Stereochemical rules for connecting protein fragments
Andrzej Kudlicki, Maga Rowicka, Zbyszek Otwinowski


In our approach to assembling a protein model, the electron density is fitted with fragments consisting of several peptide units. In low-quality regions of electron density maps, one runs into problems arising from either too many substantially different fragments, or a missing peptide bond. This can lead to difficulties in positioning the main chain of the protein. The problem may be solved by reducing the set of hypotheses by disqualifying inconsistent ones.

To this end, we characterize the local shape of a main chain segment by angles between vectors connecting four consecutive C-alpha atoms.

The local conformation is described by three quasi-conformation angles: two flat and one dihedral angle. We investigate the probability distribution of these angles, and find that the conformation space is highly restricted.
Since the procedure of matching hypotheses is computationally expensive, one needs a simple description of the quasi-conformation angle space. We employ the Maximum Entropy principle to design an analytical model the probability distribution. The result, combined with data-mining techniques, will allow for more efficient protein model building.
Automated protein model building

Modern biochemistry, as it is applied in genetics, pharmacology and medicine, strongly depends on knowing the three-dimensional structures of biological macromolecules, especially proteins and protein complexes. About 82% of known protein structures have been solved by X-ray crystallography. This process involves the following major steps:

  • growing protein crystals
  • obtaining their diffraction patterns
  • phasing, or calculating the electron density mapof a molecule (inverse problem), and
  • model building, i.e. placing atoms in this map.

Our approach to automated building of protein models given an experimental density map can be summarized as follows:

  1. Fitting typical protein fragments in electron density map. A main-chain fragment consists of several peptide units. Fragment libraries are built by clustering small secondary structure motives from known proteins.
  2. Creating and evaluating main-chain hypotheses as sets of compatible fragments. This step is achieved by means of a genetic algorithm.
  3. Attaching side-chain fragments to the built main chain.
  4. Refining the model, i.e. applying small corrections to atomic coordinates.

Here we focus on step 2. Assembling protein fragments into hypotheses requires designing a scoring function to assess whether two fragments are compatible, or not.

In the case of partly overlapping fragments, this task has been solved based on assessment how exact is the overlap, and direct calculation of van der Waals forces. It is often also important to determine, whether two disjointfragments, separated by a small distance, can both belong to the same protein main chain.

A typical case is presented in Fig. 1 (below). Proper distinguishing between pairs of compatible and incompatible disjoint fragments is most important in preventing building main-chain fragments into the protein side-chain. Nearby disjoint fragments may be compatible (this means that they can be included in one main-chain hypothesis), if the distance between them can accomodate one peptide unit.

We find, that this condition is not sufficient, and additional requirements have to be satisfied, concerning the orientation of these fragments.

Fig. 1 (on the right). A region in an experimental density map with several fitted fragments. The highlighted area shows the interface between two disjoint fragments.

We describe these requirements in terms of a probability distribution function (PDF), serving as additional statistical prior used in protein model building. This prior will also be used in refining a main-chain model. The PDF is evaluated using data from structures deposited in the PDB.

We describe this orientation by three angles χ, ζ, τ between virtual Calpha-Calpha bonds (see Fig. 2). The virtual bond is a hypothetical line connecting the Calpha atoms in consecutive peptide units.


Relative position of three consecutive virtual Calpha-Calpha bonds is defined by the three angles χ, ζ, and τ. However, due to the geometry of the problem, we shall instead describe it by , w=cosχ, x=cosζ, and τ.

In this set of variables, each volume element

dwdxdτ = d(cosχ)d(cosζ)dτ = sinχdχsinζdζdτ

corresponds to the same surface element in the real space. The variables w and x have values between -1 and 1.

Fig. 2 (on the right): Angles describing the spatial orientation of disjoint fragments. Blue spheres depict Calpha atoms, yellow bars - the virtual bonds. Two outer bonds are included in one fragment each, while the one in the middle is the virtual bond between fragments. The angles χ and ζ are planar angles between consecutive virtual bonds, and have values in the range from 0 to π. The dihedral (torsion) angle Τ can take any value from 0 to 2π.


Available protein structures (26319 structures as of 13 Jul 2004) are collected in the Protein Data Bank (PDB) database. To evaluate the probability distribution function of w, x and τ, we analyze a subset of the PDB, consisting of structures obtained from X-ray diffraction at resoutions better than 2.5 Å. After removing close homologs, the dataset consists of 4170 chains, containing over 106 aminoacids. From the data we create a 3-D histogram, equally spaced in the three variables. This histogram can be used as a prior probability lookup table when assessing compatibility of disjoint fragments.

The data are presented in Fig. 3 (below). An important feature of the plot is that it is not symmetrical with respect to swapping w, and x. This means, that the information contained in the PDF of Χ, ζ, τ could not be retrieved from data on neighboring peptide unit angles (Ramachandran plot) alone.

Fig. 3 (on the right): An isosurface representation of the empirical probability distribution function of (w, x, τ), calculated from structrures deposited in the PDB.
Exponential model

Drawbacks of the histogram representation:

  • Exactly zero in bins, where no data point fell - not a good prior probability - too prohibitive.
  • Nondifferentiable - not convenient for refinement procedures, involving derivatives.

Hence arises the need for an analytical description.

To find a functional form of the PDF we will use the Maximum Entropy Principle First, we construct a set of orthonormal base functions on . We choose them as


S_{ijk}(w,x,\tau) = {\cal L}_i(w) {\cal L}_j(x) \sin(k\tau)  ,

where are integers, and denotes the Legendre polynomial of the N-th order. The PDB data can be reduced to moments in this basis:

...i(w_n) {\cal L}_j(x_n)\sin k \tau_n\\
\end{array}\right. .

As the functional form of constraints for maximum-entropy modeling, we choose the values of the same moments, calculated from the investigated PDF with respect to the above base functions:

\mu^{(c)}_{l,n,m}={\displaystyle \int_{-1}^{1} d w \int_{-1}...
...i} d\tau\;}
C_{ijk}(w,x,\tau) p(w,x,\tau\vert\mbox{\rm coeff})

\mu^{(s)}_{l,n,m}={\displaystyle \int_{-1}^{1} d w \int_{-1}...
...i} d\tau\;}
S_{ijk}(w,x,\tau) p(w,x,\tau\vert\mbox{\rm coeff})

Using Lagrange multipliers one can compute that for such constrains the appropriate probability distribution p is an exponential function with the exponent being a linear function of the constraints:

p(w,x,\tau\vert\mbox{\rm coeff})=
\exp\left(-\lambda-\sum_{i,j\leq S^l ,k\leq S^t}
N(a_{ijk}, b_{ijk},w,x,\tau) \right) ,
where \begin{displaymath}N(a_{ijk}, b_{ijk},w,x,\tau)=\end{displaymath}\begin{displaymath}=a_{ijk}\;{\cal L}_i(w) {\cal L}_j(x)\cos k \tau +
b_{ijk}\;{\cal L}_i(w) {\cal L}_j(x)\sin k \tau .\end{displaymath}

The number $\lambda$ is a normalization parameter, introduced to assure that

{\displaystyle \int_{-1}^{1} d w \int_{-1}^{1} d x \int_{0}^{2\pi} d\tau\;}
p(w,x,\tau\vert\mbox{\rm coeff}) =1 ,

and $a_{ijk}$ and are the coefficients of the specific distribution.
$S^l$ and $S^t$ denote the maximum orders to which the Lagrange and trigonometric expansions are calculated. We assume these coefficients to be a priori statistically independent. To keep optimizations convergent, $a_{ijk}$ and are assigned a prior Gaussian probability distribution

p(a_{ijk})=\sqrt{\frac{c}{\pi}} e^{-c\; (a_{ijk})^2} ,

with $c= 0.25$. To find values of the coefficients, we use Bayes' theorem:

p( \{a_{ijk}, b_{ijk}\}_{i,j\leq S^l ,k\leq S^t}\vert w,x,\tau )\;
\begin{displaymath}= \frac{p(\{a_{ijk}, b_{ijk}\}_{i,j\leq S^l ,k\leq S^t})\;{\...
...q S^t})}
{{\displaystyle \prod_{n=1}^{N}}p(w_n,x_n,\tau_n)} .

This task boils down to maximizing of the following expression:

\lambda+ \sum_{i,j\leq S^l ,k\leq S^t}
\{a_{ijk} M^{(c)}_{i...
...m_{i,j\leq S^l ,k\leq S^t} \{ (a_{ijk})^2 +(b_{ijk})^2 \}  ,


\lambda=\log\left\{ {\displaystyle \int_{-1}^{1} d w \int_{-...
...k\leq S^t}
N(a_{i,j,k}, b_{i,j,k},w,x,\tau )\right)\right\} .

The integral for $\lambda$ is calculated numerically, and the whole optimization is performed using a library conjugate gradient routine. The required gradients were computed analytically, requiring only one simple numerical integration in evaluating .

After performing the minimization for different values of S, we find, that the probability distribution described by Equation (1) reproduces major features of the initial distribution when four orders are computed in both Lagrange polynomials and trigonometric functions.

The values of $a_{ijk}$and are given in the table below:

i 0 0 1 1 2 2 3 3
i $a_{ijk}$ $a_{ijk}$ $a_{ijk}$ $a_{ijk}$
j=0 0.0000 0.0000 0.4795 0.0000 -0.6496 0.0000 -0.3305 0.0000
  -0.0185 0.3406 -0.1722 -0.0246 -0.1468 -0.1848 0.1011 0.0330
  -0.0228 0.9266 0.0878 0.1250 0.0996 -0.3866 -0.0259 -0.1253
  -0.5055 0.1766 -0.0412 -0.0285 0.2165 -0.1214 0.0173 0.0132
j=1 0.4795 0.0000 0.1884 0.0000 -0.0819 0.0000 -0.1212 0.0000
  -0.1988 -0.1541 -0.1031 -0.0548 0.0086 0.0394 0.0597 0.0528
  0.0640 0.1783 0.0451 0.0573 0.0144 -0.0477 -0.0138 -0.0509
  -0.0123 -0.0416 -0.0194 -0.0269 -0.0172 -0.0006 -0.0002 0.0216
j=2 -0.6491 0.0000 -0.0809 0.0000 0.2085 0.0000 0.0634 0.0000
  -0.1660 -0.3318 -0.0034 -0.0448 0.0806 0.1330 -0.0009 0.0360
  0.0526 -0.3212 -0.0086 -0.0110 -0.0332 0.1600 0.0041 0.0187
  0.2524 -0.1300 0.0048 -0.0103 -0.1272 0.0617 -0.0095 0.0123
j=3 -0.3303 0.0000 -0.1184 0.0000 0.0668 0.0000 0.0785 0.0000
  0.1224 0.0491 0.0632 0.0175 -0.0073 -0.0145 -0.0405 -0.0216
  -0.0651 -0.1240 -0.0356 -0.0315 -0.0017 0.0410 0.0149 0.0315
  0.0231 0.0213 0.0153 0.0149 0.0054 0.0006 -0.0026 -0.0138

This work was supported by NIH grant No. 53163. The authors thank Mirek Gilski and Jan Zelinka for helpful discussions.
We are grateful to E.T. Jaynes Center for generous travel support.

| people | research | protein gallery | publications | positions | contact