Stereochemical rules for connecting protein fragments 

Abstract 

In our approach to assembling a protein model, the electron density is fitted with fragments consisting of several peptide units. In lowquality regions of electron density maps, one runs into problems arising from either too many substantially different fragments, or a missing peptide bond. This can lead to difficulties in positioning the main chain of the protein. The problem may be solved by reducing the set of hypotheses by disqualifying inconsistent ones. To this end, we characterize the local shape of a main chain segment by angles between vectors connecting four consecutive Calpha atoms. 
The local conformation is described by three quasiconformation angles: two flat and one dihedral angle. We investigate the probability distribution of these angles, and find that the conformation space is highly restricted. Since the procedure of matching hypotheses is computationally expensive, one needs a simple description of the quasiconformation angle space. We employ the Maximum Entropy principle to design an analytical model the probability distribution. The result, combined with datamining techniques, will allow for more efficient protein model building. 

Automated protein model building 

Modern biochemistry, as it is applied in genetics, pharmacology and medicine, strongly depends on knowing the threedimensional structures of biological macromolecules, especially proteins and protein complexes. About 82% of known protein structures have been solved by Xray crystallography. This process involves the following major steps:
Our approach to automated building of protein models given an experimental density map can be summarized as follows:

Here we focus on step 2. Assembling protein fragments into hypotheses requires designing a scoring function to assess whether two fragments are compatible, or not. In the case of partly overlapping fragments, this task has been solved based on assessment how exact is the overlap, and direct calculation of van der Waals forces. It is often also important to determine, whether two disjointfragments, separated by a small distance, can both belong to the same protein main chain. A typical case is presented in Fig. 1 (below). Proper distinguishing between pairs of compatible and incompatible disjoint fragments is most important in preventing building mainchain fragments into the protein sidechain. Nearby disjoint fragments may be compatible (this means that they can be included in one mainchain hypothesis), if the distance between them can accomodate one peptide unit. We find, that this condition is not sufficient, and additional requirements have to be satisfied, concerning the orientation of these fragments. 

Fig. 1 (on the right). A region in an experimental density map with several fitted fragments. The highlighted area shows the interface between two disjoint fragments. We describe these requirements in terms of a probability distribution function (PDF), serving as additional statistical prior used in protein model building. This prior will also be used in refining a mainchain model. The PDF is evaluated using data from structures deposited in the PDB. We describe this orientation by three angles χ, ζ, τ between virtual CalphaCalpha bonds (see Fig. 2). The virtual bond is a hypothetical line connecting the Calpha atoms in consecutive peptide units. 

Parametrization 

Relative position of three consecutive virtual CalphaCalpha bonds is defined by the three angles χ, ζ, and τ. However, due to the geometry of the problem, we shall instead describe it by , w=cosχ, x=cosζ, and τ. 
In this set of variables, each volume element dwdxdτ = d(cosχ)d(cosζ)dτ = sinχdχsinζdζdτ corresponds to the same surface element in the real space. The variables w and x have values between 1 and 1. 

Fig. 2 (on the right): Angles describing the spatial orientation of disjoint fragments. Blue spheres depict Calpha atoms, yellow bars  the virtual bonds. Two outer bonds are included in one fragment each, while the one in the middle is the virtual bond between fragments. The angles χ and ζ are planar angles between consecutive virtual bonds, and have values in the range from 0 to π. The dihedral (torsion) angle Τ can take any value from 0 to 2π.  
Data 

Available protein structures (26319 structures as of 13 Jul 2004) are collected in the Protein Data Bank (PDB) database. To evaluate the probability distribution function of w, x and τ, we analyze a subset of the PDB, consisting of structures obtained from Xray diffraction at resoutions better than 2.5 Å. After removing close homologs, the dataset consists of 4170 chains, containing over 10^{6} aminoacids. From the data we create a 3D histogram, equally spaced in the three variables. This histogram can be used as a prior probability lookup table when assessing compatibility of disjoint fragments. 
The data are presented in Fig. 3 (below). An important feature of the plot is that it is not symmetrical with respect to swapping w, and x. This means, that the information contained in the PDF of Χ, ζ, τ could not be retrieved from data on neighboring peptide unit angles (Ramachandran plot) alone. 

Fig. 3 (on the right): An isosurface representation of the empirical probability distribution function of (w, x, τ), calculated from structrures deposited in the PDB.  
Exponential model 

Drawbacks of the histogram representation:
Hence arises the need for an analytical description. To find a functional form of the PDF we will use the Maximum Entropy Principle First, we construct a set of orthonormal base functions on . We choose them as and where are integers, and denotes the Legendre polynomial of the Nth order. The PDB data can be reduced to moments in this basis:
As the functional form of constraints for maximumentropy modeling, we choose the values of
the same moments,
calculated from the investigated PDF with respect to the above base functions: Using Lagrange multipliers one can compute that for such constrains the appropriate probability distribution p is an exponential function with the exponent being a linear function of the constraints:

The number is a normalization parameter, introduced to assure that and and are the coefficients of the specific distribution. and denote the maximum orders to which the Lagrange and trigonometric expansions are calculated. We assume these coefficients to be a priori statistically independent. To keep optimizations convergent, and are assigned a prior Gaussian probability distribution
with . To find values of the coefficients, we use Bayes' theorem:
This task boils down to maximizing of the following expression: where


Results  
After performing the minimization for different values of S, we find, that the probability distribution described by Equation (1) reproduces major features of the initial distribution when four orders are computed in both Lagrange polynomials and trigonometric functions. 
The values of and are given in the table below: 



This work was supported by NIH grant No. 53163. The authors thank Mirek Gilski and Jan Zelinka for helpful discussions. 

