Novel Foundation for Protein Model Building 

We combined advanced mathematical
techniques such as Maximum Entropy Principle and Principal
Component Analysis with information theory. The result is a new approach to protein model building. 

Introduction 

Foundation of 3D atomic model building and structure validation is description of building blocks: peptides and side chains. Traditional approach uses distances, angles, torsional angles and planar restraints, and to the very limited extent considers correlations between them. It causes problems many problems. We developed a very general and uniform approach that leads to a
simple and very efficient computational implementation. This approach
uses atomic coordinates in structure fragments and their
correlations as basic variables. Probabilistic preferences are
a simple function of atomic positions and relative orientations. 
It extends the atomic distances and angle restraints to include correlations between them. Other traditional knowledge (e.g. Ramachandran plot preferences) can be fully incorporated in a more efficient and elegant form. This approach has been extended to describe short characteristic peptide fragments and βsheets consisting of noncontiguous polypeptides. The simplicity of this approach allows structure validation process to consider alternatives produced by correlated changes in the structure. 

New Fragment Llibrary for Automatic Model Building  
A nonhomologous (less than 30% sequence identity) set of structures from PDB with resolution better than 2.5Å and Rfactor smaller than 0.25 had be chosen (2936 monomers, 722,220 amino acids). Protein chains have been cut into 5peptide fragments and clustered (within rms $0.7Å). Then, the representatives of such clusters has been chosen, enlarged by several amino acids at both ends, and used as seeds for subsequent clustering of longer fragments (within rms up to 1.2Å). For certain fragments this procedure has been repeated. Then, the representatives of the biggest clusters has been chosen as library entries. Thus constructed library consists of 66 fragments, with length varying from 5 amino acids up to 15 amino acids. Some fragments contain not only single secondary structure fragments, but also complex ones such as: βturnα, βhairpins, αturnβ, α turnα and twisted β strands. 54 out of 66 fragments are onestrand and 12 are twostrand (noncontiguous). These twostrand fragments are βsheets. 
To reduce dimensionality before clustering, we were using Principal Component Analysis. This way, we have gained insight into fragments variability. Principal component analysis (PCA) is a method that transforms the number of correlated variables into a smaller number of uncorrelated variables called principal components. The first principal component accounts for as much of the variability in the data as possible, the same applies to all other principal components. The goal of PCA is twofold: to reduce the dimensionality of the data set and to identify new meaningful underlying variables. 

Interpretation of the First Principal Component in Tterms of Cluster Fragments Variability  
Cluster corresponding to the βstrand fragment. Two green dots near the cluster boundary are outer structure in the right panel. 
Internal flexibility of the βstrand fragments. (The βstrand fragment consists of five amino acids, but on the one end the N atom is removed, on the other C and O atoms.) Two structures (green dots in the left panel) which vary with respect to the first principal component have been superimposed together and compared with the average structure from this cluster (middle). 

Principal Component Analysis of Conformations of Proline  
Since the sidechain of proline forms a ring together
with its mainchain atoms, the individual movements of atoms
are severely restricted and hence the movements of the atoms are highly correlated.
Therefore the standard practice of modeling every atom by independent Gaussian is Techniques we developed here for proline should be also applicable to modeling DNA, which contains similar five atom rings. As our data set we have selected a set of 207 structures from PDB, 
These structures have sequence percentage identity smaller than 90%, have been solved to the resolution 1.2 Å or better, have Rfactor smaller than 0.30 and sequence longer than 20 residues. They contain together 3397 proline residues. After removing residues with partial occupancy we obtain a data set of 3385 5member proline rings. Next, we have used Kabsch's algorithm to align thus extracted prolines. We calculated eigenvalues of the covariance matrix. We discovered that the first principal component accounts for 92% of the data variance, second for 3%, others for less than 1%. It means that conformations of the proline ring, usually described by several torsional angles, can be described by only one, easytofind variable. Moreover, the first principal components works also as a classification variable. 

Interpretation of the First Principal Component in Terms of Proline Conformations 

Ring puckering for proline. Two outer rings, yellow and orange, which vary with respect to the first principal component have been superimposed together and compared with the calculated artificial average proline structure (blue, middle). The orange and yellow rings correspond to the up and down conformations of the proline ring, the blue, average structure is almost flat and does not occur. 
Clustering of coordinates of 3385 proline rings expressed in their first and second principal components. Approximate boundaries of clusters corresponding to up and down conformations have been colored orange and yellow. Yellow cluster consists of 1559 prolines, orange one consists of 1653. Hard to classify cases are colored light blue, there are only 173 of them out of 3385 (5%). This is exactly what should be expected, because the first two principal components account for 95% of data variability. 

Variability Along Two Principal Component Axes  
Variability along the first principal component axis. 
Variability along the second principal component axis. 

Maximum Entropy Modeling of Torsion Angle Distribution in Proteins 

Using the Maximum Entropy principle, we find the functional form of the torsional angle Φ, Ψ probability distribution in proteins. We estimate parameters of this distribution numerically, using the conjugate gradient method. We test approximations of orders 2 to 7. We calculate the information content of these approximations and compare them with standard histogram method. As data, we have selected 203 nonhomologous highresolution protein structures from Protein Data Bank (PDB). From this set, 10692 pairs of torsion angles (Φ,Ψ) have been obtained and trigonometric moments of their distribution have been computed. 
Torsional angles Φ and Ψ describe conformation of the main chain of the protein. Here, we show fragment of a protein chain, peptide units are represented by tiles. 

Comparison of Experimental Data With Our Exponential Approximation  
10degree bin histogram of the torsional
angles Φ and Ψ in our dataset of 10692 such pairs. 
Our approximation of order 4. 

Our approximation is given by function: 

Comparison of Information Content (left panel) and Information Content per Parameter (right panel) of our Approach Against Standard Histogram Method  
We define information content as Shannon's information entropy. 
Note, how devastating for histogram method is the plot of information content per parameter. 

Conclusions  
Datamining techniques, such as Principal Component Analysis, can substantially reduce dimensionality of the problems encountered during protein model building. Moreover, it often eases clustering (cf. protein fragment library) or classification (cf. proline conformations). 
Using smooth probability distribution instead of traditional histogram allows for further analysis, such as optimization with analytical computation of derivatives. Calculating information content of parameters helps with making inform decision how many parameters to consider.  
