Novel Foundation for Protein Model Building
Maga Rowicka, Mei Qi, Andrzej Kudlicki, Jan Zelinka, Saqib Ahmad, Zbyszek Otwinowski

We combined advanced mathematical techniques such as Maximum Entropy Principle and Principal Component Analysis
with information theory. The result is a new approach to protein model building.


Foundation of 3D atomic model building and structure validation is description of building blocks: peptides and side chains. Traditional approach uses distances, angles, torsional angles and planar restraints, and to the very limited extent considers correlations between them. It causes problems many problems.

We developed a very general and uniform approach that leads to a simple and very efficient computational implementation. This approach uses atomic coordinates in structure fragments and their correlations as basic variables. Probabilistic preferences are a simple function of atomic positions and relative orientations.

It extends the atomic distances and angle restraints to include correlations between them.

Other traditional knowledge (e.g. Ramachandran plot preferences) can be fully incorporated in a more efficient and elegant form. This approach has been extended to describe short characteristic peptide fragments and β-sheets consisting of non-contiguous polypeptides.

The simplicity of this approach allows structure validation process to consider alternatives produced by correlated changes in the structure.

New Fragment Llibrary for Automatic Model Building

A non-homologous (less than 30% sequence identity) set of structures from PDB with resolution better than 2.5Å and R-factor smaller than 0.25 had be chosen (2936 monomers, 722,220 amino acids). Protein chains have been cut into 5-peptide fragments and clustered (within rms $0.7Å).

Then, the representatives of such clusters has been chosen, enlarged by several amino acids at both ends, and used as seeds for subsequent clustering of longer fragments (within rms up to 1.2Å). For certain fragments this procedure has been repeated. Then, the representatives of the biggest clusters has been chosen as library entries.

Thus constructed library consists of 66 fragments, with length varying from 5 amino acids up to 15 amino acids. Some fragments contain not only single secondary structure fragments, but also complex ones such as: β-turn-α, β-hairpins, α-turn-β, α -turn-α and twisted β strands. 54 out of 66 fragments are one-strand and 12 are two-strand (non-contiguous). These two-strand fragments are β-sheets.

To reduce dimensionality before clustering, we were using Principal Component Analysis.

This way, we have gained insight into fragments variability.

Principal component analysis (PCA) is a method that transforms the number of correlated variables into a smaller number of uncorrelated variables called principal components. The first principal component accounts for as much of the variability in the data as possible, the same applies to all other principal components.

The goal of PCA is twofold: to reduce the dimensionality of the data set and to identify new meaningful underlying variables.

Interpretation of the First Principal Component in Tterms of Cluster Fragments Variability

Cluster corresponding to the β-strand fragment. Two green dots near the cluster boundary are outer structure in the right panel.

Internal flexibility of the β-strand fragments. (The β-strand fragment consists of five amino acids, but on the one end the N atom is removed, on the other C and O atoms.) Two structures (green dots in the left panel) which vary with respect to the first principal component have been superimposed together and compared with the average structure from this cluster (middle).

Principal Component Analysis of Conformations of Proline

Since the side-chain of proline forms a ring together with its main-chain atoms, the individual movements of atoms are severely restricted and hence the movements of the atoms are highly correlated. Therefore the standard practice of modeling every atom by independent Gaussian is
especially inadequate in this case.

Techniques we developed here for proline should be also applicable to modeling DNA, which contains similar five atom rings.

As our data set we have selected a set of 207 structures from PDB,
all of them solved through x-ray crystallography.

These structures have sequence percentage identity smaller than 90%, have been solved to the resolution 1.2 Å or better, have R-factor smaller than 0.30 and sequence longer than 20 residues. They contain together 3397 proline residues. After removing residues with partial occupancy we obtain a data set of 3385 5-member proline rings.

Next, we have used Kabsch's algorithm to align thus extracted prolines. We calculated eigenvalues of the covariance matrix. We discovered that the first principal component accounts for 92% of the data variance, second for 3%, others for less than 1%. It means that conformations of the proline ring, usually described by several torsional angles, can be described by only one, easy-to-find variable. Moreover, the first principal components works also as a classification variable.

Interpretation of the First Principal Component in Terms of Proline Conformations

Ring puckering for proline. Two outer rings, yellow and orange, which vary with respect to the first principal component have been superimposed together and compared with the calculated artificial average proline structure (blue, middle). The orange and yellow rings correspond to the up and down conformations of the proline ring, the blue, average structure is almost flat and does not occur.

Clustering of coordinates of 3385 proline rings expressed in their first and second principal components. Approximate boundaries of clusters corresponding to up and down conformations have been colored orange and yellow. Yellow cluster consists of 1559 prolines, orange one consists of 1653. Hard to classify cases are colored light blue, there are only 173 of them out of 3385 (5%). This is exactly what should be expected, because the first two principal components account for 95% of data variability.

Variability Along Two Principal Component Axes

Variability along the first principal component axis.
This distribution is clearly bimodal, confirming that there are two conformations of proline. The points in the plot are colored according to the group they were assign to the next figure.

Variability along the second principal component axis.
Here, the distribution is Gaussian. It means that the second principal component is not useful as classification variable, either.

Maximum Entropy Modeling of Torsion Angle Distribution in Proteins

Using the Maximum Entropy principle, we find the functional form of the torsional angle Φ, Ψ probability distribution in proteins.

We estimate parameters of this distribution numerically, using the conjugate gradient method. We test approximations of orders 2 to 7. We calculate the information content of these approximations and compare them with standard histogram method.

As data, we have selected 203 non-homologous high-resolution protein structures from Protein Data Bank (PDB). From this set, 10692 pairs of torsion angles (Φ,Ψ) have been obtained and trigonometric moments of their distribution have been computed.

Torsional angles Φ and Ψ describe conformation of the main chain of the protein. Here, we show fragment of a protein chain, peptide units are represented by tiles.

Comparison of Experimental Data With Our Exponential Approximation

10-degree bin histogram of the torsional angles Φ and Ψ in our dataset of 10692 such pairs.

Our approximation of order 4.

Our approximation is given by function:

Comparison of Information Content (left panel) and Information Content per Parameter (right panel) of our Approach Against Standard Histogram Method

We define information content as Shannon's information entropy.

Note, how devastating for histogram method is the plot of information content per parameter.


Data-mining techniques, such as Principal Component Analysis, can substantially reduce dimensionality of the problems encountered during protein model building. Moreover, it often eases clustering (cf. protein fragment library) or classification (cf. proline conformations).

Using smooth probability distribution instead of traditional histogram allows for further analysis, such as optimization with analytical computation of derivatives. Calculating information content of parameters helps with making inform decision how many parameters to consider.

| people | research | protein gallery | publications | positions | contact