PhD thesis: residue-residue contact prediction

4.6 Methods

4.6.1 Features used to train Random Forest Model

Given a multiple sequence alignment of a protein family, various sequence features can be derived that have been found to be informative of a residue-residue contact.

In total there are 250 features that can be divided into global, single position and pairwise features and are described in the following sections. If not stated otherwise, weighted features have been computed using amino acid counts or amino acid frequencies based on weighted sequences as described in section 2.6.3.

4.6.1.1 Global Features

These features describe alignment characteristics. Every pair of residues \((i,j)\) from the same protein will be attributed the same feature.

Table 4.1: Features characterizing the total alignment
Feature	Description	No. Features per residue pair \((i, j)\)
L	log of protein length	1
N	number of sequences	1
Neff	number of effective sequences computed as the sum over sequence weights (see section 2.6.3)	1
gaps	average percentage of gaps over all positions	1
diversity	\(\frac{\sqrt{N}}{L}\), N=number of sequences, L=protein length	1
amino acid composition	weighted amino acid frequencies in alignment	20
Psipred	secondary structure prediction by PSIPRED (v4.0)[225] given as average three state propensities	3
NetsurfP	secomdary structure prediction by Netsurfp (v1.0)[224] given as average three state propensities	3
contact prior protein length	simple contact predictor based on expected number of contacts per protein with respect to protein length (see description below)	1

There are in total 32 global alignment features per reside pair.

4.6.1.2 Single Position Features

These features describe characteristics of a single alignment column. Every residue pair \((i,j)\) will be described by two features, once for each position.

Table 4.2: Caption hereSingle Position Sequence Features
Feature	Description	No. Features per residue pair \((i, j)\)
shannon entropy (excluding gaps)	\(- \sum_{a=1}^{20} p_a \log p_a\)	2
shannon entropy (including gaps)	\(- \sum_{a=1}^{21} p_a \log p_a\)	2
kullback leibler divergence	between weighted observed and background amino acid frequencies [226]	2
jennson shannon divergence	between weighted observed and background amino acid frequencies [226]	2
PSSM	log odds ratio of weighted observed and background amino acid frequencies [226]	40
secondary structure prediction	three state propensities PSIPRED (v4.0) [225]	6
secondary structure prediction	three state propensities Netsurfp (v1.0) [224]	6
solvent accessibility prediction	RSA and RSA Z-score Netsurfp (v1.0) [224]	4
relative position in sequence	\(\frac{i}{L}\) for a protien of length \(L\)	2
number of ungapped sequences	\(\sum_n w_n I(x_{ni} \neq 20)\) for sequences \(x_n\) and sequence weights \(w_n\)	2
percentage of gaps	\(\frac{\sum_n w_n I(x_{ni} = 20)}{N_{\text{eff}}}\) for sequences \(x_n\) and sequence weights \(w_n\)	2
Average Atchley Factor	Atchley Factors 1-5 [227]	10
Average polarity (Grantham)	Polarity according to Grantham [228]. Data taken from AAindex Database [229].	2
Average polarity (Zimmermann)	Polarity according to Zimmermann et al. [230]. Data taken from AAindex Database [229].	2
Average isoelectricity	Isoelectric point according to Zimmermann et al. [230]. Data taken from AAindex Database [229].	2
Average hydrophobicity (Wimley&White)	Hydrophobicity scale according to Wimley & White [231]. Data taken from UCSF Chimera [231].	2
Average hydrophobicity (Kyte&Dolittle)	Hydrophobicity index according to Kyte & Doolittle [232]. Data taken from AAindex Database [229].	2
Average hydrophobicity (Cornette)	Hydrophobicity according to Cornette [233].	2
Average bulkiness	Bulkiness according to Zimmerman et al. [230]. Data taken from AAindex Database [229].	2
Average volume	Average volumes of residues according to Pontius et al. [234]. Data taken from AAindex Database [229].	2

There are 48 single sequnece features per residue and consequently 96 single sequence features per residue pair.

Additionally, all single features will be computed within a window of size 5. The window feature for center residue \(i\) will be computed as the mean feature over residues \([i\!-\!2, \ldots, i, \ldots, i\!+\!2]\). Whenever the window extends the range of the sequence (for \(i\!<\!2\) and \(i\!>\!(L-2)\)), the window feature will be computed only for valid sequence positions. This results in additional 96 window features per residue pair.

4.6.1.3 Pairwise Features

These features are computed for every pair of columns \((i, j)\) in the alignment with \(i<j\).

Table 4.3: Pairwise Sequence Features
Feature	Description	No. Features per residue pair \((i, j)\)
sequence separation	\(j-i\)	1
gaps	pairwise percentage of gaps using weighted sequences	1
number of ungapped sequences	\(\sum_n w_n I(x_{ni} \! \neq \! 20, x_{nj} \! \neq \! 20)\) for sequences \(x_n\) and sequence weights \(w_n\)	1
correlation physico-chemical features	pairwise correlation of all physico-chemical properties listed in table 4.2	13
pairwise potential (buried)	Average quasi-chemical energy of interactions in an average buried environment according to Miyazawa&Jernigan [223]. Data taken from AAindex Database [229].	1
pairwise potential (water)	Average quasi-chemical energy of transfer of amino acids from water to the protein environment according to Miyazawa&Jernigan [223]. Data taken from AAindex Database [229].	1
pairwise potential (Li&Fang)	Average general contact potential by Li&Fang [69]	1
pairwise potential (Zhu&Braun)	Average statistical potential from residue pairs in beta-sheets by Zhu&Braun [235]	1
joint shannon entropy (excluding gaps)	\(- \sum_{a=1}^{20}\sum_{b=1}^{20} p(a,b) \log p(a,b)\)	1
joint shannon entropy (including gaps)	\(- \sum_{a=1}^{21}\sum_{b=1}^{21} p(a,b) \log p(a,b)\)	1
normalized MI	normalized mutual information of amino acid counts at two positions	1
MI (+pseudo-counts)	mutual information of amino acid counts at two positions, including uniform pseudo-counts	1
MI (+pseudo-counts + APC)	mutual information of amino acid counts at two positions; including pseudo-counts and average product correction	1
OMES coeevolution score	according to Fodor&Aldrich [222] with and without APC	2

There are in total 26 pairwise sequence features.

4.6.2 Simple Contact Prior with Respect to Protein Length

The last feature listed in table 4.1 (“contact prior protein length”) stands for a simple contact predictor based on expected number of contacts per protein with respect to protein length. The average number of contats per residue, computed as the observed number of contacts divided by protein length L, has a non-linear relationship with protein length \(L\) as can be seen in Figure 4.9.

Figure 4.9: Observed number of contacts per residue has a non-linear relationship with protein length. Distribution is shown for several thresholds of sequence separation |j-i|.

In log space, the average number of contats per residue can be fitted with a linear regression and yields the following functions:

\(f(L) = 1.556 + 0.596 \log (L)\) for sequence separation of 0 positions
\(f(L) = -1.273 + 0.59 \log (L)\) for sequence separation of 8 positions
\(f(L) = -1.567 + 0.615 \log (L)\) for sequence separation of 12 positions
\(f(L) = -2.0 + 0.624 \log (L)\) for sequence separation of 24 positions

A simple contact predictor can be formulated as the ratio of the expected number of contacts per residue, given by \(f(L)\), and the possible number of contacts per residue which is \(L-1\),

\[ p(r_{ij} = 1 | L) = \frac{f(L)}{L-1} \; , \]

with \(r_{ij}=1\) representing a contact between residue \(i\) and \(j\).

4.6.3 Cross-validation for Random Forest Training

Proteins constitute highly imbalanced datasets with respect to the number of residue pairs that form and do not form physical contacts. As can be seen in Figure 4.10, depending on the enforced sequence separation threshold and protein length the percentage of contacts per protein varies between 25% and 0%. Most studies applying machine learning algorithms for predicting residue-residue contacts rebalanced the data set by undersampling of the majority class. Table 4.4 lists choices for the proportion of contacts to non-contacts used to train some machine learning contact predictors. I followed the same strategy and undersampled residue pairs that are not physical contacts with a proportion of contacts to non-contacts of 1:5.

Table 4.4: Important machine learning contact prediction approaches and their choices for rebalancing the data set.
Study	Machine Learning Algorithm	Proportion of Contacts : Non-contacts
Wu et al. (2008) [68]	SVM	1:4
Li et al. (2011) [69]	Random Forest	1:1, 1:2
Wang et al. (2011) [70]	Random Forest	1:4
DiLena et al. (2012) [78]	deep neural network	1:\(\approx \!4\) (sampling 20% of non-contacts)
Wang et al. (2013) [71]	Random Forest	1:\(\approx 4\) (sampling 20% of non-contacts)

Figure 4.10: Fraction of contacts among all possible contacts in a protein against protein length L. The distribution has a non-linear relationship. At a sequence separation >8 positions the fraction of contacts for intermediate size proteins with length >100 is approximately 2%. Data set contains 6368 proteins and is explained in methods section 2.6.1.

The total training set is comprised of 50,000 residue pairs \(<8 \angstrom\) (“contacts”) and 250,000 residue pairs \(>8 \angstrom\) (“non-contacts”). I filtered residue pairs using a sequence separation of 12 positions and selected at maximum 100 contacts and 500 non-contacts per protein. The data is collected in equal parts from data subsets 1-5 (see methods section 2.6), so that the training set consists of five subsets that are non-redundand at the fold level. Each of the five models for cross-validation will be trained on 40,000 contacts and 200,000 non-contacts originating from four of the five subsets. As the training set has been undersampled for non-contacts, it is not representative of real world proteins and the models need to be validated on a more realistic validation set. Therefore, each of the five trained models is not validated on the hold-out set but on separate validation sets containing 40 proteins at a time. The proteins of the validation sets are randomly selected from the respective fifth data subset and consequently are non-redundant at the fold level with training data. Performance is assessed by means of the standard contact prediction benchmark (mean precision against top ranked contacts).

I used the module RandomForestClassifier in the Python package sklearn (v. 0.19) [236] and trained the models on features extracted from MSAs which are listed in methods section 4.6.1.

4.6.4 Feature Selection

A random forest model is trained on the total set of features. Given the distribution of Gini importance values of features from the model, subsets of features are defined by features having Gini importance values larger than the \(\{10, 30, 50, 70, 90\}\)-percentile of the distribution. Performance of the models trained on these subsets of features is evaluated on the same validation set.

References

225. Jones, D.T. (1999). Protein secondary structure prediction based on position-specific scoring matrices 1 1Edited by G. Von Heijne. J. Mol. Biol. 292, 195–202., doi: 10.1006/jmbi.1999.3091.

224. Petersen, B., Petersen, T.N., Andersen, P., Nielsen, M., and Lundegaard, C. (2009). BMC Structural Biology A generic method for assignment of reliability scores applied to solvent accessibility predictions. BMC Struct. Biol. 9., doi: 10.1186/1472-6807-9-51.

226. Robinson, A.B., and Robinson, L.R. (1991). Distribution of glutamine and asparagine residues and their near neighbors in peptides and proteins. Proc. Natl. Acad. Sci. U. S. A. 88, 8880–4.

227. Atchley, W.R., Zhao, J., Fernandes, A.D., and Drüke, T. (2005). Solving the protein sequence metric problem. Proc. Natl. Acad. Sci. U. S. A. 102, 6395–400., doi: 10.1073/pnas.0408677102.

228. Grantham, R. (1974). Amino acid difference formula to help explain protein evolution. Science 185, 862–4.

229. Kawashima, S., Pokarowski, P., Pokarowska, M., Kolinski, A., Katayama, T., and Kanehisa, M. (2008). AAindex: amino acid index database, progress report 2008. Nucleic Acids Res. 36, D202–5., doi: 10.1093/nar/gkm998.

230. Zimmerman, J.M., Eliezer, N., and Simha, R. (1968). The characterization of amino acid sequences in proteins by statistical methods. J. Theor. Biol. 21, 170–201.

231. Wimley, W.C., and White, S.H. (1996). Experimentally determined hydrophobicity scale for proteins at membrane interfaces. Nat. Struct. Biol. 3, 842–8.

232. Kyte, J., and Doolittle, R.F. (1982). A simple method for displaying the hydropathic character of a protein. J. Mol. Biol. 157, 105–132., doi: 10.1016/0022-2836(82)90515-0.

233. Cornette, J.L., Cease, K.B., Margalit, H., Spouge, J.L., Berzofsky, J.A., and DeLisi, C. (1987). Hydrophobicity scales and computational techniques for detecting amphipathic structures in proteins. J. Mol. Biol. 195, 659–685., doi: 10.1016/0022-2836(87)90189-6.

234. Pontius, J., Richelle, J., and Wodak, S.J. (1996). Deviations from Standard Atomic Volumes as a Quality Measure for Protein Crystal Structures. J. Mol. Biol. 264, 121–136., doi: 10.1006/jmbi.1996.0628.

223. Miyazawa, S., and Jernigan, R.L. (1999). Self-consistent estimation of inter-residue protein contact energies based on an equilibrium mixture approximation of residues. Proteins 34, 49–68.

69. Li, Y., Fang, Y., and Fang, J. (2011). Predicting residue-residue contacts using random forest models. Bioinformatics 27., doi: 10.1093/bioinformatics/btr579.

235. Zhu, H., and Braun, W. (1999). Sequence specificity, statistical potentials, and three-dimensional structure prediction with self-correcting distance geometry calculations of beta-sheet formation in proteins. Protein Sci. 8, 326–42., doi: 10.1110/ps.8.2.326.

222. Fodor, A.A., and Aldrich, R.W. (2004). Influence of conservation on calculations of amino acid covariance in multiple sequence alignments. Proteins 56, 211–21., doi: 10.1002/prot.20098.

68. Wu, S., and Zhang, Y. (2008). A comprehensive assessment of sequence-based and template-based methods for protein contact prediction. Bioinformatics 24, 924–31.

70. Wang, X.-F., Chen, Z., Wang, C., Yan, R.-X., Zhang, Z., and Song, J. (2011). Predicting residue-residue contacts and helix-helix interactions in transmembrane proteins using an integrative feature-based random forest approach. PLoS One 6, e26767., doi: 10.1371/journal.pone.0026767.

78. Di Lena, P., Nagata, K., and Baldi, P. (2012). Deep architectures for protein contact map prediction. Bioinformatics 28, 2449–57.

71. Wang, Z., and Xu, J. (2013). Predicting protein contact map using evolutionary and physical constraints by integer programming. Bioinformatics 29, i266–73.

236. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., and Dubourg, V. et al. (2011). Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 12, 2825–2830.