4.6 Methods
4.6.1 Features used to train Random Forest Model
Given a multiple sequence alignment of a protein family, various sequence features can be derived that have been found to be informative of a residue-residue contact.
In total there are 250 features that can be divided into global, single position and pairwise features and are described in the following sections. If not stated otherwise, weighted features have been computed using amino acid counts or amino acid frequencies based on weighted sequences as described in section 2.6.3.
4.6.1.1 Global Features
These features describe alignment characteristics. Every pair of residues \((i,j)\) from the same protein will be attributed the same feature.
Feature | Description | No. Features per residue pair \((i, j)\) |
---|---|---|
L | log of protein length | 1 |
N | number of sequences | 1 |
Neff | number of effective sequences computed as the sum over sequence weights (see section 2.6.3) | 1 |
gaps | average percentage of gaps over all positions | 1 |
diversity | \(\frac{\sqrt{N}}{L}\), N=number of sequences, L=protein length | 1 |
amino acid composition | weighted amino acid frequencies in alignment | 20 |
Psipred | secondary structure prediction by PSIPRED (v4.0)[225] given as average three state propensities | 3 |
NetsurfP | secomdary structure prediction by Netsurfp (v1.0)[224] given as average three state propensities | 3 |
contact prior protein length | simple contact predictor based on expected number of contacts per protein with respect to protein length (see description below) | 1 |
There are in total 32 global alignment features per reside pair.
4.6.1.2 Single Position Features
These features describe characteristics of a single alignment column. Every residue pair \((i,j)\) will be described by two features, once for each position.
Feature | Description | No. Features per residue pair \((i, j)\) |
---|---|---|
shannon entropy (excluding gaps) | \(- \sum_{a=1}^{20} p_a \log p_a\) | 2 |
shannon entropy (including gaps) | \(- \sum_{a=1}^{21} p_a \log p_a\) | 2 |
kullback leibler divergence | between weighted observed and background amino acid frequencies [226] | 2 |
jennson shannon divergence | between weighted observed and background amino acid frequencies [226] | 2 |
PSSM | log odds ratio of weighted observed and background amino acid frequencies [226] | 40 |
secondary structure prediction | three state propensities PSIPRED (v4.0) [225] | 6 |
secondary structure prediction | three state propensities Netsurfp (v1.0) [224] | 6 |
solvent accessibility prediction | RSA and RSA Z-score Netsurfp (v1.0) [224] | 4 |
relative position in sequence | \(\frac{i}{L}\) for a protien of length \(L\) | 2 |
number of ungapped sequences | \(\sum_n w_n I(x_{ni} \neq 20)\) for sequences \(x_n\) and sequence weights \(w_n\) | 2 |
percentage of gaps | \(\frac{\sum_n w_n I(x_{ni} = 20)}{N_{\text{eff}}}\) for sequences \(x_n\) and sequence weights \(w_n\) | 2 |
Average Atchley Factor | Atchley Factors 1-5 [227] | 10 |
Average polarity (Grantham) | Polarity according to Grantham [228]. Data taken from AAindex Database [229]. | 2 |
Average polarity (Zimmermann) | Polarity according to Zimmermann et al. [230]. Data taken from AAindex Database [229]. | 2 |
Average isoelectricity | Isoelectric point according to Zimmermann et al. [230]. Data taken from AAindex Database [229]. | 2 |
Average hydrophobicity (Wimley&White) | Hydrophobicity scale according to Wimley & White [231]. Data taken from UCSF Chimera [231]. | 2 |
Average hydrophobicity (Kyte&Dolittle) | Hydrophobicity index according to Kyte & Doolittle [232]. Data taken from AAindex Database [229]. | 2 |
Average hydrophobicity (Cornette) | Hydrophobicity according to Cornette [233]. | 2 |
Average bulkiness | Bulkiness according to Zimmerman et al. [230]. Data taken from AAindex Database [229]. | 2 |
Average volume | Average volumes of residues according to Pontius et al. [234]. Data taken from AAindex Database [229]. | 2 |
There are 48 single sequnece features per residue and consequently 96 single sequence features per residue pair.
Additionally, all single features will be computed within a window of size 5. The window feature for center residue \(i\) will be computed as the mean feature over residues \([i\!-\!2, \ldots, i, \ldots, i\!+\!2]\). Whenever the window extends the range of the sequence (for \(i\!<\!2\) and \(i\!>\!(L-2)\)), the window feature will be computed only for valid sequence positions. This results in additional 96 window features per residue pair.
4.6.1.3 Pairwise Features
These features are computed for every pair of columns \((i, j)\) in the alignment with \(i<j\).
Feature | Description | No. Features per residue pair \((i, j)\) |
---|---|---|
sequence separation | \(j-i\) | 1 |
gaps | pairwise percentage of gaps using weighted sequences | 1 |
number of ungapped sequences | \(\sum_n w_n I(x_{ni} \! \neq \! 20, x_{nj} \! \neq \! 20)\) for sequences \(x_n\) and sequence weights \(w_n\) | 1 |
correlation physico-chemical features | pairwise correlation of all physico-chemical properties listed in table 4.2 | 13 |
pairwise potential (buried) | Average quasi-chemical energy of interactions in an average buried environment according to Miyazawa&Jernigan [223]. Data taken from AAindex Database [229]. | 1 |
pairwise potential (water) | Average quasi-chemical energy of transfer of amino acids from water to the protein environment according to Miyazawa&Jernigan [223]. Data taken from AAindex Database [229]. | 1 |
pairwise potential (Li&Fang) | Average general contact potential by Li&Fang [69] | 1 |
pairwise potential (Zhu&Braun) | Average statistical potential from residue pairs in beta-sheets by Zhu&Braun [235] | 1 |
joint shannon entropy (excluding gaps) | \(- \sum_{a=1}^{20}\sum_{b=1}^{20} p(a,b) \log p(a,b)\) | 1 |
joint shannon entropy (including gaps) | \(- \sum_{a=1}^{21}\sum_{b=1}^{21} p(a,b) \log p(a,b)\) | 1 |
normalized MI | normalized mutual information of amino acid counts at two positions | 1 |
MI (+pseudo-counts) | mutual information of amino acid counts at two positions, including uniform pseudo-counts | 1 |
MI (+pseudo-counts + APC) | mutual information of amino acid counts at two positions; including pseudo-counts and average product correction | 1 |
OMES coeevolution score | according to Fodor&Aldrich [222] with and without APC | 2 |
There are in total 26 pairwise sequence features.
4.6.2 Simple Contact Prior with Respect to Protein Length
The last feature listed in table 4.1 (“contact prior protein length”) stands for a simple contact predictor based on expected number of contacts per protein with respect to protein length. The average number of contats per residue, computed as the observed number of contacts divided by protein length L, has a non-linear relationship with protein length \(L\) as can be seen in Figure 4.9.
In log space, the average number of contats per residue can be fitted with a linear regression and yields the following functions:
- \(f(L) = 1.556 + 0.596 \log (L)\) for sequence separation of 0 positions
- \(f(L) = -1.273 + 0.59 \log (L)\) for sequence separation of 8 positions
- \(f(L) = -1.567 + 0.615 \log (L)\) for sequence separation of 12 positions
- \(f(L) = -2.0 + 0.624 \log (L)\) for sequence separation of 24 positions
A simple contact predictor can be formulated as the ratio of the expected number of contacts per residue, given by \(f(L)\), and the possible number of contacts per residue which is \(L-1\),
\[ p(r_{ij} = 1 | L) = \frac{f(L)}{L-1} \; , \]
with \(r_{ij}=1\) representing a contact between residue \(i\) and \(j\).
4.6.3 Cross-validation for Random Forest Training
Proteins constitute highly imbalanced datasets with respect to the number of residue pairs that form and do not form physical contacts. As can be seen in Figure 4.10, depending on the enforced sequence separation threshold and protein length the percentage of contacts per protein varies between 25% and 0%. Most studies applying machine learning algorithms for predicting residue-residue contacts rebalanced the data set by undersampling of the majority class. Table 4.4 lists choices for the proportion of contacts to non-contacts used to train some machine learning contact predictors. I followed the same strategy and undersampled residue pairs that are not physical contacts with a proportion of contacts to non-contacts of 1:5.
Study | Machine Learning Algorithm | Proportion of Contacts : Non-contacts |
---|---|---|
Wu et al. (2008) [68] | SVM | 1:4 |
Li et al. (2011) [69] | Random Forest | 1:1, 1:2 |
Wang et al. (2011) [70] | Random Forest | 1:4 |
DiLena et al. (2012) [78] | deep neural network | 1:\(\approx \!4\) (sampling 20% of non-contacts) |
Wang et al. (2013) [71] | Random Forest | 1:\(\approx 4\) (sampling 20% of non-contacts) |
The total training set is comprised of 50,000 residue pairs \(<8 \angstrom\) (“contacts”) and 250,000 residue pairs \(>8 \angstrom\) (“non-contacts”). I filtered residue pairs using a sequence separation of 12 positions and selected at maximum 100 contacts and 500 non-contacts per protein. The data is collected in equal parts from data subsets 1-5 (see methods section 2.6), so that the training set consists of five subsets that are non-redundand at the fold level. Each of the five models for cross-validation will be trained on 40,000 contacts and 200,000 non-contacts originating from four of the five subsets. As the training set has been undersampled for non-contacts, it is not representative of real world proteins and the models need to be validated on a more realistic validation set. Therefore, each of the five trained models is not validated on the hold-out set but on separate validation sets containing 40 proteins at a time. The proteins of the validation sets are randomly selected from the respective fifth data subset and consequently are non-redundant at the fold level with training data. Performance is assessed by means of the standard contact prediction benchmark (mean precision against top ranked contacts).
I used the module RandomForestClassifier in the Python package sklearn (v. 0.19)
[236] and trained the models on features extracted from MSAs which are listed in methods section 4.6.1.
4.6.4 Feature Selection
A random forest model is trained on the total set of features. Given the distribution of Gini importance values of features from the model, subsets of features are defined by features having Gini importance values larger than the \(\{10, 30, 50, 70, 90\}\)-percentile of the distribution. Performance of the models trained on these subsets of features is evaluated on the same validation set.
References
225. Jones, D.T. (1999). Protein secondary structure prediction based on position-specific scoring matrices 1 1Edited by G. Von Heijne. J. Mol. Biol. 292, 195–202., doi: 10.1006/jmbi.1999.3091.
224. Petersen, B., Petersen, T.N., Andersen, P., Nielsen, M., and Lundegaard, C. (2009). BMC Structural Biology A generic method for assignment of reliability scores applied to solvent accessibility predictions. BMC Struct. Biol. 9., doi: 10.1186/1472-6807-9-51.
226. Robinson, A.B., and Robinson, L.R. (1991). Distribution of glutamine and asparagine residues and their near neighbors in peptides and proteins. Proc. Natl. Acad. Sci. U. S. A. 88, 8880–4.
227. Atchley, W.R., Zhao, J., Fernandes, A.D., and Drüke, T. (2005). Solving the protein sequence metric problem. Proc. Natl. Acad. Sci. U. S. A. 102, 6395–400., doi: 10.1073/pnas.0408677102.
228. Grantham, R. (1974). Amino acid difference formula to help explain protein evolution. Science 185, 862–4.
229. Kawashima, S., Pokarowski, P., Pokarowska, M., Kolinski, A., Katayama, T., and Kanehisa, M. (2008). AAindex: amino acid index database, progress report 2008. Nucleic Acids Res. 36, D202–5., doi: 10.1093/nar/gkm998.
230. Zimmerman, J.M., Eliezer, N., and Simha, R. (1968). The characterization of amino acid sequences in proteins by statistical methods. J. Theor. Biol. 21, 170–201.
231. Wimley, W.C., and White, S.H. (1996). Experimentally determined hydrophobicity scale for proteins at membrane interfaces. Nat. Struct. Biol. 3, 842–8.
232. Kyte, J., and Doolittle, R.F. (1982). A simple method for displaying the hydropathic character of a protein. J. Mol. Biol. 157, 105–132., doi: 10.1016/0022-2836(82)90515-0.
233. Cornette, J.L., Cease, K.B., Margalit, H., Spouge, J.L., Berzofsky, J.A., and DeLisi, C. (1987). Hydrophobicity scales and computational techniques for detecting amphipathic structures in proteins. J. Mol. Biol. 195, 659–685., doi: 10.1016/0022-2836(87)90189-6.
234. Pontius, J., Richelle, J., and Wodak, S.J. (1996). Deviations from Standard Atomic Volumes as a Quality Measure for Protein Crystal Structures. J. Mol. Biol. 264, 121–136., doi: 10.1006/jmbi.1996.0628.
223. Miyazawa, S., and Jernigan, R.L. (1999). Self-consistent estimation of inter-residue protein contact energies based on an equilibrium mixture approximation of residues. Proteins 34, 49–68.
69. Li, Y., Fang, Y., and Fang, J. (2011). Predicting residue-residue contacts using random forest models. Bioinformatics 27., doi: 10.1093/bioinformatics/btr579.
235. Zhu, H., and Braun, W. (1999). Sequence specificity, statistical potentials, and three-dimensional structure prediction with self-correcting distance geometry calculations of beta-sheet formation in proteins. Protein Sci. 8, 326–42., doi: 10.1110/ps.8.2.326.
222. Fodor, A.A., and Aldrich, R.W. (2004). Influence of conservation on calculations of amino acid covariance in multiple sequence alignments. Proteins 56, 211–21., doi: 10.1002/prot.20098.
68. Wu, S., and Zhang, Y. (2008). A comprehensive assessment of sequence-based and template-based methods for protein contact prediction. Bioinformatics 24, 924–31.
70. Wang, X.-F., Chen, Z., Wang, C., Yan, R.-X., Zhang, Z., and Song, J. (2011). Predicting residue-residue contacts and helix-helix interactions in transmembrane proteins using an integrative feature-based random forest approach. PLoS One 6, e26767., doi: 10.1371/journal.pone.0026767.
78. Di Lena, P., Nagata, K., and Baldi, P. (2012). Deep architectures for protein contact map prediction. Bioinformatics 28, 2449–57.
71. Wang, Z., and Xu, J. (2013). Predicting protein contact map using evolutionary and physical constraints by integer programming. Bioinformatics 29, i266–73.
236. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., and Dubourg, V. et al. (2011). Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 12, 2825–2830.