PhD thesis: residue-residue contact prediction

4.3 Evaluating Random Forest Model as Contact Predictor

I trained a random forest classifier on the feature set described in methods section 4.6.1 and using the optimal hyperparameters identified with 5-fold cross-validation as described in the last section.

Figure 4.4 shows the ranking of the ten most important features according to Gini importance. Both local statistical contact scores, OMES [222] and MI (mutual information between amino acid counts), constitute the most important features besides the mean pair potentials acording to Miyazawa & Jernigan [223] and Li&Fang[69]. Further important features include the relative solvent accessibility at both pair positions, the total percentage of gaps at both positions, the correlation between mean isoelectric point property at both positions, sequence separation and the beta-sheet propensity in a window of size five around position i.

Figure 4.4: Top ten features ranked according to Gini importance. OMES+APC: APC corrected OMES score according to Fodor&Aldrich [222]. mean pair potential (Miyasawa & Jernigan): average quasi-chemical energy of transfer of amino acids from water to the protein environment [223]. MI+APC: APC corrected mutual information between amino acid counts (using pseudo-counts). mean pair potential (Li&Fang): average general contact potential by Li & Fang [69]. rel. solvent accessibilty i(j): RSA score computed with Netsurfp (v1.0) [224] for position i(j). pairwise gap%: percentage of gapped sequences at either position i and j. correlation mean isoelectric feature: Pearson correlation between the mean isoelectric point feature (according to Zimmermann et al., 1968) for positions i and j. sequence separation: |j-i|. beta sheet propensity window(i): beta-sheet propensity according to Psipred [225] computed within a window of five positions around i. eatures are described in detail in methods section 4.6.1.

Many features have low Gini importance scores which means they are rarely considered for splitting a node and can most likely be removed from the dataset. Removing irrelevant features from the dataset is a convenient procedure to reduce model complexity. It has been found, that prediction performance might even increase after removing the most irrelevant features [218]. For example, during the development of EPSILON-CP, a deep neural network method for contact prediction, the authors performed feature selection using boosted trees. By removing 75% of the most non-informative features (mostly features related to amino acid composition), the performance of their predictor increased slightly [86]. Other studies have also emphasized the importance of feature selection to improve performance and reduce model complexity [67,69].

Figure 4.5: Mean precision of top ranked predictions over 200 proteins for random forest models trained on subsets of features of decreasing importance. Subsets of features have been selected as described in methods section 4.6.4.

Figure 4.6: Mean precision for top ranked contacts on a test set of 1000 proteins. pseudo-likelihood = APC corrected Frobenius norm of couplings computed with pseudo-likelihood. random forest = random forest model trained on 75 sequence derived features. OMES = APC corrected OMES contact score according to Fodor&Aldrich [222]. mutual information = APC corrected mutual information between amino acid counts (using pseudo-counts).

As described in methods section 4.6.4, I performed feature selection by evaluating model performance on subsets of features of decreasing importance. Most models trained on subsets of the total feature space perform nearly identical compared to the model trained on all features, as can be seen in Figure 4.5. Performance of the random forest models drops noticeably when using only the 25 most important features. For the further analysis I am using the random forest model trained on the 75 most important features as this model constitutes the smallest set of features while performing nearly identical compared to the model trained on the complete feature set.

Figure 4.6 shows the mean precision for the random forest model trained on the 75 most important features. The random forest model has a mean precision of 0.33 for the top \(0.5\cdot L\) contacts compared to a precision of 0.47 for pseudo-likelihood. Furthermore, the random forest model improves approximately ten percentage points in precision over the local statistical contact scores, OMES and mutual information (MI). Both methods comprise important features of the random forest model as can be seen in Figure 4.4.

When analysing performance with respect to alignment size it can be found that the random forest model outperforms the pseudo-likelihood score for small alignments (see Figure F.1).
Both, local statistial models OMES and MI also perform weak on small alignments, leading to the conclusion that the remaining sequence derived features are highly relevant when the alignment contains only few sequences. This finding is expected, as it is well known that models trained on simple sequence features perform almost independent of alignment size [82,86].

References

222. Fodor, A.A., and Aldrich, R.W. (2004). Influence of conservation on calculations of amino acid covariance in multiple sequence alignments. Proteins 56, 211–21., doi: 10.1002/prot.20098.

223. Miyazawa, S., and Jernigan, R.L. (1999). Self-consistent estimation of inter-residue protein contact energies based on an equilibrium mixture approximation of residues. Proteins 34, 49–68.

69. Li, Y., Fang, Y., and Fang, J. (2011). Predicting residue-residue contacts using random forest models. Bioinformatics 27., doi: 10.1093/bioinformatics/btr579.

224. Petersen, B., Petersen, T.N., Andersen, P., Nielsen, M., and Lundegaard, C. (2009). BMC Structural Biology A generic method for assignment of reliability scores applied to solvent accessibility predictions. BMC Struct. Biol. 9., doi: 10.1186/1472-6807-9-51.

225. Jones, D.T. (1999). Protein secondary structure prediction based on position-specific scoring matrices 1 1Edited by G. Von Heijne. J. Mol. Biol. 292, 195–202., doi: 10.1006/jmbi.1999.3091.

218. Menze, B.H., Kelm, B.M., Masuch, R., Himmelreich, U., Bachert, P., Petrich, W., and Hamprecht, F.A. (2009). A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data. BMC Bioinformatics 10, 213., doi: 10.1186/1471-2105-10-213.

86. Stahl, K., Schneider, M., and Brock, O. (2017). EPSILON-CP: using deep learning to combine information from multiple sources for protein contact prediction. BMC Bioinformatics 18, 303., doi: 10.1186/s12859-017-1713-x.

67. Cheng, J., and Baldi, P. (2007). Improved residue contact prediction using support vector machines and a large feature set. BMC Bioinformatics 8., doi: 10.1186/1471-2105-8-113.

82. Skwark, M.J., Michel, M., Menendez Hurtado, D., Ekeberg, M., and Elofsson, A. (2016). Accurate contact predictions for thousands of protein families using PconsC3. bioRxiv.