F Training of the Random Forest Contact Prior

Figure F.1: Mean precision for top ranked contacts predicted with random forest on a test set of 1000 proteins splitted into four equally sized subsets with respect to Neff. Subsets are defined according to quantiles of Neff values. Upper left: Subset of proteins with Neff < Q1. Upper right: Subset of proteins with Q1 <= Neff < Q2. Lower left: Subset of proteins with Q2 <= Neff < Q3. Lower right: Subset of proteins with Q3 <= Neff < Q4. pseudo-likelihood = APC corrected Frobenius norm of couplings computed with pseudo-likelihood. random forest = random forest model trained on 75 sequence derived features. OMES = APC corrected OMES contact score according to Fodor&Aldrich [222]. mutual information = APC corrected mutual information between amino acid counts (using pseudo-counts).

Figure F.2: Top ten features for Random Forest trained with additional pseudo-likelihood contact score feature. Features ranked according to Gini importance. pseudo-likelihood: APC corrected Frobenius norm of couplings computed with pseudo-likelihood. mean pair potential (Miyasawa & Jernigan): average quasi-chemical energy of transfer of amino acids from water to the protein environment [223]. OMES+APC: APC corrected OMES score according to Fodor&Aldrich [222]. mean pair potential (Li&Fang): average general contact potential by Li & Fang [69]. rel. solvent accessibilty i(j): RSA score computed with Netsurfp (v1.0) [224] for position i(j). MI+APC: APC corrected mutual information between amino acid counts (using pseudo-counts). contact prior wrt L: simple contact prior based on expected number of contacts wrt protein length (see methods section ??). log protein length: logarithm of protein length. beta sheet propensity window(i): beta-sheet propensity according to Psipred [225] computed within a window of five positions around i. Features are described in detail in methods section 4.6.1.

Figure F.3: Mean precision for top ranked contacts over 200 proteins for variaous random forest models trained on subsets of features. Subsets of features have been selected as described in section 4.6.4.

Figure F.4: Top ten features for Random Forest trained with additional contrastive divergence contact score feature. Features ranked according to Gini importance. Features are the same as in Figure F.2 plus the following additional features: contrastive divergence: APC corrected Frobenius norm of couplings computed with contrastive divergence. Features are described in detail in methods section 4.6.1.

Figure F.5: Top ten features for Random Forest trained with additional pseudo-likleihood and contrastive divergence contact score feature. Features ranked according to Gini importance. Features are the same as in Figure F.2 plus the following additional features: contrastive divergence: APC corrected Frobenius norm of couplings computed with contrastive divergence. Diversity (sqrt(N)/L): diversity of the alignment. Features are described in detail in methods section 4.6.1.

Figure F.6: Mean precision over validation set of 200 proteins for top ranked contact predictions for different choices of window size for single position features. Dashed lines represent the models trained on four subsets of the training data according to the 5-fold cross-validation scheme. Solid lines represent the mean over the five cross-validation models.

Figure F.7: Mean precision over validation set of 200 proteins for top ranked contact predictions for different choices of the non-contact threshold to define non-contacts. Dashed lines represent the models trained on four subsets of the training data according to the 5-fold cross-validation scheme. Solid lines represent the mean over the five cross-validation models.

Figure F.8: Mean precision over validation set of 200 proteins for top ranked contact predictions for different choices of dataset composition with respect to the ratio of contacts and non-contacts. Dashed lines represent the models trained on four subsets of the training data according to the 5-fold cross-validation scheme. Solid lines represent the mean over the five cross-validation models.

References

222. Fodor, A.A., and Aldrich, R.W. (2004). Influence of conservation on calculations of amino acid covariance in multiple sequence alignments. Proteins 56, 211–21., doi: 10.1002/prot.20098.

223. Miyazawa, S., and Jernigan, R.L. (1999). Self-consistent estimation of inter-residue protein contact energies based on an equilibrium mixture approximation of residues. Proteins 34, 49–68.

69. Li, Y., Fang, Y., and Fang, J. (2011). Predicting residue-residue contacts using random forest models. Bioinformatics 27., doi: 10.1093/bioinformatics/btr579.

224. Petersen, B., Petersen, T.N., Andersen, P., Nielsen, M., and Lundegaard, C. (2009). BMC Structural Biology A generic method for assignment of reliability scores applied to solvent accessibility predictions. BMC Struct. Biol. 9., doi: 10.1186/1472-6807-9-51.

225. Jones, D.T. (1999). Protein secondary structure prediction based on position-specific scoring matrices 1 1Edited by G. Von Heijne. J. Mol. Biol. 292, 195–202., doi: 10.1006/jmbi.1999.3091.