4.4 Using Contact Scores as Additional Features

Figure 4.7: Mean precision for top ranked contacts on a test set of 1000 proteins. random forest (pLL, CD) random forest model trained on sequence features and the pseudo-likelihood and contrastive divergence contact scores. random forest (pLL) random forest model trained on sequence features and the pseudo-likelihood contact score. random forest (CD) random forest model trained on sequence features and the contrastive divergence contact score. contrastive divergence APC corrected Frobenius norm of couplings computed with contrastive divergence. pseudo-likelihood = APC corrected Frobenius norm of couplings computed with pseudo-likelihood. random forest = random forest model trained on 75 sequence derived features.

Figure F.1 shows that the random forest predictor improves over the pseudo-likelihood coevolution method when the alignment consists of only few sequences. In order to assess this improvement in a more direct manner, it is possible to build a combined random forest predictor that is not only trained on the sequence derived features but also on the pseudo-likelihood contact score as an additional feature. As expected, the pseudo-likelihood score comprises the most important feature in the model (see Appendix Figure F.2) followed by the same sequence features that were found in the previous analysis in Figure 4.4. The model trained on the 76 most relevant features performs as well as the model trained on the full feature set and was used in the benchmark shown in Figure 4.7. The combination of simple sequence features with the coevolution pseudo-likelihood contact score indeed improves predictive power for the random forest model over both single approaches. Especially for small alignments, the improvement is substantial (about 12%) as can be seen in in the left plot in Figure 4.8. In contrast, the improvement on large alignments (right plot in Figure 4.8) is smaller (about 5%), as the gain from simple sequence features compared to the much more powerful coevolution signals is neglectable.

Similarly, the contact scores derived from couplings computed with CD in chapter 3 can be added as a feature instead of the pseudo-likelihood score or besides the pseudo-likelihood contact score. Again, the contrastive-divergence and the pseudo-likelihood contact score comprise the most important features in the respective models (see Appendix Figures F.4 and F.5). The three models trained on additional coevolution features perform comparably (see Figure 4.7) and apparantly, there is minor information gain by adding both coevolution contact scores. Since it has been shown in section 3.5 that pseudo-likelihood and contrastive divergence contact scores are highly correlated, resulting in very similar rankings for residue pairs, it is not surprising that the random forest model including both coevolution scores does not improve over the random forest model including only one of both scores.

Figure 4.8: Mean precision for top ranked contacts on a test set of 1000 proteins splitted into four equally sized subsets with respect to Neff. Subsets are defined according to quantiles of Neff values. Left: Subset of proteins with Neff < Q1. Right: Subset of proteins with Q3 <= Neff < Q4. Methods are the same as in Figure 4.7