4.4 Using Contact Scores as Additional Features
Figure F.1 shows that the random forest predictor improves over the pseudo-likelihood coevolution method when the alignment consists of only few sequences. In order to assess this improvement in a more direct manner, it is possible to build a combined random forest predictor that is not only trained on the sequence derived features but also on the pseudo-likelihood contact score as an additional feature. As expected, the pseudo-likelihood score comprises the most important feature in the model (see Appendix Figure F.2) followed by the same sequence features that were found in the previous analysis in Figure 4.4. The model trained on the 76 most relevant features performs as well as the model trained on the full feature set and was used in the benchmark shown in Figure 4.7. The combination of simple sequence features with the coevolution pseudo-likelihood contact score indeed improves predictive power for the random forest model over both single approaches. Especially for small alignments, the improvement is substantial (about 12%) as can be seen in in the left plot in Figure 4.8. In contrast, the improvement on large alignments (right plot in Figure 4.8) is smaller (about 5%), as the gain from simple sequence features compared to the much more powerful coevolution signals is neglectable.
Similarly, the contact scores derived from couplings computed with CD in chapter 3 can be added as a feature instead of the pseudo-likelihood score or besides the pseudo-likelihood contact score. Again, the contrastive-divergence and the pseudo-likelihood contact score comprise the most important features in the respective models (see Appendix Figures F.4 and F.5). The three models trained on additional coevolution features perform comparably (see Figure 4.7) and apparantly, there is minor information gain by adding both coevolution contact scores. Since it has been shown in section 3.5 that pseudo-likelihood and contrastive divergence contact scores are highly correlated, resulting in very similar rankings for residue pairs, it is not surprising that the random forest model including both coevolution scores does not improve over the random forest model including only one of both scores.