PhD thesis: residue-residue contact prediction

5.4 Evaluating the Bayesian Models for Contact Prediction

The posterior distribution for \(c_{ij}\) can be computed by marginalizing over all other contact states, which are summarized in the vector \(\c_{\backslash ij}\):

\[\begin{eqnarray} p(\cij | \X, \phi) &=& \int d \c_{\backslash ij} \, p(\c |\X, \mathbf{\phi}) \nonumber\\ &\propto & \int d \c_{\backslash ij} \, p(\X|\c) \, p(\c | \phi) \nonumber\\ &\propto & \int d \c_{\backslash ij} \prod_{i'<j'} \sum_{k=0}^K g_{k}(c_{i'j'}) \, \frac{\Gauss( \mathbf{0} | \muk, \Lk^{-1})}{\Gauss(\mathbf{0} | \muijk, \Lijk^{-1})} \, \prod_{i'<j'} p(c_{i'j'} |\phi_{i'j'}) \,, \end{eqnarray}\]

where \(p(\c | \phi)\) represents a prior on contacts that is implemented by the random forest classifier trained on sequence derived features, \(\phi\), as described in chapter 4. By pulling the term depending only on the contact state \(\cij\) out of the integral over \(\c_{\backslash ij}\), one obtains the posterior distribution for \(c_{ij}\),

\[\begin{eqnarray} p(\cij | \X, \phi) & \propto & p(\cij |\phi_{ij}) \, \sum_{k=0}^K g_{k}(\cij) \, \frac{\Gauss( \mathbf{0} | \muk, \Lk^{-1})}{\Gauss(\mathbf{0} | \muijk, \Lijk^{-1})} \nonumber\\ & \times & \prod_{i'<j', (i',j') \ne (i,j)} \int d c_{i'j'} \, p(c_{i'j'} |\phi_{i'j'}) \, \sum_{k=0}^K g_{k}(c_{i'j'}) \, \frac{\Gauss( \mathbf{0} | \muk, \Lk^{-1})}{\Gauss(\mathbf{0} | \muijk, \Lijk^{-1})} \; . \end{eqnarray}\]

Since the second factor involving the integrals over \(c_{i'j'}\) is a constant with respect to \(\cij\), it can be written,

\[\begin{equation} p(\cij | \X, \phi) \propto p(\cij |\phi_{ij}) \, \sum_{k=0}^K g_{k}(\cij) \, \frac{\Gauss( \mathbf{0} | \muk, \Lk^{-1})}{\Gauss(\mathbf{0} | \muijk, \Lijk^{-1})} \, . \tag{5.5} \end{equation}\]

A predicted contact map is obtained by using the posterior probability estimate for a contact, \(p(\cij \eq 1| \X, \phi)\), as an entry in the matrix for residue pair \((i,j)\).

In the following I am going to assess the performance of the Bayesian models with hyperparameters learned using couplings from pseudo-likelihood maximization. The performance will be evaluated with respect to the precision of the top ranked contact predictions, whereby ranking of predictions now follows the posterior probability estimates for contacts.

Figure 5.8: Mean precision for top ranked contact predictions over 500 proteins. The “Bayesian Posterior” methods compute the posterior probability of contacts with the Bayesian framwork employing a three component Gaussian mixture coupling prior. Hyperparameters for the coupling prior have been trained on different dataset sizes as specified in the legend. random forest (pLL) random forest model trained on sequence features and and additional pseudo-likelihood contact score feature. Bayesian Posterior 100k: Trained on 100,000 residue pairs per contact class. Bayesian Posterior 300k: Trained on 300,000 residue pairs per contact class. Bayesian Posterior 500k: Trained on 500,000 residue pairs per contact class. pseudo-likelihood: contact score is computed as APC corrected Frobenius norm of the couplings computed from pseudo-likelihood.

Figure 5.8 shows a benchmark for the Bayesian models using a three component Gaussian mixture model for the coupling prior and with hyperparameters trained on different dataset sizes (100,000, 300,000 and 500,000 residue pairs per contact class). The analysis of the Gaussian mixture models in the last sections has revealed that the statistics and resultant distributions are coherent regardless of dataset size. And indeed, the precision over top ranked predictions is almost indistinguishable for the models learned on different dataset sizes. The Gaussian mixture model with three components has 2004 parameters (see methods section 5.7.11.2) and it is reasonable to learn this many parameters given a dataset of 2x 100,000 residue pairs even considering the unknown uncertainty of the couplings to be modelled.

Because the posterior probability of a contact utilizes additional information from the contact prior in form of the random forest classifier (see chapter 4), it is not fair to compare the posterior probabilities directly to the pseudo-likelihood derived contact scores. Instead, the predictions from the Bayesian model can be compared to the random forest model that has additionally been trained on the pseudo-likelihood derived contact scores (see section 4.4). As can be seen in Figure 5.8, the Bayesian model predicts contacts more accurately than the heuristic contact score obtained from pseudo-likelihood couplings, but less accurately than the random forest model trained on sequence features and the pseudo-likelihood contact scores.

The likelihood function of contacts has been optimized with respect to the coupling prior hyperparameters using an equal number of residue pairs that are in physical contact and that are not in physical contact. The residue pairs that are not in physical contact have been defined on basis of a \(25 \angstrom \Cb\) distance threshold. Choosing a different non-contact threshold, \(\Cb\) distance \(>8 \angstrom\), has a negligible impact on performance with the \(25 \angstrom \Cb\) cutoff giving slightly better results (see Appendix Figure G.14). Furthermore, I checked whether a different ratio of contacts and non-contacts has an impact on performance. Appendix Figure G.14 also shows that choosing five times as many non-contacts as contacts gives slightly worse precision and has the disadvantage of longer runtimes.

Figure 5.9: Mean precision for top ranked contact predictions over 500 proteins. The “Bayesian Posterior” methods compute the posterior probability of contacts with the Bayesian framwork employing a Gaussian mixture coupling prior based on couplings computed with pseudo-likelihood. Hyperparameters for the coupling prior have been trained on 100,000 residue pairs per contact class. The number of Gaussian components in the Gaussian mixture model is specified in the legend. random forest (pLL) random forest model trained on sequence features and and additional pseudo-likelihood contact score feature. Bayesian Posterior 3: Bayesian model utilizing a three component Gaussian mixture. Bayesian Posterior 5: Bayesian model utilizing a five component Gaussian mixture. Bayesian Posterior 10: Bayesian model utilizing a ten component Gaussian mixture. pseudo-likelihood: contact score is computed as APC corrected Frobenius norm of the couplings computed from pseudo-likelihood.

Figure 5.9 compares the performance of Bayesian models with Gaussian mixtures having different number of components and trained on 100,000 residue pairs per contact class. The Bayesian model with a Gaussian mixture having five components shows minor improvements over the model with a three-component Gaussian mixture. Surprisingly, the Bayesian model with the ten component Gaussian mixture performs slightly worse than the other two models. This is unexpected, because the analysis in the last section indicated that both the five and the ten component Gaussian mixture models are able to precisely model the empirical coupling distributions. However, it has also been pointed out before that training of the hyperparameters did not converge within several thousands of iterations and further training might be necessary for the five and ten component Gaussian mixture models.

The trends described for the Bayesian models based on pseudo-likelihood couplings also apply for the Bayesian models based on contrastive divergence couplings. In detail, the Bayesian models based on contrastive divergence couplings perform equally well, regardless of the size of the training set (see Appendix Figure G.15), the choice of the non-contact threshold or the number of Gaussian components (see Appendix Figure G.16). Rather surprising is the finding the Bayesian models based on contrastive divergence couplings perform worse than the ones based on pseudo-likelihood couplings (see Figure 5.10). In fact, they even have worse predictive power than the heuristic pseudo-likelihood contact score, though they involve prior information. This finding is unexpected given that a crucial approximation within the Bayesian framework employs the Hessian of the full likelihood (see method section 5.7.2) and not of the pseudo-likelihood. Therefore it is assumed that the approximation is more accurate when utilizing the couplings that have been obtained by maximizing the full likelihood with contrastive divergence. But apparantly, the approximation works very well for pseudo-likelihood couplings.

Figure 5.10: Mean precision for top ranked contact predictions over 500 proteins. The “Bayesian Posterior” methods compute the posterior probability of contacts with the Bayesian framwork employing a three component Gaussian mixture coupling prior. Hyperparameters for the coupling prior have been trained on 100,000 residue pairs per contact class. random forest (pLL) random forest model trained on sequence features and and additional pseudo-likelihood contact score feature. Bayesian Posterior pLL: Bayesian model based on pseudo-likelihood couplings. Bayesian Posterior CD: Bayesian model based on contrastive divergence couplings. pseudo-likelihood: contact score is computed as APC corrected Frobenius norm of the couplings computed from pseudo-likelihood.

It is interesting to note that the Bayesian models are mainly performing worse for proteins in the second Neff quartile which constitutes Neff values in the range \(680 \le \text{N}_{\text{eff}} < 2350\) (see Appendix Figure G.17). This finding applies to all Bayesian models, regardless of the method that was used to obtain the MAP estimate of couplings or the the number of Gaussian components used to model the coupling prior. A thorough inspection of proteins with Neff values within this particular range did not reveal any further insights.