5.4 Evaluating the Bayesian Models for Contact Prediction
The posterior distribution for \(c_{ij}\) can be computed by marginalizing over all other contact states, which are summarized in the vector \(\c_{\backslash ij}\):
\[\begin{eqnarray} p(\cij | \X, \phi) &=& \int d \c_{\backslash ij} \, p(\c |\X, \mathbf{\phi}) \nonumber\\ &\propto & \int d \c_{\backslash ij} \, p(\X|\c) \, p(\c | \phi) \nonumber\\ &\propto & \int d \c_{\backslash ij} \prod_{i'<j'} \sum_{k=0}^K g_{k}(c_{i'j'}) \, \frac{\Gauss( \mathbf{0} | \muk, \Lk^{-1})}{\Gauss(\mathbf{0} | \muijk, \Lijk^{-1})} \, \prod_{i'<j'} p(c_{i'j'} |\phi_{i'j'}) \,, \end{eqnarray}\]where \(p(\c | \phi)\) represents a prior on contacts that is implemented by the random forest classifier trained on sequence derived features, \(\phi\), as described in chapter 4. By pulling the term depending only on the contact state \(\cij\) out of the integral over \(\c_{\backslash ij}\), one obtains the posterior distribution for \(c_{ij}\),
\[\begin{eqnarray} p(\cij | \X, \phi) & \propto & p(\cij |\phi_{ij}) \, \sum_{k=0}^K g_{k}(\cij) \, \frac{\Gauss( \mathbf{0} | \muk, \Lk^{-1})}{\Gauss(\mathbf{0} | \muijk, \Lijk^{-1})} \nonumber\\ & \times & \prod_{i'<j', (i',j') \ne (i,j)} \int d c_{i'j'} \, p(c_{i'j'} |\phi_{i'j'}) \, \sum_{k=0}^K g_{k}(c_{i'j'}) \, \frac{\Gauss( \mathbf{0} | \muk, \Lk^{-1})}{\Gauss(\mathbf{0} | \muijk, \Lijk^{-1})} \; . \end{eqnarray}\]Since the second factor involving the integrals over \(c_{i'j'}\) is a constant with respect to \(\cij\), it can be written,
\[\begin{equation} p(\cij | \X, \phi) \propto p(\cij |\phi_{ij}) \, \sum_{k=0}^K g_{k}(\cij) \, \frac{\Gauss( \mathbf{0} | \muk, \Lk^{-1})}{\Gauss(\mathbf{0} | \muijk, \Lijk^{-1})} \, . \tag{5.5} \end{equation}\]A predicted contact map is obtained by using the posterior probability estimate for a contact, \(p(\cij \eq 1| \X, \phi)\), as an entry in the matrix for residue pair \((i,j)\).
In the following I am going to assess the performance of the Bayesian models with hyperparameters learned using couplings from pseudo-likelihood maximization. The performance will be evaluated with respect to the precision of the top ranked contact predictions, whereby ranking of predictions now follows the posterior probability estimates for contacts.
Figure 5.8 shows a benchmark for the Bayesian models using a three component Gaussian mixture model for the coupling prior and with hyperparameters trained on different dataset sizes (100,000, 300,000 and 500,000 residue pairs per contact class). The analysis of the Gaussian mixture models in the last sections has revealed that the statistics and resultant distributions are coherent regardless of dataset size. And indeed, the precision over top ranked predictions is almost indistinguishable for the models learned on different dataset sizes. The Gaussian mixture model with three components has 2004 parameters (see methods section 5.7.11.2) and it is reasonable to learn this many parameters given a dataset of 2x 100,000 residue pairs even considering the unknown uncertainty of the couplings to be modelled.
Because the posterior probability of a contact utilizes additional information from the contact prior in form of the random forest classifier (see chapter 4), it is not fair to compare the posterior probabilities directly to the pseudo-likelihood derived contact scores. Instead, the predictions from the Bayesian model can be compared to the random forest model that has additionally been trained on the pseudo-likelihood derived contact scores (see section 4.4). As can be seen in Figure 5.8, the Bayesian model predicts contacts more accurately than the heuristic contact score obtained from pseudo-likelihood couplings, but less accurately than the random forest model trained on sequence features and the pseudo-likelihood contact scores.
The likelihood function of contacts has been optimized with respect to the coupling prior hyperparameters using an equal number of residue pairs that are in physical contact and that are not in physical contact. The residue pairs that are not in physical contact have been defined on basis of a \(25 \angstrom \Cb\) distance threshold. Choosing a different non-contact threshold, \(\Cb\) distance \(>8 \angstrom\), has a negligible impact on performance with the \(25 \angstrom \Cb\) cutoff giving slightly better results (see Appendix Figure G.14). Furthermore, I checked whether a different ratio of contacts and non-contacts has an impact on performance. Appendix Figure G.14 also shows that choosing five times as many non-contacts as contacts gives slightly worse precision and has the disadvantage of longer runtimes.
Figure 5.9 compares the performance of Bayesian models with Gaussian mixtures having different number of components and trained on 100,000 residue pairs per contact class. The Bayesian model with a Gaussian mixture having five components shows minor improvements over the model with a three-component Gaussian mixture. Surprisingly, the Bayesian model with the ten component Gaussian mixture performs slightly worse than the other two models. This is unexpected, because the analysis in the last section indicated that both the five and the ten component Gaussian mixture models are able to precisely model the empirical coupling distributions. However, it has also been pointed out before that training of the hyperparameters did not converge within several thousands of iterations and further training might be necessary for the five and ten component Gaussian mixture models.
The trends described for the Bayesian models based on pseudo-likelihood couplings also apply for the Bayesian models based on contrastive divergence couplings. In detail, the Bayesian models based on contrastive divergence couplings perform equally well, regardless of the size of the training set (see Appendix Figure G.15), the choice of the non-contact threshold or the number of Gaussian components (see Appendix Figure G.16). Rather surprising is the finding the Bayesian models based on contrastive divergence couplings perform worse than the ones based on pseudo-likelihood couplings (see Figure 5.10). In fact, they even have worse predictive power than the heuristic pseudo-likelihood contact score, though they involve prior information. This finding is unexpected given that a crucial approximation within the Bayesian framework employs the Hessian of the full likelihood (see method section 5.7.2) and not of the pseudo-likelihood. Therefore it is assumed that the approximation is more accurate when utilizing the couplings that have been obtained by maximizing the full likelihood with contrastive divergence. But apparantly, the approximation works very well for pseudo-likelihood couplings.
It is interesting to note that the Bayesian models are mainly performing worse for proteins in the second Neff quartile which constitutes Neff values in the range \(680 \le \text{N}_{\text{eff}} < 2350\) (see Appendix Figure G.17). This finding applies to all Bayesian models, regardless of the method that was used to obtain the MAP estimate of couplings or the the number of Gaussian components used to model the coupling prior. A thorough inspection of proteins with Neff values within this particular range did not reveal any further insights.