6 Conclusion and Outlook

With the work presented here, I was the first to formulate DCA as a principled statistical approach, providing true probability estimates and promoting biological insights. The transparency of the modelling process and the flexibility of the Bayesian framework lay the foundation to further improvements of DCA for small protein families in a mechanistic way.

In chapter 2, I presented a thorough analysis of coupling matrices that are inferred from a multiple sequence alignment (MSA) and reflect the tendencies of amino acids to co-occur at paired positions in the MSA. I showed that coupling matrices contain valuable information with a meaningful biological interpretation. For example, the distributions of coupling values reflect biophysical interaction preferences between amino acids and the signal weakens with increasing residue-residue distances. Furthermore, interdependencies between different couplings are coherent and induce characteristic patterns in coupling matrices, often indicating the structural constraint for the residue pair. The majority of this information is lost by the the way current methods apply heuristics to compute a prediction for a residue-residue contact. However, in my Bayesian framework presented in chapter 5, this information is explicitelly modelled.

Chapter 3 presented an alternative approach to infer the Potts model parameters. Due to the complexity of the normalization constant it is infeasible to maximize the full likelihood to derive the model parameters. The most popular DCA approaches optimize the pseudo-likelihood instead but it is unknown how wellthe pseudo-likelihood solution approximates the full likelihood solution in case protein families have only few members. In my work, I optimized the full likelihood by using an approximate gradient provided by an algorithm called contrastive divergence, which is a novel method in contact prediction. I systematically tuned the stochastic gradient descent algorithm for the use with contrastive divergence and also examined various modifications to the estimation of the gradient. My approach achieved comparable precision as pseudo-likelihood methods with minor improvements for small protein families, which could be traced back to amplified signals between strongly conserved residue pairs.

A random forest classifier for contact prediction which was trained on sequence features is discussed in chapter 4. This model yields a robust estimator that outperforms coevolution methods for small protein families where they suffer from the low signal-to-noise ratio. In line with the most successful contact predictors, which exploit information from various sources and multiple DCA methods, I integrated the predictions of the pseudo-likelihood and the constrastive divergence method as additional features for training. The individual methods greatly contribute and improve the predictive performance of the random forest classifier.

The Bayesian framework proposed in chapter 5 represents a principled statistical approach that eradicates the use of heuristics by explicitely modelling the full information contained in the coupling signatures while at the same time accounting for the uncertainty of the data. Based on the observations and biological interpretations of coupling signals in chapter 2, the prior on couplings was modelled as a Gaussian mixture. The hyperparameters were trained on inferred couplings and structures from many proteins and the Gaussian mixture model convincingly reproduced empirical coupling distributions. Posterior probabilty estimates of residue-residue contacts are obtained by combining the likelihood of contacts with prior information in form of the random forest contact class probabilities. They posterior probabilties are less precise than the heuristic predictions obtained from the pseudo-likelihood approach combined with prior information. A possible explanation is that the Gaussian mixture model of the coupling prior does not yet capture enough information in order to efficiently discriminate between contacts and non-contacts. Eventhough reproducing the one- and two-dimensional empirical distributions, it is plausible that the precise interdependencies between couplings require a more complex model, e.g. by using more Gaussian components or full instead of diagonal covariance matrices. Furthermore, the approximation to the regularized likelihood of the sequences with a multivariate Gaussian can and perhaps must be iteratively improved by another round of training employing an improved regularization prior. Finally, the reason could be that certain inherent modelling assumptions are not met or are too inaccurate but work to verify these assumptions is stil ongoing.

Especially with the limited knowledge and uncertainty in the data, the Bayesian statistical approach developed here provides a solid theoretical and statistically sound formulation for DCA with enhanced explanatory power compared to the uninformative heuristics. Through the formulation in the language of Bayesian statistics, the framework naturally allows the integration of additional prior knowledge and it facilitates its further usage in even more complex Bayesian hierarchies. It is also straightforward to extend the model towards the estimation of posterior probabilities of residue-residue distances (see section 5.7.10). The analysis in chapter 2 has demonstrated that the coupling signal weakens with increasing inter-residue distances which represents additional information that awaits full utilization. The information gain of residue-residue distance estimates over binary contact prediction is substantial and is a promising way to greatly improve de novo structure prediction.