2.6 Methods
2.6.1 Dataset
A protein dataset has been constructed from the CATH (v4.1) [191] database for classification of protein domains. All CATH domains from classes 1(mainly \(\alpha\)), 2(mainly \(\beta\)), 3(\(\alpha+\beta\)) have been selected and filtered for internal redundancy at the sequence level using the pdbfilter
script from the HH-suite[182] with an E-value cutoff=0.1. The dataset has been split into ten subsets aiming at the best possible balance between CATH classes 1,2,3 in the subsets. All domains from a given CATH topology (=fold) go into the same subsets, so that any two subsets are non-redundant at the fold level. Some overrepresented folds (e.g. Rossman Fold) have been subsampled ensuring that in every subset each class contains at max 50% domains of the same fold. Consequently, a fold is not allowed to dominate a subset or even a class in a subset. In total there are 6741 domains in the dataset.
Multiple sequence alignments were built from the CATH domain sequences (COMBS) using HHblits [182] with parameters to maximize the detection of homologous sequences:
hhblits -maxfilt 100000 -realign_max 100000 -B 100000 -Z 100000 -n 5 -e 0.1 -all hhfilter -id 90 -neff 15 -qsc -30
The COMBS sequences are derived from the SEQRES records of the PDB file and sometimes contain extra residues that are not resolved in the structure. Therefore, residues in PDB files have been renumbered to match the COMBS sequences. The process of renumbering residues in PDB files yielded ambigious solutions for 293 proteins, that were removed from the dataset. Another filtering step was applied to remove 80 proteins that do not hold the following properties:
- more than 10 sequences in the multiple sequence alignment (\(N>10\))
- protein length between 30 and 600 residues (\(30 \leq L \leq 600\))
- less than 80% gaps in the multiple sequence alignment (percent gaps < 0.8)
- at least one residue-pair in contact at \(C_\beta < 8\angstrom\) and minimum sequence separation of 6 positions
The final dataset is comprised of 6368 proteins with almost evenly distributed CATH classes over the ten subsets (Figure 2.14).
2.6.2 Computing Pseudo-Likelihood Couplings
Dr Stefan Seemayer has reimplementated the open-source software CCMpred [100] in Python. CCMpred optimizes the regularized negative pseudo-log-likelihood using a conjugate gradients optimizer. Based on a fork of his private github repository I continued development and extended the software, which is now called CCMpredPy. It is available upon request at https://bitbucket.org/svorberg/ccmpred-new. All computations in this thesis are performed with CCMpredPy unless stated otherwise.
CCMpredPy differs from CCMpred [100] which is available at https://github.com/soedinglab/CCMpred in several details:
Initialization of potentials \(\v\) and \(\w\): - CCMpred initializes single potentials \(\v_i(a) = \log f_i(a) - \log f_i(a= "-")\) with \(f_i(a)\) being the frequency of amino acid a at position i and \(a="-"\) representing a gap. A single pseudo-count has been added before computing the frequencies. Pair potentials \(\w\) are intialized at 0. - CCMpredPy initializes single potentials \(\v\) with the ML estimate of single potentials (see section 3.7.4) using amino acid frequencies computed as described in section 2.6.4. Pair potentials \(\w\) are initialized at 0.
Regularization:
- CCMpred uses a Gaussian regularization prior centered at zero for both single and pair potentials. The regularization coefficient for single potentials \(\lambda_v = 0.01\) and for pair potentials \(\lambda_w = 0.2 * (L-1)\) with \(L\) being protein length.
- CCMpredPy uses a Gaussian regularization prior centered at zero for the pair potentials. For the single potentials the Gaussian regularization prior is centered at the ML estimate of single potentials (see section 3.7.4) using amino acid frequencies computed as described in section 2.6.4. The regularization coefficient for single potentials \(\lambda_v = 10\) and for pair potentials \(\lambda_w = 0.2 * (L-1)\) with \(L\) being protein length.
Default settings for CCMpredPy have been chosen to best reproduce CCMpred results. A benchmark over a subset of approximately 3000 proteins confirms that performance measured as PPV for both methods is almost identical (see Figure 2.15).
Pseudo-likelihood couplings used in this thesis have been computed with CCMPredPy using the following flags:
--maxit 250 # Compute a maximum of MAXIT operations
--center-v # Use a Gaussian prior for single potentials
# centered at ML estimate v*
--reg-l2-lambda-single 10 # regularization coefficient for
# single potentials
--reg-l2-lambda-pair-factor 0.2 # regularization coefficient for
# pairwise potentials computed as
# reg-l2-lambda-pair-factor * (L-1)
--pc-uniform # use uniform pseudocounts
# (1/21 for 20 amino acids + 1 gap state)
--pc-count 1 # defining pseudo count admixture coefficient
# rho = pc-count/( pc-count+ Neff)
--epsilon 1e-5 # convergence criterion for minimum decrease
# in the last K iterations
--ofn-pll # using pseudo-likelihood as objective function
--alg-cg # using conjugate gradient to optimize
# objective function
For the comparison of CCMpred and CCMPredPy in Figure 2.15, CCMpred was run with the following flags:
-n 250 # NUMITER: Compute a maximum of NUMITER operations
-l 0.2 # LFACTOR: Set pairwise regularization coefficients
# to LFACTOR * (L-1)
-w 0.8 # IDTHRES: Set sequence reweighting identity
# threshold to IDTHRES
-e 1e-5 # EPSILON: Set convergence criterion for minimum
# decrease in the last K iterations to EPSILON
2.6.3 Sequence Reweighting
As discussed in section 1.5.1, sequences in a MSA do not represent independent draws from a probabilistic model. To reduce the effects of redundant sequences, a popular sequence reweighting strategy has been found to improve contact prediction performance. Every sequence \(x_n\) of length \(L\) in an alignment with \(N\) sequences has an associated weight \(w_n = 1/m_n\), where \(m_n\) represents the number of similar sequences:
\[\begin{equation} w_n = \frac{1}{m_n}, m_n = \sum_{m=1}^N I \left( ID(x_n, x_m) \geq 0.8 \right) \\ ID(x_n, x_m)=\frac{1}{L} \sum_{i=1}^L I(x_n^i = x_m^i) \tag{2.1} \end{equation}\]An identity threshold of 0.8 has been used for all analyses in this thesis.
The number of effective sequences \(\mathbf{\neff}\) of an alignment is then the number of sequence clusters computed as:
2.6.4 Computing Amino Acid Frequencies
Single and pairwise amino acid frequencies are computed from amino acid counts of weighted sequences as described in the last section 2.6.3 and additional pseudocounts that are added to improve numerical stability.
Let \(a,b \in \{1,\ldots,20\}\) be amino acids and \(q_0(x_i=a), q_0(x_i=a,x_j=b)\) be the empirical single and pair frequencies without pseudocounts. The empirical single and pair frequencies with pseudocounts, \(q(x_i=a), q(x_i=a, x_j=b)\), are defined
\[\begin{align} q(x_i \eq a) :=& (1-\tau) \; q_0(x_i \eq a) + \tau \tilde{q}(x_i\eq a) \\ q(x_i \eq a, x_j \eq b) :=& (1-\tau)^2 \; [ q_0(x_i \eq a, x_j \eq b) - q_0(x_i \eq a) q_0(x_j \eq b) ] + \nonumber\\ & q(x_i \eq a) \; q(x_j \eq b) \tag{2.3} \end{align}\]with \(\tilde{q}(x_i \eq a) := f(a)\) being background amino acid frequencies and \(\tau \in [0,1]\) is a pseudocount admixture coefficient, which is a function of the diversity of the multiple sequence alignment:
\[\begin{equation} \tau = \frac{N_\mathrm{pc}}{(N_\mathrm{eff} + N_\mathrm{pc})} \tag{2.4} \end{equation}\]where \(N_{pc} > 0\).
The formula for \(q(x_i \eq a, x_j \eq b)\) in the second line in eq (2.3) was chosen such that for \(\tau \eq0\) we obtain \(q(x_i \eq a, x_j \eq b) = q_0(x_i \eq a, x_j \eq b)\), and furthermore \(q(x_i \eq a, x_j \eq b) = q(x_i \eq a) q(x_j \eq b)\) exactly if \(q_0(x_i \eq a, x_j \eq b) = q_0(x_i \eq a) q_0(x_j \eq b)\).
2.6.5 Regularization
CCMpredPy uses an L2-regularization per default that pushes the single and pairwise terms smoothly towards zero and is equivalent to the logarithm of a zero-centered Gaussian prior,
\[\begin{align} R(\v, \w) &= \log \left[ \mathcal{N}(\v | \v^*, \lambda_v^{-1} I) \mathcal{N}(\w | \w^*, \lambda_w^{-1} I) \right] \nonumber \\ &= -\frac{\lambda_v}{2} ||\v-\v^*||_2^2 - \frac{\lambda_w}{2} ||\w-w^*||_2^2 + \text{const.} \; , \tag{1.8} \end{align}\]where the regularization coefficients \(\lambda_v\) and \(\lambda_w\) determine the strength of regularization.
The regularization coefficient \(\lambda_w\) for couplings \(\w\) is defined with respect to protein length \(L\) owing to the fact that the number of possible contacts in a protein increases quadratically with \(L\) whereas the number of observed contacts only increases linearly as can be seen in Figure 2.16.
Most previous pseudo-likelihood approaches using L2-regularization for pseudo-likelihood optimization set \(\v^* \eq \w^* \eq \mathbf{0}\) [100–102]. A different choice for \(v^*\) is discussed in section 3.7.4 that is is used per default with CCMpredPy. The single potentials will not be optimized with CD but will be fixed at \(v^*\) given in eq. (3.16). Furthermore, CCMpredPy uses regularization coefficients \(\lambda_v \eq 10\) and \(\lambda_w \eq 0.2\cdot(L-1)\) for pseudo-likelihood optimization and the choice for \(\lambda_w\) used with CD is discussed in section 3.3.1.
2.6.6 Correlation of Couplings with Contact Class
Approximately 100000 residue pairs have been filtered for contacts and non-contacts respectively according to the following criteria:
- sequence separation of residue pairs \(\ge 10\)
- diversity (\(=\frac{\sqrt{N}}{L}\)) of alignment \(\ge 0.3\)
- number of non-gapped sequences \(\ge 1000\)
- \(\Cb\) distance threshold for contact: \(<8\angstrom\)
- \(\Cb\) distance threshold for noncontact: \(>25\angstrom\)
2.6.7 Coupling Distribution Plots
For one-dimensional coupling distribution plots the residue pairs and respective pseudo-log-likelihood coupling values \(\wijab\) have been selected as follows:
- sequence separation of residue pairs \(\ge 10\)
- percentage of gaps per column \(\le 30\%\)
- evidence for a coupling \(\wijab\) estimated from the alignment, \(N_{ij} \cdot q_i(a) \cdot q_j(b) \ge 100\) with:
- \(N_{ij}\): number of sequences with no gaps at positions \(i\) or \(j\)
- \(q_i(a)\), \(q_j(b)\): frequencies of amino acids \(a\) and \(b\) at positions \(i\) and \(j\), respectively (computed as described in section 2.6.4)
These criteria ensure that uninformative couplings are neglected, e.g. sequence neighbors albeit being contacts according to the \(\Cb\) contact definition cannot be assumed to express biological meaningful coupling patterns, or couplings for amino acid pairings that do not have enough statistical power due to insufficient counts in the alignment.
The same criteria have been applied for selecting couplings for the two-dimensional distribution plots with the difference that evidence for a single coupling term has to be \(N_{ij} \cdot q_i(a) \cdot q_j(b) > 80.\)
References
191. Sillitoe, I., Lewis, T.E., Cuff, A., Das, S., Ashford, P., Dawson, N.L., Furnham, N., Laskowski, R.A., Lee, D., and Lees, J.G. et al. (2015). CATH: comprehensive structural and functional annotations for genome sequences. Nucleic Acids Res. 43, D376–D381., doi: 10.1093/nar/gku947.
182. Remmert, M., Biegert, A., Hauser, A., and Söding, J. (2012). HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat. Methods 9, 173–5., doi: 10.1038/nmeth.1818.
100. Seemayer, S., Gruber, M., and Söding, J. (2014). CCMpred-fast and precise prediction of protein residue-residue contacts from correlated mutations. Bioinformatics, btu500.
102. Kamisetty, H., Ovchinnikov, S., and Baker, D. (2013). Assessing the utility of coevolution-based residue-residue contact predictions in a sequence- and structure-rich era. Proc. Natl. Acad. Sci. U. S. A. 110, 15674–9., doi: 10.1073/pnas.1314045110.