3.4 Using ADAM to Optimize Contrastive Divergence
ADAM computes per-parameter adaptive learning rates including momentum. The default values have been found to work well in practice so that little parameter tuning is required (see methods section 3.7.5.1 for details) [196,211]. However, I tested ADAM with different learning rates for the optimization with CD-1 for protein 1mkcA00 (number of sequences = 142) and 1c75A00 (number of sequences = 28078) and found that both proteins are sensitive to the choice of learning rate. In contrast to plain stochastic gradient descent, with ADAM it is possible to use larger learning rates for proteins having large alignments, because the learning rate will be adapted to the magnitude of the gradient for every parameter individually. For protein 1mkcA00, with Neff=96, a learning rate of 5e-3 quickly leads to convergence whereas for protein 1c75A00, having Neff=16808, an even larger learning rate can be chosen to obtain quick convergence (see Appendix Figure E.16). Therefore, I again specified the learning rate as a function of Neff, \(\alpha = 2\mathrm{e}{-3}\log(\text{N}_{\text{eff}})\), such that for small Neff, e.g. 5th percentile of the distribution in the dataset \(\approx 50\), this definition of the learning rate yields \(\alpha_0 \approx 8\mathrm{e}{-3}\) and for large Neff, e.g. 95th percentile \(\approx 15000\), this yields \(\alpha_0 \approx 2\mathrm{e}{-2}\).
It is interesting to note, that the norm of the coupling parameters, \(||\w||_2\), converges towards different values depending on the choice of the learning rate (see Appendix Figure E.16. By default, ADAM uses a constant learning rate, because the algorithm performs a kind of step size annealing by nature. However, popular implementations of ADAM in the Keras [212] and Lasagne [213] packages allow the use of an annealing schedule. I therefore tested ADAM with a sigmoidal learning rate annealing schedule which already gave good results for SGD (see section 3.2.2). Indeed, as can be seen in Appendix Figure E.17, when ADAM is used with a sigmoidal decay of the learning rate, the L2-norm of the coupling parameters \(||\w||_2\) converges roughly towards the same value. For the following analysis I used ADAM with a learning rate defined as a function of Neff and a sigmoidal learning rate annealing schedule with decay rate \(\gamma \eq 5e-6\).
I evaluated CD-1, CD-10 and persistent contrastive divergence. As before, I will also evaluate a combination of both, such that PCD is switched on, when the relative change of the norm of coupling parameters, \(||\w||_2\), falls below a small threshold. Figure 3.18 shows the benchmark for training the various modified CD models with the ADAM optimizer. Overall, the predictive performance for CD and PCD did not improve by using the ADAM optimzier instead of the manually tuned stochastic gradient descent optimizer. Therefore it can be concluded that adaptive learning rates and momentum do not provide an essential advantage for inferring Potts model parameters with CD and PCD.
The convergence analysis for the two example proteins 1ahoA00 and 1c75A00 reveals, that optimization with ADAM converges towards similar offsets as optimization with plain SGD with respect to the L2 norm of coupling parameters. For 1ahoA00, with low Neff=229, the L2 norm of the parameters converges towards \(||\w||_2 \approx 21.6\) when using CD-1 and towards \(||\w||_2 \approx 24\) when using PCD or CD-k with k>1 (compare left plots in Figures 3.19 and 3.17). For protein 1c75A00, with high Neff=16808, ADAM seems to find distinct optima that are clearly separated in contrast to using plain SGD. When using ADAM with CD-1 the algorithm converges towards \(||\w||_2 \approx 120\), ADAM with CD-5 converges towards \(||\w||_2 \approx 130\) and with CD-10 towards \(||\w||_2 \approx 131\). And using ADAM with PCD, regardless of whether the PCD gradient estimate is used right from the start of optimization or only later, the algorithm converges towards \(||\w||_2 \approx 134\). Therefore, ADAM establishes the clear trend that longer sampling, or sampling with persistent chaisn results in larger parameter estimates.
3.4.1 A Potts model specific convergence criterion
For the Potts model there exists a necessary but not sufficient condition that is satisfied at the optimum when the gradient is zero (derived in method section 3.7.4) and that is given by, \(\sum_{a,b=1}^{20} \wijab = 0\). This condition is never violated, as long as parameters satisfy this criterion at initialization and the same step size is used to update all parameters. To understand why, note that the 400 partial derivatives \(\frac{\partial L\!L(\v^*, \w)}{\partial \wijab}\) for a residue pair \((i,j)\) and for \(a,b \in \{1, \ldots, 20\}\) are not independent. The sum over the 400 pairwise amino acid counts at positions \(i\) and \(j\) is identical for the observed and the sampled alignment and amounts to, \(\sum_{a,b=1}^{20} N_{ij} q(x_i \eq a, q_j \eq b) = N_{ij}\).
Considering a residue pair \((i,j)\) and assuming amino acid pair \((a,b)\) has higher counts in the sampled alignment than in the observed input alignment, then this difference in counts must be compensated by other amino acid pairs \((c,d)\) having less counts in the sampled alignment compared to the true alignment (see Figure 3.20). Therefore, it holds \(\sum_{a,b=1}^{20} \frac{\partial L\!L(\v^*, \w)}{\partial \wijab} = 0\). This symmetry is translated into parameter updates as long as the same step size is used to update all parameters. However, when using adaptive learning rates, e.g. with the ADAM optimizer, this symmetry is broken and the condition \(\sum_{a,b=1}^{20} \wijab = 0\) can be violated during the optimization processs.
For proteins 1ahoA00 and 1c75A00, Figure 3.21 shows the number of residue pairs for which this condition is violated according to \(|\sum_{a,b=1}^{20} \wijab| > \mathrm{1e}{-2}\), during optimization with ADAM. For about half out of the 2016 residue pairs in protein 1ahoA00 the condition is violated at the end of optimization. For protein 1c75A00 it is about 2300 out of the 2485 residue pairs. Whereas this is not a problem when computing the contact score based on the Frobenius norm of the coupling matrix, it is problematic when utilizing the couplings in the Bayesian framework presented in section 5, which requires the condition \(\sum_{a,b=1}^{20} \wijab = 0\) to hold.
References
196. Ruder, S. (2017). An overview of gradient descent optimization algorithms. arXiv.
211. Kingma, D., and Ba, J. (2014). Adam: A Method for Stochastic Optimization.
212. Chollet, F. others (2015). Keras.
213. Dieleman, S., Schlüter, J., Raffel, C., Olson, E., Sønderby, S.K., Nouri, D., Maturana, D., Thoma, M., Battenberg, E., and Kelly, J. et al. (2015). Lasagne: First release., doi: 10.5281/ZENODO.27878.