PhD thesis: residue-residue contact prediction

4.2 Hyperparameter Optimization for Random Forest

There are several hyperparameters in a random forest model that need to be tuned to achieve best balance between predictive power and runtime. While more trees in the random forest generally improve performance of the model, they will slow down training and prediction. A crucial hyperparamter is the number of features that is randomly selected for a split at each node in a tree [221]. Stochasticity introduced by the random selection of features is a key characteristic of random forests as it reduces correlation between the trees and thus the variance of the predictor. Selecting many features typically increases performance as more options can be considered for each split, but at the same time increases risk of overfitting and decreases speed of the algorithm. In general, random forests are robust to overfitting, as long as there are enough trees in the ensemble and the selection of features for splitting a node introduces sufficient stochasticity. Overfitting can furthermore be prevented by restricting the depth of the trees, which is known as pruning or by enforcing a minimal leaf node size regarding the minimal number of data samples ending in a leaf node. Again, a positive side-effect of pruning and requiring minimal leaf node size is a speedup of the algorithm. [219]

In the following, I use 5-fold cross-validation to identify the optimal architecture of the random forest. Details about the training set and he cross-validation procedure can be found in method section 4.6.3. First I assessed performance of models for combinations of the parameter n_estimators, defining the number of trees in the forest and the parameter max_depth defining the maximum depth of the trees:

n_estimators \(\in \{100,500,1000\}\)
max_depth \(\in \{10, 100, 1000, None\}\)

Figure 4.2 shows that the top five parameter combinations perform nearly identical. Random forests with 1000 trees perform slightly better than models constituting 500 trees, irrespective of the depth of the trees. In order to keep model complexity small, I chose n_estimators=1000 and max_depth=100 for further analysis.

Figure 4.2: Mean precision over 200 proteins against highest scoring contact predictions from random forest models for different settings of n_estimators and max_depth. Dashed lines show the performance of models that have been learned on the five different subsets of training data. Solid lines give the mean precision over the five models. Only those models are shown that yielded the five highest mean precision values (given in parantheses in the legend). Random forest models with 1000 trees and maximum depth of trees of either 100, 1000 or unrestricted tree depth perform nearly identical (lines overlap). Random forest models with 500 trees and max_depth=10 or max_depth=100 perform slightly worse.

Next, I optimized the parameters min_samples_leaf, defining the minimum number of samples required at a leaf node and max_features, defining the number of randomly selected features considered for each split using the following settings:

min_samples_leaf \(\in \{1, 10, 100\}\)
max_features \(\in \{8, 16, 38, 75 \}\) representing \(\sqrt{N}\), \(\log2{N}\), \(0.15N\) and \(0.3N\) respectively with \(N=250\) being the number of features listed in method section 4.6.1.

Randomly selecting 30% of features (=75 features) and requiring at least 10 samples per leaf gives highest mean precision as can be seen in Figure 4.3. I chose max_features=0.30 and min_samples_leaf=10 for further analysis. Tuning the hyperparameters in a different order or on a larger dataset gives similar results.

Figure 4.3: Mean precision over 200 proteins against highest scoring contact predictions from random forest models with different settings of min_samples_leaf and max_features. Dashed lines show the performance of models that have been learned on the five different subsets of training data. Solid lines give the mean precision over the five models. Only those models are shown that yielded the five best mean precision values (given in parantheses in the legend).

In a next step I assessed dataset specific settings, such as the window size over which single positions features will be computed, the distance threshold to define non-contacts and the optimal proportions of contacts and non-contacts in the training set. I used the previously identified settings of random forest hyperparameters (n_estimators=1000, min_samples_leaf=10, max_depth=100, max_features=0.30).

proportion of contacts/non-contacts \(\in \{1\!:\!2, 1\!:\!5, 1\!:\!10, 1\!:\!20 \}\) while keeping total dataset size fixed at 300,000 residue pairs
window size: \(\in \{5, 7, 9, 11\}\)
non-contact threshold \(\in \{8, 15, 20\}\)

As can be seen in appendix ?? and ??, the default choice of using a window size of five positions and the non-contact threshold of \(8 \angstrom\) proves to be the optimal setting. Furthermore, using five-times as many non-contacts as contacts in the training set results in highest mean precision as can be seen in appendix ??. These estimates might be biased in a way since the random forest hyperparameters have been optimized on a dataset using exactly these optimal settings.

References

221. Bernard, S., Heutte, L., and Adam, S. (2009). Influence of Hyperparameters on Random Forest Accuracy. In (Springer, Berlin, Heidelberg), pp. 171–180., doi: 10.1007/978-3-642-02326-2_18.

219. Louppe, G. (2014). Understanding Random Forests: From Theory to Practice.