We have presented a number of alternative stopping rules for RF models. However, for the NHANES data that we fit the RF models to there is little difference in MSPE. The optimal model highlights the overwhelming importance of blood glucose as a predictor for glycohemoglobin percentage, with reduction in MSE four times greater than that associated with any other variable (Table 3).
One of the parameters in the RF algorithm is the minimum size of the node below which the node would remain unsplit. This is very commonly implemented in implementations of the RF algorithm, in particular in the randomForest package 4. The problem of how to select the node size in RF models is much studied in the literature. In particular Probst et al 5 review the topic of hyperparameter tuning in in RF models, with a subsection dedicated to the choice of terminal node size. As they document, the optimal node size is often quite small, and in many packages the default is set to 1 for classification trees and 5 for regression trees. There are a number of packages available that allow for alternative to the standard parental node size limit for node splitting. In particular the randomForestSRC 6 and the partykit 7,8 R packages both allow for splits to be limited by the size of the offspring node. As far as we are aware no statistical package uses the range, variance or centile range based limits demonstrated here. It should be noted that the use of limits of parental and offspring node size are not equivalent. While it is obviously the case that if the offspring nodesize is at least then the parental node size must be at least \(2n\), the reverse is clearly not the case. For example, it may be that among the candidate splits of a particular node of size \(2n\) would in general be offspring nodes of sizes \(1,2,...,n - 1,n,n+1,...2n - 1\). Were one to insist on terminal nodes being of size then only the split into two nodes each of size would be considered, whereas without restriction on the size of the terminal nodes potential candidates would in general include nodes of size \(1,2,...,n - 1,n+1,...2n - 1\) also, although the splitting variables might not in general allow all these to occur.
Numerous variants of the RF model have been created, many with implementations in R software. For example, quantile regression RF was introduced by Meinshausen 9 and combines quantile regression with random forests and its implementation provided in the package quantregForest. Garge et al 10 implemented a model-based partitioning of the feature space, and developed associated R software mobForest (although this has now been removed from the CRAN archive). Seibold et al 11 also used recursive partioning RF models which were fitted to amyotrophic lateral sclerosis data. Seibold et al have also developed software for fitting such models, in the R model4you package 12. Segal and Xiao 13 have outlined use of RFs for multivariate outcomes and developed the R MultivariateRandomForest package 14 for fitting such models. A number of more specialized RF algorithms have also been developed. Wager and Athey 15 used concepts from causal inference, and introduced the idea of a causal forest. Foster et al 16 also used standard RFs as part of a causal (counterfactual) approach for subgroup identification in randomized clinical trial data. Li et al 17 have applied more standard RF models to analyze multicenter clinical trial data. An algorithm that combines RF methods and Bayesian generalized linear mixed models for analysis of clustered and longitudinal binary outcomes, termed the binary mixed model forest was developed by Speiser et al 18, using standard R packages. Quadrianto and Ghahramani 19 also proposed a novel RF algorithm incorporating Baysian elements, which they implemented in Matlab, and compared this model with a number of other machine learning approaches in analysis of a number of datasets. Ishwaran et al 20 outlined a survival RF algorithm that is applicable to right-censored survival data; an R package randomSurvivalForestSRC (now removed from the CRAN repository) has been written implementing this model, among other time-to-event RF variants. For genomic inference two R packages implementing standard RF models have been developed by Díaz-Uriarte and de Andrés 21 and Diaz-Uriarte 22, GeneSrF and varSelRF. RF have been used in meta-analysis, and a software implementation is provided by the R package metaforest 23. The grf:geographical random forest package of Georganos et al 24 provides an implementation of the RF model specifically aimed at geographical analyses.
We have outlined stopping rules with specific application to regression trees. However, the basic idea would obviously easily carry over to classification trees, using for example the Gini or cross-entropy loss functions.