BoostTree and BoostForest for Ensemble Learning

Bootstrap aggregating (Bagging) and boosting are two popular ensemble learning approaches, which combine multiple base learners to generate a composite model for more accurate and more reliable performance. They have been widely used in biology, engineering, healthcare, etc. This paper proposes BoostForest, which is an ensemble learning approach using BoostTree as base learners and can be used for both classification and regression. BoostTree constructs a tree model by gradient boosting. It increases the randomness (diversity) by drawing the cut-points randomly at node splitting. BoostForest further increases the randomness by bootstrapping the training data in constructing different BoostTrees. BoostForest generally outperformed four classical ensemble learning approaches (Random Forest, Extra-Trees, XGBoost and LightGBM) on 35 classification and regression datasets. Remarkably, BoostForest tunes its parameters by simply sampling them randomly from a parameter pool, which can be easily specified, and its ensemble learning framework can also be used to combine many other base learners.

For example, in biology, Wang et al. [3] used an ensemble of neural networks to emulate mechanism-based biological models. They found that the ensemble is more accurate than an individual neural network, and the consistency among the individual models can indicate the error in prediction. In Moon exploration, Yang et al. [7] used ensemble transfer learning and Chang'E data for automatic Moon surface impact crater detection and age estimation. They successfully identified 109,956 new craters from 7,895 training samples. In healthcare, Agius et al. [10] developed a Chronic Lymphocytic Leukemia (CLL) treatment-infection model, an ensemble of 28 machine learning algorithms, to identify patients at risk of infection or CLL treatment within two years of diagnosis.
One of the most popular algorithms for constructing the base learners is the decision tree [11], [12], [13]. Two common approaches for constructing the composite learner are bootstrap aggregating (Bagging) and boosting.
Bagging [14] connects multiple base learners in parallel to reduce the variance of the ensemble. Each base learner is trained using the same learning algorithm on a bootstrap replica, which draws N (the size of the original training set) samples with replacement from the original training set. The outputs of these base learners are then aggregated by majority voting (for classification) or averaging (for regression) to obtain the final output. To achieve robust performance, the base learners in an ensemble should be both accurate and diverse [15], [16], [17].
There are many approaches to increase the accuracy of base learners in an ensemble. Combining the advantages of tree models and linear models can greatly improve the model's learning ability, which is the main idea of model trees. M5 [18] constructs a linear regression function at each leaf to approximate the target function for high fitting ability. When a new sample comes in, it is first sorted down to a leaf, then the linear model at that leaf is used to predict its output. M5P (aka M5 ′ ) [19] trains a linear model at each node of a pruned tree to reduce the risk of over-fitting. More sophisticated regression algorithms, e.g., Ridge Regression (RR) [2], Extreme Learning Machine (ELM) [20], Support Vector Regression (SVR) [21], and Neural Network [22], have rarely been used in model trees. A possible reason is that they have some hyper-parameters to tune, e.g., the regularization coefficient and the number of nodes in the hidden layers. It is an NP-complete problem to simultaneously determine the structure of the model tree and the parameters of each node model [23]. A common approach is cross-validation, but it is very time-consuming. Thus, it is desirable to develop a strategy that can make the model tree more compatible with these more sophisticated (and hence potentially better performing) regression models.
There are also many approaches to increase the diversity of base learners in an ensemble, which can be divided into three categories [24]: 1) Sample-based strategies, which train each base learner on a different subset of samples, and thus are scalable to big data. For example, Bagging uses bootstrap sampling to obtain different subsets to train the base learners, and AdaBoost [25] uses adaptive sample weights (larger weights for harder samples) in generating a new base learner. 2) Feature-based strategies, which train each base learner on different subsets of features, and thus are scalable to high dimensionality. For example, each decision tree in Random Forest (RandomForest) [26], [27], [28] selects the feature to be split from a random subset of features, instead of all available features. Similarly, each decision tree in Extremely Randomized Trees (Extra-Trees) [29] splits nodes by drawing the cut-points completely randomly. 3) Parameter-based strategies. If the base learners are sensitive to their parameters, then setting different parameters can improve the diversity. For example, different hidden layer weights can be used to initialize diverse neural networks. Interestingly, these three categories of diversity enhancement strategies are complementary; so, they can be combined for better performance.
Boosting [25], [30], [31], the driving force of Gradient Boosting Machine (GBM) [11], can be used to reduce the bias of an ensemble. It is an incremental learning process, in which a new base learner is built to compensate the error of previously generated learners. Each new base learner is added to the ensemble in a forward stage-wise manner. As the boosting algorithm iterates, base learners generated at later iterations tend to focus on the harder samples. Mason et al. [32] described boosting from the viewpoint of gradient descent and regarded boosting as a stage-wise learning scheme to optimize different objective functions iteratively. FilterBoost [33] is a variant of AdaBoost [25], which fits an additive logistic regression model and minimizes the negative log likelihood step-by-step. DeepBoosting [34] can use very deep decision trees as base learners. Popular implementations of GBM, e.g., XGBoost [12] and LightGBM [35], have been successfully used in many applications [36], [37], [38], [39]. However, traditional boosting approaches [11], [12], [35] often have many parameters and thus require cross-validation, which is unreliable on small datasets, and time-consuming on big data. It is desirable to develop an algorithm that has very few parameters to tune and is robust to them. This paper proposes BoostForest, which integrates boosting and Bagging for both classification and regression. Our main contributions are: 1) We propose a novel decision tree model, BoostTree, that integrates GBM into a single decision tree, as shown in Fig. 1(a). BoostTree uses the node function, e.g., RR, ELM, or SVR, to train a linear or nonlinear regression model at each node for regression or binary classification, or J regression models for Jclass classification, where J is the number of classes. For a given input, BoostTree first sorts it down to a leaf, then computes the final prediction by summing up the outputs of all node models along the path from the root to that leaf. Similar to Extra-Trees, BoostTree increases the randomness (diversity) by drawing the cut-points randomly at node splitting. 2) We propose a novel parameter setting strategy, random parameter pool sampling, which makes Boost- Tree easier to tune its hyper-parameters than traditional approaches. In this strategy, the parameters of BoostTree are not specific values, but random samples from candidate sets stored in a parameter pool. Each time a root node is generated, BoostTree randomly selects its hyper-parameters from the parameter pool. 3) Using BoostTrees as base learners, we propose a novel ensemble learning approach, BoostForest, as shown in Fig. 1(b). It first uses bootstrap to obtain multiple replicas of the original training set, and then trains a BoostTree on each replica. BoostForest uses the parameter pool sampling strategy to simplify its hyper-parameter tuning process, and outperforms several popular ensemble learning approaches. Moreover, it represents a very general ensemble learning framework, whose base learners can be any tree model, e.g., BoostTree, M5P, Extra-Tree (a single decision tree in Extra-Trees), or Logistic Model Tree (LMT) [40], or even a mixture of different models.
The remainder of this paper is organized as follows: Section 2 introduces some related work. Section 3 describes the details of our proposed BoostTree and BoostForest. Section 4 presents experimental results to demonstrate the superiority of BoostForest over several popular ensemble learning ap-proaches on 35 classification and regression datasets. Finally, Section 5 draws conclusions and points out some future research directions.

RELATED WORK
GBM [11] generates an ensemble via an iterative process. It uses Newton's method to decompose the original classification or regression problem into multiple sub-regression problems, and solves them iteratively. Given a dataset D = {(x n , y n )} N n=1 with N training samples, where x n ∈ R D×1 and D is the feature dimensionality, an ensemble φ generated by GBM uses K base learners {f k } K k=1 to predict the output:ŷ LogitBoost (Supplementary Algorithm 1) [41] is a popular implementation of GBM to optimize logistic regression. In each iteration, LogitBoost first computes the pseudolabels and weights using the Newton (for binary classification) or quasi-Newton (for multi-class classification) method, and then updates the ensemble by adding a new regression model (for binary classification) or J new regression models [for J-class (J > 2) classification], which are trained to fit the pseudo-labels by weighted least-squares regression. If the original problem is regression, then in each GBM iteration, the pseudo-label is the residual between the true value and the prediction, and all samples have the same weight.
Malerba et al. [42] proposed Stepwise Model Tree Induction (SMOTI) to construct model trees gradually for regression. When the tree is growing, SMOTI adds more and more variables to the leaf model to approximate the target function. At each step, it adds either a splitting node or a regression node to the tree. A splitting node partitions the sample space, whereas a regression node removes the linear effect of the variables already included in its parent node and performs linear regression on one variable. Different from GBM, LMT [40] generates only one tree instead of multiple trees for predicting. LMT extends model trees from regression to classification, by integrating Logit-Boost into the tree model. The final model at a leaf consists of all node models along the path from the root to that leaf.
Assume an LMT has trained a function f m for the m-th node. For an input x n , LMT first identifies q(x n ), the leaf node it belongs to, and then all f m along the path from the root to that leaf node are summed up to predict the output, i.e.,ŷ where Path q(xn) is the collection of the node indices along the path from the root to the leaf node q(x n ). Similar to LogitBoost, LMT can use (2) to output the probability. SimpleLogistic [40] (a variant of LogitBoost) incrementally refines the linear logistic model. In each iteration, instead of using all features to perform linear regression, SimpleLogistic uses only one feature to train the model. In this way, only relevant features are selected, and the risk of over-fitting is reduced.
Gradient Boosting with Piece-wise Linear Regression Trees (GBDT-PL) [43] is a variant of XGBoost, which combines multiple Piece-wise Linear Regression Trees (PL-Trees) to predict the output. Similar to LMT, a PL-Tree fits a linear model at each node in an additive manner.
Webb [44] proposed MultiBoosting, which combines wagging (a variant of Bagging) and Adaboost [25] to reduce both the bias and the variance. It uses AdaBoost to generate a sub-committee of decision trees. By wagging a set of subcommittees, MultiBoosting may achieve lower error than AdaBoost.
RandomForest [26] uses two techniques to improve the diversity of each tree, and hence reduces over-fitting: 1) Bagging, i.e., each tree is trained with a bootstrap replica drawn from the original training set; and, 2) feature subsampling, i.e., for each node of the tree, a subset of the features is randomly selected from the complete feature set, then an optimal feature is selected from the subset to split the node. Extra-Trees [29] splits nodes in a more random way than RandomForest. It first randomly draws one cut-point for each feature, and then selects the cut-point with the largest splitting gain to split a node. Extra-Trees reduces over-fitting by trading the base learners' accuracy for diversity.
TAO [45] first initializes the tree with a given depth by using CART [13] or random parameters, and then circularly updates each decision or leaf node to optimize the objective function. It trains a binary classifier at each decision node to classify a sample into the left or right child node, and a constant label (TAO-c) or linear softmax classifier (TAOl) at each leaf to predict the output. Its decision nodes are oblique nodes, which can split the samples more effectively than axis-aligned nodes. Bagged TAO-l (BaggedTAO-l) [46] further uses Bagging to integrate multiple TAO-l trees for better performance.

BOOSTTREE AND BOOSTFOREST
This section introduces the details of our proposed Boost-Tree and BoostForest. All source code is available at https://github.com/zhaochangming/BoostForest.

Motivation
Three problems need to be solved in model tree based ensemble learning: 1) How to design a model tree that is more compatible with Bagging? Generally, as the model complexity increases, the bias of the model decreases, but the variance increases. Between the two popular ensemble learning strategies, Bagging is suitable for integrating complex base learners to reduce the variance, whereas boosting for integrating simple base learners to reduce the bias. MultiBoosting combines AdaBoost and wagging to further improve the performance of a decision tree. However, how to combine GBM and Bagging to improve the performance of model trees has not been studied. LMT is a stable model with low randomness, so simply combining multiple LMTs with Bagging may not outperform a single LMT. 2) How to handle both classification and regression problems by using a single model tree structure? LMT applies LogitBoost to a decision tree to generate an accurate model tree; however, it cannot handle regression problems. GBDT-PL handles regression problems by using GBM to integrate multiple PL-Trees. However, performing boosting in a single tree is more efficient, and easier to be used in Bagging. 3) How to generate a model tree with easy parameter tuning, so that any regression algorithm, e.g., RR, ELM and SVR, can be used as its node function?
This section proposes BoostTree to solve the above three problems simultaneously. Moreover, multiple BoostTrees can be further integrated into a BoostForest for better performance.

General Idea of BoostTree
Our proposed BoostTree is inspired by LMT and GBM. Assume a BoostTree has M nodes, excluding the root. Then, it trains a function f m for the m-th node, m ∈ [1, M ], and uses (5) to predict the output.
BoostTree minimizes the following regularized loss function: where λ is the regularization coefficient. The second term above penalizes the complexity of BoostTree to reduce overfitting. Different loss functions ℓ can be used to cope with regression and classification problems, as will be introduced in Sections 3.3-3.5. For the ease of optimization, we require ℓ to be convex and differentiable.
In general, the objective function in (6) cannot be optimized directly. Inspired by LMT and GBM, BoostTree minimizes (6) in an additive manner. Assume a tree with T (T ≥ 2) leaves has been generated after T − 1 iterations. Then, there are M = 2T − 2 nodes, excluding the root. We can rewrite (6) as: where i.e., I m is the set of all training samples belonging to Leaf m. LeafLoss m measures the loss of Leaf m. In each iteration, BoostTree uses a greedy learning scheme to add branches to the leaf with the highest loss. Assume Node m is the leaf node with the highest loss. After the split, I m is divided into two subsets: I L (of the left node) and I R (of the right node). Let f L and f R be the node models of the left and the right nodes trained separately using I L and I R , respectively. Then, the reduction of the loss in (7) is: where Note that C is a constant, F m is the ensemble of the models along the path from the root to Node m, and L(f L ) and L(f R ) are the loss functions for the left and right child nodes, respectively. (11) can be optimized by minimizing L(f L ) and L(f R ) separately. More specifically, BoostTree uses two steps to optimize (11): 1) Split the node: BoostTree implements four criteria to select the cut-point: a) XGBoost Splitting Criterion (XGB-SC), which uses gradient decent to reduce the loss. b) Gini Splitting Criterion (Gini-SC), which tries to improve the purity of each leaf. c) Mean Squared Error (MSE), which tries to reduce the MSE of each leaf. d) C4.5 Splitting Criterion (C4.5-SC) [47], which selects the cut-point with the largest ratio of information gain.
Though all criteria are effective, XGB-SC achieved the best performance in our experiments, as will be demonstrated in Section 4.10.3; hence, we make it the default option in BoostTree. XGB-SC uses the Euclidean norm of the leaf score to constrain the tree complexity. Ignoring the parent node loss and inspired by Equation (7) in XGBoost [12], we can calculate the splitting gain as: where are the first and second order derivatives of the loss function w.r.t. F m (x n ), respectively. Similar to Extra-trees for increasing the randomness, BoostTree first draws a random cut-point uniformly for each feature, and then determines the best cutpoint of the current node according to the maximum δ gain . 2) Train the node model: BoostTree uses gradient boosting to decompose the original problem into multiple sub-regression problems, and uses a node function to solve a sub-regression problem in each node. It first calculates the pseudo-labels and sample weights to generate a temporary training set, and then trains a regression model for regression or binary classification, or J regression models for J-class (J > 2) classification. For simplicity, RR is used as the default node function. More details on training the node model for different tasks are introduced in Sections 3.3-3.5.
For the ease of implementation, we can simply store all possible values of each parameter in a parameter pool, from which BoostTree randomly selects a combination of parameters, e.g., the minimum number of samples at a leaf min samples leaf , and the regularization coefficient λ. Robustness of using different parameters to train a BoostTree in BoostForest will be studied in Section 4.10.1.
Algorithm 1 shows the pseudo-code of BoostTree using RR as the node function, where the subfunction FitModel assumes different forms according to different learning tasks, as shown in Algorithms 2-4.
Assume Node c is the left or right child node of Node m. According to GBM, the loss function for Node c can be expressed as: where f c (x n ) = a T c x n +b c ,ỹ n = y n −F m (x n ) is the pseudolabel, i.e., the residual between the true value and the prediction, I c is the set of all training samples belonging to Node c, a c ∈ R D×1 is a vector of the regression coefficients, and b c is the intercept. Note that all samples have the same weight in regression.
MSE is sensitive to noise and outliers. To improve Boost-Tree's robustness, we save the lower (upper) bound lb (ub) of Algorithm 1: BoostTree using RR as the node function.
Input: Data = {(x n , y n )} N n=1 , N training samples, where x n ∈ R D×1 ; Pool MSL , candidate value pool of the minimum number of samples at a leaf min samples leaf; Pool λ , candidate value pool of the ℓ 2 regularization parameter λ; (optional) MaxNumLeaf, the maximum number of leaves, default NULL.
* / if |LeafList| > 0 and (MaxNumLeaf == NULL or NumLeaf < MaxNumLeaf) then Use (8) to calculate the leaf loss of each node in LeafList; split(node * ), where node * is the leaf node with the highest loss; end return node; } each node's pseudo-label at the training stage, and clip the corresponding node model's output at the prediction stage: where lb c = max n∈Icỹ n and ub c = min n∈Icỹ n . When the tree is growing, the residuals of the nodes generated at later iterations approach zero, so lb and ub get closer. The effect of clipping will be demonstrated in Section 4.10.2.
Algorithm 2 shows the pseudo-code of BoostTree for regression.

Algorithm 2: FitModel for regression.
Input: {(x n , y n )} n∈Ic , sample set of the current node; F m , ensemble of the models along the path from the root to the parent node of the current node; λ, the ℓ 2 regularization parameter. Output: A regression model f c for the current node. y n = y n − F m (x n ), n ∈ I c ; Fit f c = RidgeRegression({(x n ,ỹ n )} n∈Ic , λ) using RR with regularization parameter λ; Clip f c using (17).

BoostTree for Binary Classification
In classification tasks, BoostTree is built using a LogitBoostlike algorithm, which iteratively updates the ensemble of the logistic linear models F by adding a new regression model f to it.
The cross-entropy loss is used in binary classification: Assume Node c is the left or right child node of Node m. According to LogitBoost (Supplementary Algorithm 1) [41], the second order Taylor expansion can be used to approximate the loss function for Node c: where C = n∈Ic ℓ (y n , F m (x n )) − n∈Ic 1 2 (g n ) 2 h n is a constant, and g n and h n are calculated by using (13) and (14), respectively. Note that g n and h n are irrelevant to f c . Therefore, f c can be any regression model.
Then, we can construct the pseudo-labelỹ n and the sample weight w n , as in LogitBoost: where in which p(x n ) is the estimated probability in (2). To improve BoostTree's robustness to outliers, we follow LogitBoost [41] to limit the minimum weight to 2ǫ (ǫ is the machine epsilon), and clip the value of the pseudo-labelỹ to: where y max ∈ [2, 4] (according to LogitBoost [41]). y max = 4 was used in our experiments. Finally, we can remove C in (19) to simplify the loss function for child Node c: Algorithm 3 shows the pseudo-code of BoostTree for binary classification.

Algorithm 3: FitModel for binary classification.
Input: {(x n , y n )} n∈Ic , sample set of the current node; F m , ensemble of the models along the path from the root to the parent node of the current node; λ, the ℓ 2 regularization parameter. Output: A regression model f c for the current node.
In each iteration, LogitBoost handles the J-class classification problem by decomposing it into J regression problems, so f c becomes a set of linear models Similarly, the second order Taylor expansion can be used to approximate the loss function for Node c: where is a constant, a j c is the coefficient vector of f j c , g j n is the jth element of g n , and h j n is the j-th diagonal element of h n . g n and h n are again calculated by using (13) and (14), respectively.
Then, for the j-th class, we can calculate the pseudo-label y j n and sample weight w j n , as in LogitBoost: where in which p j (x n ) is the estimated probability of Class j in (4). To prevent too large pseudo-labels, we also use (24) to clipỹ j n . Finally, we can remove C in (27) to simplify the loss function for Node c: Algorithm 4 shows the pseudo-code of BoostTree for Jclass (J > 2) classification.

BoostForest
BoostForest integrates multiple BoostTrees into a forest. It first uses bootstrap to generate K replicas of the original training set, and then trains a BoostTree on each replica.
For regression, the outputs predicted by all K BoostTrees are averaged as the final output: For classification, the probabilities predicted by all K BoostTrees are averaged as the final probability: BoostForest(x) Algorithm 4: FitModel for J-class (J > 2) classification.
Input: {(x n , y n )} n∈Ic , sample set of the current node; F m , ensemble of the models along the path from the root to the parent node of the current node; λ, the ℓ 2 regularization parameter. Output: The regression model set f c for the current node.
, n ∈ I c ; using weighted RR on D ′ with regularization parameter λ; end / * Center the outputs of the regression models, as in LogitBoost.
Algorithm 5 gives the pseudo-code of BoostForest.

Algorithm 5: BoostForest training algorithm.
Input: Data = {(x n , y n )} N n=1 , N training samples, where x n ∈ R d×1 ; n estimators, the number of BoostTrees; Pool MSL , parameter value pool of the minimum number of samples at a leaf; Pool λ , parameter value pool of the ℓ 2 regularization parameter λ. Output: A BoostForest.
Train BoostTree i on Data ′ using Algorithm 1; Add BoostTree i to BoostForest; end

Implementation Details
Inspired by the row sampling trick in XGBoost and light-GBM, we use two tricks to improve the speed of BoostTree: 1) Randomly select BatchSize samples to identify the cut-point, if the number of samples belonging to the splitting node is larger than the batch size BatchSize. 2) When the number of samples belonging to Node m is larger than BatchSize, randomly select BatchSize samples to train the node model, and approximate LeafLoss m as: where I batch is the batch index set of I m .
To ensure BoostTree's number of leave nodes is larger than or equal to two, BoostTree tries to split the root node until it is split successfully.

Discussions
To clarify the novelties of BoostForest, this subsection first discusses the differences and connections between it and some related approaches, e.g., LMT, GBDT-PL, MultiBoosting and TAO, and then summarizes approaches which motivate d BoostForest.
There are four main differences between LMT and Boost-Forest: 1) LMT uses SimpleLogistic to train the node model, and only handles classification. BoostTree can use any regression algorithm to train the node model, and can handle both classification and regression. 2) LMT computes the splitting gains on all values of the attributes to select the cut-point, whereas Boost-Tree draws the cut-points completely randomly, which increases its diversity and makes BoostTrees more suitable for Bagging. 3) LMT prunes the tree to reduce over-fitting and uses only one tree to predict the output. However, a single learner can easily be affected by noise and outliers, leading to poor generalization. BoostForest combines multiple BoostTrees with Bagging to reduce over-fitting.
(ignoring the pruning operation), where Depth is the depth of the tree, N is the sample size, and D is the feature dimensionality. BoostTree's complexity is O d · N + (N · d 2 + d 3 ) · Depth (using RR as the node function), where the first part is the cost of building an Extra-tree, and the second is the cost of building the node models. If downsampling is performed before fitting the RR model, BoostTree's asymptotic complexity is less than M5 and M5P train a linear model at each node during the training process. Their smoothing operations use all node models along the path from the root to a leaf to predict the output. M5 and M5P construct model trees for regression, whereas LMT constructs them for classification. LMT is the most relevant approach to BoostTree. They both integrate GBM into one single tree, train a base learner at each node to perform boosting, and combine all node models along the path from the root to the corresponding leaf to predict the output. BoostTree may be viewed as an upgraded LMT.
BoostTree improves LMT from three aspects: 1) Model structure: BoostTree expands LMT's node models to non-linear models. 2) Application scenario: BoostTree further extends LMT to regression problems. 3) Training process: BoostTree optimizes the cut-point selection strategy by combining XGB-SC and Extratrees' cut-point drawing strategy, and also the parameter setting strategy. These improvements make it easier to tune the hyper-parameters in BoostTree. Note that LMT can automatically perform crossvalidation within each node to determine the number of training iterations for SimpleLogistic, and prune the tree to further reduce over-fitting. However, when the node models are replaced by other more complex models, the computational cost of cross-validation at all nodes may be very high. To alleviate this problem, BoostTree develops a random parameter pool sampling strategy and combines it with Bagging.
There are four main differences between GBDT-PL and BoostForest: 1) GBDT-PL trains a linear model at each node, and does not evaluate its performance on multi-class classification. BoostTree can use any regression algorithm to train the node model, and can handle binary and multi-class classifications, and regression. 2) GBDT-PL splits the node and trains the node models simultaneously, whereas BoostTree first splits the node, and then trains the node models. When the node models are replaced by other complex models, it's very time-consuming for GBDT-PL to select the cut-point, because it needs to train the left and right node models at all candidate cut-points. 3) GBDT-PL uses GBM to train multiple PL-Trees, whereas BoostTree integrate GBM into one single tree. 4) GBDT-PL needs a validation set to select its hyperparameters, which include the tree structure and node model parameters. Its computational cost increases with the number of hyper-parameter combinations. BoostTree optimizes its hyper-parameter setting strategy by integrating random parameter pool sampling and Bagging.
When each node uses a linear model, BoostTree has a similar structure as a tree in GBDT-PL, because all linear node models along the path to the corresponding leaf can be combined into one single linear model. However, when each node uses a non-linear model, the node models cannot be easily combined, which makes the structures of BoostTree and GBDT-PL tree different.
There are three main differences between MultiBoosting and BoostForest: MultiBoosting and BoostForest both combine boosting and Bagging, but in different ways: MultiBoosting combines them serially, whereas BoostForest combines them in parallel.
TAO is a general and efficient optimization algorithm, which can train many types of decision trees, e.g., both TAO and BoostTree can train the base learners in a tree ensemble. However, there are three main differences between TAO and BoostTree: 1) TAO trains a binary classifier at each decision node to classify a sample into the left or right child node. Its decision nodes are oblique nodes, whereas BoostTree's decision nodes are axis-aligned nodes. 2) TAO uses only the leaf model to predict the output, whereas BoostTree combines all node models along the path from the root to the corresponding leaf to predict the output. 3) TAO circularly updates each decision or leaf nodes to optimize the objective function, whereas Boost-Tree optimizes it by using a greedy learning scheme to add new branches to the leaf. Table 1 summarizes the approaches that motivated BoostForest. Essentially, BoostForest integrates five existing strategies and develops two new empirical ones. Splitting criteria XGBoost [12], CART [13] and C4.5 [47] Cut-point drawing Extra-trees [29] Node model training LMT [40], GBM [11], LogitBoost [41] and stochastic gradient boosting [31] Tree ensemble generation RandomForest [26] Random parameter pool sampling Empirical Clipping operation in regression Empirical

EXPERIMENTAL RESULTS
Extensive experiments were carried out to verify the effectiveness of BoostForest in both classification and regression.
The following seven questions were examined: 1) What is the generalization performance of BoostForest, compared with several popular ensemble learning approaches, e.g., RandomForest [26], Extra-Trees [29], XGBoost [12], LightGBM [35], MultiBoosting [44], GBDT-PL [43], FilterBoost [33] and DeepBoosting [34]? 2) How fast does BoostForest converge, as the number of base learners increases? 3) How does the base learner model complexity affect the generalization performance of BoostForest? 4) Can BoostForest handle datasets with a large number of samples or features? 5) Can our proposed approach for constructing Boost-Forest, i.e., data replica by bootstrapping and random parameter selection from the parameter pool, also be used to integrate other tree models, e.g., Extra-Tree [29], M5P [19], and LMT [40]? 6) How does the performance of BoostForest change when different node functions, e.g., RR, ELM or SVR, are used in BoostTrees? 7) How robust is BoostForest to its hyper-parameters and the node splitting criterion?

Datasets
We performed experiments on 37 real-world datasets 1 (20 for classification and 17 for regression), summarized in Supplementary Table 1. They covered a wide range of feature dimensionalities (between 4 and 1,024) and sample sizes (between 103 and 11,000,000). For each dataset, categorical features were converted to numerical ones by one-hot encoding. Unless stated otherwise, each feature's distribution was scaled to a standard normal distribution, and the labels in the regression datasets were z-normalized.

Algorithms
Supplementary

Experimental Setting
We performed experiments on 30 small to medium sized datasets (15 for classification and 15 for regression) in Sections 4.4-4.6 and 4.10, 35 datasets (30 small to medium sized ones and five large sized ones) in Section 4.7, 32 datasets (30 small to medium sized ones and two high dimensional ones) in Section 4.8, and 17 datasets (15 small to medium sized ones and two large sized ones) in Section 4.9.
All algorithms were repeated 10 times on each dataset (except MNIST and HIGGS). For each experiment, we first randomly divided the data into 60% training, 20% validation and 20% test. For MNIST or HIGGS, we randomly divided the data into 80% training, 20% validation, and used the original test data for testing. Next, we used the validation set to select the best parameters. Then, the training set and validation set were combined to train a model with these parameters. Finally, we verified its performance on the test set. For RandomForest and Extra-Trees, we used the out-of-bag error to select their parameters.
We used the classification accuracy and the root mean squared error (RMSE) as the main performance measure for classification and regression, respectively. We also computed the average rank and training time (including the process of parameter selection) for each algorithm on each dataset. For K algorithms, the best one has rank 1, and the worst has rank K. We used seconds as the unit of time in our experiments and 50 processing cores to run the experiments on a server with two Intel(R) Xeon(R) Platinum 8276 CPUs. Additionally, we used bytes as the unit of model size in our experiments.
To validate if BoostForest significantly outperformed the baselines (α = 0.05), we first calculated the p-values using the standard t-test, and then performed Benjamini Hochberg False Discovery Rate (BH-FDR) correction [48] to adjust them. The statistically significant ones are marked by •.

Generalization Performance of BoostForest
First, we compared the generalization performance of Boost-Forest with eight baselines. The results are shown in Tables 2  and 3. Table 2 shows that BoostForest achieved the best generalization performance on 17 out of the 30 datasets, the fastest average time in regression, the second fastest average time in classification, and the best average performance, standard deviation and rank in both classification and regression. Table 3 shows that BoostForest achieved the best generalization performance on nine out of the 11 datasets, the best average performance, standard deviation and rank, and the second fastest average time.
On average, BoostForest achieved the best classification accuracy or regression RMSE. It significantly outperformed RandomForest on 19 datasets, Extra-Trees on 15 datasets, XGBoost on 14 datasets, LightGBM on 12 datasets, MultiBoosting on nine datasets, GBDT-PL on 10 datasets, DeepBoosting on eight datasets, and FilterBoost on seven datasets.
It is also important to note that BoostForest used the default parameter settings on all datasets, and did not use a validation set to select the best parameters. So, it is very convenient to use. Additionally, each BoostTree is trained with a different training set, so BoostForest can be easily parallelized to further speed it up.
One limitation of BoostForest is that its average model size was the largest, so it may not be easily deployed to low storage devices. One of our future research directions is to reduce its model size.

Generalization Performance w.r.t. the Number of Base Learners
BoostForest needs to specify the number of BoostTrees in it. It is important to study how its performance changes with this number.
On each dataset, we gradually increased the number of base learners from three to 250, and tuned other parameters of the four baselines by using the validation set (for XGBoost and LightGBM) or the out-of-bag error (for RandomForest and Extra-Trees). Fig. 1 shows the accuracies of the five algorithms on the last four classification datasets. Complete results on all 15 classification datasets are shown in Supplementary Fig. 1. Generally, as the number of base learners increased, the performances of all ensemble learning approaches quickly converged. BoostForest achieved the highest classification accuracy on 11 of the 15 datasets, and the second highest on another three (MV1, BD, and PID). Fig. 2 shows the RMSEs of the five algorithms on the last four regression datasets. Complete results on all 15 regression datasets are shown in Supplementary Fig. 2. Again, as the number of base learners increased, generally the performances of all algorithms quickly converged. BoostForest achieved the smallest RMSE on seven of the 15 datasets.
In summary, BoostForest has very good generalization performance w.r.t. the number of base learners, and it generally converges faster than the four baselines (RandomForest, Extra-Trees, XGBoost, and LightGBM).

Generalization Performance w.r.t. the Base Learner Model Complexity
We also evaluated the generalization performance of the five ensemble approaches, as the base learner model complexity increases.
The base learner model complexity was controlled by the maximum number of leaves M axN umLeaf per tree, which was gradually increased from two to 32 for classification, and two to 256 for regression. We fixed the number of base learners at 250, and tuned other parameters of the four baselines by using the validation set (for XGBoost and LightGBM) or the out-of-bag error (for RandomForest and Extra-Trees). Fig. 3 shows the accuracies of the five algorithms on the last four classification datasets. Complete results on all 15 classification datasets are shown in Supplementary Fig. 3. BoostForest achieved the highest classification accuracy on 12 of the 15 datasets. Fig. 4 shows the average RMSEs of the five algorithms on the last four regression datasets. Complete results on all 15 regression datasets are shown in Supplementary Fig. 4. On most datasets, the performances of all algorithms increased as the maximum number of leaves per tree increased. BoostForest achieved the smallest RMSE on seven of the 15 datasets, and the second smallest RMSE on the NO dataset.
These results suggest that BoostForest generalize well with the base learner model complexity.

Generalization Performance on Large Datasets
Previous experiments have shown the superiority of Boost-Forest on small to medium sized datasets with not very high dimensionalities. This subsection investigates its performance on four datasets with a large number of samples and/or features.        Table 4 compares the performance of BoostForest with four classical and popular ensemble methods on three classification datasets and two regression datasets. For the SUSY dataset, we set min samples leaf = 2 and M axN umLeaf = 20, 000 in each BoostTree.
BoostForest still demonstrated superior and consistent performance: it achieved the highest average accuracy and the second smallest average time in classification, and the lowest average RMSE and the third smallest average time in regression.
Considering that BoostForest's average model size was the largest and its base learners were more complex than baselines', we also compared BoostForest with Bagging-LightGBM, which uses Bagging to integrate 600 LightGBMs, and BaggedTAO-l, which also uses Bagging to integrate multiple complex trees. To increase BaggedTAO-l's training speed on YB and LR, we selected the cut-point from the set of each feature's {10, 20, . . . , 90} quantiles to initialize each TAO-l in BaggedTAO-l. LMT and M5P are also complex trees, which are compared with BoostForest in Section 4.8.
Supplementary Table 3 compares BoostForest with Bagging-LightGBM. BoostForest significantly outperformed Bagging-LightGBM on 15 datasets, and achieved higher accuracies on 14 of the 18 classification datasets, and lower RMSEs on 11 of the 17 regression datasets. These results indicated that BoostForest can still achieve promising performance when its average model size was smaller than Bagging-LightGBM.
Supplementary Table 4 compares BoostForest with BaggedTAO-l. BoostForest significantly outperformed BaggedTAO-l on four datasets, and achieved higher accuraies on 16 of the 17 classification datasets. BaggedTAOl had smaller average model size, because TAO can use the sparsity penalty to remove unnecessary parameters. This idea may also be used to reduce BoostForest's model size.

Use Other Base Learners in BoostForest
Next, we studied if the strategy that BoostForest uses to combine multiple BoostTrees (data replica by bootstrapping, and random parameter selection from a parameter pool) can also be extended to other tree models, i.e., whether we can still achieve good ensemble learning performance when BoostTree is replaced by another tree model.
Supplementary Table 5 shows the results. ETForest outperformed Extra-tree on 30 of the 32 datasets. ModelForest outperformed M5P on all 16 regression datasets. Boost-Forest outperformed BoostTree on all 32 datasets. These results indicated that our proposed strategy for integrating BoostTrees into BoostForest can also be used to integrate Extra-tree and M5P into a composite learner with improved performance.
However, LMForest only outperformed LMT on six of the 16 classification datasets, though it achieved better average performance than LMT, i.e., our proposed strategy is not very effective in combining multiple LMTs to further improve their performance. Additionally, using Python to call LMT's and M5P's Java APIs to generate the base learners in parallel is not very efficient, so the average time of LMForest and ModelForest was relatively long.
In summary, BoostForest achieved better average classification performance than LMForest and ETForest, and also better average regression performance than ModelForest and ETForest, indicating that BoostTree is a more effective base learner for our proposed ensemble strategy than Extra-Tree, LMT and M5P.

Use Other Regression Models in BoostTree
We also studied if other more complex and nonlinear regression algorithms, e.g., ELM and SVR, can be used to replace RR as the node function in BoostTree. The resulting trees are denoted as BoostTree-ELM and BoostTree-SVR, respectively, and the corresponding forests as BoostForest-ELM and BoostForest-SVR.
ELM [20] is a single hidden layer neural network. It randomly generates the hidden nodes, and analytically determines the output weights through generalized inverse or RR. Its model complexity can be controlled by the number of hidden nodes NumHiddenNodes and the regularization coefficient λ of RR. We set their candidate values to {10, 20, 30, 40} and {0.0001, 0.001, 0.01, 0.1, 1}, respectively, to construct the parameter pool. The sigmoid activation function was used in the hidden layer.
These results showed that when more complex and nonlinear regression algorithms, such as ELM and SVR, are used to replace RR as the node function in BoostTree, the performance may degrade due to overfit; however, integrating the corresponding BoostTrees into a BoostForest can always improve the performance.
We also explored if convolutional neural network (CNN) and deep neural network (DNN) can be used as the node function in BoostForest. The resulting forests are denoted as BoostForest-CNN and BoostForest-DNN, respectively. Supplementary Fig. 5 shows the structures of CNN and DNN used in our experiments. We initialized each node model of BoostForest-CNN or BoostForest-DNN by using its parent node model's weights, scaled its output values to (−4, 4), to be consistent with the range of pseudo-label in (24), and trained it with 20 epochs. For the MNIST dataset, we set min samples leaf = 1, 024, batch size to 1,024, and tree depth to 5 in each BoostTree, and scaled the minimum and maximum values of each feature to 0 and 1, respectively. For the HIGGS dataset, we set min samples leaf = 2, 048, batch size to 2,048, and tree depth to 5 in each BoostTree, and randomly selected 2,000,000 samples to train each base learner in BoostForest-DNN and Bagging-DNN, respectively. If the number of samples belonging to the splitting node was larger than 100,000, we randomly selected 100,000 samples to identify the cut-point. When training CNN or DNN, we adopted AdamW 3 with betas of (0.9, 0.999), initial learning rate 0.01, and weight decay 0.0001. Table 5 compares the performances of BoostForest-CNN and BoostForest-DNN with six approaches on MNIST and HIGGS, respectively. With less training time, BoostForest-CNN and BoostForest-DNN achieved higher accuracies than Bagging-CNN 150 and Bagging-DNN 150 , respectively. With smaller model size, BoostForest-CNN (BoostForest-DNN) achieved higher accuracies than RandomForest, Extra-Trees, Bagging-XGBoost, Bagging-lightGBM, and Bagging-CNN 3 (Bagging-DNN 3 ). These results indicated that complex CNN and DNN can also be used as the node function in BoostForest. 3. https://pytorch.org/docs/stable/generated/torch.optim.AdamW

Robustness of BoostForest
To investigate the robustness of BoostForest, we performed extensive experiments to study how its performance changed with different hyper-parameters. Additionally, we also studied the effect of the clipping operation in (17) and the splitting criterion.

Effect of min samples leaf and λ
There are mainly two hyper-parameters in our proposed BoostForest: min samples leaf and λ. Supplementary Table 7 shows the average performances of BoostForest, as min samples leaf increased from 5 to 15, and λ increased from 0.0001 to 1. Randomly selecting each BoostTree's parameters from the parameter pool to form a BoostForest achieved comparable average performance with using the best parameters in both regression and classification, indicating that we can use random parameter pool sampling to simplify BoostForest's parameter selection process.

Effect of the Clipping Operation in Regression
Clipping makes BoostForest more robust to noise and outliers in regression. To verify the effect of the clipping operation, we compared the performance of "BoostForest (w/o clipping)" with "BoostForest (w/ clipping)". Supplementary Table 8 shows the results. BoostForest (w/ clipping) outperformed BoostForest (w/o clipping) on most datasets. When the data noise was small, which means BoostForest can achieve a small RMSE, clipping made Boost-Forest too conservative to achieve better performance, e.g., on the AQ dataset.
Compared with the baselines in Table 2, even BoostForest (w/o clipping) achieved better average RMSE than Ran-domForest, Extra-Trees, XGBoost, LightGBM and GBDT-PL, indicating that BoostForest (w/o clipping) is also an effective model, though BoostForest (w/ clipping) is more effective.

Effect of the Splitting Criterion
We also studied if Gini-SC, C4.5-SC (usually used in C4.5 and LMT) or MSE-SC (usually used in CART and Extra-Trees), can replace XGB-SC in BoostForest.
Supplementary Table 9 shows BoostForest's performances of using different splitting criteria. Using Gini-SC, C4.5-SC and XGB-SC achieved comparable average accuracies in classification. Using XGB-SC achieved better average RMSE than using MSE-SC in regression.

CONCLUSIONS AND FUTURE RESEARCH
This paper has proposed a new tree model, BoostTree, that integrates GBM into a single model tree. BoostTree trains a regression model (for regression or binary classification) or multiple regression models (for multi-class classification) at each node. For a given input, BoostTree first sorts it down to a leaf, then computes the final prediction by summing up the outputs of all node models along the path from the root to that leaf.
Using BoostTrees as base learners, we also proposed a new ensemble learning approach, BoostForest. It first uses bootstrap to obtain multiple replicas of the training set, and then trains a BoostTree on each replica. Its hyper-parameters are easy to tune. Moreover, it represents a very general ensemble learning framework, whose base learners can be any tree model, e.g., BoostTree, Extra-Tree, M5P, or LMT, or even a mixture of different models.
BoostForest performs favorably over several ensemble learning approaches, e.g., RandomForest, Extra-Trees, XG-Boost, LightGBM, and GBDT-PL, in both classification and regression, and also MultiBoosting, FilterBoost and Deep-Boosting in classification. BoostForest simultaneously uses two randomness injection strategies: 1) data sample manipulation through bootstrapping; and, 2) input feature manipulation through randomly drawing cut-points at node splitting. Output representation manipulation will be considered in our future research.
Zhou and Feng [24] showed that Random Forests can be assembled into a Deep Forest to achieve better performance. As we have demonstrated that BoostForest generally outperforms Random Forest, it is also expected that replacing Random Forests in Deep Forest by BoostForests may result in better performance. This is also one of our future research directions.
BoostForest has two main limitations: 1) BoostForest cannot handle NULL values, because it needs all features to train a node model. 2) BoostForest may not be deployed to low storage devices, due to its large model size.
Therefore, our future research will: 1) Improve BoostTree to handle NULL values.

APPENDIX A SUPPLEMENTARY ALGORITHMS
This section presents pseudo-code of additional algorithms introduced in this paper.
Input: {(x n , y n )} n∈Ic , sample set of the current node; F m , ensemble of the models along the path from the root node to the parent node of the current node; λ, the regularization parameter; M , the number of hidden nodes. Output: ELM model f c for the current node. y n = y n − F m (x n ), n ∈ I c ; Fit f c = ELM({(x n ,ỹ n )} n∈Ic , λ, M ) using ELM with regularization parameter λ and M hidden nodes; Clip f c using (18). F m , ensemble of the models along the path from the root node to the parent node of the current node; C, the regularization parameter of SVR; ǫ, the slack variable of SVR. Output: The SVR model f c for the current node. y n = y n − F m (x n ), n ∈ I c ; Fit f c = SVR({x n ,ỹ n } n∈Ic , C, ǫ) using SVR; Clip f c using (18).

APPENDIX C SUPPLEMENTARY TABLES
Supplementary    Supplementary Table 6 Mean and standard deviation (in parentheses) of the regression RMSE, when ELM or SVR is used to replace RR in BoostTree. The best performance is marked in bold. • indicates statistically significant win for BoostForest-ELM or BoostForest-SVR.  Table 7 Average performances of BoostForest on the 30 datasets, w.r.t. min samples leaf (MSL) and λ. The best performance is marked in bold.

Mean and standard deviation (in parentheses)
of the classification accuracy