Diversity enhancement random forest model for the risk identification of disease deterioration

Background Random forest (RF) is a powerful ensemble algorithm for medical decision-making supporting (MDS). However the requirement of higher accuracy and smaller ensemble size remain significant burdens for the current RF, particularly for the risk

Inspired by [20], we proposed a diversity enhancement random forest model using diversity enhancement to improve the identification rate of disease deterioration risk and reduce the size of the random forest. Based on the above discussion, our paper aims to select the best trees, in terms of individual strength i.e. accuracy and diversity, from a large ensemble grown by random forest. The results from the new method are compared with those of random forest, Extreme Tree and the ensemble of optimal tree on the bench mark data sets. The main contributions of this paper are summarized as follows:  An improved random forest model with higher predictive accuracy and a smaller ensemble size is proposed to identify the diseases deterioration risk.
 Both public bench mark data sets from KEEL and data sets from the tertiary hospitals in the last three years are used to evaluate the proposed model. The rest of the paper is organized as follows. The proposed diversity enhancement random forest (DERF), the underlying algorithm and some other related approaches are given in Sect. 2, experiments and results based on benchmark and simulated data sets are given in Sect. 3. Finally, Sect. 4 gives the conclusion of the paper.

Methods
Diversity enhancement is an effective way to optimize random forest by selecting good but different decision trees. The principle of our proposed DERF is described as follows. First, random forest is optimized by OOB data error rate. Then, the logistic loss function is introduced to evaluate the diversity of OOB optimized random forest.

Out of Bag (OOB) optimization
Out of Bag (OOB) optimization is utilized as the first round optimization of random forest, can not only reduce the ensemble size, but also improve the prediction performance. The principle of this method is described as follows. The random forest model utilizes Bootstrap sampling to select training samples for each decision tree.
Assuming that the size of the training sample is m , Bootstrap sampling method randomly selects m samples from the training sample. The probability of each sample being selected is 1 m , which means that the probability not being selected is If the sampling is repeated n times, then the probability of one sample will not be selected can be expressed as : When the training sample size m is large enough, the limit of q can be represented as:   perfect if the logarithmic loss is 0.

Random step size greedy backward search
Considering that it is the NP complete problem to find the optimal trees in the forest, we use heuristic algorithm to search for the optimal trees to build a relatively good

The diversity enhancement random forest algorithm
The frame of algorithm DERF is as follows: 1. Take T bootstrap samples from the given portion of the training data TR =(S1,S2).
2. Grow classification trees on all the bootstrap samples using random forest method.
3. Rank the trees in descending order with respect to their out of bag (OOB) errors on OOB data set. Choose the last M trees with the lowest individual OOB errors.
4. Calculate the logarithmic loss of each tree on the subset S2, 5. Sort the M trees with the logarithmic loss values in reverse order.
6. Select the best tress with the K smallest logarithmic loss values to build the final random forest from M trees using the greedy stepwise backward search.
The proposed DERF method is evaluated on the 5 disease data set. 90% of the total data (TR) is used for training data while the remaining 10% is used for test data. 90% of TR data (S1) is utilized to generate a certain number independent classification and regression trees using bootstrap methods along with randomly selecting a certain number features for splitting the nodes of the trees. The remaining 10% TR data (S2) is employed to check the diversity of the trees using the logarithmic loss function.
Further, the greedy stepwise backward search method is applied to improve the performance of the random forest. Fig. 1 The frame of algorithm DERF.

Results
For assessing the performance of DERF, 7 data sets are introduced. 5 disease binary classification data sets are downloaded from KEEL (Knowledge Extraction based on Evolutionary Learning). To further verify the effectiveness of the model, another 2 data sets about chronic obstructive pulmonary disease (AECOPD) and chronic respiratory disease (CRD) from tertiary hospitals in the last three years are collected.
A brief summary of the data sets is given in First, we investigate the factors that influence classification accuracy. The factors include the optimal total number T, the percentage M of best trees and the number of features d. For the sake of simplicity, we first probe the optimal total number T of trees grown before the selection process. Then, we explore the ratio M of best trees checked by the OOB errors. Next, we search the number of features d for the node splitting. Finally, we depicted the ensemble size reduction effect of DERF. We implement the DERF classifier on the development platform of Python 3.6.
Considering stability and algorithmic efficiency, large values are recommended for the size of the initial set under the available computation resources and a value of T ≥500 is expected to work well in general. Figure 2 illustrates the effect of the number of trees in the initial set on the misclassification accuracy for the data sets given using DERF.
One The effect of various numbers of features selected were also investigated at random for splitting the nodes of the trees on the classification accuracy in the cases of classification for the data sets. The graph is shown in Fig. 4. The only reason that random forest is considered as an improvement over bagging is the inclusion of additional randomness by randomly selecting a subset of features for splitting the nodes of the tree. The effect of this randomness can be seen in Fig. 4 where different values of d results in different classification accuracy for the data sets. For example in the case of heart data, selecting a higher value of d adversely affects the performance.
For some data sets, Specheart for example, selecting large d results in better performance. Fig. 2 The effect of the number of trees in the initial set on classification accuracy for the data sets given using LFRF, OTE, ET and RF.   Similarly, If the final ensemble size of DERF is half of the initial decision size, these points will match the linear function 1 2 yx  .We can infer from Figure 5, on KEEL data set, the final ensemble size of DERF model is less than 1/2 of the size of the initial decision tree, which means that DERF model reduces the ensemble size by half.
Experiments on the KEEL data set show that the DERF model can reduce the ensemble size of random forest without compromising the accuracy of predictions. Table 2-4 shows that the classification accuracy and final ensemble size produced by the proposed DERF on the KEEL data sets. Overall, we found that the proposed method DERF performed better the other methods on the 3 data sets (saheart, mammographic and spectfheart), and was comparable to the other methods on the 2 data set (heart and hepatitis). The numbers in parentheses represent trees in a random forest. The smaller the number means the smaller ensemble size of the random forest.
We discovered that the proposed DERF model used smaller ensemble size to achieve better performance than OTE, ET and RF in general. The result of the best performing method for the corresponding data set was shown in bold.    In addition to the experiments on the KEEL data set, we have also evaluated our DERF on the acute exacerbation of chronic obstructive pulmonary disease (AECOPD) and chronic respiratory disease (CRD) data set we collected, as shown in Figure 6 and Figure 7. Figure 6 illustrates the effect of the number of trees in the initial set on misclassification accuracy for the data sets (AECOPD and CRD) given using LFRF, OTE, ET and RF. It can be found that the DERF algorithm has the best prediction accuracy. DERF obtains the best prediction accuracy when the number of trees is 0-500. While DERF achieves the same predictive performance as the RF model when the number of decision trees exceeds 500. Figure 7 demonstrates the relationship between the size (number) of the initial decision trees and the size (number) of the final ensemble decision trees in the DERF on the AECOPD and CRD data sets.

Figures 6 and Figures 7 show that DERF is better than OTE, the ET and RF models
which suggests that DERF achieve better performance at the disease risk monitoring and emergency care peak visits prediction. Fig. 6 The effect of the number of trees in the initial set on classification accuracy for the data sets given using LFRF, OTE, ET and RF. Availability of data and materials 17 The datasets used during the current study are available from the corresponding 18 author on reasonable request.