A Comparative Study of Forest Methods for Time-to-Event Data: Variable Selection and Predictive Performance

doi:10.21203/rs.3.rs-439232/v1

Download PDF

Research Article

A Comparative Study of Forest Methods for Time-to-Event Data: Variable Selection and Predictive Performance

https://doi.org/10.21203/rs.3.rs-439232/v1

This work is licensed under a CC BY 4.0 License

You are reading this latest preprint version

Background

As a hot method in machine learning field, the forests approach is an attractive alternative approach to Cox model. Random survival forests (RSF) methodology is the most popular survival forests method, whereas its drawbacks exist such as a selection bias towards covariates with many possible split points. Conditional inference forests (CIF) methodology is known to reduce the selection bias via a two-step split procedure implementing hypothesis tests as it separates the variable selection and splitting, but its computation costs too much time. Random forests with maximally selected rank statistics (MSR-RF) methodology proposed recently seems to be a great improvement on RSF and CIF.

Methods

In this paper we used simulation study and real data application to compare prediction performances and variable selection performances among three survival forests methods, including RSF, CIF and MSR-RF. To evaluate the performance of variable selection, we combined all simulations to calculate the frequency of the correct variables ranking in the top by variable importance measure, where higher frequency means better selection ability. We used Integrated Brier Score (IBS) to measure the prediction accuracy of all three methods. The smaller IBS value, the greater the prediction.

Results

1. Simulations show that three forests methods differ slightly in prediction performance. Real data results show that three forest methods all have advantages in different scenarios.

2. For variable selection performance,

1) MSR-RF and CIF have higher selection frequency than RSF when there are multiple categorical variables in the simulation datasets, and CIF perform well especially with the interaction term.

2) Forests methods seem to be suitable for processing data with correlation, as the selection frequency fluctuates slightly when correlation degree changes.

3) RSF and MSR-RF outperform CIF with complete binary covariate type. MSR-RF outperform RSF and CIF with complete continuous covariate type.

4) MSR-RF perform relatively robust when the variable dimension increases, while CIF perform poorly.

Conclusions

All three forests methods have respective advantages in different situations. It is important to choose the appropriate method based on the covariates in practice.

Biostatistics

Survival Analysis

Random Survival Forest

Conditional Inference Forest

Maximally Selected Rank Statistics

Machine Learning

Variable Selection

Brier Score