When Choosing the Best Subset Is Not the Best Choice

doi:10.21203/rs.3.rs-743866/v1

Download PDF

Research Article

When Choosing the Best Subset Is Not the Best Choice

https://doi.org/10.21203/rs.3.rs-743866/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Background: Variable selection in linear regression settings is a much discussed problem. Best subset selection (BSS) is often considered as an intuitively appealing ‘gold standard’, with its use being restricted mainly by its N P-hard nature. Instead, alternatives such as the least absolute shrinkage and selection operator (Lasso) or the elastic net (Enet) have become methods of choice in high-dimensional settings. A recent proposal represents BSS as a mixed integer optimization problem so that much larger problems have become feasible in reasonable computation time. This has been exploited to study the prediction performance of BSS and its competitors. Here, we present an extensive simulation study assessing, instead, the variable selection performance of BSS compared to forward stepwise selection (FSS), Lasso and Enet. The analysis considers a wide range of settings that are challenging with regard to dimensionality, signal-to-noise ratio and correlations between relevant and irrelevant direct predictors. As measure of performance we used the best possible F1 score for each method so as to ensure a fair comparison irrespective of any criterion for choosing the tuning parameters.

Results: Somewhat surprisingly, it was only in settings where the signal-to-noise ratio was high and the variables were (nearly) uncorrelated that BSS reliably outperformed the other methods. This was the case even in low dimensional settings where the number of observations exceeded the number of variables by a factor of ten. Further, the FSS approach performed nearly identically to BSS.

Conclusion: Our results shed a new light on the usual presumption of BSS being, in principle, the best choice for variable selection. More attention needs to be payed to the data generating process when considering variable selection methods. Especially for correlated variables, convex alternatives like Enet are not only faster but also appear to be more accurate in practical settings.

Bioinformatics

variable selection

high dimensional

best subset selection

No competing interests reported.

Download PDF

Version 1

posted

You are reading this latest preprint version

When Choosing the Best Subset Is Not the Best Choice

Status:

Version 1

Abstract

Full Text

Additional Declarations

Supplementary Files

Status:

Version 1