When Choosing the Best Subset Is Not the Best Choice

Background: Variable selection in linear regression settings is a much discussed problem. Best subset selection (BSS) is often considered as an intuitively appealing ‘gold standard’, with its use being restricted mainly by its NP -hard nature. Instead, alternatives such as the least absolute shrinkage and selection operator (Lasso) or the elastic net (Enet) have become methods of choice in high-dimensional settings. A recent proposal represents BSS as a mixed integer optimization problem so that much larger problems have become feasible in reasonable computation time. This has been exploited to study the prediction performance of BSS and its competitors. Here, we present an extensive simulation study assessing, instead, the variable selection performance of BSS compared to forward stepwise selection (FSS), Lasso and Enet. The analysis considers a wide range of settings that are challenging with regard to dimensionality, signal-to-noise ratio and correlations between relevant and irrelevant direct predictors. As measure of performance we used the best possible F1 score for each method so as to ensure a fair comparison irrespective of any criterion for choosing the tuning parameters . Results: Somewhat surprisingly, it was only in settings where the signal-to-noise ratio was high and the variables were (nearly) uncorrelated that BSS reliably outperformed the other methods. This was the case even in low dimensional settings where the number of observations exceeded the number of variables by a factor of ten. Further, the FSS approach performed nearly identically to BSS. Conclusion: Our results shed a new light on the usual presumption of BSS being, in principle, the best choice for variable selection. More attention needs to be payed to the data generating process when considering variable selection methods. Especially for correlated variables, convex alternatives like Enet are not only faster but also appear to be more accurate in practical settings.


Introduction
Selecting a subset of variables as direct predictors for an outcome is a much studied problem in regression modelling and has received renewed attention in the context of high-dimensional data where some variable selection is unavoidable. It appears self-evident that best subset selection (BSS) [1][2][3] should be the gold standard for variable selection: Clearly, if we assume there are s direct predictors and consider all combinations of variables up to a subset size k ≥ s the true model has to be one of the candidate models, making BSS the obvious choice for variable selection. The main reason for dismissing BSS is that a naive implementation quickly becomes computationally infeasible with larger numbers of variables [1] [4]. However, recent developments have cast a shadow on the performance of BSS even when computationally feasible. In their path-breaking work, Bertsimas et al. [5] have formulated BSS as a mixed integer optimization problem (MIO) pushing the boundaries for the feasible number of variables p to be in the thousands while still searching over moderate subset sizes of k. This now allows a better comparison of BSS with variable selection methods such as the popular Least absolute shrinkage estimator (Lasso) [6] or variants thereof, e.g. the adaptive Lasso [7], the Sparse-Groupe Lasso [8] or the Elastic net (Enet) and its adaptive version [9,10]. The convex optimization nature of all these methods enables quick computation even for millions of variables making them the main methods of choice in high-dimensional settings. While these methods' theoretical and empirical performances have been studied in much detail [11][12][13][14][15], the question arises how they actually compare empirically to BSS in realistically high-dimensional settings.
A first extensive comparison of BSS with the Lasso and a simplified version of the relaxed Lasso [16] has been carried out by Hastie et al. [4]. Perhaps surprisingly, the authors find that neither BSS nor the Lasso uniformly dominate each other, and moreover that forward step-wise selection is mostly as good as BSS, while the relaxed Lasso exhibits best performance overall. An explanation may be found in the different bias-variance tradeoffs of the different appraoches. However, Hastie et al. [4] focus on the predictive performance of the methods, considering several popular metrics of predictive accuracy. Their results may therefore not hold up for variable selection performance since different sets of selected variables can give very similar predictions but only one set of variables is the true set of direct predictors. Another recent simulation study using the MIO formulation was carried out by Takano et al. [17]. The authors compared BSS with Lasso based on different optimization criteria determining the subset size k. However, in their study they only considered a low dimensional setting with p = 100.
In the present paper we complement the above studies by specifically evaluating the selection performance of BSS compared to established variable selection approaches, gaining further insights into the properties of all the methods. Selecting the true direct predictors of an outcome is a distinct problem from prediction and often of much substantial interest in its own right, for instance in genetics. We take advantage of the MIO formulation, which makes BSS feasible in practically relevant high-dimensional setting. While we choose a similar setting for our simulation study as Hastie et al. [4], we also extend their approach in several important ways: For wider applicability, we consider more complex situations than just the Toeplitz correlation structure; we also choose different positions of the direct predictors within those correlation structures. All-in-all, our simulation has 270 different parameter combinations and the results can be compared in an interactive web-app that we created [18].
All methods evaluated, here, require choosing a tuning parameter or subset size potentially affecting which and how many variables are selected. To enable a fair comparison, we choose for each method its optimal tuning parameter (or subset size) in terms of its best achievable F1-score. This allows us to assess the best possible variable selection performance, separating this from the issue of choosing a tuning parameter.
The paper is organized as follows: The methods section describes the selection procedures under investigation and their theoretical properties. Subsequently, we describe the set-up of the simulation study as well as our findings. Finally, we will draw some conclusions and give an outlook onto future work in the last section.

Methods
Given a vector of responses y ∈ R n , a design matrix X ∈ R n×p , a vector of coefficients β ∈ R p and a noise vector ǫ ∈ R n with ǫ i ∼ N (0, σ) for independent i = 1, . . . , n, we assume the linear model where x j , j = 1, . . . , p, has been standardized such that n i=1 x i,j = 0 and n −1 n i=1 x 2 i,j = 1. We further assume β to be sparse in the sense that for s = p j=1 I(β j = 0) we have s = O(n c ) for 0 < c < 1 [19,20]. We define b = (b 1 , . . . , b j , . . . , b p ) ⊤ as the model vector that indicates the true model by its entries Since β has to be sparse most entries of b are 0. For selection we are rather interested in the entries of b than in the exact values of β. Therefore, letβ denote an estimator of β that we use to constructb by setting its entries aŝ An estimator selects the correct variables if supp(b) = supp(b), where supp(·) denotes the support of a vector. We say that an estimator or procedure is selection consistent if supp(b) converges in probability to the true support. While the ordinary least squares (OLS) estimator is the best linear unbiased estimator for β when n > p it is not useful for variable selection where the aim is to discriminate zeros from non-zeros in β. For β j = 0 it can be shown that the OLS estimate isβ j = O n −1 log n [21], i.e. it does not select a model for finite n as estimated coefficients will not be exactly zero. In a high dimensional setting p > n the OLS estimate is not unique [22].
For variable selection, different penalized least squares approaches have been formulated as an optimization problem of the form where the penalty term ||β|| q := p j=1 |β j | q 1/q denotes the L q -norm with special case ||β|| 0 := p j=1 I (β j = 0). The tuning parameter λ ≥ 0 controls the strength of the penalty where for q < 2 a higher λ shrinksβ j stronger towards 0 and fewer variables are selected. Choosing λ can be based on criteria like AIC, BIC, crossvalidation or stability procedures [23][24][25][26] each with different goals and different assumptions [27][28][29]. In the following we will focus on some of the most prominent penalization approaches.

Best subset selection
Using the L 0 -norm in (2) is known as Best Subset Selection (BSS) and can be formulated as the following discrete optimization problem with k ∈ N determining the maximal number of non-zeros inβ. Zhang et al. [30] showed that if a uniform signal strength condition for the smallest true predictor in β holds the BSS can achieve selection consistency with respect to b. Shen et al. [31] defined a degree of separation that describes how difficult it is to discriminate the true model from all other models in terms of the projection of y based onβ. As a necessary condition for a L 0 -norm based penalty approach to be selection consistent they showed that the degree of separation has to be larger than a threshold that is a function of p, n and σ 2 . The minimization of (3) is known to be N P -hard [32,33] and state-of-the-art algorithms have been capable of solving BSS problems in a feasible amount of time only if p < 50. Recently Bertsimas et al. [5] reformulated (3) as a mixed integer optimization (MIO) problem arg min where SOS-1 denotes a Specially Ordered Set of Type 1, i.e. at most one element of (β i , 1 − z i ) can be non-zero. The authors showed that this reformulation guarantees optimality in the sense of (3). Due to efficient MIO solvers like Gurobi [34] problem (4) can be solved in minutes even when p is in the 1000s, n in the 100s and a moderate value k is selected. However, certifying the optimality of the solution can take much more time. For example the Gurobi solver uses a lower an upper bound criterium to find a solution where the convergence rate of the lower bound criterium is much faster [4].

Forward step-wise selection
While BSS is limited through its N P -hard nature, stepwise selection is a popular alternative. It gradually adds (removes) variables to (from) a model based on some criterion of model fit. Due to this greedy strategy these algorithms are computationally less challenging with complexity O(p 2 ). However, stepwise selection approaches are known to have numerous drawbacks: they result in unstable final models that are sensitive to small changes in the data [35][36][37] and they are only locally optimal and often miss direct predictors while selecting irrelevant variables [38,39]. Moreover, inference is problematic as they usually do not account for multiple testing issues [37,40]. Despite these drawbacks we will consider here forward stepwise selection (FSS) [41,42] in our simulation study because it can be interpreted as greedy heuristic of BSS [4]. It is defined as an iterative algorithm and starts with an empty where P At−1 denotes the projection of y onto the column space of X At−1 . Given j t , the active set is updated as A t = A t−1 ∪ {j t } and used to estimatê where \{A t } denotes the set of not selected predictors at step t.

Lasso and Elastic net
The least absolute shrinkage and selection operator (Lasso) [6] uses an L 1 -norm as penalty term in (2) and shrinks all estimated coefficients by an absolute value towards zero. For a sufficiently large tuning parameter λ the estimated coefficients are exactly zero, hence, the Lasso performs variable selection. The total number of zero coefficients is controlled by the λ where larger values will result in sparser models. Although the Lasso can be combined with fast algorithms so that problems with p in the 10, 000's can easily be solved, i.e. for settings for which BSS with MIO can no longer be applied, it also has several drawbacks. Firstly, it only allows up to n non-zero regression coefficients which can be a limiting factor if n ≪ p [42]. Secondly, if irrelevant variables are highly correlated with direct predictors of the outcome y the Lasso selects almost arbitrarily of those true and false variables and is known not to consistent, not even for the sign of the coefficient. Further, if there is a high pairwise correlation within a set of variables and they all are true direct predictors for y the Lasso tends to select only one of these variables [9,43]. Thus, it can not guarantee consistent variable selection [20]. To address some of these drawbacks, different modifications of the Lasso have been proposed. They rely on a priori knowledge about the data generating process or the functional relationship between variables [7,8,16,[44][45][46]]. An alternative is the Elastic net (Enet) [9] which can be formulated as a weighted combination of the Lasso and the an additional The second tuning parameter 0 ≤ α ≤ 1 controls the weighting between the L 1 -and L 2 -penalty. Here the L 1 -penalty induces a Lasso-type variable selection while the L 2 -penalty helps with highly correlated variables by increasing the diagonal entries of the covariance matrix X ′ X. The latter guarantees a positive-definite covariance matrix so that it is possible for all p estimated coefficients to be non-zero. More importantly, the L 2 -penalty can be interpreted as an artificial decorrelation of the variables making it easier to jointly select highly correlated variables if they are all direct predictors for y [9]. However, the double shrinkage of the Enet increases the bias of the estimators more than Lasso or Ridge alone and a rescaling of the estimators, that does not affect the number of estimated non-zeros, has been suggested [9] .

Simulation design and evaluation
To evaluate the performance of BSS, FSS, Lasso and Enet for variable selection we simulated data from a linear model (1) with X ∼ N p (0, Σ) and y ∼ N n (Xβ, σ 2 I).
Since real world applications of variable selection often have a small signal-to-noise ratio τ we followed [4] and set σ 2 = β ⊤ Σβ τ with 0.05 ≤ τ ≤ 6. For a low-dimensional setting we chose n = 1000 and p = 100 and for a high dimensional setting n = 100 and p = 1000. Additionally, we considered an intermediate setting with n = 500 and n = 500. To assess the role of correlation between predictors we used an uncorrelated structure and the standard Toeplitz structure. The latter was created by setting the pairwise correlation between two variables x ·,u and x ·,v for u, v = 1, . . . , p as ρ |u−v| with ρ ∈ {0.35, 0.7}. However, although the Toeplitz structure is a popular correlation structure in simulations, it can be implausible in some applications. For example, in genetic epidemiology it is often reasonable to assume that genes within a functional group are correlated with each other, but that they are nearly independent of genes from other functional groups. Hence, we simulated data with correlations following a block structure for which variables were grouped into blocks of size 10 with pairwise correlations of ρ ∈ {0.35, 0.7} within blocks and ρ = 0 between variables of different blocks. The number of direct predictors was set to s = 10 in all settings and their position was either consecutive or equally spaced along the the sequence of variable. For the consecutive positioning we set the first ten coefficients to be non-zeros while for the equally spaced positioning we set every tenth coefficient to be non-zero (see Figure 1). In all scenarios a non-zero direct predictor was set to β j = 1. Overall we investigated 270 different scenarios and each one was repeated 100 times. As performance measure we used the F1-score where P = T P T P +F P is precision and R = T P T P +F N is recall (and TP is the number of true positives, FP of false positives and FN of false negatives).

Selection of tuning parameter
All methods considered here rely either on choosing a tuning parameter λ or a maximum set size k a priori. The Enet even requires setting two tuning parameters. Obviously, the performance of each variable selection methods depends the choice of these tuning parameter(s) or subset size. Since we are interested in the best possible performance of each method regarding variable selection we used a grid of tuning parameters λ and α for Lasso/Enet, i.e. α = 0.1, 0.2, . . . , 0.9 and 1000 values for λ where the largest λ returns an empty model and the smallest λ a full model. For BSS and FSS we used subset sizes of k = 1, . . . , 20 which means that for k = 10 both methods had the chance to find the true model. In a final step, for each method to be compared, we only chose those tuning values / set sizes that gave the highest F1 score in the considered setting.
Although MIO much accelerates BSS for our high dimensional settings, it can still run for hours making an extensive comparison infeasible without a time limit. Following the suggestion of other authors [4,5], we set the time limit to 3 minutes, which is sufficient for the Gurobi solver to find a solution. However, certifying for optimality can take much longer. Because this could disadvantage BSS compared to the other methods we investigate the number of certified optimality BSS solutions and its impact on the F1 score.

Results
Surprisingly, BSS reliably outperformed the other methods only in settings with high signal-to-noise ratio and when the variables were uncorrelated. Even in a low dimensional setting, with the number of observations ten times the number of variables, the selection performance of BSS drops dramatically if the true predictors are moderately correlated. In those cases BSS is even outperformed by the Lasso which is known to be inconsistent for variable selection when the true predictors are highly correlated. Interestingly, the much simpler FSS achieves a similar performance as BSS in almost all settings, in some even slightly better.
To give a more detailed impression of the BSS performance we will report the results of four specific settings regarding dimensionality, non-zero position as well as correlation structure and hight. The remaining results are shown in the supplement "Additional file 1" or interactive web app [18] and support our main conclusions. Note, for the Enet we only show results with α ∈ {0.1, 0.5, 0.9} representing a mostly Ridge-weighted, a balanced as well as a mostly Lasso-weighted Enet.

Variable selection performance
In general all methods perform better under a Toeplitz structure compared to a block structure. This seems plausible since the correlations under the Toeplitz correlation structure are weaker than within blocks. For the high dimensional block setting with equally spaced non-zeros, ρ = 0.35 and low τ , all methods have a relatively small best F1-score (see Figure 2). FSS and BSS show nearly identical results and only outperform Lasso and Enet with large signal-to-noise ratios with respect to the F1-score (τ ≥ 3.52) and precision (τ ≥ 0.71). Lasso and Enet show better recall values on average except for τ ≥ 3.52 while the variability is high for τ ≤ 0.14. Figure 3 shows the results for a high dimensional Toeplitz structure setting with consecutive non-zeros. The methods exhibit large differences: for τ ≥ 2.07 the Ridgeweighted Enet (α = 0.1) achieves mostly a very high F1-score of 1, benefiting clearly from the decorrelation. In comparison, the other methods cannot cope with highly correlated direct predictors as seen from the low recall values and ensuing low F1score values of the Lasso and the weak performances of BSS and FSS. Figure 3 shows that the recall of BSS and FSS slightly benefits from an increase in τ .
In the low-dimensional block setting with ρ = 0.7 and equally spaced non-zeros BSS and FSS barely outperform Lasso and Enet for most τ . Figure 4 shows that this is mainly due to the relatively high precision. However, when the non-zeros are consecutive the performance of FSS and BSS decreases drastically. Even in the low dimensional case the signal-to-noise ratio has to be very large (τ = 6) to achieve comparable results to Lasso and Enet (see Figure 5). In this cases the F1-scores of FSS and BSS are dominated by their poor recall.

Certified runs
Although we needed to set a time limit for the Gurobi solver (which we chose so that computation time would stay below 56 days), the poor performance of BSS cannot be explained by non-certified solutions. Figure 6 shows the number of certified and non-certified optimal solutions for the low-dimensional setting with consecutive nonzeros of figure 5. For τ ≥ 3.52 the Gurobi solver certified all solutions and at least 80 out of 100 runs have been certified as optimal for smaller τ . This suggests that BSS could not achieve a much better recovery of supp(b) in these settings. Conversely, an uncertified solution does not imply that BSS performs poorly. For example figure  7 shows the number of certified results for an intermediate Toeplitz setting (p = 500, n = 500, ρ = 0.35) with equally spaced true predictors. No run in this scenario could be certified for τ = 0.71 but BSS still achieved an F1 score of 1 in 98 out of 100 runs (see [18] and supplement "Additional file 1"). This means the Gurobi solver found an F1-optimal solution that is also the best possible solution but could not certify its optimality within 3 minutes.

Conclusion and Discussion
We carried out an extensive simulation study to compare the performance of BSS and its competitors regarding variable selection. We investigated a broad range of parameter constellations as well as different and maybe more realistic correlation structures than previous works have done. Our results show that the Enet and the Lasso outperform BSS and FSS in most scenarios. This was unexpected since due to the maximal subset size k = 20 BSS must have considered the true model with s = 10 as a candidate model in every run. Hence, we would have expect BSS to perform similarly or better than the other methods. Perhaps more surprisingly, we found that BSS cannot achieve a good variable selection performance in low dimensional settings if there is a moderate to large correlation (ρ ≥ 0.35) of the true predictors. In this case, BSS can only be 'saved' by a relative high signal-tonoise ratio which will be implausible in many practical situations. We also argue that the poor performance of BSS can not be explained by non-certified runs alone.
Based on our empirical results, using a Ridge-weighted Enet seems a good choice for settings with correlated predictors. We can recommend an L 0 -norm penalization only in situations where a high signal-to-noise ratio and (nearly) uncorrelated true predictors are plausible. However, in view of the N P -hard nature of BSS and its performance being matched almost exactly by FSS, we see no benefit in choosing BSS over FSS.
Our complete results can be accessed via a web app [18] and via the supplement "Additional file 1". However, it has to be kept in mind that our simulation study was designed to evaluate variable selection methods, and not to assess the criteria for selecting the tuning parameters or subset size.
In future research, it might be promising to combine BSS or FSS with approaches like Lasso or Enet so as to preselect or decorrelate covariates. Hence, further insights  Ethics approval and consent to participate Not applicable.

Consent for publication
Not applicable. Figure 6 Number of simulation runs with certified optimal solution found by the Gurobi solver for a block-structured low-dimensional setting with ρ = 0.7 and consecutive true predictors.