Stable Variable Selection for High-dimensional Genomic Data with Strong Correlations

Background: High-dimensional genomic data studies are often found to exhibit strong correlations, which results in instability and inconsistency in the estimates obtained using commonly used regularization approaches including both the Lasso and MCP, and related methods. Result: In this paper, we perform a comparative study of regularization approaches for variable selection under different correlation structures, and propose a two-stage procedure named rPGBS to address the issue of stable variable selection in various strong correlation settings. This approach involves repeatedly running of a two-stage hierarchical approach consisting of a random pseudo-group clustering and bi-level variable selection. Conclusion: Both the simulation studies and high-dimensional genomic data analysis have demonstrated the advantage of the proposed rPGBS method over most commonly used regularization methods. In particular, the rPGBS results in more stable selection of variables across a variety of correlation settings, as compared to recent work addressing variable selection with strong correlations. Moreover, the rPGBS is computationally efﬁcient across various settings.


Background
In bioinformatics studies, the links between genetic markers and phenotypes are highly important, but these genetic markers are often found to be highly correlated. Furthermore, genomic datasets often contain strong linear dependencies among the variables. Consider a high-dimensional data set with n observations and p variables where p n. Let y i be the observed response variable, (x i1 , · · · , x ip ) be the p-dimensional predictor vector for observation i, 1 ≤ i ≤ n. Without the loss of generality, the data is standardized and centered such that n i=1 y i = 0, n i=1 x ij = 0 and n i=1 x 2 ij = n. Then a highdimensional linear regression model has the format, where ε i s are i.i.d random errors with mean 0 and constant variance σ 2 . In a highdimensional setting, a sparse model is often assumed; that is, most coefficients β j s are zeros, except for a few nonzero ones. Without the loss of generality, we assume only the first q coefficients are nonzeros. Let β = (β 1 , β 2 ) , where β 1 = (β 1 , · · · , β q ) includes only nonzero β j s and β 2 = (β q+1 , · · · , β p ) includes only zeros. A model can perform simultaneous variable selection and estimation if it provides exact zero estimates for some coefficients.
In the above high-dimensional linear model (1), classical regularized methods for variable selection are to minimize a penalized least squares objective function, where p i=1 P λ (|β j |) is the penalty function chosen to produce a sparse β. Among many existing penalty functions, the least absolute shrinkage and selection operator (Lasso, [1]) and the minimax concave penalty (MCP, [2]) are the two most commonly used penalties. In particular, the Lasso takes P λ (|β j |) = λ|β j |, a convex penalty, while the MCP uses a folded concave penalty with Here λ is a tuning parameter controlling the sparsity in both the Lasso and MCP. The MCP has a better variable selection performance by penalizing only the small coefficients. This range of non-penalization is controlled by another tuning parameter γ. The smoothly clipped absolute deviations penalty (SCAD, [3]) is another folded concave penalty that performs similarly to the MCP. Some other variants of the Lasso include the adaptive Lasso ( [4]), the relaxed Lasso ( [5]), Moderately Clipped Lasso ( [6]), etc. The regularization methods with the above penalties can perform well for variable selection in high-dimensional data settings when strong correlation does not appear among many covariates. However, issues arise when they are applied to highly correlated and/or linearly dependent variables. The underperformance of the Lasso for variable selection under highdimensional data with correlated variables was also justified theoretically in [7] and [8]. In those works, they proved that the data must satisfy the weak irrepresentable condition in order for the Lasso to select important variables consistently. Denote the response vector y = (y 1 , · · · , y n ) and design matrix X = (x ij ) n×p with p variables listed as columns and n observations listed as rows. Correspondingly, we write X 1 as the submatrix of X with only the first q columns of active variables (corresponding to β 1 ), and X 2 as the submatrix of the last p − q inactive variables (corresponding to β 2 ). The weak irrepresentable condition indicates |sign(β 1 )(X 1 X 1 ) −1 X 1 x j | < 1 for all q + 1 ≤ j ≤ p, where sign(u) = 1 or −1 if u > 0 or < 0 is a sign function. Thus, the weak irrepresentable condition means the active variables and the non-active variables should not be strongly correlated.
[9] also detailed that when the predictor variables are strongly correlated, there are two main issues that arise with classical variable selection methods: Instability and Inconsistency. Instability refers to the inability to distinguish among correlated explanatory variables, and inconsistency refers to the inability to select variables unvaryingly given linear dependence among predictors. As has been shown in [9], the commonly used methods often exhibit the above-mentioned discrepancies, especially in highly correlated genomic data settings. This phenomenon is also demonstrated in our comparative studies. During the last twenty years, much work has been done to improve the Lasso and MCP type methods by taking into account either the strong correlation or or grouping effect.
1.1 Encourage simultaneous shrinkage using the ridge penalty since the Lasso tends to select only one variable from a group of variables with strong pairwise correlations, [10] proposed the elastic net (Enet) regularization approach by combining both the 1 penalty of the Lasso and the 2 penalty. Elastic net combines the feature selection from Lasso and grouping shrinkage from the ridge regression model ( [11]) to improve the model's predictions. In particular, the elastic net estimator, By adding an additional ridge penalty 1−α 2 p i=1 β 2 j to the Lasso, the elastic net has the effect of producing equal (up to a change of sign if negatively correlated) estimates for coefficients of a group of highly correlated variables and then thus selects groups of correlated variables. Here α controls the magnitude of group effect. Similarly, Mnet ( [12]) improves the variable selection results of Enet by adding the ridge penalty to the MCP. While Enet and MCP try to use the ridge penalty to encourage correlation among variables, the additional 2 penalty tries to shrink all of the coefficients together. Thus both Enet and MCP tend to generate incorrect active variable sets especially when active variable and inactive variable are strongly correlated.

Encourage the group variable selection with prior group information
In a high-dimensional data setting where variables appear with a group structure, it is often assumed that variables function groupwisely as well. For example, in gene expression data analysis, genes from the same genetic pathways often function similarly ( [13]). [14] proposed the Group Lasso method for group variable selection, i.e., selecting active groups, if some structured group information is available. Suppose p variables consist of J groups such that X j includes only variables in group j, and β j represents the corresponding coefficients vector of group j. Then the group Lasso is to apply the Lasso penalty at the group level, that is replacing the penalty Here p j is the group size for group j and β j 2 = ( k∈Group j β 2 k ) 1/2 . To improve the performance of group variable selection, [15] studied the Group MCP by replacing the Lasso penalty by the MCP penalty at the group level, that is, replacing |β j | in the MCP penalty function (3) by β j 2 for the group j.
Both Group Lasso and Group MCP depend on a prior group information correctly dividing active and inactive variables into different groups. They are "all-in" or "all-out" methods, meaning that when a group is selected, all variables within those groups are selected, and when a group is not selected, all corresponding variables are discarded. Group Lasso penalizes the grouped coefficients much like the lasso does because it uses the same tuning parameter (with a factor related to the group size) for all groups and hence suffers from estimation inefficiency and variable selection inconsistency. When the data is well-structured into groups, both of these methods tend to perform well, but there is inconsistency in their performance when the data cannot be distinctively grouped ( [16]). In such scenarios, adopting group-wise bi-level structure approach avoids producing biased estimates and completely incorrect variable selection results.
1.3 Reduce the effect of strong correlation using the covariance matrix Two most recent work including the Precision Lasso (PLasso, [9]) and Whitening Lasso (WLasso, [17]) propose to reduce the influence of strong correlation in variable selection by incorporating the covariance matrix into the penalty function.
The PLasso is a Lasso variant that promotes sparse variable selection by regularization governed by the covariance and inverse covariance matrices of explanatory variables. It can handle the challenges in variable selection when variables are both strongly correlated and dependent. However, the penalty is designed to depend on a tradeoff between both the covariance and inverse covariance matrix which is hard to tune. In addition, the computation scales cubically with the number of samples and linearly with the number of explanatory variables, making it computationally expensive. Our late numerical studies demonstrate that it does not necessarily outperform the traditional methods such as the MCP or Mnet. For example, the PLasso often severely over-selects the variables, especially when the correlation or dependence structure is not strong. The WLasso is another recent method proposed to improve the variable selection result especially in the scenario where active variables and inactive variables have strong correlation. It involves transforming the model to remove the correlation existing among columns of X and applying the generalized Lasso approach ( [18]) for selecting significant variables. The WLasso can be computationally tedious since it requires a decomposition of covariance matrices. In addition, it assumes strong correlation between active variables and inactive variables by providing the estimated covariance matrix Σ with two blocks (one block including mainly active variables and the other one including mainly inactive variables). However, the division of block structure may cause instability and inaccuracy in variable selection. Our comparative studies show that the WLasso exhibits some variability and inconsistency in variable selection, particularly in cases when correlations among predictors are high. Not like the PLasso, it is challenging to apply the WLasso to binary responses since the approach depends strongly on the generalized Lasso method.

Our contribution
As summarized above, previous work addressing the issue of variable selection in highly correlated high-dimensional data is computationally complex and not consistently efficient across varied settings including both strong and low correlation settings. In this paper, we conduct a thorough comparative study among many existing regularization methods for variable selection under different correlation structures. In particular, we propose an efficient procedure for stable variable selection dealing with various high-dimensional correlated data. The method proposed in this paper, termed as the rPGBS (repeatedly randomized Pseudo Group Bi-level Selection) is a stable variable selection approach involving repeatedly running a two stage hierarchical variable selection approach consisting of random pseudo group clustering and bi-level variable selection. Owing to its comparatively less complex algorithm coupled with its consistently high variable selection efficiency across a variety of data settings, the proposed rPGBS is a rather robust approach to deal with highly correlated predictors in high-dimensional genomic data.
In Section 2 of this paper, we present our main methodology and provide its Bayesian understanding. In Sections 3, we provide results from numerical studies using both simulations across different settings and two publicly available gene expression data sets. Section 4 contains some concluding remarks.

Methods
Here we propose a repeatedly randomized Pseudo Group Bi-level Selection (rPGBS) method for stable variable selection in generalized linear model setting. We present our methodology under continuous response variables and the linear regression model setting in (1). But the same procedure is naturally applied for the logistic regression with binary responses. Our method can be summarized into three main steps: 1) using X to obtain a sample pseudo group structure based upon a randomly generated group number J; 2) using a regularized method with a penalty producing bi-level sparsity such that the estimated β has the sparsity at both the group level and within group level; and 3) summarizing the variable selection results by repeatedly performing the above two steps based upon random generated J from a uniform distribution from k min ≥ 1 to k max ≤ p. In the following subsections, we give more details of the three steps, and then provide an rPGBS algorithm.

Generating a random pseudo group pattern
The randomized pseudo grouping information was generated by using the correlation or spatial dependence among variables to cluster them into J groups, where J ∼ Unif[k min , k max ]. Some existing methods include Partitioning around medoids (PAM, [19]) and Vorononi ( [20], [21], [22]) iteration method. For our simulations and real data analysis, we used partitioning around medoids to group the variables, as a demonstration. For a random group number J, the pseudo group information G(J) can be obtained by the pam function in the "clusters" R package ( [19]). One can also use other types of clustering methods, allowing for even more effective implementation of rPGBS.

Bi-level variable selection
The bi-level selection is a two-step process: the selection at the group level and the selection at the individual variable level. This is especially pertinent in our method (addressing the phenomenon in high-dimensional genomic data analysis) as inherent grouping structure is fuzzy and random. As stated earlier, many existing methods are highly dependent upon group selection, and the "all-in or all-out" methodology does not allow flexibility in variable selection. There are some existing bi-level variable selection methods including composite MCP (cMCP, [15]) and Group Exponential Lasso (GEL, [23]), robust group MCP ( [24]), among many others. Both methods select groups and then also do another selection within the groups that were selected. Unlike the MCP and group MCP, bi-level selection approaches conduct another iteration of selection on each of the selected groups to determine the active variables. In addition, the individual variable-level penalty in cMCP cannot exceed the the penalty at the group-level. Thus, no single variable can dominate the penalty of the group.
Suppose G(J) is an observed group assignment pattern with J groups such that X = (X 1 , · · · , X J ), where X j consists of p j variables in group j. Denote β j = (β j1 , · · · , β jpj ) as the coefficients vector for X j . The cMCP applies the MCP penalty both at the group level and within the group level: where P (·) is the MCP function in (3), γ O is the outer penalty (at the group level) tuning parameter and γ I is the inner penalty (within the group level) tuning parameter. The GEL also utilizes bi-level selection because it uses an exponential penalty at both the group level and the variable level. See [15] for more details. We choose to apply the cMCP since this method avoids over-shrinkage both at and within the group level due to the nice properties of the MCP penalty. Although the cMCP was applied for our rPGBS method for most of our numerical studies including real data analysis, we found that the rPGBS with the GEL may also perform well under certain simulation settings.

Summarizing the repeated random process
From the above, we know the bi-level variable selection result depends upon the observed pseudo group information. Therefore, to obtain a stable variable selection result, it is essential to repeatedly generate J ∼ Unif(k min , k max ) randomly and use the corresponding pseudo group information G(J) to run bi-level selection based upon G. Due to the fact that fuzzy group information and corresponding bi-level sparsity behave like a blackbox, we can treat one process of randomized group assignment and bi-level selection as one epoch. Let a j be the binary outcome for the activity of variable j, that is, a j = 1 if β j = 0, and 0 otherwise in the true sparse model. After M iterations of epoch, we can estimate the probability of variable j being active as: In summary, the rPGBS method gives the activation probability vector r = (r 1 , · · · , r p ) as the following algorithm.

Algorithm 1 rPGBS Algorithm for variable selection
INPUT: Data X and y, epoch number M , group number range k min and kmax OUTPUT: Active proportion for all variables.
Initialize an active vector a = (a 1 , · · · , ap) with a j = 0 for 1 ≤ j ≤ p repeat Random generate an integer J from Unif(k min , kmax) Produce pseudo group information G(J) with the random group number J Run a bi-level selection (eg. cMCP or GEL) with group information Then variables with a larger r j have higher chance to be an active variable. Variable j is deemed to be active if r j > c, where 0 < c ≤ 1 is a cut-off value. The cut-off value c can be chosen by sorting the active proportions r j from the smallest one (0 in general) to the largest one (1 in general). If the sorted proportions jump fast to a higher value and then slow down or become stable, then the value at which the jump occurs is chosen as the cutoff point. The coefficients estimation vector β is obtained using the post selection method. In particular, if A includes all active variables, then β A c = 0 and β A is obtained by refitting y on X A . Among all our simulation settings, we run epoch for 20 times, and choose (k min , k max ) = (0.05p, 0.5p). The rPGBS with "cMCP" is applied for most settings if not specified otherwise ("GEL" may also be applied for some settings based on the type of data). For the rPGBS method, we use the cutoff value c = 0.8 among all settings in linear regression, and c = 0.5 for logistic regression, if not otherwise specified.

A Bayesian understanding
To understand the rationale of the rPGBS, we consider a Bayes setting such that Here P λ (·) is proportional to some chosen penalty function, such as the MCP penalty in (3) and the Exponential Lasso penalty in [23]. The summation in (6a) is based upon some given group information G, where G follows a distribution F given the cardinality of J in (6b), with J following some prior distribution. In particular, for the rPGBS algorithm, we consider an uninformative prior distribution of J in (6c). Thus the rPGBS estimator of β is approximately a posterior mode under the high-dimensional linear regression model (1).

Simulation studies for Continuous response data
We first perform simulation studies for continuous response data under different correlation or dependent settings.

Example 1
We simulate data from multiple linear regression settings, y i = p j=1 x ij β j + ε i , where β j = 1 or −1 for active variables and 0 otherwise. The covariate vector (x i1 , · · · , x ip ) is simulated from a multivariate normal distribution with mean 0 p and covariance matrix Σ, where Σ is selected from one of three correlation structures: (a) blockwise, or (b) groupwise or (c) power-decay. The random error ε i is generated from N(0,1). We set (n, p) = (200, 500).
(a) Block-wise. The block-wise structure is generated similar to [17]. In particular, the covriance matrix is constructed as follows: Here Σ 11 is the correlation matrix among all active variables, where any two active variables are correlated with correlation coefficient α 1 . Here Σ 12 , the correlation matrix between the active variables block and inactive variables block, is a constant matrix consisting of all α 2 . Here Σ 22 is the correlation matrix among all active variables, where any two active variables are correlated with correlation coefficient α 3 . For our simulation studies (α 1 , α 2 , α 3 ) = (0.3, 0.5, 0.7). We set the number of active variables as q = 10 for this setting.
(b) Group-wise. The group-wise structure has between group correlation a and within group correlation a · b. In our simulation studies, the number of active groups was set to 3, a = 0.9, b = 0.8, and number of active variables in this setting was 30.
(c) Power Decay. In the Power Decay structure, the correlation strength decreases when two variables stray further from the diagonal. In particular, cor(x j , x k ) = α |j−k| for 1 ≤ j, k ≤ p. In our simulation studies, we choose α = 0.99. We set the number of active variables as q = 10 for this setting.
Example 2 Similar to Example 1 (c), except that we use an auto-regressive sampling scheme as [9]. In particular, X ∈ R n×p has two subsets: correlated variables X 1 ∈ R n×p1 and linearly dependent variables X 2 ∈ R n×p2 . Here X 1 is generated following standard Yule-Walker equations ( [25]), and X 2 depends on X 1 .
To demonstrate the performance of rPGBS, we compare it with all other 8 models introduced in Section 1 including the Lasso, MCP, Enet, Mnet, GLasso, GMCP, WLasso and PLasso. For Group Lasso and Group MCP, number of groups is fixed to be J = 0.1p, such that the number of groups increases with p. For both WLasso and PLasso, we used corresponding publicly available functions from two papers. For all 9 methods, optimal tuning parameters are selected by 5-fold cross validations. We run 50 iterations for every simulation setting.
To compare all methods, for each of the nine methods including rPGBS, we summarize the variable selection results using 6 measurements including Total Accuracy (TA, the percentage of variables being correctly classified as active or inactive), TPR (the percentage of detected active variables that are truly active), TNR (the percentage of detected inactive variables that are truly inactive), Sensitivity (the percentage of active variable being selected as active), Specificity (the percentage of nonactive variables being identified as inactive), and Total number of variables selected (Total #).
The average results (and standard deviations) out of all 50 iterations for Example 1(ac) are reported in Table 1-3. The average results for Example 2 are reported in Table 4. Note that the final column of those tables exhibits Total # of variables selected by each method, with the true number of active variables listed together with the column name (in parentheses). Table 1 reports the compared results for the linear block-wise correlation setting in Example 1(a), where the number of true active variables is 10. Among all 9 methods, the MCP performs the best, while the rPGBS is comparative to the MCP. The average number of selected active variables is 10.24 (rPGBS) vs 10.14 (MCP), the TPR is 98.2% (rPGBS) vs 98.8% (MCP), and other measurements such as the TA, TNR, Sensitivity and Specificity are all 100% on average for both methods. The Mnet also performs rather well in this setting but still worse than both the rPGBS and MCP. Consistent with previous studies, the Lasso also performs worse than the MCP when strong correlation exists in highdimensional settings. Both the WLasso and PLasso perform much worse than the rPGBS. In particular, the WLasso severely under-fit the model by providing an average number of selected active variables of 0.12. The PLasso over-fit the model by providing an average number of selected variables of 83.8. Those results tell us that for this setting, where the correlation between active and inactive variables is larger than the correlation among active variables (0.5 vs 0.3), group variable selection tends to over-select the number of variables if group information is not appropriately used since the GMCP performs worse than both MCP and Mnet. However, the rPGBS is relatively robust to the inappropriate group information due to its stability generated from the repeatedly randomized grouping arrangement and bi-level selection.  Table 2 gives the comparative results of all 9 methods for the power decay correlation setting in Example 1(b), where the true active set also has 10 variables. In this setting, the rPGBS outperforms all other 8 methods, by providing an average number of selected active variables of 10.42 and TA of 99.1%. Among other methods, variable selection addressing group selection over-selects the number of variables. For example, the number of active variables is estimated to be 29.68 (GMCP) vs 3.86 (MCP) and 4.78 (Mnet). Similar to Table  1, the pattern of WLasso and PLasso severely under-selecting and over-selecting continues in the power decay correlation setting. The WLasso selects only 0.02 (on average) and PLasso selects 421.7 (on average) variables. It means that if the correlation structure of the data matrix is not appropriately characterized and used in high-dimensional data analysis, it can severely decrease previous variable selection methods' performances using either WLasso or PLasso.  Table 3 gives the comparative results for the group-wise correlation setting in Example 1(c), where the true model has 30 active variables. Our proposed rPGBS method still outperforms all 8 other methods. Since this setting is designed for grouped sparsity, it is not surprising to see that the performance of GMCP has improved compared with Table 1 and 2. However, the group selection result for GMCP is still not ideal since the exact group information is not available. Among all the methods, the rPGBS maintains the highest TA of 98.9%, highest TPR 91.1%, and closest total number of variables selected of 29.48. Besides rPGBS, the Enet performs better than others regarding the Total # but still a lower sensitivity rate of 40.3%. The pattern of under-selection and over-selection of the WLasso and PLasso in this setting stays the same as Example 1 (a) and (b). It is evident that the rPGBS is the strongest variable selection method regarding its stability and consistency for variable selection under different correlation structure settings.  Table 4 gives the comparison results for Example 2, where both strong correlation and dependence exist among variables. This is the simulation setting adopted by the PLasso ( [9]). It is not surprising to see that performances of the PLasso have been significantly improved for this setting due to the fact that it is specifically designed to take into account the variables' dependent structure. However, the rPGBS continues to outperform all other methods including the PLasso. the rPGBS maintains the highest TA at 99.1%, Highest TPR of 75.9%, an exceptional TPR and Specificity. The true model has 10 active variables in this setting. The total number of active variables being selected by the rPGBS is 11.92 on average. Although the PLasso has improved, it still tends to over-select the variables by providing the Total # of 47.86 with a low TPR of 22.4%. The WLasso severely underselects in this setting. To demonstrate the variability of each method, we plot the boxplots of TA, TPR, TNR and Total # from all 50 simulation settings for this setting in Figure 4. The stability and consistency of rPGBS can be demonstrated by its comparison with all other 8 methods from those four Boxplots. The comparative results of all previously discussed methods except the WLasso are reported in Table 5. We don't have the results for the WLasso since it is only designed for a linear regression model. Similar to the results in linear regression setting, all methods using the MCP penalty (rPGBS, MCP, Mnet, and GMCP) perform better than methods using the Lasso including PLasso. Among all methods, the rPGBS is the one that performs the best, and the PLasso performs the worst. In the binary response data setting, the rPGBS has almost perfect selection of the ten active variables. This outperformance of any other method further evidences the strength of the rPGBS for variable selection in the highly correlated setting.

ROC curves for rPGBS
As introduced in Section 2.3, a variable is claimed to be active using rPGBS if the selction proportion r j > c. To demonstrate the performance of the rPGBS for stable and consistent variable selection for different choices of cutoff value c, we let c change from 0 to 1 and plot the ROC curves for all simulated settings in Figure 4.
The ROC curves show that the rPGBS performs the best under the block-wise correlation settings (both linear and logistic models), but is also highly efficient under other settings. This reinstates that rPGBS performs stable variable selection across a variety of data settings.

Real data analysis
In this section, we apply the proposed rPGBS method and compare it with other methods using two real gene expression data sets (one has the continuous response, the other has the binary response). Both are publicly available high-dimensional highly correlated data.

Breast cancer transcriptomic data
The Breast cancer transcriptomic dataset was also used in [17] to validate the efficiency of the WLasso approach. This is a publicly available dataset (https://www.ncbi. nlm.nih.gov/geo/query/acc.cgi?acc=GSE2990) which consists of data for gene expression profiling in breast cancer.
The original dataset has 10, 674 genes and 150 observations. Out of these genes, the BRCA1 gene is chosen to be the response variable while others are the predictors. A screening process is applied to the original data matrix based on IQR (Inter-Quartile Range) which selects 995 predictors. The pairwise correlation plotted in Figure 3 shows that the data is highly correlated among those 995 gene expressions.
All 9 methods used in the simulation studies are applied for the selection of important variables related to BRCA1. In addition, we compute the prediction error using post selection estimation. In particular, we divide the data into the training and test data in the ratio of 2 : 1, and observations for each set are chosen randomly. All methods are used only to select an active variable set A Method . Then the estimated coefficients β A Method are computed by fitting a least squares regression model to the train data in terms of selected active set A Method . Finally, the prediction errors are computed using the test set. Table 6 gives both the variable selection results (including both the total number of selected variables (| A|) and the full list of selected variables from the full data) and prediction performance (from the test data) from all 9 methods. The rPGBS selected 16 variables with a prediction error of 0.15884. The MCP, Mnet and WLasso selected fewer variables, but they have higher prediction error . The variables selected by rPGBS are IRF9, RAI2,  OS9, CYB1B1, DLX6, ATAD2, HADHA, KCNA10, CCNJ, APMAP, ELSPBP1, BMP2,  TMUB2, LOC81691, SLC13A2, ESPL1. A study published in "Cancer Research" states that the data from the study were able to "...identify a novel IFN-independent role for IRF9 in the development of resistance to antimicrotubule agents in breast tumor cells and may link downstream mediators of IFN signaling to drug resistance in human cancers," supporting the rPGBS method's variable selection for this gene ( [26]). CYP1B1 has been shown to enhance breast cancer cell proliferation and metastasis ( [27]). In a study published in "Bioscience Reports", "Results showed that high expression of DLX6-AS1 was significantly closely associated with poor overall survival in tumor patients" ( [28]). Furthermore, with regard to ATAD2, "Overexpression identifies breast cancer patients with poor prognosis, acting to drive proliferation and survival of triple-negative cells..." ( [29]). "Involvement of mitochondria and the mitochondrial trifunctional protein (encoded by HADHA gene) in breast carcinogenesis is known" ( [30]) shows that HADHA is a significant gene. The SLC13A2 gene is said to be associated with metagene predictors of breast cancer recurrence ( [31]).

Leukemia Data
The second data set we are visiting is the Leukemia data. It was used to validate the performance of the proposed method under binary response. There are 47 patients with acute lymphoblastic leukemia (ALL) (coded as 0) and 25 patients with acute myeloid leukemia (AML) (coded as 1). The samples were assayed using Affymetrix Hgu6800 chips and data on the expression of 7,129 genes (Affymetrix probes) are available (http:// www-genome.wi.mit.edu/mpr/data_set_ALL_AML.html). This data is available in R in the package golubEsets. A screening process was applied to select the predictor genes with the top 500 values of IQR (this was done since n is small). The pairwise correlation plotted in Figure 4 shows that there exists high correlation amongst 500 predictors. All methods compared in Section 3.2, except the WLasso, were applied to this dataset for selecting active variable sets. In addition, to compute the mis-specification error, the data is divided into the training and test data fairly. The training data points were chosen by randomly selecting 80% of observations from those of ALL and AML each. The testing data was obtained by selecting the remaining observations from each category. This was done because the sample size (n=72) was very small. Similar to Section 4.1, all applied methods were only used for selecting the active variable set A Method . Then a post-selection logistic regression was applied to the train data to obtain β A Method . Then the mis-specification error is computed on the test data. Both the selected active variable sets (from the full data) and the mis-specification errors (from the test data) by each method are reported in Table 7.
The rPGBS selects only 3 genes (using a cutoff value c = 0.05) from the Leukemia dataset, namely, X17042 at, M19507 at, and X95735 at. Both X17042 at and M19507 at are listed in the Genome Biology's Top-Ranked (25 pairs) for ALL/AML class separation ( [32]). Gene X95735 at was on the list of markers for leukemia as in [33]. In addition, the rPGBS gives the smallest mis-specification error of 0.06667. Although both the MCP and Lasso also produce the same mis-specification error, the Lasso selects a much larger active set with 24 variables, and the MCP selects 4 genes. Among those 4 genes, only X95735 at is also selected by the rPGBS.

Conclusion
In this paper, we studied a stable variable selection for high-dimensional data with highlycorrelated covariates. We investigated a thorough comparative study on variable selection performances from 8 existing regularization methods under different high-dimensional correlated data settings. More importantly, we proposed a novel efficient method (rPGBS) for stable and consistent variable selection in high-dimensional data with possibly strong correlation and dependence.
Our proposed rPGBS method uses the pseudo group assignment and bi-level penalty to encourage the grouping effect for variable selection. It adopts the repeatedly randomized group assignment to reach the variable selection's stability. For each potential variable, the rPGBS also produces an active probability vector for the coefficient vector as a by-product.
The stability and consistency for variable selection of the rPGBS was demonstrated using both simulation studies and two publicly available gene expression data sets with continuous and binary responses. The results from simulated data show that rPGBS performs with consistent efficiency across various correlated data settings. The results for real data analysis show that the active variables set selected by our method has small size but lower prediction errors. Those numerical studies have demonstrated that the rPGBS results in a stable variable selection across a variety of correlated high-dimensional data settings.
One major advantage of our method over other most related work dealing with highly correlated variables in high-dimensional data, namely WLasso and PLasso, is that our method has a simpler algorithm which makes it computationally very fast, even though p n. Both the WLasso and PLasso involve complex algorithms for parameter selection and variable selection and are seen to be inconsistent across different types of correlation settings. In addition, the proposed method is also applicable to binary response data which is not possible for the WLasso, and existing methods that are applicable to binary response data do not necessarily perform better than the rPGBS. In addition to being fast and efficient, the rPGBS is user-friendly by offering a framework dealing with various high-dimensional data such as count or survival data. One could also use the rPGBS with any other group assignment and bi-level selection methods.
A possible improvement could be to devise a more consistent approach to gauge the parameters such as the epoch number M , k min , k max and c. In the current approach, k min and k max are intuitively gauged based on the size of p, which could be improved to obtain more consistent results. Although our current selection of these parameters is simplistic, they have performed better than existing approaches for stable variable selection.