Stable Variable Selection for High-dimensional Genomic Data with Strong Correlations

doi:10.21203/rs.3.rs-923319/v1

Download PDF

Research Article

Stable Variable Selection for High-dimensional Genomic Data with Strong Correlations

https://doi.org/10.21203/rs.3.rs-923319/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Background: High-dimensional genomic data studies are often found to exhibit strong correlations, which results in instability and inconsistency in the estimates obtained using commonly used regularization approaches including both the Lasso and MCP, and related methods.

Result: In this paper, we perform a comparative study of regularization approaches for variable selection under different correlation structures, and propose a two-stage procedure named rPGBS to address the issue of stable variable selection in various strong correlation settings. This approach involves repeatedly running of a two-stage hierarchical approach consisting of a random pseudo-group clustering and bi-level variable selection.

Conclusion: Both the simulation studies and high-dimensional genomic data analysis have demonstrated the advantage of the proposed rPGBS method over most commonly used regularization methods. In particular, the rPGBS results in more stable selection of variables across a variety of correlation settings, as compared to recent work addressing variable selection with strong correlations. Moreover, the rPGBS is computationally efficient across various settings.

Bioinformatics

Bi-level sparsity

High-dimensional data

MCP

Stable variable selection

Strong correlation