Background: Comparison of LASSO, smoothly clipped absolute deviation (SCAD) and minimax concave penalty (MCP) logistic classifiers in order to reconnaissance of related genes with COPD disease and assessing the genes effects on the progression of the disease based on one of the main classes of cells involved in the disease, Sputum Cells.
We used a genome-wide expression profiling to define gene networks relevant to the disease. The data retrieved from Gene Expression Omnibus (GEO) with accession numbers "GSE22148". From 143 samples in GOLD stage 2-4 COPD ex-smokers, 54,675 probes primary were assessed. After normalization, LASSO, SCAD and MCP logistic regressions were applied. K-fold cross-validation scheme was used to evaluate the performance of two methods. All of the computational processes were done using "ncvreg", "Affy," "Limma" and "SVA" R packages.
Results: The results of LASSO (AUC=0.95, sensitivity= 0.91, specificity= 0.86) and SCAD (AUC=0.97, sensitivity= 0.95, specificity= 0.85) logistic regression were almost similar. There were 23 and 22 significantly associated genes for LASSO and SCAD, respectively. The only difference between these models is related to "stromal interaction molecule 2". Comparing to MCP approach, the most conservative method, we detected only 7 significant genes (AUC= 0.94, sensitivity= 0.94, specificity= 0.82).
Conclusions: In the present study, the relative expressions of thousands of the genes were assessed and identified as associated genes with the progression of COPD. Differential analysis of gene expression data is able to reduce the number of genes but in a limited manner. In order to find an efficient and small subset of genes, we should use alternative approaches like logistic regression. Regularization solves the high dimensionality problem in using this kind of regression.