Title: Identification of Novel Genes Associated with Chronic Obstructive Pulmonary Disease Using Different Penalized Logistics Regression with High-Dimensional Biological Data from Human Sputum Cells

Background : Comparison of LASSO, smoothly clipped absolute deviation (SCAD) and minimax concave penalty (MCP) logistic classifiers in order to reconnaissance of related genes with COPD disease and assessing the genes effects on the progression of the disease based on one of the main classes of cells involved in the disease, Sputum Cells. We used a genome-wide expression profiling to define gene networks relevant to the disease. The data retrieved from Gene Expression Omnibus (GEO) with accession numbers "GSE22148". From 143 samples in GOLD stage 2-4 COPD ex-smokers, 54,675 probes primary were assessed. After normalization, LASSO, SCAD and MCP logistic regressions were applied. K-fold cross-validation scheme was used to evaluate the performance of two methods. All of the computational processes were done using "ncvreg", "Affy," "Limma" and "SVA" R packages. Results:


Background
Chronic obstructive pulmonary disease (COPD) is a progressive inflammatory disease mostly characterized by airway obstruction and is predicted to be among the first three causes of death worldwide by 2030 (1). General exhibitions include emphysema, small airway obstructions, and chronic bronchitis. In a healthy airway and alveolar development background, smoking is the first-line risk factors followed closely by the history of maternal/paternal asthma, maternal smoking, and childhood asthma or respiratory infections. Polluted air, second-hand smoking, and malnutrition could also lead to COPD in the susceptible population (2).
COPD is a complex disease and besides the environmental factors, the disease course is dependent on interactions between different genes. DNA microarrays now permit scientists to screen thousands of genes simultaneously and determine whether the expression pattern of these genes changes in normal and COPD tissue. So, new analytical methods must be developed for selecting genes related to COPD. Since there are numerous allele variants involved in COPD, single nucleotide polymorphism (SNP) has been vastly used and shown numerous susceptibility regions on the genome (3,4). Therefore, the associations between the genomics and the disease incidence and progression could be studied more precisely through machine-learning techniques (5). Also, network medicine has been introduced for facilitating the investigations on genomics, transcriptomics, proteomics, and other "-omics" to cast a more elucidating light on the complexity of the pathogenesis of diseases like COPD (6). One of the properties of microarray data is that the number of genes (p) exceeds the number of samples (n). They are dealing with the situation, which is commonly known as the high dimensional dataset. However, logistic regression as a highly appropriate classification tool for such high dimensional datasets from the microarray technique has a few drawbacks, such as the emergence of irrelevant data (7,8).
Moreover, regression analysis has been established to tend to overlook the multicollinearity problem (strong correlation between two or more than two genes in the regression model) (9).
So, overfitting and multicollinearity are the most common problems that arise in highdimensional data when applying statistical classification and prediction methods (10). Nowadays, researchers update, improve, and optimize the models such as the LASSO, minimax concave penalty (MCP), and smoothly clipped absolute deviation (SCAD) to introduce statistical learning models to overcome this issue. Penalized Logistic Regression models represent spares and interpretable model in high dimensional datasets and control the multicollinearity (7,8). Up until now, there has been no study on -omics data integration on COPD with these approaches.
Although LASSO has many excellent properties, it is a biased estimator, and this bias does not necessarily go away as increases. The bias of the LASSO estimate for a truly non-zero variable is constant even for large regression coefficients. One approach to reducing the bias of the LASSO is to use the weighted penalty approach. If we choose the weights in such a way that the variables with large coefficients have smaller weights, then we can reduce the estimation bias of the LASSO. It is the motivation of adaptive LASSO approaches. The SCAD penalty retains the penalization rate (and bias) of the LASSO for small coefficients, but continuously relaxes the rate of penalization as the absolute value of the coefficient increases. The idea behind the MCP is very similar. In comparison to SCAD, however, the MCP relaxes the penalization rate immediately while with the SCAD, the rate remains flat for a while before decreasing.
On the other hand, from a biological perspective, only a small subset of genes is strongly indicative of the target disease, and most genes are irrelevant to COPD classification and prediction (5). To fill this gap, the present study designed to apply statistical-learning methods for better understanding the genetic etiology in COPD affected by smoking habit. LASSO, MCP, and SCAD logistic regression applied to identify the most important genes related to GOLD stage 2-4 COPD ex-smokers.

Differential analysis of genes expression data
Differential analysis was performed on the expression profiling of 54675 probes by the array. The expression profilings of the probes were extracted from 143 patients in GOLD stage 2-4 COPD.
The results of the differential analysis showed significant expressions for the top 250 genes after adjusting values by the Benjamini-Hochberg-FDR correction at α =0.05 (More details shown in additional file 1).

Gene selection and prediction
Based on the LASSO logistic regression, 24 genes significantly affect COPD severity ( Table 1). The results of MCP and SCAD models are also presented in Table 1. The MCP model is the most conservative one and shows only seven genes as significant. The results of SCAD and LASSO models were similar except for one gene, "stromal interaction molecule 2", which was not significant in the SCAD model (p-value=0.074). While SCAD logistic classifier had the highest AUC 7 (0.97, 95% CI=0.95-0.98), sensitivity (95%), specificity (85%), and Youden index (0.81), LASSO logistic classifier had AUC (0.95, 95% CI=0.92-0.96), sensitivity (91%), specificity (86%) and Youden index (0.78). The ROC curves for the three approaches are depucted in figure 1. The difference between LASSO and SCAD accuracy indices was not significant (p-value=0.39). Consequently, 24 candidate genes identified here were associated with the progression of COPD by these classifiers. Based on their patterns of co-expression for twenty-four candidate genes by the heatmap plot for hierarchical clustering showed that all of the candidate genes were divided into the two major clusters ( Figure 2). According to this figure, the MCP model, in contrast to other models, showed the "Caldesmon 1" as the only significant gene in this classifier.

Discussion
Chronic Obstructive Pulmonary Disease (COPD) is a progressive life-threatening lung disease that causes breathlessness, and it was the fifth leading cause of death in 2002 (11). This disease continues to be a significant cause of morbidity, mortality, and health-care costs worldwide (12).
Total deaths from COPD are projected to increase by more than 30% in the next ten years unless urgent action is taken to reduce the underlying risk factors, especially tobacco consumption. The global burden of the disease study reports a prevalence of 251 million cases of COPD in 2016, worldwide. Estimates show that COPD becomes the third leading cause of death in 2017, worldwide. Recent studies have indicated that the occurrence of lung cancer is a multiple-factors and multiple-step process, and it is the result of interaction between genetic and environmental exposure factors (13). COPD is a complex disease that is influenced by genetic factors, environmental influences, and genotype-environment interactions. Genotype-environment interactions have been of interest to geneticists for decades (14). The main risk factors for COPD included smoking, indoor air pollution, outdoor air pollution, genetic factors, and occupational dust and chemicals. Some cases of COPD are due to long-term asthma.
In this study, we also identified 24 significant genes associated with COPD progression. That may represent novel biomarkers in the prognosis of COPD. In our analyses, the most significantly Stromal interaction molecules, STIM2, is a regulator of store-operated calcium (Ca2+) entry as well as basal cytoplasmic Ca2+ levels in human cells (15). As is known, E2 exposure inhibited STIM1 translocation in airway epithelia, preventing SOCE. E2 can signal non genomically by inhibiting basal phosphorylation of STIM1, and STIM2, leading to a reduction in SOCE (16).
Another study showed that STIM1 and STIM2 were significant as up-regulated genes versus healthy controls and healthy smokers (17). Also, AMOTL1 via the activation of LKB1/AMPK signaling and IFN-γ-induced hyperpermeability of cultured human lung microvascular endothelial cells by maintaining the levels of AmotL1 is related to lung function (18). Angiomotin Like one and caldesmon, one both have a role in involvement in Adhesion and Cell Motility in lung airway and alveolars and may have a role in obstruction of the airway by a problem in the expulsion of produced mucosa and destruction of alveolar walls or spasm in small airways (19). The STIM2 and AMOTL1 were selected as the most important genes in this study may reveal these genes as a novel target in the treatment of COPD (20). Also, caldesmon one gene (CALD1) as a novel gene associated with both the overall survival and the disease-free survival in bladder cancer patients (21), diabetic nephropathy (22,23), and glioma neovascularization (24). RNF130, ring finger protein 130, is a candidate gene for Endoplasmic reticulum-associated degradation using phylogenetic tree analysis (25). RNF130 is involved in the pathogenesis of gestational trophoblastic diseases (26). CALD1 and RNF130 genes are not previously detected in COPD studies that may represent novel biomarkers in the diagnosis or prognosis of COPD. However, STX6 is involved in diverse cellular functions in a variety of cell types and has been shown to regulate many intracellular membrane trafficking events such as endocytosis, recycling, and anterograde and retrograde trafficking. The oncogenic roles of STX6 in the progression of esophageal squamous cell carcinoma (ESCC) is established, and it might be a valuable target for ESCC therapy (27). An Epigenome-Wide Association Study found that STX6 may one of genes that associated with atopic asthma. They also reported that STX6 may has a role in methylate process that is seen in this disease (28) In other studies, the association between PLCB1 and the decrease of FEV1 and FEV1/FVC in their participation was shown (29). Another study reported that PLCB1 at least in two pedigree with severe COPD was observed (30) Other group found an association between ZNF33A gen and remission from asthma(31) LARP4B is a target gene of miR106 inhibition of miR-106b that suppressed the mRNA and protein expression of cancer-related genes (32). Syntaxin 6 may have a role in regulating neutrophil secondary granule exocytosis also stimulation of cells by Ca2+. This role may be effective in the inflammatory process that results in obstruction in airways in COPD (33).
Chronic obstructive pulmonary disease is a progressive health problem that is accompanied by dyspnea, cough, and sputum production. Dyspnea is caused by two mechanisms: 1) alveolar cell destruction and inability to the alveolar wall to maintain their structure and decrease available respiratory gases exchange surface area, and 2) inflammation of airways that causes narrowing It seems that two main mechanisms participate in the development of COPD, and several genes may involve in these processes. The first mechanism is oxidative stress and response of immune cells such as neutrophils, CD4, and CD8 lymphocytes and macrophages, which have an important role in this inflammatory process. It is reported that macrophage 5-10 times increased in airways, lung parenchyma, BAL fluid, and sputum in patients with COPD (35). Gens such as AMOTL1, Syntaxin 6, and PRX2 may have a role in inflammation or oxidative stress response.
Some other genes, such as Cacna1g and smooth muscles that existed in small airways. Some genes, such as CALD1, may have a role in the maintenance of cell members and may in the destruction of alveolars walls. And another mechanism may induce by these genes may production of glycoprotein s and amyloids that help to obstruct the small airways.
Regularization-based logistic regression models (e.g., LASSO, MCP, and SCAD) have already been used widely in microarray analysis (36). In this study, based on 10-fold cross-validation, the SCAD and LASSO regularized logistic regression models were found to perform better than the MCP logistic regression. LASSO has satisfying properties, and it is good for simultaneous estimation and eliminating trivial genes but is not good for grouped selection in microarray data. Our previous study showed that LASSO-based methods' (e.g., elastic-net) penalty are useful for gene selection in microarray COPD data (37). However, it is known that LASSO requires rather stringent conditions on the design matrix to be variable selection consistent (38). Non-convex penalized high-dimensional regression has recently received considerable attention to focus on identifying the unknown sparsity pattern,. We recommended the SCAD penalty, which enjoys the oracle property for variable selection. In this study, SCAD regularized logistic regression models was found to perform better than LASSO. The performance of the MCP procedure is satisfactory, but its optimal performance depends on the tuning parameter (39). In this study, MCP is the most conservative method for gene selection. In microarray data classification, SCAD regularized logistic regression may provide a useful methodology for future studies in the discovery of novel diagnostic-and prognostic biomarkers and novel therapeutic targets in the treatment of COPD.

Conclusion
In the present study, the relative expressions of thousands of the genes were assessed and identified as associated genes with the progression of COPD. Differential analysis of gene expression data is able to reduce the number of genes but in a limited manner. In order to find an efficient and small subset of genes, we should use alternative approaches like logistic regression. Regularization solves the high dimensionality problem in using this kind of regression.
In this dataset, SCAD logistic regression, with a lot of advantages in theoretical, computational, and practical aspects, had a higher accuracy rate than LASSO and MCP penalties and might be useful for diagnosis or suitable intervention for COPD. 13

Study population and dataset
The raw data of gene expression architecture in the small airway epithelium (SAE) of COPD In that study, two sputum studies were performed in GOLD stage 2-4 COPD ex-smokers from the ECLIPSE cohort. First, gene array profiling at baseline for 1480 patients was performed. One year later, samples from a separate population of 176 patients were used for real-time PCR. The gene expression findings for IL-18R were further analyzed using immunohistochemistry in lung tissue and induced sputum samples from patients outside the ECLIPSE cohort.

Normalization and filtering of primary probes
The "sva" and "affy" packages were used respectively for removing batch effects and other unwanted variations in data and for statistical comparisons (41,42). Also, the standardization and normalization in the "limma" package performed (43). In addition, differential analysis of genes expression data was conducted using the adjusted P-value based on the Benjamini-Hochberg-FDR correction at α =0.05. All statistical analyses performed using R version 3.5.2 (44).

LASSO, MCP, and SCAD logistic regressions
The ridge regression adds squared magnitude of coefficients as penalty term to regularize parameters. All of the estimated coefficients are non-zero, so no gene selection is performed.
However, LASSO regression uses the absolute value of magnitude of coefficient as penalty term instead (Equation 1), and hence provide automatic gene selection. On the other hand, the ridge penalty tends to shrink the coefficients of correlated variables toward each other, which is good for multicollinearity and grouped selection. However, the LASSO penalty is somewhat indifferent to the choice among a set of strong but correlated variables. Therefore, LASSO is good for simultaneous estimation and eliminating trivial genes but not for grouped selection. However, it is known that LASSO requires rather stringent conditions on the design matrix to be variable selection consistent (38). Characteristic (ROC) curve (AUC) was used to calculate the accuracy of classifiers' performance (50,51). We used "cv.ncvreg" and "roc" function in "ncvreg" and "pROC" R packages (52) for K-CV and ROC analysis respectively. 16

Interactive Cluster Heatmap
A heatmap is a popular graphical method for visualizing high-dimensional data. Rows and columns are sorted using a hierarchical clustering technique (53). The interactive cluster heatmap was applied by the "heatmaply" R package (54).

Consent for publication
Not applicable

Availability of data and materials
The datasets generated and/or analysed during the current study are available in the Gene Expression Omnibus (GEO) site in the National Center of Biotechnology Information (NCBI) database, with the accession number "GSE22148" repository, https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE22148       Table 1 should be palced in page 6.