COL1A1, COL1A2, COL3A1 and DCN are Potential Biomarkers to Predict Progression from Smokers to Suffering from Lung Adenocarcinoma

Background: To explore novel related genes and potential biomarkers that predict progression from smokers to lung adenocarcinoma (LA). Methods: Three datasets from GEO (Gene Expression Omnibus) database were used to identify differentially expressed genes (DEGs) between LA tissue (LAT) and normal tumor adjacent tissue (TAT). The overlap of DEGs could be found and enriched in gene oncology (GO) and pathways to discover the potential biological mechanisms. Protein-protein interaction (PPI) network was applied to nd the relationship among proteins. Survival analysis contributed to the deniteness of key genes. The expression of key genes in LA patients who smoke was veried. Furthermore, genetic alterations, co-expression and pathways of key genes were explored. To obtain more information, key genes were further analyzed in immune inltration, drug target and the distribution of single cell in LA. Results: 245 DEGs were revealed in 3 datasets from GEO. In Kaplan Meier plotter, we found that high expression of COL1A1, COL1A2 and COL3A1 was associated with poorer survival while low expression of DCN was contributed to poorer survival in LAs who smoke. Thus, three up-regulated genes (COL1A1, COL1A2, COL3A1) and one down-regulated gene (DCN) were dened as key genes. Their genetic alterations were more common in female LA smokers and co-expression genes/proteins of them mainly functioned at extracellular matrix. Furthermore, COL1A1, COL1A2, COL3A1 genes had a common targeted drug called Collagenase clostridium histolyticum (DB00048) and DCN gene had a targeted drug called Tromethamine (DB03754). In the Single Cell Expression Atlas of EMBL-EBL, COL3A1 gene was specically highly expressed in female LA patients with brain metastasis. Conclusions: COL1A1, COL1A2, COL3A1 and DCN could be regarded as novel potential biomarkers that predict progression from smokers to lung adenocarcinoma.


Background
According to the investigation of the World Health Organization (WHO), the new cases of lung cancer around the world were 2093876 and its death was 1761007 in 2018. It is well known that cigarette smoking is a main reason leading to lung cancer. In the year 2011, cigarette smoking accounted for 80.2% of lung cancer deaths among adults (≥ 35 years old) in the United States [1] . Besides, smoking status even affected the curative effect of lung adenocarcinoma (LA) and altered the transcriptome of non-involved lung tissue in LA [2,3] . People consider that the major pathological type of smoking-related lung cancer was squamous carcinoma while adenocarcinoma was rare, previously. With advances in diagnostic techniques and improved cigarette design, the proportion of lung adenocarcinoma among smokers has increased obviously in recent decades [4,5] . In the large population-based cohorts, the percentage of adenocarcinoma with current smokers ranged from 30-42% in the United States [6] .
Considering the development trend, we selected lung adenocarcinoma as the research object.
It has been reported that smoking could shorten the survival of LA patients with TP53 mutation and impact the DNA methylation level at genome-wide [7,8] . Besides, it was found that aberrant activation of mast cells and CD4 + memory T cells also contributed to LA development and progression. However, the signaling pathways and driver genes in smoking-associated lung adenocarcinoma remain uncertain [9] . In the last few decades, we mainly focused on the effect of smoking between lung cancer patients and normal people. But why non-smokers also suffered from lung cancer? Besides, we also knew that some people who smoke had no lung cancer during their whole life. Therefore, a further understanding of the biological characteristics and differences between cancerogenic and normal smokers may provide some references for the prevention of lung cancer and the treatment of patients who smoke.
To explore the potential mechanism and biomarkers of smokers who had lung adenocarcinoma, we compared the lung adenocarcinoma tissue (LAT) with normal tumor adjacent tissue (TAT) in smokers by analyzing datasets from GEO. The overlap of DEGs could be found and their gene oncology (GO) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways were analyzed to discover the potential biological mechanism. Protein-protein interaction (PPI) network was also applied to observe the relationship among proteins. Hub genes were concluded by Mcode (Molecular Complex Detection) and cytohubba modules in Cytoscape (a public bioinformatics software, version 3.7.1). By using Kaplan Meier plotter, survival analysis was analyzed to de ne key genes. Information from The Cancer Genome Atlas (TCGA) and ONCOMINE database were used to explore genetic alterations and to verify the expression of key genes in LA patients who smoke. Furthermore, the pathways and co-expression of key genes were explored. To obtain more information, key genes were further analyzed in immune in ltration, drag target and the distribution of single cell in LA.

Datasets selection
Based on the pre-retrieval results of objective microarray data, the GEO database was identi ed as the source of datasets. GEO is a comprehensive public database containing high-throughput gene expression data for multiple species and diseases, as well as chip and microarray data [10] . We selected lung adenocarcinoma patients who smoke as study subjects, and compared gene expression differences between LAT and TAT to nd predictive biomarkers. Datasets containing the following criteria for patient inclusion and exclusion may be included. Inclusion criterion: 1. Patients whose genes were sequenced from lung tissue were pathologically diagnosed as lung adenocarcinoma; 2. Smoking status included current or former smoking status, unlimited years of smoking; 3.The microarray data type was limited to gene sequencing data; 4.The age and gender of the patients were not limited. Exclusion criterion: 1. Datasets that couldn't be analyzed online using GEO2R;2. Gene symbol of datasets cannot be extracted or converted.

Identi cation of DEGs
Smoking samples of each dataset were grouped into "normal" and "cancer", and analyzed by GEO2R to search the top 250 DEGs with default setting. The results were downloaded and managed in EXCEL software. According to the criterion of |logFC|>1 and adjusted P-value (adj.P) < 0.01, the DEGs of each dataset were further screened and divided into up-regulated and down-regulated genes by the positive and negative of logFC. Venny2.1 was applied to nd the overlap of up-regulated and down-regulated DEGs in all datasets.

Functional enrichment analysis
In order to elucidate the functional pro les and pathways of the DEGs, we used WebGestalt (WEB-based Gene SeT AnaLysis Toolkit) to obtain the enriched biological process (BP), cellular component (CC), molecular function (MF) and KEGG pathway. P < 0.05 was considered statistically signi cant. The results were presented in bubble diagrams with the ggplot2 and dplyr packages in R 3.6.3 language.
PPI network construction and module analysis Firstly, the PPI network of DEGs was presented using the STRING online database, in which the moderate con dence (an interaction with a combined score > 0.4) was statistically signi cant [11] . Then, the MCODE plug-in in Cytoscape software (version 3.7.1) was used to cluster the PPI network based on topological principles to determine the closely related regions and the most signi cant modules in PPI network was found out (degree cutoff = 2, node score cutoff = 0.2, k-core = 2, max depth = 100) [12,13] .

Hub genes selection
In the identi cation of hub genes, Cytohubba plug-in of Cytoscape plays an important role in ranking the nodes in PPI network based on the different network characteristics of 11 topological analysis methods [14] . In this study, we sorted the top 10 hub genes from high to low according to the four topological methods of MCC (maximum clique centrality), MNC (maximum neighborhood component), Degree and EPC (edge percolated component), and identi ed the common genes as hub genes. In addition, based on the EGA, TCGA and GEO databases, an online tool Kaplan Meier-plotter was applied to analyze overall survival in uence in the hub genes in lung adenocarcinoma patients who smoke [15] . We de ned these genes with overall survival signi cance (P < 0.05) as key genes, showing potential roles to be used as predictive biomarkers.
Key genes analysis in lung adenocarcinoma who smoke Oncomine online database was used to conduct third-party veri cation of key gene expression in lung adenocarcinoma patients who smoke to improve reliability [16] . Besides, the cBioPortal online tool is used to study the genetic changes in the key genes in lung adenocarcinoma patients who smoke [17] . The online analysis tool GeneMANIA was used to build a network including co-expression and pathways of these key genes. This database contains complete datasets from GEO, BioGRID and other databases as well as genome datasets for speci c functions of organisms [18] .

Further analysis of key genes
To analyze the relationship between tumor and immune system, TISIDB online website was used to discover the situations of immune invasion of these genes among lung adenocarcinoma patients, based on the interaction among 28 kinds of lymphocytes. Meanshile, drug targets were also analyzed in TISIDB.
Moreover, the distribution of key genes in lung adenocarcinoma tissue cells was described through the Single Cell Expression Atlas of EMBL-EBL in order to obtain more meaningful information.

DEGs were identi ed between LAT and TAT
Three datasets about gene expression of LAT and TAT were found from GEO including GSE10072 [19] , GSE31547 and GSE32863 [20] . Speci cally, based on GPL94 ([HG-U133A] Affymetrix Human Genome

Enrichment analysis of the 245 DEGs
To reveal the biological functions of the 245 DEGs, we performed functional annotation and pathway enrichment analysis using the WebGestalt online tool. On the one hand, the main results (FDR < 0.01) of GO and KEGG enrichment of 49 up regulated DEGs were as follows: 1. BP: scollagen bril organization, extracellular matrix organization, extracellular structure organization and tissue development ( Fig. 2A). 2. MF: extracellular matrix structural constituent, extracellular matrix structural constituent conferring tensile strength, platelet-derived growth factor binding, structural molecule activity, protease binding and growth factor binding (Fig. 2B). 3. CC: brillar collagen trimer, banded collagen bril, extracellular matrix, collagen-containing extracellular matrix, complex of collagen trimers, endoplasmic reticulum lumen, collagen trimer, extracellular matrix component (Fig. 2C). KEGG: Protein digestion and absorption and ECM-receptor interaction (Fig. 2D). On the other hand, the main results (FDR < 0.01) of GO and KEGG enrichment of 196 down regulated DEGs were as follows: 1. BP: biological adhesion, cell adhesion, cardiovascular system development, vasculature development, blood vessel development, blood vessel morphogenesis, angiogenesis, circulatory system development, tube development and regulation of in ammatory response (Fig. 2E). 2. MF: amyloid-beta binding, peptide binding, glycosaminoglycan binding, signaling receptor binding, amide binding, cytokine binding, transforming growth factor beta binding, growth factor binding, low-density lipoprotein, particle binding and identical protein binding Twenty genes were selected by using PPI network The STRING database was used to predict the potential relationship of 245 DEGs at the protein level (combined score > 0.4). The PPI network was exported from the STRING database and interpreted by Cytoscape software, including 218 nodes and 753 edges (Fig. 3A). In addition, according to the established MCODE plug-in parameters, the most important PPI network module was selected, which was composed of 20 nodes and 76 edges (Fig. 3B). COL1A1, COL1A2, COL1A3 and DCN were identi ed as key genes In order to further identify the hub genes, Cytohubba plug-in was used to sort the top 10 nodes according to the established four topological analysis methods (Table 1). A total of 9 genes (IL6, SPP1, COL1A1, COL1A2, COL3A1, VWF, CTGF, COMP and DCN) were identi ed as hub genes, respectively. To identify key genes, Kaplan Meier plotter was applied to perform survival analysis. The signi cant genes are potential to be used as predictive biomarkers. In LA smokers, we found that high expression of COL1A1 (P = 0.016), COL1A2 (P = 0.015) and COL3A1 (P = 0.04) was associated with poorer survival while low expression of DCN (P = 0.012) was contributed to poorer survival. However, other genes made no sense in survival analysis (Fig. 4). Thus, COL1A1, COL1A2, COL1A3 and DCN were identi ed as key genes. The probable mechanisms of COL1A1, COL1A2, COL1A3 and DCN alterations in LAT To verify the expression of the key genes, 3 studies from Oncomine online database were used to further validate the expression of these 4 genes (Fig. 5A). Comparing to TAT in smokers, COL1A1, COL1A2 and COL3A1 were highly expressed signi cantly (P < 0.001), while DCN was lowly expressed signi cantly (P < 0.001), in LAT. All key genes had obvious expressive dissimilarities in smokers between LA samples and normal lung tissue samples, which was in accordance with the results from GEO database. Regarding the genetic alteration, 4 hub genes were altered in 65 (12.8%) of 508 lung adenocarcinoma patients who smoke (TCGA, rehose legacy) and only samples with genetic alteration were presented in Fig. 5B. Among the 4 genes, COL1A1 and COL1A2 were mainly altered by ampli cation. However, COL3A1 was mainly altered by missense mutation. Besides, DCN was mainly altered by deep deletion and missense mutation and its genetic alteration seemed to occur in female. There was no obvious correlation among the four genetic alterations. However, in Fig. 5C, genetic alterations of these key genes signi cantly occurred in female LA smokers by Chi-squared test (P-value = 1.913 × 10 − 3 , FDR = 0.0421). Including coexpression (84.90%) in 325 studies and pathways (15.10%) in 6 studies, an interaction network for the 4 genes with 20 proteins/genes was generated via GeneMANIA. Based on the synthesis score, the ranking order from high to low was CD36, ITGA11, COL6A3, COL5A2, LUM, COL5A1, SPARC, TGFB3, MXRA5, JUNB, POSTN, FN1, FBN1, VEGFD, THBS2, ITGB1, COL6A1, CDH11, GTF3A and FAP. According to the FDR value, the top seven credible functions of this network were extracellular matrix organization, extracellular structure organization, extracellular matrix, proteinaceous extracellular matrix, extracellular matrix disassembly, collagen and extracellular matrix part (Fig. 5D). COL1A1, COL1A2, COL3A1 and DCN genes could be applied as predictive biomarkers in LA We de ned COL1A1, COL1A2, COL3A1 and DCN genes as key genes and further analyses were performed. Spearman correlation test was conducted to explore the correlation between genes and lymphocyte in ltration. We found that among 517 LA samples, the in ltration abundance of 28 kinds of lymphocytes was almost positively correlated with the expression of the 4 key genes [21] . For the upregulated genes, the top 5 correlations of COL1A1, COL1A2, COL3A1 had better consistency and they were central memory CD8 T cell, regulatory T cell, memory B cell, natural killer cell, and natural killer T cell. For the down-regulated genes, the top 5 correlations of DCN were mast cell, type 1 T helper cell, natural killer cell, macrophage and regulatory T cell (Fig. 6A). Based on the information collected from DrugBank database, COL1A1, COL1A2, COL3A1 genes had a common targeted drug called Collagenase clostridium histolyticum (DB00048) and DCN gene had a targeted drug called Tromethamine (DB03754) (Fig. 6B). Via single cell RNA sequencing of 176 lung adenocarcinoma patient-derived cells (t-SNE Perplexity = 15) [22] , we analyzed the cell distributions in diseases (Fig. 7A), sex (Fig. 7B) and metastatic sites (Fig. 7C). Furthermore, the expression levels of COL1A1 (Fig. 7D), COL1A2 (Fig. 7E), COL3A1 (Fig. 7F) and DCN (Fig. 7G) in these cell distributions were also presented. By comparing these seven results, we found that COL1A1 was more expressed in male LA patients, COL1A2 was more expressed in female LA patients with brain metastases. Excitingly, COL3A1 was speci cally expressed highly in female LA patients with brain metastases.

Discussion
In lung cancer, lots of researches have shown that many signaling pathways could be activated by smoke, including β adrenergic receptor-mediated (β-ARs), nicotinic acetylcholine receptor-mediated (nAChRs), nuclear factor-KB (NF-KB), epidermal growth factor receptor (EGFR) and gamma aminobutyric acid (GABA) signaling pathways [23] . However, understanding the mechanisms that lead to lung adenocarcinoma in smokers remains a hard work. At the genetic level, smokers and nonsmokers were accurately identi ed in LA based on a support vector machine (SVM) classi cation model constructed from 27 characteristic genes with signi cant enrichment of cancer proteoglycan and Ras signaling pathway [24] . It was suggested that 7 mRNAs including CYP17A1, PKHD1L1, RPE65, NTSR1, FETUB, IGFBP1 and G6PC might be used as prognostic indicators in smokers with LA [9] . However, very few researchers focus on the development from normal lung tissue to LA tissue in smokers. Therefore, our research is committed to discover potential biomarkers that predict progression from smokers to lung adenocarcinoma.
In this study, a total of 245 DEGs (196 down-regulated genes and 49 up-regulated genes) were identi ed. In GO and KEGG enrichment analysis of 196 down-regulated genes, we found that these genes exist in the extracellular matrix and the endoplasmic reticulum lumen in the cytoplasm. They are mainly involved in the biological process of the construction of extracellular matrix and tissues development through protein digestion and absorption and extracellular matrix receptor action pathways, catalyzing the synthesis of extracellular matrix structural components, as well as the combination of growth factors and proteases. However, in GO and KEGG enrichment analysis of 49 up-regulated genes, we found that existing in extracellular matrix, cytoplasmic vesicles, secretory vesicles and plasma membrane, these genes catalyze the binding of various molecules through the complement system pathway, which is mainly involved in cell or molecule adhesion, blood vessel formation and development, and the regulation of in ammatory response. Thus, it has been seen that 245 DEGs mainly acted on the tumor microenvironment which is essentially composed of genetically abnormal cells surrounded by blood vessels, broblasts, immune cells, stem cells and extracellular matrix (ECM) [25] . The complement pathway is a type of innate immunity that mainly supplements immunoglobulin and enhances the ability of immune cells to clear by promoting in ammation and attacking pathogen cell membranes [26] . In the tumor microenvironment, complement regulates both pro-tumor and anti-tumor pathways. It has been proved that complement activation via a C3a receptor pathway mediates lung cancer progression while RNA interference with CD59 synthesis can inhibit the growth and metastasis of lung adenocarcinoma cells [27,28] . Dysregulated complement activation is a key link between in ammation, the suppression of antitumour immune responses and the promotion of tumorigenesis [29] . Furthermore, It was suggested that complement inhibitors or activators combined with targeted therapy or immunotherapy have promising prospects in the treatment of lung adenocarcinoma [26] .
After a series of screening, COL1A1, COL1A2, COL3A1 and DCN genes were de ned as key genes. COL1A1 (Collagen Type I Alpha 1 Chain) and COL1A2 (Collagen Type I Alpha 2 Chain) are protein coding genes that encode the pro-chains of type I collagen (COL1) which is related to osteogenesis imperfect [30] . COL3A1 (Collagen Type III Alpha 1 Chain) is another kind of collagen coding gene and its mutation is responsible for Ehler-Danlos syndrome type IV [31] . DCN, also called Decorin, is a protein coding gene and its mutation was regarded as a main aetiological agent of Congenital stromal corneal dystrophy (CSCD) [32] . Genetic alterations of these key genes were more common in female LA smokers. The network of coexpression and pathway indicated that these 20 genes/proteins were mainly functioned in extracellular matrix (especially collagen). CD36 was the most relevant pathway gene/protein with COL1A1 and COL1A2. CD36 could mediate the related pathway to inhibit angiogenesis which is the basis of tumor growth and metastasis [33] . Thus, we showed that COL1A1 and COL1A2 might promote tumor progression by inhibiting CD36 related pathways.
Regarding the immune in ltration analysis, we found that COL1A1, COL1A2, COL3A1 mainly regulate central memory CD8 T cell, regulatory T cell, memory B cell, natural killer (NK) cell, and natural killer T (NKT) cell while DCN mainly regulates mast cell, type 1 T helper cell, natural killer cell, macrophage and regulatory T cell. These immune cells act different roles in tumor immune microenvironment. Central memory CD8 T cell plays an important role in immunocytotherapy that it was obtained from the patient's body, proliferated in vitro and then transferred back to the body to achieve the anti-tumor effect [34] . For patients who were unresponsive to PD-L1 or cytotoxic T lymphocyte-associated protein 4 (CTLA-4), regulatory T cell was accelerated to express by antibody-mediated depletion of immune checkpoint 4-1BB in order to modulate an antitumor immune response [35] . Memory B cells secreted high level of immunoglobulin against tumor antigens that accumulated in regional lymph nodes partly caused by PD-L1 blockade [36] . Studies have shown that patients with low NK cells activity had an increased risk of lung cancer, and injecting multiple allogeneic NK cells tended to have a better prognosis [37,38] . Based on hematopoietic stem cell-engineer, invariant NKT cell, a potent immune cell for targeting cancer, changed its disadvantage of low level in cancer patients and developed an original therapy proved with long-term effect and no toxicity in vivo [39] . Mast cells can both anti-tumor through tumor in ltration can directly affect the proliferation and invasion of tumor cells and promote tumor through the establishment of the tumor microenvironment and regulating tumor cell immune response and whether anti-tumor or promote tumor depends on cancer types, tumor progression and the location of immune cells in tumor [40,41] . Particularly, it was suggested that abnormal activation of mast cells led to lung immune dysfunction in smokers, which also contributes to tumor development and progression [42] . Type 1 T helper cell mainly secretes cytokines Interferon-γ (IFN-γ) which is also bene cial to tumor cells such as facilitating tumor growth, altering immune resistance of tumor and promoting immunosuppressive tumor microenvironment [43,44] . The ability of macrophage to mount an effective antitumor response was governed by metabolism meanwhile its metabolism can be actively reprogramed by the tumor microenvironment via metabolites, cytokines or other signaling mediators so that the anticancer effect of macrophages was reduced [45] . Thus, inhibition of this reprogramming process has become a new approach for tumor therapy.
The immune environment in ltrated by these genes has potential value in tumor development and treatment. It is the importance of these genes in tumors has prompted us to look for relevant targets to inhibit them. In drug target analysis, COL1A1, COL1A2, COL3A1 genes had a common targeted drug called Collagenase clostridium histolyticum (DB00048) and DCN gene had a targeted drug called Tromethamine (DB03754). However, both of them haven't been applied in cancer therapy. Furthermore, single cell sequencing data was applied to explore expression speci city of these genes. It was displayed that COL1A1 gene more often expressed in male LA patients while COL1A2 and DCN genes more often expressed in female LA patients with brain metastases and COL3A1 gene was speci c high expression in female LA patients with brain metastases. Via IHC, Liu Y et al proved that COL1A1 and COL3A1 were signi cantly high expression in brain tumor tissue metastasized from LA [46] . Due to the speci c expression of COL3A1, it has the potential to predict brain metastases in LA. However, no studies have described the relationship between other genes (COL1A2 and DCN) and brain metastases.
Although potential biomarkers were found in our study, there are some de ciencies as followed. Firstly, limited to a clinical condition of smoker, we only choose ONCOMINE data to verify the expression key genes and didn't analyze the correlation of key genes and tumor stages. Both of them can further enhance veracity and richness in our study. Secondly, due to the limited experimental conditions, we didn't conduct experimental veri cation of the key genes, and some of these genes were only veri ed by the experimental results of published literature. Thirdly, the concrete functions and mechanisms of how key genes induce LA are still unclear. Thus, more studies are needed to clarify those questions.

Conclusions
By bioinformatics analysis of the gene expression pro le of smokers in LA, four core molecules were identi ed potentially associated with the pathogenesis from normal to LA in smokers, including 3 up regulated genes (COL1A1, COL1A2, COL3A1) and a down regulated gene (DCN). Besides, their genetic alterations were more common in female LA smoker and co-expression genes/proteins of them mainly functioned in extracellular matrix. Especially, COL3A1 is a potential biomarker to predict brain metastasis in lung adenocarcinoma. These key genes may be regarded as novel potential biomarkers that predict progression from normal to LA in smokers. However, due to the limitations of this study, further studies are needed to elucidate the biological function of these genes in the progression from normal to LA in smokers.    pathway networks of key genes via GENEMANIA (The area of genes/proteins represents synthesis score in co-expression and pathway. The thicker the line, the smaller the FDR value of the correlation).