Identication of important genes and pathways leading to poor prognosis of non-small cell lung cancer using Integrated Bioinformatics

Non-small cell lung cancer (NSCLC) is a common malignancy with a high morbidity and mortality rate worldwide. Herein, we employed an integrated bioinformatics approach to identify crucial genes and to investigate the underlying mechanism in the poor prognosis of NSCLC. We performed an integrated analysis of microarray datasets (GSE33532, GSE19188, GSE102287, GSE27262) containing 319 NSCLC and 232 normal lung tissues. 52 up-regulated and 173 down-regulated overlapping-differentially expressed genes were identied, further validated by cBioportal. DAVID was employed for Gene Ontology, KEGG pathway analysis, and STRING for protein-protein interaction network analysis. Cytoscape plug-ins, MCODE and CytoHubba, were used to acquire module 1–2 and to screen the top ten core genes, respectively. GEPIA and Kaplan-Meier plotter analyzed the expression levels and survival rates of the ten core genes, respectively. Four genes were up-regulated (COL1A1, MMP1, ADAM12, CTHRC1), and six genes were down-regulated (VWF, CD36, OGN, EDN1, CAV1, ITGA8) in NSCLC. Seven genes (COL1A1, ADAM12, VWF, OGN, EDN1, CAV1, ITGA8) were associated with NSCLC's poor prognosis (P < 0.01), four of which (VWF, CAV1, ITGA8, COL1A1) were enriched in the focal adhesion pathway (P = 1.04E-04), as per KEGG pathway analysis. It suggests that VWF, CAV1, ITGA8, and COL1A1 might promote NSCLC via the focal adhesion pathway.


Introduction
Lung cancer has a very high prevalence, morbidity, and mortality rate across the globe [1]. Lung cancer is classi ed as Non-small cell lung cancer (NSCLC) and small cell lung cancer (SCLC), which are the most commonly reported lung cancer diagnoses. Currently, 80-85% of lung cancer cases are NSCLC. The lung adenocarcinoma, along with lung squamous cell carcinoma, is the most commonly reported NSCLC subtypes, which accounts for around 40-50% and 20-30% of NSCLC cases, respectively [2][3][4]. Stage I or II (early-stage)_cases of NSCLC are primarily treated with surgical circumscription; however, such cases are not frequently observed. However, a majority of the NSCLC cases (60%) are diagnosed at stage III or IV (metastatic stage) or an advanced stage. Presently, surgical treatment is unavailable for NSCLC, and conventional chemotherapy, along with radiation therapy, is widely employed to treat the a icted subjects. Recent reports suggest the occurrence of EGFR-gene arrangements in lung adenocarcinoma.
TKIs are widely used to treat the tumors with either EGFR mutation or aberrant EGFR activation, as TKIs effectively mitigate the EGFR activity by competitive inhibition of ATP.TKIs have shown promising results in NSCLC treatment, which has encouraged further investigation of targeted therapy [5]. A plethora of studies suggested that targeted therapy can signi cantly improve the survival rate of NSCLC patients [6,7]. Therefore, an in-depth analysis of the underlying molecular mechanism for altered genomic changes and molecular pathways in the NSCLC and its association with the onset and progression of NSCLC might enhance the precision of targeted therapy. Also, it would lead to a better prognosis and an increased survival rate of NSCLC patients.
In the current investigation, rstly, Gene Expression Omnibus (GEO) was used to download four original microarray datasets, which encompassed 319 NSCLC, and 232 normal lung tissues (GSE33532 [8], GSE19188 [9], GSE102287 [10], and GSE27262 [11]). These datasets were analyzed by the GEO2R, an open-access online tool, and the Venn diagram was generated to identify the overlap between the differentially expressed genes (DEGs) in the four datasets. DAVID web portal was employed to perform the Gene Ontology (GO) and KEGG pathway enrichment analysis of the selected DEGs. The functional signi cance of the DEG's interactions was determined by STRING, an online database to analyze a protein-protein interaction (PPI) network. Besides, Cytoscape plug-ins (MCODE and CytoHubba) were employed to analyze the crucial modules and to acquire the top ten core genes. The NSCLC (MSK, Cancer Cell2018) [12], NSCLC (TRACERx, NEJM2017) [13], and NSCLC (TCGA, Nat Genet2016) [14] were validated on the cBioportal to ensure the reliability of DEGs identi cation. Additionally, we performed the gene expression pro ling interactive analysis (GEPIA) based study to evaluate the differences in expression levels of ten core genes between NSCLC and normal lung tissues. Concurrently, we analyzed the survival prognosis of genes by the Kaplan-Meier plotter in which only seven genes matched. KEGG pathway enrichment analysis of these seven DEGs showed the signi cant enrichment of VWF, CAV1, ITGA8, and COL1A1 in the focal adhesion pathway.

Microarray data sources and DEGs data processing
NCBI-GEO (https://www.ncbi.nlm.nih.gov/gds/) is a functional genomics database that includes microarray/gene pro les. In this study, we used this public repository to acquire gene expression pro les of the NSCLC and normal/adjacent lung tissues (GSE33532, GSE19188, GSE102287, and GSE27262). All chip data were from the GPL570 platform ([HG-U133_Plus_2] Affymetrix Human Genome U133 Plus 2.0 Array). It included a total of 80 NSCLC and 20 normal lung tissue, 91 NSCLC and 65 normal lung tissue, 123 NSCLC and 122 normal lung tissue, 25 NSCLC, and 25 normal lung tissue. P < 0.05 was used as the cut-off criterion, and [logFC] > 2 was used to acquire DEGs between NSCLC and normal lung tissue by the GEO2R online tool. Venn diagram (http://bioinformatics.psb.ugent.be/webtools/Venn/) was used to determine the overlapping DEGs from four datasets mentioned above; LogFC > 0 and LogFC < 0 were employed as the criteria for the up-regulated and down-regulated genes, respectively.

Gene Ontology (GO) and KEGG pathway analysis
We performed the Gene Ontology (GO) enrichment analysis to categorize the DEGs as per their molecular functions (MF), biological processes (BP), and cellular components (CC) [15]. Kyoto Encyclopedia of Genes and Genomes (KEGG) is an open-access database that facilitates an in-depth understanding of genes and their respective advanced functions, and the information is derived from molecular-level experimental data [16]. DAVID (https://david.ncifcrf.gov/) is an online bioinformatics platform with multifunctional tools for gene annotation, visualization, and integration of gene-speci c functions [17]. We employed the DAVID platform to evaluate the DEGs enrichment in CC, MF, BP, and KEGG pathways (P < 0.05).

Protein-protein interaction (PPI) network and module analysis
The STRING database (https://string-db.org/) was employed to investigate the interaction network of DEGs [18]. The PPI interaction network was created for DEGs with the help of the STRING database, with an interaction score of ≥ 0.4. The nodes and edges represented genes and interaction between the genes, respectively, in the PPI network. We used Cytoscape (http://www.cytoscape.org/) [19] to detect potential correlations between DEGs where the criteria of zero number of maximum interactors and ≥ 0.4 con dence scores were employed to analyze the PPI network acquired from the STRING database.
Additionally, the PPI network's two crucial modules were analyzed in the MCODE plug-in, and the top ten core genes were identi ed by the EcCentricity scoring method in the cytoHubba plugin of Cytoscape. 2.5 Survival analysis and differential expression of core genes in NSCLC We analyzed ten core genes by GEPIA (http://gepia.cancer-pku.cn/) [21]. The genes were categorized as upregulated or downregulated according to the difference in gene expression level in NSCLC and normal lung tissues (P < 0.05). The survival prognosis of these ten core genes was done by using the Kaplan-Meier plotter (http://www.kmplot.com/analysis/) [22]. and the survival prognosis of seven of them was signi cantly poor (P < 0.01).

Differentially expressed genes (DEGs) identi cation
The gene expression pro les of GSE33532, GSE19188, GSE102287, and GSE27262 in NSCLC and normal or adjacent lung tissues were obtained from the gene database, NCBI-GEO. GEO2R online tool was employed to determine the 795, 635, 1087, and 671 DEGs from GSE33532, GSE19188, GSE102287, and GSE27262, respectively with a P < 0.05, and logFC > 2 cut-off criteria. We observed an overlap of 52 upregulated and 173 down-regulated DEGs in four datasets (Table 1 and Fig. 1). Table 1 A total of 225 common DEGs (52 up-regulated genes and 173 down-regulated genes) were identi ed in the analysis of four data sets DEGs Genes name

Gene Ontology (GO) and KEGG pathway enrichment analysis of DEGs
The online web portal DAVID was used to categorize the 52 up-regulated and 173 down-regulated DEGs into three functional Gene Ontology (GO) categories, i.e., biological process (BP), molecular function (MF), and cellular component (CC) (Figs. 2A-2B). The outcome of the GO analysis revealed that the BP category was notably enriched, followed by CC and MF. In the biological process category of GO analysis, around 59% of the up-regulated genes were enriched in cell division, mitotic nuclear division, sister chromatid cohesion, G2/M transition of the mitotic cell cycle, spindle organization, and mitotic spindle assembly checkpoint. Similarly, the down-regulated DEGs were mostly enriched in angiogenesis, cell adhesion, BMP signaling pathway, vasoconstriction, response to cAMP, and response to hypoxia. In the cell component category of GO analysis, around 60% of the overexpressed DEGs were primarily enriched in the spindle, midbody, condensed chromosome kinetochore, nucleoplasm, spindle pole, and kinetochore. Similarly, we found the enrichment of cell surface, extracellular region, extracellular space, proteinaceous extracellular matrix, and membrane raft in the down-regulated DEGs. In the molecular function category of GO analysis, overexpressed DEGs were chie y enriched in ATP binding, metalloendopeptidase activity, and protein homodimerization activity, while down-regulated DEGs were enriched in heparin-binding, receptor activity, and transforming growth factor-beta binding ( Table 2). The KEGG pathway enrichment analysis elucidated the overrepresentation of oocyte meiosis and cell cycle pathway for the overexpressed DEGs along with malaria and vascular smooth muscle contraction (VSMC) pathway for the down-regulated DEGs (Table 3).

Protein-protein interaction (PPI) network and modular analysis
The protein-protein interaction network (PPI) for the 225 DEGs was constructed by using the STRING database and Cytoscape software. The PPI network of 48 up-regulated DEGs and 126 down-regulated DEGs embodied a total of 174 nodes and 816 edges (Fig. 3A). Moreover, 52 DEGs were not involved in the PPI network. Furthermore, we employed the MCODE to analyze the 174 nodes and the two crucial modules from the PPI network as per the degree of importance. The module 1 encompassed a total of 32 nodes and 511 edges (Fig. 3B, Table 4), and it was chie y involved in oocyte meiosis, cell cycle, p53 signaling pathway, and progesterone-mediated oocyte maturation. The module 2 encompassed a total of 9 nodes and 29 edges (Fig. 3C, Table 4), and it was chie y involved in the PI3K-Akt and HIF-1 signaling pathway. The CytoHubba analysis revealed the ten most crucial node-degree genes: COL1A1, ITGA8, VWF, MMP1, ADAM12, CD36, OGN, EDN1, CTHRC1, and CAV1. These genes were primarily involved in ECMreceptor interaction, focal adhesion, and PI3K-Akt signaling pathway (Fig. 3D, Table 4).

Validation of DEGs using cBioportal
The cBioportal was used to analyze the genetic variation to con rm the reliability of the DEGs mentioned above. We analyzed the data from a total of 1546 NSCLC samples. The analysis showed that a total of 1,445 samples (93%) of the three NSCLC study data sets had genetic mutations, and the gene mutation range was 76-95.8% (Fig. 4A). Besides, we also analyzed the genomic changes in the TCGA study samples on OncoPrint, which demonstrated the mutation frequencies range of 0.2%-20%. Apart from the low mutation frequencies of EDN1 (1.2%) and OGN (1%), the mutation frequencies of other genes were signi cantly higher. The two genes with the highest mutation frequencies were ITGA8 (9%) and VWF (8%) (Fig. 4B), which validated the reliability of DEGs identi cation.

GEPIA and Kaplan-Meier Plotter based analysis of core genes
We employed GEPIA to analyze the expression levels of ten core genes in tumors and normal lung tissues. The outcome of this analysis demonstrated that COL1A1, MMP1, ADAM12, and CTHRC1 were upregulated (P < 0.05), whereas VWF, CD36, OGN, EDN1, CAV1, and ITGA8 were down-regulated (P < 0.05) in lung adeno and squamous cell carcinoma (Fig. 5A). We also used Kaplan-Meier Plotter to analyze further the survival prognosis of these ten core genes in NSCLC. The outcome demonstrated that the survival rates of the two up-regulated genes, i.e., COL1A1 and ADAM12, were signi cantly poor (P < 0.01). Five out of the six down-regulated, i.e., VWF, OGN, EDN1, CAV1, and ITGA8, also had poor survival prognosis (P < 0.01) (Fig. 5B).

Enrichment analysis of KEGG pathway to reanalyze seven selected genes
We used DAVID to re-enrich the KEGG pathway for the selected seven genes to explore signaling pathways associated with them. The outcome of this analysis showed that four genes (VWF, CAV1, ITGA8, COL1A1) were signi cantly enriched in the focal adhesion pathway (P = 1.04E-04) ( Table 5, Fig. 6).

Discussion
In recent years, multiple investigations have identi ed the occurrence and progression of human cancer is a complicated process since it involves amalgamated action of multiple genes and proteins. Presently, targeted therapy is preferred over traditional therapies to treat various malignancies, including NSCLC [23]. Lung cancer is one of the most prevalent malignant tumors across the globe, and the advanced stage of this malignancy is challenging to treat. Over the last few decades, numerous studies have investigated the underlying mechanism behind the onset and development of lung cancer. However, the outcome of the majority of the studies was derived from a single gene or cohort studies. In the current study, we undertook an in-depth bioinformatics investigation of four original microarray datasets, which contained 319 NSCLC and 232 normal lung tissues. Firstly we identi ed a total of 225 common DEGs, of which 52 genes were up-regulated genes, and 173 genes were down-regulated in NSCLC tissue.
In the second step, we carried out GO and KEGG pathway enrichment analysis with the help of the DAVID database. In the BP category, the majority of the up-regulated DEGs were enriched in cell division, mitotic nuclear division, sister chromatid cohesion, G2/M transition of the mitotic cell cycle, spindle organization, and mitotic spindle assembly checkpoint. The down-regulated DEGs showed enrichment of BP, such as angiogenesis, cell adhesion, BMP signaling pathway, vasoconstriction, response to cAMP, and hypoxia. In the CC category, up-regulated DEGs were mainly enriched in the spindle, midbody, condensed chromosome kinetochore, nucleoplasm, spindle pole, and kinetochore. The down-regulated DEGs showed enrichment of CC, such as cell surface, extracellular region, extracellular space, proteinaceous extracellular matrix, and membrane raft. In the MF category, up-regulated DEGs were primarily enriched in ATP binding, followed by metalloendopeptidase activity and protein homodimerization activity, while down-regulated DEGs showed enrichment of heparin-binding, receptor activity, and transforming growth factor-beta binding. The KEGG pathway enrichment analysis depicted that the overexpressed DEGs were primarily enriched in oocyte meiosis and cell cycle, whereas down-regulated DEGs were primarily enriched in malaria and vascular smooth muscle contraction.
The third step involved the protein-protein interaction network analysis of the 225 DEGs with the help of the STRING database, and we identi ed 174 nodes and 816 edges in this network. Concurrently, we used the Cytoscape software to screen two crucial modules and the top ten core genes in the PPI network: COL1A1, MMP1, VWF, ADAM12, CD36, OGN, EDN1, CTHRC1, CAV1, ITGA8. These ten core genes primarily participated in ECM-receptor interaction, focal adhesion, and PI3K-Akt signaling pathway.
The fourth step embodied the analysis of the genetic variation between the three NSCLC study data sets, which included a total of 1546 samples. Furthermore, we validated the mutation frequency of each gene in the genome to improve the reliability of DEGs identi cation. Additionally, the outcome of the GEPIA analysis indicated that COL1A1, MMP1, ADAM12, and CTHRC1 were up-regulated (P < 0.05), whereas VWF, CD36, OGN, EDN1, CAV1, and ITGA8 were signi cantly downregulated (P < 0.05) in lung adeno and squamous cell carcinoma than the normal lung tissue. The Kaplan-Meier Plotter analysis of these ten core genes revealed a poor survival rate for two up-regulated genes (COL1A1, ADAM12) and ve downregulated genes (VWF, OGN, EDN1, CAV1, ITGA8) (P < 0.01). A re-enrichment of the KEGG pathway of these seven genes revealed that four of the seven genes (VWF, CAV1, ITGA8, COL1A1) were signi cantly enriched in focal adhesion (P = 1.04E-04), which might be potential novel targets for the treatment and improved survival prognosis in NSCLC.
Collagen 1A1 (COL1A1) is an extracellular matrix protein, and its gene is located on human chromosome 17q21. COL1A1 is chie y spotted in the skin and tendons that serves a crucial role in cellular behavior and tissue structure [24]. In recent years, a plethora of studies has reported the association of the COL1A1 with multiple malignancies such as hepatocellular carcinoma (HCC) [25], NSCLC [26], rectal adenocarcinoma (READ) [27], and breast cancer [28]. Also, the aberrant expression of COL1A1 could lead to the onset of hepatocellular carcinoma (HCC). The drug resistance, metastasis, and disease recurrence encountered during the HCG treatment could be inhibited by targeting the COL1A1 [25]. Shuo Fang et al.
demonstrated that the NSCLC cell line secretes COL1A1, whose molecular weight is different than the COL1A1 secreted by the cancer-associated and normal broblasts. The fold-change of COL1A1 mRNA in NSCLC tissues was more than the normal lung tissues, but there was no correlation with tumor stage (P > 0.05) [26].
The integrin family proteins are adhesion receptors, which are primarily distributed on the cell surface and mediates cell-to-cell and cell-to-extracellular matrix interactions [29]. Integrin alpha 8 (ITGA8) is a crucial member of the integrin family, which is overexpressed during lung and kidney organ development. The absence of ITGA8 in mice leads to abnormal kidney development, which suggests that ITGA8 plays a vital role in organ development [30]. in MM, which could be essential for the progression and recurrence of MM [33].
Von Willebrand factor (vWF) is a complex and multifunctional multimeric glycoprotein that plays a crucial role in normal blood coagulation [34]. In human cancers, vWF has been shown to have antitumor effects because of the negative regulation of angiogenesis and apoptosis. Numerous studies have reported that vWF and tumor cells could directly interact with each other through receptors (GPIb) and integrins (αIIBβ3 and αβ3) [35][36][37]. Stark's group reported that the loss of endothelial vWF leads to increased angiogenesis in vitro, whereas the in vivo model-vWF-de cient mice-showed enhanced neovascularization and angiogenesis. The molecular functions behind these altered processes were chie y regulated by vWF via various intracellular and extracellular pathways such as αvβ3, VEGFR-2, and Ang-2 signaling pathways [38]. We can speculate that apart from regulating angiogenesis, vWF might also have a pro-apoptotic function. The previous report by Terraube et al. stated that vWF inhibits metastasis by inducing tumor cell death [37,39].
Furthermore, Mochizuki et al. validated the pro-apoptotic effect of Vwf [40]. Besides, vWF was also found to be overexpressed in some of the other malignancies, such as prostate cancer [41] and leiomyosarcoma [42]. Till now, the research studies on the correlation between the vWF and cancer have been mostly incongruent. The previous literature reports have been contradictory about the correlation between vWF and cancer. Thus, an in-depth analysis is required to bring forth the effect of vWF on tumor formation.
Interestingly, CAV1 is both an anti-apoptotic and a pro-apoptotic protein, so that it shows both pro-and anti-cancer effects [54,55].
A plethora of studies have suggested that VWF, CAV1, ITGA8, and COL1A1genes were related to the onset and progression of multiple malignancies, but few studies have reported their roles in NSCLC. Therefore, apart from providing reliable and useful information, this study might provide a direction to the current research investigation on NSCLC.

Declarations
Availability of data and materials The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.

Figure 2
The expression level and survival prognosis of ten core genes in NSCLC. (A) Out of the total ten core genes, four genes were up-regulated, and six genes were down-regulated in NSCLC (P<0.05). (B) Kaplan- Meier plotter was used to analyze the survival rate of ten core genes. The survival rates of seven out of ten genes were signi cantly reduced in NSCLC (P<0.01).   The overexpressed genes were classi ed into GO categories: BP, MF, and CC.