Identication of molecular markers associated with diagnosis and prognosis in lung adenocarcinoma by bioinformatics analysis

Background: Lung adenocarcinoma (LUAD) is the main histological subtype of lung cancer. However, the molecular mechanism underlying LUAD is not yet clearly dened, but elucidating this process in detail would be of great signicance for clinical diagnosis and treatment. Methods: Gene expression proles were retrieved from Gene Expression Omnibus database (GEO), and the common differentially expressed genes (DEGs) were identied by online GEO2R analysis tool. Subsequently, the enrichment analysis of function and signaling pathways of DEGs in LUAD were performed by gene ontology (GO) and The Kyoto Encyclopedia of Genes and Genomics (KEGG) analysis. The protein-protein interaction (PPI) networks of the DEGs were established through the Search Tool for the Retrieval of Interacting Genes (STRING) database and hub genes were screened by plug-in CytoHubba in Cytoscape. Afterwards, the miRNAs and the hub genes network was constructed via miRWalk. Finally, receiver operating characteristic (ROC) curve and Kaplan-Meier plotter were performed to analyze the diagnosis and prognosis ecacy of hub genes. Results: A total of 312 DEGs were identied, including 74 up-regulated and 238 down-regulated genes. GO analysis results showed that DEGs were mainly enriched in biological processes including composition of extracellular matrix, regulation of angiogenesis and so on. KEGG analysis results revealed DEGs were mainly enrolled in cell adhesion signaling pathway. Subsequently, 10 hub genes, CDC20, CENPF, TPX2, TOP2A, KIAA0101, CDCA7, ASPM, ECT2, UBE2T and COL1A1, were identied. And TOP2A, CDCA7, TPX2 and COL1A1 showed strong relationships with each other and the miRNAs nearby in miRNAs-mRNA network obtained by miRWalk website. Finally, all these 10 hub genes were found signicantly related to the diagnosis and prognosis of LUAD (p<0.05). Conclusions: Our results suggested that TOP2A, CDCA7, TPX2 and COL1A1 might present predictive value for the development and prognosis in LUAD, and might be used as potential molecular markers for the diagnosis and


Introduction
Lung cancer is the leading cause of cancer morbidity and mortality in the world, with 2.1 million new cases and 1.8 million deaths predicted in 2018 [1]. Although the great advance in surgical and molecular target therapies, the prognosis of lung cancer is still dismal [2]. According to histopathology characteristics, lung cancer can be divided into two main types: non-small cell lung cancer (NSLC) and small cell lung cancer (SCLC), with NSLC being accounted for 85% of lung cancer with lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC) and large cell carcinoma (LCC) three main subtypes [3]. LUAD has been ranked the rst cancer mortality in china [4],and receiving an operation remains a primary method, unfortunately the 5-year survival rate is around 10-15% [5]. Lacking sensitive and effective early diagnosis biomarkers and tending to metastasis and rapid dissemination are deemed to contribute to the poor prognosis. Therefore, looking for reliable biomarkers or drug targets is of great importance for improving the diagnosis and prognosis of LUAD.
High-throughput screening technique which is based on RNA sequencing or microarray gene expression pro les has been widely used for gene detection and diseases analysis. Academic can rapidly and precisely get the knowledge of key genes associated with serious diseases like cancer via a large number of biological information data mining [6]. For instance, Huang et al. used integrated bioinformatics analysis of RNA sequencing data and superenhancer catalogs to identify superenhancer-associated circRNAs and further found that loss of superenhancer-regulated circN x promoted cardiac regenerative repair and functional recovery [7]. As for lung cancer, compound CA-5f was considered as a novel latestage autophagy inhibitor with potent anti-tumor effect against NSLC via bioinformatics technique [8].
The discovery and the lucubration of these key genes obtained through bioinformatics analysis contribute to the discovery of effective biomarkers, which can be applied to the diagnosis, treatment and prognosis of tumors.
The aim of our study was to perform multiple analyses towards key genes of LUAD through integrated bioinformatics methods. First of all, we downloaded three microarray datasets containing mRNA expression data from Gene Expression Omnibus (GEO) and the differentially expressed genes (DEGs) were screened out via GEO2R online tool. Next, by virtue of Gene Ontology (GO), Kyoto Encyclopedia of Genes and Genomes (KEGG), functional annotation and enrichment analysis about these DEGs were performed, and then we established protein-protein interaction (PPI) network and obtained the hub genes in LUAD. Then, the miRNAs and the hub genes network was constructed to select and identify the candidate biomarkers. Receiver operating characteristic (ROC) diagnosis and Kaplan-Meier survival analysis were carried out to provide support for the diagnosis, targeted therapy and prognosis of LUAD.
Last but not least, we generated candidate biomarkers' expression violin plots based on patient pathological stage.

Materials And Methods
Acquisition and processing of GEO data. 3 LUAD datasets were obtained from GEO database (available online: http://www.ncbi.nlm.nih.gov/geo) [9], including the gene expression pro les GSE118370 DEGs were screened and identi ed via GEO2R online tools (available online: https://www.ncbi.nlm.nih.gov/geo/geo2r/) with the cut-off |log2FC|≥1 and adj. p < 0.05. The numbers of DEGs we got from GSE118370, GSE32863 and GSE43458 were 3302, 1584 and 1025, respectively. DEGs co-occurred in all 3 datasets were integrated together and showed in a Venn diagram (available online: http://bioinformatics.psb.ugent. be/webtools/Venn/). GO and KEGG Pathway enrichment analyses and visualization. Functional annotations and pathway enrichment analyses on DEGs were carried out through Enricher website (available online: https://amp.pharm.mssm.edu/Enrichr) [10], which included the visual results of GO and KEGG enrichment analyses. GO can be used to study gene-related biological process (BP), molecular function (MF) and cellular component (CC), while KEGG can simulate and analyze the pathways that DEGs were enriched in.
PPI network and acquisition of hub genes. PPI network can explain the possible mechanisms of the genes involved in intracellular processing. So the STRING (available online: https://stringdb.org/, Version11.0) [11], an online tool was rst applied to establish a PPI network of DEGs, which was then imported into Cytoscape software (version 3.6.1) [12]. In addition, the plug-in CytoHubba in Cytoscape was used to search and analyze the hub genes that were generated through Maximal Clique Centrality (MCC) score.
The construction of miRNA interaction network. MiRWalk website (available online: http://mirwalk.umm.uni-heidelberg.de/) [13] was used to establish a miRNA-mRNA regulatory network of the hub genes we obtained from CytoHubba plug-in.
Diagnosis and prognosis analyses of hub genes. Receiver operating characteristic (ROC) can evaluate the diagnostic e cacy of each index and the larger the area under curve (AUC), the higher the diagnostic e cacy [14]. We used SPSS (version 24) to gather statistics and analyze the case samples in GSE43458 dataset. In addition, the Kaplan-Meier plotter (available online: http://kmplot.com/) [15], a online database for survival analysis, was used to analyze the prognosis value of the hub genes. The hazard ratio (HR) with 95% con dence intervals and logrank P value were calculated and displayed on the diagram.
Expression violin plots about candidate biomarkers based on patient pathological stage. In order to detect the relationship between patient pathological stage and candidate biomarkers' expression, Gene Expression Pro ling Interactive Analysis (GEPIA, available online: http://gepia.cancer-pku.cn/) [16] was utilized.

Results
Identi cation and screening of DEGs. Three microarray datasets: GSE118370, GSE32863 and GSE43458 were obtained from GEO database, including 144 samples from LUAD patients and 94 samples of normal lung tissues. The differentially expressed genes (DEGs) were identi ed by online GEO2R analysis tool with the cut-off criterion of |log2FC|≥1 and adj. p < 0.05, and got 3302, 1584 and 1025 DEGs from each dataset, respectively. Afterwards, a Venn diagram ( Fig. 1) was used to get an intersection of DEGs from the 3 datasets above, acquiring a total of 312 DEGs with a consistent expression, in which there were 74 up-regulated genes and 238 down-regulated genes (Table 1) in the LUAD samples compared to the normal samples.
GO analysis and KEGG pathway analysis of DEGs. To evaluate the function of DEGs, we performed functional annotations and pathway enrichment analyses via GO and KEGG. According to the result of KEGG, DEGs emerged a high enrichment in the pathways of cell adhesion, protein digestion and absorption, tyrosine metabolism, ECM receptor interaction and so on ( Fig. 2A). GO analysis showed that DEGs mainly participated in the biological processes (BP) of the extracellular matrix organization, regulation of angiogenesis, negative regulation of blood vessel morphogenesis and so on (Fig. 2B). The molecular function (MF) included amyloid-beta binding, the activation of protein homodimerization, the activation of metalloendopeptidase inhibitors, the binding of low-density lipoprotein and so on (Fig. 2C). The analysis of cellular component (CC) emerged that these genes mainly encoded the components on the plasma membrane (Fig. 2D).
Analysis of PPI network and identi cation of hub genes. Exploring the protein-protein interaction in organisms can help to study molecular mechanisms of some diseases from a perspective of system. We built a PPI network via STRING website (Fig. 3A), then the interaction data were imported into Cytoscape, and then CytoHubba plug-in was applied to con rm 10 hub genes according to the scores of MCC. The results were shown in Fig. 3B. They were CDC20, CENPF, TPX2, TOP2A, KIAA0101, CDCA7, ASPM, ECT2, UBE2T and COL1A1, and their full names and functions were shown in Table 2. miRNA regulatory network of hub genes. Through inputting 10 hub genes into miRWalk website, we obtained the miRNAs regulatory network related with the hub genes (Fig. 4). The blue spots in the center were the hub genes, and what surrounded them were the miRNA which might regulate the hub genes. As were shown in Fig. 4, these four genes: TPX2, TOP2A, CDCA7 and COL1A1 had stronger interactions and formed regulation networks with the miRNAs nearby, in which the central miRNAs has-miR-6129, has-miR-6759-5p, has-miR-6510-5p and so on had two more connections with the four genes above. miR-145 [17], hsa-let-7d-5p [18] and so on also proved to have associations with the occurrence and the development of tumors.
Diagnosis and prognosis analyses of hub genes. Based on the GSE43458 dataset, SPSS software was used to analyze the diagnostic e cacy of these 10 hub genes. The results showed that the AUC of all the hub genes were above 0.800 (Fig. 5) Fig. 6.
Expression pro les analyses about four candidate biomarkers. As for the four candidate biomarkers, TPX2, TOP2A, CDCA7 and COL1A1, we further explored their relationship between expression pro les and patient's pathological stage. In Fig.7, we found that all these four genes, especially COL1A1, had an increase trend in multiple pathological stages of LUAD, and with the progression of tumors (from stage 1 to stage 4), the expression of these candidate biomarkers increased signi cantly.

Discussion
In this study, 312 DEGs were identi ed including 74 up-regulated and 238 down-regulated genes in three datasets from the GEO database. Furthermore, through Enricher online web tool, we visualized the outcomes derived from Gene Ontology and KEGG pathway enrichment analysis. As for the biological processes, these DEGs were enriched in extracellular matrix organization, regulation of angiogenesis, negative regulation of blood vessel morphogenesis, and negative regulation of angiogenesis and so on.
As for the molecular functions, the DEGs were mainly enriched in amyloid-beta binding, protein homodimerization activity, metalloendopeptidase inhibitor activity, low-density lipoprotein particle binding and calcium ion binding. For the cellular components, the DEGs showed enrichment in the integral component of plasma membrane. Angiogenesis is closely related to the occurrence and progression of cancers [19] and the formation of extracellular matrix is also associated with tumor metastasis and invasion [20,21]. According to KEGG pathway enrichment analysis, tyrosine metabolism was found to be signi cant in LUAD. Li et al. revealed that activation of tyrosine metabolism in CD13 + cancer stem cells may drive relapse in hepatocellular carcinoma by means of generating nuclear acetyl-CoA to acetylate and stabilize Foxd3, and allowing CD13 + cancer stem cells to sustain quiescence and resistance to chemotherapeutic agents [22]. Nitration of protein tyrosine has proved to be involved in a variety of biological processes, including signal transduction, protein degradation, energy metabolism, mitochondrial dysfunction, enzyme inactivation, immunogenic response, cell apoptosis and cell death, and plays an important role in the occurrence and metastasis of lung cancer [23]. Therefore, the signaling pathway of tyrosine metabolism was expected to be a potential drug therapy target for LUAD.
Next, DEGs PPI network was constructed via the STRING online database and Cytoscape software. By virtue of "CytoHubba" plug-in, the top ten hub genes, CDC20, CENPF, TPX2, TOP2A, KIAA0101, CDCA7, ASPM, ECT2, UBE2T, COL1A1 were identi ed and they were all found up-regulated in LUAD. In addition, miRWalk, an online analysis tool, was used to construct a network of miRNAs associated with the regulation of these genes, and nally four genes (TOP2A, CDCA7, TPX2 and COL1A1) were found to show strong associations with each other, which could be considered as new effective targets to improve the prognosis of LUAD patients.
Topoisomerase II (TOP2) has been clari ed to have crucial functions, including DNA replication, transcription and chromosome segregation, and more and more active anticancer drugs targeted it [24]. TOP2 contains two types of isozymes: TOP2A and topoisomerase II beta (TOP2B) [25], TOP2A is the only enzyme able to cleave and re-ligate the double-strand backbone of DNA, which is indispensable for DNA replication, transcription, and repair [26,27]. Ejlertsen et al. [26] showed that TOP2A was a direct molecular target of anthracyclines that can improve the sensitivity of anthracycline-containing chemotherapy in high-risk breast cancer patients. In malignant peripheral-nerve sheath tumor, TOP2A was the most overexpressed gene compared with benign neuro bromas [28]. High expression of TOP2A was found to be correlated to worse overall survival (OS) in all non-small-cell lung cancer and lung adenocarcinoma patients, but not in lung squamous cell carcinoma patients [29,30]. It has also been reported that miRNA's being associated with TOP2A plays an important role in lung cancer, for example, down-regulation of miRNA-144-3p whose potential target was TOP2A, was highly enriched in various key pathways like the protein digestion and absorption and the thyroid hormone signaling pathways in nonsmall cell lung cancer from the comprehensive meta-analysis [31].
Cell division cycle associated 7, CDCA7, was identi ed as a c-Myc responsive gene, and behaved as a direct c-Myc target gene. Overexpression of CDCA7 was found to enhance the transformation of lymphoblastoid cells, and it complements a transformation-defective Myc Box II mutant, suggesting its involvement in c-Myc-mediated cell transformation [32]. In quite a number of tumors, such as hepatocellular carcinoma [33], colorectal cancer [34], lymphoma [35], breast cancer [36], CDCA7 was all reported up-regulated and might be a potential prognostic factor and therapeutic target. Wang et al. have found that CDCA7 could promote lung adenocarcinoma proliferation via regulating the cell cycle and silencing CDCA7 inhibited cell proliferation in LUAD through G1 phase arrest and induction of apoptosis, which implied that CDCA7 might be identi ed as a potential therapeutic target for new biomarkers and LUAD [37].
TPX2, which is also known as DIL2 or p100, uses two exibly linked elements ('ridge' and 'wedge') in a novel interaction mode to simultaneously bind across longitudinal and lateral tubulin interfaces [38,39].
In ovarian cancer, it can promote the proliferation and migration of human ovarian cancer cells by regulating PLK1 expression [40]. Except for these function, in various cancers can TPX2 also control bladder cancer cell's proliferation and invasion via TPX2-p53-GLIPR1 regulatory circuitry [41], regulate the PI3K/AKT signaling pathway to facilitate hepatocellular carcinoma [42], interactive with miRNA such as miR-485-3p [43], miR-361-5p [44], miR-335-5p [45], miR-216b [46] and so on. Zhou et al. have veri ed that TPX2 can activate the epithelial-mesenchymal transition process and promote both the expression and activities of matrix metalloproteinase (MMP)2 and MMP9 in non-small cell lung cancer (NSCLC), which means TPX2 promotes the metastasis and malignant progression of NSCLC and could thus serve as a marker of poor prognosis in NSCLC [47].
Some studies have con rmed that COL1A1-related miRNAs were involved in the regulation of different kinds of tumors, including LUAD [18,48]. Zhang et al. found that COL1A1 was positively correlated with NOTCH3 expression, and the miR-150 /NOTCH3/COL1A1 axis might be involved in EGFR-TKI resistance in LUAD, which provided a potential target for LUAD therapy and could be used as a prognostic target [49]. Hsa-let-7d-5p has been proved that it can regulate cell cycle including both G1/S and G2/M cell cycle phase transitions and telomere maintenance in human lung broblasts and is related to cellular senescence [18]. Hsa-miR-483-3p can also play an important role in lung cancer caused by radon exposure by regulating certain signaling pathways such as proliferative protein kinases (MAPK) and reactive oxygen species (ROS) [50].
Followed by, ROC curve analysis of these 10 hub genes was conducted based on dataset GSE43458, and we found that they all had certain diagnostic e cacy, which TOP2A ranked rst. The results of Kaplan-Meier survival analysis also showed that these hub genes were signi cantly correlated with the prognosis of LUAD patients, suggesting that they may be potential prognostic biomarkers of LUAD. Lastly, we evaluated the expression patterns of TOP2A, CDCA7, TPX2 and COL1A1 in the main pathological stages and demonstrated that the expression of these four genes in tumors increased with the increase of the pathological stage of LUAD patients.
In summary, our study indicated that TOP2A, CDCA7, TPX2 and COL1A1 had strong relationships with each other and also owned better diagnosis and prognosis e cacy in LUAD. The mechanism and their mutual regulation network are worthy of further research and experiments. Anyway, all of our analyses may provide some useful direction into the potential biomarkers and molecular mechanism of the occurrence and development of LUAD.