Identification of 10 Important Genes with Poor Prognosis in Non-small cell Lung Cancer through Bioinformatical Analysis

Background: The lung cancer has become the most lethal cause of cancer-related death in China and is responsible for more than 1 million deaths all of the world every year, especially non-small cell lung cancer (NSCLC). Although great advance in pharmaceutical therapies for lung cancer patients, the overall survival is still poor. It is necessary to find out the effective biomarkers in order to improve and predict the prognosis of lung cancer patients. The integrated bioinformatical analysis, as a useful tool to dig up the valuable clues, can be applied to search new effective therapeutic targets. Methods: In this work, we utilized four NSCLC datasets (GSE18842, GSE31210, GSE33532 and GSE101929) from Gene Expression Omnibus (GEO) to analyze. We totally found that there were 162 differentially expressed genes (DEGs) in these four datasets, including 41 up-regulated genes and 121 down-regulated genes in NSCLC tissues. The analysis of GO enrichment and KEGG pathway was done by DAVID software. Then, we identified 10 core oncogenes by constructing protein-protein interaction (PPI) network. Last, we further analyzed the 10 core oncogenes through Kaplan Meier plotter online database and Gene Expression Profiling Interactive Analysis (GEPIA) respectively. Results: We discovered 10 key oncogenes which were associated with the progression and poor prognosis for NSCLC, including ANLN, CCNA2, CDCA7, DEPDC1, DLGAP5, HMMR, KIAA0101, RRM2, TOP2A, and UBE2T. Conclusion: These 10 genes can be served as the therapeutic targets and useful prognostic biomarkers for NSCLC treatment. progression and poor prognosis for NSCLC through our research, including ANLN, CCNA2, CDCA7, DEPDC1, DLGAP5, HMMR, KIAA0101, RRM2, TOP2A, and UBE2T. These 10 genes can be served as the therapeutic targets and useful prognostic biomarkers for NSCLC treatment. But the mechanism of these genes to regulate the progression of NSCLC is needed to explore, it is useful to design new drugs targeting these oncogenes.


Background
The lung cancer, which is the leading cause of cancer-related death in China and is responsible for more than 1 million deaths all of the world every year(1), can be divided two classes: non-small cell lung cancer (NSCLC) and small cell lung cancer (SCLC). NSCLC accounts for approximately 85% of all lung cancer cases, including adenocarcinoma, squamous cell carcinoma and large cell carcinoma.
Nowadays, although great advance in pharmaceutical therapies for lung cancer patients, the overall survival is still poor. NSCLC has become the most lethal human cancer. Hence, it is necessary to find out the effective therapeutic targets in order to improve the prognosis of lung cancer patients.
Gene chip, as a proven technique, could make many slice data be produced and stored in public databases (2). Therefore, we can explore a large number of valuable clues via these data. Meanwhile, the integrated bioinformatic results also can help us to further study and discover the potential mechanism.
In the present study, we chose 4 databases related with non-small cell lung cancer from Gene Expression Omnibus (GEO), including GSE18842, GSE31210, GSE33532 and GSE101929. First, we found that there were 162 differentially expressed genes (DEGs) in these four databases above, including 41 up-regulated genes and 121 down-regulated genes in NSCLC tissues. Then, we did some other bioinformatic analyses and identified 10 core genes by establishing protein-protein interaction (PPI) network. In order to confirm the important role of these 10 core genes in NSCLC, we further analyzed the survival curve and the DEGs expression between NSCLC tissues and normal lung tissues through Kaplan Meier plotter online database and Gene Expression Profiling Interactive Analysis (GEPIA) respectively. Taken above, these 10 DEGs were all related with the prognosis of NSCLC. In conclusion, our bioinformatic study provides some additional useful biomarkers for NSCLC patients.
These biomarkers can be considered as candidate therapeutic targets for NSCLC, and the results also supply some ideas for our further study.

Data source and preprocessing
NCBI-GEO (https://www.ncbi.nlm.nih.gov/geo/) was selected for our research, which is a free public database of microarray/gene profile. We used the key words ('non-small cell lung cancer' [All Fields] OR 'lung adenocarcinomas' [All Fields]) AND ('human' [Organism]) AND ('Expression profiling by array' [Filter]) to select related datasets. Next, we screened four gene expression profiles (including GSE18842, GSE31210, GSE33532 and GSE101929) according to the following inclusion criteria: a.
Human NSCLC tissues, not cell lines; b. Normal lung tissues used as controls; c. The total sample numbers, containing tumor tissues and normal tissues, are over 50; d. These datasets have the same Platform in order to process the data easily. These four gene profiles we selected were all on account of GPL570 Platform. GSE18842 contained 46 NSCLC tissues and 45 normal lung tissues, GSE31210 included 226 NSCLC tissues and 20 normal lung tissues, GSE33532 covered 80 NSCLC tissues and 20 normal lung tissues, and GSE101929 incorporated 32 NSCLC tissues and 34 normal lung tissues.

Screening of differentially expressed genes (DEGs)
The DEGs between NSCLC tissues and normal lung tissues were screened by using the GEO2R online tools. The fold change value (FC) obtained for each genes was indicated as logFC in order to normalize the data derived from the same microarray platform (3). We considered DEGs as |logFC| >2 and adjust P value < 0.05. Venn software online (http://bioinformatics.psb.ugent.be/webtools/Venn/) was used to analyze the DEGs among the above four datasets via checking the raw data in TXT format. In the present study, the DEGs with log FC > 2 was considered as an up-regulated gene, and the DEGs with log FC < -2 was regarded as a down-regulated gene.

DEGs gene ontology (GO) enrichment and Kyoto encyclopedia of genes and genomes (KEGG) pathway analyses
After screening the DEGs from the above four datasets, we performed the GO enrichment and KEGG pathway analyses using the Database for Annotation, Visualization and Integrated Discovery (DAVID) (https://david.ncifcrf.gov/tools.jsp), which is designed to identify a huge number of genes or proteins function (4). GO analysis is used to integrate annotation data and provide tools access to all the data provided by the study, and identify unique biological properties of these datasets (5). KEGG can integrate the currently known protein interaction network information, including metabolism, genetic information processing, environmental information related processes, and cell physiological process, etc (6). We used DAVID to perform biological analyses of DEGs and visualize the DEGs enrichment of biological processes (BP), molecular functions (MF), cellular components (CC) and pathways. P<0.05 was considered as significant difference.

Protein-protein interaction (PPI) network analysis
PPI network analysis was performed for the identified DEGs by using Search Tool for the Retrieval of Interacting Genes (STRING) (https://string-db.org/), which is an online software of interactions of genes and proteins. The PPI network could be visualized by Cytoscape in order to examine the potential correlation between the DEGs (maximum number of interactors=0 and confidence score ≥0.4) (7). Besides, the Molecular Complex Detection (MCODE) app in Cytoscape was used to analyze the modules of the PPI network (degree cutoff=2, max. Depth=100, κ-core=2, and node score cutoff=0.2) (8).

Analyzing overall survival and RNA sequencing expression of core genes
Kaplan Meier-plotter (https://kmplot.com/analysis/) is a widely used website tool for illustrating the relationship between patients' overall survival and gene expression levels of DEGs based on EGA, TCGA and GEO (9). In this study, we acquired core genes corrected with the progression of NSCLC through the PPI network analysis. The correlation between core genes expression and survival in lung cancer was analyzed by Kaplan Meier-plotter. The hazard ratio (HR) with 95% confidence intervals and log-rank P value were also computed and showed on the plot. In order to validate the important of these core genes, we next used the GEPIA website (http://gepia.cancer-pku.cn/) to analyze the RNA sequencing expression data according to thousands of samples from the GTEx projects and TCGA (10), including lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC).

Screening of DEGs between NSCLC tissues and normal lung tissues
There were together 384 NSCLC tissues and 119 normal lung tissues in these four datasets that we chose to study. The up-regulated DEGs were statistically significant as logFC > 2 and P value < 0.05, while the down-regulated DEGs were statistically significant as logFC < -2 and P value < 0.05. Through GEO2R online tools, a total of 772, 443, 610, and 926 DEGs were extracted from GSE18842,  Table 1).

Analyzing of the DEGs GO enrichment
In our study, all 162 DEGs were analyzed by DAVID software in order to perform the functional process. The results were shown in Figure 2 & Table 2. In this part, we only summarized the top 5 different functional process: a. in the biological processes (BP) section, the up-regulated DEGs were mainly involved in collagen catabolic process, extracellular matrix disassembly, collagen fibril organization, sensory perception of sound, and proteolysis, while the down-regulated DEGs in angiogenesis, vasculogenesis, cell surface receptor signaling pathway, receptor internalization and vasoconstriction; b. in the cell composition (CC) part, the up-regulated DEGs were enriched in proteinaceous extracellular matrix, collagen trimer, extracellular region, and cytoplasm, while the down-regulated DEGs in integral component of plasma membrane, integral component of membrane, membrane raft, plasma membrane, and external side of plasma membrane; c. in the molecular function (MF) aspect, the down-regulated DEGs were particularly focused on receptor activity, heparin binding, ion channel binding, Ras guanyl-nucleotide exchange factor activity and angiotensin type II receptor activity, while up-regulated DEGs in no significant difference.

Analyzing of the DEGs KEGG pathways
In this study, the DEGs KEGG pathways were also performed by DAVID software. As shown in Table 3, the results indicated that the DEGs were mainly enriched in ECM-receptor interaction, cell adhesion molecules, leukocyte transendothelial migration, protein digestion and absorption, PPAR signaling pathway, adrenergic signaling in cardiomyocytes and neuroactive ligand-receptor interaction.

Analyzing of protein-protein interaction network (PPI) and modular
We applied the STRING database to build the PPI network, including 41 up-regulated genes and 121 down-regulated genes. A PPI network of the DEGs was presented as shown in Figure 3A. Then we used Cytotype MCODE to construct a significant modular containing 10 nodes (ANLN, CCNA2, CDCA7, DEPDC1, DLGAP5, HMMR, KIAA0101, RRM2, TOP2A, and UBE2T) and 43 edges ( Figure 3B). We discovered that these 10 central nodes were all up-regulated DEGs.

Analyzing of core genes
We next utilized Kaplan Meier-plotter and GEPIA to further analyze the 10 core genes. Kaplan Meierplotter was used to illustrate the relationship between patients' overall survival and gene expression levels of DEGs, while GEPIA to dig up the DEGs expression level between NSCLC and normal people.
As shown in Figure 4, all the 10 core genes had an obviously worse survival when they had high expression in NSCLC patients (P 0.05). GEPIA results also demonstrated that all the 10 genes expressed higher in NSCLC samples than normal lung tissues, including LUAD and LUSC (P 0.05, Figure 5).

Discussion
As shown in Figure 6, in this study, first, we together selected four NSCLC databases from GEO according to the screening principle which was described in the data source and preprocessing section; Second, we used GEO2R online tools to analyze the DEGs extracted from the four datasets respectively; Third, we applied Venn diagram software online to screen the common DEGs in these four datasets. In this part, we found that there were 162 DEGs in these four databases, including 41 up-regulated genes and 121 down-regulated genes in NSCLC tissues; Fourth, we analyzed all the 162 DEGs GO enrichment and KEGG pathways by using DAVID software. As shown in Table 3, the 162 DEGs were mainly enriched in ECM-receptor interaction, cell adhesion molecules, leukocyte transendothelial migration, protein digestion and absorption, PPAR signaling pathway, adrenergic signaling in cardiomyocytes and neuroactive ligand-receptor interaction to exert their biological function; Fifth, we constructed the PPI network of these 162 DEGs by applying the STRING database, then we discovered a significant modular containing 10 nodes through utilizing the Cytotype MCODE.
These 10 core genes are ANLN, CCNA2, CDCA7, DEPDC1, DLGAP5, HMMR, KIAA0101, RRM2, TOP2A, and UBE2T; Last, we further analyzed the survival curve and the expression level between NSCLC tissues and normal lung tissues of these 10 core genes through Kaplan Meier plotter online database and GEPIA respectively. Taken together, we discovered that all the 10 genes were associated with poor prognosis in NSCLC, and they were all up-regulated DEGs. ANLN (Anillin), an actin binding protein, is first found in Drosophila as a 124 kDa protein and plays an important role in cytokinesis (11). ANLN has higher expression levels in the brain, testis, and placenta, but lower expression levels in the heart, kidney, liver, pancreas, prostate, spleen and lung.
Recently, ANLN has been identified as a prognostic biomarker in cervical cancer, breast cancer, pancreatic cancer, colorectal cancer, and bladder urothelial carcinoma. ANLN is also discovered overexpressing in the majority of the primary NSCLC and is involved in the metastasis of lung cancer (12). Pathway analysis demonstrated that ANLN participated in developmental processes through the regulation of nuclear division pathway (13). CCNA2 (CyclinA2) belongs to a ubiquitously expressed member of the cyclin family and is expressed in almost all tissues in human (14). Evidence indicated that CCNA2 was up-regulated in many kinds of cancers, and as an oncogenic gene, CCNA2 also played an important role in regulating cancer cell growth and apoptosis, especially controlling the cell cycle at the G1/S and the G2/M transitions (15).
CCNA2 can be used as a prognostic biomarker for colorectal cancer, ER + breast cancer, esophageal squamous cell carcinoma and pancreatic etc. Resent study indicated that CCNA2 has higher expression in human NSCLC specimens than normal lung tissues, and could induce EMT and promote NSCLC metastasis via integrin αvβ3 signaling pathway (16). However, further research is needed to uncover the target gene of CCNA2.
CDCA7 (Cell division cycle-associated protein 7), also known as JPO1, is a new member of cell division cycle associated genes family (17). CDCA7 has been identified as a DNA-binding protein (18). MYC and E2F1 could bind to the promoter of CDCA7, thereby driving CDCA7 expression. Recently, CDCA7 was discovered as a critical regulator of lymphomagenesis and invasion (19), while overexpression of CDCA7 predicted poor prognosis in triple negative breast cancer and colorectal cancer (20, 21).
Wang's study indicated that CDCA7 was significantly overexpressed in LUAD compared with the normal lung tissues, and silencing CDCA7 could inhibit cell proliferation through G1 phase arrest and induction of apoptosis (22). In conclusion, CDCA7 can be considered as a therapeutic target for LUAD. DEPDC1 (DEP domain containing 1), a highly conserved protein, plays important roles in many biological processes, for example, cell proliferation, cell cycle progression, cell apoptosis and signaling transduction etc (23). DEPDC1 was firstly reported to be highly overexpressed in bladder cancer and had a critical role in the development of the bladder cancer (24). Nowadays, DEPDC1 is considered as a novel oncoantigen which is upregulated in many kinds of cancers, including hepatocellular carcinoma, nasopharyngeal carcinoma, prostate cancer, breast cancer, and malignant glioma. DEPDC1 expression is also increased in LUAD and can be applied as a prognostic biomarker for NSCLC patients (25). Recently, DEPDC1 was found inducing apoptosis in A549 lung adenocarcinoma cells by the NF-κB signaling pathway (26). Further studies are needed to explore the mechanism of DEPDC1. DLGAP5 (disc large homolog-associated protein 5), a mitotic spindle protein, can exert important biological function as a signaling molecule because it contains a guanylate-kinase-associated protein (GKAP) domain, which is highly conserved among many species and found in various eukaryotic signaling proteins (27). DLGAP5 overexpression could promote the proliferation potential of human cells, and the overexpression also been discovered in hepatocellular carcinoma, prostate cancer, colorectal cancer and adrenocortical carcinoma. Recently, studies also uncovered that DLGAP5 was highly overexpressed in the lung cancer tissues compared to corresponding normal lung tissues (28).
Hence, DLGAP5 can be used as promising biomarker for early detection of lung cancer.
HMMR (Hyaluronan-mediated motility receptor), as an oncogene, is found highly up-regulated and plays important roles during the progression of human leukemias and solid tumors (29,30).
Tilghman's work revealed that HMMR was overexpressed in glioblastoma (GBM) tumors where it supported the self-renewal and tumorigenic potential of GBM stem cells (31). Taken together, HMMR not only promotes the progression of tumor, but also maintains the cancer stem cell (CSC) stemness.
Meanwhile, some other studies have developed HMMR with great value for prognostic prediction in NSCLC (32). But further research is needed to state the regulated mechanism of HMMR in NSCLC.
KIAA0101, also named as proliferating cell nuclear antigen (PCNA)-associated factor (PAF15), functions as an oncogene and is upregulated in various cancers, including breast cancer, esophageal cancer, hepatocellular carcinoma, ovarian cancer and lung cancer. KIAA0101 has been recently considered as a potential biomarker for recurrence and poor prognosis in tumor patients. Kato's study discovered that KIAA0101 was overexpressed in the great majority of lung cancers, and KIAA0101 could be used as a specific target to treat lung cancer (33).
RRM2 (Ribonucleotide reductase M2 subunit), a small subunit of the ribonucleotide reductase complex, is a rate-limiting enzyme for dNTP producing and displays critical roles in many cellular processes such as cell proliferation, invasiveness, migration and angiogenesis (34). RRM2 has been reported overexpressing in various malignancies as a tumor driver, including breast cancer, gliomas, colorectal cancer, bladder cancer and NSCLC. Yang's work found RRM2 was upregulated in NSCLC tumor and cell lines, and the aberrant upregulation predicted a poor prognosis (35). Mechanistically, they also revealed the vital role of LINC00667/miR-143-3p/RRM2 signal pathway in the NSCLC progression. In conclusion, RRM2 can be used as a therapeutic target for NSCLC.
TOP2A (Topoisomerase 2-alpha) encodes a nuclear enzyme which implicates in almost any processes of DNA metabolism, such as replication, transcription and chromosome segregation during interphase and mitosis (36). It has been reported that TOP2A has higher expression level in a variety of human cancers, including gastric cancer, bladder urothelial carcinoma, colon cancer and pancreatic cancer.
Meanwhile, TOP2A also can be considered as the target for some of the most widely used chemotherapeutic drugs for human cancers treatment (37). But the role of TOP2A in progression of NSCLC has not been elucidated.
UBE2T (Ubiquitin-conjugating enzyme E2T, also named as HSPC150), a member of the E2 family, is firstly identified in a patient with Fanconi anemia (FA) (38). UBE2T takes part in main cellular processes such as cell cycle control, signal transduction and tumorigenesis through working with specific E3 ubiquitin ligase to active the degradation of relevant substrates (39). UBE2T has been also discovered overexpressed in prostate cancer, osteosarcoma, gastric cancer, hepatocellular carcinoma and lung cancer. But the mechanism of UBE2T to promote the progression of NSCLC is not clear now.
Further studies are needed to clarify the relationship between UBE2T and NSCLC.

Conclusion
We discovered 10 key oncogenes which were associated with the progression and poor prognosis for NSCLC through our research, including ANLN, CCNA2, CDCA7, DEPDC1, DLGAP5, HMMR, KIAA0101, RRM2, TOP2A, and UBE2T. These 10 genes can be served as the therapeutic targets and useful prognostic biomarkers for NSCLC treatment. But the mechanism of these genes to regulate the progression of NSCLC is needed to explore, it is useful to design new drugs targeting these oncogenes. None.

Authors' contributions:
L Wang and N Hu conceived and designed the idea to this manuscript; W Wu and C Fang collected and analyzed the data, and drafted the manuscript; C Zhang collected the data and revised the manuscript. All authors confirmed the final version of the manuscript for submission.

Funding
This work was supported by National Natural Science Foundation of China (Grant No. 81803933) and Xinglin Young Talent Program of Shanghai University of Traditional Chinese Medicine.

Availability of data and materials:
The datasets used and analyzed during the present study are available from the corresponding author on reasonable request.

Ethical Statement:
Our study did not require an ethical board approval because it did not contain human or animal trials.

Consent for publication:
Not applicable.     Analysis the relationship between NSCLC patients' overall survival and gene expression levels of the 10 core DEGs by applying Kaplan Meier-plotter. As shown, all the 10 core genes had an obviously worse survival when they had high expression in NSCLC patients (P<0.05).

Figure 5
Analysis the 10 core DEGs expression level in NSCLC patients compared to healthy people by using GEPIA. As shown, all the 10 genes expressed higher in NSCLC samples than normal lung tissues, including LUAD and LUSC (*P<0.05). Red color means lung cancer tissues and grey color means normal lung tissues.

Figure 6
The process of our work.