Screening prognostic markers for non-small cell lung cancer based on data mining and bioinformatics analysis

Background At present, non-small cell lung cancer has a high morbidity and mortality, and the recurrence and metastasis situation is serious. It is impossible to accurately predict the prognosis of cancer patients clinically. Biomarker is a kind of biomolecule with wide application prospects, and its potential in cancer prognosis is gradually revealed, and it is expected to be applied clinically. We integrated four gene expression profiles (GSE19188, GSE19804, GSE101929 and GSE18842) from the GEO database and screened the commonly differentially expressed genes using the GEO2R online tool. We screened 952 commonly differentially expressed genes. Gene ontology analysis showed that CDEGs were mainly enriched in biological processes such as cell adherin, angiogenesis and positive regulation of angiogenesis, and KEGG pathways such as ECM-receptor interaction and cell adherin molecules (CAMs). Up-regulation of G2 and S phase-expressed protein 1(GTSE1) expression is associated with poor prognosis of lung adenocarcinoma(LADE) and lung squamous cell carcinoma(LUSC). Up-regulation of Neuromedin-U(NMU) expression, down-regulation of Proto-oncogene c-Fos(FOS) and Cyclin-dependent kinase inhibitor 1C(CDKN1C) is only associated with poor prognosis of LADE. We believe that GTSE1, NMU, FOS, and CDKN1C have potential application value as prognostic markers for lung adenocarcinoma, and are of great significance for lung adeno carcinoma efficacy evaluation and relapse monitoring. At the same time, GTSE1 may also be used as a new target for cancer treatment New ways.


Background
As one of the malignant tumors, lung cancer has the highest morbidity and mortality worldwide [1]. According to different pathological types, lung cancer is mainly divided into small cell lung cancer(SCLC) and non-small cell lung cancer(NSCLC), of which non-small cell lung cancer accounts for 85% of all lung cancer [2]. The clinical treatment methods for lung cancer mainly include surgical resection, chemotherapy, physical radiation therapy, targeted therapy and immunotherapy. Only 20-30% of patients with NSCLC are suitable for surgical resection. The toxic and side effects of chemotherapy and physical radiation therapy significantly limit its clinical application prospects. Targeted therapy is currently a widely used clinical treatment, but secondary drug resistance is a challenging problem in targeted therapy. Immunotherapy shows the advantages of significantly enhanced efficacy, reduced side effects and long-lasting effects, but the immune effect due to changes in the tumor microenvironment has reduced the therapeutic effect. Mining potential prognostic markers or therapeutic targets can further understand and understand the direction and mechanism of tumor development, and provide patients with personalized treatment plans, monitor treatment effects, and improve the quality of life of lung cancer patients.
At present, the combination of omics technology with high throughput and bioinformatics analysis is still an important and efficient research method in medical research in order to discover target molecules related to diseases. In addition, it is also a reliable method to bioinformatics analysis for combination of a large amount of omics data to find targets with potential application value. For example, studies on oral cancer [3], glioma [4], colorectal cancer [5], osteosarcoma [6], lung cancer [7][8][9][10][11][12] and ovarian cancer [13]. Through this research method, multiple prognostic markers for NSCLC such as KIAA1522, PHLPP2, and TOP2A have been discovered, and subsequent studies have further proved that abnormally high expression of KIAA1522 is an independent factor affecting the poor prognosis of NSCLC, and is also related to platinum resistance [8]. Reduced expression of PHLPP2 in lung cancer can predict lung cancer progression [9]. TOP2A and TPX2 can jointly regulate the development of lung cancer, TOP2A may be a prognostic marker for lung cancer [7]. Due to the strong heterogeneity, high malignancy, and recurrence and metastasis of lung cancer, a single marker is difficult to meet the complex and changing clinical needs. However, the combination of markers can complete the prognosis relatively comprehensively, highlighting the advantages of stable accuracy. In this study, we screened four key differentially expressed genes of NSCLC based on Meta analysis and assessed their prognostic value.

Identification and function enrichment of DEGs
Differentially expression gene of each series was analyzed by GEO2R online web(https://www.ncbi.nlm.nih.gov/geo/geo2r/). The criteria were set as adjusted P value < 0.01 and |logFC| > 1. The first step was to screen DEGs in each data set. The second step was to use the online tool(http://bioinformatics.psb.ugent.be/webtools/Venn/) to create a venn map to screen for commonly differentially expressed genes(CDEGs) in the four data sets. The function enrichment and KEGG(KEGG: Kyoto Encyclopedia of Genes and Genomes) pathway enrichment were performed by the Database for Annotation, Visualization and Integrated Discovery(DAVID, https://david.ncifcrf.gov/home.jsp), and the function enrichment including biological process(BP), cellular component(CC) and molecular function(MF). The criteria were set as P value < 0.01.

PPI Network construction and visualization
Protein protein interaction network of CDEGs was performed by the Search Tool for Retrieval of Interacting Genes(STRING, https://string-db.org/cgi/), which is a free tool for researchers. The minimum required interaction score set as 0.04. Further visualization of the PPI network using Cytoscape version 3.7.2 and highlighting closely related modules via the MCODE plugin. Select the highest-scoring protein in each module for more in-depth analysis.

Kaplan-Meier survival analysis
The prognostic value of CDEGs in patients with non-small cell lung cancer was analyzed by Kaplan-Meier Plotter online database(https://kmplot.com/analysis/index.php?p=background), which combines data from GEO, EGA(European Genome-phenome Archive) and TCGA(The Cancer genome Atlas) databases and is able to assess survival-related 54K genes in 21 tumors. In this study, we used NSCLC information from the database to analyze the prognostic value of CDEGs. P value < 0.05 indicates statistical significance.

Validation of CDEGs expression levels and correlation analysis
For the screening of promising CDEGs, we verified their expression levels in lung adenocarcinoma samples through the GEPIA(Gene Expression Profiling Interactive Analysis) online database(http://gepia.cancerpku.cn/index.html). The cutoff values were set as |logFC| > 1 and P < 0.01. In addition, we also assessed the correlation between expression levels and clinical stage of tumors and investigated whether valuable CDEGs are independent influencing factors influencing the prognosis of NSCLC.

Screening of CDEGs
We collected four NSCLC datasets(GSE101929, GSE18842, GSE19188 and GSE19804) from the GEO database and performed differential expression screening. In these four data sets, there are 3179, 3162, 2601 and 1404 differentially expressed genes, of which 952 genes were commonly differentially expression genes(CDEGs), including 256 up-regulated genes and 696 down-regulated genes( Fig. 1).

Enrichment of function and KEGG pathway
To get an overview of the role and participation of 952 CDEGs in the development of NSCLC, we performed a gene ontology analysis using the DAVID online tool. We selected the five most significant terms of enrichment in each GO classification enrichment result( Table 2). The analysis showed that biological process significantly associated with the development of non-small cell lung cancer include cell adherin, extracellular matrix organization, angiogenesis, positive regulation of angiogenesis and collagen catabolic process. In addition, cell components such as proteinaceous extracellular matrix, extracellular space, extracellular region, extracellular matrix and collagen trimer and molecular function such as heparin binding, integrin binding, protein binding, calcium ion binding and metalloendopeptidase activity were also closely with the development of non-small cell lung cancer. The results of KEGG pathway enrichment shown that Malaria, Cell cycle, ECM-receptor interaction, Cell adhesion molecules (CAMs) and Protein digestion and absorption signaling pathways were most closely related to the occurrence of non-small cell lung cancer.

PPI Network analysis and visualization
We performed a PPI network analysis using the STRING online tool to facilitate systematic understanding of the interactions between CDEGs and to identify key genes involved in NSCLC development. In order to more clearly and intuitively recognize the key genes in the network, we conducted a visual analysis through Cytoscape software and modularized the network through the MCODE plugin. Four modules were selected for further analysis in 28 modules. Among the four modules, there are 69, 26, 37 and 61 nodes and 2125, 173, 248 and 214 edges (Table 3), respectively, indicating that these genes are very closely related and have multiple interacting molecules. We selected the highest-scoring gene in each module for further study in order to be able to screen out the most representative target molecules for each module, in turn GTSE1, NMU, FOS and CDKN1C (Fig. 2). Table 3 Survival analysis We selected four key genes, of which GTSE1(logFC = 1.32, adjusted P value < 0.001) and NMU(logFC = 2.81, adjusted P value < 0.001) were up-regulated in tissues of patients with non-small cell lung cancer, and the expression levels of FOS(logFC = -2.22, adjusted P value < 0.001) and CDKN1C(logFC = -1.56, adjusted P value < 0.001) were down-regulated. To assess the prognostic value of GTSE1, NMU, FOS, and CDKN1C in patients with NSCLC, we analyzed another group of NSCLC cases from GEO, EGA and TCGA databases using Kaplan-Meier Plotter. The results showed that high expression levels of GTSE1(P value < 0.01) and NMU(P value < 0.01) were closely related to the shorter overall survival of LADE patients, with statistical significance. In contrast, high expression levels of FOS(P value < 0.01) and CDKN1C(P value < 0.01) were significantly associated with longer overall survival in patients with LADE, with statistically significant significance (Fig. 3). However, except for NMU, FOS and CDKN1C, only GTSE1 expression levels are associated with patient prognosis in LUSC. These findings demonstrate that GTSE1 and NMU high expression, low expression of FOS and CDKN1C can be used as indicators of poor prognosis of LADE.

GEPIA analysis
We have demonstrated the potential clinical value of GTSE1, NMU, FOS and CDKN1C in the prognosis of patients with NSCLC. To confirm whether the expression levels of GTSE1, NMU, FOS and CDKN1C in the LSCLC tissues were consistent with the results of the four data sets, we verified 1654 samples by GEPIA analysis. The results showed that the expression levels of GTSE1(P < 0.05) and NMU(P < 0.05) were significantly up-regulated, and the expression levels of FOS(P < 0.05) and CDKN1C(P < 0.05) were significantly down-regulated in lung adenocarcinoma and statistically significant (Fig. 4). The validation results shown that the four CDEGs were completely consistent with the results in the four data sets, further demonstrating that these four targets have good prognostic value for LADE. In addition, we conducted a correlation analysis by GEPIA to investigate whether the four targets are independent of each other affecting LADE. The results show a very weak positive or negative correlation between GTSE1 and NMU, GTSE1 and FOS, FOS and CDKN1C, respectively (Fig. 5). The relationship between gene expression levels and clinical stage showed that the expression levels of NMU, FOS and CDKN1C were not related to the clinical stage of lung adenocarcinoma except GTSE1 (Fig. 6). The expression level of GTSE1 increases with the tumor stage, indicating that it can be used as a potential indicator for determining the development of tumors. Based on all the above findings, GTSE1, NMU, FOS and CDKN1C have good prognostic value for patients with lung adenocarcinoma.
The important reason for the high mortality rate of lung cancer is that lung cancer is easy to metastasize and relapse during the treatment. Therefore, it is urgent to solve the clinical problem by accurately predicting potential prognostic markers of tumor progression status. Transcriptionomics with high-throughput advantages can provide powerful help for researchers in medical research to facilitate screening of target molecules. Therefore, we integrated four mRNA expression profiles of non-small cell lung cancer derived from the GEO database and studied them.
By comparing NSCLC tissues with paired paracancerous tissues, we screened 952 CDEGs from four expression profiles, including 256 CDEGs with up-regulated expression and 696 CDEGs with down-regulated expression.
Gene ontology analysis showed that CDEGs were mainly enriched in biological processes such as cell adherin, angiogenesis and positive regulation of angiogenesis, and KEGG pathways such as ECM-receptor interaction and cell adherin molecules (CAMs). Similarly, Piao JJ et al. also reported this result [11].
We selected GTSE1, NMU, FOS and CDKN1C for further research through PPI network analysis because they are the core genes. It has been reported that GTSE1 is highly expressed in tumors such as liver cancer and melanoma, and is associated with poor prognosis of patients [18,19]. Moreover, GTSE1 may be involved in tumorigenesis and progression by regulating p53 phosphorylation [20,21]. In another liver cancer study, it was also proved that the down-regulation of GTSE1 has a good effect of promoting apoptosis, reducing anti-apoptotic ability and enhancing the sensitivity of chemotherapeutic drugs [22]. Therefore, it can be considered that GTSE1 not only has the potential as a prognostic marker for lung adenocarcinoma, but may also be used as a new target to provide a more effective targeted drug delivery route for cancer treatment. NMU is well known for its uterine smooth muscle contraction inducer. Meanwhile, It also participates in the formation and development of various tumors. For example, Koji Takahashi et al. reported that the positive rate of NMU in NSCLC and SCLC was as high as 68% and 82%, and the overexpression of NMU was verified at protein level and transcriptional level [23]. In addition, studies have shown that overexpression of NMU is also produced in HER2 overexpressing breast cancer, and overexpression of NMU in breast cancer is associated with poor prognosis in patients [24,25]. Not only that, but similar reports have been reported in the study of Clear cell renal cell carcinoma and endometrial carcinoma [26][27][28]. CDKN1C is a tumor suppressor gene, which is down-regulated in studies related to gastric cancer [29], bladder cancer [30], pancreatic cancer [31], lung cancer [32] and breast cancer [33], and low expression levels are associated with poor prognosis in patients. Importantly, all of the above research results strongly support our analysis results. In addition, GTSE1, NMU, FOS and CDKN1C have no correlation with each other, indicating that each target can be used alone as a prognostic marker for NSCLC. In conclusion, we believe that GTSE1, NMU, FOS and CDKN1C can be used as potential markers for the prognosis of lung adenocarcinoma, and provide a basis for clinical lung adenocarcinoma efficacy evaluation and recurrence monitoring. At the same time, GTSE1 may also be used as a new target for cancer treatment.

Conclusion
We performed a differential analysis of large-scale lung small cell lung cancer samples and matched paracancerous tissues based on bioinformatics, and selected four core CDEGs for in-depth study. Then through meta analysis, expression level verification and correlation analysis, we believe that GTSE1, NMU, FOS and CDKN1C have potential and clinical application value as prognostic markers of LADE. At the same time, GTSE1 may also be used as a new target for cancer treatment.

Declarations
All available URLs and online tools have been shown in the text. CDEGs from four data sets. Analysis of the four data sets revealed that there were 256 commonly up-regulated differentially expressed genes and 696 commonly down-regulated differentially expressed genes, respectively.

Figure 3
Prognosis analysis of CDEGs in patients with lung adenocarcinoma(LADE) and lung squamous cell carcinoma(LUSC). A indicates high expression of GTSE1 and NMU in lung adenocarcinoma patients, and low expression of FOS and CDKN1C is associated with poor prognosis. B indicates that high expression of GTSE1 in lung squamous cell carcinoma patients is associated with poor prognosis, while NMU, FOS and CDKN1C are not statistically significant.

Figure 5
Correlation analysis of GTSE1, NMU, FOS and CDKN1C expression levels in lung adenocarcinoma tissues. The results show a weak positive correlation between GTSE1 and NMU, FOS and CDKN1C, and a weak negative correlation between GTSE1 and FOS. Correlation analysis using the Pearson rank sum test, P < 0.05 means statistically significant. Analysis of the correlation between GTSE1, NMU, FOS and CDKN1C expression levels and clinical stage of lung adenocarcinoma. Except for NMU, FOS and CDKN1C, only the expression level of GTSE1 is correlated with lung adenocarcinoma staging and is positively correlated.