Genes co-related with poor prognosis of patients with lung cancer via bioinformatical approaches

Lung cancer is one of the most common malignant tumors with high mortality worldwide. Recently, researchers reported that molecular markers on lung cancer could be used as diagnostic and prognostic targets. However, these molecules were not ideal in specificity and high selectivity. Therefore, exploring more reliable biomarkers to improve the prognosis and clarify the underlying mechanism is urgently needed both for clinic and basic research. This study aimed to identify significant genes with poor prediction for lung cancer and their underlying mechanisms. Firstly, we used gene expression datasets available from GEO (Gene Expression Omnibus) database. There were 109 lung cancer samples and 27 normal samples in the selected datasets. First, DEGs (Different Expressed Gene set) of lung cancer and normal lung samples were screen out with GEO2R tool, and we displayed them by Venn diagram software and Heatmap. Secondly, we used DAVID (Database for Annotation, Visualization and Integrated Discovery) to analyze KEGG (Kyoto Encyclopedia of Gene and Genome) pathway and GO (Gene Ontology). Third, PPI (Protein-Protein Interaction) of these DEGs was conducted by Cytoscape with STRING (Search Tool for the Retrieval of Interacting Genes). Our results showed that the expression trends of 21 up-regulated genes and 116 down-regulated were similar in selected three datasets. Analyzed by MCODE (Molecular Complex Detection) plug-in, 11 up-regulated and 16 down-regulated genes were selected. To further verify gene expression differences, GEPIA (Gene Expression Profiling Interactive Analysis) was implemented and we found 26 of 27 genes were found differently expressed in lung cancer compared with normal lung tissues. Furthermore, Kaplan– Meier analysis was used and we found 23 of 26 genes for overall survival indicated much less survival time. At last, three genes, CDH5, CLDN5, PECAM1, were found to be significantly decreased in lung cancer tissue proved through re-analysis of DAVID, which mainly co-related with leukocyte trans-endothelial migration. In conclusion, three significant down-regulated deferentially expressed genes with poor prognosis on lung cancer were identified basing on integrated bioinformatical methods. These down-expressed genes may be as a potential prognosis targets for patients with lung cancer.


Introduction
Lung cancer is the most common malignant tumor with high mortality both in male and female 3 around the world [1][2][3]. Due to the occultation of lung cancer, most patients have been in the advanced stage at the time of diagnosis [4]. Regardless of different subtypes, the overall survival rate of lung cancer patients is still disappointing; less than 7% of patients survived 10 years following diagnosis across all stages of lung cancer [5]. Recently, researchers reported that molecular markers on lung cancer cells could be used as diagnostic target, such as 14C5 and α3β1 [6][7]. However, these molecules were not ideal in specific and high selective properties. Therefore, exploring more reliable biomarkers to improve the prognosis and clarifying the underlying mechanism is urgently needed both for clinic and basic research.
Recently, researchers reported that molecular markers on lung cancer cells could be used as diagnostic and prognostic targets. However, these molecules were not ideal and suitable for clinic application.
Gene chip assay was already applied in gene expression [8]. However, only changes in one or more amount of genes could be analyzed simultaneously with this method. Recently, bioinformatical methods were reported which could be used among multiple genes to further investigate the underlying mechanisms, signal pathway and the interaction [9].
In this study, firstly, we chose GSE33532, GSE43346 and GSE 118370 from GEO database. Secondly, we applied for GEO2R online tool and displayed DEGs by Venn diagram software and Heatmap.
Thirdly, the DAVID was used to analyze these DEGs, including GO analysis (molecular function (MF), cellular component (CC), biological process (BP)) and KEGG pathways. The fourth, to find out core genes, we established PPI network and then implemented Cytotype MCODE for additional analysis of the DEGs. Then these core genes were imported into the Kaplan Meier plotter online database and GEPIA for the significant survival information and further verifying the different expression(P < 0.05).
Taken all data above, 23 DEGs were screened out. Then, we re-analyzed these 23 DEGs for KEGG pathway enrichment. At last, three genes (CDH5, CLDN5, PECAM1) were obtained and significantly enriched in the Leukocyte trans-endothelial migration.

Microarray data information
GSE33532, GSE43346 and GSE 118370, three gene expression profiles about lung cancer samples and normal lung samples, were obtained from NCBI-GEO [10]. Microarray data of GSE33532, GSE43346 and GSE 118370 was all on account of GPL570 Platforms, including 80 lung cancer samples and 20 normal lung samples, 23 lung cancer samples and 1 normal lung samples, and 6 lung cancer samples and 6 normal lung samples, respectively.

Screen out DEGs
DEGs between lung cancer and normal lung samples of three datasets were screened out by GEO2R online tool [11] with |logFC| >2 and p value <0.05. That DEGs screened out by logFC<-2 were considered as down-regulated genes. On the contrary, the DEGs which were screened out by logFC>2 were considered as up-regulated genes.

Draw Venn diagram and heatmap
Venn diagram and heatmap were respectively drawn by the tool of Venn website (http://bioinformatics.psb.ugent.be/webtools/Venn/) and Excel software.

Gene ontology analysis and KEGG analysis by DAVID
Gene ontology analysis was an approach, which was commonly used in defining genes, relative RNA and relative protein to identify unique biological properties of high-throughput transcriptome or genome data [12]. KEGG is an encyclopedia of genes and genomes [13]. DAVID, a functional annotation tool, was conducted to identify function for tremendous genes, RNA and proteins [14].
Here, our study used DAVID to verify the DEGs' enrichment and their underlying pathways.

PPI networks and core genes
We used STRING, a kind of functional protein association networks, for evaluating PPI information [15]. Then, Cytoscape software [16] was applied to find out the potential correlation between these DEGs. Furthermore, the MCODE app of Cytoscape software was used to screen out the core genes of the PPI network.

Verification of gene expression differences and survival analysis of core genes
To further verify gene expression differences, we used the GEPIA website (http://gepia.cancer-pku.cn/) for analyzing RNA sequencing expression from the GTEx projects and TCGA, which included thousands 5 of samples [17]. Kaplan Meier-plotter (http://www.kmplot.com/lung/) [18], a survival analysis tool, widely used in assessing the survival information of a large number of differently expressed genes based on EGA, GEO and TCGA database. The logrank p value (<0.05) and hazard ratio with 95% confidence intervals were shown on the upper right corner of plots.  Fig 1).

GO analysis and KEGG pathway analysis of DEGs in lung cancers
Totally, 137 DEGs were analyzed by DAVID 6.7. The relative results of GO analysis were shown in Table 2.
In BP, up-regulated DEGs were mainly enriched in different stages of cell cycle, while down-regulated DEGs were particularly mainly enriched in vasculature development, blood vessel morphogenesis and so on.
In CC, up-regulated DEGs were mainly enriched in microtubule, cytoskeleton, non-membranebounded organelle and so on. Down-regulated DEGs were mainly enriched in plasma membrane part and so on.
In MF, down-regulated DEGs were mainly enriched in transforming growth factor beta bin and actin filament binding.
Results about KEGG analysis were shown in Table 3. From the Table 3, We found that up-regulated DEGs were mainly enriched in p53 signaling pathway, while down-regulated DEGs were mainly enriched in vascular smooth muscle contraction, cell adhesion molecules, dilated cardiomyopathy and so on.

PPI networks and core genes analysis
The expression of total 137 DEGs was shown in heatmap, which was shown the relative expression level of DEGs( Fig. 2A) [19][20]. Then the 137 DEGs were verified by the DEGs PPI network tool of STRING, which included 243 edges and 104 nodes. And there were 87 down-regulated and 17 upregulated genes after analyzing by PPI network (Fig. 2B). Then we conducted Cytotype MCODE app for a further analysis. the Fig. 2C-D showed that 27 central genes which were consisted of 11 upregulated genes and 16 down-regulated genes were screen out.

Analysis of core genes by the Kaplan Meier plotter and GEPIA
GEPIA was used to further verify the 27 core genes' expression between the lung cancer samples and normal lung samples. Our results indicated that 26 of 27 genes indicated high or low expressed in lung cancer samples compared to normal lung samples. (P<0.05, Table 4 and Fig. 3).Meanwhile, Kaplan-Meier plotter was implemented in identifying 26 core genes survival information. It was found that there were 23 genes of 26 core genes having significantly worse survival information, while 3 genes were no significant (Table 5 & Fig. 4). So, there were 23 meaningful genes left after GEPIA and Kaplan-Meier plotter analysis.

Re-analysis of 23 genes via DAVID
KEGG pathway enrichment was re-analyzed via DAVID to further find the possible pathway of these 23 core genes. Our results showed that three genes (CDH5, CLDN5, PECAM1) markedly enriched in the leukocyte trans-endothelial migration and CAMs (cell adhesion molecules) ( Table 6 & Fig. 5A and 5B). Furthermore, we found there was no correlation among CDH5, CLDN5, PECAM1 verified by Pearson assay (P>0.05, respectively).

Discussion
In this study, we performed a bioinformatical analysis on the basis of three gene chip datasets, to find more efficient biomaker of lung cancer. 109 lung cancer samples and 27 normal samples were involved in our study. The results revealed total 137 corporately changed DEGs and analyzed via GEO2R and shown by Venn diagram and heatmap [21]. GO analysis and KEGG pathway enrichment analysis using DAVID methods [22] indicated which pathways up-regulated DEGs or down-regulated DEGs were enriched in. 27 center genes were screen out via the PPT network and Cytoscape MCODE app software [23][24]. Furthermore, through GEPIA analysis [25] and Kaplan-Meier plotter analysis [26], we found that 26 genes showed apparently high or low expression in lung cancer samples compared with normal samples by among these 27 core genes and 23 of 26 genes indicated a significantly worse survival. Finally, we re-analyzed 23 core genes and found that three genes (CDH5, CLDN5, PECAM1) enriched in the leukocyte trans-endothelial migration and cell adhesion molecules (CAMs), which might be considered as new effective targets to play a role on the diagnosis and prognosis for patients with lung cancer. CDH5 (cadherin 5), encoded a classical cadherin, located on the long arm of chromosome 16, involved in loss of heterozygosity events in breast and prostate cancer. In 2016, Hung found the CDH5 as an angiogenic factor in lung cancer [27]. Furthermore, not only in lung cancer, Mao reported that CDH5 was overexpressed in gliomas, co-related with tumor grades, and was an independent adverse prognostic predictor for patients with glioblastoma multiforme [28]. In addition, CDH5 was reported that it played a role in regulating angiogenesis, human drug-induced liver injury and gastric cancer [29][30][31].
CLDN5 (claudin 5) was a member of the claudin family and claudins belong to integral membrane proteins and components of tight junction strands. Mutations in this gene have been found in patients with velocardiofacial syndrome. Jia reported that down-regulating CLDN5 increased tumor invasion and potential metastatic abilities [32]. Ma discovered CLDN5 was closely related to brain metastases from lung cancer [33]. In 2019, Jia indicated that high-dose bevacizumab likely increased lung tumor invasion and potential metastatic abilities through down-regulating CLDN5 [34].Moreover, CLDN5 showed a close relationship with mental illness, such as depression [35], schizophrenia [36], brain edema following fatal heat stroke [37] and tumor brain metastasis [38]. showed that PECAM1 was found to be related to angiogenesis of the lung cancer [39][40].
Our results suggested that bioinformatical analysis on the basis of gene chip datasets could be used to find more efficient biomaker for lung cancer. However, there were still some limitations in this study. Firstly, the samples size was not big, which might result in some results deviations. Secondly, even though numerous studies proved that these three genes were related to various types of cancer, however, very few studies have been reported about CDH5, CLDN5 and PECAM1 in prognostic evaluation of lung cancer based on Pubmed retrieval. Therefore, our finding may provide useful information for future study about these three genes in lung cancer.

Conclusion
Our study based on bioinformatics analysis identified three down-regulated DEGs (CDH5, CLDN5, PECAM1) with poor prognosis of patients with lung cancer. These decreased expressed genes may be as potential prognosis predicting targets and may be very helpful for clarifying the mechanisms of The prognostic information of the 26 core genes. Kaplan-meier plotter online tool was used to identify the prognositc information of the 26core genes and 23 of 26 genes had a significantly worse survival rate (p< 0.05).