Integrated bioinformatics analysis reveals ASPM and CENPF with prognostic value in lung cancer

Lung cancer (LC) is the most frequent type of cancer in the world. But the mechanism of LC is still largely unknown. In this study, we analyzed three lung cancer gene expression microarray of different pathologic types to explore the potential candidate genes in LC by Integrated bioinformatical methods. 459 overlapped differentially expressed genes (DEGs) were explored in three GEO gene expression prole from different pathologic types of lung cancer and function annotation were analyzed. Biological process of the DEGs was enriched in regulation of vasculature development and angiogenesis. The signicant molecular function of the DEGs was TGF-β receptor activity. The most signicant Reactome pathway of DEGs was cell cycle and extracellular matrix organization pathway. The PPI network of the DEGs was constructed and 23 candidate hub genes were established in the network. Kaplan-Meier survival analysis show 21 genes were conrmed to associated with the prognosis of LC. The genetic alterations analysis of these genes by using cBioPortal shown ASPM has the highest genetic alteration rate of 9% in main pathological types of 3191 LC patients, and CENPF has the second highest alteration rate of 6%. ASPM and CENPF also have a signicant co-occurrence relationship in LC, and they both participate in the regulation of cell cycle. In the TF -miRNA-gene network of 21 genes shown CENPF have the most signicant value in the network and the most relevant TF are NFYA, E2F1 and MYC. In conclusion, this study explored several key genes about LC and analyzed potential TF of those genes, provides possible therapeutic targets and biomarker for further clinical application.


Background
Lung cancer(LC), one of the highest incidence malignant tumors worldwide, is one of the leading cause of cancer-associated mortality in the world (1). The general mortality of LC was increasing rapidly recent years, from 3.5 million in 1990 to 4.2 million in 2015 (2). Due to the di culties in early diagnosis for LC and lack of effective medical therapy, the improvement of long-term survival rate of LC patients is still too slow (1). The occurrence and progression of lung cancer, like most type of human tumors, is identi ed as a heterogeneous progress, numerous factors such as environmental factors like air pollution, bad living habits like smoking and genetic factors are all important causes of lung cancer (3,4). The abnormal expression and alteration of genes is vital to the development of LC(5, 6), and therapy aim at genetic alteration improved the survival of terminal lung cancer patients in recent years (7,8). Thus, it is very important to nd the key genes and explore the precise mechanisms of lung cancer, and this may be helpful for provide biomarker for early diagnosis of LC and nd novel therapeutic target for lung cancer.
The rapid progress in gene-expression microarray and bioinformatics have promoted the discovery of key genes and underlying molecular mechanisms in human diseases especially in tumors (9)(10)(11). In recent years, numerous microarray studies of LC have been carried out and produced a large amount of chip data which mainly instore in public database like Gene Expression Omnibus database(GEO) and The Cancer Genome Atlas (TCGA). Most microarray studies of LC are single cohort study and have small sample size, and not performed according to the pathological type of lung tumor. All these effect cause a poor reliability of the results. However, using integrated bioinformatics methods to integrate different studies, increasing the sample size and analyzing microarray data according to the pathological type of lung tumor might reduce these disadvantages.
In this study, we selected three original human lung cancer microarray datasets from GEO, GSE74706 including 10 LAC lung tissues, 8 LSCC lung tissues and 18 normal lung tissues , GSE60052 including 79 SCLC lung tissues and 7 normal control lung tissues , GSE19804 including 60 normal control lung tissues and 60 lung cancer tissues . Then we analyzed DEGs by using online tools GEOR2 and BioJupies (12)according to the pathological type of lung tumor, and identi ed overlapped DEGs in all three pathological type, and then performed GO term and pathway enrichment analysis, and explored key candidate genes in the PPI network by using Cytoscape software. We further performed Kaplan-Meier survival analysis of the hub genes use online tool Kaplan-Meier Plotter (http://kmplot.com/analysis/).
Then we used The cBioPortal for Cancer Genomics(http://www.cbioportal.org/) (13) to explore the genetic alterations and co-occurrence of the hub genes in LC patients. And nally TF-miRNA-gene network was constructed by NetworkAnalyst (https:// www. networkanalyst.ca/) (14) to explore potential regulatory mechanism of these genes.

Identi cation of DEGs in Different Pathologic Types of LC
We screened 1328 DEGs of lung cancer from GSE19804, 4189 DEGs of LAC and 5764 DEGs of LSCC from GSE74706, 3368 DEGs of SCLC from GSE60052, with cut-off criterion of adjust P value 0.05 and |logFC| 1. And 459 overlapped DEGs were obtained ( Figure 1).

Functional Annotation and Pathways Enrichment of DEGs
Gene Ontology analysis and Reactome pathway analysis of DEGs were performed by Enrichr. In the Gene Ontology analysis the DEGs were enriched into three groups, including biological process, molecular function and cellular component( Figure 2). in the biological process group, the DEGs mainly enriched in positive regulation of vasculature development, regulation of angiogenesis, positive regulation of angiogenesis. In the molecular function group, the DEGs mainly enriched in transforming growth factor beta-activated receptor activity, transmembrane receptor protein serine/threonine kinase activity, protein homodimerization activity, integrin binding, transforming growth factor beta binding, histone kinase activity, amyloid-beta binding, BMP receplor activity, endopeptidase inhibitor activity, kinase binting. In the cellular component group, the DEGs mainly enriched incondensed nuclear chromosome, centromeric region, amellar body, integral component of plasma membrane, condensed chromosome kinetochore, condensed nuclear chromosome kinetochore, spindle, G-protein coupled receptor dimeric complex, platelet alpha granule, membrane raft, condensed chromosome, centromeric region. In the Reactome pathway analysis, the most signi cant pathway was Cell Cycle and Mitotic, Cell Cycle and Hemostasis, Platelet degranulation, Extracellular matrix organization, signaling by Rho GTPases, Response to elevated platelet cytosolic Ca2, Mitotic Prometaphase.

5.The Genetic Alteration Analysis of Candidate Genes in LC patients
We examine the genetic alterations of these 21 genes by using cBioPortal. The result shown 19 genes have genetic alterations in the main pathological type of lung tumor (Figure 6), and amongst these genes ASPM has the highest genetic alteration rate of 9% in 3191 LC patients, and CENPF has the second highest alteration rate of 6%. ASPM and CENPF also have a signi cant co-occurrence relationship in LC (Table.2).

Discussion
Even though there was a great progress in targeted therapy and immunotherapy for lung cancers, the overall mortality of lung cancer is still high. It is critical to early diagnosis and treatment for lung cancer.
And it is important to seeking for novel therapy targets and biomarkers for the prevention, diagnose and treatment of lung cancer. Bioinformatic analysis and microarray expression pro ling analysis is considered as a powerful, comprehensive and accurate method to discover novel diagnosis markers or therapeutic targets for various diseases, particularly for cancers.
In this study, we integrated three lung cancer pro le datasets from GEO database, and analyzed the data according to the pathologic types of lung cancer. And identi ed 459 overlapped DEGs in three main pathologic types of lung cancer, including LAC, LSCC and SCLC. GO term enrichment analysis was performed for annotating DEGs and the result demonstrated that the signi cant biological process related to cancer was regulation of vasculature development and angiogenesis. Angiogenesis is essential for tumor growth and metastasis, and play an important role in the control of cancer progression.
Angiogenesis has been validated as an effective therapeutic target in several kinds of tumors including lung cancers through randomized controlled clinical trials, and one of these effective drug is vascular endothelial growth factor (VEGF) pathway inhibitors (15,16).
The signi cant molecular function of the DEGs was TGF-β receptor activity. The TGF-β pathway is important for the genesis and development of tumor. In the initiation stage of cancer development, TGF-β acts as a potential tumor suppressor and play a critical role in inhibiting the genesis of tumor by suppressing the proliferation of tumor epithelial cells. As in the stage of tumor progresses, TGF-β transfer into a tumor promoter and participates in the process of cancer progress by promoting the proliferation, invasion and metastasis of cancer cells and also vital for keeping the potentiality of cancer stem cell (17,18).
In the Reactome pathway analysis, the most signi cant pathway of DEGs was cell cycle and extracellular matrix organization pathway. Cell cycle is a fundamental process of cell life, which controls the growth and proliferation of cell. Abnormal cell cycle is strongly associated with cell carcinogenesis. Cyclindependent kinases(Cdks), a key factors that regulate the cell cycle, is crucial in cell cycle, and is an ideal therapy target in tumor (19). Extracellular matrix (ECM) is an important tissue barrier to prevent tumor metastasis. The key components of ECM, bronectin and laminin, are connected to the surface membrane integrin receptor of cancer cell, and also determines the shape of cancer cell and controls the differentiation and migration of cancer cell (20).
The PPI network of DEGs were constructed and 23 key candidate genes were identi ed, and 21 genes have signi cant association with the prognosis of lung cancer. Amongst these genes, ASPM had the highest genetic alteration rate of 9% in 3191 LC patients of three main pathological types, and CENPF has the second highest alteration rate of 6%. Data shown that ASPM and CENPF had a signi cant cooccurrence relationship in LC and these two genes have an interaction in the PPI network, the related function and pathway of these two genes are both enriched in cell cycle. The main genetic alteration of ASPM and CENPF was ampli cation and missense mutation. ASPM abnormal spindle microtubule assembly encode centrosomal protein and play a critical role in regulating proliferative divisions of neuroepithelial cells and neurogenesis, and participates in malignant progression of many kind of tumors (21,22). The overexpression of ASPM had been proved to be related with the pathological staging and poor prognosis in liver cancer patients, pancreatic cancer, ovarian cancer and glioblastoma (23)(24)(25).
To explore potential underlying mechanism of the candidate hub genes, we constructed TF -miRNA-gene network of 21 genes, and the result shown CENPF had the most signi cant value in the network. CENPF (centromere protein F), which mainly expressed in the G2/M cell cycle phase, is an important protein for cell cycle regulation by recruiting the checkpoint proteins like Mad1 and BubR1, resulting in a consistent response of checkpoint. CENPF has been identi ed as a key gene in several cancers especially in prostate cancer(26). CENPF have been reported as critical regulator of the COUP-TFII-FOXM1-CENPF axis in human prostate cancer, which was activated in the progress of metastasis and resulted in a poor prognosis in prostate cancer (27). Decrease the activity of CENPF can remarkable inhibit tumorigenesis in prostate cancer mouse models(28). Even so ASPM and CENPF have been proved as a master regulator in several cancers, but there are rare reports about that ASPM and CENPF are associated with lung cancer, here we nd that ASPM and CENPF are poor prognostic factors of lung cancer and both have high genetic alteration rate in the main pathological types of lung cancer. This result indicate that ASPM and CENPF may be potential therapy targets or biomarkers of lung cancer.
TF -miRNA-gene network revealed the most relevant TF are NFYA, E2F1 and MYC, and CENPF was the central node in the network, and the association between CENPF and these TFs are rarely been reported.
E2F1 and MYC play important role in cell cycle progression and cell apoptosis and associated with the progress of several tumors(26, 29), these means E2F1 and MYC may interact with CENPF and participate in the tumorigenesis by regulating the cell cycle.

Conclusion
In this study we identi ed ASPM and CENPF as key genes in three main pathologic types of main pathological types of lung cancer, including LAC, LSCC and SCLC and these two genes may become potential therapy targets or biomarkers of lung cancer. At the same time, there are several limitations about our study. The main limitation is that the results of our study is obtained from microarray data and public databases through bioinformatic analysis, and can't provide real information on gene expression level, protein activity and genetic alteration, multicenter and large sample follow-up studies are necessary to verify our results.
Materials And Methods

The Information of Microarray Data and Identi cation of DEGs
Lung cancer and normal lung tissues gene expression pro le of GSE74706 including 10 LAC lung tissues 8 LSCC lung tissues and 18 normal lung tissues , GSE60052 including 79 SCLC lung tissues and 7 normal control lung tissues , GSE19804 including 60 normal control lung tissues 60 lung cancer tissues were obtained from NCBI-GEO. We analyzed the data according to the pathologic types of lung cancer. The data analysis of GSE74706 and GSE19804 was used GEOR2, an online tools of GEO. The data analysis of GSE60052 was used BioJupies (https://amp.pharm.mssm.edu/biojupies/) (12), an online analysis tools of GEO RNA-seq Data. The criteria of DEGs was adjust P value 0.05 and |logFC| 1. We explored overlapped genes in different lung cancer pathologic types.

Functional Annotation and PPI Network of DEGs
Gene Ontology analysis and Reactome pathway analysis of DEGs were performed by Enrichr (http://amp.pharm.mssm.edu/Enrichr/) (30), and the cut-off criterion was P value 0.05. The PPI network of the DEGs was constructed by using STRING(http://string-db.org) (31)with the threshold of con dence score >0.4. We visualized the PPI network by using Cytoscape software platform(32).

3.Key Candidate Genes Screening
In the PPI network, we calculated the degree of connectivity of each nodes by using cytoHubba, a APP of Cytoscape software platform. To add credibility to the results, we used two kinds of topological algorithms (MCC, Degree) to calculate the top-scoring 25 nodes, and choose overlapped nodes in both algorithms method as hub genes for further study.

Kaplan-Meier Survival Analysis of the Candidate Genes
We performed Kaplan-Meier survival analysis of the hub genes by using Kaplan-Meier Plotter (http://kmplot.com/analysis/), which contain survival data of 2437 lung cancer patients. The criteria was log-rank P value 0.05. The genes who had a signi cant association with the prognosis of LC were selected for further study.

5.The Genetic Alteration Analysis of Candidate Genes in LC patients
We used The cBioPortal for Cancer Genomics(http://www.cbioportal.org/) (13)to explore the genetic alterations and co-occurrence of the hub genes in LC patients. we choose ve study about LC including 3246 LC samples totally.

Function and Pathway Analysis of ASPM and CENPF
The PPI network of ASPM and CENPF the was constructed by using STRING and Gene Ontology analysis and KEGG pathway analysis of ASPM and CENPF associated genes were performed in STRING.   The dark curve represent low expression of the gene. Figure 6