Identification of Cell cycle Gene Signatures Predicting Survival in Patients with Lung Squamous Cell Carcinoma


 Background:Lung cancer (LC) is one of the most important and common malignant tumors, and its incidence and mortality are increasing annually. Lung squamous cell carcinoma (LUSC) is the common pathological type of lung cancer. A small part of biomarkers have been confirmed to be related to the prognosis and survival by data excavation. However, the moderate forecast effect of a single gene biomarker is not accurate. Thus, we aimed to identify new gene signatures to better predict Lung squamous cell carcinoma ( LU SC). Methods : Using the mRNA-mining approach, we performed mRNA expression profiling in large lung squamous cell carcinoma cohorts (n= from The Cancer Genome Atlas (TCGA) database. Gene Set Enrichment Analysis (GSEA) and Gene Set Variation Analysis(GSVA) were accomplished, and connections between genes and cell cycle were found in the Cox proportional regression model. Results : We have confirmed a set of four genes (CDKN1A, CHEK2, E2F4 and RAD21) that were importantly associated with overall survival (OS) in the test series. Based on the research of the four-gene signature, the patients in the test series could be divided into high-risk and low-risk teams. Additionally, multivariate Cox regression analysis revealed that the prognostic power of the four-gene signature is independent of the clinical factors. Conclusion : Our study demonstrated the connections between the four-gene signature and cell cycle. Novel insights into the research mechanisms of cell cycle was also revealed regarding the biomarkers of a poor prognosis for lung squamous cell carcinoma patients.


Background
In recent years, the incidence and mortality rates of lung cancer are the highest all over the world,and it is the leading cause of cancer death in China [1,2].There are two types of pathological categories: Non-small cell lung cancer (NSCLC) and small cell lung cancer (SCLC),and lung squamous cell carcinoma is the one type of Non-small cell lung cancer which accounts for approximately 25%-30% of all lung cancer patients [3].Although the incidence of LUSC can be controlled in a way by restricting tobacco smoking [4]. The identification of targeted treatment strategies for LUSC has remained an complicated goal [5]. An adding amount of evidence demonstrates that the detection and adhibition of molecular biomarkers will improve the prognostic evaluation and identification of potential high-risk patients with LUSC.Therefore we should find more biomarkers which affect the mechanism of tumorigenesis and progression of LUSC,So, the development of novel biomarkers which can predict the prognosis value for LUSC is of great significance.Recent studies have provided insights into the mechanisms that contribute to DNA repair in specific cell-cycle phases and have highlighted the mechanisms that ensure cell-cycle progression or arrest in normal and cancerous cells.Some gene Bind to and inhibits cyclin-dependent kinase activity, preventing phosphorylation of critical cyclin-dependent kinase substrates and blocking cell cycle progression and promote the development of LUSC.
In the past few years, many studies have confirmed that MicroRNAs (miRNAs) can control gene expression at the posttranscriptional level [6], which have been considered new biomarkers of cancer with infinite clinical value [7]. Scientists researched and detected many relating genes,and opened up a new approach for the further research and effective treatment method.Some biomarkers have been confirmed that they are related to the prognostic factor of lung squamous cell carcinoma. miR-324-3p and miR-1285 as diagnostic and prognostic biomarkers for early stage lung squamous cell carcinoma [8]. Minichromosome maintenance protein (MCM) has been found to be a relevant marker for prognosis in many human cancers,which may act as a potential therapeutic target for LUSC patients [9]. With the development of high throughput sequencing and and magnificent data analysis techniques, numerous databases have enabled detailed understanding of genomic changes.Disinterment of cancer related genes and the identification of gather models can lean on bioinformatics and data integration [10]. However, a single gene biomarker cannot produce good prediction effects, and some studies have found that the gene signature can be a better alternative to predict the prognosis and survival [11].The method of the Kaplan-Meier plots and Cox proportional hazards regression have been applied diffusely in biomarker screening [12,13]. Multigene prognostic signatures can be used in clinical treatment because of the original cancer biopsies. However, not all pathways have been explored to identify new biomarkers of LUSC.Therefore, more efforts are needed to find more sensitive biomarkers of LUSC.
The major public project "The Cancer Genome Atlas" (TCGA) [14]can collect,select and analyze human tissue for genome changes effectively. The TCGA database provides analysis of highthroughput data for varieties of genomic changes, including exchange of cell cycle.The information of patient diagnostics, treatment, and tumor pathology can be combined to the bioinformatics analysis.
We used Gene Set Enrichment Analysis (GSEA) and Gene Set Variation Analysis (GSVA) to search for some genes and perform further analysis. The priority of GSEA does not require a distinct differential gene threshold. The algorithm is also special such that several genes can be identified whose expression is based on the entire trend of effective data and overall level, even without any prior experience. The priority of GSVA were performed to explore the biological function and functional annotation of the differences. We conducted various of bioinformatics and survival analyses, screened for one pathway associated with cell cycle,and built a prognostic risk model to predict the prognosis of LUSC patients. The results of analysis could bring new insight into the molecular mechanisms based on cell cycle in LUSC.
In our study, we have drawn hallmark gene sets from lung squamous cell carcinoma cases with completed mRNA expression data set from the TCGA database. We have confirmed the key mRNAs related to cell cycle and have built a four-gene risks signature that can accurately predict the patients' prognosis results. Surprisingly, in fifteen pathways, the cell cycle-related risk signature can predict patients who are in the high-risk group with a poor prognosis.

Patient Clinical Data and mRNA Expression Dataset
The mRNA expression profiles and clinical data concerning lung squamous cell carcinoma patients were extracted from the TCGA database (https://cancergenome.nih.gov/) and were divided into two groups, one with lung squamous cell carcinoma tissue and one with adjacent noncancerous tissue.
Additionally, the following clinical information was recorded: gender, age, size of tumor, pathological stage, new tumor events after the initial treatment,status of distant organ metastasis, neoplasm cancer status, and residual tumor. Finally, patients were classified. The general clinical features are detailed in Table 1.

Gene Set Enrichment Analysis and Gene Set Variation Analysis
GSEA (http://www.broadinstitute.org/gsea/index.jsp) was performed to explore whether the identified gene sets showed significant differences between the groups [15]. The expression levels of mRNAs in lung squamous cell carcinoma samples and adjacent noncancerous tissue were analyzed.
Normalized p values (p < 0.05) were used to determine which functions can be used for further investigation. Gene Set Variation Analysis (GSVA), which was a GSE method that evaluates variation of pathway activity over a specimen population in an unsupervised method. We showed the robustness of GSVA in a comparison with present state of the art sample-wise enrichment technique.

Statistical Analysis
The expression profiles of 24,991 mRNAs were shown as raw data, and each mRNA was normalized by log2 for further analysis. We used univariate Cox regression to analyze and identify genes with obvious relationships to overall survival with p values <0.05. Next, we used multivariate Cox proportional hazards regression to analyze and further confirm the prognostic genes from the previous steps. The filtered mRNAs were classified into risk (hazard ratio (HR) > 1) and protective (0 < HR < 1) types. Thereafter, a prognostic risky score formula was established based on a linear combination of the expression levels weighted with the regression coefficients derived from the multivariate Cox regression analysis. Risky score = expression of gene 1 × β1 + expression of gene 2 × β2 + ⋯ + expression of gene n × βn. We separated patients into high-risk and low-risk subgroups by the median value using the median risky score as the cutoff. Next, we used Kaplan-Meier curves and log-rank methods to validate the prognostic importance of the risky score. Subsequently, we examined the differential expression of optimal genes in lung squamous cell carcinoma samples and adjacent noncancerous tissue. We classified them into high-risk and low-risk teams by the median risk score and used the KM method (multiplication of the positive limit) to predict the accuracy of the survival state and survival time. The survival function was constructed by the KM method, and the ROC curve was drawn. The chosen gene alterations were shown in specific cancer types online (http://www.cbioportal.org/). All the statistical analyses were performed using SPSS 16.0 and Graph Pad Prism8 software.

Initial Screening of Genes using GSEA and GSVA
We obtained the clinical features from patients with lung squamous cell carcinoma, along with an expression data set for 56392 mRNAs from the TCGA database. The KEGG gene sets had expressed signatures derived by concentrating multiple gene sets from the Molecular Signatures Database (MSigDB) to represent well-defined biological statuses or courses. GSEA was performed using the above detailed data to detect whether the identified gene sets showed statistically important differences between normal tissue and LUSC tissues patients. GSVA estimates variation of gene set enrichment over the samples independently of any class label.therefore,we got fifteen pathway( Fig.1)

Identification of Cell cycle-related mRNAs Associated with the Survival of Patients
First, we employed univariate Cox regression analysis of the 125 genes for preliminary screening and obtained 20 genes with p values <0.1. Additionally, multivariate Cox regression analysis was used to further examine the relationship between the expression profiles of 20 mRNAs and the patient survival rate. Subsequently, 4 mRNAs (CDKN1A, CHEK2, E2F4 and RAD21) were verified as independent poor prognostic indicators. The filtered mRNAs were classified into a risky type (CDKN1A, E2F4 and RAD21), whose HR was >1 with shorter survival, and a protective type (CHEK2), whose HR was <1 with longer survival ( Table 2).We made Pearson correlation coefficient among the 4 mRNAs on the basis of Table2 ,and we found the correlation between E2F4 and CHEK2, between E2F4 and RAD21, between CHEK2 and RAD21,and the value of R is greater than 0.3 (Fig.2).

Construction of a four-mRNA Signature to Predict Patients' Result
A prognostic risky score formula was established based on a linear combination of the expression levels weighted with the regression coefficients derived from multivariate Cox regression analysis.
The risky score=0.2551*expression of HMMR+0.2160*expression of B4GALT1-0.1570*expression of SLC16A3+0.1238*expression of ANGPTL4+0.2381*expression of EXT1+0.1027*expression of GPC1+0.1820*expression of RBCK1+0.1874*expression of SOD1+0.2226*expression of AGRN. Each patient of LUSC had only one risky score. We calculated the scores and ranked them, and then classified the patients into high-and low-risk groups by the median value (Fig.3A). The survival length of time (in days) of each patient is shown in Fig.3B, and the patients with high-risk scores showed higher mortality rates than those with low-risk scores. Additionally, a heatmap (Fig.3C) was revealed to display the expression profiles of the four mRNAs.Then,we compared the risk score to the prognosis of the 4-mRNA ,and it was showed that the different expression of four gene between cancer tissue and adjacent normal tissue in Fig.4A,and the different expression of four gene in each stage was displayed in Figure 4B,and the graph of survival curve of between risk score and four gene were showed in Figure 4C,therefore,we can see that the value of P of risk score is dominant. With the increasing risk score of patients with LUSC, the expression of high-risk types of mRNAs (CDKN1A, E2F4 and RAD21) was obviously upregulated. By contrast, the expression of the protective type of mRNAs (CHEK2) was downregulated.

Generation of Risk Score from the Four mRNA Signatures as an Indicator of Prognosis
The prognostic values of the risk scores were compared with the clinicopathological information by univariate and multivariate analyses. Samples with completed clinical data were used for analysis.The patients with lung squamous cell carcinoma. Among 154 patients,15(9.7%) received radiation therapy. Additionally, we found that the risk score, new event,tobacco smoking history,and neoplasm cancer status were independent prognostic indicators because they showed important differences in univariate analysis with p values <0.05 (Table 3). In the subsequent multivariate analysis (Table 3), we found that the risk score,new event,neoplasm cancer status and tobacco smoking history showed statistical significance in univariate and multivariate analyses (P<0.05). Whether univariate or multivariate analysis, the risky score had prominent prognostic values, with p values < 0.05 (HR = 1,566, 95% CI (confidence interval) = 1.073-2.288). Additionally, the most obvious clinical parameter to predict patient survival was "neoplasm cancer status", and patients with tumors are probably 4.871 times more likely to be exposed to death than those who were tumor free.From the value of P,we can draw that the risk score is more dominant than TNM classification. And the 4-mRNA expression-based survival risk score was used to assign patients into a low-risk or high-risk group using the median risk score as the cut-off. The ROC curve analysis score was 0.661 (Fig.5A), indicating good sensitivity and specificity of the 4-mRNA signature in predicting survival in LUSC. We also made ROC curve of important clinical parameter (Fig.5B-H)and found ROC curve of risky score was obviously higher than ROC curve of other clinical parameter,and superior to clinical parameter for prognosis indicator .

Validation of Four mRNA Markers for Survival Prediction by Kaplan-Meier Curve Analysis
Kaplan-Meier curves and the log-rank method showed a poor prognosis in patients with high-risk scores (p < 0.0010) (Fig.6A). Univariate Cox regression analysis of the overall survival showed that several clinicopathological data were effective in predicting the survival rate of lung squamous cell carcinoma, including Age,Sex,T classification, N classification ,M classification, new event, neoplasm cancer status, tobacco smoking history,radiation therapy and residual tumor. The K-M method was then adopted to confirm the above results. According to the curve, patients with age older than sixtyeight,with tumor after treatment,T classification greater than T1, distant organ metastasis, stage greater than stage I,a residual tumor,tobacco smoking history, or a positive tumor finding during the follow-up visit were correlated with poor prognosis (Fig. 6B,6C,6E,6G,6K,6M,6N,6O). These results provided further confirmation of the accuracy of our analysis. Hence, further stratified analysis was performed for data mining.And the risk score is superior to other clinical indicator from results.

Validation of differential gene between high risk team and low risk team
We classified the patiens of LUSC into two groups followed by risk score,and enriched them into pathway of high risk team and low risk team by GSEA and GSVA. And got differential gene,then put them into mass survival and got related genes. We made Pearson correlation coefficient between related genes and four genes ,and found the correlation between CDKN1AandKLK5, between CDKN1A and KLK7, and the value of R is greater than 0.3( Figure 7B).

Discussion
In recent years, some researchers have verified miRNAs could play an important role in tumorigenesis and progression of lung cancer. Accumulating evidence reveals the potential of miRNAs as biomarkers for evaluating and predicting the prognosis for lung cancer [16],indicating their considerable clinical significance in research [17]. For example, Xu et al reported that MicroRNA-106b serves as a prognostic biomarker and is associated with cell proliferation, migration, and invasion in osteosarcoma [18].Li etal drawed that MiR-421 is Overexpressed and Promotes Cell Proliferation in Non-Small Cell Lung Cancer [19]. However, patient survival could not be predicted by these genes because various factors can modulate a single gene, resulting in an inaccurate predictive effect. Thus, a gene signature comprising various genes has been built from the statistical model to predict cancer outcomes. The results highlight that the predictive effects of each gene involved can offer a more accurate prediction than a single biomarker [20].
With the development of gene signatures, the prognosis of some types of cancer has been predicted by statistical models. MiR-98-5p, miR-152-3p, miR-326 and miR-4289 Expression could serve as Biomarker for Prostate Cancer Diagnosis [21]. We also performed GSEA analysis using the mRNA expression data of the 503 LUSC patients and found that 4 exhibit significant differences, with a P value <0.05, and the minimum P value could be used for further analysis. We explored specific functions to identify genes by GSEA that could predict the survival of LUAD patients. Furthermore, we identified a combination of 4 genes with prognostic value for LUAD patients instead of a single gene by univariate Cox regression and multivariate Cox regression analyses. Subsequently, through comparison with some known prognostic biomarkers, we found that our identified risk signature may strongly support clinical results. We analyzed cell-related genes using the LUSC data set in TCGA and then compared one with lung squamous cell carcinoma tissue to one with adjacent noncancerous tissue data of LUAD patients. Kaplan-Meier analysis showed that a high-risk score was associated with metastasis and a poor prognosis. Among our 4 genes,Zhou etal showed that the expression of CDKN1A-interacting zinc finger protein 1 may contribute to the growth and angiogenesis of LSCC and may be a novel biomarker for LUSC [22]. Theasha Manicum etal demonstrated that E2F4 were significantly associated with unfavourable OS in all gastric cancer patients [23]. RAD21 were also increased in bladder cancer tissues and cell lines ,and it could be used as potential therapeutic targets in bladder cancer [24]. Conventional prognostic systems generally make inaccurate predictions for risk stratification and estimations of clinical outcomes because of the heterogeneity between patients. To the best of our knowledge, compared with a single common biomarker, the 4-mRNA signature can better predict the prognosis of LUSC.
The 4 genes were found to be correlated with cell cycle. CDKN1A is a protein ,which played multiple roles not only in the DNA damage response, but also in many cellular processes during cell growth.The function of CDKN1A is to arrest cell cycle progression by inhibiting the movement of cyclin-dependent kinases [25].And it is found to participate in other significant processes, such as DNA replication/repair,cell death,gene transcription, and cell motility [26]. CDKN1A can inhibit the CDK activity, by interacting with their N-terminal domain or by interfering with the phosphorylation of CDK1 and CDK2 [27]. Moreover, it contributes to the G1 arrest, by inhibiting the cyclin E and cyclin A/CDK2 activity [28].CHEK2 is Serine/threonine-protein kinase which is required for checkpointmediated cell cycle arrest, activation of DNA repair and apoptosis in response to the presence of DNA double-strand breaks. CHEK2 then reacts with downstream phosphatase CDC25, serine/threonine protein kinase NEK6, transcription factor FOXM1, p53 protein and BRCA1 or BRCA2 [29]. CHEK2 May also phosphorylate NEK6 which is involved in G2/M cell cycle arrest. And CHEK2 regulates apoptosis through the phosphorylation of p53/TP53, MDM4 and PML.E2F4 is transcription activator that binds DNA cooperatively with DP proteins through the E2 recognition site, 5'-TTTC[CG]CGC-3' found in the promoter region of a number of genes whose products are involved in cell cycle regulation or in DNA replication.RAD21 is involved in sister chromatid cohesion from the time of DNA replication in S phase to their segregation in mitosis, a function that is essential for proper chromosome segregation, postreplicative DNA repair, and the prevention of inappropriate recombination between repetitive regions.It acts as a target gene for multiple genes or as a regulatory gene for other genes to participate extensively in the physiological and pathological processes of the human body [30][31][32]. RAD21 is reported that expressed highly in breast cancer squamous cell carcinoma of the oral cavity and hepatocellular carcinoma tissues and promotes the proliferation and metastasis [33][34][35].By Pearson correlation analysis,we also find related gene KLK5 and KLK7 ,which is correlate to CDKN1A,expressed in lung cancer [36].Therefore,we can see the relation between the 4 genes and cell cycle, and the cell cycle arrest and apoptosis induced the progression of lung cancer.
In conclusion, this work is the first to report a four-gene risk signature related to cell cycle that can help predict survival and prognosis in LUSC patients. A higher risk score indicates a poorer prognosis. This finding will help future researchers in their efforts to identify new treatments for LUSC and to provide more gene targets to cure LUSC in patients.

Conclusion
We used a four-gene signature (CDKN1A, CHEK2, E2F4 and RAD21) to predict and evaluate LUSC via tissue or blood samples and examined whether mutations in these genes can promote the development of LUSC.Furthermore, we identified treatments related to cell circle to successfully target these genes; this signature could also be used to develop new targeted treatments to cure