Initial Screening of Genes using GSEA and GSVA
We got the clinical data from patients with LUSC, along with an expression dataset for 56,392 mRNAs from the TCGA database. The Kyoto Encyclopedia of Genes and Genomes (KEGG) gene sets had expressed signatures derived by choosing multiple gene sets from the Molecular Signatures Database (MSigDB) to be on behalf of well-defined biological statuses or courses. GSEA was identified using the above detailed data to detect whether the identified gene sets showed statistically important differences between normal tissue and LUSC tissue. GSVA estimates the variation in gene set enrichment over the samples independently of any class label. Therefore, we obtained fifteen pathways (Fig. 1), including DNA replication, cell cycle, homologous recombination, mismatch repair, proteasome, base excision repair, spliceosome, aminoacyl tRNA biosynthesis, pyrimidine metabolism, nucleotide excision repair, P53 signalling pathway, basal transcription factors, RNA degradation, RNA polymerase and oocyte meiosis, and we selected the cell cycle pathway based on the number of genes and the normalized enrichment score (NES).
Identification of Cell Cycle-related mRNAs Associated with Patient Survival
First, we used univariate Cox regression analysis of the 125 genes for rudimentary screening and got 20 genes with p values < 0.1.In addition, multivariate Cox regression analysis was performed to further examine the correlation between the expression profiles of 20 mRNAs and the patient survival rate.Whereafter, 4 mRNAs (CDKN1A, CHEK2, E2F4 and RAD21) were identified as independent poor prognostic indicators. The filtered mRNAs were divided into a risk type (CDKN1A, E2F4 and RAD21), whose HR was > 1 with shorter survival, and a protective type (CHEK2), whose HR was < 1 with longer survival (Table 2). We calculated the Pearson correlation coefficients of the 4 mRNAs (as shown in Table 2) and found a correlation between E2F4 and CHEK2, between E2F4 and RAD21, and between CHEK2 and RAD21 (all R values were greater than 0.3) (Fig. 2).
Table 2
The information of four prognostic mRNAs importantly associated with overall survival in patients with LUSC
mRNA
|
Ensemble ID
|
Location
|
B(cox)
|
HR(95%CIs)
|
P
|
CDKN1A
|
ENSG00000124762
|
Chr6:36,676,460 − 36,687,339
|
0.14381
|
1.15466
|
0.00125
|
CHEK2
E2F4
RAD21
|
ENSG00000183765
ENSG00000205250
ENSG00000164754
|
Chr22:28,687,743 − 28,742,422
Chr16:67,192,155 − 67,198,918
Chr8:116,845,935 − 116,874,866
|
-0.30838
0.25765
0.23935
|
0.73464
1.29388
1.27042
|
0.000924
0.023732
0.035461
|
Construction of a Four-mRNA Signature to Predict Patient Prognosis
The prognostic risk score formula was fixed based on a linear combination of the expression levels weighted with the regression coefficients derived from the multivariate Cox regression analysis: risk score = 0.14381*expression of CDKN1A -0.30838*expression of CHEK2 + 0.25765*expression of E2F4 + 0.23935*expression of RAD21. Each patient with LUSC had single risk score. We calculated and arranged the scores and then divided the patients into high- and low-risk groups by the median value (Fig. 3A). The survival time (in days) of each patient is shown in Fig. 3B, and the patients with high risk scores showed higher mortality rates than those with low-risk scores (Fig. 3A and Fig. 3B). Additionally, a heatmap (Fig. 3C) was generated to display the expression profiles of the four mRNAs. Then, we compared the risk score to the prognosis of the 4-mRNA group, and the differential expression of the four genes between cancer tissue and adjacent normal tissue is shown in Fig. 4A. The differential expression of the four genes in each stage is displayed in Fig. 4B, and the graph of the survival curve between the risk score and the four genes is shown in Fig. 4C. Therefore, we can see that the P value of the risk score is dominant. With the increasing risk score of LUSC patients, the expression of risk-type mRNAs (CDKN1A, E2F4 and RAD21) was obviously upregulated. By contrast, the expression of the protective-type mRNAs (CHEK2) was downregulated.
Generation of the Risk Score from the Four-mRNA Signature as an Indicator of Prognosis
The prognostic value of the risk score was compared with the clinicopathological information by univariate and multivariate analyses. Samples with complete clinical data were used for analysis. The median age of the 504 patients with LUSC was 68 years and included 373 male patients and 131 female patients. Among 391 patients, 79 (20.2%) developed a tumour during the follow-up visit. Among 439 patients, 41 (9.3%) had residual tumours. Among 503 patients, 184 (36.6%) had lymph node metastasis, and 86 (17.3%) had distant metastases among 497 patients with LUSC. Among 154 patients, 15 (9.7%) received radiation therapy. Additionally, we found that the risk score, new events, tobacco smoking history, and neoplasm cancer status were independent prognostic indicators because they showed significant differences in the univariate analysis, with p values < 0.05 (Table 3). In the subsequent multivariate analysis (Table 3), we found that the risk score, new events, neoplasm cancer status and tobacco smoking history showed statistical significance in the univariate and multivariate analyses (P < 0.05). Regardless of the analysis used (univariate or multivariate), the risk score showed prominent prognostic value, with p values < 0.05 (HR = 1.566, 95% CI (confidence interval) = 1.073–2.288). Additionally, the best clinical parameter to predict patient survival was neoplasm cancer status, and patients with tumours were 4.871 times more likely to experience death than those who were tumour free. Based on the P value, we can conclude that the risk score is more dominant than the TNM classification. The 4-mRNA expression-based survival risk score was applied to divide patients into a low-risk or high-risk group through the median risk score as the cut-off. The ROC curve analysis score was 0.661 (Fig. 5A),showing good sensitivity and specificity of the 4-mRNA signature in predicting survival in LUSC patients. We also generated an ROC curve of important clinical parameters (Fig. 5B-H) and found that the ROC curve of the risk score was obviously higher than and thus superior to the ROC curve of the other clinical parameters as an indicator of prognosis.
Table 3. Univariable and multivariable analyses for each clinical feature
Validation of the Four mRNA Markers for Survival Prediction by KM Curve Analysis
KM curves and log-rank tests showed a poor prognosis in patients with high risk scores (p < 0.0010) (Fig. 6A). The univariate Cox regression analysis of OS showed that several clinicopathological factors, including age, sex, T classification, N classification, M classification, new events, neoplasm cancer status, tobacco smoking history, radiation therapy and residual tumours, were valid in predicting the survival rate of LUSC patients. The KM method was then adopted to identify the above consequences. According to the curve, patients aged > sixty-eight, the presence of a tumour after treatment, a T classification > T1, distant organ metastasis, a tumour stage > stage I, a residual tumour, tobacco smoking history, or a positive tumour finding during the follow-up visit were correlated with a poor prognosis (Fig. 6B, 6C, 6E, 6G, 6K, 6M, 6N, 6O). These consequences provide further acknowledgement of the correctness of our analysis. Therefore, a further stratified analysis was achieved for data mining. The risk score is superior to other clinical indicators based on the results.
Validation of the Differentially Expressed Genes between the High-risk and Low-risk Groups
We classified LUSC patients into two groups by the risk score and determined the enriched pathways in the high-risk and low-risk groups by GSEA and GSVA. The differentially expressed genes were then subjected to mass survival analyses, and related genes were obtained. We calculated the Pearson correlation coefficients between related genes and the four genes and found a correlation between CDKN1A and KLK5 and between CDKN1A and KLK7, with R values greater than 0.3 (Fig. 7B).