The analysis flow of this study is shown in Fig. 1.
1. Identification of Potentially High-Risk Genes by Transcriptome, RNomics, and Methylome
1.1 Identification of differentially expressed genes, miRNAs, and methylations
RNA-Seq and microRNA-Seq data from the TCGA dataset were used to identify differentially expressed genes (DEGs), differentially expressed microRNA (DEmiRNAs), and differentially expressed long non-coding RNA (DElncRNAs). We obtained 5916 DEGs and 538 DEmiRNAs for LUAD, 7472 DEGs and 622 DEmiRNAs for LUSC (Table 1). DNA methylation data from UCSC Xena were used to identify differentially methylated genes (DMGs) (please see Materials and Methods). For LUAD, a total of 32,293 up-regulated methylation probes were distributed among 4446 DMGs, and 24,205 down-regulated methylation probes were distributed among 4783 DMGs. For LUSC, a total of 38,838 up-regulated probes were distributed among 2877 DMGs, and 63,575 down-regulated probes were distributed among 7754 DMGs. We also realized that the frequencies of up-regulated genes and miRNAs were much higher than those of down-regulated genes. Conversely, in the case of methylation genes, the frequencies of down-regulated DMGs were higher than those of up-regulated DMGs.
Table 1
The numbers of differentially expressed genes, microRNA, and LncRNAs in NSCLC.
Type
|
DEGs
|
DEmiRNAs
|
DElncRNAs
|
Up
|
Down
|
Total
|
Up
|
Down
|
Total
|
Up
|
Down
|
Total
|
LUAD
|
4061
|
1855
|
5916
|
418
|
120
|
538
|
1361
|
387
|
1748
|
LUSC
|
4524
|
2948
|
7472
|
472
|
150
|
622
|
1525
|
561
|
2086
|
1.2 Co-expression analysis of gene expression and DNA methylation
DNA methylation levels are related to gene expression and play an important role in the occurrence and development of cancer (16). By comparing the changes in expression levels and methylation levels between cancer samples and normal samples, we found that 2656 genes and 3680 genes were related to both DEGs and DMGs in LUAD and LUSC, respectively (Supplementary Table S1).
Generally, promoters with higher methylation levels contribute to the silencing of tumor suppressor genes. Conversely, lower methylation levels lead to the activation or up-regulation of proto-oncogenes during carcinogenesis. Therefore, we identified the genes with opposite trends between gene expression levels and DNA methylation levels. Finally, 1493 genes and 2183 genes (Fig. 2A and 2B; the red dots) with opposite expression differences were found. These genes were denoted as Dataset 1.
Then, we compared the functions of co-expression genes. In LUAD, the genes with opposite trends prefer to enrich in immune-response-related processes, cell-killing regulation, aging, digestion, etc. (Fig. 2C). In LUSC, genes with the opposite trends are likely to enrich in response to the drugs, organophosphorus, growth factor binding, gland development, etc. (Fig. 2D).
1.3 Construction of ceRNA regulatory network
The regulatory relationships among DEmiRNAs, DEGs, and DElncRNAs were predicted by the Starbase database. In LUAD, 122 DEmiRNAs regulate 301 DElncRNAs and 3227 DEGs. In LUSC, 113 DEmiRNAs regulate 331 DElncRNAs and 4361 DEGs. For the regulatory relationships, please see Supplementary Table S3. The DEGs involved in regulation were denoted as Dataset 2.
1.4 Prediction of potential high-risk genes
In this study, the overlapping genes of Dataset 1 and Dataset 2 were defined as potential high-risk genes. Finally, a total of 878 and 1378 potential high-risk genes were obtained for LUAD and LUSC, respectively (Fig. 3A and 3B).
GO and KEGG pathway enrichment analysis showed that, in LUAD and LUSC, the potential high-risk genes were both enriched with the collagen-activated signaling pathway, gland development, Pyrin domain binding, collagen receptor activity, and the PPAR signaling pathway (Fig. 3C–F). The risk genes of LUAD were also enriched in meiotic mismatch repair, the apical part of the cell, MHC class Ib protein complex, apical plasma membrane, lipid transporter activity, bile secretion pathways, and cholesterol metabolism pathways (Fig. 3C, 3E). The risk genes of LUSC were related to epithelial tube morphogenesis, nucleolar ribonuclease P complex, desmosome, cytoplasmic vesicle lumen, T cell receptor binding, and basal cell carcinoma pathways, and bile secretion pathways (Fig. 3D, 3F).
2. Prediction of Hub Genes
2.1 Identification of co-expressed genes by weighted gene co-expression network analysis
For potential high-risk genes, weighted gene co-expression network analysis (WGCNA) was used to predict co-expression modules. In LUAD, we eliminated four outlier samples (Fig. 4A) and identified 6 as the optimal soft threshold through the scatter plot (Fig. 4B). The 878 potential high-risk genes were clustered into three modules (Fig. 4C). Figure 4D shows the correlation coefficients between modules and two clinical traits; P-value < 0.05 for all three modules. The MEturquoise module with the highest absolute correlation coefficients (absolute correlation coefficient = 0.81, P-value = 3 × 10–125) was used for subsequent analysis, and the number of co-expressed genes was 462.
In LUSC, one outlier sample was removed (Fig. 4E), the optimal soft threshold was also set as 6 (Fig. 4F), and the 1378 potential high-risk genes were clustered into five modules (Fig. 4G). The MEturquoise module (absolute correlation coefficient = 0.81, P-value = 2 × 10–124) was used for subsequent analysis, and the number of co-expressed genes was 573 (Fig. 4H).
2.2. Identification of hub genes by interaction network analysis
The STRING database was used to obtain the protein–protein interaction relationships of co-expressed genes. In LUAD, the interaction network of 462 co-expressed genes consisted of 462 nodes and 1013 edges. In LUSC, the interaction network of 573 co-expressed genes consisted of 573 nodes and 1520 edges. We used the molecular complex detection (MCODE) plugin in Cytoscape to predict hub genes. Totals of 54 and 77 hub genes were obtained for LUAD and LUSC, respectively (Fig. 5).
3. Identification of Prognostic Genes using the LASSO Regression Model
To identify the disease-characterizing genes, we used the gene expression of hub genes and the LASSO regression method to construct prognostic models of LUAD and LUSC. With cross-validation (nflods = 20), two models were generated: the best model (λ = 0.0287 in LUAD, λ = 0.0482 in LUSC) and the simplest model (λ = 0.0728 in LUAD, λ = 0.0842 in LUSC) (Supplementary Figures S1a–d). In this study, we chose the simplest model to build the prognostic models, and 15 prognostic genes were obtained. The prognostic models were constructed based on the risk regression coefficient (Formulas 1 and 2). Then, the samples were divided into low-risk and high-risk groups based on the optimal cutoff values of each sample.
To test the accuracy of the model, we performed the following analysis on the training and testing sets, according to the above groupings. Firstly, the distribution of risk score and survival time was shown by Kaplan–Meier (KM) survival analysis and plotting the KM curve. The patients with lower risk-scores generally show better survival status and longer survival time (Log-rank test, P-value < 0.05) (Supplementary Figures S1e–h). Secondly, the distribution of risk scores for samples with different survival states (dead and alive) is shown in the boxplots. The risk scores of the patients in the dead state are significantly higher than those of the patients in the alive state (Wilcoxon test, P-value < 0.05) (Supplementary Figures S2a–d). Thirdly, the time-dependent ROC analysis showed that the model accuracies were 0.675 for LUAD training sets, 0.650 for LUSC training sets, 0.672 for LUAD testing sets, and 0.528 for LUSC testing sets (Supplementary Figures S2e–h).
4. Survival Analysis and Univariate Cox Regression Identifying Prognostic Genes Signatures
In this study, the survival analysis (Log-rank test) was used to test the reliability of the 15 prognostic genes as independent prognostic factors. Setting the threshold of the P-value to 0.05, 14 genes were related to the prognosis of LUAD and LUSC patients (Supplementary Figures S3 and S4). The survival rate of the high-expression group was higher in EPCAM, KL, NKD1, NTRK2, ESAM, ZEB2, SLC9A3, and ALDOC. Conversely, the survival rate of the low-expression group was higher in VIM, NES, KLF4, KRT8, C4BPA, and YWHAZ.
Then, a univariate COX analysis was performed on the 14 genes associated with prognosis, and the P-values of seven genes were less than 0.05 (Fig. 6). In LUAD, ZEB2 was identified as a protective factor, with NES, KLF4, and KRT8 as risk factors. In LUSC, ALDOC was identified as a protective factor, with C4BPA and YWHAZ as risk factors.
5. Validation of Prognosis-Related Genes using GEO Datasets
The 14 prognosis-related genes were again validated with GSE50081, GSE29013, and GSE37745 using an independent KM survival analysis; the threshold P-value was 0.05. Finally, we found ten prognosis-related genes in NSCLC (Supplementary Figure S5). Among them, the high-expression group of NKD1, NTRK2, and ZEB2 had a higher survival rate. The survival rates for the other seven genes were higher in the low-expression group.
6. Expression of Prognostic Genes
For prognostic genes, we analyzed the gene expression level in different stages of lung cancer (Table 2) and among different tumor types (Fig. 7).
Table 2
Intron losses and gains in Caenorhabditis
Gene
|
Average expression level1 (sample number)
|
P-value
(ANOVA)
|
P-value (Scheffe test)
|
Normal
|
Stage I
|
Stage II
|
Late Stage
|
N VS I
|
N VS II
|
N VS Late
|
I VS II
|
I VS Late
|
II VS Late
|
VIM
|
7.41 (56)
|
6.24 (268)
|
6.20 (115)
|
6.25 (101)
|
4.13×10–20
|
2.20×10–18
|
3.75×10–16
|
1.64×10–14
|
0.980
|
1.000
|
0.987
|
EPCAM
|
6.02 (56)
|
7.59 (268)
|
7.65 (115)
|
7.51 (101)
|
7.59×10–40
|
4.16×10–37
|
6.28×10–33
|
3.70×10–27
|
0.934
|
0.839
|
0.625
|
NES
|
4.67 (56)
|
3.05 (268)
|
3.28 (115)
|
3.24 (101)
|
1.47×10–26
|
2.01×10–26
|
1.15×10–16
|
1.12×10–16
|
0.205
|
0.379
|
0.996
|
KL
|
2.21 (56)
|
0.67 (268)
|
0.64 (115)
|
0.66 (101)
|
1.77×10–54
|
1.88×10–50
|
2.98×10–44
|
2.13×10–41
|
0.961
|
0.999
|
0.991
|
KLF4
|
5.26 (56)
|
2.98 (268)
|
3.05 (115)
|
3.15 (101)
|
4.16×10–39
|
1.17×10–37
|
6.62×10–30
|
1.52×10–26
|
0.961
|
0.645
|
0.933
|
NKD1
|
1.78 (56)
|
0.72 (268)
|
0.65 (115)
|
0.62 (101)
|
2.57×10–31
|
3.09×10–27
|
3.55×10–25
|
2.72×10–25
|
0.818
|
0.627
|
0.990
|
ERBB2
|
4.11 (56)
|
4.98 (268)
|
4.93 (115)
|
4.81 (101)
|
1.81×10–10
|
3.53×10–10
|
1.39×10− 07
|
1.74×10− 5
|
0.973
|
0.446
|
0.798
|
NTRK2
|
1.11 (56)
|
0.33 (268)
|
0.29 (115)
|
0.35 (101)
|
2.30×10–18
|
9.41×10–17
|
4.09×10–15
|
1.23×10–12
|
0.950
|
0.994
|
0.910
|
ESAM
|
5.87 (56)
|
3.94 (268)
|
3.84 (115)
|
3.83 (101)
|
6.89×10–45
|
4.79×10–40
|
1.11×10–36
|
1.51×10–35
|
0.777
|
0.761
|
1.000
|
ZEB2
|
2.47 (56)
|
1.46 (268)
|
1.38 (115)
|
1.35 (101)
|
2.00×10–38
|
5.01×10–33
|
2.87×10–31
|
1.63×10–31
|
0.631
|
0.388
|
0.982
|
KRT8
|
5.97 (56)
|
7.18 (268)
|
7.33 (115)
|
7.46 (101)
|
7.32×10–24
|
4.62×10–18
|
1.03×10–18
|
4.88×10–21
|
0.443
|
0.053
|
0.780
|
SLC9A3
|
0.09 (48)
|
0.56 (239)
|
0.57 (151)
|
0.54 (84)
|
9.52×10− 6
|
2.77×10− 5
|
4.61×10− 5
|
8.17×10− 4
|
0.999
|
0.994
|
0.985
|
ALDOC
|
2.61 (48)
|
4.03 (239)
|
3.99 (151)
|
3.89 (84)
|
2.84×10–17
|
1.75×10–16
|
1.80×10–14
|
1.28×10–10
|
0.991
|
0.757
|
0.899
|
C4BPA
|
6.95 (48)
|
2.95 (239)
|
1.94 (151)
|
2.74 (84)
|
7.26×10–44
|
4.66×10–32
|
1.94×10–43
|
9.87×10–28
|
1.71×10− 5
|
0.858
|
0.028
|
YWHAZ
|
6.07 (48)
|
7.37 (239)
|
7.32 (151)
|
7.44 (84)
|
6.01×10–52
|
2.83×10–48
|
9.80×10–42
|
3.80×10–42
|
0.782
|
0.741
|
0.340
|
Note: *We detected six putative intron gains in C. tropicalis and 12 in C. sinica. However, none of their annotations were confirmed by the RNA-seq data and thus, they were not counted as novel introns. |
A two-way ANOVA was used to compare the expression levels among normal samples, Stage I, Stage II, and the Late Stage. All prognostic genes showed significant differences in expression levels at different stages (Table 2, ANOVA test, P-value < 0.05 for all cases). The Scheffe test was used for multiple comparison tests. For the prognostic genes, there were significant differences in the expression levels between normal samples and any stage of tumor samples. However, there was almost no difference in expression levels among samples with different tumor stages. Only the expression of the C4BPA gene was significantly different between Stage I and Stage II, and significantly different between Stage II and the Late Stage.
After excluding the tumor types without control samples, 22 tumor types remained for comparing the differential expression of prognostic genes between tumor samples and normal samples. The expressions of different genes in different tumors were not completely consistent (Fig. 7 and Figure S5). For instance, the expressions of the KL gene of PCPG and THCA tumor samples were significantly higher than those of normal samples (the red dots in Fig. 7, Wilcoxon test, P-value < 0.05). Conversely, in LUAD, LUSC, KICH, ESCA, HNSC, KIRP, READ, BLCA, PRAD, CESC, STAD, COAD, KIRC, BRCA and UCEC, the expression levels of the KL gene were significantly lower than those of normal samples (the green dots in Fig. 7, Wilcoxon test, P-value < 0.05). However, there was no statistical difference in the KL gene expression between tumor samples and normal samples in YHYM, CHOL, PAAD, SARC, and LIHC (the gray dots in Fig. 7, Wilcoxon test, P-value > 0.05).
7. New Biomarker Genes: NES and ESAM
In the former analysis, we identified 10 prognostic genes for NSCLC. According to previous research, EPCAM, NKD1, NTRK2, ZEB2, VIM, KRT8, KLF4, and YWHAZ have been used as prognostic markers or have been recognized as potential markers (25–32). Therefore, we found two new potential biomarker genes: NES and ESAM.
The NES gene is a gene expression down-regulation and DNA methylation up-regulation. A total of 28 SNP sites were all distributed in the NES coding region. Among of them, 61% comprised missense mutations (N = 17), 32% comprised silent mutations (N = 9), and 7% comprised nonsense mutations (N = 2). The univariate COX analysis implied that the NES gene was a risk factor in LUAD (Fig. 6A, hazard ratio = 1.23, P-value = 0.013). The survival analyses of both the TCGA and GEO datasets revealed that the low expression of NES was conducive to patient survival (Fig. 8A, Log-rank test, P-value = < 0.05 for all cases). By comparing the differential expression among 22 tumor types, it was found that the NES gene expression of tumor samples was lower than that of normal samples for LUAD, LUSC, KICH, KIRP, PAAD, BLCA, PRAD, KIRC, BRCA, and UCEC, but higher for CHOL, LIHC, and, THCA (Fig. 8C).
The ESAM gene was also a gene expression down-regulation and DNA methylation up-regulation. Only one SNP site was located in the coding region, causing a missense mutation. The hazard ratio value of the ESAM gene was 0.9, which means that ESAM might be one of the protective genes; unfortunately, there was no statistically significant difference (P-value > 0.05, Fig. 6A). The results of the survival analyses based on TCGA and GEO datasets were different (Fig. 8D). The TCGA data displayed the high-expression group with a good survival rate (Log-rank test, P-value = 0.006). The GEO dataset exhibited the low-expression group with a higher survival rate (Log-rank test, P-value = 0.029). Additionally, we found that ESAM expression was lower in the tumor samples of LUAD, LUSC, KICH, KIRP, CESE, THCA, BRCA, and UCEC, and more highly expressed in the tumor samples of CHOL, LIHC, and KIRC (Fig. 8F).