3.1 Analysis of CRGs Expression and Overall Mutation Level of THCA Patients
First, we extracted 14 CRGs from TCGA-THCA dataset, and 11 CRGs (SLC25A3, DLD, LIAS, ATP7B, PDHA1, SLC31A1, FDX1, DBT, GCSH, LIPT1 and DLAT) were differentially expressed (Figure 1A). Among them, SLC25A3 was up-regulated in THCA group, while DLD, LIAS, ATP7B, PDHA1, SLC31A3, FDX1, DBT, GCSH, LIPT1 and DLAT were down-regulated in THCA group. We then conducted principal component analysis using CRDEGs, the results showed that tumor samples and normal samples were clearly distinguished (Figure 1B), indicating the importance of CRGs in THCA.
Mutation data were downloaded from the TCGA database to assess the influence of genetic mutations on THCA patients. The results showed that missense accounted for majority of mutations, and the highest frequency of variants in THCA patients was single nucleotide polymorphism with C > T accounting for the vast majority. The top 10 genes in THCA patients sorted by mutation rate was BRAF (59%), FRG1B (59%), NRAS (8%), NBPF10 (6%), NBPF1 (6%), UBBP4 (5%), KRTAP4 −11 (5%), CHEK2 (5%), EP400 (5%) and POTEC (4%) (Figure 1C).
Subsequently, we analyzed the mutation of top 20 genes (BRAF, FGR1B, NRAS, NBPF10, NBPF1, UBBP, KRTAP4-11, CHEK2, EP400, POTEO, TG, TTN, HRAS, FAM47C, MACF1, ZNF814, MT-ND5, MUC4, ZFHX3 and RPTN) with the highest mutation frequency in THCA patients, the results showed that missense mutation accounted for the main part of mutations (Figure 2A). At the same time, we analyzed the mutations and copy number variations (CNVs) of the above 11 CRDEGs among THCA patients, and missense mutations also accounted for the vast majority of these mutations. In addition, ATP7B exhibited the highest mutation frequency was ATP7B, and the main type of CNVs was censoring (Figure 2B).
3.2 Construction of THCA Prognostic Model
Pearson’s correlation analysis showed that the expressions of most CRDEGs had positive correlation with other CRDEGs, among which the highest correlation (r = 0.75) was between DLD and DLAT (Figure 3A), indicating the similar expression patterns among CRDEGs in THCA patients.
In order to identify whether prognostic genes were of diagnostic value, a RF model was constructed based on 11 CRDEGs (SLC25A3, DLD, LIAS, ATP7B, PDHA1, SLC31A1, FDX1, DBT, GCSH, LIPT1 and DLAT). We found that the accuracy of model was highest when the included variable was 5 (Figure 3B), and 5 genes (LIAS, LIPT1, DLD, FDX1, PDHA1) in model with the highest accuracy was selected as potential prognostic genes for THCA. Based on the above 5 genes in Cox regression models. The samples were grouped into high-risk group (44 samples) and low-risk group (225 samples) according to risk score and cut-off identified by the “maxstat” package.
As depicted in Figure 3C, LASSO model was also constructed. The characteristic parameters decreased and the absolute value of the coefficient increased with increasing λ. After simulating and selecting the number of features, we got two models (the best model and the simplest model) (Figure 3D). We chose the simplest model to construct the THCA prognostic model and identified 6 genes (SLC25A3, LIAS, ATP7B, PDHA1, SLC31A1 and GCSH) as potential prognostic genes of THCA. We next calculated the risk scores of all samples in accordance with the Cox risk regression coefficient of the above 6 genes. All samples are grouped into high-risk group (31 samples) and low-risk group (268 samples), the basis of grouping was consistent with RF model.
We next performed survival analysis in RF model and LASSO model, respectively, according to the above groupings. The results showed that no matter which model (RF model or LASSO model) was chosen, the survival rate of high-risk group was lower in both Training dataset and Testing dataset (Figure 3E, 3F, 3G and 3H). Meanwhile, the univariate COX regression results showed that the high-risk group exhibited higher degree of tumor progression regardless of model was chosen (Figure 4A and 4B). Finally, we compared the expressions of 5 prognostic genes (LIAS, LIPT1, DLD, FDX1 and PDHA1) of RF prognostic model and 6 prognostic genes (SLC25A3, LIAS, ATP7B, PDHA1, SLC31A1 and GCSH) of LASSO prognostic model between high-risk group and low-risk group in the Training dataset. The results represented as heat maps (Figure 3I and 3J) showed that the above prognostic gene differentially expressed between two groups in both two prognostic models.
3.3 GSVA Based on Risk Grouping
GSVA showed there were four differentially expressed hallmark gene sets (“APICAL_JUNCTION”, “APOPTOSIS”, “ANGIOGENESIS” and “COAGULATION” ) between two groups in RF model and LASSO model (Figure 5A and 5B). Among them, “APOPTOSIS” and “ANGIOGENESIS” have been proven to be closely associated with various cancers (35), indicating that the above two prognostic models might be highly accurate in the diagnosis and treatment of THCA.
3.4 Risk Score Grouping Gene Differential Expression Analysis & Functional Enrichment Analysis
We found 1749 DEGs (440 upregulated and 1311 downregulated) in RF prognostic model (Figure 6A) and 754 DEGs (289 upregulated and 465 downregulated) in LASSO prognostic model (Figure 6B) by using R package “limma”. The DEGs in RF prognositic model was shown in Table 2. Then 608 DEGs shown as Figure 6C and Table 3 were obtained by taking the intersection of DEGs in RF model and LASSO model.
As depicted in Figure 6D, 6E and 6F, GO enrichment analysis based on the above 608 DEGs indicated that DEGs were associated with biological processes (BP) (hormone transport, hormone secretion, signal release, etc.), cellular components (CC) (ion channel complex, cation channel complex, transmembrane transporter complex, etc.), and molecular functions (MF) (passive transmembrane transporter activity, receptor ligand activity, signaling receptor activator activity, etc.). The detailed results were shown in Table 4. KEGG functional enrichment analysis represented in Figure 6G and Table 5 showed that the DEGs had a main impact on pathways including cytokine−cytokine receptor interaction, viral myocarditis, cell adhesion molecules, arachidonic acid metabolism, Th17 cell differentiation, rheumatoid arthritis and so on.
3.5 CPs-based survival analysis and GSEA
9 prognostic genes (SLC25A3, LIAS, ATP7B, PDHA1, SLC31A1, GCSH, LIPT1, DLD and FDX1) of THCA was obtained by taking intersection of prognostic genes between RF prognostic model and LASSO prognostic model. Based on the above 9 prognostic genes, we calculated the CPs of all THCA patients using ssGSEA algorithm. THCA patients then was divided into two groups according to the optimal cut-off value of prognostic analysis based on CPs. As shown in Figure 7A, the results of survival analysis showed that the THCA patients with low-CPs had a relatively good prognosis (log-rank test P = 0.022). GSEA based on the fold differences of genes were shown in Table 6. The top 8 pathways with the smallest P-value were depicted in Figure 7B, among which glycerophospholipid metabolism (P = 0.0216) (Figure 7C), carbon metabolism (P = 0.0195) (Figure 7E), insulin signaling pathway (P = 0.0251) (Figure 7H), mineral absorption (P = 0.0251) (Figure 7I) and metabolic pathways (P = 0.0314) (Figure 7J) were significantly upregulated, and cell adhesion molecules (P = 0.0398) (Figure 7D), Th17 cell differentiation (P = 0.0375) (Figure 7F) and transcriptional mis-regulation in cancer (P = 0.0437) (Figure 7G) were significantly downregulated.
3.6 Transcription levels of 5 hub genes in THCA cells
To further identified our conclusions, the mRNA levels of the 5 hub genes (FDX1, LIAS, PDHA1, LIPT1 and DLD) were detected in Nthy-ori 3-1 cells, FTC-133 cells and 8505C cells by qRT-PCR using GAPDH as internal control. As depicted in Figure 8, the mRNA levels of FDX1, LIAS, PDHA1, LIPT1 and DLD were significantly downregulated in FTC-133 cells and 8505C cells. These results were consistent with the above prognostic models.
3.7 Immune Infiltration Analysis
As shown in Figure 9A, a panorama of 21 immune cells infiltration in all samples was constructed via CIBERSORT by removing immune cells when the sampling fractions of which were 0. We then explored the immune cells with significant differences according to the fractions of infiltration. As depicted in Figure 9B, the infiltrations of CD8+ T cells (P = 9.9e-05), Tregs (P = 0.01137), M0 Macrophages (P = 0.01134), M2 Macrophages (P = 0.00026 ), etc. exhibited significant differences in both high-CPs group and low-CPs group. We next separately analyzed and visualized the correlations among 21 immune cells. As showed in Figure 9C and 9D. In the high-CPs group, NK cells resting had significant positive correlation with Neutrophils (r = 0.36, P = 1.7e-06), and the negative correlation between CD8+ T cells and Resting memory CD4+ T cells was highest (r = -0.47, P = 6e-11). In the low-CPs group, Follicular helper T cells had significant positive correlation with Macrophages M1 (r = 0.46, P < 2e-16), and the negative correlation between Follicular helper T cells and Mast cells resting was highest (r = -0.45, P < 2e-16).
3.8 CPs-based Immunotherapy & Prognostic Model Construction
We further analyzed the characteristics of immunotherapy in two groups. The results showed that the overall TIDE assessment in high-CPs group was lower (Figure 10A). Among them, the scores of TIDE (P = 0.032) (Figure 10B) were higher in the high-CPs group, indicating the THCA patients with high-CPs might be more likely to response to immune checkpoint inhibitors.
Univariate Cox regression analysis and multivariate Cox regression analysis were conducted according to TNM stage, sex, age and grouping, the results showed that the age exerted significant impact on OS (Figure 10C and 10D). We also created a nomogram (Figure 10E) and a calibration curve (Figure 10F, 10G and 10H) using the R package “rms” package to estimate the rates of 1-, 3- and 5-year survival of THCA patients, and the results show that the nomogram model for 1-year survival was most consistent with the ideal model, and other nomogram models were basically consistent with the ideal model, indicating that our model had relatively high accuracy.