Patient characteristics
We included 33 tumor types in the core sample set, namely adrenocortical carcinoma (ACC), bladder urothelial carcinoma (BLCA), breast invasive carcinoma (BRCA), cervical and endocervical cancers (CESC), cholangiocarcinoma (CHOL), colon adenocarcinoma (COAD), lymphoid neoplasm diffuse large B-cell lymphoma (DLBC), esophageal carcinoma (ESCA), glioblastoma multiforme (GBM), head and neck squamous cell carcinoma (HNSC), kidney chromophobe (KICH), kidney renal clear cell carcinoma (KIRC), kidney renal papillary cell carcinoma (KIRP), acute myeloid leukemia (LAML), brain lower grade glioma (LGG), liver hepatocellular carcinoma (LIHC), lung adenocarcinoma (LUAD), lung squamous cell carcinoma (LUSC), mesothelioma (MESO), ovarian serous cystadenocarcinoma (OV), pancreatic adenocarcinoma (PAAD), pheochromocytoma and paraganglioma (PCPG), prostate adenocarcinoma (PRAD), rectum adenocarcinoma (READ), sarcoma (SARC), skin cutaneous melanoma (SKCM), stomach adenocarcinoma (STAD), testicular germ cell tumors (TGCT), thyroid carcinoma (THCA), thymoma (THYM), uterine corpus endometrial carcinoma (UCEC), uterine carcinosarcoma (UCS), and uveal melanoma (UVM).
The original platform information of each tumor was shown in Table S1. The sample size, death number, tumor stage distribution and mean age of the core sample set and the mutation sample set in each tumor were summarized in Table S2.
Prognostic power of diverse clinical data, molecular data and combination data in each cancer type
The significant amplifications or deletions calculated from the GISTIC and significant mutated genes calculated from the MutSig were shown in Table S3. The features of miRNA, mRNA, methylation and RPPA data were obtained directly from the level 3 data (Figure S1). After univariate screen and LASSO for all patients in each cancer type, the significant prognostic factors identified by the cox regression analysis were shown in Table S4-S5.
Patients were randomly split as the training group (80%) and testing group (20%) for 100 times to calculate the concordance index (c-index). The mean c-index calculated from each platform data in each cancer type was shown in Fig. 1, with the darker color blocks representing higher c-index values. The prediction performance of clinical variables varied across cancer types, with the c-index from as high as 0.86 for THCA to as low as 0.40 for CESC (Fig. 1). Among all molecular data, the mRNA seems to be the most informative prognostic variable, with the highest mean c-index of 0.58 (Fig. 1)[4].
There were 163 comparisons between the molecular data and combination data in total, and 163 comparisons between the clinical data and combination data in total (5 molecular data platforms in each of the 33 cancer types, except that the protein data was missing in LAML and UVM, namely, 5 × 33 − 2 = 163). Additional predictive power was observed in 108 of the total 163 comparisons between the combination data and molecular data in all cancer types, with the c-index of combination data significantly higher than that of the molecular data alone (P < 0.05) (Table S6).
However, on the other hand, in almost half of the cancer types (14/33), the clinical data still appeared to be the most informative index for cancer prediction, with the highest c-index value among the clinical data, molecular data and combination data (Fig. 1). Incorporating molecular data into the clinical model can only boost the prediction accuracy in 27 out of 163 comparisons (P < 0.05) (Table S6).
Deeper insights from top-performing prognostic models
To get deeper insights, we further demonstrated the practical applicability of the prognostic model in LGG, since the c-index of LGG was the highest among all the 33 cancer types (Fig. 1).
As the TNM staging system was not applicable in TCGA LGG, we included the age and Karnofsky Performance Status Scale (KPS) as the clinical variables. We divided the patients with different KPS score into high risk (score = 20, 30, 40), high-intermediate risk (score = 50, 60), low-intermediate risk (score = 70, 80), and low risk (score = 90, 100). Using the clinical data alone, the survival outcomes were significantly distinguished by these risk groups (log-rank P < 0.01, c-index = 0.75, Fig. 2A). When combining the clinical data with molecular data, the survival outcomes were also significantly different among these subgroups defined by the integrative models involving CNV (Fig. 2B, log-rank P < 0.01, c-index = 0.80), methylation (Fig. 2C, log-rank P < 0.01, c-index = 0.90), miRNA (Fig. 2D, log-rank P < 0.01, c-index = 0.85), mRNA (Fig. 2E, log-rank P < 0.01, c-index = 0.72) and RPPA (Fig. 2F, log-rank P < 0.01, c-index = 0.92). The detailed models were described in Table S7. On the other hand, the c-index values based on the different classification methods showed that three integrative models (methylation, miRNA and RPPA) outperformed the clinical KPS score model in terms of prognostic power (P < 0.05) (Fig. 2).
Pan-cancer analysis of adenocarcinoma, squamous cell carcinoma, neuronal tumors and kidney tumors
As the performance of prognostic prediction depends on the sample size involved, we tried to evaluate the prognostic power of different models from a higher level[8]. Recently, Malta et al. sorted cancer types into different groups by the stemness indices obtained from transcriptomic and epigenetic features[5]. According to their findings, we combined the BRCA, CHOL, COAD, ESCA (adenocarcinoma), LUAD, OV, PAAD, PRAD, READ, STAD and UCEC into the adenocarcinoma, combined the CESC, ESCA (squamous carcinoma), HNSC and LUSC into the squamous cell carcinoma, combined the GBM, LGG, PCPG, SKCM and UVM into the neuronal tumors, and combined the KICH, KIRC and KIRP into the kidney tumors. The mean c-index calculated from each platform data was shown in Fig. 3.
In adenocarcinoma and neuronal tumors, additional predictive power was observed in all combination data, compared with molecular data alone (P < 0.05) or clinical data alone (P < 0.05) (Fig. 3A, C). In the squamous cell carcinoma and kidney tumors, incorporating clinical data into the molecular model could also boost the prediction accuracy in all the 5 different platforms (P < 0.05, Fig. 3B, D). However, when compared with the clinical data alone, additional predictive power was missed in several integrative models involving CNV and methylation in squamous cell carcinoma, and methylation and miRNA in kidney tumors (P > 0.05, Fig. 3B, D).
Survival analysis based on mutation data
Under most circumstances in the 100 randomizations, few features passed the univariate cox screen and LASSO, since the number of patients with positive gene mutation was much less than that of the wild type patients. Instead, we only performed the cox regression analysis to explore the significant prognostic mutated genes in each cancer type, which was shown in Table S8.
At the pan-cancer level, we explored the association of driver mutations with prognosis[9]. In adenocarcinoma, squamous cell carcinoma, neuronal tumors and kidney tumors, we incorporated top mutated driver genes with high mutation frequencies (> 10%) to correlate with survival outcomes. As shown in Fig. 4A-B, the hazard ratio (HR) of mutant TP53 versus wild type TP53 was 2.05 [95% confidence interval (CI): 1.81–2.31] in adenocarcinoma and 1.44 (95% CI: 1.19–1.74) in squamous cell carcinoma, respectively. Compared with PTEN wild type population, PTEN mutation carriers had a reduced incidence of mortality in adenocarcinoma (HR: 0.53, 95%CI: 0.42–0.66) (Fig. 4A). However, the mutant PTEN conferred an increased mortality in neuronal tumors (HR: 2.43, 95% CI: 2.02–2.94) (Fig. 4C). The other significant prognostic mutated genes included in PIK3CA, KRAS, and ARID1A in adenocarcinoma (Fig. 4A), NOTCH1 in squamous cell carcinoma, IDH1, ATRX, BRAF, APOB, SPTA1, NF1, EGFR and CIC in neuronal tumors (Fig. 4C), PBRM1 and SETD2 in kidney tumors (Fig. 4D).