Classification of genes in cancers and normal tissues
The RNA-seq data and corresponding clinical information for 6,918 cancer patients diagnosed with 21 distinct human cancer types, as catalogued in The Cancer Genome Atlas (TCGA) (Table S1), were downloaded. This dataset was uniformly processed through a consistent bioinformatics pipeline. Expression levels were subsequently normalized to TPM in order to enable comparative analysis across samples. We performed PCA to delineate the gene expression patterns among 21 different cancers (Fig. 1B). While a significant proportion of the cancers were closely aggregated, LIHC demonstrated pronounced heterogeneity in comparison to the other cancer types.
In this study, we adopted a comparable approach to categorize 19,652 protein-coding genes into five distinct categories based on their expression levels across various cancer types (Figure S1A) as previously described (18). Our analysis showed that a substantial portion (53.6%) of protein-coding genes were expressed in all cancers analysed, while an additional 12.1% of genes were not detected in any of the cancer types examined. The commonly expressed protein-coding genes were found to be enriched in typical cancer-related processes such as mRNA processing and cell cycle-related biological functions (Figure S1B). This enrichment aligns with the rapid cellular proliferation that occurs during tumorigenesis.
Our analysis extended to the prevalence of upregulated genes across all cancer types, encompassing categories of cancer-enriched, group-enriched, and cancer-enhanced genes (Fig. 2A). Remarkably, glioblastoma multiforme (GBM), testicular germ cell tumours (TGCT), and liver hepatocellular carcinoma (LIHC) exhibited the highest number of upregulated genes. This observation may be partly explained by the intrinsic heterogeneity of the brain, testis, and liver tissues, indicating that the elevated gene expression could be inherently connected to the properties of the tissues from which these cancers originate.
We downloaded the TPM expression profiles of genes in normal tissues from the Human Protein Atlas (18). Consequently, we sourced TPM profiles for 19 tissues corresponding to 17 distinct cancer types to delineate the gene expression patterns (Table S2). Furthermore, we organized the 19,564 protein-coding genes into five categories across these tissue types (Figures S1C-D). In contrast to cancer states, a smaller fraction of genes (43.2%) was expressed across all normal tissue types, and a lower number of genes (7.7%) remain undetected in normal tissues. This pattern suggests a shift in gene expression from normal to cancerous tissues. To delve deeper into this phenomenon, we conducted a comparative analysis of gene specificity categories between normal and cancerous tissues (Fig. 2B).
We observed that the majority of genes with low tissue specificity maintained this characteristic during the transition from normal to tumour conditions. These genes are predominantly involved in essential cellular biological processes such as ribosome biogenesis and mitochondrial gene expression (Figure S1E). Additionally, genes that were categorized as having elevated expression in normal tissues exhibited a shift to various specificity categories in the context of cancer, reflecting the heterogeneity of gene expression across different cancer types. We particularly focused on genes that were not detected in normal tissues but showed elevated expression in cancerous conditions (Fig. 2C), as these genes may contribute to the progression of tumorigenesis. Among these, we identified 263 genes (Table S3) predominantly involved in nucleosome assembly or DNA packaging processes (Fig. 2D), aligning with the rapid cellular proliferation typical of tumour progression.
The identification of prognostic genes for cancers
A Kaplan-Meier (KM) analysis was employed to assess the relationship between the patient’s tumour transcriptomic profiles and clinical survival outcomes, from the recruitment in the study to the occurrence of death. As described in the Methods section and our previous research (19), patients were stratified into groups based on the high or low expression levels of the genes. Genes were labelled as 'favourable' and 'unfavourable' if high expression correlated with better or poor survival outcomes, respectively. We analysed the number of prognostic genes (PGs) for each cancer type (Fig. 3A) and observed that KIRC and LIHC had the highest numbers of PGs. In KIRC, the majority of PGs (87.3%) were categorized as favourable genes, whereas the majority of PGs (92.3%) were categorized as unfavourable genes in LIHC.
The prognostic significance of the genes varied across cancer types, with some demonstrating consistent prognostic values. For example, CD6, a crucial gene for T-cell activation, is identified as a favourable prognostic marker in multiple cancers, such as breast invasive carcinoma (BRCA), cervical squamous cell carcinoma and endocervical adenocarcinoma (CESC), head and neck squamous cell carcinoma (HNSC), and skin cutaneous melanoma (SKCM), as shown in Fig. 3B. Similarly, PSMB1, the non-catalytic component of the proteasome complex, was implicated in poor survival outcomes across several cancer types (Fig. 3C), namely bladder urothelial carcinoma (BLCA), BRCA, HNSC and lung adenocarcinoma (LUAD). Our findings suggest a potential underlying commonality in the regulatory mechanisms across these cancers.
Conversely, certain genes exhibited distinctly different prognostic significance depending on the cancer type. As shown in Fig. 3D, interferon-induced anti-viral exoribonuclease (ISG20), acting on single stranded RNA and involved in immune and inflammatory responses, correlated with improved survival in ovarian serous cystadenocarcinoma (OV) and SKCM. However, its high expression is indicative of poorer survival in GBM and KIRC. These findings align with previously published research about these cancers(20–22).
Validation of prognostic genes in different cancer cohorts
In the previous version of the Human Pathology Atlas (11), we reported significant variability in the number of prognostic genes (PGs) across different cancer types. To reduce dependence on a single dataset, we compiled 10 follow-up datasets (FDs) from various sources, each corresponding to one of the cancer types included in the leading datasets (LDs), specifically the TCGA cohorts (refer to Fig. 4A and Table S4). These FDs were re-annotated using the same bioinformatics pipeline and reference genomes, and a consistent approach was applied to filter their clinical records.
We analysed the connectivity among LDs and FDs for the 10 cancer types using PCA based on the expression patterns of all protein-coding genes (Figure S2A). Notably, the LIHC-FD exhibited distinct expression patterns that aligned closely with those of the LD cohort. The centroid plot in the upper right corner further demonstrated that LD and FD share similar variances in PCA. Additionally, a clearer clustering trend was observed in the dendrogram plot (Figure S2B), showing that dataset pairs of the same cancer type generally clustered more closely. However, a significant divergence was noted in the breast cancer (BRCA) FD, which could be attributed to its specific age demographic—comprising solely young individuals aged 25 to 35 years—differing from the broader age range (25 to 90 years old) in the LD.
The cluster heatmap supports our observations (Fig. 4B), aligning with the findings discussed above. Although the Spearman correlation coefficients for all LDs and FDs were generally above 0.6, indicating a robust association, expression profiles were most conserved within the same cancer type—reflecting underlying biological consistencies. Notably, GBM emerged as the most distinct cancer type, with the Spearman correlation between GBM datasets being the lowest among all dataset pairs of the same cancer type. In contrast, KIRC and LIHC displayed the highest Spearman correlation between their respective LD and FD. Consistent with the PCA results, LIHC also showed a considerably low Spearman correlation with datasets of other cancer types. Aside from these three cancers, other cancer types clustered together, suggesting greater homogeneity in gene expression patterns. Within this larger cluster, COAD and rectum adenocarcinoma (READ), both originating from the same organ and often collectively referred to as colorectal cancer, showed higher intragroup similarity.
To assess the robustness of the PGs, we performed KM analysis in the 10 FDs using the same pipeline. A gene was considered a confidence gene if it consistently demonstrated prognostic value across both the LD and FD of the same cancer type. Consequently, we were able to identify shared confidence prognostic genes (CPG) for the two independent data sets (Fig. 4C). KIRC and LIHC exhibited the highest number of shared prognostic genes, while cancers such as COAD and lung squamous cell carcinoma (LUSC) were found to have fewer prognostic genes identified by the two independent cohorts.
We further assessed the repeatability of PG identification by calculating the Spearman correlation for KM coefficients of genes between the LDs and FDs. As shown in Fig. 4D, there was a general trend of positive correlation between the KM coefficients for the same cancer type. The PGs from the LDs and FDs for four cancers (GBM, KIRC, LIHC, and LUAD) showed significant overlap (hypergeometric test, p < 0.05), indicating a robust expression-survival association for these cancers. Notably, KIRC and LIHC displayed the highest correlation coefficients (r = 0.64, JC = 0.26 for KIRC; r = 0.66, JC = 0.24 for LIHC, Fig. 4E). We observed that most of the identified KIRC CPGs were favourable and these genes are associated with the regulation of cell cycle-related transcription, whereas the majority of LIHC CPGs were unfavourable and they are enriched in biogenesis, RNA assembly and gene expression.
Cell proportions in cancer have a major effect on prognostic genes
Significant differences were observed in the CPGs across the 10 cancer types studied. Given that LDs and FDs originated from various sources, it was not possible to account for all variables in our analysis. To interpret these results from a systems biology perspective, we focused on LIHC, which showed high consistency in CPGs, and COAD, which displayed low consistency, for more detailed analysis.
The LIHC datasets exhibited a strong correlation in expression profiles across all protein-coding genes (Fig. 5A) and negligible differences in survival times (Fig. 5B), along with a considerable number of shared PGs. As previously mentioned, LIHC showed the highest similarity in KM coefficients (Fig. 5C). Using the Boruta SHAP algorithm, we evaluated critical features influencing patient survival. Four well-documented clinical variables (cancer stage, race, gender, and age) and the first three principal components of the LD expression profiles, representing overall expression patterns, were assessed. Across 100 iterations, expression principal components consistently emerged as significant factors for survival, while other clinical attributes were not emphasized (Fig. 5D). Furthermore, both LIHC datasets displayed high congruence in cell-type proportions, with hepatocytes constituting the majority (> 90%, Fig. 5E), indicating high cellular homogeneity within the samples.
In contrast to LIHC, COAD displayed distinct characteristics in all evaluated aspects. As previously noted, the gene expression profiles across all protein-coding genes of COAD were less distinctive compared to those of LIHC, showing a notably lower correlation (Fig. 5F). Significant differences were also observed in the survival times between the living and deceased patient groups (Fig. 5G), suggesting that the COAD cohorts might be subject to highly divergent exposure factors. These multiple discrepancies likely contributed to the lower confidence in PGs identified for COAD (Fig. 5H).
In our survival analysis (Fig. 5I), expression principal components for COAD were identified as important less frequently compared to LIHC, whereas 'Race' was more frequently recognized as a significant factor (N = 20), even surpassing PCA2. Additionally, the major cell types within COAD, namely epithelial cells and fibroblasts, showed significantly different proportions across the datasets, yet both were present in low percentages (Fig. 5J). These comparisons underscore the intrinsic differences between the two COAD cohorts, both in terms of the clinical characteristics of the patients and the cellular composition of the sequenced samples.
The KM analysis assigns greater weight to the survival days of deceased patients because they represent “completed event records.” When comparing survival days between cohort pairs (Figure S3, Table S5), it was found that, of the ten cancer dataset pairs, six showed no statistical difference in the survival days of the deceased patient group. Among these, cancer types such as KIRC, LIHC and LUAD demonstrated significant consistency between the LD and FD. Although there was no statistical difference in survival among deceased BRCA patients, intrinsic biological differences are evident; the FD for BRCA includes younger patients (ages 25–35) compared to the broader age range (26–90 years) in the LD, which could influence the overlap of prognostic genes. Additionally, variations in cancer subtypes or treatment modalities can lead to notable deviations between datasets. For instance, in the READ cohort, 43.2% of patients underwent pharmaceutical therapy and 56.8% received radiation therapy. In contrast, the majority of the (READ-FD cohort did not receive any treatment (77.7%), with only a small fraction undergoing pharmaceutical intervention. This divergence in treatment approaches is reflected in the substantial variation in survival durations observed between the living and deceased patients across both cohorts.
Construction of the cancer regulatory networks for prognostic genes
Our study revealed that KIRC and LIHC are characterized by a strong correlation between gene expression profiles and prognostic outcomes. Despite the abundance of PGs, selecting the most efficacious genes for treatment remains challenging. To improve the specificity of PGs selection, we construct a regulatory network for KIRC prognostic genes. This network serves as a strategic framework to guide the selection of genes within relevant pathways, potentially streamlining the identification of therapeutic targets (see Methods section for a detailed methodology).
We downloaded a comprehensive set of 186 KEGG pathways with their associated genes from MSigDB (23). For each sample in our dataset, we calculated the activity score for each of these pathways. The top 10 pathways with the lowest p values (p < 0.05) that significantly have different activity scores among the alive and deceased patient groups in KIRC-LD were shown in Fig. 6A, while the top pathways in KIRC-FD were shown in Fig. 6B. The tight junction pathway, which emerged as the shared pathway of different activity in both LD and FD cohorts, was thus regarded as a potential key pathway related to different survival outcomes. It plays a key role in cell adhesion and permeability in epithelial cells and shows reduced activity in KIRC samples compared to non-tumorous tissue (24). Additionally, it has been implicated in the progression of more advanced tumour pathology.
For KIRC, we utilized the ARACNe-inferred KIRC network(25), which includes 6,054 transcriptional regulators (TRs) and their gene regulatory associations. We conducted a linear regression analysis to assess the correlation between pathway activities and the activities of major identified TRs, with the robustness of these correlations verified via bootstrap analysis (n = 100 iterations). In KIRC, we identified 529 TRs that exhibit regulatory interactions with the tight junction pathway (Table S6). Of these, 319 TRs (60.03%) also demonstrated a correlation with patient survival outcomes in KIRC. In the KIRC-FD, 2,051 TRs were implicated in the regulation of the tight junction pathway, with 23.55% (483 TRs) associated with patient survival in KIRC-FD.
Comparative analysis between the KIRC-LD and KIRC-FD cohorts revealed 90 TRs involved in the tight junction pathway, which were also concurrently identified as KIRC CPGs in previous KM analysis (Fig. 6C). These TRs exhibited a high correlation in slope value (r = 0.89, by Spearman Coefficient, Fig. 6D). The majority of these TRs were categorized as favourable CPG biomarkers (88 TRs), each showing positive regulation of the tight junction pathway. In contrast, two TRs, DNMT3B and PPP1R1A, were classified as unfavourable KIRC CPGs, displaying a negative regulatory relationship with the tight junction pathway. The impairment of this pathway may play a critical role in KIRC pathogenesis, aligning with our findings where the overexpression of the two unfavourable CPGs could decrease pathway activity, potentially accelerating tumour progression and resulting in worse patient survival outcomes. Moreover, the two TRs have been extensively investigated across multiple studies and recognized as potential candidates for cancer therapy(26, 27), indicating their potential application in future KIRC research.
We applied a similar methodology to construct the regulatory network for LIHC prognostic genes (Table S7). Differential activation of the purine metabolism and RNA polymerase pathways was observed between alive and deceased LIHC patients, as well as LIHC-FD, as shown in Figures S4A and S4B. Within the regulatory framework of the purine metabolism pathway, 209 TRs were also identified as LIHC CPG (Figure S4C). The inhibition of purine metabolism is known to suppress the progression of hepatocellular carcinoma (HCC) (28). Notably, genes classified as unfavourable exhibited a positive regulatory association with purine metabolism, suggesting a potential inhibition through the unfavourable genes.
In contrast, 165 TRs, which also align with CPGs, were identified concerning the RNA polymerase pathway, as shown in Figure S4D. Although the activity scores of CPGs within survival-differential pathways indicated a lower Spearman correlation in LIHC, we observed three genes—TAF15, CHEK1, and PDCD6—as having the highest slope values in both purine metabolism and RNA polymerase pathways (Figures S4E-F). These genes have been implicated in the inhibition of HCC progression(29, 30) and cellular migration(31), illustrating their potential as targets for the development of effective HCC treatment.
The impact of updated datasets on prognostic genes
In the Human Pathology Atlas, our focus was on protein-coding genes, deriving expression levels from the aggregate of protein-coding transcripts. Utilizing Ensembl release 103, which includes updated gene or transcript classification for over 3,000 genes, updating our gene classification was a crucial initial step in our analysis. We then performed a correlation analysis of expression profiles to detect changes in overall expression patterns. Despite employing different gene quantification methods (previously FPKM and currently TPM), the average Spearman correlation coefficient remained above 0.8, which is relatively low considering the samples are generally the same.
Our dataset has been updated with the latest clinical records from the TCGA database, meticulously comparing changes on a case-by-case basis. Significant updates include alterations in cohort sample sizes, with notable reductions observed across most cancer types (Figure S5A). For instance, the sample size for uterine corpus endometrial carcinoma (UCEC) decreased by 67.5% due to the unavailability of raw bam files for 365 patients, resulting in the lowest expression correlation (r = 0.84, by the Spearman coefficient). Adjustments in patients’ clinical information were also evident; for example, the survival for a BRCA patient sample was revised to 1,468 days (2,024 days less than the previous record), and a CESC patient's status was updated to deceased with no change in survival time.
We conducted a comparative analysis of PGs across two versions of the dataset, as shown in Figure S5B. The number of PGs for each cancer type is listed, along with their respective categories, with the significance of overlap indicated by asterisks. While high consistency was anticipated and observed within the same cancer types, the gene lists are not entirely identical. This discrepancy highlights the sensitivity of survival analysis to data variations, particularly changes in expression levels and clinical information.