The Human Pathology Atlas for deciphering the prognostic features of human cancers

doi:10.21203/rs.3.rs-4544479/v1

Download PDF

Research Article

The Human Pathology Atlas for deciphering the prognostic features of human cancers

https://doi.org/10.21203/rs.3.rs-4544479/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Cancer is one of the leading causes of mortality worldwide, highlighting the urgent need for a deeper molecular understanding of the disease's heterogeneity and the development of personalized treatments. Since its establishment in 2017, the Human Pathology Atlas has been instrumental in linking gene expression profiling with patient survival outcomes, providing system-level insights and experimental validation across a wide range of cancer research. In this updated analysis, we analysed the expression profiles of 6,918 patients across 21 cancer types using the latest gene annotations. Our refined approach enabled us to offer an updated list of prognostic genes for human cancers, with a focus on hepatocellular, renal and colorectal cancers. To strengthen the reliability of our findings, we integrated data from 10 independent cancer cohorts, creating a cross-validated, reliable collection of prognostic genes. By applying a systems biology approach, we identified that patient survival outcomes in kidney renal clear cell carcinoma (KIRC) and liver hepatocellular carcinoma (LIHC) are strongly associated with gene expression profiles. We also developed a prognostic regulatory network specifically for KIRC and LIHC to enhance the utility of the Human Pathology Atlas for cancer research. The updated version of the Human Pathology Atlas lays the foundation for precision oncology and the development of personalized treatment strategies.

Systems Biology

Cancer Biology

Pan-cancer study

Survival analysis

Pathology

System biology

Cancer remains a significant global health challenge, with recent estimates indicating approximately 19.3 million new cases and almost 10 million deaths annually (1). In Europe, breast, colorectal, lung, and prostate cancers are the most frequently diagnosed cancers, collectively representing over half of all cases (2). Of particular concern is the high rate of premature mortality associated with cancer, imposing substantial societal and economic burdens (3). Extensive efforts have been invested in cancer research to develop effective treatment options and improve prognostic outcomes. However, universally effective and resilient treatments remain limited due to the heterogeneity of cancer (4–7). This highlights the urgent need for a deeper understanding of the molecular mechanisms driving cancer pathogenesis and for the development of more effective, targeted and personalized treatment strategies. Cancer research has experienced significant evolution with advancements in computational power and the emergence of big data (8–10). Integrating multi-omics has propelled the field into a new era, where systems biology approaches can offer novel insights into cancer's complex pathology, bridging the existing gaps in our understanding of cancer pathogenesis and treatment efficacy.

Previously, we employed a systems biology approach to establish associations between gene expression profiles and patient survival outcomes, which we compiled into the Human Pathology Atlas (11). It is available in an open-access form as an essential component of the Human Protein Atlas (https://www.proteinatlas.org/), which has been integral to numerous cancer studies, furnishing experimental evidence and system-level insights to bolster research on biomarker identification and disease progression-related gene screening (12–14). Building upon the methodologies of our prior work, we have also identified tumour genes that correlate with patient survival, guiding us towards the discovery of novel drug targets and the development of inhibitory compounds capable of suppressing tumour cell growth and proliferation (15–17). These advancements emphasize the need for systematic exploration of prognostic gene signatures to enhance the precision of cancer diagnostics and therapeutics.

In this study, we re-annotated the pathological attributes of all protein-coding genes starting from the raw bam files and quantified gene expression as transcripts per million (TPM) to enable fair comparisons across a broad spectrum of genes and various cancer datasets. We also standardized gene expression on a quantile scale, allowing us to track shifts in gene expression from normal to tumour tissues. Furthermore, we updated the correlations between gene expression and survival outcomes using global gene expression profiling. Additionally, we compiled independent datasets from 10 different cancer types to identify a robust set of confidence prognostic genes (CPGs) that could enhance cancer research and potential clinical applications. Notably, we observed significant variations in prognostic-gene associations across cancer types. By focusing on liver hepatocellular carcinoma (LIHC) and colon adenocarcinoma (COAD), we investigated tumour heterogeneity and found that prognostic gene associations are highly specific to each cancer type. In the end, we constructed a prognostic regulatory network for kidney renal clear cell carcinoma (KIRC) and LIHC that incorporates these prognostic genes, paving the way for more comprehensive cancer investigations. The workflow of our study is depicted in Fig. 1A.

Classification of genes in cancers and normal tissues

The RNA-seq data and corresponding clinical information for 6,918 cancer patients diagnosed with 21 distinct human cancer types, as catalogued in The Cancer Genome Atlas (TCGA) (Table S1), were downloaded. This dataset was uniformly processed through a consistent bioinformatics pipeline. Expression levels were subsequently normalized to TPM in order to enable comparative analysis across samples. We performed PCA to delineate the gene expression patterns among 21 different cancers (Fig. 1B). While a significant proportion of the cancers were closely aggregated, LIHC demonstrated pronounced heterogeneity in comparison to the other cancer types.

In this study, we adopted a comparable approach to categorize 19,652 protein-coding genes into five distinct categories based on their expression levels across various cancer types (Figure S1A) as previously described (18). Our analysis showed that a substantial portion (53.6%) of protein-coding genes were expressed in all cancers analysed, while an additional 12.1% of genes were not detected in any of the cancer types examined. The commonly expressed protein-coding genes were found to be enriched in typical cancer-related processes such as mRNA processing and cell cycle-related biological functions (Figure S1B). This enrichment aligns with the rapid cellular proliferation that occurs during tumorigenesis.

Our analysis extended to the prevalence of upregulated genes across all cancer types, encompassing categories of cancer-enriched, group-enriched, and cancer-enhanced genes (Fig. 2A). Remarkably, glioblastoma multiforme (GBM), testicular germ cell tumours (TGCT), and liver hepatocellular carcinoma (LIHC) exhibited the highest number of upregulated genes. This observation may be partly explained by the intrinsic heterogeneity of the brain, testis, and liver tissues, indicating that the elevated gene expression could be inherently connected to the properties of the tissues from which these cancers originate.

We downloaded the TPM expression profiles of genes in normal tissues from the Human Protein Atlas (18). Consequently, we sourced TPM profiles for 19 tissues corresponding to 17 distinct cancer types to delineate the gene expression patterns (Table S2). Furthermore, we organized the 19,564 protein-coding genes into five categories across these tissue types (Figures S1C-D). In contrast to cancer states, a smaller fraction of genes (43.2%) was expressed across all normal tissue types, and a lower number of genes (7.7%) remain undetected in normal tissues. This pattern suggests a shift in gene expression from normal to cancerous tissues. To delve deeper into this phenomenon, we conducted a comparative analysis of gene specificity categories between normal and cancerous tissues (Fig. 2B).

We observed that the majority of genes with low tissue specificity maintained this characteristic during the transition from normal to tumour conditions. These genes are predominantly involved in essential cellular biological processes such as ribosome biogenesis and mitochondrial gene expression (Figure S1E). Additionally, genes that were categorized as having elevated expression in normal tissues exhibited a shift to various specificity categories in the context of cancer, reflecting the heterogeneity of gene expression across different cancer types. We particularly focused on genes that were not detected in normal tissues but showed elevated expression in cancerous conditions (Fig. 2C), as these genes may contribute to the progression of tumorigenesis. Among these, we identified 263 genes (Table S3) predominantly involved in nucleosome assembly or DNA packaging processes (Fig. 2D), aligning with the rapid cellular proliferation typical of tumour progression.

The identification of prognostic genes for cancers

A Kaplan-Meier (KM) analysis was employed to assess the relationship between the patient’s tumour transcriptomic profiles and clinical survival outcomes, from the recruitment in the study to the occurrence of death. As described in the Methods section and our previous research (19), patients were stratified into groups based on the high or low expression levels of the genes. Genes were labelled as 'favourable' and 'unfavourable' if high expression correlated with better or poor survival outcomes, respectively. We analysed the number of prognostic genes (PGs) for each cancer type (Fig. 3A) and observed that KIRC and LIHC had the highest numbers of PGs. In KIRC, the majority of PGs (87.3%) were categorized as favourable genes, whereas the majority of PGs (92.3%) were categorized as unfavourable genes in LIHC.

The prognostic significance of the genes varied across cancer types, with some demonstrating consistent prognostic values. For example, CD6, a crucial gene for T-cell activation, is identified as a favourable prognostic marker in multiple cancers, such as breast invasive carcinoma (BRCA), cervical squamous cell carcinoma and endocervical adenocarcinoma (CESC), head and neck squamous cell carcinoma (HNSC), and skin cutaneous melanoma (SKCM), as shown in Fig. 3B. Similarly, PSMB1, the non-catalytic component of the proteasome complex, was implicated in poor survival outcomes across several cancer types (Fig. 3C), namely bladder urothelial carcinoma (BLCA), BRCA, HNSC and lung adenocarcinoma (LUAD). Our findings suggest a potential underlying commonality in the regulatory mechanisms across these cancers.

Conversely, certain genes exhibited distinctly different prognostic significance depending on the cancer type. As shown in Fig. 3D, interferon-induced anti-viral exoribonuclease (ISG20), acting on single stranded RNA and involved in immune and inflammatory responses, correlated with improved survival in ovarian serous cystadenocarcinoma (OV) and SKCM. However, its high expression is indicative of poorer survival in GBM and KIRC. These findings align with previously published research about these cancers(20–22).

Validation of prognostic genes in different cancer cohorts

In the previous version of the Human Pathology Atlas (11), we reported significant variability in the number of prognostic genes (PGs) across different cancer types. To reduce dependence on a single dataset, we compiled 10 follow-up datasets (FDs) from various sources, each corresponding to one of the cancer types included in the leading datasets (LDs), specifically the TCGA cohorts (refer to Fig. 4A and Table S4). These FDs were re-annotated using the same bioinformatics pipeline and reference genomes, and a consistent approach was applied to filter their clinical records.

We analysed the connectivity among LDs and FDs for the 10 cancer types using PCA based on the expression patterns of all protein-coding genes (Figure S2A). Notably, the LIHC-FD exhibited distinct expression patterns that aligned closely with those of the LD cohort. The centroid plot in the upper right corner further demonstrated that LD and FD share similar variances in PCA. Additionally, a clearer clustering trend was observed in the dendrogram plot (Figure S2B), showing that dataset pairs of the same cancer type generally clustered more closely. However, a significant divergence was noted in the breast cancer (BRCA) FD, which could be attributed to its specific age demographic—comprising solely young individuals aged 25 to 35 years—differing from the broader age range (25 to 90 years old) in the LD.

The cluster heatmap supports our observations (Fig. 4B), aligning with the findings discussed above. Although the Spearman correlation coefficients for all LDs and FDs were generally above 0.6, indicating a robust association, expression profiles were most conserved within the same cancer type—reflecting underlying biological consistencies. Notably, GBM emerged as the most distinct cancer type, with the Spearman correlation between GBM datasets being the lowest among all dataset pairs of the same cancer type. In contrast, KIRC and LIHC displayed the highest Spearman correlation between their respective LD and FD. Consistent with the PCA results, LIHC also showed a considerably low Spearman correlation with datasets of other cancer types. Aside from these three cancers, other cancer types clustered together, suggesting greater homogeneity in gene expression patterns. Within this larger cluster, COAD and rectum adenocarcinoma (READ), both originating from the same organ and often collectively referred to as colorectal cancer, showed higher intragroup similarity.

To assess the robustness of the PGs, we performed KM analysis in the 10 FDs using the same pipeline. A gene was considered a confidence gene if it consistently demonstrated prognostic value across both the LD and FD of the same cancer type. Consequently, we were able to identify shared confidence prognostic genes (CPG) for the two independent data sets (Fig. 4C). KIRC and LIHC exhibited the highest number of shared prognostic genes, while cancers such as COAD and lung squamous cell carcinoma (LUSC) were found to have fewer prognostic genes identified by the two independent cohorts.

We further assessed the repeatability of PG identification by calculating the Spearman correlation for KM coefficients of genes between the LDs and FDs. As shown in Fig. 4D, there was a general trend of positive correlation between the KM coefficients for the same cancer type. The PGs from the LDs and FDs for four cancers (GBM, KIRC, LIHC, and LUAD) showed significant overlap (hypergeometric test, p < 0.05), indicating a robust expression-survival association for these cancers. Notably, KIRC and LIHC displayed the highest correlation coefficients (r = 0.64, JC = 0.26 for KIRC; r = 0.66, JC = 0.24 for LIHC, Fig. 4E). We observed that most of the identified KIRC CPGs were favourable and these genes are associated with the regulation of cell cycle-related transcription, whereas the majority of LIHC CPGs were unfavourable and they are enriched in biogenesis, RNA assembly and gene expression.

Cell proportions in cancer have a major effect on prognostic genes

Significant differences were observed in the CPGs across the 10 cancer types studied. Given that LDs and FDs originated from various sources, it was not possible to account for all variables in our analysis. To interpret these results from a systems biology perspective, we focused on LIHC, which showed high consistency in CPGs, and COAD, which displayed low consistency, for more detailed analysis.

The LIHC datasets exhibited a strong correlation in expression profiles across all protein-coding genes (Fig. 5A) and negligible differences in survival times (Fig. 5B), along with a considerable number of shared PGs. As previously mentioned, LIHC showed the highest similarity in KM coefficients (Fig. 5C). Using the Boruta SHAP algorithm, we evaluated critical features influencing patient survival. Four well-documented clinical variables (cancer stage, race, gender, and age) and the first three principal components of the LD expression profiles, representing overall expression patterns, were assessed. Across 100 iterations, expression principal components consistently emerged as significant factors for survival, while other clinical attributes were not emphasized (Fig. 5D). Furthermore, both LIHC datasets displayed high congruence in cell-type proportions, with hepatocytes constituting the majority (> 90%, Fig. 5E), indicating high cellular homogeneity within the samples.

In contrast to LIHC, COAD displayed distinct characteristics in all evaluated aspects. As previously noted, the gene expression profiles across all protein-coding genes of COAD were less distinctive compared to those of LIHC, showing a notably lower correlation (Fig. 5F). Significant differences were also observed in the survival times between the living and deceased patient groups (Fig. 5G), suggesting that the COAD cohorts might be subject to highly divergent exposure factors. These multiple discrepancies likely contributed to the lower confidence in PGs identified for COAD (Fig. 5H).

In our survival analysis (Fig. 5I), expression principal components for COAD were identified as important less frequently compared to LIHC, whereas 'Race' was more frequently recognized as a significant factor (N = 20), even surpassing PCA2. Additionally, the major cell types within COAD, namely epithelial cells and fibroblasts, showed significantly different proportions across the datasets, yet both were present in low percentages (Fig. 5J). These comparisons underscore the intrinsic differences between the two COAD cohorts, both in terms of the clinical characteristics of the patients and the cellular composition of the sequenced samples.

The KM analysis assigns greater weight to the survival days of deceased patients because they represent “completed event records.” When comparing survival days between cohort pairs (Figure S3, Table S5), it was found that, of the ten cancer dataset pairs, six showed no statistical difference in the survival days of the deceased patient group. Among these, cancer types such as KIRC, LIHC and LUAD demonstrated significant consistency between the LD and FD. Although there was no statistical difference in survival among deceased BRCA patients, intrinsic biological differences are evident; the FD for BRCA includes younger patients (ages 25–35) compared to the broader age range (26–90 years) in the LD, which could influence the overlap of prognostic genes. Additionally, variations in cancer subtypes or treatment modalities can lead to notable deviations between datasets. For instance, in the READ cohort, 43.2% of patients underwent pharmaceutical therapy and 56.8% received radiation therapy. In contrast, the majority of the (READ-FD cohort did not receive any treatment (77.7%), with only a small fraction undergoing pharmaceutical intervention. This divergence in treatment approaches is reflected in the substantial variation in survival durations observed between the living and deceased patients across both cohorts.

Construction of the cancer regulatory networks for prognostic genes

Our study revealed that KIRC and LIHC are characterized by a strong correlation between gene expression profiles and prognostic outcomes. Despite the abundance of PGs, selecting the most efficacious genes for treatment remains challenging. To improve the specificity of PGs selection, we construct a regulatory network for KIRC prognostic genes. This network serves as a strategic framework to guide the selection of genes within relevant pathways, potentially streamlining the identification of therapeutic targets (see Methods section for a detailed methodology).

We downloaded a comprehensive set of 186 KEGG pathways with their associated genes from MSigDB (23). For each sample in our dataset, we calculated the activity score for each of these pathways. The top 10 pathways with the lowest p values (p < 0.05) that significantly have different activity scores among the alive and deceased patient groups in KIRC-LD were shown in Fig. 6A, while the top pathways in KIRC-FD were shown in Fig. 6B. The tight junction pathway, which emerged as the shared pathway of different activity in both LD and FD cohorts, was thus regarded as a potential key pathway related to different survival outcomes. It plays a key role in cell adhesion and permeability in epithelial cells and shows reduced activity in KIRC samples compared to non-tumorous tissue (24). Additionally, it has been implicated in the progression of more advanced tumour pathology.

For KIRC, we utilized the ARACNe-inferred KIRC network(25), which includes 6,054 transcriptional regulators (TRs) and their gene regulatory associations. We conducted a linear regression analysis to assess the correlation between pathway activities and the activities of major identified TRs, with the robustness of these correlations verified via bootstrap analysis (n = 100 iterations). In KIRC, we identified 529 TRs that exhibit regulatory interactions with the tight junction pathway (Table S6). Of these, 319 TRs (60.03%) also demonstrated a correlation with patient survival outcomes in KIRC. In the KIRC-FD, 2,051 TRs were implicated in the regulation of the tight junction pathway, with 23.55% (483 TRs) associated with patient survival in KIRC-FD.

Comparative analysis between the KIRC-LD and KIRC-FD cohorts revealed 90 TRs involved in the tight junction pathway, which were also concurrently identified as KIRC CPGs in previous KM analysis (Fig. 6C). These TRs exhibited a high correlation in slope value (r = 0.89, by Spearman Coefficient, Fig. 6D). The majority of these TRs were categorized as favourable CPG biomarkers (88 TRs), each showing positive regulation of the tight junction pathway. In contrast, two TRs, DNMT3B and PPP1R1A, were classified as unfavourable KIRC CPGs, displaying a negative regulatory relationship with the tight junction pathway. The impairment of this pathway may play a critical role in KIRC pathogenesis, aligning with our findings where the overexpression of the two unfavourable CPGs could decrease pathway activity, potentially accelerating tumour progression and resulting in worse patient survival outcomes. Moreover, the two TRs have been extensively investigated across multiple studies and recognized as potential candidates for cancer therapy(26, 27), indicating their potential application in future KIRC research.

We applied a similar methodology to construct the regulatory network for LIHC prognostic genes (Table S7). Differential activation of the purine metabolism and RNA polymerase pathways was observed between alive and deceased LIHC patients, as well as LIHC-FD, as shown in Figures S4A and S4B. Within the regulatory framework of the purine metabolism pathway, 209 TRs were also identified as LIHC CPG (Figure S4C). The inhibition of purine metabolism is known to suppress the progression of hepatocellular carcinoma (HCC) (28). Notably, genes classified as unfavourable exhibited a positive regulatory association with purine metabolism, suggesting a potential inhibition through the unfavourable genes.

In contrast, 165 TRs, which also align with CPGs, were identified concerning the RNA polymerase pathway, as shown in Figure S4D. Although the activity scores of CPGs within survival-differential pathways indicated a lower Spearman correlation in LIHC, we observed three genes—TAF15, CHEK1, and PDCD6—as having the highest slope values in both purine metabolism and RNA polymerase pathways (Figures S4E-F). These genes have been implicated in the inhibition of HCC progression(29, 30) and cellular migration(31), illustrating their potential as targets for the development of effective HCC treatment.

The impact of updated datasets on prognostic genes

In the Human Pathology Atlas, our focus was on protein-coding genes, deriving expression levels from the aggregate of protein-coding transcripts. Utilizing Ensembl release 103, which includes updated gene or transcript classification for over 3,000 genes, updating our gene classification was a crucial initial step in our analysis. We then performed a correlation analysis of expression profiles to detect changes in overall expression patterns. Despite employing different gene quantification methods (previously FPKM and currently TPM), the average Spearman correlation coefficient remained above 0.8, which is relatively low considering the samples are generally the same.

Our dataset has been updated with the latest clinical records from the TCGA database, meticulously comparing changes on a case-by-case basis. Significant updates include alterations in cohort sample sizes, with notable reductions observed across most cancer types (Figure S5A). For instance, the sample size for uterine corpus endometrial carcinoma (UCEC) decreased by 67.5% due to the unavailability of raw bam files for 365 patients, resulting in the lowest expression correlation (r = 0.84, by the Spearman coefficient). Adjustments in patients’ clinical information were also evident; for example, the survival for a BRCA patient sample was revised to 1,468 days (2,024 days less than the previous record), and a CESC patient's status was updated to deceased with no change in survival time.

We conducted a comparative analysis of PGs across two versions of the dataset, as shown in Figure S5B. The number of PGs for each cancer type is listed, along with their respective categories, with the significance of overlap indicated by asterisks. While high consistency was anticipated and observed within the same cancer types, the gene lists are not entirely identical. This discrepancy highlights the sensitivity of survival analysis to data variations, particularly changes in expression levels and clinical information.

In this study, we compiled and updated publicly available cancer datasets and conducted KM survival analyses to systematically explore the relationship between gene expression and patient survival outcomes. Our findings revealed distinct patterns across various cancer types. Notably, KIRC and LIHC demonstrated a significant number of prognostic genes (PGs), indicating a robust correlation between gene expression profiles and survival outcomes in these cancers. These PGs were further validated using independent datasets. The high expression correlation observed in both the initial and follow-up datasets for KIRC and LIHC suggests better consistency in the disease pathology rather than significant variability among patients. This consistency was also supported by cell type analysis derived from the LIHC datasets, suggesting that LIHC may exhibit uniform behaviour across different studies, potentially due to the homogeneous nature of the tissue involved.

However, the impact of gene expression on cancer prognosis varies across different cancer types. The fundamental complexity of cancer, which includes genetic diversity, epigenetic modifications, comorbidities, environmental factors, and lifestyle choices, contributes differently to disease progression and patient survival (32–35). Certain cancers, such as TGCT and PRAD, have been found to have a significantly smaller set of prognostic genes, suggesting a potentially weaker correlation between gene expression and survival outcomes in these cases.

Furthermore, our methodology for selecting PGs was stringent, utilizing a p-value threshold of less than 0.001 to ensure robust statistical significance. This rigorous cut-off minimizes the influence of potential gene expression fluctuations on our results. However, the unique characteristics of each tumour type may necessitate a more flexible approach to cut-off criteria, potentially adapting them to better match the specificities of individual cancers. Such a nuanced consideration of cut-off thresholds could facilitate a more tailored and insightful analysis when studying specific cancer types (15–17).

Additionally, we constructed prognostic networks for KIRC and LIHC), showcasing how cancer-specific prognostic genes can be integrated into cancer research. These CPGs can serve as a systematic reference to streamline the selection of gene candidates and further identify those with strong associations with survival outcomes.

Furthermore, our comparative analysis of clinical information across different cancer cohorts showed that even minor discrepancies can significantly affect survival analysis outcomes. This underscores the need for meticulous examination of the original data before conducting survival analysis to reduce the risk of error and ensure the reliability of the study's conclusions. While gene expression patterns serve as crucial biomarkers in some cancers, their prognostic value may be less pronounced in others, necessitating a comprehensive approach to understanding and predicting cancer survival. Future studies should strive to incorporate a broader range of data to enhance the accuracy of survival analyses and minimize the effects of inconsistent clinical information.

In conclusion, we employed the Kaplan-Meier analyses to determine the prognostic significance of protein-coding genes in patients’ survival across 21 cancer types. We curated lists of genes with favourable and unfavourable prognostic values. Additionally, we compiled a robust list of genes for 10 cancer types, confirming their prognostic value through validation with independent cancer cohorts. Our analysis of clinical information indicated that gene expression patterns significantly impacted survival predictions, particularly in KIRC and LIHC cancer types. The results of this study will be presented in the updated Pathology section of the open access Human Protein Atlas resource (www.proteinatlas.org).

Pre-processing of data

We used the GDC client to download the raw BAM files of transcripts per million (TPM) for The Cancer Genome Atlas (TCGA) cohorts. After screening all samples across 21 cohorts, we retained data from 6,918 donors who had both primary tumour solid tissue samples and associated clinical information. This clinical information was sourced from the TCGA Pan-Cancer Clinical Data Resource (TCGA-CDR) (36) by categorizing the data according to cancer types.

We retrieved the global gene expression profiles (measured in Fragments Per Kilobase of transcript per Million mapped reads, FPKM) and clinical information for 442 donors from the International Cancer Genome Consortium (ICGC) database (http://icgc.org/), which includes data on breast cancer (BRCA-KR), liver cancer (LIRI-JP), ovarian cancer (OV-AU), and pancreatic cancer (PACA-AU). In our study, we limited our dataset to samples that included primary tumour solid tissue samples and clinical information. To avoid ambiguity in the expression data for donors with multiple tumour samples, we followed the criteria: preference was given to the sample labelled 'C01', or in the absence of such a label, we selected samples that were 'untreated', 'included in PCAWG', or had a 'higher percentage of cellularity'. All FPKM values were converted to TPM, focusing on protein-coding genes to ensure data consistency.

The metadata and raw RNA-sequencing data for colorectal cancer were acquired from individuals who had surgery at Uppsala University Hospital in Sweden (37). The colon adenocarcinoma (COAD-UCAN) cohort consists of data from 486 patients, and the rectum adenocarcinoma (READ-UCAN) cohort comprises data from 207 patients.

The raw bam files and clinical information for 58 Glioblastoma (GBM-GSE121720) patients were retrieved from the Gene Expression Omnibus (GEO) database (https://www.ncbi.nlm.nih.gov/geo/) using the accession number GSE121720 (38). Only patients diagnosed with “primary glioblastoma” were included in our analysis. Additionally, we retrieved the RNA-seq data of 100 clear cell renal cell carcinoma (JAP-KIRC) patients from a Japanese cohort (39) from the European Genome-phenome Archive using the accession number EGAS00001000509.

The metadata and raw RNA-sequencing data for lung cancer were collected from patients who underwent surgical treatment between 2006 and 2010 at Uppsala University Hospital in Uppsala, Sweden. These data are available in the NCBI SRA database using the accession number SRP074349 (40, 41). The lung adenocarcinoma cohort (LUAD-UCAN) includes data from 105 patients, and the lung squamous cell carcinoma (LUSC-UCAN) cohort comprises data from 68 patients.

For all cohorts with available raw data in this study, we employed the BEDTools (42) for converting BAM to FASTQ files, and the Kallisto (43) for calculating the TPM values for each gene (annotated by GRCh38 and Ensemble 103). During the analysis, we focused on the protein-coding genes, considering the mRNA expression value of the gene as the cumulative total of TPMs for all its transcripts. We included genes that exhibited an average expression level > 1 across patients within each cancer type in our analysis. Furthermore, we included only patients with a recorded survival time of more than 0 days to minimize potential inaccuracies in clinical information. We conducted a Principal Component Analysis (PCA) to illustrate the overarching gene expression patterns across 21 different cancers. We finally clustered the cancers based on the mean expression level of genes, utilizing Euclidean distance as the metric for clustering.

Classification of genes in cancers and normal tissues

The TPM values for normal tissues were acquired from the Human Protein Atlas (44). We included only tissues with matched cancer types to ensure a fair comparison. We categorized the protein-coding genes into five distinct groups according to their expression patterns in tumours and tissue types. The classification is as follows: 1) Cancer/tissue enriched, where a gene's mRNA levels in one type of cancer or tissue are > 4 times the maximum levels found in all other cancers or tissues. 2) Group enriched, indicating elevated expression in about a quarter of the cancers or tissues. 3) Cancer/tissue enhanced, denoting genes with moderate expression levels. 4) Low cancer/tissue specificity, where expression levels are not significantly elevated in any cancer or tissue. 5) Not detected, for genes with expression levels below 1 in all cancers and tissues.

Kaplan-Meier Survival Analysis

We categorized genes into two groups based on their TPM values for Kaplan-Meier (KM) survival analysis and compared the survival outcomes using log-rank tests. To identify the optimal expression cut-offs for grouping, we examined all TPM values from the 20th to the 80th percentiles to stratify the patients. We examined significant differences in the survival outcomes of these groups and chose the cut-offs that yielded the lowest log-rank P value. The “survival” R package was used for the Kaplan-Meier survival analysis, and “ggplot” was employed for visualizations. Genes were designated as prognostic genes (PGs) if they had log-rank P values less than 0.05. Additionally, a prognostic gene was considered unfavourable if the group with high expression had a higher number of observed events than expected; conversely, it was considered favourable if the number was lower. All analyses were executed using RStudio with R version 4.2.3.

Correlation analysis

A gene qualifies as an overlapping prognostic gene across different datasets if it is identified as a prognostic gene in any dataset and shows a consistent directional effect (either consistently positive or consistently negative across all datasets). To evaluate the correlation between gene expression patterns across two different cohorts, we used the Spearman coefficient and the Jaccard Coefficient (JC). Furthermore, we employed the hypergeometric test to determine the statistical significance of the overlap between two gene lists. We performed the entire analytical process using RStudio with R version 4.2.3.

Clinical feature ranking

We analysed the significance of clinical features using the Boruta SHAP algorithm (45), which integrates Boruta's variable selection method with Shapley values, employing random forests to methodically determine variable importance. Next, we applied the PCA to extract primary expression patterns, with a focus on the three most impactful principal components. We transformed categorical clinical features, such as race and cancer stage, into numerical data using one-hot encoding. To achieve unbiased feature selection, we standardized all variables to a scale ranging from − 1 to 1. To ensure the robustness of our feature selection method, we subjected all features to 100 shuffling iterations to bring them closer to a state of randomness. This entire analysis was carried out using Python.

Prediction of cell-type proportion

We performed the analysis to identify cell types and their proportions within bulk RNA-seq datasets using the Dampened Weighted Least Squares (DWLS) approach (46). This technique is tailored to accurately deduce cell-type compositions, adjusting for any bias towards cells with either high gene expression levels or prevalence. Necessary reference profiles were sourced from single-cell RNA-seq data; for colorectal cancer, this data was retrieved from the GEO database using the accession number GSE178341. For hepatocellular carcinoma, the single-cell RNA-seq data was similarly obtained from the GEO database, linked to the accession number GSE149614.

Construction of the regulatory networks for prognostic genes

We retrieved the KEGG (47) pathway database from the Molecular Signatures Database (MSigDB (23)). Quantitative assessment of molecular pathways and gene activity levels in tumour samples was performed to establish their associative patterns through the following steps: 1) The normalized enrichment score for each pathway was calculated for individual samples using single sample-based Gene Set Enrichment Analysis (ssGSEA (48)), and these scores were compiled into a pathway activity vector. Similarly, the VIPER algorithm (49) was used to determine the activity score of transcriptional regulators (TRs) based on the ARACNe-inferred cancer network (25). These scores formed the basis for the TRs activity score vector. 2) Linear regression analysis was used to identify the regulatory relationship between gene activity (as the predictor) and pathway activity (as the response), denoted as the 'slope'. A positive slope indicates a direct association, whereas a negative slope indicates an inverse relationship. 3) The robustness of these associations was validated through bootstrapping, performing 100 iterations to ensure statistical reliability. 4) Pathways that showed significant concordance with prognostic genes (PGs) were categorized as prognostic pathways, highlighting their potential influence on patient outcomes. The analyses were performed using the TR2PATH (50) package within RStudio. We applied the Kolmogorov-Smirnov test to assess the differences in activity between patients who were alive and those who were deceased. Pathways with a p-value > 0.05 were excluded from the analysis, which was performed using R.

Acknowledgments

We would like to thank the Human Protein Atlas and the Science for Life Laboratory, the Swedish National Infrastructure for providing computational resource through Uppsala Multidisciplinary Center for Advanced Computational Science (UPPMAX). We also thank The Cancer Genome Atlas for providing access to data.

Funding

This work was financially supported by the Knut and Alice Wallenberg Foundation with grand number 72110. MY is being sponsored in her doctoral studies by the China Scholarship Council (Grant No. 202006940003).

Author contribution

Data curation: MY, CZ and MS; Investigation: MY and CZ; Methodology: MY, CZ, XL, HY, XS, and AM; Visualization: MY; Software: KF and MZ; Project administration: CZ, MU and AM; Supervision: CZ, MU and AM; Funding acquisition: AM; Writing-original draft: MY; Writing-review and editing: CZ, MS, XL, HY, XS, HT, MU, and AM.

Competing interests

AM and MU are the founder and shareholders of ScandiBio Therapeutics, ScandiEdge Therapeutics and Atlas Antibodies (MU). The other authors declare no competing interests.

Data availability

All data needed to evaluate the conclusions in the paper are present in the paper and/or the Supplementary Materials.

Code availability

The scripts required to reproduce the results presented in this paper are available in the GitHub repository (https://github.com/cellur-m/pathology_atlas).

H. Sung, J. Ferlay, R. L. Siegel, M. Laversanne, I. Soerjomataram, A. Jemal, F. Bray, Global Cancer Statistics 2020: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA: A Cancer Journal for Clinicians 71, 209-249 (2021).
T. Dyba, G. Randi, F. Bray, C. Martos, F. Giusti, N. Nicholson, A. Gavin, M. Flego, L. Neamtiu, N. Dimitrova, R. Negrão Carvalho, J. Ferlay, M. Bettio, The European cancer burden in 2020: Incidence and mortality estimates for 40 countries and 25 major cancers. European Journal of Cancer 157, 308-347 (2021).
C. Frick, H. Rumgay, J. Vignat, O. Ginsburg, E. Nolte, F. Bray, I. Soerjomataram, Quantitative estimates of preventable and treatable deaths from 36 cancers worldwide: a population-based study. The Lancet Global Health 11, e1700-e1712 (2023).
N. Howlader, G. Forjaz, M. J. Mooradian, R. Meza, C. Y. Kong, K. A. Cronin, A. B. Mariotto, D. R. Lowy, E. J. Feuer, The Effect of Advances in Lung-Cancer Treatment on Population Mortality. New England Journal of Medicine 383, 640-649 (2020).
D. J. Propper, F. R. Balkwill, Harnessing cytokines and chemokines for cancer therapy. Nature Reviews Clinical Oncology 19, 237-253 (2022).
Y.-M. Yang, P. Hong, W. W. Xu, Q.-Y. He, B. Li, Advances in targeted therapy for esophageal cancer. Signal Transduction and Targeted Therapy 5, 229 (2020).
K. C. Kurnit, G. F. Fleming, E. Lengyel, Updates and New Options in Advanced Epithelial Ovarian Cancer Treatment. Obstet Gynecol 137, 108-121 (2021).
J. Fan, K. Slowikowski, F. Zhang, Single-cell transcriptomics in cancer: computational challenges and opportunities. Experimental & Molecular Medicine 52, 1452-1465 (2020).
O. Menyhárt, B. Győrffy, Multi-omics approaches in cancer research with applications in tumor subtyping, prognosis, and diagnosis. Computational and Structural Biotechnology Journal 19, 949-960 (2021).
M. Hong, S. Tao, L. Zhang, L.-T. Diao, X. Huang, S. Huang, S.-J. Xie, Z.-D. Xiao, H. Zhang, RNA sequencing: new technologies and applications in cancer research. Journal of Hematology & Oncology 13, 166 (2020).
M. Uhlen, C. Zhang, S. Lee, E. Sjöstedt, L. Fagerberg, G. Bidkhori, R. Benfeitas, M. Arif, Z. Liu, F. Edfors, K. Sanli, K. v. Feilitzen, P. Oksvold, E. Lundberg, S. Hober, P. Nilsson, J. Mattsson, J. M. Schwenk, H. Brunnström, B. Glimelius, T. Sjöblom, P.-H. Edqvist, D. Djureinovic, P. Micke, C. Lindskog, A. Mardinoglu, F. Ponten, A pathology atlas of the human cancer transcriptome. Science 357, (2017).
Z. Tang, B. Kang, C. Li, T. Chen, Z. Zhang, GEPIA2: an enhanced web server for large-scale expression profiling and interactive analysis. Nucleic Acids Research 47, W556-W560 (2019).
D. Miao, C. A. Margolis, W. Gao, M. H. Voss, W. Li, D. J. Martini, C. Norton, D. Bossé, S. M. Wankowicz, D. Cullen, C. Horak, M. Wind-Rotolo, A. Tracy, M. Giannakis, F. S. Hodi, C. G. Drake, M. W. Ball, M. E. Allaf, A. Snyder, M. D. Hellmann, T. Ho, R. J. Motzer, S. Signoretti, W. G. Kaelin, T. K. Choueiri, E. M. Van Allen, Genomic correlates of response to immune checkpoint therapies in clear cell renal cell carcinoma. Science 359, 801-806 (2018).
Y. Jiang, A. Sun, Y. Zhao, W. Ying, H. Sun, X. Yang, B. Xing, W. Sun, L. Ren, B. Hu, C. Li, L. Zhang, G. Qin, M. Zhang, N. Chen, M. Zhang, Y. Huang, J. Zhou, Y. Zhao, M. Liu, X. Zhu, Y. Qiu, Y. Sun, C. Huang, M. Yan, M. Wang, W. Liu, F. Tian, H. Xu, J. Zhou, Z. Wu, T. Shi, W. Zhu, J. Qin, L. Xie, J. Fan, X. Qian, F. He, F. He, X. Qian, J. Qin, Y. Jiang, W. Ying, W. Sun, Y. Zhu, W. Zhu, Y. Wang, D. Yang, W. Liu, Q. Liu, X. Yang, B. Zhen, Z. Wu, J. Fan, H. Sun, J. Qian, T. Hong, L. Shen, B. Xing, P. Yang, H. Shen, L. Zhang, S. Cheng, J. Cai, X. Zhao, Y. Sun, T. Xiao, Y. Mao, X. Chen, D. Wu, L. Chen, J. Dong, H. Deng, M. Tan, Z. Wu, Q. Zhao, Z. Shen, X. Chen, Y. Gao, W. Sun, T. Wang, S. Liu, L. Lin, J. Zi, X. Lou, R. Zeng, Y. Wu, S. Cai, B. Jiang, A. Chen, Z. Li, F. Yang, X. Chen, Y. Sun, Q. Wang, Y. Zhang, G. Wang, Z. Chen, W. Qin, Z. Li, C. Chinese Human Proteome Project, Proteomics identifies new therapeutic targets of early-stage hepatocellular carcinoma. Nature 567, 257-261 (2019).
X. Li, K. Shong, W. Kim, M. Yuan, H. Yang, Y. Sato, H. Kume, S. Ogawa, H. Turkez, S. Shoaie, J. Boren, J. Nielsen, M. Uhlen, C. Zhang, A. Mardinoglu, Prediction of drug candidates for clear cell renal cell carcinoma using a systems biology-based drug repositioning approach. eBioMedicine 78, (2022).
M. Yuan, K. Shong, X. Li, S. Ashraf, M. Shi, W. Kim, J. Nielsen, H. Turkez, S. Shoaie, M. Uhlen, C. Zhang, A. Mardinoglu, A Gene Co-Expression Network-Based Drug Repositioning Approach Identifies Candidates for Treatment of Hepatocellular Carcinoma. Cancers 14, 1573 (2022).
O. K. Graves, W. Kim, M. Özcan, S. Ashraf, H. Turkez, M. Yuan, C. Zhang, A. Mardinoglu, X. Li, Discovery of drug targets and therapeutic agents based on drug repositioning to treat lung adenocarcinoma. Biomedicine & Pharmacotherapy 161, 114486 (2023).
M. Uhlén, L. Fagerberg, B. M. Hallström, C. Lindskog, P. Oksvold, A. Mardinoglu, Å. Sivertsson, C. Kampf, E. Sjöstedt, A. Asplund, I. Olsson, K. Edlund, E. Lundberg, S. Navani, C. A.-K. Szigyarto, J. Odeberg, D. Djureinovic, J. O. Takanen, S. Hober, T. Alm, P.-H. Edqvist, H. Berling, H. Tegel, J. Mulder, J. Rockberg, P. Nilsson, J. M. Schwenk, M. Hamsten, K. v. Feilitzen, M. Forsberg, L. Persson, F. Johansson, M. Zwahlen, G. v. Heijne, J. Nielsen, F. Pontén, Tissue-based map of the human proteome. Science 347, 1260419 (2015).
M. Uhlen, C. Zhang, S. Lee, E. Sjöstedt, L. Fagerberg, G. Bidkhori, R. Benfeitas, M. Arif, Z. Liu, F. Edfors, K. Sanli, K. v. Feilitzen, P. Oksvold, E. Lundberg, S. Hober, P. Nilsson, J. Mattsson, J. M. Schwenk, H. Brunnström, B. Glimelius, T. Sjöblom, P.-H. Edqvist, D. Djureinovic, P. Micke, C. Lindskog, A. Mardinoglu, F. Ponten, A pathology atlas of the human cancer transcriptome. Science 357, eaan2507 (2017).
M. Gao, Y. Lin, X. Liu, Y. Li, C. Zhang, Z. Wang, Z. Wang, Y. Wang, Z. Guo, ISG20 promotes local tumor immunity and contributes to poor survival in human glioma. OncoImmunology 8, e1534038 (2019).
T. Xu, H. Ruan, S. Gao, J. Liu, Y. Liu, Z. Song, Q. Cao, K. Wang, L. Bao, D. Liu, J. Tong, J. Shi, H. Liang, H. Yang, K. Chen, X. Zhang, ISG20 serves as a potential biomarker and drives tumor progression in clear cell renal cell carcinoma. Aging (Albany NY) 12, 1808-1827 (2020).
Z. Chen, M. Yin, H. Jia, Q. Chen, H. Zhang, ISG20 stimulates anti-tumor immunity via a double-stranded RNA-induced interferon response in ovarian cancer. Frontiers in Immunology 14, 1176103 (2023).
A. Liberzon, A. Subramanian, R. Pinchback, H. Thorvaldsdóttir, P. Tamayo, J. P. Mesirov, Molecular signatures database (MSigDB) 3.0. Bioinformatics 27, 1739-1740 (2011).
P. Li, P. Lan, S. Liu, Y. Wang, P. Liu, Cell Polarity Protein Pals1-Associated Tight Junction Expression Is a Favorable Prognostic Marker in Clear Cell Renal Cell Carcinoma. Frontiers in Genetics 11, (2020).
A. A. Margolin, I. Nemenman, K. Basso, C. Wiggins, G. Stolovitzky, R. D. Favera, A. Califano, ARACNE: An Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context. BMC Bioinformatics 7, S7 (2006).
S. Takakura, T. Kohno, R. Manda, A. Okamoto, T. Tanaka, J. Yokota, Genetic alterations and expression of the protein phosphatase 1 genes in human cancers. International journal of oncology 18, 817-824 (2001).
K. Miyakuni, J. Nishida, D. Koinuma, G. Nagae, H. Aburatani, K. Miyazono, S. Ehata, Genome-wide analysis of DNA methylation identifies the apoptosis-related gene UQCRH as a tumor suppressor in renal cancer. Molecular Oncology 16, 732-749 (2022).
Y. C. Chong, T. B. Toh, Z. Chan, Q. X. X. Lin, D. K. H. Thng, L. Hooi, Z. Ding, T. Shuen, H. C. Toh, Y. Y. Dan, Targeted inhibition of purine metabolism is effective in suppressing hepatocellular carcinoma progression. Hepatology communications 4, 1362-1381 (2020).
Q. Zhu, Y. Hu, W. Jiang, Z.-L. Ou, Y.-B. Yao, H.-Y. Zai, Circ-CCT2 Activates Wnt/β-catenin Signaling to Facilitate Hepatoblastoma Development by Stabilizing PTBP1 mRNA. Cellular and Molecular Gastroenterology and Hepatology 17, 175-197 (2024).
S. Y. Wen, Y. T. Liu, B. Y. Wei, J. Q. Ma, Y. Y. Chen, PDCD6 Promotes Hepatocellular Carcinoma Cell Proliferation and Metastasis through the AKT/GSK3β/β-catenin Pathway. Biomedical and Environmental Sciences 36, 241-252 (2023).
N. Elgohary, R. PellegRIno, O. Neumann, H. M. ELzAwAHRY, M. M. Saber, A. A. Zeeneldin, R. Geffers, V. Ehemann, P. Schemmer, P. Schirmacher, Protumorigenic role of Timeless in hepatocellular carcinoma. International journal of oncology 46, 597-606 (2015).
G. Gandaglia, A. Becker, Q. D. Trinh, F. Abdollah, J. Schiffmann, F. Roghmann, Z. Tian, F. Montorsi, A. Briganti, P. I. Karakiewicz, M. Sun, Long-term survival in patients with germ cell testicular cancer: A population-based competing-risks regression analysis. European Journal of Surgical Oncology (EJSO) 40, 103-112 (2014).
M. P. Purdue, S. J. Hutchings, L. Rushton, D. T. Silverman, The proportion of cancer attributable to occupational exposures. Annals of Epidemiology 25, 188-192 (2015).
J. Kim, J. E. Gosnell, S. A. Roman, Geographic influences in the global rise of thyroid cancer. Nature Reviews Endocrinology 16, 17-29 (2020).
J. Gammall, A. G. Lai, Pan-cancer prognostic genetic mutations and clinicopathological factors associated with survival outcomes: a systematic review. npj Precision Oncology 6, 27 (2022).
J. Liu, T. Lichtenberg, K. A. Hoadley, L. M. Poisson, A. J. Lazar, A. D. Cherniack, A. J. Kovatich, C. C. Benz, D. A. Levine, A. V. Lee, L. Omberg, D. M. Wolf, C. D. Shriver, V. Thorsson, N. Cancer Genome Atlas Research, H. Hu, An Integrated TCGA Pan-Cancer Clinical Data Resource to Drive High-Quality Survival Outcome Analytics. Cell 173, 400-416.e411 (2018).
B. Glimelius, B. Melin, G. Enblad, I. Alafuzoff, A. Beskow, H. Ahlström, A. Bill-Axelson, H. Birgisson, O. Björ, P.-H. Edqvist, T. Hansson, T. Helleday, P. Hellman, K. Henriksson, G. Hesselager, M. Hultdin, M. Häggman, M. Höglund, H. Jonsson, C. Larsson, H. Lindman, I. Ljuslinder, S. Mindus, P. Nygren, F. Pontén, K. Riklund, R. Rosenquist, F. Sandin, J. M. Schwenk, R. Stenling, K. Stålberg, P. Stålberg, C. Sundström, C. Thellenberg Karlsson, B. Westermark, A. Bergh, L. Claesson-Welsh, R. Palmqvist, T. Sjöblom, U-CAN: a prospective longitudinal collection of biomaterials and clinical information from adult cancer patients in Sweden. Acta Oncologica 57, 187-194 (2018).
Y. Wu, M. Fletcher, Z. Gu, Q. Wang, B. Costa, A. Bertoni, K.-H. Man, M. Schlotter, J. Felsberg, J. Mangei, M. Barbus, A.-C. Gaupel, W. Wang, T. Weiss, R. Eils, M. Weller, H. Liu, G. Reifenberger, A. Korshunov, P. Angel, P. Lichter, C. Herrmann, B. Radlwimmer, Glioblastoma epigenome profiling identifies SOX10 as a master regulator of molecular tumour subtype. Nature Communications 11, 6434 (2020).
Y. Sato, T. Yoshizato, Y. Shiraishi, S. Maekawa, Y. Okuno, T. Kamura, T. Shimamura, A. Sato-Otsubo, G. Nagae, H. Suzuki, Y. Nagata, K. Yoshida, A. Kon, Y. Suzuki, K. Chiba, H. Tanaka, A. Niida, A. Fujimoto, T. Tsunoda, T. Morikawa, D. Maeda, H. Kume, S. Sugano, M. Fukayama, H. Aburatani, M. Sanada, S. Miyano, Y. Homma, S. Ogawa, Integrated molecular analysis of clear-cell renal cell carcinoma. Nature Genetics 45, 860-867 (2013).
A. Mezheyeuski, C. H. Bergsland, M. Backman, D. Djureinovic, T. Sjöblom, J. Bruun, P. Micke, Multispectral imaging for quantitative and compartment-specific immune infiltrates reveals distinct immune profiles that classify lung cancer patients. The Journal of Pathology 244, 421-431 (2018).
T. Goldmann, S. Marwitz, D. Nitschkowski, R. Krupar, M. Backman, H. Elfving, V. Thurfjell, A. Lindberg, H. Brunnström, L. La Fleur, A. Mezheyeuski, J. S. M. Mattsson, J. Botling, P. Micke, C. Strell, PD-L1 amplification is associated with an immune cell rich phenotype in squamous cell cancer of the lung. Cancer Immunology, Immunotherapy 70, 2577-2587 (2021).
A. R. Quinlan, I. M. Hall, BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics 26, 841-842 (2010).
N. L. Bray, H. Pimentel, P. Melsted, L. Pachter, Near-optimal probabilistic RNA-seq quantification. Nature Biotechnology 34, 525-527 (2016).
M. Uhlén, L. Fagerberg, B. M. Hallström, C. Lindskog, P. Oksvold, A. Mardinoglu, Å. Sivertsson, C. Kampf, E. Sjöstedt, A. Asplund, I. Olsson, K. Edlund, E. Lundberg, S. Navani, C. A.-K. Szigyarto, J. Odeberg, D. Djureinovic, J. O. Takanen, S. Hober, T. Alm, P.-H. Edqvist, H. Berling, H. Tegel, J. Mulder, J. Rockberg, P. Nilsson, J. M. Schwenk, M. Hamsten, K. von Feilitzen, M. Forsberg, L. Persson, F. Johansson, M. Zwahlen, G. von Heijne, J. Nielsen, F. Pontén, Tissue-based map of the human proteome. Science 347, 1260419 (2015).
H. Stoppiglia, G. Dreyfus, R. Dubois, Y. Oussar, Ranking a random feature for variable and feature selection. Journal of Machine Learning Research 3, 1399-1414 (2003).
D. Tsoucas, R. Dong, H. Chen, Q. Zhu, G. Guo, G.-C. Yuan, Accurate estimation of cell-type composition from gene expression data. Nature Communications 10, 2975 (2019).
M. Kanehisa, M. Furumichi, M. Tanabe, Y. Sato, K. Morishima, KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Research 45, D353-D361 (2016).
A. Subramanian, P. Tamayo, V. K. Mootha, S. Mukherjee, B. L. Ebert, M. A. Gillette, A. Paulovich, S. L. Pomeroy, T. R. Golub, E. S. Lander, J. P. Mesirov, Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proceedings of the National Academy of Sciences 102, 15545-15550 (2005).
M. J. Alvarez, Y. Shen, F. M. Giorgi, A. Lachmann, B. B. Ding, B. H. Ye, A. Califano, Functional characterization of somatic mutations in cancer using network-based inference of protein activity. Nature Genetics 48, 838-847 (2016).
S. Panja, M. I. Truica, C. Y. Yu, V. Saggurthi, M. W. Craige, K. Whitehead, M. V. Tuiche, A. Al-Saadi, R. Vyas, S. Ganesan, S. Gohel, F. Coffman, J. S. Parrott, S. Quan, S. Jha, I. Kim, E. Schaeffer, V. Kothari, S. A. Abdulkadir, A. Mitrofanova, Mechanism-centric regulatory network identifies NME2 and MYC programs as markers of Enzalutamide resistance in CRPC. Nature Communications 15, 352 (2024).

The authors declare potential competing interests as follows: AM and MU are the founder and shareholders of ScandiBio Therapeutics, ScandiEdge Therapeutics and Atlas Antibodies (MU). The other authors declare no competing interests.

Download PDF

Version 1

posted

You are reading this latest preprint version

The Human Pathology Atlas for deciphering the prognostic features of human cancers

Status:

Version 1

Abstract

Figures

Introduction

Results

Discussion

Materials and Methods

Declarations

References

Additional Declarations

Status:

Version 1