Tumor cell total mRNA expression shapes the molecular and clinical phenotype of cancer

Cancers can vary greatly in their transcriptomes. In contrast to alterations in specific genes or pathways, differences in tumor cell total mRNA content have not been comprehensively assessed. Technical and analytical challenges have impeded examination of total mRNA expression at scale across cancers. To address this, we developed a model for quantifying tumor-specific total mRNA expression (TmS) from bulk sequencing data, which performs transcriptomic deconvolution while adjusting for mixed genomes. We used single-cell RNA sequencing data to demonstrate total mRNA expression as a feature of tumor phenotype. We estimated and validated TmS in 5,015 patients across 15 cancer types identifying significant inter-individual variability. At a pan-cancer level, high TmS is associated with increased risk of disease progression and death. Cancer type-specific patterns of genetic alterations, intra-tumor genetic heterogeneity, as well as pan-cancer trends in metabolic dysregulation and hypoxia contribute to TmS. Taken together, our results suggest that measuring cell-type specific total mRNA expression offers a broader perspective of tracking cancer transcriptomes, which has important biological and clinical implications. (TmS), which captures the ratio of total mRNA expression per haploid genome in tumor cells versus surrounding non-tumor cells. We first scrutinized total mRNA expression using single-cell data across four cancer types 26–28 , and then calculated TmS in matching bulk RNA and DNA data from 5,015 patients across 15 cancer types from TCGA, ICGC 29 , and at MD Anderson Cancer Center 30 (NCT01946165). Our analyses revealed that variation in total mRNA expression is a feature that captures cancer cell plasticity at a patient-sample level and predicts prognosis across cancers. with subsequently processed using single-cell These cells were then collected and sent for single-cell RNAseq experiments. The distributions of total UMI counts and gene counts are shown. d, Distribution of TmS for tumor samples from prostate cancer patients in TCGA who were untreated, and patients at MD Anderson who underwent treatment with ADT+ARSI. All samples from TCGA presented a Gleason score of 8+, matching the patient population at MDA.


Introduction
Reprogramming of the transcriptional landscape is a critical hallmark of cancer, which accompanies cancer progression, metastasis and resistance to treatment 1,2 . Recent single-cell studies revealed expansion of cell state heterogeneity in cancer cells that arise largely independently of genetic variation [3][4][5] , bringing new insights into long-standing topics of cancer cell plasticity 6 and cancer stem cells 7 .
Assessing these clinically relevant topics 8,9 in large patient cohorts, however, has been difficult due to the high cost and sample quality requirements associated with most single-cell technologies. As bulk tumor RNA and DNA sequencing data are already obtained from large patient series with clinical outcomes, in silico approaches to analyze human tissues may expedite our understanding of tumor heterogeneity. We thus aim to develop such an approach.
Previous approaches to build cellular differentiation hierarchies are not suitable for in vivo human tissue studies and further require known cell type-specific genetic markers 10 . Variation in total mRNA amount, i.e., the sum of detectable mRNA transcripts across all genes per cell, has been linked indirectly to cancer progression and de-differentiation as a result of MYC activation 11,12 or aneuploidy 13,14 . Single-cell studies recently demonstrated that the total number of expressed genes per cell is more predictive of cellular phenotype than alterations in any specific genes or pathways 15,16 . With current limitations in our knowledge of marker genes across cancers, total mRNA expression per tumor cell may represent a robust and measurable pan-cancer feature that warrants a systematic evaluation in patient cohorts.
Total tumor-cell mRNA expression information is masked during standard bulk data analysis, thus requiring deconvolution. Variation in total mRNA transcript levels is removed by routine normalization, together with technical biases such as read depth and library preparation [17][18][19][20] . Data generated from cancer studies contain reads from both tumor and admixed normal cells. Copy number aberrations such as gain or loss of chromosomal copies (i.e., ploidy) in tumor cells affect gene expression through dosage effect 14 .
Here, building upon prior work in bulk transcriptome deconvolution [21][22][23] and in modelling tumor ploidy 24,25 , we created a measure of tumor-specific total mRNA expression (TmS), which captures the ratio of total mRNA expression per haploid genome in tumor cells versus surrounding non-tumor cells. We first scrutinized total mRNA expression using single-cell data across four cancer types [26][27][28] , and then calculated TmS in matching bulk RNA and DNA data from 5,015 patients across 15 cancer types from TCGA, ICGC 29 , and at MD Anderson Cancer Center 30 (NCT01946165). Our analyses revealed that variation in total mRNA expression is a feature that captures cancer cell plasticity at a patient-sample level and predicts prognosis across cancers. 5

Diversity in total mRNA expression as a hallmark of cancer cells
To motivate a model-based quantification of total mRNA expression in bulk tissue, we first studied singlecell RNA sequencing (scRNAseq) data generated from 46,468 cells in human colorectal, liver 27 , lung 26 and pancreatic tumors 28 (Fig. 1a, Supplementary Fig. 1a; Methods, Supplementary Note 1.1).
Total unique molecular identifier (UMI) counts can be modelled as cell size (total RNA molecule counts) multiplied by cell capture efficiency 31 . We find total UMI counts are highly correlated with gene counts, the total number of expressed genes per cell, across all cells and cell types (median Spearman r = 0.96, median absolute deviation (MAD) = 0.03, Supplementary Note 1.2). This suggests a similar utility of total UMIs as gene counts for annotating cellular phenotype 15 , where technological effects such as capture efficiency 32

are nuisance parameters largely constant across cells (Supplementary Note 1.2).
Expected as a cancer hallmark, we observe larger variability in total UMI and gene counts in tumor cells compared to non-tumor cells (epithelial, stromal and immune cells) within individual experiments where mRNAs from all cells were generated using the same library (F-test for variance p values < 0.02, Supplementary Fig. 1b). Consistent with recent reports from both human and mouse tumors 26,33 , we find multiple Seurat 34 clusters within tumor and non-tumor cells presenting distinct total UMI and gene counts ( Fig. 1b-c, Supplementary Fig. 2a; Methods, Supplementary Note 1.2). Trajectory inference using Monocle [35][36][37] shows distinct gene expression states among these clusters (Fig. 1d,   Supplementary Fig. 2b). High-UMI tumor cells present a less differentiated state 15 (adjusted P values < 0.001, Fig. 1d, Supplementary Fig. 2b; Methods), together with a lower or similar cell cycle state.
Hence, UMI count is not a surrogate measure for proliferation (Supplementary Fig. 2c). In four patients with a shorter time-to-disease progression (colon, liver and pancreas cancers), or advanced-stage disease (lung cancer), the high-UMI tumor cell clusters present differentiation scores close to 1, indicating a stem-like cell state ( Fig. 1c-d, Supplementary Fig. 2b). This is further supported by their enrichment of stemness and the epithelial-mesenchymal-transition (EMT) gene sets across cancers, out of 18,617 gene sets 38,39 Table 2; Methods). Our observations confirm the utility of total UMI counts to measure a feature of cancer cell plasticity.

investigated (Supplementary
When normalized and pooled across cells (pseudo-bulk, Methods), increased average total UMI counts for tumor vs. non-tumor cells are observed in these four patient samples, as compared to other samples within each cancer type (Fig. 1e). Given that the observed differences in average total UMI counts are large and variable across patients, we hypothesize that a quantification of average tumor-specific total mRNA expression in bulk sequencing data is useful to track tumor phenotype and clinical behavior. 6 Estimating tumor-specific total mRNA expression from bulk sequencing data In order to quantify tumor-specific total RNA expression across a large number of patient samples, we employ three steps in a sequential deconvolution of matched DNA/RNAseq data ( Fig. 2a Fig. 3; Methods). The parameter can be derived using RNAseq or microarray data (e.g., using DeMixT 23 ). Using a profile likelihood of the DeMixT model to rank genes for each study cohort, we identify top-ranked genes as an intrinsic tumor signature gene set, where genes follow a unimodal distribution with low variance across the hidden tumor component and express differentially from the nontumor component (Supplementary Fig. 4a We benchmarked the performance of TmS estimation using total RNAseq data generated from mixed cell populations with known proportions 23 , resulting in accurate separation of H1092 lung cancer cells from cancer-associated fibroblasts ( Fig. 2b- Fig. 4c). Across cancers, they are consistently enriched in housekeeping and essential genes 41,42 , and in cancer hallmark 38  TmS is associated with prognostic clinicopathologic characteristics TmS is dependent upon the background normal reference tissue. For tumors derived from the same tissue, comparisons can be made between known histopathologic and/or molecular subtypes. Consistent trends are observed between subtypes of head-and-neck squamous cell carcinoma, breast carcinoma, renal papillary carcinoma, and prostate adenocarcinoma, where prognostically favorable subtypes are enriched in tumors with lower TmS and vice versa (Fig. 4a-e; Methods). In head-and-neck squamous cell carcinoma, the prognostically favorable human papillomavirus (HPV)-positive subtype has lower median TmS than the HPV-negative subtype (P value = 0.004, Fig. 4a). Similarly, triple negative receptor status is associated with higher TmS in breast carcinoma (adjusted P value = 5 x 10 -36 , Fig. 4b), in keeping with this subtype's known propensity for aggressive behavior. Subtypes of renal papillary carcinoma also show significant differences in TmS, where the more aggressive Type II tumors have higher TmS compared to Type I (P value = 7 x 10 -5 , Fig. 4c) 44 . In bladder cancer, increased TmS is associated with the basal molecular subtype, which is known to exhibit more aggressive behavior compared to luminal tumors (P value = 2 x 10 -4 , Fig. 4d) [45][46][47] . In prostate adenocarcinoma, TmS is associated with tumor grade as assessed by Gleason score, where high TmS tumors enrich for Gleason scores of 8 and above (P value = 0.004, Fig. 4e). To further support its unique biological identity, we find that TmS is not a surrogate for Tumor-Node-Metastasis stage, cellular proliferation or other patient characteristics such as age and sex, as is shown by weak to no correlations between TmS and the corresponding variables (Supplementary Fig. 5a-c).

TmS refines prognostication across cancer types in TCGA and ICGC-EOPC
We examined the association of TmS with overall survival (OS) and progression-free interval (PFI) across TCGA cohorts (Methods, Supplementary Note 3). In pan-cancer analyses, high TmS is associated with reduced OS and PFI (Fig. 4f, Supplementary Fig. 5d). We next examined each cancer type in the context of overall TNM stage classification, which is used across cancers for predicting prognosis and treatment decision-making. Analysis stratified by early (I/II) vs. advanced (III/IV) stages reveal three different patterns for the effects of TmS by stage. The first group of tumors show consistent effects across stages ( Fig. 4g; Supplementary Fig. 5e). This includes thyroid, lung adeno, colorectal, hepatocellular, stomach adeno, and renal clear cell carcinomas, where high TmS is associated with higher risks of death and/or disease progression within both early-and advanced-stage tumors. Prostate adenocarcinoma also belongs to this group, where stage is replaced by Gleason scores 7 vs. 8+ in line with clinical practice.
In head-and-neck squamous cell, lung squamous cell, and bladder urothelial carcinomas, high TmS is 8 associated with reduced survival in early-stage tumors only, while for advanced-stage tumors, high TmS is associated with improved survival ( Fig. 4h; Supplementary Fig. 5f). An opposite pattern is observed in breast and renal papillary carcinomas, where high TmS is associated with poor prognosis in advancedstage tumors, but improved survival in early-stage tumors ( Fig. 4i; Supplementary Fig. 5g). TmS remains significantly associated with survival outcomes in Cox regression models, after adjusting for key prognostic characteristics including subtype, stage and age, except in hepatocellular carcinomas and prostate adenocarcinoma (TCGA), where only a trend towards significance was observed (Fig. 4j, Supplementary Table 5). Early-onset prostate cancer from ICGC-EOPC shows a significant effect of TmS within Gleason score = 7 (Fig. 4j, Supplementary Fig. 5e, Supplementary Table 5). We also find that TmS, which is corrected for tumor ploidy and expressed per haploid genome copy, shows better risk stratification than an analogous measure without ploidy adjustment (Supplementary Fig. 6).
Pan-cancer and cancer-specific contributions to tumor-specific total mRNA expression We hypothesize that as a feature of tumor phenotype, tumor-specific total mRNA expression should be associated with interacting biological processes (Fig. 5a). Differential gene expression analyses comparing high to low TmS samples demonstrate enrichment for metabolic pathways across cancers (Supplementary Fig. 7a; Methods). Of the typically seven metabolic pathways of carbohydrates 48 , the pentose phosphate pathway is upregulated in high TmS samples, achieving statistical significance in 12 out of 15 cancer types (Fig. 5b, Supplementary Fig. 7a-b). This is a major pathway for nucleotide synthesis through the formation of ribose 5-phosphate. As such, tumors with high TmS may require upregulation of this pathway to provide nucleotides for mRNA synthesis. Moreover, we find a dichotomized hypoxia score 49 , based on 51 genes, is significantly associated with TmS across all 14 cancer types with available data ( Fig. 5c; Methods). High TmS correlates with high hypoxia in all cancer types except head-and-neck cancers. While there are no overlapping genes between the hypoxia and pentose phosphate pathway gene sets, the hypoxia scores are moderately correlated with pentose phosphate score across cancers (Supplementary Fig. 7c; Supplementary Note 4). Network analysis shows dense connectivity between a subnetwork of 61 genes from these two pathways (density score = 0.32 50 , Supplementary Fig. 7d; Methods). The consistent and pan-cancer identification of both biological processes supports the molecular relevance of our proposed TmS measurement.
Interrogation of genetic alterations in TCGA revealed cancer-specific patterns associated with TmS. We performed two interrogations in parallel: an agnostic association analysis of TmS with all nonsynonymous mutations (32,894 cancer-gene pairs, using logistic regression models to adjust for covariates such as tumor mutation burden), as well as a driver mutation-specific association analysis of TmS (24 cancer-gene pairs). We find 5 overlapping pairs out of 6 statistically significant pairs produced from each interrogation (Fig. 5d, adjusted P values < 0.01). The additional pair found through the agnostic 9 search (FGFR3 in bladder carcinoma) was missed in driver mutation analysis due to a lack of sample size. This suggests TmS can capture changes in tumor phenotypes induced by specific driver mutations 51 .
Furthermore, we examined broad-scale genetic alterations, including tumor mutation burden (TMB), chromosomal instability (CIN), and whole-genome duplication status (WGD). Significant associations are sparsely identified across cancers, suggesting that these features may contribute to tumor-specific total mRNA expression in specific cancer types but are not universal determinants (Supplementary Fig. 8).
Case studies in cancer patients with varying tumor-cell total mRNA expression Finally, we provide two independent cohort studies to validate that TmS can quantify variations in tumor cell total mRNA expression in response to different biological processes.
Renal medullary carcinoma (RMC) is a rare but highly aggressive disease characterized by very poor prognosis compared with clear cell renal cell carcinoma (KIRC), papillary renal cell carcinoma (KIRP) and chromophobe renal cell carcinoma (KICH), which are the three most common renal cell carcinoma subtypes 52 . All RMC tumors harbor inactivating SMARCB1 alterations leading to high MYC expression and replicative stress 30 . As a result, morphological features consistent with high levels of RNA transcription, such as open phase nuclei with prominent nucleoli, are a hallmark of RMC (Fig. 6a). We analyzed matched RNAseq and exome sequencing data from untreated primary RMC tumor samples and adjacent kidney control tissues, derived from 9 patients with clinicopathologic features representative of this highly aggressive disease 30 (Methods). Significantly higher TmS values are presented in RMC than those from KIRC, KIRP and KICH (with the same normal tissue background) in TCGA (Fig. 6b), demonstrating this MYC-driven renal cell carcinoma is distinguished from other renal cell carcinomas by increased tumor-cell total mRNA expression.
Androgen deprivation therapy (ADT) and enhanced androgen receptor (AR) signaling inhibitors (ARSIs) such as Enzalutamide (Enza) and Abiraterone (Abi) are currently used to treat prostate cancer patients 53,54 . AR activity has been shown to induce chromatin changes which leads to increased transcriptional activity [55][56][57][58] . Therefore, we hypothesize that tumor cells that remain after ADT+ARSI treatments will have reduced tumor-cell total mRNA expression levels. First, scRNAseq data from the prostate cancer cell line LNCaP treated with and without Enza for 48 hours (Methods, Supplementary Note 1) showed Enza-treatment reduced TmS in these cells (Fig. 6c). ARSI agents are used in clinical trials studying their effects in the treatment of high-risk primary prostate cancer (e.g., NCT01946165 at MD Anderson). We find considerably lower TmS in radical prostatectomy tissue specimens from 24 patients with high-risk localized disease (with Gleason score ³ 8) who were treated with ADT+ARSI for 6 months, as compared to untreated tumor samples from patients (with Gleason score ³ 8) in TCGA ( Fig.  6d; Methods), demonstrating that AR signaling inhibition in prostate cancer leads to decreased tumorcell total mRNA expression.

Discussion
Our study identifies TmS, a robust and measurable feature of cancer cell plasticity, from bulk tumor tissues that is clinically and molecularly relevant across cancers. While single-cell technology depicts tumor cell populations with distinct gene expression states (a microscopic view), questions remain on how these populations coexist and interchange to impact clinical outcome 6 . Average signals across all tumor cells intrinsically account for the magnitudes and fractions of each population and are demonstrated by this study to be informative for the clinical presentations of cancer (a macroscopic view).
Cancer cell plasticity, previously evaluated within a few tumors or in model systems only 9 , is measured in a pan-cancer panorama (5,015 patient samples from 15 cancer types) by TmS, an integrative RNA and DNA deconvolution metric for bulk tissues. Association of TmS with genomic features, metabolism and hypoxia, supports a consistent and biologically meaningful measurement of a bulk-level feature of tumor phenotype. The ability of TmS to refine prognostication is observed within each of the 13 cancer types where pathological stages are available, which was not achievable by signature genes for EMT (the most studied pathway for plasticity) 59 alone in bulk tissues 60 .
While high tumor-cell total mRNA expression is generally associated with aggressive disease, clinical context remains important to evaluate its prognostic implications, as the direction of the prognostic effect was inverted by stage in five out of thirteen cancer types. Such complexity is expected and consistent with often contradictory findings of cancer stem cell models across cancer types 6,8 , as the biological underpinnings are complex and largely unknown. Given that early and advanced stage tumors and different tumor types are often treated using distinct modalities, the inverted effect may in part be underpinned by a differential response of tumors with low vs. high total mRNA expression to treatment conditions. The prognostication effect of TmS in bladder cancer as pattern II (Fig. 4h) is supported by evidence shown in the basal molecular subtype, which, although more aggressive than the luminal subtype and presenting mostly in an advanced stage, is more responsive to frontline cisplatin-based chemotherapy 45,46 . Additional studies incorporating data from clinical trials will be needed to elucidate how stage-specific and treatment-related factors interact with tumor-cell total mRNA expression to determine patient outcome.
Conceptually, analogous to DNA ploidy measuring the total DNA content calculated per haploid genome, the total mRNA content per haploid genome can be considered the "ploidy of the transcriptome". Total mRNA content is a key parameter of tumor heterogeneity and phenotype plasticity, previously hidden in most RNA-based assays. While our current work focuses on interpretation of mRNA, TmS is validated 11 for total RNA and the methodology developed here can readily be applied to the quantification of other RNA species (e.g. rRNA, miRNA, piRNA), further illuminating the cancer transcriptome. Enhanced attention to "transcriptome ploidy" will enable better phenotypic characterization and a deeper biological understanding of transcriptional dysregulation in cancer and other diseases.

Methods
Additional details and results are described in Supplementary Note. Here, we summarize the key aspects of the analysis.
Total mRNA expression in single-cell RNA sequencing data

Dataset
We collected single-cell RNA sequencing (scRNAseq) data from nine patients, comprising 2 with colorectal adenocarcinoma, 3 with hepatocellular carcinoma, 2 with lung adenocarcinoma, and 2 with pancreatic adenocarcinoma (Supplementary Table 1), as well as from the LNCaP prostate cancer cell line. A full description is provided in Supplementary Note, Section 1.1.

Quality control, clustering, cell type annotation, and normalized UMI
For each sample, we first filtered out cells based on number of genes expressed, total UMI (unique molecular identifier) counts, and proportion of total UMI counts derived from mitochondrial genes. We also removed cells that are detected as doublets. After the quality control, the nine patient samples across four cancer types had 46,468 cells remaining, and the two samples from the LNCaP cell line had 6,707 cells remaining, respectively. Within each patient sample, highly variable genes were detected and used for principal component analysis (PCA). Cells were then clustered with the Seurat package 34 . Cell type was annotated using known marker genes 26,27,[61][62][63] . Tumor cells were identified based on the inferred presence of somatic copy number alterations by inferCNV 64 . We further merged Seurat 34 identified clusters that were not significantly different in gene counts, which is the total number of expressed genes (Wilcoxon rank-sum test, α=0.001, Fig. 1b). A full description is provided in Supplementary Note, Section 1.2.1.
To enable comparison between different scRNAseq samples within the same study, we performed scale normalization for the total raw UMI counts across all cells, to ensure the total UMI count per cell are comparable for different samples from the same study. A full description is provided in Supplementary Note, Section 1.2.2.
Trajectory and differentiation score calculation in tumor cells 12 We applied Monocle 2 (version 2.14.0) [35][36][37] to construct single-cell trajectories, and used the CytoTRACE score to measure the differentiation state of tumor cells 15 . For Monocle 2, we ran "differentialGeneTest" (cutoff: q-value < 0.01) to select differentially expressed (DE) genes across tumor cell clusters, then used the "DDR-Tree" method of "reduceDimension" with selected DE genes for dimensionality reduction and tree construction. In order to compare CytoTRACE scores between the tumor cell clusters from patient samples within the same cancer type, we integrated tumor cells from patient 1 and patient 2 from each of the colorectal, lung and pancreatic cancers using ComBat (version 3.20.0) 65 embedded in CytoTRACE to correct for batch effects.

Gene set enrichment analysis in scRNAseq data
We performed gene set enrichment analyses for the high and low UMI tumor cell clusters in scRNAseq data. We first compiled a comprehensive set of signatures with 18,617 human gene sets (containing at least 4 genes) from the Molecular Signatures Database (MSigDB, v6.2) 38 and CellMarker 39 . Among them, 341 gene sets were annotated as stemness signatures based on the key word '_stem_' in their names.
We quantified enrichment for high and low UMI tumor cells using the GeneOverlap R package (v1.24.0) 66 .
GeneOverlap took the DE genes, a gene set and the background genome size (number of expressed genes in the scRNAseq data expressed in ≥ 10 cells) as input, and gave a P value for the enrichment significance and Jaccard Index, which was calculated by the number of common genes in the DE gene list and the signature gene set divided by the union of them. P values were adjusted for multiple comparisons using the Benjamini-Hochberg (BH) correction. The DE genes between high UMI and low tumor cells were obtained by the "FindMarkers" function from Seurat with Wilcoxon rank-sum test, based on criteria of adjusted P value < 0.1, genes expressed in ≥ 10 cells, and absolute log2(fold change) > 0.585 (1.5 fold change).

Pseudo-bulk analysis
We pooled scRNAseq data to form pseudo-bulk samples and estimated the ratio of the mean total UMI counts of tumor cells to that of the non-tumor cells for each sample. The 95% confidence intervals were constructed using bootstrapping of the same number of tumor and non-tumor cells with 1,000 repetitions.
Tumor-specific total mRNA expression in bulk sequencing data . For non-tumor cells, which are commonly diploid, this assumption is assured.
In the analysis of bulk RNAseq data from mixed tumor samples, we are interested in comparing tumor with non-tumor cell groups. We let T denote tumor cells and N denote non-tumor cells. Therefore, we define a tumor-specific total mRNA expression score (TmS) to reflect the ratio of total mRNA transcript level per haploid genome of tumor cells to that of the surrounding non-tumor cells, i.e., TmStumor = ST / SN, simplified as TmS from here forward. It is necessary to calculate this ratio in order to cancel out technical effects presented in sequencing data that confounds with both ST and SN . Let T g = ∑ u gc ) and the tumor cell proportion (hereafter 'tumor purity') r = CT /(CT + CN). We thus have The tumor-specific mRNA proportion π derived from the tumor can be estimated using DeMixT 23 as π 4; the tumor purity r and ploidy T can be estimated using ASCAT 24 , ABSOLUTE 25 or Sequenza 40 based on the matched DNA sequencing data as ρ 4 and T 5 , respectively; the ploidy of non-tumor cells N was assumed to be 2 24,25 . Hence, we have In what follows, we use TmS to represent TmS 6 for simplicity. A full description is in Supplementary Note, Section 2.1.

Consensus of tumor purity and ploidy estimation
For DNA-based deconvolution methods such as ASCAT and ABSOLUTE, there could be multiple tumor purity ρ and ploidy $ pairs that have similar likelihoods. Both ASCAT and ABSOLUTE can accurately estimate the product of purity and ploidy ρ $ ; however, they sometimes lack power to identify ρ and $

TmS =
Eq.  Fig. 3f-g) Therefore, our goal is to select an appropriate set of genes as input to DeMixT that optimizes the estimation of the tumor-specific mRNA proportions ( π ). In general, genes expressed at different numerical ranges can affect estimation of π. We found that including genes that are not differentially finding the best genes. By applying a profile-likelihood based approach to detect the identifiability of model parameters 67 , we systematically evaluated the identifiability for all available genes based on the data, and selected the most identifiable genes for the estimation of π. As a general method, the profilelikelihood based gene selection strategy can be extended to any method that uses maximum likelihood estimation. We also employed a virtual spike-in strategy to balance proportion distributions which further improved the deconvolution performance. A full description is provided in Supplementary Note, Section 2.2.

Profile-likelihood based gene selection
Briefly, in the DeMixT model, for sample i ∈ (1,2, … , ) and gene ∈ (1,2, … , ), we have Eq. 4 15 where Yig represents the scale normalized expression matrix observed from mixed tumor samples, T'ig and N'ig represent the normalized relative expression of gene g within tumor and surrounding non-tumor cells, respectively. The estimated tumor-specific mRNA proportion π 4 is the desirable quantity for Eq.3.
We assume each hidden component follows the log2-normal distribution, i.e., T' ig ~ LN Fμ Tg , σ Tg 2 G and N' ig ~ LN Fμ Ng , σ Ng 2 G. We will use notation T and N and drop the ' sign from now on.
is the likelihood function of the DeMixT model.
The confidence interval of a profile likelihood function can be constructed through inverting a likelihoodratio test 68 . However, calculating the actual profile likelihood function of all genes (~20,000) is generally infeasible due to computational limits. We adopted an asymptotic approximation to quickly evaluate the profile likelihood function 67 , using the observed Fisher information of the log-likelihood, denoted as H(π 4,μ 4 T ,σ 4 T ). Then the asymptotic α level confidence interval of μ Tk can be written as 67 We hereby introduce a gene selection score to represent the length of an asymptotic profile-likelihood based 95% confidence interval of μ Tk for gene , Genes with a lower score have a smaller confidence interval, hence higher identifiability for their corresponding parameters in the DeMixT. Genes are ranked based on the gene selection scores from the smallest to the largest. A subset of genes that are ranked on top will be used for parameter estimation.
In the DeMixT R package (available from Bioconductor), our proposed profile-likelihood based gene selection approach is included as function "DeMixT_GS". A full description is provided in Supplementary   Note

Estimation of tumor-specific mRNA proportions from RNAseq data
For each cancer type, we filtered out poor quality tumor and normal samples that were likely misclassified.
We then selected available adjacent normal samples as reference for the tumor deconvolution using

Intrinsic tumor signature genes
For each cancer type, we conducted gene set enrichment analyses on Hallmark pathways and KEGG pathways 38 for all genes with their gene selection scores using GSEA 73 and g:Profiler 74 . The genes were ranked according to their gene selection scores from the smallest to the largest. We further evaluated the chromatin accessibility of intrinsic tumor signature genes using ATAC-seq data from TCGA samples 43 .
For each sample, we calculated the mean of the peak scores of all intrinsic tumor signature genes and compared it with the corresponding permuted null distribution for each cancer type. A full description is provided in Supplementary Note, Section 2.3.2.3.

TmS estimation in ICGC-EOPC (Early-Onset Prostate Cancer)
In this cohort, matched mRNA sequencing data and whole genome sequencing data, as well as clinical data including biochemical recurrence, Gleason score and pathologic stage, from 121 tumor samples and 9 adjacent normal samples from 96 patients (age at treatment < 55), were downloaded from Gerhauser et al. 29 We used the 9 available adjacent normal samples as the normal reference to run

Association with clinical variables
Kruskal-Wallis tests were used to compare the distribution of TmS between subgroups defined by each clinical variable (Fig. 4a-e). The P values from the Kruskal-Wallis tests were adjusted using BH correction across all available clinical variables within the corresponding cancer type.

Association with survival outcomes
We evaluated the association between TmS and survival outcome (overall survival and progression-free interval) across 15 cancer types in TCGA and ICGC-EOPC. To ensure sufficient sample size in each group, we summarized pathologic stages into two categories: early (I/II) and advanced (III/IV). For prostate cancers, we used Gleason score (Gleason Score = 7 versus 8+) instead of early and advanced stages. We used a recursive partitioning survival tree model, rpart 75 , to find the optimal TmS cutoff 19 separating different survival outcomes within each of the two stages defined above in each cancer type.
Splits were assessed using the Gini index, and the maximum tree depth was set to 2. Log-rank tests between high and low TmS groups within early or advanced pathological stages were performed. We then fitted multivariate Cox Proportional Hazard models with age, TmS, stage, and an interaction term of TmS and Stage (TmS x stage) as predictors of overall survival and progression free interval for each TCGA cancer type. We also applied the same procedure to two other metrics for comparison: ploidy and TmS x ploidy. We provide further information in Supplementary Note, Section 3.

Association with metabolism, hypoxia, and genomic dysregulations
For each cancer type from TCGA, we conducted gene set enrichment analyses 73  we further defined a pentose-phosphate-13 score, i.e., a score based on the 13-gene-set, as the enrichment score calculated by gene set variation analysis (GSVA) 76 .
Hypoxia scores were generated as described previously 49,77 using the Buffa Signature 78 , which has been previously shown to be well-correlated with direct and transcriptional measures of hypoxia 79 . High values of this score suggest more hypoxia (lower levels of oxygen), while low values suggest less hypoxia (higher levels of oxygen). See further information in Supplementary Note, Section 4.
We conducted a network analysis of the pentose phosphate pathway (15 genes) and hypoxia signature gene list (51 genes) using a genome-wide parsimonious composite network (PCNet) 80 with visualization and analysis performed with Cytoscape version 3.8.2 81 .
Tumor mutation burden (TMB) was calculated by counting the total number of somatic mutations based on the consensus mutations calls (MC3) 82 . Chromosomal Instability (CIN) scores were calculated as the ploidy-adjusted percent of genome with an aberrant copy number state. ASCAT was used to calculate 20 allele-specific copy numbers 24 . For samples present in both TCGA and PCAWG, the consensus copy number was derived from published results 83 . Tumor samples that had undergone WGD were identified based on homologous copy-number information 25 .
For each cancer type within TCGA, we considered genes which had driver mutations (including nonsense, missense and splice-site SNVs and indels) 51 in at least 10 samples. For each of these genes, samples were labelled as "with driver mutation" if they carried at least one driver mutation in that gene or "without driver mutation" otherwise. We investigated 24 cancer-gene pairs for driver mutation analysis. We applied a Wilcoxon rank-sum test to each candidate gene to compare the distributions of TmS of the samples with driver mutations versus without driver mutations. The P values of each gene were adjusted for multiple testing using BH correction across all candidate genes within the corresponding cancer type.
We also implemented an agnostic search among non-synonymous mutations (including SNVs and indels) over all genes for the 15 cancer types to identify those that were significantly associated with TmS. For each cancer type, we considered a gene as a candidate gene if there were at least 10 samples containing non-synonymous mutations in that gene. We investigated 32,894 cancer-gene pairs for non-synonymous mutation analysis. We applied two statistical tests to evaluate the difference between the "with nonsynonymous mutation" and "without non-synonymous mutation" samples. We first applied a Wilcoxon rank-sum test for each candidate gene to evaluate the difference between the distribution of TmS of the two group of samples. We then fitted a linear regression model using log2-transformed TmS as the dependent variable and mutation status as a predictor: log 2 (TmS)=b 0 +b 1 log 2 (TMB) +b 2 MUT, where TMB represents tumor mutation burden. MUT=1 if the sample has at least one non-synonymous mutation in the candidate gene, and MUT=0 otherwise. The P values were calculated by a t-test of the regression The P values of each gene based on Wilcoxon rank-sum test and t-test were adjusted by BH correction based on the number of candidate genes within the corresponding cancer type.

Reporting summary
Further information on research design is available in the Nature Research Reporting Summary linked to the paper.

Data availability
The UMI counts of the hepatocellular carcinoma single cell RNA sequencing data were downloaded from the Gene Expression Omnibus (GEO) with the accession code GSE125449. All other relevant data are available from the corresponding author upon reasonable request.

Code availability
All code used for analyses was written in R version 3.6.1 and will be made available. The core computational pipelines developed for estimating tumor-specific mRNA expression proportion are available in R package DeMixT1.8.0, which can be downloaded from https://www.bioconductor.org/packages/release/bioc/html/DeMixT.html.      shown in the boxes with a yellow background, suppose their corresponding total mRNA amounts are 2, 25 3, and 4, then the ploidy-adjusted, or per haploid genome, total mRNA amount would be 1, 1, and 1.

Figure Legends
Under the scenario of dosage compensation, i.e., more chromosomal copies but maintaining the same total dose, the second cell has a total mRNA amount of 2 and a per haploid genome value of 0. 67  respectively. Adjusted P values of Wilcoxon rank-sum tests are indicated by asterisks (* P < 0.05, ** < 0.01, *** < 0.001).