MOSBY enables multi-omic inference and spatial biomarker discovery from whole slide images

doi:10.21203/rs.3.rs-3938444/v1

Download PDF

Article

MOSBY enables multi-omic inference and spatial biomarker discovery from whole slide images

https://doi.org/10.21203/rs.3.rs-3938444/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 06 Aug, 2024

Read the published version in Scientific Reports →

You are reading this latest preprint version

The utility of deep neural nets has been demonstrated for mapping hematoxylin-and-eosin (H&E) stained image features to expression of individual genes. However, these models have not been employed to discover clinically relevant spatial biomarkers. Here we develop MOSBY (Multi-Omic translation of whole slide images for Spatial Biomarker discoverY) that leverages contrastive self-supervised pretraining to extract improved H&E whole slide images features, learns a mapping between image and bulk omic profiles (RNA, DNA, and protein), and utilizes tile-level information to discover spatial biomarkers. We validate MOSBY gene and gene set predictions with spatial transcriptomic and serially-sectioned CD8 IHC image data. We demonstrate that MOSBY-inferred colocalization features have survival-predictive power orthogonal to gene expression, and enable concordance indices highly competitive with survival-trained multimodal networks. We identify and validate 1) an ER stress-associated colocalization feature as a chemotherapy-specific risk factor in lung adenocarcinoma, and 2) the colocalization of T effector cell vs cysteine signatures as a negative prognostic factor in multiple cancer indications. The discovery of clinically relevant biologically interpretable spatial biomarkers showcases the utility of the model in unraveling novel insights in cancer biology as well as informing clinical decision-making.

The developments in high-throughput RNA, DNA, and protein assays as well as in digital pathology have enabled multiple different high-resolution perspectives on a given tissue. Sampled from the same tissue, these data modalities are inherently linked, due to arising from highly similar cell populations and potentially the same biological state. Successfully inferring this common biology is key for relating different data modalities, and predicting one modality from another. Deep neural nets have emerged as an effective and flexible framework to represent the common biology underlying different readouts, opening the way to ‘translate’ any data modality into another^1–5. A high utility use case for this translation task involves digital pathology where relatively easy-to-acquire hematoxylin and eosin (H&E) stained whole slide images (WSI) can be used to infer high-throughput molecular data that is time consuming or expensive to acquire.

The success of inferring the joint biology (i.e. underlying latent distribution) between WSIs and molecular data relies heavily on extracting informative features from WSIs. Contrastive self-supervised learning has made a breakthrough in computer vision by learning high quality image representations in no-annotation settings as shown in curated benchmark datasets such as ImageNet ^6,7. Contrastive learning aims to discriminate between positive and negative images where positive pairs are obtained from augmentations from the same image and negative pairs are augmentations from different images ⁸. Contrastive self-supervised learning has also shown promise in identifying cancer-specific morphological features in large-scale and heterogeneous histology datasets and training feature extractor models that outperform ImageNet-pretrained networks ^9–11. However, the superior performance of self-supervised features has largely been demonstrated for classification tasks, with limited attention on regression tasks. For instance, Wang et al.⁸ developed an oncology-focused feature extractor RetCCL that was trained in more than 34,000 histology images and achieved state-of-the-art performance in cancer subtype classification. Regression tasks, such as inferring gene expression levels from WSI features, have been addressed with multi-output regression networks ^1,12; yet, self-supervised learning-based image features remain largely underexplored in this context¹³.

Among molecular data modalities, the inference of bulk and single-cell transcriptomics from H&E WSI features has received considerable attention^1,2,12; however, an underexplored extension of this question is whether mapping the same image features to proteomic data holds greater promise due to proteins being more proximally associated with phenotype. The full power and limitations of image to molecular data ‘translations’ will become more evident as these deep learning models are expanded to learn multiple -omic data modalities such as transcriptomics, proteomics, metabolomics, and whole exome or SNP array-based DNA measurements.

Here, we present the MOSBY (Multi-Omic translation of whole slide images for Spatial Biomarker discoverY) model that employs contrastive self-supervised learning-based (RetCCL) features from H&E images to infer high-throughput molecular data such as transcriptomics (single genes or gene signatures), proteomics, and selected DNA-based features (e.g. cancer DNA fraction, average DNA methylation). MOSBY adopts a multiple instance learning approach where it partitions gigapixel WSIs to small tiles, makes tile-level predictions as an intermediate step, and then aggregates those to obtain slide-level predictions. At the inference stage, MOSBY enables in silico spatialization of bulk omic profiles by reconstructing tiles from a given WSI. Correlation between two omic features across all tiles of a slide reveals patient-specific colocalization or spatial exclusion patterns. Expanded to the cohort, these spatial features allow identification of clinically relevant spatial biomarkers. We demonstrate the performance of MOSBY in pan-cancer TCGA data, and spatially validate results with 1) spatially resolved transcriptomic data in breast cancer, 2) serially-sectioned CD8 IHC images in bladder cancer, and 3) tumor microenvironment (TME) cell type predictions in nonsmall cell lung cancer from an orthogonal deep learning model. In contrast to a whole transcriptome approach, we focus on TME cell type marker genes for MOSBY gene models, whereby model predictions could be contrasted with validation data in (3). Signature models, on the other hand, aim to encompass a wider biology including TME cell type, metabolic, and oncogenic pathway signatures. We finally utilize signature models to showcase the derivation of biologically interpretable colocalization features consistently associated with risk and disease state in human cancers.

We developed the MOSBY model that learned a mapping from RetCCL (contrastive self-supervised learning)-based WSI features to bulk transcriptome, proteome, whole exome, SNP-array, and DNA methylome based profiles (Fig. 1a). We limited our study to 21 TCGA indications¹⁴ with at least 200 paired RNA-seq and H&E whole slide images, resulting in a total of 12592, 10192, and 12090 images matching with transcriptome, proteome and DNA-based data respectively (breakdown by indication in Supplementary Table 1). Analyzed transcriptomic features consisted of 55 TME-related genes^15,16 (Supplementary Table 2) and 175 gene sets that covered tumor-related processes^17–32, metabolic pathways³³, and TME cell type or process signatures^34–37 (Supplementary Table 3). The proteome model involved 191 total and phosphoprotein antibodies from the TCGA reverse-phase protein array (RPPA) panel that focused on tumor-intrinsic and oncogenic processes ³⁸. Tested DNA-based features were limited to tumor purity, cancer DNA fraction, subclonal genome fraction, and average DNA methylation (all continuous measures bound to the interval [0,1]).

In addition to single indication models, we also trained MOSBY in pan-tissue, pan-cancer, pan-squamous and pan-adenocarcinoma settings. Pan-tissue models consisted of LUNG (lung adenocarcinoma and squamous cell carcinoma), KIDNEY (clear cell and papillary renal cell carcinoma), BRAIN (glioblastoma and low grade glioma), and COADREAD (colon and rectal adenocarcinoma). The pan-adenocarcinoma (ADENO) model consisted of pancreatic, lung, stomach, colon, rectal, prostate, and ovarian adenocarcinomas; while the pan-SQUAMOUS model included squamous cell carcinomas of the lung, cervix, and head and neck. The ADENO and SQUAMOUS models allowed us to investigate whether histological similarities beyond tissue architecture enabled MOSBY to learn a better mapping from WSI features to multi-omic data.

Similar to the HE2RNA¹² model, a 2-layer perceptron was adopted as the multi-output regression network that mapped image features to omic variables, and a maximum of 8000 image tiles per slide were used. Separate models were trained for gene, signature, protein, and DNA variables. In contrast to HE2RNA, image tiles were randomly selected from each WSI to capture an unbiased representation of the entire slide, and were all used in training without clustering. In addition, batch-normalized transcriptomic and proteomic data were used as ground truth to enable across-indication comparisons with resulting MOSBY predictions. To obtain slide-level predictions, tile-level predictions of omic features were aggregated by averaging.

Contrastive self-supervised pretraining benefits prediction of omic data from H&E whole slide images

The RetCCL feature extractor utilized TCGA H&E images during contrastive training, however it was not supervised by gene expression information. Hence, there are no a priori guarantees for RetCCL-based image features to predict gene expression with higher accuracy than ImageNet-based features. The success in this task hinges upon the diversity of semantic (in this case, biological) features in the representations learned by RetCCL. Thus we investigated the performance differences between MOSBY models trained with RetCCL- or ImageNet-based image features. Models were trained with 5-fold cross-validation (80/20 percent training/test set split). A random one fifth of the training set was also allocated as validation set to determine an early stopping criterion using the Spearman correlation between slide-level model prediction and ground truth omic data. Spearman correlation was preferred to mitigate the effects of outlier samples in relatively smaller cohorts. Correlation coefficients from test sets were subsequently averaged across five folds to obtain the final performance score for a given feature.

RetCCL-based image features consistently led to higher cross-validated test set averages in all four omic data types compared to features extracted with ImageNet-pretrained ResNet-50 architecture (Fig. 1b-1e). The pan-cancer model (PANCAN) with RetCCL-based features achieved the highest performance for all tested data types with median cross-validated Spearman correlation of 0.722 in single genes (0.673 with ImageNet-features) (Fig. 1b), 0.693 in signatures (0.647 with ImageNet-features) (Fig. 1c), 0.549 in proteomic data (0.512 with ImageNet-features) (Fig. 1d), and 0.6 in DNA features (0.533 with ImageNet-features) (Fig. 1e). In terms of single indication models, the thyroid cancer model (THCA) achieved best performance for single gene and protein expression data sets with RetCCL-based features. Median cross-validated Spearman correlations in these models were 0.59 vs 0.524 for single gene, and 0.443 vs 0.385 for protein expression data with RetCCL and ImageNet-based features respectively (Fig. 1b, 1d). The single indication models achieving best performance for signature and DNA data were the liver and bladder cancer models respectively (LIHC and BLCA) (Fig. 1c, 1e). Across tested signatures, the LIHC model showed a median cross-validated Spearman correlation of 0.532 with RetCCL-based vs 0.45 with ImageNet-based features. For tested DNA features, the BLCA model achieved a median cross-validated Spearman correlation of 0.553 with RetCCL-based vs 0.417 with ImageNet-based features. Taken together, these results indicated that contrastive self-supervised pretraining in large-scale histology datasets benefits prediction of multiomic data from H&E-stained whole slide images.

MOSBY tile level predictions are validated with spatially resolved transcriptomic data

For inference tasks, MOSBY ‘full models’ were trained with data from 80% of patients, with the remaining 20% used as the validation set to determine an early stopping criterion. An MOSBY full model was trained in the TCGA breast cancer (BRCA) cohort (N = 1576 WSIs), and subsequently deployed on H&E image tiles from a publicly available spatially resolved transcriptomic breast cancer dataset³⁹ (referred to as ST from here on, N = 68 WSIs) that served as an independent validation cohort. Image tiles were centered around ST spot coordinates to enable one-to-one comparison between spot level ground truth data and tile level model predictions (Methods). For individual slides, the MOSBY signature model predictions showed the highest concordance with ground truth for ‘poor differentiation’ (Stemness_Kim_Myc, Spearman r = 0.71) and stromal features (Stroma_Estimate, Spearman r = 0.63) (Fig. 1f, Supplementary Fig. 1a). Across 68 slides, concordance was highest for a monocyte signature with a median Spearman correlation of 0.238 (Supplementary Fig. 1a), and a maximum of 0.611 (Fig. 1f). This large variation across slides was observed for all tested signatures suggesting that the quality of spatially resolved data is critical in validating MOSBY predictions. Of note, CD8 T cell infiltration and proliferation-related model predictions also showed strong concordance with ground truth, demonstrating the variety in phenotypes captured successfully by the model (Supplementary Fig. 1b).

Spot level concordance for single gene expression values was overall lower than that for signatures (Supplementary Fig. 1d), potentially driven by the large degree of zero reads (i.e. dropouts) in ground truth data (Supplementary Fig. 1e). However, model predictions for genes associated with stroma, plasma cells, and epithelial features showed good performance for individual slides (COL1A2, MZB1, EPCAM respectively) (Fig. 1g). CD68, a myeloid marker, showed the highest median concordance in gene models (Spearman r = 0.127), which was low in magnitude but consistent with the high concordance of myeloid cells in signature models (Supplementary Fig. 1d).

As MOSBY aims to discover the joint latent space between H&E WSIs and bulk omic profiles, the correlation structure in ground truth data was also utilized to validate MOSBY model predictions (Methods). This analysis revealed that the MOSBY signature-signature correlation matrix was significantly different from random for 67 out of 68 slides (empirical p-value < 0.05) (Supplementary Fig. 1c). For gene-gene correlation matrices, this value dropped to only 28 out of 68 slides (Supplementary Fig. 1f), again highlighting that computing gene signature scores is an effective strategy to deal with the dropout issue in spatially resolved transcriptomic datasets. Despite reduced overall performance in gene models, predicted gene expression levels still successfully captured specific highly correlated gene blocks derived from ground truth data, such as those among cancer-associated fibroblast (CAF) and immune cell marker genes (Supplementary Fig. 1g). An analysis across 68 slides indicated that MOSBY model predictions were particularly good at distinguishing CAF-related features from other TME population marker genes. For 63 out of 68 slides, the MOSBY gene-gene correlation matrix had a cluster significantly enriched for fibroblast marker genes, and this enrichment was higher than 89% for the large majority of the slides (Supplementary Fig. 1h).

Serially sectioned CD8 IHC whole slide images validate MOSBY in silico spatialization

MOSBY tile-level predictions were further validated with CD8 antibody-stained immunohistochemistry (IHC) WSIs. To this end, gene and signature models were trained with data from urothelial carcinoma immune checkpoint inhibitor (ICI) trials IMvigor210⁴⁰ and IMvigor211⁴¹ (Methods, N = 1460 paired H&E and RNA-seq samples). H&E-CD8 double-stained slides were not available, thus serially-sectioned single-stained H&E and CD8 IHC whole slide images were computationally aligned (Methods, N = 1017 paired H&E and CD8 IHC WSIs). 454 slides that were correctly aligned based on multiple alignment metrics were subjected to a manual QC check and a stringent 95% tissue overlap threshold (Methods, Fig. 2a). The remaining 42 IHC WSIs constituted the analysis set, and were used to categorize “brownness” of pixels based on diaminobenzidine staining (i.e. DAB mask)(Methods). Each DAB mask was split into tiles, and “brown” pixels in each tile were counted to determine tile-level CD8 IHC quantification (Methods). Pearson correlation coefficient was then computed between CD8 IHC and MOSBY-predicted tile-level values (Fig. 2a).

To compare with CD8 IHC quantification, gene features CD8A, CD8B, and signature features Cibersort_CD8_T_cells and T_effector_cells were used from MOSBY model predictions. Across 42 slides, CD8B and Cibersort_CD8_T_cells showed the strongest concordance with CD8 IHC, having the highest correlation in 22 and 15 slides respectively (Fig. 2b). Overall, CD8B model predictions had the highest median correlation with CD8 IHC tile-level values (Pearson r = 0.352), with a range of [-0.006, 0.789]. The density plot for r = 0.789 correlation (N = 16859 tiles, p ≈ 0, two-tailed exact test) indicates stronger concordance for values closer to the higher and lower ends of the distribution (Fig. 2c). Visual inspection of CD8 IHC and H&E WSIs confirmed successful alignment, but also revealed minor differences between tissue sections potentially due to these being non-adjacent sections (Fig. 2d). At low magnification, visualization of DAB mask tile level quantification and CD8B model predictions indicated that the model successfully captures CD8 T cell infiltration patterns in the tumor (Fig. 2e, 2f). High magnification inspection of model predictions and DAB mask (as well as the input H&E and CD8 IHC images) confirmed that MOSBY, despite being weakly supervised on a patient level bulk RNA-seq dataset, effectively learned a CD8 T cell-specific representation in H&E images (Fig. 2g–2j).

In order to gain insight into both high and low concordance cases, four representative slides were selected based on the distribution of predicted CD8B vs DAB mask correlations (near the minimum, and 25th, 50th, and 75th percentiles). Selected slides had Pearson correlation values 0.48 (N = 30032 tiles, p ≈ 0, two-tailed exact test), 0.35 (N = 22942 tiles, p ≈ 0, two-tailed exact test), 0.25 N = 11642 tiles, p ≈ 0, two-tailed exact test, and 0.011 (N = 16595 tiles, p = 0.16, two-tailed exact test) respectively (Supplementary Fig. 2a–2d). Low magnification inspection of DAB mask and MOSBY CD8B predictions indicated that model predictions captured overall CD8 T cell infiltration patterns in the first three slides despite the correlation being as low as 0.25 (Supplementary Fig. 2a–2c). For the last slide (r = 0.011), spatial patterns in DAB mask and model predictions were conspicuously discordant (Supplementary Fig. 2d). High magnification inspection of CD8 IHC and DAB mask images from this last slide revealed that DAB mask captured brown staining artifacts in regions that did not show any expected CD8-specific staining patterns (Supplementary Fig. 2e, 2f). In addition, CD8 IHC and H&E tissue sections showed misalignment in small substructures, despite an overall high score in tissue alignment (Supplementary Fig. 2e-2h). Taken together, these results showed that MOSBY successfully learned a CD8 T cell-specific representation on H&E images, and correlations as low as 0.25 may be sufficient to capture overall CD8 T cell infiltration patterns in the tumor microenvironment, suggesting the potential of this tool for patient stratification in ICI trials.

Validation with pathologist annotations of cancer regions indicates epithelial gene and signature markers may be misleading in accounting for tumor-specific expression

Various RNA- and protein-based epithelial markers such as EPCAM (GENE) and E-cadherin (PROTEIN) are frequently used in the literature to mark cancer cells in tissue specimens. However, epithelial markers can be expressed in both transformed and non-transformed epithelial cells. We utilized MOSBY tile level predictions in order to compare virtual spatialization of epithelial markers with pathologist tumor region annotations and assess the tumor-specificity of commonly used tumor markers.

MOSBY full models were trained in TCGA lung adenocarcinoma (LUAD), and deployed on WSIs in IMpower150⁴³, an independent validation cohort in nonsquamous nonsmall cell lung cancer with pathologist annotations (i.e. pen marks) demarcating cancer epithelia.

Pathologist annotations were compared with MOSBY features of tumor purity (DNA-based) and epithelial markers EPCAM (GENE), E-cadherin (PROTEIN), and an epithelial gene signature by Taube et al²⁰. WSI level comparison indicated that MOSBY models trained with DNA-based quantification of cancer cells (tumor purity) achieves a stronger concordance with pathologist tumor region annotations (Fig. 3a, 3b). E-cadherin closely followed DNA-based features in terms of concordance with pathologist annotations, albeit noticeably lower specificity for the cancer region in Slide 2 (Fig. 3d vs 3b). Interestingly, predicted spatialization of EPCAM transcript levels showed relatively poor concordance with cancer epithelium annotation (Fig. 3c); and the performance of the Taube et al. epithelial signature was further reduced with highest predicted tiles corresponding to non-tumor regions (Fig. 3e). Collectively, these results suggest that employing DNA-based tumor cellularity estimates is the most reliable approach for accounting for tumor-specific expression in bulk datasets; and published epithelial signatures may not be fit-for-purpose in this task.

MOSBY enables investigation of intratumor heterogeneity with prediction of single gene, signature, and protein expression levels

The MOSBY framework employs RNA-seq and RPPA data to predict gene (or signature) and protein expression levels respectively (Methods). The utility of the model in predicting single gene, signature, and protein expression levels and inferring intratumor heterogeneity is demonstrated in Impower150 in the context of cell proliferation features (Supplementary Fig. 3). Three representative H&E slides are shown in this figure together with pathologist annotations of cancer epithelia (Supplementary Fig. 3a), and MOSBY-predicted spatialization of cell proliferation markers Cyclin B1 (Supplementary Fig. 3b), Hallmark G2M checkpoint signature (Supplementary Fig. 3c) and MKI67 (Supplementary Fig. 3d). Even though tile-level predictions of all three cell cycle markers are highest in cancer epithelia-annotated regions, the spatialization of protein and signature markers (Cyclin B1 and Hallmark G2M checkpoint signature) demonstrate a stronger localization to cancer epithelia (Supplementary Fig. 3b, 3c, 3d). In addition, the spatialization of MKI67 expression in all three slides appears to suffer from checkerboard-like artifacts (Supplementary Fig. 3d). These results suggest that molecular data prediction from WSIs may be more successful with proteins or gene signatures, as opposed to single genes. Potential reasons for this observation are 1) proteins being closer to phenotype than genes and 2) gene signatures increasing the signal-to-noise ratio from RNA-seq data. Biologically, the predicted spatial pattern of Cyclin B1 and Hallmark G2M checkpoint signature levels show differential proliferation levels across cancer epithelia tiles (Slide 3 in Supplementary Fig. 3b, 3c), indicating MOSBY also has potential to infer intratumor heterogeneity.

MOSBY predicts stroma, immune, and proliferation features with highest accuracy in TCGA

Cross-validated test set performance in single-indication signature and protein models was used to determine biological features with the highest prediction accuracy, and those with 25 highest performance are highlighted in Fig. 4a, 4d. The best-performing signatures ESTIMATE Stroma and ESTIMATE Immune had median correlations 0.627 and 0.616 across all tested indications (Fig. 4a). In randomly split test sets, the Spearman correlation of ESTIMATE Stroma and ESTIMATE Immune signatures reached as high as 0.787 and 0.781 in skin cutaneous melanoma and thyroid cancer respectively (Fig. 4b). Other best-performing signatures were again largely enriched in stroma and immune features as well as general mesenchymal characteristics (such as Hallmark EMT and Taube et al mesenchymal signatures). (Fig. 4a). Of note, particular signatures known to play an important role in specific cancer indications also showed strong prediction accuracy in those pertinent settings. For instance, the Hallmark Angiogenesis signature had Spearman r = 0.746 in the liver cancer model (LIHC), and a fatty acid elongation signature had Spearman r = 0.741 in the low grade glioma setting (LGG) (Fig. 4b). After adjusting for multiple hypothesis testing with false discovery rate (adj. p < 0.05), 14 out of 21 tested indications showed significant performance (as measured by Spearman r) for more than 90% of tested signatures (total N = 175 signatures) (Fig. 4c).

In contrast with our signature set that represented both tumor-intrinsic pathways and cell populations in the TME, the TCGA RPPA antibody set was heavily focused on tumor-intrinsic pathways. Therefore, the best-performing proteins had a diversity of representation from tumor-intrinsic characteristics, such as proliferation (Cyclin-B1, FOXM1), DNA repair (MSH6, PARP1), and apoptosis (cleaved Caspase7) (Fig. 4d). MSH6 and Cyclin B1 were the 2 best-performing proteins that were tested in at least 20 indications (Fig. 4d). These proteins had median Spearman r = 0.423 and 0.417 respectively across tested indications, but correlations in individual indications were as high 0.676 in PRAD for Cyclin B1, and 0.593 in LGG for MSH6. Overall, lower prediction accuracy in protein models was expected due to the lower signal-to-noise ratio in the RPPA technology compared to RNA-seq. However, in specific settings such as the PRAD model, both total and phosphoproteins showed strong prediction accuracy in test sets. Progesterone receptor (PR) and cMET_pY1235 reached Spearman r = 0.614 and 0.631 respectively in this indication (Fig. 4e). Also, the AKT/mTOR pathway showed evidence of strong prediction accuracy in the sarcoma setting where S6 and RICTOR_pT1135 antibodies showed Spearman r = 0.731 and 0.695 respectively (Fig. 4e). Despite paucity of representation in the RPPA panel, stromal and immune features were also found among the best-performing proteins such as ECM-associated Collagen-VI, Fibronectin and T cell-associated Lck (Fig. 4d). Overall, 11 out of 20 tested indications showed significant performance (adj. p < 0.1) for more than 90% of tested antibodies (total N = 191 antibodies) (Fig. 4f).

In terms of single gene models, best performing features confirmed the strong prediction accuracy associated with stromal features observed above; the highest-ranking genes were marker genes of fibroblasts (e.g. LUM, COL5A1) (Fig. 4g). Top-ranking genes also included markers of macrophages (e.g. CSF1R), and T cells (e.g. CD3E). In DNA models, tumor cellularity measures (tumor purity, cancer DNA fraction) achieved better performance than subclonal genome fraction, and average DNA methylation features (Fig. 4h).

Multi-indication runs may show exaggerated performance

The results above demonstrated that MOSBY models trained with gene signature and protein expression data generally perform better than those trained with single genes. Thus, using signature and protein expression data, we next addressed whether single- vs multi-indication models (introduced in Fig. 1b) had more utility from a clinical perspective. We focused on performance in the unseen test set to evaluate the generalization capacity of each approach in a scenario emulating the clinical setting; i.e. predicting gene and protein expression levels for a single patient. Performance was measured, as above, with the Spearman correlation between MOSBY slide-level predictions and ground truth levels of signature or (phospho)protein expression (N = 175 for signatures, 191 for total and phosphoproteins).

Multi-indication models (PANCAN, ADENO, BRAIN, KIDNEY, SQUAMOUS, LUNG) showed superior performance for both signature and protein prediction tasks, compared to single-indication counterparts (Supplementary Fig. 4a). The PANCAN model had a median cross-validated Spearman correlation of 0.693 for tested signatures (Supplementary Fig. 4a), with epithelial, proliferation, and stemness signatures showing the best prediction accuracy (cross-validated Spearman r = 0.826, 0.824, and 0.82 respectively). We focused on the Benporath_ES2 stemness signature to demonstrate the bias associated with multi-indication models. In a randomly split test set, this signature had Spearman r = 0.84 between predicted and ground truth levels in the PANCAN dataset (Supplementary Fig. 4b, left). However, the correlations in individual indications varied between 0.244–0.645 for this signature, suggesting that the PANCAN model performance may be spuriously inflated by combining multiple indications with different distributions; reminiscent of the Simpson’s paradox (e.g. OV and READ slides concentrated on the higher end, and THCA and KIRC concentrated on the lower end of the distribution; Spearman r = 0.248, 0.326, 0.41, and 0.482 for OV, READ, THCA and KIRC respectively) (Supplementary Fig. 4b, right).

In cross-validated protein models, Cyclin B1, Claudin-7 and progesterone receptor (PR) showed the best prediction accuracy in the PANCAN model (Spearman r = 0.775, 0.772, and 0.749 respectively). To ask whether this performance was spuriously inflated, we again compared PANCAN and single indication models. In a randomly split test set, Cyclin B1 had Spearman correlation of 0.76 in the PANCAN model (Supplementary Fig. 4c, left), while correlations in individual indications were lower ranging from − 0.10 to 0.597. Simpson’s paradox was again evident in protein model predictions with Cyclin B1 levels in individual indications localizing to smaller parts of the pan-cancer distribution and showing lower prediction accuracy (e.g. Spearman r = − 0.103, 0.198, 0.214, and 0.263 for LIHC, THCA, READ and CESC respectively) (Supplementary Fig. 4c, right).

We then investigated the potential bias in multi-indication models more systematically by comparing median test set performance (i.e. median correlation for signatures or proteins) in single vs multi-indication models (Supplementary Fig. 4e, 4g). The train/validate/test set splits in multi-indication models were implemented with indication stratification, enabling comparison with single-indication training. Results showed that neither single-indication nor multi-indication training was universally better (Supplementary Fig. 4e, 4g). However, evidence also showed that multi-indication training generally did not improve performance if a single-indication model reached a median performance of 0.45 for signatures (Supplementary Fig. 4e), and 0.35 for proteins (Supplementary Fig. 4g), with LGG as a potential exception benefiting from joint training with GBM.

Pan-tissue training appeared to improve median test set accuracy in certain settings such as protein models in kidney, brain, lung, colon/rectum (Supplementary Fig. 4g), and signature models in brain (Supplementary Fig. 4e). However, at the individual signature/protein level, further heterogeneity in model performance was observed (Supplementary Fig. 4f, 4h). A subset of signatures/proteins benefited from multi-indication training whereas another subset did not. This observation prevented us from making general recommendations as to the deployment of single vs multi-indication models in a clinical setting. In circumstances where there is an interest in specific signatures/proteins, it may be suboptimal to make the model decision based on median performance. However, results in Supplementary Fig. 4e, 4g can be used as a guideline in other settings that lack a focus on specific biological features. We utilize single indication models in subsequent parts of this study in order to prevent potential bias from multi-indication training.

Spatial patterns inferred from tile-level MOSBY predictions increase survival predictive power of gene signature-based models

MOSBY tile-level predictions enable in silico spatialization for a tested omic feature, as well as assess spatial correlation between two tested features (positive and negative correlations indicating colocalization and spatial exclusion respectively). We define a ‘colocalization feature’ as the Pearson correlation between tile-level predictions of two omic features. As correlation coefficients were computed across all tiles on a slide, colocalization features represent slide-level as opposed to ‘local’ spatial patterns. We focused on our signature panel (N = 175) as the omic features to derive colocalization features and investigate survival associations, as the signature panel covered both tumor and non-tumor TME components. Moreover, using signatures as opposed to single genes enable the discovery of ‘biologically interpretable’ spatial biomarkers as well-designed signatures capture pathway and cell type-related gene expression with higher fidelity. For a slide, correlations from all pairwise combinations of signatures (N = 15225) (i.e. the collection of all colocalization features) are referred to as the ‘colocalization map’ from here on (Fig. 1a).

A survival analysis was performed to ask whether slide-level patterns in colocalization maps harbored survival signals that could not be captured by the mere magnitude of signature expression. Three L1-regularized Cox proportional hazards regression models were fit to address this question, and the ability of the models to predict survival was computed with concordance indices (c-index) (Fig. 5a) (Methods). The first model only used flattened colocalization features (MOSBY predictions, N = 15225). The second Cox model only used signature expression levels (N = 175) with the goal of assessing the survival predictive power of gene expression ‘magnitude’. The third model was a joint model combining all signatures but only lasso-selected colocalization features from the first model in order to prevent colocalization features from dominating the model. Each model was run across 10 cross-validation folds to optimize the shrinkage parameter and also obtain mean and standard error estimates for the relevant c-index (Methods).

This process was implemented separately for all tested TCGA indications, and c-index mean and standard error estimates were plotted (Fig. 5a). We observed that the joint (i.e. third) model had a higher c-index than the signature-only model in most indications. This finding revealed that slide-level spatial patterns, discovered with an inference engine such as MOSBY, have survival predictive power that could not be captured by gene expression alone. As MOSBY colocalization features are biologically interpretable, this opens the door to discovering potentially ‘actionable’ colocalization or spatial exclusion patterns that predict clinical outcomes. Moreover, the c-indices achieved in the joint model were highly competitive with or higher than those reported in the literature from end-to-end survival-trained multimodal models utilizing genomic, transcriptomic, and image datasets⁴⁴. Of note, comparing colocalization-only and signature-only models, we also found that the total survival predictive power of slide-level spatial patterns was not as high as that of signature levels in most TCGA indications. Ovarian and rectal cancer results were an exception to this general pattern (Fig. 5a), suggesting spatial biomarkers discovered in these indications may have the greatest potential to lead to novel insights.

Colocalization maps enable discovery of biologically interpretable spatial biomarkers

We next investigated consistent spatial predictors of risk that were supported across multiple indications and also showed evidence of tumor specificity. In each indication, survival effects and tumor specificity of colocalization features were explored with two tests: 1) A univariate Cox regression model for survival, and 2) a Mann-Whitney test to compare tumor-normal levels of colocalization value. A potential spatial biomarker of risk was defined as a colocalization feature that was significantly associated with poor overall survival (p < 0.05) and also had elevated levels of colocalization in the tumor (adj. p < 0.05). Given these criteria, four colocalization features had evidence in four different TCGA indications to be a spatial biomarker of risk (Fig. 5b). Of these four, the colocalization between an ER stress signature (XBP1s targets ER17²²) and a neurotransmitter signature was associated with poor survival and malignant state in colon adenocarcinoma, lung adenocarcinoma, liver hepatocellular and ovarian cancers (Fig. 5c, 5d). In an independent non-squamous lung cancer study involving both immune checkpoint blockade (atezolizumab) and chemotherapy arms (IMpower110), this colocalization feature was also found to be associated with poor survival in the chemotherapy, but not immunotherapy arm, suggesting higher relevance as a resistance factor in chemotherapy (Fig. 5e, 5f). Of note, the ER17 and neurotransmitter signatures were not individually found to be associated with risk in any of the four mentioned indications (Supplementary Fig. 5a, 5b). Visual inspection of WSIs indicated that the expression of ER17 and neurotransmitter signatures primarily came from the microenvironment (as opposed to tumor region) in the case of high colocalization (Supplementary Fig. 5c). In low colocalization cases, the neurotransmitter signature expression primarily came from the tumor region whereas the ER17 signature expression was again predominantly in the microenvironment (Supplementary Fig. 5d).

Focusing on colocalization features involving immune system signatures, the T effector cell vs Cysteine colocalization was identified as the most consistent spatial biomarker of risk in TCGA. This colocalization feature was associated with poor survival and also showed significant tumor enrichment in breast, squamous lung, and ovarian cancers (Fig. 6a, 6b). T effector and cysteine signatures were not individually found to be associated with risk in any of these indications (Supplementary Fig. 5a, 6b). Visual inspection of WSIs indicated that the expression of T effector and cysteine signatures primarily came from the microenvironment (as opposed to tumor region) in the case of high colocalization (Supplementary Fig. 6c). As high colocalization is a risk factor, this expression pattern may be suggestive of a cysteine-associated immunosuppressive TME. In low colocalization cases, the T effector signature expression primarily came from the microenvironment whereas the cysteine signature expression was predominantly in the tumor region (Supplementary Fig. 6d).

The strongest survival effect for T effector vs cysteine colocalization was observed in breast cancer, where we investigated other immune cell types and found that immune vs cysteine colocalization was a general negative prognosis biomarker in this indication. Significant survival associations were observed for both lymphocytes/NK cells (Fig. 6c), and myeloid populations (Fig. 6d). The same immune vs cysteine colocalization features were also found to be negative prognostics in the Atezolizumab arm of Impower110 non-squamous cohort (Fig. 6e, 6f). Of note, most of these features were not prognostic in the chemotherapy arm (Supplementary Fig. 6e, 6f), however did not qualify as predictive biomarkers since the survival association differences between Atezolizumab and chemotherapy arms were not significant.

The MOSBY workflow achieves prediction of bulk omic profiles from H&E WSI features. Our results showed that, compared to ImageNet-based pretraining, self-supervised pretraining in large histological datasets allows creation of inference engines (e.g. RetCCL) that enable a more accurate mapping from image features to gene, signature, protein, and DNA-based measurements. We demonstrated that the most accurately predicted features by MOSBY involved processes such as proliferation, immune/stromal infiltration, differentiation, and epithelial-to-mesenchymal transition. This finding suggests that self-supervised features learned from pan-cancer histological datasets run the risk of accentuating biological processes and pathways that show the highest variation across different cancer indications. Training feature extractors on images from a single indication may be required to capture biological processes that play an important role in one or only a few cancer indications. RetCCL-based features showed promise by capturing angiogenesis in hepatocellular carcinoma, and fatty acid biology in low grade glioma. Yet, the accumulation of even larger histological datasets in the future have potential to allow refinement of image features relevant for indication-specific pathways, thus making possible the discovery of a greater number of clinically relevant spatial biomarkers.

The utility of H&E images to predict high-throughput expression data has been largely explored in the context of individual genes^1,2,12. The RNA-Seq noise level associated with individual gene transcript counts can be alleviated by utilizing focused gene sets (i.e. signatures) aiming to measure particular pathway or cell type levels. Also, proteins and posttranslational modification levels (e.g. phosphorylation, acetylation) capture a more faithful representation of tissue phenotype compared to gene transcript levels. We demonstrated with pathologist tumor annotation validations that training the model with signature or protein expression levels led to potentially more interpretable results compared to single gene-trained models. Moreover, the MOSBY workflow has the benefit of employing both RNA and protein level data to interrogate the spatial distribution of biological processes, each modality functioning as an independent validation platform for the other. Therefore, training MOSBY with signature and protein levels have the highest potential of uncovering spatial patterns relevant for cancer biology and clinical outcome. In the case of tumor cellularity, our results indicated that DNA-based measures outperformed RNA and protein-based epithelial markers, and importantly, highlighted that published epithelial signatures may be misleading due to being expressed also by normal cells.

MOSBY, as with many other deep learning models, adopts a weakly-supervised approach by first making tile-level predictions and then aggregating tiles to obtain a prediction at the WSI level^4,45. Although an intermediate output of the model, tile-level predictions enable in silico spatialization of whole slide-level annotations, opening the way to inferring intratumor heterogeneity for the slide-level information used as ground truth^12,44. Spatial intratumor heterogeneity patterns learned from a cohort of patients subsequently allow investigation of clinically relevant biomarkers. In MOSBY, spatial patterns are captured by pairwise colocalization features. For a given signature pair, the colocalization value on a slide is defined as the Pearson correlation across all tiles. Thus, MOSBY colocalization features capture slide-level but not local spatial patterns. Local processes such as tertiary lymphoid structures are known to affect patient survival and response to cancer immunotherapy^46,47. We demonstrated in this study that slide-level spatial patterns also carried survival signals, and increased predictive power of gene signature-based survival models in most TCGA indications. Moreover, we noted that the concordance indices of joint models (both colocalization features and gene signature levels) either surpassed or were comparable to those of multimodal deep learning models that employed the complete set of WSI and omic (RNA-seq, mutation status, copy number variation) data in TCGA⁴⁴. This finding indicated that the 175-signature panel we defined was sufficient to capture most biological processes important for clinical outcome.

End-to-end neural networks trained to predict survival have been lacking in terms of direct biological interpretation of image regions important for the model^44,48. These models may incorporate mechanisms such as attention heatmaps and spatial credit assignment to increase interpretability^44,49, yet still require pathologist efforts to examine important image regions whereby interpretation remains limited to phenotypes visible by human eye. In contrast, MOSBY spatial features are biologically interpretable by design, which is an advantage of this approach over end-to-end neural networks trained to predict survival^44,48. In this study, we showed that the colocalization of an endoplasmic reticulum (ER) stress-related signature and a neurotransmitter signature is both elevated in tumors and associated with poor overall survival in four TCGA indications. The poor survival association in lung adenocarcinoma was also validated in the chemotherapy arm of an independent NSCLC cohort (nonsquamous samples in Impower110), suggesting this colocalization may be a chemotherapy-specific risk factor. Moreover, we identified the T effector cell vs cysteine signature colocalization as a TME-related risk factor in multiple TCGA indications, as well as in the immunotherapy arm of Impower110 nonsquamous cohort. These results showcase the high utility of colocalization maps for discovering biologically interpretable clinically relevant spatial biomarkers.

Large oncology data sets have enabled the routine development of machine learning models trained simultaneously in multiple cancer indications from the same organ (i.e. pan-tissue training), or from different organs across the body (i.e. pan-cancer training). Such models have shown success in mapping molecular tumor features to patient outcomes as well as in mapping WSI features to transcriptomic data ^12,50,51. However, in the clinical setting where the machine learning model will be used for a single patient, it becomes critical to understand whether multi-indication training enables more robust models with higher generalization capacity or introduces bias into the model due to distribution differences across indications. Hence, we trained single-indication, pan-tissue, and pan-cancer versions of the MOSBY model to investigate the relative merits and potential clinical utility of multi- vs. single-indication models. We discovered that better generalization (test set performance) is achieved with single-indication training in indications where the single-indication performance is moderate to high. In contrast, multi-indication training has potential to increase generalization capacity in indications where single-indication performance is relatively poor. While these conclusions can be made for the median performance across tested signatures and protein/phosphoproteins, significant heterogeneity was observed among these features in terms of showing better performance in the single- vs. multi-indication setting. To prevent potential bias from multi-indication training, we recommend the use of single-indication models where the feature set in the model reflects an unbiased discovery approach. However, an explicit single- vs. multi-indication comparison may be required for settings where the studied feature set reflects a specific biology of interest.

A limitation of our study is that MOSBY colocalization maps are not able to capture local spatial patterns. Future work involves the investigation of a graph neural network-based model on tile-level MOSBY predictions where we can capture local as well as slide-level patterns. Moreover, transformer-based architectures may increase the expressiveness of our model to allow a more accurate mapping from image features to bulk omic profiles.

Data

TCGA: Batch-normalized RNA sequencing, RPPA datasets as well as clinical and DNA-based data were obtained from the PanCanAtlas publications page of the Genomic Data Commons website (https://gdc.cancer.gov/about-data/publications/pancanatlas). H&E-stained slide images (i.e. Tissue Slides) were downloaded from GDC Data Portal (https://portal.gdc.cancer.gov/).

Spatially-resolved transcriptomic data: Publicly available breast cancer tissue slides and spatial transcriptomic assays processed by the Spatial Transcriptomics method^39,52 were downloaded from https://data.mendeley.com/datasets/29ntw7sh4r/5. The dataset contained 68 WSIs from a total of 23 patients along with spot coordinates and respective RNA-seq expression values from the spatial transcriptomics assay. WSIs were tiled into 224x224 patches as input for the MOSBY model. Each tile was centered around the pixel coordinate of an assay spot to represent the 100µm region of the spot. Log-normalized gene expression values were used as ground truth. Signature scores were calculated by first creating an AnnData object using the anndata package then applying the score_genes function from scanpy⁵³. All analyses were conducted using Python 3.10.

MOSBY preprocessing

Image tiling

On whole slide images, foreground elements (tissue) and background (glass) were isolated through luminosity-based segmentation. The Python library OpenSlide was used as a backend for generating 224x224px foreground tiles at 0.5mpp resolution. The same tiling protocol was leveraged for both training and inference, allowing tiles to be mapped back to the original WSI positions for visualization (Supplementary Fig. 7).

Feature Extraction

Contrastive self-supervised learning-based RetCCL⁸ was used to extract image features. RetCCL employs a ResNet-50 architecture to extract 2048 features for each image tile.

Model Training

TCGA

A maximum of 8000 tiles were selected randomly for each slide to yield an unbiased representation of the slide, and all tiles were concurrently used for training. A 2-layer perceptron (512 and 256 nodes per layer) was used to map image features to omic variables. Number of epochs was set to a maximum of 300, with early stopping allowed with a patience of 30 epochs. Ground truth RNA-seq data were log-transformed. Ground truth signature and protein levels had negative values, and thus were shifted to make all values nonnegative. The model was trained with MSE loss between prediction and ground truth levels, while Spearman correlation between these quantities was used as early termination criterion in the validation set. A batch size of 64, and AdamW optimizer with 1e-3 weight decay were used. Learning rate scheduler was implemented with step size 5 and gamma 0.9.

5-fold cross-validation was performed (64% training, 16% validation, and 20% test set in each fold) to assess model performance. Information leak was prevented by assigning all WSIs from the same patient to the same partition. For inference, full models were trained with 80% of the data, with the remaining 20% used as validation set to check early stopping criterion.

IMvigor210 and IMvigor211

A full model was trained using both IMvigor210 and IMvigor211 images, and largely following the parameters used in TCGA training. Patients were split into 80% training and 20% validation set, stratified by trial and preventing information leak. The signature and gene model consisted of 175 signatures (Supplementary Table 3) and 73 genes (Supplementary Table 2) respectively. Different from TCGA runs, a maximum of 4000 randomly selected tiles per WSI were used during training, and WSIs having less than 200 tiles were filtered out. Ground truth RNA-seq data were log-transformed and standardized to have 0 mean and unit variance. Both gene and signature expression levels were shifted to make all values nonnegative.

Computational hardware and software

MOSBY was built with the PyTorch library (v1.11.0) in Python (v3.10.8) as a novel implementation of the HE2RNA¹² model. Python libraries used for data processing included NumPy (v1.23.5), Pandas (v1.2.4), Scikit-learn (v0.24.1), OpenSlide (v1.1.1), Zarr (v2.12.0), Tifffile (v2020.10.1), and OpenCV-cv2 (v4.5.5). Whole slide image tiling, RetCCL feature extraction and MOSBY model training were implemented in NVIDIA Tesla V100 Tensor Core GPUs (graphics processing units). Deep learning models were trained with NVIDIA Cuda compiler (v12.1.105). Data visualization in Python was implemented with Matplotlib (v3.3.4) and Seaborn (v0.11.1) libraries. Python statistical analyses such as Spearman and Pearson correlation were implemented with the SciPy library (v1.10.1).

Data processing in R (v4.1.1) was implemented with dplyr (v1.0.8), magrittr (v2.0.2), and reshape2 (v1.4.4) libraries. Data visualization in R was performed using ggplot2 (v3.3.5), ggpubr (v0.4.0), and ggsci (v2.9) libraries.

Spatial transcriptomics validation analysis

Derivation of an empirical null distribution for the distance between ground truth and predicted correlation structures

For each slide, the gene-gene correlation matrix was computed for both ground truth and model predictions. The Frobenius distance between these two matrices was used as a test statistic. Subsequently, the MOSBY correlation matrix rows and columns were randomly shuffled, and the Frobenius distance to the ground truth matrix was computed. Random shuffling was repeated 1000 times to obtain an empirical null distribution for MOSBY vs ground truth distance. The test statistic was then compared with the empirical null distribution to obtain a p-value for model predictions. The same sequence of steps was followed for signature-signature correlations.

CD8 IHC and H&E whole slide image alignment

IMvigor210 and IMvigor211 slides were scanned by an external party (CellCarta, Montreal, QC) on a Pannoramic 250 (3DHistech, Budapest, Hungary) with a 20x or 40x objective. Digital files were transferred to Genentech and converted to the Aperio SVS image format for all viewing and analysis. Hematoxylin and Eosin (HE) slides were aligned with a CD8 stained section from the same block (1017 total sample pairs). These sections were almost never serial, and displayed variable degrees of adjacency characteristic of the block being refaced between sections. Image alignment was performed in Matlab (R2022a, Mathworks, Natick, MA.) by downsampling to ~ 20 um per pixel and converting them to normalized grayscale images, before calculating an affine transformation using mutual information as the underlying matching metric. The transformation was then upscaled before being applied to the original CD8 high magnification image data bringing it into alignment with the HE image. Various alignment metrics were produced (intersection over union, normalized cross correlation) to select for correctly aligned images (454 sample pairs) before a final manual QC check at low magnification (at least 95% of tissue present on both slides, with the majority of visible structures in close proximity) that was not exhaustive and resulted in 42 sample pairs subjected to further analysis.

A binary mask was then produced from the aligned CD8 image using both HSV thresholding and a blue-normalized “brownness” algorithm⁵⁴.

CD8 IHC tile-level quantification

Tile size used for H&E images was set to 224x224 pixels at 0.5mpp. In cases where the native resolution was different from 0.5mpp, tiling was performed with an adjusted tile size at the native resolution, and tiles were subsequently downsampled or upsampled to arrive at 224 pixels at 0.5mpp. In most IMvigor slides, the native resolution was 0.243mpp resulting in an adjusted tile size of approximately 460x460 pixels. CD8 IHC images were tiled using the ‘adjusted’ tile size to match with the corresponding H&E tiles. The CD8 IHC count for a tile was computed using a convolution approach: Each tile was split into 30x30p subtiles, for which the number of 1s (brown pixels) was counted. The counts across subtiles were averaged to obtain the value for the tile. This average value was log-transformed (log(x + 1)) in comparisons with MOSBY model predictions.

Computation of signature scores from bulk RNA-seq data

In TCGA and IMvigor RNA-seq datasets, normalized gene expression values were log-transformed (log(x + 1)), z-transformed across samples to have 0 mean, unit variance, and subsequently averaged across genes to arrive at a single signature score for each sample. In TCGA, signature scoring was performed on the batch-normalized pan-cancer RNA-seq dataset, and hence signature scores were comparable across cancer indications.

Survival analysis

TCGA concordance index analysis: MOSBY signature models (full models trained with 80%-20% training/validation split) were used to make inference on all slides in a cancer indication. For a given slide, the colocalization for a signature pair was computed with a Pearson correlation across tile-level predictions of the two signatures. The collection of all pairwise correlation values for a slide formed the ‘slide colocalization matrix’ (N = 175x175). For patients with multiple existing H&E slides, the patient-level colocalization matrix was computed as the average of all pertinent slide-level colocalization matrices. Three sets of cross-validation runs were implemented for L1-regularized Cox regression models. The inputs to these survival models consisted of: Model 1) Flattened patient-level colocalization maps (N = 15225 features). Model 2, patient-level signature scores (N = 175 signatures). Model 3) all tested signatures (N = 175) and lasso-selected colocalization features from Model 1. The input features were regressed against overall survival in all models and for all tested indications.

In all three models, the sequence of possible values for lambda (shrinkage parameter) was internally determined in the cv.glmnet function from the glmnet⁵⁵ R library (v4.1.3) prior to cross-validation runs. The lambda maximizing the mean Harrel’s concordance measure across 10 cross-validation folds was chosen as optimal, and used to determine the concordance index estimate for the model. The standard error estimate for the model concordance index was calculated across the 10 cross-validation folds.

Kaplan-Meier plots

The survminer R library (v0.4.9) was used with a median cutoff to generate Kaplan-Meier plots. Log-rank test p-values were obtained internally in survminer using the survdiff function in the survival⁵⁶ library (v3.3.1).

Data availability

The accession URLs for publicly available data analyzed in this study (TCGA, spatial transcriptomics) are listed in the Data section of Methods. Datasets from clinical trials IMpower150 (H&E image data), IMpower110 (H&E image and clinical data), IMvigor210 (RNA-seq, H&E image and CD8 IHC image data), IMvigor211 (RNA-seq, H&E image and CD8 IHC image data) were also analyzed in the current study. IMvigor210 RNA-seq data is available at the European Genome-phenome archive (EGA) under the accession number EGAS00001002556, and was also published as an R package (http://research-pub.gene.com/IMvigor210CoreBiologies/). IMpower110 and IMpower150 datasets are not publicly available as data release is designated for the pending primary biomarker manuscripts.

IMvigor210 and IMvigor211 H&E and CD8 IHC image data as well as IMvigor211 RNA-seq data that support the findings of this study are available from Roche, but restrictions apply to the availability of these data, which were used under license for the current study, and so are not publicly available. Data are however available from the authors upon reasonable request and with permission of Roche. For up to date details on Roche's Global Policy on the Sharing of Clinical Information and how to request access to related clinical study documents, see here: https://go.roche.com/data_sharing.

Funding: This work was supported by Genentech, Inc.

Acknowledgements

We would like to thank all of the study participants and their families, and all of the site investigators, study coordinators, and staff. We also would like to thank Brandon Kayser, Robert Johnston, Aïcha Bentaieb, Dan Ruderman, Hector Corrada Bravo, Jason Hackney, and colleagues from the Oncology Reverse Translation team for providing critical feedback on the manuscript. This work was supported by Genentech, Inc.

Author contributions

K.L., A.K, Y.S. devised the study.

Y.S., V.P, A.K, J.E, E.L, E.W., B.N., M.S., M.B. acquired, analyzed, and interpreted the data.

Y.S. wrote the manuscript with input from the remaining authors.

Competing interests

Y.S. , J.E. , E.L , B.N. , M.S. , M.B. , K.L. are employees of Genentech, Inc. and shareholders in F. Hoffmann La Roche, Ltd.

V.P. is an external partner at Genentech, Inc.

A.K. was an intern at Genentech, Inc.

E.W. was an intern at Genentech, In

Comiter, C. et al. Inference of single cell profiles from histology stains with the Single-Cell omics from Histology Analysis Framework (SCHAF). http://biorxiv.org/lookup/doi/10.1101/2023.03.21.533680 (2023) doi:10.1101/2023.03.21.533680.
Alsaafin, A., Safarpoor, A., Sikaroudi, M., Hipp, J. D. & Tizhoosh, H. R. Learning to predict RNA sequence expressions from whole slide images with applications for search and classification. Commun. Biol. 6, 304 (2023).
Yang, K. D. et al. Multi-domain translation between single-cell imaging and sequencing data using autoencoders. Nat. Commun. 12, 31 (2021).
Tsai, P.-C. et al. Histopathology images predict multi-omics aberrations and prognoses in colorectal cancer patients. Nat. Commun. 14, 2102 (2023).
Haviv, D., Gatie, M., Hadjantonakis, A.-K., Nawy, T. & Pe’er, D. The covariance environment defines cellular niches for spatial inference. http://biorxiv.org/lookup/doi/10.1101/2023.04.18.537375 (2023) doi:10.1101/2023.04.18.537375.
Chen, X., Fan, H., Girshick, R. & He, K. Improved Baselines with Momentum Contrastive Learning. Preprint at http://arxiv.org/abs/2003.04297 (2020).
Oquab, M. et al. DINOv2: Learning Robust Visual Features without Supervision. Preprint at http://arxiv.org/abs/2304.07193 (2023).
Wang, X. et al. RetCCL: Clustering-guided contrastive learning for whole-slide image retrieval. Med. Image Anal. 83, 102645 (2023).
Fremond, S. et al. Interpretable deep learning model to predict the molecular classification of endometrial cancer from haematoxylin and eosin-stained whole-slide images: a combined analysis of the PORTEC randomised trials and clinical cohorts. Lancet Digit. Health 5, e71–e82 (2023).
Schirris, Y., Gavves, E., Nederlof, I., Horlings, H. M. & Teuwen, J. DeepSMILE: Contrastive self-supervised pre-training benefits MSI and HRD classification directly from H&E whole-slide images in colorectal and breast cancer. Med. Image Anal. 79, 102464 (2022).
Ciga, O., Xu, T. & Martel, A. L. Self supervised contrastive learning for digital histopathology. Mach. Learn. Appl. 7, 100198 (2022).
Schmauch, B. et al. A deep learning model to predict RNA-Seq expression of tumours from whole slide images. Nat. Commun. 11, 3877 (2020).
Nahhas, O. S. M. E. et al. Regression-based Deep-Learning predicts molecular biomarkers from pathology slides. Preprint at http://arxiv.org/abs/2304.05153 (2023).
Hoadley, K. A. et al. Cell-of-Origin Patterns Dominate the Molecular Classification of 10,000 Tumors from 33 Types of Cancer. Cell 173, 291-304.e6 (2018).
Bagaev, A. et al. Conserved pan-cancer microenvironment subtypes predict response to immunotherapy. Cancer Cell 39, 845-865.e7 (2021).
Patil, N. S. et al. Intratumoral plasma cells predict outcomes to PD-L1 blockade in non-small cell lung cancer. Cancer Cell 40, 289-300.e4 (2022).
Liberzon, A. et al. The Molecular Signatures Database (MSigDB) hallmark gene set collection. Cell Syst. 1, 417–425 (2015).
Miranda, A. et al. Cancer stemness, intratumoral heterogeneity, and immune response across cancers. Proc. Natl. Acad. Sci. U. S. A. 116, 9020–9029 (2019).
Şenbabaoğlu, Y. et al. Tumor immune microenvironment characterization in clear cell renal cell carcinoma identifies prognostic and immunotherapeutically relevant messenger RNA signatures. Genome Biol. 17, 231 (2016).
Taube, J. H. et al. Core epithelial-to-mesenchymal transition interactome gene-expression signature is associated with claudin-low and metaplastic breast cancer subtypes. Proc. Natl. Acad. Sci. U. S. A. 107, 15449–15454 (2010).
Masiero, M. et al. A core human primary tumor angiogenesis signature identifies the endothelial orphan receptor ELTD1 as a key regulator of angiogenesis. Cancer Cell 24, 229–241 (2013).
Harnoss, J. M. et al. IRE1α Disruption in Triple-Negative Breast Cancer Cooperates with Antiangiogenic Therapy by Reversing ER Stress Adaptation and Remodeling the Tumor Microenvironment. Cancer Res. 80, 2368–2379 (2020).
Gene Ontology Consortium et al. The Gene Ontology knowledgebase in 2023. Genetics 224, iyad031 (2023).
Hobert, O., Carrera, I. & Stefanakis, N. The molecular and gene regulatory signature of a neuron. Trends Neurosci. 33, 435–445 (2010).
Robertson, A. G. et al. Comprehensive Molecular Characterization of Muscle-Invasive Bladder Cancer. Cell 171, 540-556.e25 (2017).
Tsai, H. K. et al. Gene expression signatures of neuroendocrine prostate cancer and primary small cell prostatic carcinoma. BMC Cancer 17, 759 (2017).
Gillespie, M. et al. The reactome pathway knowledgebase 2022. Nucleic Acids Res. 50, D687–D692 (2022).
Xu, Q. et al. Pan-cancer transcriptome analysis reveals a gene expression signature for the identification of tumor tissue origin. Mod. Pathol. Off. J. U. S. Can. Acad. Pathol. Inc 29, 546–556 (2016).
Ben-Porath, I. et al. An embryonic stem cell-like gene expression signature in poorly differentiated aggressive human tumors. Nat. Genet. 40, 499–507 (2008).
Bhattacharya, B., Puri, S. & Puri, R. K. A review of gene expression profiling of human embryonic stem cell lines and their differentiated progeny. Curr. Stem Cell Res. Ther. 4, 98–106 (2009).
Shats, I. et al. Using a stem cell-based signature to guide therapeutic selection in cancer. Cancer Res. 71, 1772–1780 (2011).
Kim, J. et al. A Myc network accounts for similarities between embryonic stem and cancer cell transcription programs. Cell 143, 313–324 (2010).
Possemato, R. et al. Functional genomics reveal that the serine synthesis pathway is essential in breast cancer. Nature 476, 346–350 (2011).
Newman, A. M. et al. Robust enumeration of cell subsets from tissue expression profiles. Nat. Methods 12, 453–457 (2015).
Mariathasan, S. et al. TGFβ attenuates tumour response to PD-L1 blockade by contributing to exclusion of T cells. Nature 554, 544–548 (2018).
Böttcher, J. P. & Reis e Sousa, C. The Role of Type 1 Conventional Dendritic Cells in Cancer Immunity. Trends Cancer 4, 784–792 (2018).
Yoshihara, K. et al. Inferring tumour purity and stromal and immune cell admixture from expression data. Nat. Commun. 4, 2612 (2013).
Akbani, R. et al. A pan-cancer proteomic perspective on The Cancer Genome Atlas. Nat. Commun. 5, 3887 (2014).
Stenbeck, L., Bergenstråhle, L., Lundeberg, J. & Borg, Å. Human breast cancer in situ capturing transcriptomics. (2021) doi:10.17632/29ntw7sh4r.5.
Balar, A. V. et al. Atezolizumab as first-line treatment in cisplatin-ineligible patients with locally advanced and metastatic urothelial carcinoma: a single-arm, multicentre, phase 2 trial. Lancet Lond. Engl. 389, 67–76 (2017).
Powles, T. et al. Atezolizumab versus chemotherapy in patients with platinum-treated locally advanced or metastatic urothelial carcinoma (IMvigor211): a multicentre, open-label, phase 3 randomised controlled trial. Lancet Lond. Engl. 391, 748–757 (2018).
Diao, J. A. et al. Human-interpretable image features derived from densely mapped cancer pathology slides predict diverse molecular phenotypes. Nat. Commun. 12, 1613 (2021).
Socinski, M. A. et al. Atezolizumab for First-Line Treatment of Metastatic Nonsquamous NSCLC. N. Engl. J. Med. 378, 2288–2301 (2018).
Chen, R. J. et al. Pan-cancer integrative histology-genomic analysis via multimodal deep learning. Cancer Cell 40, 865-878.e6 (2022).
Lu, M. Y. et al. Data-efficient and weakly supervised computational pathology on whole-slide images. Nat. Biomed. Eng. 5, 555–570 (2021).
Helmink, B. A. et al. B cells and tertiary lymphoid structures promote immunotherapy response. Nature 577, 549–555 (2020).
Trüb, M. & Zippelius, A. Tertiary Lymphoid Structures as a Predictive Biomarker of Response to Cancer Immunotherapies. Front. Immunol. 12, 674565 (2021).
Katzman, J. L. et al. DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC Med. Res. Methodol. 18, 24 (2018).
Javed, S. A. et al. Additive MIL: Intrinsically Interpretable Multiple Instance Learning for Pathology. (2022) doi:10.48550/ARXIV.2206.01794.
Chowell, D. et al. Improved prediction of immune checkpoint blockade efficacy across multiple cancer types. Nat. Biotechnol. 40, 499–506 (2022).
Chen, R. J. et al. Whole Slide Images are 2D Point Clouds: Context-Aware Survival Prediction using Patch-based Graph Convolutional Networks. (2021) doi:10.48550/ARXIV.2107.13048.
Ståhl, P. L. et al. Visualization and analysis of gene expression in tissue sections by spatial transcriptomics. Science 353, 78–82 (2016).
Wolf, F. A., Angerer, P. & Theis, F. J. SCANPY: large-scale single-cell gene expression data analysis. Genome Biol. 19, 15 (2018).
Brey, E. M. et al. Automated selection of DAB-labeled tissue for immunohistochemical quantification. J. Histochem. Cytochem. Off. J. Histochem. Soc. 51, 575–584 (2003).
Simon, N., Friedman, J., Tibshirani, R. & Hastie, T. Regularization Paths for Cox’s Proportional Hazards Model via Coordinate Descent. Journal of Statistical Software 39, 1--13 (2011).
Therneau, T. M. & Grambsch, P. M. Modeling Survival Data: Extending the Cox Model. (Springer, 2000).

Competing interest reported. Y.S. , J.E. , E.L , B.N. , M.S. , M.B. , K.L. are employees of Genentech, Inc. and shareholders in F. Hoffmann La Roche, Ltd. V.P. is an external partner at Genentech, Inc. A.K. was an intern at Genentech, Inc. E.W. was an intern at Genentech, Inc.

Download PDF

Journal Publication

published 06 Aug, 2024

Read the published version in Scientific Reports →

Editorial decision: Revision requested
17 May, 2024
Reviews received at journal
15 May, 2024
Reviewers agreed at journal
05 May, 2024
Reviews received at journal
17 Mar, 2024
Reviewers agreed at journal
08 Mar, 2024
Reviewers agreed at journal
07 Mar, 2024
Reviewers invited by journal
05 Mar, 2024
Editor assigned by journal
05 Mar, 2024
Editor invited by journal
28 Feb, 2024
Submission checks completed at journal
28 Feb, 2024
First submitted to journal
07 Feb, 2024

You are reading this latest preprint version

MOSBY enables multi-omic inference and spatial biomarker discovery from whole slide images

Status:

Journal Publication

Version 1

Abstract

Figures

Introduction

RESULTS

Contrastive self-supervised pretraining benefits prediction of omic data from H&E whole slide images

MOSBY tile level predictions are validated with spatially resolved transcriptomic data

MOSBY predicts stroma, immune, and proliferation features with highest accuracy in TCGA

Multi-indication runs may show exaggerated performance

Spatial patterns inferred from tile-level MOSBY predictions increase survival predictive power of gene signature-based models

Colocalization maps enable discovery of biologically interpretable spatial biomarkers

DISCUSSION

METHODS

Data

MOSBY preprocessing

Model Training

Computational hardware and software

Spatial transcriptomics validation analysis

CD8 IHC and H&E whole slide image alignment

CD8 IHC tile-level quantification

Survival analysis

Data availability

Declarations

References

Additional Declarations

Supplementary Files

Status:

Journal Publication

Version 1