ST profiling of the anterior part and posterior part of the mouse brain tissue sagittal sections was generated with the Visium technology from 10x Genomics 14–17. Both datasets consist of serial H&E histology images and paired gene expressions at the spatial spots and their coordinates. The analyzed gene expression count matrices are outputs of the SpaceRanger pipeline 18.
The anterior sagittal dataset comprises two sets of H&E-stained histology images, alongside corresponding spatial gene expression data 14, 15. In slice 1, the spatial transcriptomics (ST) data encapsulates the expression profiles of 31,053 genes across 2,825 spots, given by read counts. Slice 2 features ST data for 31,053 genes across 2,696 spots, also detailed through read counts.
The posterior sagittal dataset includes two collections of H&E-stained histology images and their associated spatial gene expression data 16, 17. For the first slice, spatial transcriptomics (ST) analysis reveals the expression levels of 32,285 genes across 3,355 spots, given by read count data. The second slice presents ST information for the same number of genes, but across 3,289 spots, with expression quantified similarly through read counts.
Data preprocessing
For the H&E histology images, we extracted patches based on the size and location of each spot. Each patch was a 224x224 image centered around a spot, approximately 55µm on each side, and paired with the spot's corresponding gene expression profile. For the spatial gene expression profiles of each tissue section, each spot was normalized to the total count and log normalized. The union of the top 1,000 most highly variable genes from each of the slices was used for training and prediction. Finally, the expression data of these samples were batch corrected using Harmony 19 before one of the slices was randomly selected to be held out for testing. For these two datasets, the slice 2 was selected to be held out for training, while the slice 1 was selected to be held out for testing.
Learning image embedding for expression prediction.
Residual networks (ResNets) 20 are widely recognized architectures in image classification, initially acclaimed for their significant advancement upon introduction. They continue to be a benchmark in various image analyses 21–23 and serve as baselines in image studies introducing novel architectures 24, 25. Our focus in this paper was to extract highly representative features from H&E images using ResNet50. During the training phase, we utilized a dataset comprising \(\:N\) pairs of H&E images and their corresponding gene expressions. Each image patch was represented as a 3D tensor \(\:{\mathbf{X}}_{i}\in\:{\mathbb{R}}^{3\times\:W\times\:H}\), where \(\:N\) and \(\:H\) denoted the width and height of the patch, respectively. The associated gene expression \(\:{\mathbf{y}}_{i}\) was represented as a \(\:d\)-dimensional vector in \(\:{\mathbb{R}}^{d}\).
Our objective was to develop a deep learning framework capable of accurately predicting gene expression from a given image patch. We conceptualized our model as comprising two main components: a backbone network, denoted as \(\:\mathcal{F}\left(\bullet\:\right)\), for initial feature extraction module, followed by a feature refinement module, \(\:\mathcal{G}\left(\bullet\:\right)\), to enhance the predictive capability of the extracted features. To optimize the performance of both modules, we employed the Mean Squared Error (MSE) loss function during the training process as follows:
$$\:{\mathcal{L}}_{mse}=\sum\:_{i=1}^{N}{‖{\mathbf{y}}_{i}-\mathcal{G}\left(\mathcal{F}\left({\mathbf{X}}_{i}\right)\right)‖}_{{\mathcal{l}}_{2}}$$
,
where \(\:{\mathcal{l}}_{2}\) denotes the L-2 norm to measure the difference between predicted gene expression and ground-truth one.
Spot-Interaction Discovery.
The emergence of transformers has led to their widespread application in image and omics data analysis, as demonstrated by numerous studies 26–29. Particularly, recent transformer-based architectures have drawn attention to self-attention and cross-attention mechanisms, offering means to capture interdependencies between different input modalities 30, 31.
In the context of our spatially resolved gene expression prediction approach, the Self-Attention Transformer (SAT) plays a crucial role in exploring spot-spot interactions. By clustering spots that exhibit high correlation, SAT enhances our model's ability to represent gene expression accurately. This approach facilitates a deeper understanding of spatial relationships within the data, contributing to improved predictive performance.
In our endeavor to analyze H&E images for spot interaction, we proposed the development of a spot-interaction module, as shown in Fig. 1. This module aimed to refine predictions of gene expression. To streamline the model and minimize the number of parameters, we introduced a non-parametric attentive module. The operation of this attention module was defined by the following Equation:
$$\:\mathcal{A}\left(Q,\:K,\:V\right)=\mathbf{s}\mathbf{o}\mathbf{f}\mathbf{t}\mathbf{m}\mathbf{a}\mathbf{x}\left(\frac{Q{K}^{\text{T}}}{\sqrt{d}}\right)V$$
,
where \(\:K\), \(\:Q\), and \(\:V\) represent the key, query, and value matrices of a Transformer module. To model the spot interaction, we aimed to learn a compact gene expression correlation between the spots in the training set. Here, \(\:{\mathbf{X}}_{i}\) represented the input features, and \(\:\mathcal{F}\left(\bullet\:\right)\) and \(\:\mathcal{G}\left(\bullet\:\right)\) denoted transformation functions. The aim of this formulation was to explore the matrix of spot-spot interactions. By doing so, it clustered spots that exhibit high correlation, thereby enhancing the robustness of the gene expression representation. Moreover, this approach elucidated the correlation among spots, improving the model’s inference capabilities. To enhance the model’s learning process by integrating knowledge of ground-truth gene expressions, we introduced a secondary MSE loss function. This additional MSE loss was formulated as follows:
$$\:{\mathcal{L}}_{mse}^{{\prime\:}}=\sum\:_{i=1}^{N}{‖{\mathbf{y}}_{i}-\mathcal{A}\left(\mathcal{G}\left(\mathcal{F}\left({\mathbf{X}}_{i}\right)\right)\right)‖}_{{\mathcal{l}}_{2}}$$
.
Integrating dual objectives into a unified framework, we defined the overall loss function as follows:
$$\:\mathcal{L}={{\mathcal{L}}_{mse}+\mathcal{L}}_{mse}^{{\prime\:}}$$
.
This combined loss framework was designed to enhance the model’s predictive performance by balancing the feature extraction and interactive feature refinement. By carefully summing up the contribution of the primary and secondary MSE losses, the model was steered to pay detailed attention to the subtleties and complexities of gene expression data. This, in turn, was expected to improve the model’s capacity for capturing the intricate biological relationships that were represented within H&E images.
Evaluated Metrics
In this study, we employed the Pearson correlation coefficient (PCC) to measure the spatial gene expression predicted by ResSAT with the observed gene expression, in order to assess their level of correlation. The PCC had a range of values between − 1 and 1. It was determined by dividing the covariance of two variables by the product of their individual standard deviations:
$$\:\text{P}\text{C}\text{C}=\:\frac{\text{C}\text{o}\text{v}({\text{y}}_{\text{t}\text{r}\text{u}\text{e}},\:{\text{y}}_{\text{p}\text{r}\text{e}\text{d}})}{{\sigma\:}\left({\text{y}}_{\text{t}\text{r}\text{u}\text{e}}\right)\bullet\:{\sigma\:}\left({\text{y}}_{\text{p}\text{r}\text{e}\text{d}}\right)}$$
,
where \(\:\text{C}\text{o}\text{v}(\bullet\:,\:\bullet\:)\) denotes covariance; \(\:{\text{y}}_{\text{t}\text{r}\text{u}\text{e}}\) and \(\:{\text{y}}_{\text{p}\text{r}\text{e}\text{d}}\) represents the original gene expression and the gene expression obtained by prediction, respectively. \(\:{\sigma\:}\left(\bullet\:\right)\) means standard deviation.
We computed the mean correlation of all genes as follows:
$$\:\stackrel{-}{R}=\frac{1}{k}{\sum\:}_{i=1}^{k}\mathbf{C}\mathbf{o}\mathbf{r}\mathbf{r}\left({\mathbf{y}}_{\text{true}}^{i},{\mathbf{y}}_{\text{pred}}^{i}\right)$$
,
where \(\:\mathbf{C}\mathbf{o}\mathbf{r}\mathbf{r}\left({\mathbf{y}}_{\text{true}}^{i},{\mathbf{y}}_{\text{pred}}^{i}\right)\) represented the Pearson correlation coefficient between the true and predicted expression values of the \(\:i\)-th gene, \(\:{\mathbf{y}}_{\text{true}}^{i}\) and \(\:{\mathbf{y}}_{\text{pred}}^{i}\), respectively. \(\:k\) is the number of all genes.
In addition to the mean correlation of all genes, we computed the mean correlation of the top 50 most highly expressed genes (HEGs) in predicted gene expression, compared with the observed gene expression. This metric provided insight into the performance of the prediction method specifically for genes that are highly expressed in the spatial context. Technically, the mean correlation of the top \(\:k{\prime\:}\) HEGs was calculated as follows:
$$\:{\stackrel{-}{R}}_{50}=\frac{1}{k}{\sum\:}_{i=1}^{k}\mathbf{C}\mathbf{o}\mathbf{r}\mathbf{r}\left({\mathbf{y}}_{\text{true}}^{i},{\mathbf{y}}_{\text{pred}}^{i}\right)$$
,
where \(\:\mathbf{C}\mathbf{o}\mathbf{r}\mathbf{r}\left({\mathbf{y}}_{\text{true}}^{i},{\mathbf{y}}_{\text{pred}}^{i}\right)\) represented the Pearson correlation coefficient between the true and predicted expression values of the \(\:i\)-th gene, \(\:{\mathbf{y}}_{\text{true}}^{i}\) and \(\:{\mathbf{y}}_{\text{pred}}^{i}\), respectively. We set \(\:k{\prime\:}=50\) in our experiments by default.
Algorithm 1 Spatial transcriptomics prediction using H&E-stained images. |
Require: Image patch \(\:\left(\mathbf{X}\in\:{\mathbb{R}}^{N\times\:\left(3\times\:W\times\:H\right)}\right)\) Paired gene expression profile \(\:\left({\mathbf{y}}_{true}\in\:{\mathbb{R}}^{N\times\:d}\right)\) |
Ensure: Expression prediction \(\:\left({\mathbf{y}}_{\text{p}\text{r}\text{e}\text{d}}\in\:{\mathbb{R}}^{N\times\:d}\right)\) |
1: function ResSAT\(\:\left(\mathbf{X}\right)\) 2: \(\:{f}_{x}\leftarrow\:\mathcal{G}\left(\mathcal{F}\left(\mathbf{X}\right)\right)\) \(\:N\) image embeddings 3: \(\:Q,K,V,d\) \(\:\leftarrow\:\) \(\:{f}_{x}\) dimension of \(\:K\:\)as \(\:d\) 4: \(\:{\mathbf{y}}_{pred}\:\leftarrow\:\text{s}\text{o}\text{f}\text{t}\text{m}\text{a}\text{x}\left(Q{K}^{T}/\sqrt{d}\right)\bullet\:V\) self-attention transformer 5: \(\:\mathcal{L}={{\mathcal{L}}_{mse}+\mathcal{L}}_{mse}^{{\prime\:}}\) 6: return \(\:{\mathbf{y}}_{true}\), \(\:{\mathbf{y}}_{pred}\), \(\:\mathcal{L}\) |