scSTAR reveals hidden heterogeneity with a real-virtual cell pair structure across single cell samples


 We present scSTAR, single-cell States Transfer Across-sample of RNA-seq data, which estimates individual cell state dynamics by creating real-virtual cell pairs across samples/conditions, therefore each cell can act as its own control when comparing between conditions. Using simulated and published scRNA-seq datasets, we show that new cell subtypes/functions which are more linked to the biological problem under investigation are discovered, especially when current methods give blur patterns.

2 making it physically not possible to track the state transition of each individual cell across samples/conditions. Various analytical methods have been developed aiming to tackle this problem 5,6 , however, they are mainly based on the average state value over a cluster of cells, e.g., cell subtypes or cell relatives, and thus the state transition patterns extracted are limited to sub-bulk rather than single cell resolution. Furthermore, interrogating the 5 average state difference across cell clusters can lead to the overlooking of modest biological differences between individual cells, especially when the difference is smaller than the inter-cluster variations. Inspired by bulk data analysis, where this issue is usually addressed by comparing paired samples from the same subject 7,8 , here we propose a paired-cell model, scSTAR (single-cell State Transition Across-samples of Rna-seq data), 10 where for each real cell in one sample/condition, scSTAR estimates its virtual projection in the other. Such creation of real-virtual cell pairs allows each cell to act as its own control when comparing between samples/conditions. The scSTAR in silico estimates of individual cell state transition is achieve by generating real -virtual cell pairs across samples/conditions, and the difference between 15 the real-virtual cell pair should only relate to the biological problem of interest. We demonstrate that cell state dynamics can be achieved by maximising the covariance between cell states from various samples, which is the partial least squares solution (PLS) 9 (see Online Methods for more details). In a typical single cell analytical workflow, scSTAR can be applied just before "clustering/trajectory" etc., step [10][11][12][13][14] (Figure 1a) to 20 reveal detailed sub-cluster structures of cell dynamics pattern in response to experimental condition changes, even when they are buried by other interference.  To benchmark scSTAR, we compared the original (unprocessed), mutual nearest neighbors (MNN) 10 and scSTAR processed data with three commonly used unsupervised clustering methods, k-means, SC3 15 and Seurat 11 on a series of simulated noisy data (Methods). scSTAR-processed data outperforms unprocessed and MNN methods with clearer clustering and better ARIs (adjusted Rand index) in almost all comparisons, especially at low signal-to-noise ratio (FC (fold change) = 1.3, p <10 -5 ) ( Figure S1).
To further evaluate the performance of scSTAR for therapy response prediction, we applied it on a melanoma data 1 to construct an immunotherapy-response prediction model using cell type composition patterns (Additional information Remark A). scSTAR 5 effectively predicted the immunotherapy induced dynamics of each cell from pretreatment samples with AUC = 0.96 which is a clear increase compared to the prediction models (AUC = 0.8) with clusters reported in the original study ( Figure 1b). We used scSTAR to uncover novel condition-specific cell types in data from a mouse immunosenescence study (Additional information Remark B). The exhausted T cell (TExh) 10 and effector memory T cell (TEM) were prominent in elderly and young animals with unstimulated cells from B6 mice, respectively. In contrast, the central memory T cell (TCM) was unaffected with aging ( Figure S3a-d). Trajectory analysis also showed that TEM tended to be replaced by TExh with aging; but TCM was separated from such trajectory ( Figure S3e-f). The cell dynamic patterns revealed by scSTAR from B6 mice were 15 confirmed in CAST mice ( Figure S3g-j). A similar conclusion was drawn from stimulated cells ( Figure S4). In addition, beside what the original studies have found, novel cell subtypes, new discriminatory patterns, clearer progression trajectories, etc., were identified with scSTAR on four more scRNA-seq datasets 2, 16-18 (Additional information Remark C, D, E, F, Figure  20 S5-10). For example, PBMC divergence between early and late stage lung cancer patients was clearly revealed by scSTAR (Figure 1c, 1d); dendritic cell state transition from normal to tumor tissues was better illustrated (Figure 1e, 1f). Furthermore, an unknown Treg cell subtype associated with suppression of antitumor immune functions (Tumor Associated Treg (TAT)) was uncovered with scSTAR. 25 The Treg cell of adjacent normal and tumor tissues from four solid tumors, liver 19 , lung 17 , breast 20 and esophageal 2 cancers ( Figure 2a, Additional information Remark G) were examined. TAT can be well characterized by high expression of SQLE, HSPA5, HSP90B1 and PSMD12 from mTOC1 gene set (Figure 2b, 2c). It was also found significantly associated with poor patients' survival (Figure S11-S23) and similar patterns were 30 observed in 11 out of 21 cancers in TGCA database 21 (Figure 2d, 2e). The patients with high TAT also tended to be nonresponding to immunotherapy on a melanoma dataset 1 (Figure 2f). Its existence was verified on a human ESCC (Esophageal Squamous Cell Carcinoma) 2 (Figure 2g-k), a mouse lung cancer 22 scRNA-seq dataset (Methods, Figure  S25) and further confirmed by immunofluorescence microscopy in human ESCC ( Figure  35 2l). TAT suppresses immune functions similar as conventional eTreg (effector Treg) cells, but with much stronger amplitudes (Figure 2k, S24) and start to be functional at very early stage of tumorigenesis ( Figure S25d). Additionally, gene enrichment analysis showed TAT has highly active glycolysis and hypoxia, which also indicated a strong immune suppression capability 23 (Figure 2d). We envisage that the ubiquitous existence of TAT 40 in various solid tumor types and species (human and mouse) endows its potential to be an effective immunotherapy target in cancer treatment. fails. scSTAR can be applied directly on scRNA-seq data with various experimental setups, and it has the potential to considerably simplify comparative analysis in a variety of research and practical applications.

Methods
Methods, including statements of data availability and any associated accession codes and references, are available in the online version of the paper. The scSTAR R package is available at https://github.com/XZouProjects/scSTAR.
where � represents the projected profile of the cells from condition B to condition A, therefore, � is an unknown variable and not observable from the data, and needs to be estimated. We consider this as the virtual cells paired with Y; V represents state transfer of each cell when the condition shifted from A to B. We define V is mainly dependent on the way of manipulation rather than the cell initial states, 20 therefore, � and V are not expected to the associated biologically.
For most cell-state analysis, the variable of interest is . To estimate , the conventional scRNAseq data interpretation strategy is to infer the properties of V by comparing and some other observed cells X from condition A. However, the cell-to-cell heterogeneity may confound such performances, 25 especially when V is relatively weak. Such obstacle is ubiquitous in real experiments, such as the changes of cell states are usually smaller than the differences between cell subtypes.
Ideally, to estimate � from Y, principal component regression (PCR) can be adopted. For Eq 1 and given X and Y, principal component analysis (PCA) can decompose X as p t, where p is PCA loading 30 matrix and t denotes the corresponding PCA score. As � and X are different cells from the same condition, therefore, they can be characterized by the same loading P, e.g., X = p and � = p ′. Let = p ′ + . Give Y and p, ′ can be estimated by minimizing square error The solution of Eq 2 is t ′ = ( ) −1 = ( ) . And � can be estimated as ( ) −1 , further However, in practice, the mismatch between X and Y may bias the results obtained using PCR.
To account for such possible mismatch, partial least square (PLS) is adopted here, and the solution can well approximate the solution of Eq. 2. Suppose X and Y can be decomposed into linear combination forms X=wt, Y=cu using PLS. The cost function of PLS provides The term � , � � is maximized when = are the eigenvectors of Var(X) since we assume �~, which is the solution of PCA of X. The second term ( , ) = ( , ) = 0, as X and V are unrelated. Therefore, the solution of Eq 3 tends to approximate PCA analysis results of X.
Furthermore, the cost function can be expressed as It can be seen that the estimation of also accounts for the maximization of correlation between the two group of cells, which may account for the mismatch between � and .
Based on the above derivation, the virtual cell � can be estimated using ( ) −1 and the difference between the real-virtual cell pair driven by the biological problem of interest, i.e., V=Y-X � .

15
Similarly, the cell dynamic components contained in X when compared to Y can also be extracted, i.e.,

Three-step scSTAR procedure
In real scenarios, the variations contained in each cell can be considered as a linear combination of 20 the following components 24, 25 : where ℎ indicates the batch effect, and indicates the interferences not related to the discrimination between the two groups, consisting of random (including technical) (r) and biological (b) noise. represents the gene expression differences between the paired conditions studied.

25
To reduce the interferences of random noise, the OGFSC 26 method proposed in our previous study is first applied with the parameter α fixed to 0.5, which implies that the genes with variances smaller than the expected values are removed.

30
Let's define 'anchor' being the cells which can be paired between the two groups. A reasonable assumption is that the differences between a pair of anchor cells is only caused by batch effect 10 .
Here, the k-nearest neighbors (KNN) 27 method is used to identify the paired anchor cells, which should be within the k nearest neighbors mutually. By default, k is set to 3, and the similarity between cells is evaluated in terms of cosine metrics. Next, a PLS model (PLS1) is constructed on the anchor cells from both groups to characterize the batch effect. The component of batch effect Vbatch is removed using the PLS model and the minimum square error method as follows: Step 2: Cell state transfer estimation by decomposing Vsignal from Vnoise

15
Using a similar principle, a second PLSDA model (PLS2) is constructed. As noise variations do not contribute to the discrimination of the two groups, PLS2 dedicatedly captures the variation of .
Using the loading matrix 2 to estimate the virtual cell profiles, and signal can be extracted from ′: In such a way, the � is the estimation of .
where i indicates the cells from the same group. Those genes with larger in group A than group B, and at the same time, with values larger than the upper boundary of the confidence interval of the non-DE genes (by default 99%), are considered up-regulated in group A. In the same way, the DE genes upregulated in group B can also be identified.
The PLSDA model tends to only preserve the genes with much larger expression in one group than in the other group. Therefore, for those genes identified as up-regulated in one group, their processed expression values in this group approximate the change amplitudes of the same cells 5 between the two groups.

Simulated datasets
To evaluate the proposed algorithm, we used the protocol presented in Haghverdi et al. 10 to create simulated datasets, and mimic case-control studies, which contain batch effect, random and biological

Evaluation metrics
The results obtained by different clustering methods on the simulated data were evaluated by Adjusted where are contingency table entry values, and are the sum of the th row and the th column of the contingency table, respectively. The closer the ARI value is to 1, the closer it is to the true cluster.

Experimental Datasets 30
To illustrate the capability of scSTAR in discovering novel cell subtypes, five datasets were adopted: a head and neck squamous cell carcinoma data (GSE103322); an esophageal squamous cell carcinoma tissue dataset (GSE145370); a hepatocellular cell carcinoma tissue dataset (GSE140228); a lung squamous cell carcinoma tissue dataset (GSE99254); a breast cancer tissue dataset (GSE114727).

11
The potential clinical application of scSTAR was demonstrated on a melanoma immunotherapy treatment dataset (GSE120575).

Multi-color IHC
The clinical specimens performed with IHC in this study were collected with informed consent for

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.