Construction of continuously expandable single-cell atlases through integration of heterogeneous datasets in a generalized cell-embedding space

Single-cell RNA-seq and ATAC-seq analyses have been widely applied to decipher cell-type and regulation complexities. However, experimental conditions often confound biological variations when comparing data from different samples. For integrative single-cell data analysis, we have developed SCALEX, a deep generative framework that maps cells into a generalized, batch-invariant cell-embedding space. We demonstrate that SCALEX accurately and efficiently integrates heterogenous single-cell data using multiple benchmarks. It outperforms competing methods, especially for datasets with partial overlaps, accurately aligning similar cell populations while retaining true biological differences. We demonstrate the advantages of SCALEX by constructing continuously expandable single-cell atlases for human, mouse, and COVID-19, which were assembled from multiple data sources and can keep growing through the inclusion of new incoming data. Analyses based on these atlases revealed the complex cellular


INTRODUCTION
Single-cell RNA sequencing (scRNA-seq) and assay for transposase-accessible chromatin using sequencing (scATAC-seq) technologies enable decomposition of Integration methods have been developed to remove batch effects in single-cell datasets 10-16 . One common strategy is to identify similar cells or cell populations across batches. This includes the mutual nearest neighborhood (MNN) method 10 which identifies correspondent pairs of cells between two batches by searching for mutual nearest neighbors in gene expression. Scanorama 11 generalizes the process of neighbor searching from within two batches to a multiple-batch manner. Seurat v2 13 applies canonical correlation analysis (CCA) to identify common cell populations in lowdimensional embeddings across data batches, while Seurat v3 14 introduces "cell anchors" to mitigate the problem of mixing non-overlapping populations, an issue experienced in Seurat v2. Harmony 16 also applies population matching across batches, specifically through a fuzzy clustering algorithm.
It is notable that all of these cell similarity-based methods are local-based, wherein cell-correspondence across batches are identified through the similarity of individual cells or cell anchors/clusters. Accordingly, these methods all suffer from two common limitations. First, they are prone to mixing cell populations that only exist in some batches. This becomes a severe problem for the integration of datasets that contain nonoverlapping cell populations in each batch (i.e., partially-overlapping data). Second, these methods can only remove batch effects from the current batches being assessed but cannot manage batch effects from additional, subsequently obtained batches. So each time a new batch is added, it requires an entirely new integration process that again examines the previous batches. This severely limits the capacity to integrate new singlecell sequencing datasets.
As an alternative to the cell similarity-based local methods, scVI 17 applies a conditional variational autoencoder (VAE) 18 framework to model the inherent distribution/structure of the input single-cell data. VAE is a deep generative method that comprises an encoder and a decoder, wherein the encoder projects all highdimensional input data into a low-dimensional embedding, and the decoder recovers them back to the original data space. The VAE framework can maintain the same global internal data structure between the high-and low-dimensional spaces 19 . However, scVI includes a set of batch-conditioned parameters into its encoder that restrains the encoder from learning a batch-invariant embedding space, limiting its generalizability with new batches.
We previously applied VAE and designed SCALE (Single-Cell ATAC-seq Analysis via Latent feature Extraction) to model and analyze single-cell ATAC-seq data 20 . We found that the VAE framework in SCALE can disentangle cell-type-related and batch-related features in a low-dimensional embedding space. Here, having redesigned the VAE framework, we introduce SCALEX as a method for integration of heterogeneous single-cell data. We demonstrate that SCALEX integration is accurate, scalable, and computationally efficient for multiple benchmark datasets from scRNAseq and scATAC-seq studies. As a specific advantage, SCALEX accomplishes data integration through projecting all single-cell data into a generalized cell-embedding space using a batch-free encoder and a batch-specific decoder. Since the encoder is trained to only preserve batch-invariant biological variations, the resulting cellembedding space is a generalized one, i.e., common to all projected data. SCALEX is therefore able to accurately integrate partially-overlapping datasets without mixing of non-overlapping cell populations. By design, SCALEX runs very efficiently on huge datasets. These two advantages make SCALEX especially useful for the construction and research utilization of large-scale single-cell atlas studies, based on integrating data from heterogeneous sources. New data can be projected to augment an existing atlas, enabling continuous expansion and improvement of an atlas. We demonstrated these functionalities of SCALEX in the construction and analyses of atlases for human, mouse, and COVID-19 PBMCs.

Projecting single-cell data into a generalized cell-embedding space
The central goal of single-cell data integration is to identify and align similar cells across different batches, while retaining true biological variations within and across cell-types. The fundamental concept underlying SCALEX is disentangling batchrelated components away from batch-invariant components of single-cell data and projecting the batch-invariant components into a generalized, batch-invariant cellembedding space. To accomplish this, SCALEX implements a batch-free encoder and a batch-specific decoder in an asymmetric VAE framework 18 (Fig. 1a. Methods). While the batch-free encoder extracts only biological-related latent features (z) from input single-cell data ( ), the batch-specific decoder is responsible for reconstructing the original data from z by incorporating batch information back during data reconstruction.
Supplying batch information to the decoder in data reconstruction allows the encoder to learn a batch-invariant data representation for each individual cell during model training, which, as a whole, defines a generalized low-dimensional cellembedding space. This learning is also facilitated by random slicing of all input single cells from different batches into mini-batches. Each mini-batch is forced into alignment with the same data distribution under the restriction of KL-divergence in the same cellembedding space 21 . SCALEX also implements Domain-Specific Batch Normalization (DSBN) 22 (Methods), a multi-branch Batch Normalization 23 , in its decoder to support incorporation of batch-specific variations to reconstruct single-cell data.
The design underlying SCALEX renders the encoder to function as a data projector that projects single cells of different batches into a generalized, batch-invariant cellembedding space. SCALEX thus removes batch-related variations present in singlecell data while preserving batch-invariant biological signals in cell-embedding, making it an enabling tool for integration analyses of diverse single cell datasets, without relying on searching for cell similarities.

SCALEX integration is accurate, scalable, and accommodates diverse data types
We first evaluated the data integration performance of SCALEX on multiple wellcurated scRNA-seq datasets, including human pancreas (eight batches of five studies) 24-28 , heart (two batches of one study) 29 and liver (two studies) 30,31 ; as well as human non-small-cell lung cancer (NSCLC, four studies) 32-35 and peripheral blood mononuclear cell (PBMC; two batches assayed by two different protocols) 13 . For comparison, we included several other methods in the analyses, including Seurat v3, Harmony, Conos, BBKNN, MNN, Scanorama, and scVI (Methods).
We used Uniform Manifold Approximation and Projection (UMAP) 36 embeddings to visualize the integration performance of all methods (Methods). Note that all of the raw datasets displayed strong batch effects: cell-types that were common in different batches were separately distributed. Overall, SCALEX, Seurat v3, and Harmony achieved the best integration performance for most of the datasets by merging common cell-types across batches while keeping disparate cell-types apart (Fig. S1). MNN and Conos integrated many datasets but left some common cell populations not well aligned.
BBKNN, Scanorama, and scVI often had unmerged common cell-types, and sometimes incorrectly mixed distinct cell-types together. For example, in the PMBC dataset ( Fig.   1b), considering the T cell populations between the two batches, while SCALEX, Seurat v3, Harmony, and MMN integrations were effective, Scanorama showed both a larger misalignment and mixed all cell-types together without maintaining clear boundaries.
We quantified single-cell data integration performance using a silhouette score 37 and a batch entropy mixing score 10 (Methods). Briefly, the silhouette score assesses the separation of biological distinctions, and the batch entropy mixing score evaluates the extent of mixing of cells across batches. Overall, SCALEX outperformed all of the other methods as assessed by the silhouette score, and tied with Seurat and Harmony as the best-performing methods based on the batch entropy mixing score (Fig. 1c). We note that SCALEX obtained a slightly lower batch entropy mixing score, compared to Seurat v3 and Harmony on the liver dataset, which contains batch-specific cell-types and thus is a partially-overlapping dataset. However, Seurat v3 and Harmony may have obtained a high batch entropy mixing score because of misaligning different cell-types together. Indeed, by only considering the degree of batch mixing but ignoring cell-type differences, the batch entropy mixing score is not ideally suited for assessing batch mixing for partially-overlapping datasets.
We also tested the scalability and computation efficiency of SCALEX on largescale datasets by applying it to 1,369,619 cells from the human fetal atlas dataset (two data batches, Methods) 38,39 . SCALEX accurately integrated these two batches, showing good alignment of the same cell-types (Fig. S2, Fig. 1d). We then compared the computational efficiency of different methods using down-sampled datasets (of 10 K, 50 K, 250 K, 1 M) from the human fetal atlas dataset. SCALEX consumed almost constant runtime and memory that increased only linearly with data size, whereas MNN, Seurat v3, and Conos consumed runtime and memory that increased exponentially, thus did not scale well beyond 250 K cells. Harmony consumed over 400 gigabytes (GB) of memory in analyzing the 1 M dataset, rendering it unsuitable for integration of datasets at this scale (Fig. 1e). Notably, the deep learning framework of SCALEX enables it to run very efficiently on GPU devices, requiring much reduced runtime (took about 10 minutes and 16 GB of memory on the 1 M dataset).
Finally, SCALEX can be used to integrate scATAC-seq data as well as crossmodality data (e.g. scRNA-seq and scATAC-seq) (Methods). For example, SCALEX integrated the mouse brain scATAC-seq dataset (two batches assayed by snATAC and 10X) 40 very well, aligning common cell subpopulations and separate distinct ones (Fig.   1f). We also integrated the cross-modality PBMC data between scRNA-seq and scATAC-seq 41,42 , and found that SCALEX could correctly integrate the two types of data, and could distinguish rare cells that are specific to scRNA-seq data, including pDC and platelet cells (Fig. 1g). Thus, SCALEX has broad integration capacity across various types of single-cell data.

SCALEX integrates partially-overlapping datasets
Partially-overlapping datasets present a major challenge for single-cell data integration for local cell similarity-based methods 13,14 , often leading to over-correction (i.e., mixing of distinct cell-types). As a global integration method that project cells into a generalized cell-embedding space, SCALEX is expected to be immune to this problem.
For example, the liver dataset is a partially-overlapping dataset where the hepatocyte population contains multiple subtypes specific to different batches: three subtypes are specific to LIVER_GSE124395, and two other subtypes only appear in LIVER_GSE115469 (Fig. S3). We noticed that SCALEX maintained the five hepatocyte subtypes apart, whereas Seurat v3 mixed all five and Harmony mixed the hepatocyte-SCD and hepatocyte-TAT-AS1 cells (Fig. 2a).
To characterize the performance of SCALEX on partially-overlapping datasets, we constructed test datasets with a range of common cell-types, down-sampled from the six major cell-types in the pancreas dataset (Methods). SCALEX integration was accurate for all cases, aligning the same cell-types without over-correction, whereas both Seurat v3 and Harmony frequently mixed the cell-types, particularly for the lowoverlapping cases (Fig. 2b, Fig. S4). When there was none common cell-type, both Seurat v3 and Harmony collapsed the six cell-types to three, mixing alpha with gamma cells, beta with delta cells, and acinar with ductal cells in various extent. We repeated the cell-type down-sampling analysis from the 12 cell-types in the PBMC dataset as a more complex partial-overlapping example and observed similar results (Fig. S5), demonstrating that SCALEX is robust in retaining informative biological variations for partially-overlapping datasets.

Projection of unseen data into an existing cell-embedding space
The accurate, scalable, and efficient integration performance of SCALEX depends on its encoder's capacity to project cells from various sources into a generalized, batchinvariant cell-embedding space. We speculate that once a cell-embedding space has been constructed after integration of existing data, SCALEX should be able to use the same encoder to project additional (i.e., previously unseen) data onto the same embedding space. To test this hypothesis, we used the pancreas dataset. SCALEX integration removed the strong batch effect in the raw data and aligned the same celltypes together and kept different cell-types were clearly distinguished (Fig. 3a, Fig.   S6a). Cell-types were validated by the expression of their canonical markers, including rare cells such as Schwann cells, epsilon cells (Fig. S6b).
We projected three new batches [43][44][45] for pancreas tissues (Fig. 3b) into this "pancreas cell space" using the same encoder trained on the pancreas dataset. After projection, most of the cells in the new batches were accurately aligned to the correct cell-types in the pancreas cell space, enabling their accurate annotation by cell-type label transfer (Fig. 3c, Method). We benchmarked annotation accuracy by calculating the adjusted Rand Index (ARI) 46 , the Normalized Mutual Information (NMI) 47 , and the F1 score using the cell-type information in the original studies as a gold standard (Methods). The SCALEX annotations achieved the highest accuracy in comparisons with annotations using three other methods (Seurat v3, Conos, and scmap).

Expanding an existing cell space by including new data
The ability to project new single-cell data into a generalized cell-embedding space allows SCALEX to readily extend this cell space. To verify this, we projected two additional melanoma data batches (SKCM_GSE72056, SKCM_GSE123139) 48,49 onto the previously constructed PBMC space. The common cell-types were correctly projected onto the same locations in the PBMC cell space (Fig. 3d). For the tumor and plasma cells only present in the melanoma data batches, SCALEX did not project these cells onto any existing cell populations in the PBMC space; rather, it projected them onto new locations close to similar cells, with the plasma cells projected to a location near B cells, and the tumor cells projected to a location near HSC cells (Fig. 3e).
SCALEX projection enables post hoc annotation of unknown cell-types in the existing cell space using new data. We noted a group of cells previously uncharacterized in the pancreas dataset (Fig. 3a). We found that these cells displayed high expression levels for known epithelial genes (Methods). We therefore assembled a collection of epithelial cells from the bronchial epithelium dataset 50 . We then projected these epithelial cells onto the pancreas cell space and found that a group of antigen-presenting airway epithelial (SLC16A7+ epithelial) cells were projected onto the same location of the uncharacterized cells (Fig. 3f). This, together with the observation that both cell populations showed similar marker gene expression (Fig. 3g), indicates that these uncharacterized cells are also SLC16A7+ epithelial cells. SCALEX thus enables discovery science in cell biology by supporting exploratory analysis with large numbers of diverse datasets.

SCALEX supports construction of expandable single-cell atlases
The ability to combine partially-overlapping data onto a generalized cell-embedding space makes SCALEX a powerful tool to construct a single-cell atlas from a collection of diverse and large datasets. We applied SCALEX integration to two large and complex datasets-the mouse atlas dataset (comprising multiple organs from two studies assayed by 10X, Smart-seq2, and Microwell-seq 6,51 ) (Fig. 4a) and the human atlas dataset (comprising multiple organs from two studies assayed by 10X and Microwell-seq 39,52 ).
Despite the strong batch effects in the raw data, SCALEX integrated the three batches of the mouse atlas dataset into a unified cell-embedding space (Fig. 4b,c, Fig.   S7a). Common cell-types (including both B, T, and endothelial cells in all tissues and proximal tubule, urothelial, and hepatocytic cells in certain tissues) were well-aligned together at the same position in the cell space. Non-overlapping cell-types (such as sperm, Leydig, and small intestine cells from the Microwell-seq data, keratinocyte stem cells and large intestine cells in the Smart-seq2 data, and oligodendrocytes in the Smart-seq2 and Microwell-seq data) were located separately in the space, indicating that biological variations were preserved well ( Fig S7b).
Importantly, atlases generated with SCALEX can be used and further expanded by projecting new single-cell data to support comparative studies of cells both in the original atlas and in the new data. Illustrating this, we projected two additional data batches of aged mouse tissues from Tabula Muris Senis (Smart-seq2 and 10X) 53 and two single tissue datasets (lung and kidney) 54 onto the SCALEX mouse atlas space. We found that the same cell-types in the new data batches were correctly projected onto the same locations on the cell-embedding space of the initial mouse atlas (Fig. 4d), which was also confirmed by the accurate cell-type annotations for the new data by label transfer from the corresponding cell-types in the initial atlas ( Fig. 4e. Methods). On one way, this mouse atlas then can be used to accurately identify/characterize the cells in the new data based on their projected locations in the cell space; and on the other way, projection of new data enables ongoing (and informative) expansion of an existing atlas.
Following the same strategy, we also constructed a human atlas by SCALEX integration of multiple tissues from two studies (GSE134255, GSE159929) ( Fig. S8a,b). SCALEX, effectively eliminated the batch effects in the original data and integrated the two datasets in a unified cell-embedding space (Fig. S8c,d). Again, we were able to correctly project two additional human skin datasets (GSE130973, GSE147424) 55,56 onto the human atlas cell-embedding space (Fig. S8e), and again accurately annotated these projected skin cells ( Fig. S8f. Methods). These results illustrate that: i) SCALEX enables researchers to evaluate their project-specific single cell datasets by leveraging existing information in large-scale (and ostensibly well annotated) cell atlases; and ii) it also enables atlas creators to informatively integrate new datasets and attendant biological insights from many research programs.
We observed that some cell subpopulations were differentially associated with patient status (Fig 5d). A subpopulation of CD14 monocytes (CD14-ISG15-Mono), specifically associated with COIVD-19 patients, was characterized by its high expression of Type I interferon-stimulated genes (ISGs) and genes associated with immune-response-related GO terms (Fig 5e,f). The frequency of CD14-ISG15-Mono cells increased significantly from healthy donors to mild/moderate and severe patients

Comparative analysis of the SCALEX COVID-19 PBMC atlas and the SC4 consortium study
Recently, a large-scale effort of the Single Cell Consortium for COVID-19 in China (SC4) has generated a single-cell atlas that contains over 1 million cells (including PBMCs and other tissues) from 171 COVID-19 patients and 25 healthy controls 65 (Fig.   S11a). We projected the consortium dataset into the cell-embedding space of the SCALEX COVID-19 PBMC atlas, and found that the cell-types of two atlases were well-aligned in the embedding space (Fig. 5h,i, Fig. S11b,c).
Our analysis, based on the SCALEX COVID-19 PBMC atlas, yielded findings consistent with two conclusions from the SC4 study 65 . First, in both analyses diverse immune subpopulations displayed differential associations with COVID-19 severity.
The proportions of CD14 monocytes, megakaryocytes, plasma cells, and pro T cells were elevated with increasing disease severity, while the proportion of pDC and mDC cells decreased (Fig. 5g). Second, we confirmed that the megakaryocytes and monocyte populations are associated with cytokine storms triggered by SARS-Cov2 infection and are further elevated in severe patients 66 , based on calculating the same cytokine score and inflammatory score (defined in the SC4 study) for the cells of our SCALEX COVID-19 PBMC atlas ( Fig. 5j. Methods).
Integration of the SC4 data further substantially improved both the scope and resolution of the SCALEX COVID-19 PBMC atlas. First, this data added macrophages and epithelial cells to the cell space, enabling investigation of their potential involvement in COVID-19. The integration also supported more precise characterization of specific cell subpopulations. For example, the megakaryocyte population, not distinguished in either single atlas, could be divided into two subpopulations in the combined atlas (Fig. 5h). An exploratory functional analysis of the differentially expressed genes in these two newly delineated megakaryocyte subpopulations (TUBA8-Mega and IGKC-Mega, Fig. S11d,e) revealed enrichment for the GO terms "humoral immune response" for IGKC-Mega cells yet enrichment for "negative regulation of platelet activation" for TUBA8-Mega cells (Fig. 5k). These results illustrate how the continuously expandable single-cell atlases generated using SCALEX capitalize on existing large-scale data resources and also facilitate discovery of biological and biomedical insights.

DISCUSSION
SCALEX provides a VAE framework for integration of heterogeneous single-cell data by disentangling batch-invariant components from batch-related variations and projecting the batch-invariant components into a generalized, low-dimensional cellembedding space. By design, SCALEX models the inherent batch-invariant patterns of single-cell data, distinguishing it from previously reported integration methods based on cell similarities. SCALEX does not rely on the identification of common cell-types across batches, and therefore avoids the problem of cell-type over-correction, a severe problem for partially-overlapping datasets. SCALEX thus also overcomes issues of computational complexity in cell similarity-based methods; that is, the computational time required to identify similar cells may increase exponentially as the cell number increases.
These two features make SCALEX particularly useful for construction and integrative analysis of large-scale single-cell atlases based on very heterogenous data (i.e., datasets acquired by different labs and using different single-cell analysis platforms). Our construction of human, mouse, and COVID-19 patient single-cell atlases-which aligned well with previously reported atlases generated from coordinated large-scale consortium efforts-demonstrates the particular ability of SCALEX to producing large-scale atlases from extant small-scale datasets. SCALEX achieves data integration by projecting all single cells into a generalized cell-embedding space using a universal data projector (i.e., the encoder). This data projector only needs to be trained once, and then can be used without retraining to continuously integrate new incoming data into an existing single-cell atlas. This continuous growth ability makes a SCALEX atlas an elastic resource, allowing the integration of many singlecell studies to support ongoing, very large-scale research programs throughout the life sciences and biomedicine.
While the number of single-cell studies is increasing enormously each year, best practices for experimental design and sample processing are not established, and there is no obviously dominant data-acquisition platform. SCALEX's ability to informatively combine data from heterogenous studies and platforms makes it particularly suitable for the current era of single-cell biological research. Finally, the ability to conduct exploratory analysis within a generalized cell space supports that SCALEX should be particularly useful for large-scale integrative (e.g., pan-cancer) studies. We speculate that use of SCALEX to project single-cell datasets (including for example scATACseq and scRNA-seq) from highly diverse cancer types to construct a pan-cancer singlecell atlas may lead to the discovery of previously unknown cell types that are common to divergent carcinomas and that function in pathogenesis, malignant progression, and/or metastasis.

Overview of the SCALEX model. SCALE applies a variational autoencoder (VAE)
to project the different batches of datasets into the same batch-invariant lowdimensional embeddings by learning a batch-free encoder and a batch-specific decoder simultaneously. Since the encoder and decoder are coupled to learn a batch-free encoder, a batch label is only exposed to the decoder within the domain-specific batch The first term is the reconstruction term, which minimizes the distance between the generated output data and the original input data. The second term is the regularization term, which minimizes the Kullback-Leibeler divergence between posterior distribution and prior distribution of latent variable . To enable a more flexible alignment under the latent space, we adjusted the coefficient of the second term to 0.5, thus, the final loss function is: The overall network architecture of SCALEX consists of an encoder and a decoder.  Preprocessing for scRNA-seq. We downloaded gene expression matrices and preprocessed them using the following procedure: i). Cells with fewer than 600 genes and genes present in fewer than 3 cells were filtered out. ii). Total counts of each cell were normalized to 10,000. iii). Values of each gene were subjected to log transformation with an offset of 1. iv). The top 2,000 highly variable genes were Preprocessing for cross-modality data (scRNA-seq and scATAC-seq). We first created a gene activity matrix by the GeneActivity function in the Signac 70 R package to quantify the activity of each gene from scATAC-seq data. We then combined gene activity score matrix with scRNA-seq data matrix as two individual "batches" for integration. The subsequent preprocessing followed the same preprocessing used for the scRNA-seq data (above).
Visualization. UMAP algorithm 36 was used for visualization. We applied the neighbors function from the Python package Scanpy with the parameters n_neighbors=30 and metric="Euclidean" for computing the neighbor graph, followed by umap function with min_dist=0.1 to visualize cells in a two-dimensional space.
Silhouette score. We used the silhouette score to assess the separation of biological populations with the function silhouette_score in the scikit-learn package in Python.
The silhouette score was computed by combining the average intra-cluster distance (a) and the average nearest-cluster (b) for each cell.
Here, we took UMAP embeddings as input to calculate silhouette score.
Batch entropy mixing score. Batch entropy mixing score (adapted from "entropy of batch mixing" 10 ) was used to access the regional mixing of cells from different batches, with a high score suggesting that cells from different batches are well mixed together.
The batch entropy mixing score was computed as follows: (1) Calculated the proportion Pi of cell numbers in each batch to the total cell numbers.
(3) Calculated the 100 nearest neighbors for each randomly chosen cell.
(4) The regional mixing entropies for each cell were defined as: where pi is the proportion of cells from batch i in a given region, such that ∑ = 1 =0 , pi' is a correction item to eliminate the deviation caused by the different cell numbers in different batches. The total mixing entropy was then calculated as the sum of the regional mixing entropies.   i.e., we first clustered all cells in a dataset into several major clusters, then for some big clusters, we further clustered them into minor clusters respectively. We used the confusion matrix to evaluate the accuracy of cell-type annotations (prediction) when a gold-standard annotation is available, which is typical for "celltype annotation by label transfer" (see above). In cell-type annotation by label transfer, we predict the cell-types for a single-cell data_batch_1, using the annotations in another data_batch_2. When data_batch_1 was already annotated with cell-types, we can calculate the confusion matrix C=[Cij] to compare the cell-type predictions with the existing cell-type annotations, where Cij equals the percentage of cells known to be in cell-type i and predicted to be in cell-type j.
Adjusted Rand Index. The Rand Index (RI) computes a similarity score between two clustering assignments by considering matched and unmatched assignment pairs, independent of the number of clusters. The Adjusted Rand Index (ARI) score is calculated by "adjust for chance" with RI as follows: If given the contingency table, then ARI can also be represented by: The ARI score is 0 for random prediction and 1 for perfectly matching.

√H(P)H(T)
Where P and T are categorical distributions for the predicted and real clustering, I is the mutual entropy, and H is the Shannon entropy.

Generation of partially-overlapping datasets.
To simulate partially-overlapping datasets from the pancreas dataset, we used the pancreas_celseq2 and pancreas_smartseq2 data batches, and worked with only six cell-types (alpha, beta, ductal, acinar, delta, gamma). For each simulated partially-overlapping dataset, we randomly selected three to six cell-types from each batch, and counted the number of the common cell-types, which was used as the indicator for the overlapping level (whole integers, 0 to 6). We required the union of cell-types in the newly simulated partially-overlapping dataset to cover all six cell-types.
For the PBMC dataset, we used both of the two data batches and worked with twelve cell-types (B, CD4 T, CD4 naive T, CD8 T, CD8 naive T, DC, HSC, Megakaryocyte, NK, monocyte-CD14, monocyte-FCGR3A, pDC). We used the same down-sampling strategy as for the pancreas dataset (above).  Table 1). We then calculated the cytokine and inflammatory scores from the raw gene expression profile using the score_genes function implemented in the Scanpy.

Data availability.
All data analyzed in this study are publicly available; the data sources are detailed in Supplementary Table 2.               The design and performance of SCALEX for single-cell data integration. a, SCALEX models the global structure of single-cell data using a variational autoencoder (VAE) framework. b, UMAP embeddings of the PBMC dataset before and after integration using SCALEX, Seurat v3, Harmony, Conos, or Scanorama integration, colored by batch and cell-type. c, Scatter plot showing a quantitative comparison of the silhouette score (y-axis) and the batch entropy mixing score (x-axis) on the benchmark datasets. d, UMAP embeddings of the SCALEX integration of the human fetal atlas dataset, colored by batch and cell-type. e, Comparison of computation e ciency on datasets of different sizes sampled from the whole human fetal atlas dataset) including runtime (left) and memory usage (right). f, UMAP embeddings of the mouse brain scATAC-seq dataset before (left) and after integration (middle, right); colored by data batch or Leiden clustering. g, UMAP embeddings of the PBMC cross-modality dataset before (left) and after integration (middle, right); colored by batch or cell-type.

Figure 2
Comparison of integration performance over partially-overlapping datasets by different methods. a, Comparison over the liver dataset. b, Comparison over simulated datasets with different numbers of common cell-types (obtained by down-sampling the pancreas dataset). Misalignments are highlighted with red circles.

Figure 3
Projecting heterogenous data into a generalized cell-embedding space. a, UMAP embeddings of the pancreas dataset after integration by SCALEX, colored by cell-type. b, UMAP embeddings of three projected pancreas data batches projected onto the pancreas space, colored by cell-types; the light gray shadows represent the original pancreas dataset. c, Confusion matrix between ground truth cell-types and those annotated by different methods. ARI, NMI and F1 scores (top) measure the annotation accuracy. d, UMAP embeddings of the PBMC dataset after integration and the two projected melanoma data batches onto the PBMC space, colored by cell-types with light gray shadows represent the original PBMC dataset. e, The PBMC space that includes the original PBMC dataset and the two projected melanoma data batches. f, Annotating an uncharacterized small cell population in the pancreas dataset by projection of the bronchial epithelium data batches into the pancreas cell space. Only the uncharacterized cells in the pancreas dataset (left) and the SLC16A7+ epithelial cells in the bronchial epithelium data batches (right) are colored. g, Heatmap showing the normalized expression of the top-10 ranking speci c genes for the uncharacterized cell population in different cell-types.

Figure 4
Construction of an expandable mouse single-cell atlas. a, Datasets acquired using different technologies (Smart-seq2, 10X, and Microwell-seq) covering various tissues used for construction of the mouse atlas.
b, UMAP embeddings of the mouse atlas dataset colored by batch and tissue. c, UMAP embeddings of the mouse atlas after SCLAEX integration, labeld with and colored by cell-type. d, Two Tabula Muris Senis data batches and two mouse tissues (lung and kidney) data are projected onto the cell space of the mouse atlas, with the same cell-type color as in c. e, Confusion matrix of the cell-type annotations by SCALEX and those in the original studies. Color bar represents the percentage of cells in confusion matrix Cij known to be cell-type i and predicted to be cell-type j. Figure 5