Supplementary figures
Fig. S1 | Comparison of integration performance on benchmark datasets. UMAP embeddings for benchmark datasets grouped by batches and cell-types, before and after integration by different methods. Misalignments are highlighted with red circles.
Fig. S2 | The human fetal atlas. a, UMAP embeddings of the human fetal atlas dataset colored by batch before integration. b, Similarity matrix of meta-cell representations for different cell-types in the two data batches in the generalized cell-embedding space. Color bar represents the Pearson correlation coefficient between the average meta-cell representation of two cell-types from a respective data batch. c, Comparison of computation efficiency on datasets of different sizes (sampled from the whole human fetal atlas dataset), including runtime (left) and memory usage (right), in log scale.
Fig. S3 | Canonical marker genes of different cell-types and UMAP embeddings of the liver dataset. a, Dotplot of canonical marker genes for each cell-type. Dot color represents average expression level, while dot size represents the proportion of cells in the group expressing the marker. b, UMAP embeddings of the liver dataset, colored by batch (left) and cell-type (right) after SCALEX integration. c, Normalized marker gene expression on the UMAP embeddings of the five hepatocyte subtypes. Color bar represents the expression level.
Fig. S4 | Integration over partially-overlapping datasets down-sampled from the pancreas dataset. Partially-overlapping datasets were generated by down-sampling the pancreas dataset, consisted of common cell-types with a decreased overlapping number (ranging from 0 to 6). Integration results for SCALEX, Seurat, and Harmony are shown in the UMAP embeddings colored by batches (left) and cell-types (right) respectively (overlapping number decreases from 6 to 0). Misalignments are highlighted with red circles.
Fig. S5 | Integration over partially-overlapping datasets down-sampled from the PBMC dataset. Partially-overlapping datasets were generated by down-sampling the PBMC dataset, consisted of common cell-types with a decreased overlapping number (ranging from 0 to 6). Integration results for SCALEX, Seurat and Harmony are shown in the UMAP embeddings colored by batches (left) and cell-types (right) respectively (overlapping number decreases from 6 to 0). Misalignments are highlighted with red circles.
Fig. S6 | The pancreas dataset and the additional data batches. a, UMAP embeddings of the pancreas dataset, the three additional pancreas data batches and the bronchial epithelium data batches (data from three donors), grouped by batch. b, Dot plot of canonical markers of cell-types of reference pancreas dataset; dot color represents average expression level, while dot size represents the proportion of cells in the group expressing the marker.
Fig. S7 | The SCALEX mouse atlas. a, UMAP embeddings of the mouse atlas data before integration, colored by batch. b, UMAP embeddings of three mouse atlas data batches (Smart-seq2, 10X, and Microwell-seq) after integration, colored by cell-type; the light gray shadows represent the original mouse atlas dataset. c, Dotplot of the top 5 cell-type-specific genes for each cell-type in the mouse atlas dataset. Dot color represents average expression level, while dot size represents the proportion of cells in the group expressing the marker.
Fig. S8 | The SCALEX human atlas. a, The human atlas dataset acquired using different technologies (Smart-seq2, 10X, and Microwell-seq) covering various tissues used for construction of the human atlas. b-c, UMAP embeddings of the human atlas dataset colored by batch and cell-type, before (b) and after integration (c). d, Similarity matrix of meta-cell representations for cell-types in the two data batches in the generalized cell-embedding space after SCALEX integration between two batches. Color bar represents the Pearson correlation coefficient between the average meta-cell representation of two cell-types from a respective data batch. e, UMAP embeddings of the human atlas and two additional projected data batches colored by cell-type. f, Confusion matrix of the cell-type annotations by SCALEX and those in the original study. Color bar represents the percentage of cells in confusion matrix Cij known to be in cell-type i and predicted to be in cell-type j.
Fig. S9 | COVID-19 immune landscape. a, UMAP embeddings of the raw COVID-19 PBMC dataset before integration. b, UMAP embeddings of the COVID-19 PBMC atlas colored by condition and Leiden clustering after SCALEX integration. c, Dotplot of canonical marker genes for each cell-type. Dot color represents average expression level, while dot size represents the proportion of cells in the group expressing the marker. d, UMAP embeddings of the COVID-19 PBMC atlas in individual batches after SCALEX integration, colored by cell-type; the light gray shadows represent the other batches of COVID-19 PBMC atlas. e, Frequency of cell distributions across healthy people and influenza patient controls, and among mild/moderate, severe, and convalescent COVID-19 patients. Dirichlet-multinomial regression was used for pairwise comparisons, ***p<0.001, **p<0.01, *p<0.05.
Fig. S10 | COVID-19 heterogeneous dysfunctional immune response. a, Stacked violin plot of differentially-expressed genes between PNPLA2-Immature_Neutrophil and NCF1-Immature_Neutrophil cells. b, GO terms enriched in the differentially-expressed genes for PNPLA2-Immature_Neutrophil and NCF1-Immature_Neutrophil cells. c, Stacked violinplot of differentially-expressed genes between PRDM1-Plasma and MZB1-Plasma. d, GO terms enriched in the differentially-expressed genes for PRDM1-Plasma and MZB1-Plasma cells.
Fig S11 | Projection of the SC4 dataset onto the SCLAEX COVID-19 PBMC atlas. a-b, UMAP embeddings of the SC4 dataset before integration (a) and after projection onto the SCLAEX COVID-19 PBMC space (b). c, Separate UMAP embeddings of each SC4 data batch, after being projected onto the SCALEX COVID-19 PBMC space, colored by cell-type. d, UMAP embeddings of the TUBA8-Mega and IGKC-Mega cells. e, UMAP embeddings of the differentially-expressed genes of TUBA8-Mega and IGKC-Mega cells.