FastIntegration significantly improves time efficiency and memory usage over Seurat V4 and enables large-scale data integration
We created FastIntegration by improving the original Seurat batch integration functions in several aspects (Figure 1A, details given in Methods). Most importantly, FastIntegration performs principal component analysis (PCA) only once and employs the resulting latent space during the iterative batch correction. By avoiding PCA re-computation at each iteration, the gene expression correction step can be parallelized and sped up. We added outlier detection and remove them at each round of integration to prevent overcorrection. We also optimized the use of variables to reduce memory usage and parallelized various functions to improve overall runtime performance.
As we intend FastIntegration to be a fast and high-capacity version of Seurat integration, we first assessed the results produced by these two methods for consistency with the same input dataset of 20 blood samples. Both methods obtained high Adjusted Rand Index (ARI) scores based on the ground-truth cell type labels and were highly consistent between them with clusters showing high mutual ARI scores (Figure S1). In this dataset, neutrophils only appear in two batches, posing a challenge to batch integration methods to maintain them separate from other similar myeloid cells. Seurat mixed the neutrophils with monocytes while FastIntegration was able to retain them as a distinct cluster, which can be attributed to our overcorrection prevention steps (Figure S1A-B). Next, we compared their memory and time usage for integrating 10 to 200 samples (Figure 1B). When integrating the same number of samples, FastIntegration required less memory and time than Seurat. Moreover, Seurat was unable to integrate more than 30 samples, showing a “Cholmod error” caused by a failure to create a large sparse matrix. In contrast, FastIntegration was able to integrate 200 samples within 40 minutes.
To further demonstrate the large-scale integration capabilities of FastIntegration, we integrated 4.2 million cells from 877 blood samples collected from COVID-19 patients and healthy donors. The integration took less than two days with 150 threads and 2 TB peak memory usage. After integration, inter-study variations were removed (Figure 1C) while the major cell types were well separated, confirmed by their marker gene expressions (Figure S2). Strikingly, when we extracted the plasma cells and re-clustered them, we could clearly distinguish IgG, IgM, and IgA expressing plasma cells (Figure 1D), which demonstrated that the integrated data retains the subtle differences between cells. We have also applied FastIntegration on the DISCO database (https://www.immunesinglecell.org/) 12 to create integrated atlases for different tissues, diseases, and cell types.
FastIntegration performs well on both homogeneous and heterogeneous data integration
Current benchmarking studies 10,11 have focused on the performance of integrating homogeneous datasets derived from the same tissue type. With the exponential growth of single-cell data, integration of heterogeneous datasets across different tissue types is becoming increasingly common. For example, several studies have integrated blood with lung tissue samples from COVID-19 patients to study the systematic immune responses at different sites 13,14. Here, we evaluated the performance of FastIntegration and four other state-of-the-art methods, namely Harmony 6, BBKNN 7, scVI 8, and Scanorama 9 in integrating large numbers of homogeneous and heterogeneous datasets. The homogeneous datasets comprised 50 blood samples, and the heterogeneous datasets contained 50 blood and 10 lung samples. Sample Local Inverse Simpson's Index (sLISI) and inverse cell type LISI (1/cLISI) scores were used to assess batch mixing and cell type separation respectively, with a higher value denoting better performance. As BBKNN outputs only a graph, we could not calculate its LISI scores. Adjusted rand index (ARI), a global evaluation metric, was used to assess the concordance between clustering and manual cell type annotation.
For both homogeneous and heterogeneous data integration, FastIntegration and BBKNN produced the highest ARI scores (Figure 2A). However, BBKNN returns a distance-based neighbor list only, which prevents some downstream analyses. In terms of sLISI scores, Harmony was the top method with FastIntegration ranked second for both homogenous and heterogeneous integration (Figure 2B, Figure S3). In terms of cLISI, FastIntegration was the best method for both homogenous and heterogeneous integration (Figure 2B, Figure S3). Among the evaluated methods, FastIntegration struck the best balance between cell type separation (cLISI) and batch mixing (sLISI) (Figure 2B). In the homogeneous dataset, only 2 of 50 samples contained neutrophils, and only scVI, Scanorama, and FastIntegration were able to maintain the neutrophils as a separate cluster (Figure S4). This means that they are less likely to overcorrect and can better maintain separation of batch-specific cell types as compared to Harmony and BBKNN. For the heterogeneous dataset, only FastIntegration was able to integrate the immune cells from both blood and lung samples (Figure S5). Collectively, FastIntegration showed superior and stable performance for both homogeneous and heterogeneous data integration.
Batch corrected gene expression values generated by FastIntegration can be used for downstream analyses
The batch correction functions of FastIntegration and Seurat aim to align shared cell types across batches, removing batch effects present in the gene expression values in this process. The resulting batch-corrected values are potentially usable for various downstream analyses but there has been no study that systematically assessed the performance of batch corrected values in downstream analyses. Moreover, Seurat integration by default uses the top 2000 highly variable genes and returns the corrected values only for these genes. If all available genes are used, it encounters performance issues when integrating more than 30,000 cells. As a result, using “FindConservedMarkers” on the uncorrected values is recommended for post-integration Differential Gene Expression (DGE) analysis (https://github.com/satijalab/seurat/issues/1717). However, FindConservedMarkers cannot identify DEGs of clusters that only exist in a subset of batches. Given that FastIntegration can integrate thousands of batches and output batch corrected expression values for all genes, we evaluated the differences between using batch-corrected and raw values in DEG identification and quantification of transcription factor (TF) activity.
The homogenous dataset of 50 blood samples was used to assess the DEG identification. For each cell type pair, we identified their DEGs twice using different sample subsets. The first batch of DEGs was identified between cell type 1 in samples 1 to 25, and cell type 2 in samples 26 to 50, and the second batch was identified between cell type 1 in samples 26 to 50, and cell type 2 in samples 1 to 25 (Figure 2C). This mimics the situation where some cell types only exist in a subset of samples. We found that the percentage of overlapping DEGs identified using corrected value is much larger (58% vs 41%) than those identified using the uncorrected values, implying that the DEGs identified using corrected values is more stable across batches (Figure 2D). Moreover, we found that 1.3% of the DEGs identified using the uncorrected values showed opposite fold changes in the two comparisons, which we attribute to batch effects. Only 0.02% of the DEGs identified using the corrected values show such opposite fold changes. Interestingly, when examining the performance of DEG identification in each cell type pair, we found that the performance of using batch-corrected values is much better for similar cell types such as switched B and non-switched B cells (Figure 2E). This ability to identify subtle cell differences is especially attractive for big data integration. To check the reliability of the DEGs, we also computed the DEGs between all pairs of cell types using the batch-corrected and raw values of all 50 samples, and compared them with the DEGs computed within individual samples. Assuming that the true DEGs can also be found in most individual samples, we counted the frequency of conserved DEGs being found in the DEGs identified within individual samples. A higher frequency denotes higher reliability for the detected DEGs. We found the DEGs identified with batch-corrected values to be more conserved in each single sample than DEGs identified with raw values (Figure 2F).
Using batch corrected values for DEG identification is highly controversial with the choice of uncorrected values being commonly stated as the preferred approach. In this study, we demonstrated how batch effects can negatively affect the use of uncorrected values for DEG identification with cell types not found in all batches. By using batch corrected values, we reduced the number of DEGs with incorrectly identified fold change directions and increased the overall accuracy. For cell types found in all batches, the choice of uncorrected versus corrected values should not matter, as the presence of the cells across all batches should cancel out the batch effects. Another commonly encountered scenario in DEG identification is that of technical batch effects confounding with biological or patient effects. While the former is nuisance and should be removed, the latter can be of significance to the study at hand. The disentanglement of both effects remains an unsolved challenge when all batches do not have the same sample to serve as a reference to guide the data integration. As such, uncorrected values remain the only choice for investigating biological effects present between samples.
To test the impact of batch effects on TF activity prediction, we applied SCENIC(ref) to the raw and batch-corrected data to predict the TF activities of individual cells. In the TF activity generated UMAP for uncorrected values, the batch effects are clearly present among the same cell types across different batches (Figure 2G). For the FastIntegration’s corrected values, most of the batch effects was eliminated with the same cell types across different batches clustering together.