3.1 Batch correction and integration
In order to show the effectiveness of scAEQN, we have compared the results obtained by different batch correction methods using different datasets. Low dimensional data of latent layer in scAEQN was extracted for visualization and cluster.
We generate t-SNE plots to show correction and integration results in different methods from MESC dataset (Fig. 2B). As shown in Fig. 2B, raw data has obvious batches that it is separated two blocks with different batches in every cluster. As shown on cell type distribution plots, scAEQN and scVI has obvious three cluster blocks contrast with raw data, Combat, MNN, Harmony, Limma, QuantNorm, and scBatch method. They are more obvious for the separation of cell types. However, Scanorama overcorrected on the MESC dataset that cells of different cell types are mixed in every cluster block. In batch distribution plots, scAEQN, scVI, Combat, Harmony, Limma methods have similar levels of batch integration, MNN, QuantNorm, and scBatch methods still retain obvious batch blocks after batch corrected, Scanorama has superior effective batch integration. Likewise, we plot t-SNE plots in different methods on mouse pancreas and human pancreas2 datasets (Supplementary Figure S1 and Figure S2). For the mouse pancreas dataset, it does not have a more obvious batch effect from raw data. For such data, scAEQN still retains a superior clustering pattern of cell types after batch correction, compared with Harmony, scBatch, scVI, Scanorama methods, Combat, Limma, MNN, and QuantNorm methods show obvious batches after batch correction. For the human pancreas2 dataset, scAEQN largely preserves the clustering pattern of cell types although there are subtle batches within each cluster block, compared to Combat, Harmony, Limma, and MNN methods that cell type cluster blocks are mixed. Scanorama method has overcorrected likewise, and scVI method has superior performance on this dataset. In conclusion, by visualizing the distribution of batch and cell type on different datasets, it illustrates that scAEQN can effectively correct the batch effect and retain cell type cluster pattern in scRNA-seq data.
While a clear batch integration of small datasets with only a few cell types or batches is easy to evaluate by visualization method, the assessment can become subjective when the comparison is made between different competitive methods on large complicated datasets [33]. The difficulty increases when the batches are not clearly segregated, like brain, mouse neuron and human pancreas1 datasets. Therefore, we also use three metrics to help us assess the batch correction methods in terms of batch mixing and cell type purity (Fig. 2A). From the ARI F1 score (Fig. 2Aa), scAEQN get the highest ARI F1 score in brain, mouse neuron, and mouse pancreas datasets compared to other methods, the scores are 0.7742, 0.4997, and 0.8441 respectively. scAEQN is second only to scVI on MESC and human pancreas2 datasets, with scores of 0.9479 and 0.9886 on the MESC dataset and 0.7325 and 0.8664 on the human pancreas2 dataset. scAEQN does not perform well on the human pancreas1 dataset, only higher than raw data. Whereas, in the ASW F1 score (Fig. 2Ab), scAEQN is ahead of other batch correction methods with a score 0.6030 on the human pancreas1 dataset. Besides, scAEQN gets the highest ASW F1 scores in mouse pancreas and human pancreas2 datasets. scAEQN is comparable to other methods on brain, MESC, and mouse neuron datasets. For LISI F1score (Fig. 2Ac), scAEQN has superior scores on mouse pancreas and human pancreas2 datasets. scAEQN score of 0.2924 is lower than MNN (0.3790) and Scanorama (0.4610) methods on the brain dataset and its score of 0.2926 is lower than Combat (0.3250), Limma (0.4169), and Harmony (0.4601) methods on MESC dataset. On the mouse neuron dataset, the scAEQN score of 0.3558 is lower than Limma (0.3708), Scanorama (0.4399), and Harmony (0.3714) methods and its score of 0.4415 is lower than Limma (0.4935) and Harmony (0.4724) methods on human pancreas1 datasets.
In general, the metrics of scAEQN are improved compared with raw data, indicating that scAEQN has the ability of batch correction. From the quantitative comparison of three metrics, scAEQN has a better ability of batch correction compared with other methods.
3.2 Cluster analysis
The final purpose of batch correction is to conduct more accurate downstream analysis of data, so it is vital to conduct downstream analysis of batch corrected data. We select six common single-cell RNA-seq datasets (Table 2) and perform batch correction with different batch correction methods for cluster analysis. These methods are BBKNN, Combat, Harmony, Limma, MNN, QuantNorm, scBatch, scVI, and Scanorama. On six datasets, we utilized k-means to cluster the corrected expression data or similarity matrix, and calculate its ARI score. In addition, we also cluster the uncorrected raw matrix with the same clustering method as a control group to verify whether the batch correction is effective before clustering. The clustering results generated by k-means clustering are unstable, we adopt 10 times clustering to get the average value of ARI to get the final evaluation index (Table 3).
Table 3
ARI for different batch correction methods in six datasets
| Brain | MESC | Mouse neuron | Human pancreas1 | Mouse pancreas | Human pancreas2 |
Uncorrected | 0.2044 | 0.4705 | 0.3202 | 0.4320 | 0.1704 | 0.3255 |
BBKNN | 0.1769 | 0.5037 | 0.3621 | 0.3814 | 0.1170 | 0.3401 |
Combat | 0.2179 | 0.5477 | 0.3528 | 0.4308 | 0.1594 | 0.3739 |
Harmony | 0.1769 | 0.5130 | 0.5127 | 0.4759 | 0.1227 | 0.3409 |
Limma | 0.1635 | 0.5043 | 0.3730 | 0.4114 | 0.1518 | 0.3900 |
MNN | 0.1291 | 0.5027 | 0.3434 | 0.2628 | 0.1129 | 0.2912 |
Quantnorm | 0.3299 | 0.2488 | 0.1178 | 0.4621 | 0.5272 | 0.5207 |
scbatch | 0.2022 | 0.5344 | 0.5451 | 0.5926 | 0.1434 | / |
scVI | 0.2325 | 0.9842 | 0.4221 | 0.5002 | 0.3530 | 0.5867 |
Scanorama | 0.0871 | 0.0769 | 0.1844 | 0.0187 | 0.1259 | 0.3418 |
scAEQN | 0.4992 | 0.8469 | 0.6733 | 0.6612 | 0.7832 | 0.6077 |
From the comparison of ARI metric results, the performance of scAEQN is effective than the rest of nine batch corrected methods on brain, mouse neuron, human pancreas1, mouse pancreas and human pancreas2 datasets, which improves at least 2% higher than the best of the nine methods and improves 26% than the second on mouse pancreas dataset. On mouse embryonic stem cells (MESC) dataset, scAEQN is second to scVI but it still outperformed the other methods and improved by 37% over raw data. We did not get the clustering ARI metric of the human pancreas2 dataset after correction by the scBatch method, because there are too many cells in the human pancreas2 dataset, and scBatch cannot process datasets with too many cells, which is also mentioned by the author. From the bar plot of ARI (Fig. 3A), scAEQN has obvious advantages in the cluster. Based on this, we demonstrate that scAEQN improves clustering accuracy while removing batch effects, and is superior to most batch correction methods.
3.3 Differential expression genes analysis
To further assess scAEQN, DEGs analysis is performed from the corrected data by scAEQN and compared with the uncorrected original data. We utilize FindAllMarkers function in Seurat3 package to obtain DEGs on six datasets, where log2FC threshold is 0.05 because of the corrected data are normalized and log transformed in scAEQN. The top 5 DEGs are selected by log2FC rank in every cell type on six datasets, and then plotted heatmap by ComplexHeatmap [28] method (Fig. 3B). We can see that the top 5 DEGs obtained by scAEQN have more distinct expression blocks in each cell type, compared with uncorrected raw data from Fig. 3B (a-d). scAEQN has distinct differential expression gene blocks in large cell types, like alpha, beta, ductal, delta, endothelial, and gamma cell types on the mouse pancreas (Fig. 3Be). And on the human pancreas2 dataset, scAEQN has also distinct differential expression gene blocks in large cell types, like alpha, beta, ductal, delta, acinar, and gamma cell types (Fig. 3Bf).
Based on the human pancreas1 dataset, we analyze the identification results of DEGs obtained by different batch correction methods using Seurat [15]. We compare our method scAEQN with the data of raw count matrix, MNN [10], Limma [3], Combat [2], and scBatch [13], eventually. We set the conditions of adjusted P-value < 1e-2 and log2FC value > 0.05 to analysis the number of DEGs (Fig. 3D). More DEGs are detected by scAEQN under the same conditions, which avoid missing possible genes, and provide a solid foundation for gene function analysis. At the same time, the number of DEGs, detected jointly by scAEQN with uncorrected, MNN, Limma, Combat, scBatch, is 279, 444, 206, 224, 420. Nearly half of the DEGs detected by other methods are consistent with scAEQN. Finally, we conclude the common DEGs obtained by these six methods, and obtain 165 co-expressed DEGs. The specific detected DEGs and co-expressed genes are shown in the supplementary note S10. In conclusion, scAEQN is feasible in the identification and analysis of differential expression genes.
3.4 Running time
To verify the operability of scAEQN, we record running time in seconds of different batch correction methods under five datasets with different cell numbers varied from tens to thousands, and parameters setting are described in section 2.2.
As can be seen from Fig. 3C, the abscissa represents the number of cells arranged from small to large in each dataset, the ordinate represents the running time in seconds, and the lines of different colors represent the running time of different batch corrections methods. Figure 3C(b) is an enlarged plot of the line marked by the red circle in Fig. 3C(a). There is a dotted line between 1886 and 3289 in Fig. 3C(a), because a total of 12 hours were spent on the dataset of 3289 cells using the scBatch method and did not get results, and spent around 4 hours on the dataset of 1886 cells. The running time of the scBatch method increases exponentially with the increase of the number of cells. MNN method also takes more time with the increase in the number of cells similarly, but the running time is still acceptable. Combat, Limma, and BBKNN are stable in running time. The running time of scAEQN is related to the parameters of the selected QuantNorm. When the parameters of each dataset in scAEQN are set to “method = vectorization”, the running time of our method will increase with the increase of the number of cells, but the longest time is only a few minutes. For instance, on the dataset with 651 cells of the human pancreas from different donors, when the parameter is set to “method = row/column”, the running time for scAEQN is only six minutes. Although scAEQN is not the best in running time performance, it is still within our acceptable range.