The Cellcano framework. Cellcano uses gene-level summaries of the raw scATAC-seq data as inputs. It incorporates ArchR 30 pipeline to process the raw data and obtain gene scores for both reference and target datasets. Then Cellcano applies F-test on reference gene scores to select cell-type-specific genes as features for model construction 31. After obtaining the reference and target gene scores for the selected features, Cellcano adopts a two-round supervised celltyping strategy, shown in Fig. 1. In the first round, Cellcano trains an MLP model with reference gene scores and predicts cell types in target data. If the target size is too small, Cellcano stops and returns the prediction results. When the target size is large enough (e.g., over 1,000 cells), Cellcano performs another round of model training to improve the prediction results. The second round starts with selecting anchors from the target data. The anchors are defined as the ones with higher confidence in first-round cell type prediction. For that, we extract the prediction probabilities for all cells from the first round and calculate their entropies. The cells with low entropy are deemed more confidently predicted. To assure the existence of all cell types and keep the cell type proportions consistent, Cellcano selects anchors with relatively low entropies in each predicted cell type to form a new reference dataset. Next, Cellcano trains a KD model on the new reference dataset (using predicted cell types as pseudo labels) and predicts cell types for the remaining non-anchors.
Cellcano outperforms existing scATAC-seq celltyping methods. We collect four human peripheral blood mononuclear cells (PBMCs) datasets and two mouse brain datasets (Supplementary Table 1) to benchmark the Cellcano. Among four human PBMCs datasets, one is cell-sorted by FACS and can be considered as “gold standard”. The cell types in other three datasets are annotated based on computational methods and prior biological knowledge, which are “silver standard” 32. For the six datasets, we design 50 experiments (Supplementary Table 2), which comprehensively cover different real application scenarios. We benchmark Cellcano against six competing methods: Seurat 24, scJoint 25, EpiAnno 27, ACTINN 8, scANVI 9 and SingleR 4. Even though Seurat and scJoint are not specifically designed for scATAC-seq celltyping using scATAC-seq data as reference, they can take gene scores as input for cell type prediction. For EpiAnno, we use ArchR to call peaks and count reads overlapping the peak regions to generate peak-by-cell matrices as its input. ACTINN, scANVI and singleR, which are designed for scRNA-seq celltyping, can also take the gene scores as input for scATAC-seq celltyping. ACTINN is a deep learning based method which is very similar to the first-round prediction of Cellcano. ScANVI is a semi-supervised learning method which uses deep generative model for variational inference purpose to first integrate scRNA-seq datasets and then transfer annotations. SingleR is a correlation-based supervised scRNA-seq celltyping method. According to a recent survey study, SingleR is the second-best performer behind Seurat in scRNA-seq celltyping 12. We include the above three methods because we want to explore whether existing scRNA-seq celltyping methods can be directly applied to scATAC-seq with gene scores as input. We evaluate the prediction performances from all methods by different metrics, including overall accuracy (Acc), adjusted rand index (ARI), macro F1 score (macroF1), Cohen’s kappa (κ), median F1 score (medianF1), median precision, and median recall.
We first compare the performances where we have one fixed “gold standard” target data (Fig. 2A, Figure S1). In total, there are seven experiments using different references. Cellcano achieves the highest average accuracy in the seven experiments (Fig. 2A), while scJoint is a close second. The accuracies from all other methods are significantly lower. For all other metrics (Figure S1), Cellcano and scJoint in general have the highest performances compared to all other methods, consistent with the results in prediction accuracy. Oveall, the third best performer is ACTINN which is a variation of Cellcano first-round prediction. The performance differences between Cellcano and ACTINN indicate the performance improvements by introducing our second-round prediction.
We then evaluate the performances in all other 22 human PBMCs experiments (Fig. 2B, Figure S2). Since the experiments involve different target datasets, the baseline performance for each experiment can vary. We eliminate such baseline effect by computing the performance gains/losses for each method against the average. To be specific, we take the average of the prediction performances from all seven methods for each experiment, and then subtract the average from the performances for each method. From these experimental scenarios, Cellcano and ACTINN outperform all other methods in all seven metrics. Compared to ACTINN, Cellcano achieves better performances in those multiclass metrics (Acc, ARI, macroF1 and Cohen’s kappa κ), while ACTINN is slightly better in metrics measuring each cell type as a binary classification task (median precision, median recall and medianF1). Similarly, we evaluate the performances in 21 mouse brain experiments (Fig. 2C, Figure S3) and observe that Cellcano again outperforms all other methods. Note that EpiAnno fails to generate results for two relatively larger (over 32k cells) experiments due to memory limit. Overall, Cellcano outperforms all other methods considering all scenarios: two systems (human PBMCs and mouse brain), 50 experiments, and seven metrics.
To further demonstrate how the two-round procedure in Cellcano outperforms, we use one experiment (one FACS-sorted human PBMCs dataset as target, a combination of four individuals from Satpathy et al. 33 PBMCs dataset as reference) as an example to visualize the prediction results after each round. Figure 2D labels the ground truth cell types provided by FACS. After the first-round prediction, some cells in B cell and natural killer (NK) are wrongly predicted as Monocytes (Fig. 2E, red boxes). After the second round, the wrong predictions are corrected (Fig. 2F, red boxes). Another observation is that many CD8 T cells on the boundary between CD4 T cell and CD8 T cell clusters (black dotted line area) are not correctly predicted. After the second round, most of these cells are correctly assigned back to CD8 T cells. These demonstrate the advantage of having our second-round prediction with KD model.
The choice of using gene score as input. As mentioned before, scATAC-seq data can be represented in three different feature spaces: genome-wide fixed-size bins, peaks, and genes. Genome-wide fixed-size bins have a very large feature space, which poses a heavy computational burden. The peaks are not pre-defined and require additional steps in calling and unifying peaks. More importantly, since the peaks will be different for each prediction task, one cannot reuse a pre-trained prediction model for new target data. In this work, we use gene scores as input because they are well defined and have a small feature space. Also, it is possible to further connect the model trained on gene scores to scRNA-seq models and vice versa. Our comparisons between Cellcano and EpiAnno (which takes peak counts as input) prove that Cellcano with gene scores as input outperforms EpiAnno. Here, we further justify our choice of input between gene scores and fixed-size bin counts.
We evaluate Cellcano with gene scores or fixed-size 500-bp bin counts as input in both human PBMCs and mouse brain experiments. The comparison of prediction accuracies from human PBMCs is shown in Fig. 3A. The two types of inputs produce almost identical results in most experiments, while two outliers show that using gene scores is significantly better. Trends are similar when comparing ARI and macroF1 (Figure S4A-B). In all these mouse brain experiments, Cellcano with gene scores as input is better than using fixed-size bins in 62 out of 63 results (Fig. 3B, Figure S4C-D), except one in mouse brain experiment using ARI as measurement (Figure S4C). Overall, these results demonstrate that using gene scores as inputs works much better than using bin counts. In addition, the computational time for using gene scores as input is much shorter (Fig. 3C). Considering both computational and prediction performances, we decide to use gene scores as input.
Gene scores can be summarized in different ways 18,30 and our next question is how to utilize these gene score models in Cellcano. In total, ArchR provides 54 variations of gene score models (details in Supplementary Note 1). Our results use the ArchR recommended gene score model, which was shown to be the most accurate to infer gene expression in matched scATAC-seq and scRNA-seq data. The model resides in the “GeneModel-GB-Exponential-Extend” category, which covers signals on the whole gene body and adds bi-directional exponential decay weights on the reads outside the gene body area according to the distance to the gene body. We investigate whether using another model or applying a majority voting strategy with all ArchR 54 gene score models as input can result in a better prediction. To that end, we use each gene score as input in Cellcano to predict cell types. Then we take a majority voting for all 54 predictions to determine the final cell type call. More details about the majority voting strategy can be found in the Methods section. Figure 3D and Figure S5A-B shows the results from using all individual gene scores and the majority voting from four human PBMCs experiments. We again remove the baseline performance for each experiment to compute the gains/losses, and then order the heatmap to make the left column have the largest average gain. Overall, the top 10 or so performing gene score models are very similar. The majority voting Acc ranks the first, and the Acc of using recommended gene score model ranks the 4th (Fig. 3D). However, the average performance differences between majority voting and the ArchR recommended gene score model is very small (0.34%). Similar trends have been observed in ARI (majority voting ranks 1st and ArchR recommended gene score model ranks 4th) and macroF1 (ArchR recommended gene score model ranks 2nd and majority voting ranks 5th ). In summary, the slight improvement of Acc and ARI in majority voting, which uses 54 times computational resources, is unworthy. Moreover, since the ArchR recommended gene score has very similar results and shows good performances in other tasks, we recommend using it as Cellcano’s default input.
Properties of Cellcano anchors. Similar to Seurat, Cellcano selects anchors from the target dataset and uses them as reference to predict cell types for non-anchors in the second round. However, the procedure for anchor selection in Cellcano is very different. Seurat uses Mutual Nearest Neighbors (MNN) in a low-dimensional space determined by canonical component analysis (CCA) to select anchors, which relies on the linear relationship between reference and target. The number of anchors selected is further determined by the parameter of how many neighbors are examined. Differently, Cellcano obtains predicted probabilities for cells in target data from the first-round MLP, and then selects anchors based on the prediction entropies. The number of anchors in Cellcano is determined by the quantiles of entropies in each cell type.
We show an example below where Cellcano selects 40% (entropy quantile cutoff as 0.4) cells from target dataset as anchors (Fig. 4A-C). For all 29 human PBMCs experiments, Cellcano’s anchors achieve much higher accuracy (median: 91.93% and mean: 91.04%) compared to Seurat (median: 71.36% and mean: 69.04%), even though Cellcano selects more anchors (Fig. 4A-B). We also compare the non-anchors performances in Cellcano before and after the second-round prediction and observe an increase of 2.44% in median and 3.27% in mean. The improvement further validates the usefulness of our two-round prediction procedure. We then use the same experiment shown in Fig. 2D (one FACS-sorted human PBMCs dataset as target, a combination of four individuals as reference) as an example to visualize the anchors selected by Cellcano. We conclude that anchors selected by Cellcano can better capture the full scope of target data distribution (Fig. 4C, Figure S6) compared to those selected by Seurat (Fig. 4D).
Next, we want to explore how the number of selected anchors will affect the prediction performances. We first compare the performance between anchors and non-anchors under different quantile cutoffs in human PBMCs experiments (Figure S7A-C) and mouse brain experiments (Figure S8A-C). We observe that when the quantile cutoff is lower, the anchor accuracies are higher. This is as expected because a more stringent confidence criterion will lead to higher prediction accuracy. However, using fewer anchors means the training dataset is smaller in the second round. Moreover, using too few anchors could fail to capture the full scope of target distribution since the most confident cells tend to cluster around cluster centroids. Both can result in decreased performance in predicting non-anchors (Figure S7-8, right panels where Cellcano selects cutoffs as 0.1). On the other hand, choosing too many anchors will include many wrongly predicted cells, which is detrimental to the second-round model training. Thus, the final prediction performance depends on a balance between anchor numbers and anchor accuracy.
We investigate the impact of anchor numbers in human PBMCs experiments (Fig. 4E, S9A-B) and mouse brain experiments (Fig. 4F, S9C-D). We select different numbers of anchors according to entropy quantile cutoffs (0.1 to 0.6 with step size 0.1) and compare the final prediction performances. Similar to the comparison in Fig. 2B, each experiment has a prediction baseline which is calculated as the average performance by using different quantile cutoffs. We then calculate the performance gains/losses for each quantile cutoff against the average.
Overall, the performances are stable when using 0.2 or above as quantile cutoffs (the median Acc varies within − 0.4% ~ +0.9% in human PBMCs experiments and − 0.9% ~ +1.4% in mouse brain experiments). The worst performance occurs when using 0.1 as the quantile cutoff. This again can be explained by the small training size in the second round and the failure of capturing the target distribution. Therefore, when choosing a moderate number of cells, Cellcano can produce comparable prediction results. By default, we use 0.4 as the entropy quantile cutoff in our software implementation.
Cellcano works better than prediction with batch effect removed. A key advantage of the two-round approach in Cellcano is that training a model using anchors in target data alleviates the distributional shift problem between the reference and target data. The distributional shift is often caused by batch effect in high-throughput data. This leads to a question whether our two-round strategy is better than the one where we first remove batch effect and then apply a direct prediction. Therefore, we compare the performances between Cellcano and Cellcano direct prediction with batch effect removed by Harmony 34, which was demonstrated to have the best performance in previous benchmark study 35. We also include Seurat in the evaluations because Seurat also claims to mitigate batch effect by projecting both datasets to the same dimension using CCA.
Our designed experiments involve combinations of individuals and batches in both reference and target datasets. We use human PBMCs datasets as example to show how Cellcano performs when compared to Cellcano first-round prediction with batch effect removed and Seurat (Figure S10). In the “inter-dataset” category, we use one individual from one dataset to predict one individual from another dataset. Performances from the three methods are comparable. Next, we investigate the scenarios where the reference (inter-dataset: combined reference) or target (inter-dataset: combined target) are combined with multiple batches or individuals. The combined reference experiments represent the scenario where one wishes to use a large collection of public datasets to increase the reference data size and improve prediction result. The combined target experiments represent the scenario when the target data are from multiple batches, but one wants to determine their cell types in one run. In both cases, there are batch effects inside either reference or target dataset, and especially in those scenarios where batch effects exist in the target dataset, the results with batch effect removed by Harmony have significantly reduced performances. We generate a low-dimensional visualization before (Figure S11A) and after the batch effect removal (Figure S11B) using one example where one FACS-sorted PBMCs data is taken as target and four individuals from Satpathy et al33 are combined as reference. We can observe that the batch correction results are chaotic and deteriorate the prediction (Figure S11B). Similar problems appear in Seurat where the average performances drop when batch effects exist in target datasets.
Among all comparisons, Cellcano is not affected by the batch effects and steadily outperforms other methods. Even in intra-dataset prediction where we use different individuals from the same dataset as reference and target, Cellcano still largely outperforms other two methods which might indicate their failures of dealing with effect among different individuals. These results demonstrate that Cellcano can handle data from different individuals and batches in both reference and target data, without batch effect removal. This provides the possibility of training predication models using a large compendium of datasets.
Cellcano is computationally efficient and scalable. We evaluate the computational performance of Cellcano and show the methods’ runtime for all experiments (Fig. 5A-B). For fair comparisons, we combine the training time and prediction time into an overall runtime for Cellcano and EpiAnno. This is because all other methods need both reference and target datasets as input to do prediction. Here, we do not consider the data pre-processing time (such as the time used for generating peak counts or gene scores from the raw data). We sort the experiments by the total number of cells in reference and target datasets. The results indicate that when the cell number is low, Cellcano, Seurat and scJoint use about the same runtime. However, when the cell number starts increasing, Seurat and scJoint can be three times slower than Cellcano. All other methods are 5 ~ 100 times slower than Cellcano. The reason why ACTINN as one-round prediction is slower than Cellcano is because ACTINN uses all genes for training while Cellcano selects 3,000 genes as features. An additional advantage is that Cellcano is a supervised celltyping method, the pretrained models can be re-used in future predictions, which means the runtime can be further reduced with the first-round pretrained model as input.