Cellcano: supervised cell type identification for single cell ATAC-seq data

doi:10.21203/rs.3.rs-1717357/v1

Download PDF

Article

Cellcano: supervised cell type identification for single cell ATAC-seq data

https://doi.org/10.21203/rs.3.rs-1717357/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 03 Apr, 2023

Read the published version in Nature Communications →

Version 1

posted

You are reading this latest preprint version

Computational cell type identification (celltyping) is a fundamental step in single-cell omics data analysis. Supervised celltyping methods have gained increasing popularity in single-cell RNA-seq data because of the superior performance and the availability of high-quality reference datasets. Recent technological advances in profiling chromatin accessibility at single-cell resolution (scATAC-seq) have brought new insights to the understanding of epigenetic heterogeneity. With continuous accumulation of scATAC-seq datasets, supervised celltyping method specifically designed for scATAC-seq is in urgent need. In this work, we develop Cellcano, a novel computational method based on a two-round supervised learning algorithm to identify cell types from scATAC-seq data. The method alleviates the distributional shift between reference and target data and improves the prediction performance. We systematically benchmark Cellcano on 50 well-designed experiments from various datasets and show that Cellcano is accurate, robust, and computational efficient. Cellcano is well-documented and freely available at https://marvinquiet.github.io/Cellcano/.

The developments of single cell sequencing technologies have greatly enhanced the understanding of biological mechanisms in complex tissues. Among all single cell assays, single-cell RNA-sequencing (scRNA-seq) has been the most popular with over 1,200 analytical tools developed ¹. In scRNA-seq data analysis, cell type identification (referred to as “celltyping” hereafter) is one of the most fundamental questions. The procedure takes the single cell gene expression and determines the cell types for each individual cell. Computational celltyping for scRNA-seq has been a very active field. Many methods are currently available ^2–9 and several benchmark papers have been published ^10–12. These methods can be roughly categorized as supervised and unsupervised. According to benchmark studies, supervised celltyping methods have advantages over unsupervised ones in accuracy, robustness and scalability ^13,14.

Gene expression can be regulated by several factors. Among them, chromatin accessibility is essential for the interaction between DNA and regulatory elements, and provides important information for understanding the transcriptional regulatory mechanism ¹⁵. Similar to the advancement of scRNA-seq, recent years have also witnessed the shift from measuring chromatin accessibility in bulk samples to single-cell level by single-cell sequencing assay for transposase-accessible chromatin (scATAC-seq) ¹⁶. Like in scRNA-seq, celltyping is also an important question in scATAC-seq data analysis. However, scATAC-seq data have certain characteristics that make the celltyping in scATAC-seq more difficult. First of all, scATAC-seq data are much sparser due to low read counts ¹⁷, which results in weaker signals for distinguishing cell types. Secondly, unlike scRNA-seq, feature space is not well-defined in scATAC-seq data, which poses difficulties in extracting useful information. The raw scATAC-seq data can be summarized to counts on genome-wide fixed-size bins, peaks representing the accessible regions, or genes ¹⁸. The determination of feature space is an additional important step in scATAC-seq celltyping.

It is possible to do celltyping through experimental procedures such as Fluorescence-activated cell sorting (FACS), however, they are expensive and laborious ¹⁹. Another possibility is to utilize information from another single-cell assay. For example, SNARE-seq ²⁰ can simultaneously measure gene expression and chromatin accessibility in individual cells, and one can use the well-developed scRNA-seq tools for celltyping. Although this is a possible option, multi-omics datasets are very limited and expensive. For most scATAC-seq data without paired scRNA-seq data, one has to rely on celltyping methods specifically developed for scATAC-seq.

Most existing scATAC-seq celltyping methods are unsupervised ^21–24. In these methods, after the feature spaces are determined, data transformations, such as term frequency-inverse document frequency (TF-IDF), are applied to enhance signals. Then, dimension reduction and unsupervised clustering are performed, and clusters are annotated based on known cell-type-specific markers. The unsupervised methods usually have heavy computational burdens especially when the number of cells is large. With the accumulation of publicly available and well-annotated scATAC-seq data, supervised scATAC-seq celltyping methods start to gain attentions. As of now, only a few methods have been developed for this purpose. Seurat, as one of the most popular single cell genomics data analysis tools, provides functionality for supervised scATAC-seq celltyping using unpaired scRNA-seq as reference ²⁴. Another recently published method scJoint first learns a joint embedding of atlas-level scRNA-seq and scATAC-seq data and then uses k-nearest neighbors (KNN) to transfer labels from scRNA-seq to scATAC-seq ²⁵. These methods represent one of the most common approaches in annotating scATAC-seq by using existing scRNA-seq as reference datasets. Due to the strong data distributional discrepancies between reference and target data, these methods can significantly underperform. SnapATAC is a comprehensive tool for analyzing scATAC-seq data ²⁶. However, its celltyping strategy is directly adopted from Seurat. Moreover, its data processing procedure involves the calculation of pair-wise cell similarity, dimension reduction and manifold projection, which is computationally heavy when cell number scales up. The above methods all require model re-training when new target datasets come, which increases the computation time.

Another possibility for scATAC-seq cell type identification is to summarize scATAC-seq data to the genes or regions and use existing scRNA-seq celltyping methods. However, this requires a third-party tool to process scATAC-seq data, and it’s not clear what the best way is to summarize the scATAC-seq data. Very recently, EpiAnno was published to perform supervised celltyping in scATAC-seq using scATAC-seq data as reference ²⁷. It uses peaks from either the reference or target data as features and obtains peak counts for both reference and target data. Then, a nonlinear Bayesian neural network is trained to capture the latent space. However, a major problem is that the peaks are not well-defined and highly data dependent. Due to technical and biological artifacts, concordance of peaks can be low between reference and target ²⁸, which would result in a loss of information and undesirable celltyping results. Additionally, EpiAnno is not scalable for large datasets.

In this work, we develop a novel computational celltyping method for scATAC-seq, named Cellcano. Cellcano implements a two-round supervised learning algorithm. It first trains a multi-layer perceptron (MLP) on the reference dataset and predicts cell types in target data to generate pseudo labels. Based the generated labels, Cellcano then selects cells with high confidence as anchors to form a new training set. Next, Cellcano trains a self-Knowledge Distiller model (KD model) ²⁹ on anchors and predict cell types for remaining non-anchors. The KD model alleviates noises in anchors by softening the label distribution. Through extensive real data analyses, we demonstrate that Cellcano is significantly more accurate, computationally efficient, and scalable compared to existing methods. Cellcano is well-documented and freely available at https://marvinquiet.github.io/Cellcano/.

The Cellcano framework. Cellcano uses gene-level summaries of the raw scATAC-seq data as inputs. It incorporates ArchR ³⁰ pipeline to process the raw data and obtain gene scores for both reference and target datasets. Then Cellcano applies F-test on reference gene scores to select cell-type-specific genes as features for model construction ³¹. After obtaining the reference and target gene scores for the selected features, Cellcano adopts a two-round supervised celltyping strategy, shown in Fig. 1. In the first round, Cellcano trains an MLP model with reference gene scores and predicts cell types in target data. If the target size is too small, Cellcano stops and returns the prediction results. When the target size is large enough (e.g., over 1,000 cells), Cellcano performs another round of model training to improve the prediction results. The second round starts with selecting anchors from the target data. The anchors are defined as the ones with higher confidence in first-round cell type prediction. For that, we extract the prediction probabilities for all cells from the first round and calculate their entropies. The cells with low entropy are deemed more confidently predicted. To assure the existence of all cell types and keep the cell type proportions consistent, Cellcano selects anchors with relatively low entropies in each predicted cell type to form a new reference dataset. Next, Cellcano trains a KD model on the new reference dataset (using predicted cell types as pseudo labels) and predicts cell types for the remaining non-anchors.

Cellcano outperforms existing scATAC-seq celltyping methods. We collect four human peripheral blood mononuclear cells (PBMCs) datasets and two mouse brain datasets (Supplementary Table 1) to benchmark the Cellcano. Among four human PBMCs datasets, one is cell-sorted by FACS and can be considered as “gold standard”. The cell types in other three datasets are annotated based on computational methods and prior biological knowledge, which are “silver standard” ³². For the six datasets, we design 50 experiments (Supplementary Table 2), which comprehensively cover different real application scenarios. We benchmark Cellcano against six competing methods: Seurat ²⁴, scJoint ²⁵, EpiAnno ²⁷, ACTINN ⁸, scANVI ⁹ and SingleR ⁴. Even though Seurat and scJoint are not specifically designed for scATAC-seq celltyping using scATAC-seq data as reference, they can take gene scores as input for cell type prediction. For EpiAnno, we use ArchR to call peaks and count reads overlapping the peak regions to generate peak-by-cell matrices as its input. ACTINN, scANVI and singleR, which are designed for scRNA-seq celltyping, can also take the gene scores as input for scATAC-seq celltyping. ACTINN is a deep learning based method which is very similar to the first-round prediction of Cellcano. ScANVI is a semi-supervised learning method which uses deep generative model for variational inference purpose to first integrate scRNA-seq datasets and then transfer annotations. SingleR is a correlation-based supervised scRNA-seq celltyping method. According to a recent survey study, SingleR is the second-best performer behind Seurat in scRNA-seq celltyping ¹². We include the above three methods because we want to explore whether existing scRNA-seq celltyping methods can be directly applied to scATAC-seq with gene scores as input. We evaluate the prediction performances from all methods by different metrics, including overall accuracy (Acc), adjusted rand index (ARI), macro F1 score (macroF1), Cohen’s kappa (κ), median F1 score (medianF1), median precision, and median recall.

We first compare the performances where we have one fixed “gold standard” target data (Fig. 2A, Figure S1). In total, there are seven experiments using different references. Cellcano achieves the highest average accuracy in the seven experiments (Fig. 2A), while scJoint is a close second. The accuracies from all other methods are significantly lower. For all other metrics (Figure S1), Cellcano and scJoint in general have the highest performances compared to all other methods, consistent with the results in prediction accuracy. Oveall, the third best performer is ACTINN which is a variation of Cellcano first-round prediction. The performance differences between Cellcano and ACTINN indicate the performance improvements by introducing our second-round prediction.

We then evaluate the performances in all other 22 human PBMCs experiments (Fig. 2B, Figure S2). Since the experiments involve different target datasets, the baseline performance for each experiment can vary. We eliminate such baseline effect by computing the performance gains/losses for each method against the average. To be specific, we take the average of the prediction performances from all seven methods for each experiment, and then subtract the average from the performances for each method. From these experimental scenarios, Cellcano and ACTINN outperform all other methods in all seven metrics. Compared to ACTINN, Cellcano achieves better performances in those multiclass metrics (Acc, ARI, macroF1 and Cohen’s kappa κ), while ACTINN is slightly better in metrics measuring each cell type as a binary classification task (median precision, median recall and medianF1). Similarly, we evaluate the performances in 21 mouse brain experiments (Fig. 2C, Figure S3) and observe that Cellcano again outperforms all other methods. Note that EpiAnno fails to generate results for two relatively larger (over 32k cells) experiments due to memory limit. Overall, Cellcano outperforms all other methods considering all scenarios: two systems (human PBMCs and mouse brain), 50 experiments, and seven metrics.

To further demonstrate how the two-round procedure in Cellcano outperforms, we use one experiment (one FACS-sorted human PBMCs dataset as target, a combination of four individuals from Satpathy et al. ³³ PBMCs dataset as reference) as an example to visualize the prediction results after each round. Figure 2D labels the ground truth cell types provided by FACS. After the first-round prediction, some cells in B cell and natural killer (NK) are wrongly predicted as Monocytes (Fig. 2E, red boxes). After the second round, the wrong predictions are corrected (Fig. 2F, red boxes). Another observation is that many CD8 T cells on the boundary between CD4 T cell and CD8 T cell clusters (black dotted line area) are not correctly predicted. After the second round, most of these cells are correctly assigned back to CD8 T cells. These demonstrate the advantage of having our second-round prediction with KD model.

The choice of using gene score as input. As mentioned before, scATAC-seq data can be represented in three different feature spaces: genome-wide fixed-size bins, peaks, and genes. Genome-wide fixed-size bins have a very large feature space, which poses a heavy computational burden. The peaks are not pre-defined and require additional steps in calling and unifying peaks. More importantly, since the peaks will be different for each prediction task, one cannot reuse a pre-trained prediction model for new target data. In this work, we use gene scores as input because they are well defined and have a small feature space. Also, it is possible to further connect the model trained on gene scores to scRNA-seq models and vice versa. Our comparisons between Cellcano and EpiAnno (which takes peak counts as input) prove that Cellcano with gene scores as input outperforms EpiAnno. Here, we further justify our choice of input between gene scores and fixed-size bin counts.

We evaluate Cellcano with gene scores or fixed-size 500-bp bin counts as input in both human PBMCs and mouse brain experiments. The comparison of prediction accuracies from human PBMCs is shown in Fig. 3A. The two types of inputs produce almost identical results in most experiments, while two outliers show that using gene scores is significantly better. Trends are similar when comparing ARI and macroF1 (Figure S4A-B). In all these mouse brain experiments, Cellcano with gene scores as input is better than using fixed-size bins in 62 out of 63 results (Fig. 3B, Figure S4C-D), except one in mouse brain experiment using ARI as measurement (Figure S4C). Overall, these results demonstrate that using gene scores as inputs works much better than using bin counts. In addition, the computational time for using gene scores as input is much shorter (Fig. 3C). Considering both computational and prediction performances, we decide to use gene scores as input.

Gene scores can be summarized in different ways ^18,30 and our next question is how to utilize these gene score models in Cellcano. In total, ArchR provides 54 variations of gene score models (details in Supplementary Note 1). Our results use the ArchR recommended gene score model, which was shown to be the most accurate to infer gene expression in matched scATAC-seq and scRNA-seq data. The model resides in the “GeneModel-GB-Exponential-Extend” category, which covers signals on the whole gene body and adds bi-directional exponential decay weights on the reads outside the gene body area according to the distance to the gene body. We investigate whether using another model or applying a majority voting strategy with all ArchR 54 gene score models as input can result in a better prediction. To that end, we use each gene score as input in Cellcano to predict cell types. Then we take a majority voting for all 54 predictions to determine the final cell type call. More details about the majority voting strategy can be found in the Methods section. Figure 3D and Figure S5A-B shows the results from using all individual gene scores and the majority voting from four human PBMCs experiments. We again remove the baseline performance for each experiment to compute the gains/losses, and then order the heatmap to make the left column have the largest average gain. Overall, the top 10 or so performing gene score models are very similar. The majority voting Acc ranks the first, and the Acc of using recommended gene score model ranks the 4th (Fig. 3D). However, the average performance differences between majority voting and the ArchR recommended gene score model is very small (0.34%). Similar trends have been observed in ARI (majority voting ranks 1st and ArchR recommended gene score model ranks 4th) and macroF1 (ArchR recommended gene score model ranks 2nd and majority voting ranks 5th ). In summary, the slight improvement of Acc and ARI in majority voting, which uses 54 times computational resources, is unworthy. Moreover, since the ArchR recommended gene score has very similar results and shows good performances in other tasks, we recommend using it as Cellcano’s default input.

Properties of Cellcano anchors. Similar to Seurat, Cellcano selects anchors from the target dataset and uses them as reference to predict cell types for non-anchors in the second round. However, the procedure for anchor selection in Cellcano is very different. Seurat uses Mutual Nearest Neighbors (MNN) in a low-dimensional space determined by canonical component analysis (CCA) to select anchors, which relies on the linear relationship between reference and target. The number of anchors selected is further determined by the parameter of how many neighbors are examined. Differently, Cellcano obtains predicted probabilities for cells in target data from the first-round MLP, and then selects anchors based on the prediction entropies. The number of anchors in Cellcano is determined by the quantiles of entropies in each cell type.

We show an example below where Cellcano selects 40% (entropy quantile cutoff as 0.4) cells from target dataset as anchors (Fig. 4A-C). For all 29 human PBMCs experiments, Cellcano’s anchors achieve much higher accuracy (median: 91.93% and mean: 91.04%) compared to Seurat (median: 71.36% and mean: 69.04%), even though Cellcano selects more anchors (Fig. 4A-B). We also compare the non-anchors performances in Cellcano before and after the second-round prediction and observe an increase of 2.44% in median and 3.27% in mean. The improvement further validates the usefulness of our two-round prediction procedure. We then use the same experiment shown in Fig. 2D (one FACS-sorted human PBMCs dataset as target, a combination of four individuals as reference) as an example to visualize the anchors selected by Cellcano. We conclude that anchors selected by Cellcano can better capture the full scope of target data distribution (Fig. 4C, Figure S6) compared to those selected by Seurat (Fig. 4D).

Next, we want to explore how the number of selected anchors will affect the prediction performances. We first compare the performance between anchors and non-anchors under different quantile cutoffs in human PBMCs experiments (Figure S7A-C) and mouse brain experiments (Figure S8A-C). We observe that when the quantile cutoff is lower, the anchor accuracies are higher. This is as expected because a more stringent confidence criterion will lead to higher prediction accuracy. However, using fewer anchors means the training dataset is smaller in the second round. Moreover, using too few anchors could fail to capture the full scope of target distribution since the most confident cells tend to cluster around cluster centroids. Both can result in decreased performance in predicting non-anchors (Figure S7-8, right panels where Cellcano selects cutoffs as 0.1). On the other hand, choosing too many anchors will include many wrongly predicted cells, which is detrimental to the second-round model training. Thus, the final prediction performance depends on a balance between anchor numbers and anchor accuracy.

We investigate the impact of anchor numbers in human PBMCs experiments (Fig. 4E, S9A-B) and mouse brain experiments (Fig. 4F, S9C-D). We select different numbers of anchors according to entropy quantile cutoffs (0.1 to 0.6 with step size 0.1) and compare the final prediction performances. Similar to the comparison in Fig. 2B, each experiment has a prediction baseline which is calculated as the average performance by using different quantile cutoffs. We then calculate the performance gains/losses for each quantile cutoff against the average.

Overall, the performances are stable when using 0.2 or above as quantile cutoffs (the median Acc varies within − 0.4% ~ +0.9% in human PBMCs experiments and − 0.9% ~ +1.4% in mouse brain experiments). The worst performance occurs when using 0.1 as the quantile cutoff. This again can be explained by the small training size in the second round and the failure of capturing the target distribution. Therefore, when choosing a moderate number of cells, Cellcano can produce comparable prediction results. By default, we use 0.4 as the entropy quantile cutoff in our software implementation.

Cellcano works better than prediction with batch effect removed. A key advantage of the two-round approach in Cellcano is that training a model using anchors in target data alleviates the distributional shift problem between the reference and target data. The distributional shift is often caused by batch effect in high-throughput data. This leads to a question whether our two-round strategy is better than the one where we first remove batch effect and then apply a direct prediction. Therefore, we compare the performances between Cellcano and Cellcano direct prediction with batch effect removed by Harmony ³⁴, which was demonstrated to have the best performance in previous benchmark study ³⁵. We also include Seurat in the evaluations because Seurat also claims to mitigate batch effect by projecting both datasets to the same dimension using CCA.

Our designed experiments involve combinations of individuals and batches in both reference and target datasets. We use human PBMCs datasets as example to show how Cellcano performs when compared to Cellcano first-round prediction with batch effect removed and Seurat (Figure S10). In the “inter-dataset” category, we use one individual from one dataset to predict one individual from another dataset. Performances from the three methods are comparable. Next, we investigate the scenarios where the reference (inter-dataset: combined reference) or target (inter-dataset: combined target) are combined with multiple batches or individuals. The combined reference experiments represent the scenario where one wishes to use a large collection of public datasets to increase the reference data size and improve prediction result. The combined target experiments represent the scenario when the target data are from multiple batches, but one wants to determine their cell types in one run. In both cases, there are batch effects inside either reference or target dataset, and especially in those scenarios where batch effects exist in the target dataset, the results with batch effect removed by Harmony have significantly reduced performances. We generate a low-dimensional visualization before (Figure S11A) and after the batch effect removal (Figure S11B) using one example where one FACS-sorted PBMCs data is taken as target and four individuals from Satpathy et al³³ are combined as reference. We can observe that the batch correction results are chaotic and deteriorate the prediction (Figure S11B). Similar problems appear in Seurat where the average performances drop when batch effects exist in target datasets.

Among all comparisons, Cellcano is not affected by the batch effects and steadily outperforms other methods. Even in intra-dataset prediction where we use different individuals from the same dataset as reference and target, Cellcano still largely outperforms other two methods which might indicate their failures of dealing with effect among different individuals. These results demonstrate that Cellcano can handle data from different individuals and batches in both reference and target data, without batch effect removal. This provides the possibility of training predication models using a large compendium of datasets.

Cellcano is computationally efficient and scalable. We evaluate the computational performance of Cellcano and show the methods’ runtime for all experiments (Fig. 5A-B). For fair comparisons, we combine the training time and prediction time into an overall runtime for Cellcano and EpiAnno. This is because all other methods need both reference and target datasets as input to do prediction. Here, we do not consider the data pre-processing time (such as the time used for generating peak counts or gene scores from the raw data). We sort the experiments by the total number of cells in reference and target datasets. The results indicate that when the cell number is low, Cellcano, Seurat and scJoint use about the same runtime. However, when the cell number starts increasing, Seurat and scJoint can be three times slower than Cellcano. All other methods are 5 ~ 100 times slower than Cellcano. The reason why ACTINN as one-round prediction is slower than Cellcano is because ACTINN uses all genes for training while Cellcano selects 3,000 genes as features. An additional advantage is that Cellcano is a supervised celltyping method, the pretrained models can be re-used in future predictions, which means the runtime can be further reduced with the first-round pretrained model as input.

Computational celltyping for single cell omics data is an important problem. Such methods are under-developed for scATAC-seq data. In this work, we develop Cellcano, a two-round supervised scATAC-seq celltyping method. Due to distributional shift, the first-round prediction can be inaccurate, and the anchors can be noisy. The KD model in the second round is thus used to distill the knowledge from a noisily labeled input. We have shown in 50 experiments with data from two systems (human PBMCs and mouse brain) that Cellcano significantly outperforms other competing methods both in prediction and computational performances. Cellcano is also robust against the anchor selection procedure and batch effects in the data.

Cellcano has several advantages and methodological novelties. First, Cellcano uses gene scores as input, which has many advantages compared to using bin or peak counts: (1) genes have a much smaller feature space, which significantly improve the computational performance; (2) genes are shared among datasets, which provides potential to be further connected to other modalities, such as gene expression data. We show that using gene scores works as equally well or even better than using bin counts as input. Secondly, Cellcano implements novel strategies in selecting and using anchors. The MLP in Cellcano can better capture the non-linear relationship between the gene scores and the corresponding cell types. In addition, the KD model is robust to anchors with noisy labels. Moreover, Cellcano does not need to jointly operate on the reference and target datasets, like Seurat and scJoint does. This allows Cellcano to be trained on a compendium of reference datasets and provide a pre-trained model.

There are some further developments for Cellcano we plan to work on. First, Cellcano can be adapted to other celltyping scenarios, for example, cross-modality predictions (using scRNA-seq as reference for scATAC-seq celltyping), celltyping in single-cell DNA methylation, etc. Another interesting question is to use multimodal reference data, for example, to jointly use scRNA- and scATAC-seq data as reference to improve celltyping results for either scRNA- and scATAC-seq data. Such an approach can potentially further improve prediction performance.

Motivation for Cellcano. While there are many scRNA-seq celltyping methods available, tools for supervised scATAC-seq celltyping using scATAC-seq data as reference are very limited. Some of the scATAC-seq celltyping methods rely on the integration of another single-cell omics data measured from different cells and then transfer cell labels according to cell-cell distance after the integration. However, the modality integration assumes a preserved latent manifold between different modalities, which is not always true ³⁶. To avoid the reliance on unstable multi-modality integration, Cellcano uses scATAC-seq as reference data, since more and more high-quality scATAC-seq data has been published in the field. Compared to scRNA-seq, scATAC-seq is much sparser and has weaker signals, which affects the prediction performance. We notice that Seurat first selects anchors in the target dataset and uses the anchors’ information to predict the non-anchors. Cellcano is inspired by Seurat but adopts a completely different strategy to select and use the anchors. We are aware that the anchors selected from the target dataset can have some noises, especially when distinguishing similar cell types. We hence introduce the KD model in our second round to get a more stable and accurate prediction result with noisy labels as input.

Cellcano model. Cellcano takes scATAC-seq raw data (fragment files or bam files) as inputs and calls ArchR to generate the gene score matrices. Assume there are G genes and N cells in the reference, and M cells in the target data, we define the gene score matrices in reference and target data as ${{X}_{ref}\in \mathbb{R}}^{G\times N}$ and ${{X}_{tgt}\in \mathbb{R}}^{G\times M}$, respectively. In the reference gene scores, we first perform a feature selection step to select representative features. The features are selected by F-test with known cell type labels, represented as ${{C}_{ref}\in \mathbb{R}}^{N\times 1}$. We have previously shown that features selected by F-test in reference data can provide the best results in supervised scRNA-seq celltyping ¹³. By default, we select top 3,000 genes with the largest F-statistics. We obtain the reference and target gene scores for the selected features and perform data normalization. To be specific, we normalize the cell-wise gene scores so that the total gene scores sum to 10,000 for each cell. We then take log-transformation on the normalized gene scores plus 1. After that, we perform gene-wise standardization on the log normalized gene scores so that each gene will have zero-mean and unit-variance. The standardization is a recommended procedure for performing efficient backpropagation in neural networks ³⁷.

In Cellcano’s first-round prediction, we first train an MLP model with a ReLU activation function to capture the non-linear mapping between the ${X}_{ref}$ and ${C}_{ref}$. For a multi-class classification with K cell types, the cell type label ${C}_{ref}$ is one-hot encoded to a binary matrix with dimension $N\times K$. The one-hot encoding labels the corresponding class as 1 and all others as 0 for each cell. The last layer of MLP is connected to a softmax function to convert the outputs from the last layer of the MLP to probabilities. The softmax function is represented by

$$\sigma \left({Z}_{i}\right)=\frac{\text{e}\text{x}\text{p}\left(\frac{{Z}_{\text{i}}}{T}\right)}{{\sum }_{k=1}^{K}\text{e}\text{x}\text{p}\left(\frac{{Z}_{k}}{T}\right)}$$

Here, ${Z}_{i}$ represents the outputs from the last layer of the MLP, and $T$ is a hyperparameter representing the temperature of the softmax function. The larger the $T$ is, the smoother the ${\sigma (Z}_{i})$ will be. We set $T=1$ in the first-round MLP model. During training, we use cross-entropy as the loss function to minimize the distributional difference between the one-hot encoded cell type label $p$ and the predicted cell type probabilities $\sigma \left(Z\right)$:

$$H\left(p,\sigma \left(Z\right))\right)=-{\sum }_{i=1}^{N}{\sum }_{k=1}^{K}{p}_{ik}\text{l}\text{o}\text{g}\left(\sigma {\left({Z}_{i}\right)}_{k}\right)$$

After training the MLP model, we apply the trained MLP model to the target data to obtain the probabilities for each cell being in each cell type.

When the target data size is small, Cellcano takes the class with the largest probability as the final predicted cell type for each cell and stops. When the target size is large (over 1,000 cells by default), we perform a second-round prediction. We first select anchors from the target, and we aim at selecting accurate anchors which can also capture the full scope of target distribution to guide the second-round prediction. With the first-round predicted probabilities, denoted as ${q}_{ik}$ for cell i being in cell type k, we calculate the entropy ${E}^{M\times 1}$ for all M cells as:

$${E}_{i}= -\sum _{k=1}^{K}{q}_{ik}\text{l}\text{o}\text{g}\left({q}_{\text{i}\text{k}}\right)$$

When a cell label is more confidently assigned, its entropy over the predicted probabilities is lower, and the prediction is in general more accurate (Fig. 4A, Figure S7-8). Once we have entropies for all cells, we select 40% cells with the lowest entropies as anchors for each cell type to form the new reference dataset for second-round training. Sometimes the cell type composition in anchors can be very skewed, which could affect the performance when training the second-round model. Therefore, we first calculate the average number of cells per cell type according to the predicted cell types in the anchors, and then oversample the cell types with fewer cells to the average number. This can ensure that there are enough training cells in all cell types in the anchor. Since some anchors will be mistakenly predicted, we apply the KD model in the second-round training to deal with the issue, detailed in next section. The model trained in the second round will be used to predict cell types for non-anchors. Finally, we combine the cell types predicted for the anchors (from the first round) and non-anchors (from the second round) as our final cell type calls.

The Knowledge Distiller (KD) model. Although the anchors cannot be perfectly predicted from the first round, they are important complementary training data for improving prediction, since these cells are from the exact same target domain where we previously lack supervision. To deal with training data with noisy labels, we implement a self-Knowledge Distiller (KD) model in the second-round training. The KD technique was originally proposed to transfer the knowledge learned from a sophisticated teacher model to a light-weighted student model, by treating the prediction results produced from the teach model as the “soft labels” for training the student model ³⁸. Inspired by this and several recent works ^29,39, we propose to use the teacher-student interaction to alleviate the noisy label problem. Specifically, the teacher model distills knowledge from both clean supervision and noisy supervision by producing “soft labels” as the training targets of the student model. Compared to the “hard labels” that only contain over-confident 1’s and 0’s, “soft labels” are smoothed and thus more noise-tolerated ⁴⁰. Also, there are cell types sharing similar profiles during celltyping which fits the fine-grained classification setting in the KD model. In Cellcano, we apply a “self-KD model” where we have the exact same structure for the teacher model and the student model. We set them to be vanilla MLPs of two hidden layers with 64 and 16 nodes respectively. To let the model be more generalizable, we put the dropout layer right after the input layer. We use ReLU as the activation function.

We first train the teacher model with the anchors as input. To make the label “softer”, we set the temperature $T$ of the softmax function to be larger. We use cross-entropy loss for the teacher model, then train the student model with the teacher’s “soft labels” as well as the one-hot encoded “hard labels”. The idea is to learn a label smoothing regularization so that the label distribution can be better captured. The KD loss function for the student model is a weighted average of two losses, which is shown in the equation below:

$${L}_{KD}=\alpha H\left(p,{q}_{s}^{{T}_{1}}\right)+\left(1-\alpha \right)KL\left({q}_{t}^{{T}_{2}},{q}_{s}^{{T}_{2}}\right).$$

Here, ${T}_{1}$ and ${T}_{2}$are temperatures in the softmax functions, and $\alpha$ is a hyperparameter for balancing the two losses. The first part of the KD loss is a cross-entropy loss where the student prediction ${q}_{s}$is guided by “hard labels” (anchor cell types from first-round prediction), and we set the ${T}_{1}$ as 1. The second part represents the Kullback-Leibler (KL) divergence loss which measures the probability distribution distances between the soft teacher prediction ${q}_{t}$ and the soft student prediction ${q}_{s}$, where ${T}_{2}$ can be adjusted. We set ${T}_{2}=3$ for the second part to soften the label distribution. Overall, we set $\alpha$ as 0.1 to value more on the teacher model’s “soft labels”. The KD model is trained for 30 epochs.

Data preprocessing by ArchR. All raw scATAC-seq data (fragment or bam files) are processed by ArchR ³⁰. We set genome hg19 for human PBMCs datasets and mm10 for mouse brain datasets. Then, we load the downloaded fragment files or bam files as input for ArchR to generate the ArrowFiles with createArrowFiles() function. In the function, two parameters serve with quality control purpose: minTSS and minFrags. We adjust the thresholds according to original papers to obtain high-quality cells.

The gene score matrices and genome-wide fixed-size bin counts are generated using the default setting in ArchR. The gene score matrix is generated with ArchR recommended gene score model (details in Supplementary Note 1). The bin counts are generated with 500-bp bins genome-wide. This results in around 6 million bins in hg19 and 5 million bins in mm10. To accelerate the data loading time, we filter out the bins with non-zero counts in less than 1% cells to reduce the feature space. The peak-by-cell matrices generation needs additional peak calling steps in ArchR. To reuse the ArrowFiles generated earlier, we put ArrowFiles from all human PBMCs datasets together and call peaks. ArchR first clusters cells and then creates pseudo-bulk replicates to assure the reproducibility of peak calling. Once the peaks are obtained, reads are counted on the peak regions to generate the peak count matrices. The same procedure has been performed in mouse brain datasets.

Benchmarking methods. We benchmark Cellcano to six competing methods: Seurat, scJoint, EpiAnno, ACTINN, scANVI and SingleR. For all methods except EpiAnno, we use the scATAC-seq gene score matrix before feature selection in the place of scRNA-seq gene expression matrix as the reference and follow their default procedures for celltyping. In Seurat, we choose reciprocal principal component analysis (RPCA) to calculate the joint embedding of reference and target datasets as it is proved to have better integration performance in a benchmark paper ³⁵. For EpiAnno, we use peak counts matrices as input. To accommodate the memory limitation of EpiAnno, we set the hyper parameter peak rate as 0.08 or 0.05 for large input matrices, while keeping the original 0.03 peak rate for remaining matrices.

We choose overall accuracy, ARI and macroF1 for performance evaluation. Accuracy describes the number of correctly assigned cells divided by total number of cells. This can be used as an indicator to show how well most cells are assigned. ARI measures the cluster concordance between the true labels and predicted labels. MacroF1 treats all cell types equally, which puts more emphasis on the accuracies of smaller clusters compared to other metrics. For a fair comparison, we also include median F1 score (medianF1), median precision, median recall, and Cohen’s kappa (κ), which were used in the EpiAnno paper. The medianF1, median prediction and median recall regard predicting each cell type as a binary classification task and calculate the median performance for each cell type. Cohen’s kappa measures the agreement between labels from the ground truth and the predictor. In summary, these metrics measure different perspectives and can be used to fairly reflect prediction performance among different performers.

Majority voting strategy. When evaluating the choice of gene score model, we apply the majority voting strategy to 54 ArchR gene score models. We use one gene score matrix from the reference data to train the Cellcano two-round model and predict the corresponding gene score matrix from the target data. This results in 54 predictions for each cell. We then select the one with the highest vote as the final cell type. In total, we select four inter-dataset human PBMCs experiments as examples which are: (1) use PBMC_D10T1 from Granja et al. ⁴¹ PBMCs dataset as reference to predict PBMC_Rep1 from Satpathy et al. ³³ PBMCs dataset; (2) use PBMC_D10T1 from Granja et al. PBMCs dataset as reference to predict PBMC_Rep2 from Satpathy et al. PBMCs dataset; (3) use PBMC_Rep1 from Stapathy et al. PBMCs dataset as reference to predict PBMC_D10T1 from Granja et al. PBMCs dataset; and (4) use PBMC_Rep1 from Stapathy et al. PBMCs dataset as reference to predict PBMC_D11T1 from Granja et al. PBMCs dataset.

Data availability

All datasets are publicly available, and the access numbers or the downloaded websites are provided by the original publications: Satpathy et al. ³³ (GSE129785), Granja et al. ⁴¹ (GSE139369), 10X PBMCs (10X Single Cell Multiome ATAC + Gene Expression with granulocytes removed through cell sorting), FACS PBMCs ⁴² (GSE123578), Lareau et al. ⁴² (GSE123581) and Cusanovich et al. ⁴³ (The Mouse sci-ATAC-seq Atlas). Datasets descriptions can be found in Supplementary Table 1 and details on data preprocessing are provided in Supplementary Note 2.

Code availability

Cellcano code is freely available on GitHub (https://github.com/marvinquiet/Cellcano) and the software package has been released both on PyPI (https://pypi.org/project/Cellcano/) and Anaconda (https://anaconda.org/marvinquiet/cellcano-all). Users can choose either option to easily install our package. Detailed tutorials on installation and usage are also provided (https://marvinquiet.github.io/Cellcano/).

Acknowledgements

This work is partially supported by the NIH award R01GM122083 for H.W. and W.M.

Author contributions

H.W. conceived the study. W.M. designed and implemented the model. W.M. developed the Cellcano software and analyzed the results. J.L. provided advice on algorithms and helped on benchmarking and testing the software. W.M and H.W wrote the paper with input from J.L.

Competing interests

The authors declare no competing interests.

Additional information

Supplementary file 1: Supplementary note section 1-2, Table 1, Figures S1-S7.

Supplementary file 2: Supplementary Table S2 describes 50 celltyping experiments designed for this study.

Correspondence and requests for materials should be addressed to Dr. Hao Wu.

Zappia, L. & Theis, F. J. Over 1000 tools reveal trends in the single-cell RNA-seq analysis landscape. Genome Biology 22, 301 (2021).
Kiselev, V. Y., Yiu, A. & Hemberg, M. scmap: projection of single-cell RNA-seq data across data sets. Nature methods 15, 359–362 (2018).
de Kanter, J. K., Lijnzaad, P., Candelli, T., Margaritis, T. & Holstege, F. C. CHETAH: a selective, hierarchical cell type identification method for single-cell RNA sequencing. Nucleic acids research 47, e95–e95 (2019).
Aran, D. et al. Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage. Nat Immunol 20, 163–172 (2019).
Xie, P. et al. SuperCT: a supervised-learning framework for enhanced characterization of single-cell transcriptomic profiles. Nucleic acids research 47, e48–e48 (2019).
Zhang, A. W. et al. Probabilistic cell-type assignment of single-cell RNA-seq for tumor microenvironment profiling. Nature methods 16, 1007–1015 (2019).
Alquicira-Hernandez, J., Sathe, A., Ji, H. P., Nguyen, Q. & Powell, J. E. scPred: accurate supervised method for cell-type classification from single-cell RNA-seq data. Genome biology 20, 1–17 (2019).
Ma, F. & Pellegrini, M. ACTINN: automated identification of cell types in single cell RNA sequencing. Bioinformatics 36, 533–538 (2020).
Probabilistic harmonization and annotation of single-cell transcriptomics data with deep generative models. Molecular Systems Biology 17, e9620 (2021).
Pasquini, G., Arias, J. E. R., Schäfer, P. & Busskamp, V. Automated methods for cell type annotation on scRNA-seq data. Computational and Structural Biotechnology Journal (2021).
Abdelaal, T. et al. A comparison of automatic cell identification methods for single-cell RNA sequencing data. Genome biology 20, 1–19 (2019).
Huang, Q., Liu, Y., Du, Y. & Garmire, L. X. Evaluation of Cell Type Annotation R Packages on Single-cell RNA-seq Data. Genomics, Proteomics & Bioinformatics (2020).
Ma, W., Su, K. & Wu, H. Evaluation of some aspects in supervised cell type identification for single-cell RNA-seq: classifier, feature selection, and reference construction. Genome Biology 22, 264 (2021).
Sun, X., Lin, X., Li, Z. & Wu, H. A comprehensive comparison of supervised and unsupervised methods for cell type identification in single-cell RNA-seq. Briefings in Bioinformatics 23, bbab567 (2022).
Tsompana, M. & Buck, M. J. Chromatin accessibility: a window into the genome. Epigenetics & Chromatin 7, 33 (2014).
Buenrostro, J. D. et al. Single-cell chromatin accessibility reveals principles of regulatory variation. Nature 523, 486–490 (2015).
Baek, S. & Lee, I. Single-cell ATAC sequencing analysis: From data preprocessing to hypothesis generation. Computational and Structural Biotechnology Journal 18, 1429–1439 (2020).
Chen, H. et al. Assessment of computational methods for the analysis of single-cell ATAC-seq data. Genome Biology 20, 241 (2019).
Davey, H. M. & Kell, D. B. Flow cytometry and cell sorting of heterogeneous microbial populations: the importance of single-cell analyses. Microbiological reviews 60, 641–696 (1996).
Chen, S., Lake, B. B. & Zhang, K. High-throughput sequencing of the transcriptome and chromatin accessibility in the same cell. Nature biotechnology 37, 1452–1457 (2019).
Bravo González-Blas, C. et al. cisTopic: cis-regulatory topic modeling on single-cell ATAC-seq data. Nature Methods 16, 397–400 (2019).
Zamanighomi, M. et al. Unsupervised clustering and epigenetic classification of single cells. Nat Commun 9, 2410 (2018).
Baker, S. M., Rogerson, C., Hayes, A., Sharrocks, A. D. & Rattray, M. Classifying cells with Scasat, a single-cell ATAC-seq analysis tool. Nucleic Acids Research 47, e10 (2019).
Stuart, T. et al. Comprehensive Integration of Single-Cell Data. Cell 177, 1888–1902.e21 (2019).
Lin, Y. et al. scJoint integrates atlas-scale single-cell RNA-seq and ATAC-seq data with transfer learning. Nature Biotechnology 40, 703–710 (2022).
Fang, R. et al. Comprehensive analysis of single cell ATAC-seq data with SnapATAC. Nat Commun 12, 1337 (2021).
Chen, X. et al. Cell type annotation of single-cell chromatin accessibility data via supervised Bayesian embedding. Nat Mach Intell 4, 116–126 (2022).
Fu, L. et al. Predicting transcription factor binding in single cells through deep learning. Science Advances 6, eaba9031.
Liu, Y., Shen, S. & Lapata, M. Noisy Self-Knowledge Distillation for Text Summarization. in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 692–703 (Association for Computational Linguistics, 2021). doi:10.18653/v1/2021.naacl-main.56.
Granja, J. M. et al. ArchR is a scalable software package for integrative single-cell chromatin accessibility analysis. Nat Genet 53, 403–411 (2021).
Su, K., Yu, T. & Wu, H. Accurate feature selection improves single-cell RNA-seq cell clustering. Briefings in Bioinformatics (2021).
Kiselev, V. Y. et al. SC3: consensus clustering of single-cell RNA-seq data. Nature methods 14, 483–486 (2017).
Satpathy, A. T. et al. Massively parallel single-cell chromatin landscapes of human immune cell development and intratumoral T cell exhaustion. Nat Biotechnol 37, 925–936 (2019).
Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nature methods 16, 1289–1296 (2019).
Luecken, M. D. et al. Benchmarking atlas-level data integration in single-cell genomics. Nat Methods 19, 41–50 (2022).
Argelaguet, R., Cuomo, A. S. E., Stegle, O. & Marioni, J. C. Computational principles and challenges in single-cell data integration. Nat Biotechnol 39, 1202–1215 (2021).
LeCun, Y. A., Bottou, L., Orr, G. B. & Müller, K.-R. Efficient BackProp. in Neural Networks: Tricks of the Trade: Second Edition (eds. Montavon, G., Orr, G. B. & Müller, K.-R.) 9–48 (Springer, 2012). doi:10.1007/978-3-642-35289-8_3.
Hinton, G., Vinyals, O. & Dean, J. Distilling the Knowledge in a Neural Network. http://arxiv.org/abs/1503.02531 (2015) doi:10.48550/arXiv.1503.02531.
Yuan, L., Tay, F. E., Li, G., Wang, T. & Feng, J. Revisiting Knowledge Distillation via Label Smoothing Regularization. in 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 3902–3910 (IEEE, 2020). doi:10.1109/CVPR42600.2020.00396.
Müller, R., Kornblith, S. & Hinton, G. When Does Label Smoothing Help? http://arxiv.org/abs/1906.02629 (2020) doi:10.48550/arXiv.1906.02629.
Granja, J. M. et al. Single-cell multiomic analysis identifies regulatory programs in mixed-phenotype acute leukemia. Nat Biotechnol 37, 1458–1465 (2019).
Lareau, C. A. et al. Droplet-based combinatorial indexing for massive-scale single-cell chromatin accessibility. Nat Biotechnol 37, 916–924 (2019).
Cusanovich, D. A. et al. A Single-Cell Atlas of In Vivo Mammalian Chromatin Accessibility. Cell 174, 1309–1324.e18 (2018).

There is NO Competing Interest.

SupplementaryInfo0803.pdf
Supplementary file 1: Supplementary note section 1-2, Table 1, Figures S1-S7.
SupplementaryTable2humanPBMCsexperiments.xlsx
Supplementary Table 2
SupplementaryTable2mousebrainexperiments.xlsx
Supplementary Table 2

Download PDF

Journal Publication

published 03 Apr, 2023

Read the published version in Nature Communications →

Version 1

posted

You are reading this latest preprint version

Cellcano: supervised cell type identification for single cell ATAC-seq data

Status:

Journal Publication

Version 1

Abstract

Figures

Introduction

Results

Discussion

Methods

Declarations

References

Additional Declarations

Supplementary Files

Status:

Journal Publication

Version 1