We first explore the importance of the PCs involved in our frameworks. We implemented the workflow with PCs of 20, 50 to 500 with step of 50, and applied them to all the datasets. We noted that the models would achieve relatively stable performance when using 50 PCs or more (Fig. 1A). Particularly, for the pancreas datasets, the accuracy was 0.94 with 50 PCs, slightly increasing to 0.945 with more than 100 PCs. Similar observations were observed on PBMC and TM datasets. For instance, an accuracy of 0.88 was achieved with more than 50 PCs compared with 0.86 for the model with 20 PCs for the PMBC dataset. For the TM dataset, an improvement from 0.91 to 0.935 was observed when comparing the model with 50 PCs and the one with 20 PCs, while relatively stable results of 0.94 were reported when increasing the PCs to 100 and more. Thus, to enhance computational efficiency and adopt a more concise model, we have selected 50 as the default parameter in our approach, prioritizing speed and simplicity.
We further wondered the contribution of each step or each component in our PCLDA model. To do so, for different datasets, we used four different combinations to compare their accuracy (Fig. 1B): fitting LDA model only; T-test based gene screening and LDA model, PCA downscaling and LDA model, and as well as the PCLDA (i.e. combination of gene screening, PCA downscaling, and LDA model). We noted that the accuracy of PCLDA is the highest and most stable. Particularly, when fitting the LDA model only, several datasets were failed due to the memory issue. In addition, the collinearity issues were observed for some datasets when fitting the LDA model after gene screening but without PCA (i.e., Gene Screening + LDA in Fig. 1B), which consequentially leads to worse performance.
We then compared the classification accuracy of PCLDA with the other nine methods on the above-mentioned scRNA-seq datasets (Table 1). These benchmarking methods include two statistical metrics-based methods (SingleR36, Scmap37), two tree-based methods (CHETAH38 and scClassify), three machine-learning-based methods (Granatt, SingleCellNet, scID, SCINA39) and a semi-supervised learning based method (Seurat)40 (Table 2 for the detailed models). We used overall accuracy as a performance measurement metric.
Table 2
Compared machine learning models used for cell annotation.
Tool name
|
Language
|
Computational approach
|
Seurat
|
R
|
weighted-nearest neighbour
|
SingleR
|
R
|
Spearman
|
CHETAH
|
R, Shiny app
|
Spearman + confidence
|
SCINA
|
R
|
Bimodal distribution fitting to marker genes
|
singleCellNet
|
R
|
Random Forest
|
scID
|
R
|
LDA
|
Garnett
|
R
|
Elastic Net
|
scClassify
|
R, Shiny app
|
Weighted kNN classifier
|
scmap
|
R
|
Cosine distance based kNN
|
1. Intra-dataset annotation and performance comparison
We first tested the classification accuracy of ten methods (Table 2) on nine publicly available scRNA-seq datasets (Table 1). These datasets include two peripheral blood mononuclear cell (PBMC) datasets, four human pancreatic islet datasets, two Tabula Muris datasets ( TM full and TM lung), and one mouse brain dataset. To avoid potential bias, we used the 5-fold cross validation scheme to measure the averaged accuracy. The detailed comparison results were shown in Fig. 2. It demonstrated that most of the methods consistently showed strong performance across all datasets. Notably, PCLDA, Seurat, and singleR achieved remarkable average accuracies of 0.95, 0.96, and 0.945, respectively. In contrast, some methods, like scID, Garnett, and scmap, exhibited poor performance. Additionally, some methods, such as SCINA, CHETAH and SingleCellNet, perform decent on most datasets but struggled with the TM (full) data. We noted that the full TM dataset from the Smart-Seq2 platform contains 37 cell types (Table 1) and certain cell types have a very limited number of cells, making the classification task more challenging. As a results, those methods struggled to achieve high performance on the TM full datasets due to the increased complexity posed by these sparse cell type labels. The Garnett model even failed to train the model. Overall, in the intra-dataset cross-validation scenario, the PCLDA is top ranked in most of the datasets and has high stability with basically no outliers.
2. Inter-dataset annotation and performance comparison
To evaluate the annotation tools in a more realistic setting, we conducted inter-dataset performance evaluation on ten datasets/ten pairs. As shown in Table 1, three of the ten model testing scenarios are for the same protocol, and seven are for different protocols/platforms.
We performed within-protocol prediction to show how good the models could achieve when the query datasets were generated using the same protocol as the reference datasets, which would be the most common application. The PCLDA model has ranked the top 2 highest accuracies in all three datasets compared to other methods, with an average accuracy of 0.91, while the best competing methods have an average accuracy of 0.92 (Seuart) and 0.9 (SingleCellNet), respectively (Fig. 3). Particularly, when dealing with large PBMC datasets, which have 90k + cells, the PCLDA achieves a very similar performance to the top one, indicating that PCLDA is very effective in handling large data.
We then evaluated the model performance by comparing them on cross-protocol datasets. While not all the models perform well on all the datasets, a few methods (e.g. scID and Granett) even showed poor performance in most comparisons with accuracies lower than 0.7. In contrast, the PCLDA model consistently achieved decent performance and demonstrated the top tier rank in all seven comparisons, along with Seurat and SingleR. The average accuracy for PCLDA, Seurat and SingleR were 0.95, 0.95 and 0.946, respectively. Notably, when comparing models on lung tissue from the TM data, which contained only 10 cell types and had a relatively uniform distribution of cells among different cell types, the PCLDA stood out with the best performance, achieving an accuracy of 0.99. In contrast, some other methods failed to perform adequately in this specific scenario.
It is also worth mentioning that it is very challenging to predict the cell types for the TM full data. As we motioned in intra-dataset scenario, it has 37 cell types in TM full training (reference) set, many of which contain only a few cells, and the gene expressions within these cell types are highly correlated. Despite of these challenges, the PCLDA method exhibited significant improvement on this dataset with highest accuracy of 0.94. In contrast, many other methods struggled to perform well, underscoring the advantage of our approach in effectively distinguishing between highly similar cell populations.
Altogether, these results across the 19 benchmark datasets in the three scenarios demonstrated that the performance of PCLDA is very robust and stable. The model’s performance remains consistently high for both intra-dataset prediction (cross-validation scenario) and inter-dataset prediction (cross-sample and cross-protocol scenarios).