cellMarkerPipe: Cell Marker Identification and Evaluation Pipeline in Single Cell Transcriptomes

Assessing marker genes from all cell clusters can be time-consuming and lack systematic strategy. Streamlining this process through a unified computational platform that automates identification and benchmarking will greatly enhance efficiency and ensure a fair evaluation. We therefore developed a novel computational platform, cellMarkerPipe (https://github.com/yao-laboratory/cellMarkerPipe), for automated cell-type specific marker gene identification from scRNA-seq data, coupled with comprehensive evaluation schema. CellMarkerPipe adaptively wraps around a collection of commonly used and state-of-the-art tools, including Seurat, COSG, SC3, SCMarker, COMET, and scGeneFit. From rigorously testing across diverse samples, we ascertain SCMarker’s overall reliable performance in single marker gene selection, with COSG showing commendable speed and comparable efficacy. Furthermore, we demonstrate the pivotal role of our approach in real-world medical datasets. This general and opensource pipeline stands as a significant advancement in streamlining cell marker gene identification and evaluation, fitting broad applications in the field of cellular biology and medical research.


Introduction
Single-cell RNA sequencing (scRNA-seq) has emerged as a powerful high-throughput technique, enabling the comprehensive pro ling of diverse cell populations within tissue samples [1][2][3][4][5][6] .The scRNA-seq technology not only facilitates the exploration of various biological processes in disease and development [7][8][9] , but also allows for the identi cation of both known and novel single cell types, along with the characterization of their respective marker genes [10][11][12] .In typical scRNA-seq analysis, following cell type clustering is to obtain marker genes that are speci c to the clusters 13 .These marker genes are then manually inspected using available information in the literature or cell marker databases such as CellMarker 14 and PanglaoDB 15 .While effective, this manual process can be time-consuming and potentially prone to biases when different marker gene identi cation approaches need to be tested and applied.
A range of computational tools has emerged to enhance the convenience and automation of marker gene identi cation in scRNA-seq analysis.ScType streamlines cell type annotation through a reference marker gene database 16 , emphasizing the crucial role of marker gene identi cation under the cluster context.
General-purpose feature selection, particularly dimension reduction based on globally highly variable genes, retains informative genes but may not offer cell type markers.In contrast, for de-novo marker gene identi cation, methods targeting differential expression (DE) genes have been proven effective in pinpointing genes speci c to cell types 17 .Through extensive testing among DE statistical approaches, Wilcoxon rank sum test was highlighted to be worked well for DE gene identi cation particularly with su cient sample size [17][18][19] .Seurat, a cornerstone package in scRNA-seq, performs non-parametric Wilcoxon rank sum test by default in FindAllMarkers function 20 in a one-against-all manner.SC3 21 , another comprehensive scRNA-seq analysis package, identi es DE genes through a non-parametric Kruskal-Wallis test.
Additional to DE based statistical tests, there exists a category of specialized tools with more sophisticated approaches for cluster-wise marker gene identi cation.These tools aim to computationally emulate cell sorting by identifying cell type-speci c genes or gene panels.COSG 22 presents a signi cant advancement in the eld by introducing cosine similarity-based marker gene identi cation, which proves to be a more precise, robust and scalable method for discerning true marker genes across various cell types.SCMarker 23 is an ab initio method designed for marker selection by exploring bi-modally distributed expression levels that are co-or mutually-exclusively expressed with some other genes.Then SCMarker assigns the top ten highest expressed genes from all markers to the speci c cell types.
Uniquely, COMET 24 has capability in predicting advantageous marker panels (gene combinations) from transcriptomic data, by a special hypergeometric test.Finally, scGeneFit 25 selects gene markers that collectively optimize cell label hierarchy recovery, leveraging label-aware compressive classi cation methods and signi cantly enhancing the accuracy of cell type identi cation.Then the marker genes are assigned to hierarchical cell labels by their high expressions 25 .
Given the diverse landscape of the above tools, researchers face a challenge to make a good choice, which requires careful consideration of tool performance, compatibility with speci c datasets, and suitability for addressing distinct biological questions.The rapid evolution of technology has led to continuous development of new tools and methodologies, further complicating the selection process for the most appropriate tools.Addressing this challenge, the development of a uni ed platform for benchmarking marker genes should aim to signi cantly enhance usability while ensuring consistent and comprehensive evaluation metrics for testing various marker gene identi cation tools.While versatile benchmark projects for scRNA-seq have been conducted in recent years, such as those addressing differential expression analysis 17 , dimension reduction methods 26 , clustering strategies 27 , and data matrix transformations 28 , a benchmark for specialized gene marker identi cation tools is still absent, let alone a uni ed platform to perform such assessments in a user-friendly manner.
Therefore, we propose cellMarkerPipe, an adaptable and all-in-one platform designed for cell type-speci c marker gene identi cation and benchmarking.This platform conveniently compiled and wrapped around a list of recent and specialized computational tools for cluster-speci c marker gene identi cation (from 2017 to 2022, see Supplementary Table 1), each contributing to the advancement of marker gene identi cation in the evolving eld of single-cell transcriptomics.Rigorous testing on diverse scRNA-seq datasets from human, mouse, and plant samples (Supplementary Table 2), in conjunction with known markers, validates the robustness of our systematic benchmarking approach.Through a case study, we illustrate the potential applications of the cellMarkerPipe platform in advancing gene therapeutics for targeted cell populations, paving the way for personalized and genomic editing treatments.Implemented in both Python and R, this open-source platform empowers researchers across diverse biological domains with a comprehensive and fully automated protocol for cell-type marker gene identi cation.

Overview of the pipeline
The cellMarkerPipe pipeline accepts input in the 10x format, comprising both a cell-gene matrix and cell cluster labels (see Fig. 1).The output from the pipeline is an evaluation report with comprehensive metrics for the identi ed marker genes from any selected methodology.
The cellMarkerPipe comprises three core modules: preparation, marker selection, and marker evaluation.
Firstly, the preparation step includes normalization, scaling, and potential dimension reduction through the selection of highly variable genes.This step also includes the ltering of low-quality single cells following best practices of Seurat (see Methods).The output of the preparation step is a normalized and scaled cell-gene matrix.Secondly, the marker selection step employs a standard cell label le and normalized gene expression data from the previous step to perform gene selection using various methods.The pipeline has already supported multiple R/python environments for the prioritized tools, i.e.Seurat, COSG, SC3, SCMarker, COMET, and scGeneFit, to be compared and benchmarked in this paper.
The pipeline also has the capability of allowing researchers to incorporate their own tools or gene selection methods, provided the data formats for cell clusters and normalized expression matrix are compatible with our standards.The output of marker selection is a csv le containing genes speci c to each cell type.Thirdly, in the evaluation step, the pipeline assesses the marker genes and outputs a report.The evaluation metrics are based on the re-clustering effect alone without any known marker genes; Optionally, users can provide known cluster-speci c marker genes to evaluate Precision and Recall scores as additional metrics.Thus, the nal evaluation report includes scores such as the Adjusted Rand Index (ARI) [cite:Objective criteria for the evaluation of clustering methods], Jaccard index, purity, normalized mutual information (NMI) 29 , and Fowlkes-Mallows Index (FMI) from the re-clustering assessment 30,31 , and precision and recall values for each cell type and overall dataset given known marker genes.

Systematic benchmarking in diverse testing scenarios
The evaluation of marker genes across different tools raised several critical considerations, including the number of selected marker genes, relative cell type abundance, input cell numbers, and number of highly variable genes at the dimension reduction stage (Supplementary Table 3).These factors that may affect our evaluation have been systematically explored in this section.Among the various e cacy metrics, ARI is frequently used to assess the re-clustering method based solely on the selected marker genes 32 .Additionally, precision is utilized to ascertain true positives among the selected marker genes for each cell type.From ARI and precision curves (Fig. 2a-d), SCMarker and COSG consistently perform well as an overall observation.The complete analytical metrics under various testing cases are all reported and show similar patterns (Supplementary Table 3).
In Case 1 (Fig. 2a), we evaluated the tool performance by controlling the total number of selected marker genes, using a publicly available Zeisel dataset 33 from the mouse brain.All the methods can be adjusted by certain parameters to reach the same or similar number of selected marker genes.With more marker genes selected, we observed the increasing trend of ARI scores showing an overall improvement in clustering e cacy.As expected, precision instead decreased since the total selected marker genes will dilute the proportion of known marker genes.Among methods, COSG and SCMarker exhibited good precision scores for identifying true positive gene markers at the same or similar number of marker genes being reported.Marker genes identi ed by COSG and SCMarker also show more speci c gene expressions patterns in correspondent cell types from heatmap (Supplementary Fig. 1).Moreover, ARI reaches saturation for most of the methods especially for SCMarker even with fewer than 20 genes (approximately 2 genes in each of the nine cell types).Since all datasets went through dimension reduction by selecting highly variable expressed genes during the preparation step, we selected the similar number of top high variable genes as a control method to indicate the baseline e cacy compared to those specialized tools.This "high variable" method exhibited lower overall precision but comparable ARI score as to other methods at any selected gene numbers, which indicates that the re-clustering on global informative genes may not always offer marker genes in a cell type speci c manner.This implies the necessity of comprehensive metrics in our pipeline.Lastly, since the top 10 selected marker genes for each cluster (this is about 90 marker genes in total) in those methods already display a stabilized clustering performance (ARI saturation), we will always report top 10 marker genes for each cluster in later experiments (and in heatmaps) by default.
In Case 2 (Fig. 2b), we tested methods on the relative cell population by Jurkat dataset 34 , an arti cial mixture of two distinct cell lines (Jurkat and 293T).By altering the mixing ratio from 1:1 to an imbalanced scenario up to 9:1 (Jurkat:293T), we observed a decrease in clustering e cacy when the cell types were more severely imbalanced.However, the effect on precision was not consistently stable since there are only one marker genes in each cell line were considered.Overall, SCMarker demonstrated superior re-clustering effectiveness in this test of imbalanced cell types.From the heatmap visualization of marker gene expression speci city, SCMarker, COSG, Seurat and SC3 all display strong patterns (Supplementary Fig. 2).
In Case 3 (Fig. 2c), regarding experimental throughput, we varied the total number of input cells from 100 to 2500 using PBMC-10K dataset 34 to investigate the impact for marker genes identi cation.We observed an enhanced clustering effectiveness (ARI) and precision score with over 500 cells, equating to roughly 50 cells per cluster.Generally, more cells can provide better distribution of gene expressions but may also bring in more noise.The gene expression speci city pattern is not very observable when using 2500 cells and 5000 genes as inputs in all methods (Supplementary Fig. 3).Given current technological capabilities which enable the processing of over 5000 cells, the limitation of the input cell number is of minor consideration, except for rare cell types.
In Case 4 (Fig. 2d), we tested the effect of input gene numbers (highly variable genes from top 500 to 5,000 in dimension reduction) using PBMC-10K dataset 34 .This experiment illustrated that input gene numbers (often after dimension reduction with highly variable genes) may not signi cantly affect clustering e cacy but do impact precision.SCMarkers and COSG displayed superior performance overall in the PBMC dataset, emphasizing that leveraging a substantial number of highly variable genes is bene cial for enhancing clustering e cacy 35 but may not necessarily serve as speci c markers for cell types.This underscores that genes playing crucial roles in overall clustering performance are not necessarily cell type-speci c marker genes.
In Case 5 (Fig. 2e), we utilized a plant dataset derived from Arabidopsis root single cells 36 to visualize the standardized gene expression speci city using the top 10 selected gene markers in each cell type.
Heatmaps can visually represent the marker genes that in uenced these cell clusters 37 .SCMarker and COSG played important roles in identifying type-speci c expressed genes, with the yellow colors indicating higher speci city of cell type expressions.Additionally, the red box highlights the re-discovery of known marker genes.The marker genes selected by SCMarker and COSG included more reported known marker genes (in red boxes) than those selected by other methods a rming the superior performance of SCMarker and COSG in this context.

Comparative studies in human and mice gut tissues
In this experiment, we utilized datasets from both human and mice gut cells 38,39 , to conduct comprehensive comparisons across our selected methodologies.Initially, we reconstructed and displayed single-cell clusters colored on the given cell type labels identi ed in the respective studies (Fig. 3a and   3b).In the human Colon, Ileum, and Rectum tissues, we obtained marker genes for cell types of Paneth, Goblet, Enterocyte, Stem-cell, Enteroendocrine, EP (Enterocyte progenitor), and TA (transit-amplifying), while in mouse Duodenum, Ileum, and Jejunum tissues, we obtained marker genes for Enterocyte, Tuft, Goblet, Enteroendocrine, Stem-cell, TA, EP, and Paneth cell types.Overall, COSG achieved consistently better gene recall scores in human Ileum Rectum and mouse Duodenum, while SCMarker and Seurat performed better in other tissues (Supplementary Table 4).The performance patterns of the computational method COSG remained relatively consistent across both human and mouse, as well as across various tissue types (Fig. 3a and 3b).From heatmaps, the speci c gene expression values within each cell type were distinctly elucidated by the top 10 marker genes identi ed through the COSG method (Fig. 3c and 3d).Genes selected by SCMarker also displayed speci c cell type expression patterns (Supplementary 4-9) across human and mice tissues.While we did observe shared markers, such as TFF3, ATOH1, and FCGBP in Goblet cells and LGR5 in Stem-cells from both human and mouse Ileum data, it is important to notice that, overall, the marker genes identi ed in human and mice datasets exhibited differences (Fig. 3c and 3d).

Running time comparison for various methods
In this experiment, we conducted a thorough analysis of the running time complexity and scalability across various methods using the PBMC dataset.By manipulating the number of genes and cells as input variables, we meticulously measured the running time taken for the marker gene selection step in seconds across different tools (Fig. 4a and 4b).COMET exhibited challenges in scalability as it necessitates the examination of marker gene combinations, making it less e cient when faced with an increased number of genes or cells.Similarly, scGeneFit displayed relatively extended processing times due to the evaluation of gene networks based on positive and negative correlations.In contrast, the running time for the remaining tools demonstrated similar performance, showing little variation in response to changes in input variables.

Examination of cell makers and cell types in gene therapy
A recent clinical trial 40 explores gene editing-based therapy for children with transfusion-dependent βthalassemia caused by HBB gene mutation.By targeting the BCL11A enhancer, researchers aim to induce γ-globin expression to compensate the globin de ciency.Two children received edited stem cells, achieving successful engraftment and transfusion independence for over 18 months.Single cell data from this study provide exploratory analysis, which revealed no notable side effects.Here we used our cellMarkerPipe to re-evaluate the single cell data from this medical research from one of the children with both unedited and modi ed blood cells using six methods.We obtained the cell clusters and gene markers from the original publication.With the known blood cell markers 40 , SCMarker and COSG identi ed a good set of marker genes (Fig. 5a) according to their precision and recall scores and comparable clustering e cacy.
The COSG selected top 10 marker genes were visually represented for each cell type, showcasing their speci c expressions in a heatmap (Fig. 5b).This pattern can be also observed in SCMarker identi ed marker genes, but not observed clearly from other tools (Supplementary Fig. 10-11).Most of these markers were shared across unedited and modi ed samples in identi ed blood cell types.Interestingly, when combining B cells (naïve and memory) in these two samples (Fig. 5c), BCL11a did not emerge as a distinguishing marker between edited and unedited cases.This overall analysis suggests that the gene editing effect may not signi cantly impact major mature cell types according to the marker gene selections.Additionally, we explored potential uncertainties arising from different clustering methods, such as SC3 (Fig. 5d).SC3 and Seurat were both controlled to generate twelve clusters, and we obtained top ten marker genes for each cluster by COSG.While minor differences were observed in overall precision and recall for known cell markers, it's important to notice that the choice of clustering method in scRNA-seq analysis may in uence marker gene selection and downstream analysis.Nonetheless, this variation, given commonly used cell clustering methods (such as Seurat is used by default in our cellMarkerPipe), shall not pose a major concern for a real-world data analysis.

Discussion
CellMarkerPipe places a strong emphasis on utilizing cell cluster aware marker gene identi cation methods.While general dimension reduction or feature selection methods hold their own advantages in speci c contexts, the prioritization of specialized marker gene selection techniques in this platform ensures a meticulous assessment of marker gene quality and speci city that can emulate the cell sorting approach.
Among the tools integrated within cellMarkerPipe, SCMarker and COSG both exhibit commendable stability and consistency in their performance.Particularly, COSG stands out for its adeptness and e ciency in employing simple cosine distance measures.This characteristic lends COSG a high degree of reliability in marker gene identi cation across diverse datasets, underscoring its suitability for robust single-cell analyses.
The unique feature of COMET, its ability to identify combinatorial gene markers, has not been extensively tested and fairly benchmarked in this research.Indeed, the gene panel identi cation of COMET represents a distinctive advantage, revealing intricate relationships among genes and offering signi cant biological insights into complex regulatory networks.Nevertheless, this unique capability of COMET comes with the trade-off of expensive time complexity.
Seurat, as a widely adopted tool in single-cell analysis, provides valuable utility through its FindAllMarkers function.It's crucial to recognize that while Seurat is convenient to use and should be su ce for diverse applications, it might not always be the optimal choice for the comprehensive discovery of marker genes.This consideration implies the importance of evaluating other marker identi cation approaches.
We limit our tests by default settings for differential expression (DE) gene statistical methods in two widely used packages, Seurat and SC3, recognizing the diverse options and settings explored in other DE benchmark projects 17,18 .When utilizing differential expression (DE) statistical methods, as they may select top-ranked DE genes that are highly expressed in target cells and a small group of nontarget cells, potentially leading to erroneous identi cation as marker genes 22 .This highlights the importance of incorporating other more sophisticated marker gene identi cation methods besides the DE approaches.
Overall, CellMarkerPipe stands out as a new, standardized and uniform platform designed for the identi cation of cell-speci c marker genes, coupled with comprehensive benchmarking capabilities.In practical applications, researchers have the exibility to seamlessly integrate cellMarkerPipe with any gene selection tool as a plugin, enabling them to e ciently pinpoint marker genes tailored to their speci c research objectives.Furthermore, users can readily access an informative evaluation report, thereby ensuring a thorough assessment of the identi ed markers.This streamlined process exempli es the adaptability and user-centric nature of cellMarkerPipe, providing a valuable resource for researchers seeking precise and reliable marker gene identi cation in their single-cell transcriptome analyses.

Pipeline Development
The cellMarkerPipe was developed by both python and R environment.For the simplicity and compatibility concerns, each of the methods we selected has its own working environment in the development of the pipeline.The installation of working environment correspondent to each of the tools is listed in the github (see code availability section).The pipeline was implemented in both command line version and the python modules.Python modular functions are tested in Jupyternotebooks in both a single computer and a computer cluster.
The preprocessing and data preparation step involves ltering data based on criteria such as the minimum and maximum gene count per cell and the percentage of mitochondrial genes per cell, following a widely adopted protocol in Seurat.For re-clustering evaluation, Seurat was employed, as it is the most prevalent package for cell clustering.In the nal experiment using a clinical dataset, we also implemented SC3 clustering to compare its clustering effectiveness in the preparation step.The same data formats were prepared from SC3 clustering output so that the marker selection step and evaluation step were run without any revision in procedures.

Benchmark from datasets and tools
The datasets in this study were chosen from popular public scRNA-seq datasets widely used in cell clustering and annotations.These datasets should have a clear demonstration of known cell types and marker genes being used in their study.We also aimed to maximize the species and tissue coverage with diverse application scenarios by these datasets.The data matrices and cell type information were downloaded from NCBI or their publication repository (Supplementary Table 2).The ground truth for the marker genes from each of the datasets were collected from the original publication or relevant studies they refer to (Supplementary Table 5).To make sure the fair comparison of the total gene selections in different tools, we ensure the total number of selected genes are comparable based on the adjustment of the parameters in each tool (in experiments mentioned in Fig. 1a), or simply report the top ten marker genes per cluster (in experiments other than Fig. 1a).For the method scGeneFit, since it's hard to directly control the selected gene numbers, we tuned the other parameters to make sure the number of selected genes is closest to the testing cases of all other methods.The values of parameters we used in this study for all tools ensure the proper comparison and reproducibility of this benchmark and used as default settings in our pipeline (Supplementary Table 6).When reporting the total selected marker genes (Supplementary Table 7) from each dataset, since some clusters may share the marker genes, the nonredundant numbers were reported in the paper (Fig. 1a) and evaluated in ARI and precision scores.The marker genes for individual cell types were also reported (Supplementary Table 7) and used in gene expression heatmaps (Supplementary Figs.1-11).
The heatmaps showing the cell type speci c gene expressions for selected marker genes per group were generated by Scanpy sc.pl.matrixplot function 41 at "standard_scale" mode.This mode is to standardize the given gene expressions into the values between 0 and 1, meaning for each variable or group the values subtract the minimum and divide each by its maximum.
Detailed statistics for these various experiments and more heatmaps were shown in Supplementary

Evaluation scores in benchmark reports
The nal evaluation report provides scores for metrics are either from re-clustering by Seurat default or comparison of the known gene markers.The re-clustering based metrics are Adjusted Rand Index (ARI), Jaccard index, purity, normalized mutual information (NMI), Fowlkes-Mallows Index (FMI).The comparisons of the known markers provide precision and recall.These metrics are demonstrated in the following formula.These scores are obtained by scikit-learn package 42 .
ARI is a measure of consistency between the observation and expected cluster results.The observation clusters is based on identi ed marker genes, while the expected clusters is based on the true cell cluster labels.Assume n is the total number of cells.The number represents the cell numbers in both i-th cluster in observation and j-th cluster in expectation .The number represents the cell numbers in i-th cluster in observation , and the number represents the cell numbers in j-th cluster in expectation .The ARI can be calculated by the formula below.The ARI value will be in the range [0, 1].
Higher values indicate better agreement between the predicted (observed) and true (expected) clusters by the identi ed gene markers. 1 The Jaccard Index measures the similarity between two sets by comparing the intersection (common elements in clusters) with the union (total elements in clusters) of the sets, often used to assess the similarity of clustering results 43 .The calculation is as follows given the i-th cluster in observation and j-th cluster in expectation .The overall Jaccard Index is based on the mean of all the cluster-wise comparisons.

2
NMI quanti es the mutual dependence between two sets of labels, adjusted for chance, providing a measure of the agreement between the two clustering results.In detail, NMI is mutual information between observation and expectation normalized by entropies of each and .NMI can be calculated by the following formula.
FMI assesses the similarity between two cluster results by computing the geometric mean of the pairwise precision and recall.TP (true positive), FP (false positive) and FN (false negative) are based on the cell label contingency table between i-th cluster in observation and j-th cluster in expectation .

4
Besides the above re-clustering metrics based on the identi ed marker genes, if user provides the "true" or expected marker gene sets, our pipeline also calculates the Precision and Recall scores to show whether the predicted marker genes are consistent with biological validated or "true" marker gene sets.Precision means the percentage of the corrected predicted genes among all predicted genes, while Recall means the percentage of the corrected predicted genes among all expected true results.Our pipeline will report cluster-wise Precision and Recall scores for each cluster, as well as the overall Precision and Recall scores when the marker genes from all clusters are pooled together.

Figure 2 Testing
Figure 2

Figure 4 Running
Figure 4