Single-Cell Manifold Preserving Feature Selection (SCMER)

A key challenge in studying organisms and diseases is to detect rare molecular programs and rare cell populations (RCPs) that drive development, differentiation, and transformation. Molecular features such as genes and proteins defining RCPs are often unknown and difficult to detect from unenriched single-cell data, using conventional dimensionality reduction and clustering-based approaches. Here, we propose a novel unsupervised approach, named SCMER, which performs UMAP style dimensionality reduction via selecting a compact set of molecular features with definitive meanings. We applied SCMER in the context of hematopoiesis, lymphogenesis, tumorigenesis, and drug resistance and response. We found that SCMER can identify non-redundant features that sensitively delineate both common cell lineages and rare cellular states ignored by current approaches. SCMER can be widely used for discovering novel molecular features in a high dimensional dataset, designing targeted, cost-effective assays for clinical applications, and facilitating multi-modality integration.


INTRODUCTION
A tissue in a living organism often consists of millions to billions of cells. While the terminally differentiated cells with relatively distinct molecular profiles can be readily distinguished via single-cell RNA sequencing (scRNA-seq) at current sampling depth, many cells involved in development, differentiation, and transformation remain difficult to detect 1 , 2 . For example, a fraction of tumor cells in renal cell carcinomas can go through sarcomatoid transformation driven by epithelial to mesenchymal transformation (EMT) 3 , 4 ; tumor cells in pancreatic ductal adenocarcinomas can transiently express stemness features (e.g., SOX2) at its invasion fronts 5 -7 . These cells can be relatively rare in the sampled populations, transiently expressing certain molecular features and thereby may not form distinct clusters in high dimensional feature spaces 8 , 9 .
To detect characteristic features (e.g., genes, proteins) in a single-cell dataset, studies often employ unsupervised clustering followed by one-cluster-vs-all differential expression (DE) analysis. These approaches can detect major cell types governed by lineage features that dominate data variance, but are insensitive to rare but unique features that have relatively small variance and manifest as level gradients within cell-type clusters (a.k.a. cell states) 1 4 . They are also clumsy at detecting features affecting multiple clusters, e.g., transcription factors (TFs) regulating multiple cell types 1 5 , as that involves comparison of an exponentially growing number of cluster combinations. To detect features associated with continuous developmental processes, many studies perform trajectory inference 1 6 followed by regression analysis to identify correlated features (e.g., Monocle 1 7 ). The selection of features depends on trajectories, which could be challenging to infer accurately for complex processes. A detailed comparison was performed by RankCorr 1 2 across various methods such as statistical tests, logistic regression, MAST 1 0 , scVI 1 1 , and COMET 1 3 .
Most existing approaches regard features as independent variables without exploring their interactions 1 8 .
As a result, they tend to identify redundant features (e.g. CD3D, CD3E and CD3G for T cells) that are highly correlated with the inferred clusters or trajectories, but ignore novel, uncorrelated features. Some recent work such as scHOT 1 8 and SCMarker 1 9 started to exploit correlational patterns among co-or antiexpressing genes. However, they do not model complex interactions of more than two genes. SCMarker cannot characterize continuous cell states, and scHOT relies on the accuracy of trajectory inference.
To enhance sensitivity in detecting rare features and RCPs, many studies 2 0 , 2 1 had to slice and dice data spaces in empirical, multifaceted ways 8 or perform iterative gating 2 2 and re-clustering at variable resolutions, which may lead to biased, irreproducible results.
Increasing the number and variety of molecular features 2 3 , 2 4 and improving the fidelity of the measurements can help discover RCPs. However, they unavoidably increase the already high cost of experiments. To make assays cost-effective towards clinical applications, it is important to select a compact actionable set of molecular features that unbiasedly represent molecular diversity in high dimensional data. This ability is important for designing and manufacturing customized assays, e.g., 10x targeted gene expression, MissionBio Tapestri 2 5 and NanoString GeoMx 2 6 , which perform multi-omics measurements of hundreds of selected DNA, RNA, and proteins.
To address these fundamental challenges, we developed SCMER (Single-Cell Manifold Preserving Feature Selection), which selects an optimal set of features such as genes or proteins from a single-cell dataset. Similar to t-Distributed Stochastic Neighbor Embedding (t-SNE) 2 7 and Manifold Approximation and Projection (UMAP) 2 8 , we hypothesize that a manifold defined by pairwise cell similarity scores sufficiently represents the complexity of the data, encoding both global relationship between cell groups and local relationship within cell groups 2 9 . By preserving such a manifold while performing feature selection, the most salient features that unbiasedly represent the original molecular diversity will be selected.
SCMER does not require clusters or trajectories, and thereby circumvents the associated biases. It is sensitive to detect diverse features that delineate common and rare cell types, continuously changing cell states, and multicellular programs

THE SCMER APPROACH AND ITS UNIQUE STRENGTH
In a nutshell, SCMER (Fig. 1a, Methods)  . SCMER selects optimal features that preserve the manifold and retain inter-and intra-cluster diversity ( Fig. 1b). It can be applied to discover rich molecular pathways, identify prognostic genes, and design customized DNA/RNA/antibody panels of restricted sizes towards clinical applications.
To elucidate the novel cell populations and features that SCMER uniquely identifies, we generated a synthetic dataset containing a branching trajectory of 4,000 single cells from five major cell types, namely progenitor (Pro), precursor of A and B (PreA and PreB), and mature A and B (A and B) (Fig. 1c). Four kinds of features are simulated, those (I) specific to one cell type/cluster, (II) shared by more than one cell type 1 5 , (III) gradually changing over cell states, and (IV) transiently activated (also known as checkpoints 3 3 ). We created a total of 180 features including 20 cell-type specific features for each cell type (100 in total), 10 gradually changing features for each branch (20 in total), 5 shared features for precursors and mature cells (10 in total), 5 transiently expressing features, and 45 random noise features.
Each cell was given a ground-truth pseudo-time in the trajectory, which parameterizes the expected level of a feature. In addition to major cell type labeling, the cells transitioning from precursor to mature are identified as "RCP-A" and "RCP-B", which overexpress type-IV features. Dispersion was added based on a negative-binomial distribution. In as few as 45  . SCMER robustly demonstrated the best performance on all the experiments.

HETEROGENEITY IN CANCER
Single-cell datasets derived from cancer samples are often highly complex, containing heterogeneous cell types and states in not only tumor cells but also stromal and immune cells. Supervised analysis of cancer data is challenging as cancer cells are highly plastic 4 2 and can express novel unknown features, which can heavily confound clustering and trajectory-based analysis. We applied SCMER on a single-cell RNA-seq melanoma dataset containing 4,645 cells from 19 human melanoma samples To understand the biological meanings of the selected genes, we compared them with the 11 gene sets described in the original publication that represent important cell types and pathways in the study. The selected genes compactly covered all the 11 gene sets ( Table 2). Interestingly, genes belonging to the known drug resistance AXL program and MITF program were also selected by SCMER. These genes do not preferentially express in a specific cluster (e.g. PMEL, TOB1, etc. in Fig. 2d and Supplementary Figure 1a To comprehensively assess the performance of SCMER, we varied the number of selected features and recorded the number of recalled gene sets. SCMER consistently recalled more gene sets than other widely-used methods for any given number of features, demonstrating the best performance (Fig. 2c).
We also applied SCMER to a large-scale pan-cancer single-cell transcriptomic study consisting of 198 cell lines and patient samples from 22 cancer types 3 5 . SCMER showed the high sensitivity to characterize intra-cluster heterogeneity, identifying recurrent heterogeneous programs shared by a majority of cell lines and by patient tumor samples (Supplementary Figure 3 and Supplementary Result 1).

IMMUNOCYTES
We further examined SCMER in a complex setting involving many cell subtypes and subtle intra-cluster structure and shared pathways. The dataset contains 39,563 gastrointestinal immune cells collected from inflamed tissues from ten Crohn's disease patients 3 6 . As a risk factor to cancer, chronic inflammation which affect many cellular processes by regulating MAP kinases. DUSP1 has been reported as a key regulator 4 5 of both innate and adaptive immune responses by inactivating p38 and JNK (c-Jun N-terminal kinase). We found that DUSP2 and DUSP4 showed gradients majorly in the T cell cloud, while DUSP1 was expressed in all the major cell types. These genes are related to a wide range of immune phenotypes including producing inflammatory cytokines and autoimmune responses, which are highly relevant to Crohn's disease.
SCMER again compared favorably to the other methods that selected various numbers of features (Fig.   3i, Supplementary Result 2). It was evident that the other methods tended to ignore features associated with intra-cluster heterogeneity and multicellular programs. The novel genes selected by SCMER were also highly enriched in multiple immune pathways 4 3 (humoral immune response, leukocyte migration, complement activation, etc.; Supplementary Table 5). Overall, SCMER sensitively preserved different types and levels of heterogeneity in the original data.

SCMER DISSECTS CONTINUOUS CELL LINEAGE DIFFERENTIATION
Cell differentiation involves many unique patterns of gene expressions, including those gradually changing, shared among cell types, or transiently activating during the process. None of these patterns can be characterized through clustering. Having shown that SCMER identifies both inter-and intra-cluster features, here, we examine if it can identify key differentiation features in hematopoiesis in human bone marrow (BM). A recent study 3 7 sequenced 6,915 BM cells from four healthy donors. The transcriptomes of the cells formed a continuum instead of discrete clusters in the UMAP, reflecting the continuous differentiation process (Fig. 4a).
From this dataset, SCMER selected 100 features, which clearly preserved the trajectory of differentiation ( Fig. 4b). The original publication reported 57 established features belonging to 12 sets ( Table 3).
SCMER recalled all the 12 sets as well as other well-known immune features (e.g. GNLY and CD74).
SCMER also picked up features for less abundant cells that were not reported in the original publication (Supplementary Table 6 . Similarly, increasing GPR183 expression was found in a subgroup of T cells, B cells, and Monocytes ( Fig. 4d). The function of GPR183 is not fully known, but emerging evidence shows that it may be a regulator of immune cell migration 4 6 . SCMER identified many genes that displayed gradient along the developmental trajectories, for example, AHSP and CA1 for Erythroid cells, and VPREB1 for B cells (Supplementary Figure 7b). It also identified genes such as PRTN3 (monocytes) and PDZD8 (erythroid) that appeared transiently expressed during the developmental process and became dim in terminally differentiated cells (Fig. 4d), which were not prioritized by Monocle. Besides, it identified TF genes such as JUN and SOX4, which play important roles in regulating cell differentiation 4 7 , 4 8 . We comprehensively evaluated the performance and confirmed that SCMER outperformed the other unsupervised methods in recapitulating molecular diversity (Fig. 4c).

SCMER IDENTIFIES MOLECULAR DRIVERS FROM PERTURBED CELLS
More and more studies using single-cell technologies to investigate heterogeneity of cells in response to a genetic or chemical perturbation 4 9 .
In these experiments, cell state may transition under complex kinetics.
To investigate the utility of SCMER in studying cellular responses, we applied it on single-cell data derived from dexamethasone (DEX) treated A549 lung adenocarcinoma cell line 3 8 . As reported in the original publication, the 1,429 cells sampled at 0, 1, and 3 hours after the DEX treatment formed a continuum in the transcriptomic space (Fig. 5a), indicating heterogeneous responses of the cell population. After running SCMER on the sci-RNA-seq data, 80 genes were selected, with the manifold and treatment states largely preserved (Fig. 5b).
We inferred TF activities based on motif enrichment 5 0 in the chromatin accessibility (sci-ATAC-seq) data co-assayed on the same set of cells 3 8 ( Methods, Fig. 5c). Among the top 50 highly variable TFs (Supplementary Figure 8a), NR3C1, the primary target of DEX 3 8 , had the most prominently increasing activity level over treatment time. Other TFs such as FEV 5 1 and the ETS family 5 2 , also targets of DEX, had decreasing activity levels.
We then correlated the expression levels of the genes selected by SCMER with the activity levels of the top TFs. We found that FKBP5, GALNT18, NRCAM, etc. were positively correlated with NR3C1, while CYP24A1, COL5A2, etc. were negatively correlated (Supplementary Table 7 ) in the whole transcriptome; while CYP24A1, which regulates multiple metabolism processes 5 5 , was the most negative ). Cells of high FKBP5 expression levels came mostly from 1 and 3 hours (Fig. 5d), with matched polarized distributions in the RNA and the ATAC embeddings (Fig. 5g). Similar patterns were observed between cells of high and those of low CYP24A1 expression levels (Fig. 5f,i).
Interestingly, SCMER also selected a group of genes uncorrelated with prominent TF activities (Fig. 5j,   Supplementary Figure 8). Among them were MKI67 (e.g., (Fig. 5e,h), which encodes proliferation marker protein Ki-67, and other cell-cycle genes such as CENPF, TOP2A, RYBP, MLH3, etc. Pathway analysis confirmed that these genes are highly enriched in cell proliferation pathways (Supplementary Table 8), indicating that an appreciable fraction of cells continued proliferating despite the treatment. It is not surprising that the levels of these genes were uncorrelated with chromatin state changes, as it has been shown that cell cycling status has little direct effect on chromatin accessibility 5 6 . Also among uncorrelated were several cancer cell stemness marker genes 4 3 such as ACTG1, TSC22D1, and FN1, which may indicate that a fraction of cancer cells maintained their stemness during the course of the treatment. These genes would have been missed by a DE analysis supervised by the treatment time.
Taken together, our results demonstrated the superior power of SCMER in discovering features associated with heterogeneous cellular state change in the context of perturbation experiments.

SCMER ACHIEVES CROSS-MODALITY FEATURE MAPPING
One challenge in applying scRNA-seq for cell-typing is that expression levels of mRNAs can differ substantially from those of homologous proteins, due to post transcriptional modifications 5 7 . Although performing multi-omics assays may be the ultimate solution, they are currently associated with higher cost and lower throughput. Thus, rather than simply selecting the homologous mRNAs, it is beneficial to identify the set of genes whose expression levels maximally represent cellular diversity at the protein level. This capability can be important for designing targeted, cost-effective assays for preclinical and clinical applications. SCMER is ideally suited for such a purpose, as it allows selecting features in one modality while preserving manifold in another modality.
The protein manifold based on 25 markers was utilized to "supervise" the selection of mRNAs (Methods). CITE-seq, which co-assays mRNA and protein markers from the same set of cells, is ideal for obtaining the optimal mapping between mRNAs and proteins (Fig. 6a,b).
As shown, the mRNA expression levels of genes homologous to the protein markers, such as CD4 (a T h cell marker) and NCAM1 (CD56, an NK cell marker) offered low power in delineating the corresponding cell types (Fig. 6d,e). Some markers, e.g., CD45RA (B cells and naïve T cells) and CD45RO (memory T cells) are isoforms of the same gene, PTPRC. Consequently, T cell subtypes were less distinguishable in the RNA space than in the protein space (Fig. 6b). The differences among CD8 T cell subtypes were even bigger than the differences between CD4 and CD8 T cells.
SCMER selected a set of genes that best preserved the diversity at the protein-level, notably the continuum among naïve CD8 T cells, memory CD8 T cells, and effector CD8 T cells (Fig. 6c). It identified genes that are non-homologous to the protein markers but better represent the protein level difference, for example, GPR183, KLRF1, CD79B, and S100A4 for CD4, CD56, CD45RA, and CD45RO, respectively (Fig 6d,f). On the other hand, the SCMER result appeared to better delineate progenitor cells than the protein markers, which demonstrates a strength of integrating complementary modalities.
Similar conclusions were drawn when applying SCMER on another smaller PBMC CITE-seq dataset 4 0 with 10 protein markers (Supplementary Result 3).
Importantly, the genes selected by SCMER from one donor (14,468 cells) appeared to preserve the cell diversity in another donor (16,204 cells) (Supplementary Figure 10e-f), which validated the applicability of SCMER in designing targeted panels for populational level testing.

DISCUSSION
SCMER was designed to meet an important need in single-cell molecular data analysis, to sensitively identify non-redundant features that delineate both common cell lineages and rare cellular states ignored by current approaches. It provides an ab initial approach for discovering novel genes and features in high dimensional datasets, designing cost-effective assays for potential clinical applications, and assisting multi-modality integration of gene expression, proteins, and other features.
SCMER does not require clusters or trajectories and is not affected by uncertainties in clustering or trajectory inference. It explores alternative explanations via feature selection and reports the most salient features representing different facets of cells and underlying molecular activities. As a result, on datasets involving hematopoiesis, lymphogenesis, tumorigenesis, and drug resistance and response, SCMER identified features representing major cell types, rare cell populations (RCPs), continuous cell states, and multicellular programs. In our study, it prevailed existing unsupervised methods and often performed better than or comparably to the supervised methods when accurate labeling was possible (Supplementary Figure 11). Moreover, SCMER can handle batch effects by treating batches as a stratum, and finding a consensus set of features that preserve the manifold in respective batches (Methods). In that manner, it will prioritize genes contributing to biological but not technical variances. . On a dataset with 10,000 cells and 2,000 candidate features, it typically converges in 20 to 40 iterations, which takes 5 to 10 minutes on a desktop computer equipped with a 3.20GHz 6-core Intel Core i7-8700 CPU. GPU acceleration is also supported, and the time consumption is halved with a middle-end NVidia GTX 960M GPU on a laptop computer with a 2.7GHz 4-core Intel Core i7-5700HQ CPU.
Because SCMER detects informative features that represent much wider and more complex biological processes than current methods, we expect it to be of immediate interest in projects producing large numbers of unsorted cells, such as the Human Cell Atlas 5 8 , the Human BioMolecular Atlas Program (HuBMAP) 5 9 , the Precancer Atlas . It will be beneficial in various scenarios including biomarker discovery and clinical assay designing. As a first-of-its-kind method designed for manifold preserving feature selection on biomedical data, it can potentially be broadly applied to non-single-cell data, for example, bulk RNA expression 2 9 , copy number aberration 6 2 , and genetic and drug screening data in large cohort studies such as TCGA 6 3 , GTEx 6 4 , Depmap 6 5 , CTRP 6 6 , etc.

CELL-CELL SIMILARITY
SCMER is inspired by three methods: Stochastic Neighbor-Preserving Feature Selection (SNFS) 6 7 , tdistributed stochastic neighbor embedding (t-SNE) 2 7 and Uniform Manifold Approximation and Projection (UMAP) 2 8 . Because UMAP is more sensitivity to both global relationship between cell groups and local relationship within cell groups 2 9 , we borrowed a part of the UMAP formulation, i.e.,

SUPERVISED MULTI-OMICS MODE
To transfer the manifold in one matrix ‫)܆(‬ to another (‫܆‬Ԣ), either between different modalities or subsets

‫܆‬
. This is also applicable to select features from a shortlist of the original ones.

USING PRESELECTED FEATURES
In the case that a researcher wants to specify a few features that are known to be useful, we slightly modify the regularization to is a diagonal matrix. If a feature is considered important a priori, the corresponding entry in ‫ܞ‬ is set to 0 to avoid ݈ ଵ -regularization. In this "softlysupervised" way, SCMER is more likely to select these features, but may still discard some of them if they are contradicting with the manifold. Thus, in addition, we provide a "hard-supervised" way where a set of features are guaranteed to be kept. Other features are selected to supplement them.

ORTHANT-WISE LIMITED MEMORY QUASI-NEWTON ALGORITHM
Limited-memory BFGS (L-BFGS) is an widely-used optimization algorithm in the quasi-Newton methods family 6 8 . It should be noted that setting the undefined point to 0 (or any other value) at ‫ݓ‬ ൌ 0 does not solve the problem as the discontinuity will also break L-BFGS. A modified version of L-BFGS called orthant-wise limited memory quasi-Newton (OWL-QN) algorithm 3 1 solves this problem by introducing pseudogradients and restrict the optimization to an orthant without discontinuities in the gradient.

L-BFGS optimizer is provided in PyTorch
. It constrains the optimization to be in the same "orthant" in each iteration.

DATA PREPROCESSING
For the melanoma data 3 4 , which is TPM based, after removing ERCC spike-ins, we processed the data using the standard workflow of SCANPY 6 9 , including quality control (filtering out genes that are detected in less than 3 cells), normalization (10,000 reads per cell), log transformation, highly variable genes detection (with a loose threshold to filter out noisy genes; not to be confused with the DXG we compared with), and scaling.
For the Ileum Lamina Propria Immunocytes data 3 6 , bone marrow data 3 7 , and A549 data 3 8 , which are UMI based, we used the standard workflow of SCANPY, including quality control (filtering out genes that are detected in less than 3 cells), normalization (10,000 reads per cell), log transformation, highly variable genes detection, and scaling. We used the stratified approach to suppress batch effect on the Ileum Lamina Propria Immunocytes data.
For protein data in CITE-Seq , we followed the preprocessing of protein data described in the original publication. For mRNA data in CITE-seq, we follow the standard workflow of SCANPY, as described above, except that we did not filter highly variable genes. We preprocessed protein data as mRNA data, without filtering highly variable genes.

INFERENCE OF TF ACTIVITIES
Because TFs tend to bind at sites with cognate motifs, accessibility at peaks with the motifs reflects their activity. To estimate transcription factor activity from sci-ATAC-seq data, we use chromVAR

COMPARISON WITH OTHER METHODS
To identify the highly expressed genes (HXG), we used the standard SCANPY 6 9 workflow. HXG is defined by the total reads of a gene across all cells. To identify the highly variable genes, we followed the standard scoring method in SCANPY 6 9 . SCMarker 1 9 provides a gene list without ranks. It has two parameters, We ran Monocle 1 7 in unsupervised and supervised manners. For the supervised run, the labels were used directly. The trajectory was inferred using clusters/labels and pseudo-time is calculated. Genes were ranked by the degree they are explained by functions (which were fitted with cubic splines) of pseudotime. For the unsupervised run, we clustered the cells and visually confirmed the clusters are concordance with the labels.
We ran RankCorr 1 2 in both supervised and unsupervised manner. For the supervised run, we used the label from the data directly. For the unsupervised run, we used the Leiden algorithm 7 0 for clustering which is the recommended method in SCANPY. Default parameters were used, and the clusters are visually checked that they are reasonable.
For random results, we randomly selected gene sets of given sizes. Reported are mean performance and the critical level of statistically significantly better (or worse) than random as defined by single-sample one-sided z-test at 5% significance level.

ETHICS APPROVAL AND CONSENT TO PARTICIPATE
Not applicable in this study.

CONSENT FOR PUBLICATION
Not applicable in this study.

COMPETING INTERESTS
The authors declare that they have no competing interests. (b) Applications of SCMER. SCMER selects features that preserve the manifold and retain inter-and intra-cluster diversity, and thus can be applied to discover rich molecular pathways, integrate modalities, and design customized DNA/RNA/antibody panels of restricted sizes.  (d-f) Expression of genes that show intra-cluster gradients.  Table 14).

FUNDING
(b) UMAP embedding of the dataset on SCMER selected genes.
(d) Activity of selected markers. Arrows are drawn for a visual reference for the developmental processes.