sc-REnF: An entropy guided robust feature selection for single-cell rna-seq data

Annotation of cells in single-cell clustering requires a homogeneous grouping of cell populations. Since single cell data is susceptible to technical noise, the quality of genes selected prior to clustering is of crucial importance in the preliminary steps of downstream analysis. Therefore, interest in robust gene selection has gained considerable attention in recent years. We introduce sc-REnF, ( r obust en tropy based f eature (gene) selection method), aiming to leverage the advantages of R ´ enyi and Tsallis entropies in gene selection for single cell clustering. Experiments demonstrate that with tuned parameter ( q ), R ´ enyi and Tsallis entropies select genes that improved the clustering results signiﬁcantly, over the other competing methods. sc-REnF can capture relevancy and redundancy among the features of noisy data extremely well due to its robust objective function. Moreover, the selected features/genes can able to clusters the unknown cells with a high accuracy. Finally, sc-REnF yields good clustering performance in small sample, large feature scRNA-seq data. Availability: The sc-REnF is available at https://github.com/Snehalikalall/sc-REnF


Introduction
In recent times technological advances have made it possible to study RNA-seq data at single cell resolution 1 . Single cell RNA sequencing (scRNA-seq) is a powerful tool to capture gene expression snapshots in individual cells. Cell type detection is one of the fundamental steps in downstream analysis of scRNA-seq data 2 . A widely used approach for this is to cluster the cells into different groups, and determine the identity of cells within the individual groups/clusters 3,4 . This provides an unsupervised method of annotating the different cell types present in the large population of scRNA-seq data [5][6][7] . Starting from raw counts, scRNA-seq data analysis typically goes through the following steps before clustering: i) normalization (total-count and log-normalization), ii) feature selection, and iii) dimensionality reduction. While normalization/log-normalization adjusts the differences between the samples of individual cells and reduces the skewness of the data, feature selection seeks to identify the most relevant features (genes) from the large feature space. The top genes are either i) highly variable genes, selected by measuring the coefficient of variation 8,9 or ii) highly expressed genes having expression levels higher than the average across all cells 10 .
The performance of downstream analysis, mainly the clustering process, is heavily dependent on the quality of selected top features/genes. The typical characteristics of good features/genes are: i) it should encode useful information about the biology of the system, ii) should not include features that contain random noise, and iii) preserve the useful biological structure while reducing the size of the data so as to reduce the computational cost of later steps. The conventional approach of gene selection based on high variability in the very first stage of the downstream analysis is seemingly simple. However there are two major caveats: i) The observed variability of genes depends on the pseudo-count, which is arbitrary and can introduce biases in the data ii) PCA dimensionality reduction implicitly depends on Euclidean geometry which may not be appropriate for highly sparse and skewed scRNA-seq data. To take care of (i) adding a small pseudocount to all normalized counts prior to log normalization is a common practice in this pipeline. It is required because CPM (counts per million) cannot change the dominating zeros in the scRNA-Seq data, and without that log transformation is not possible. For (ii), application of PCA, despite its fast and memory-efficient behavior, is questionable because of the high sparsity, and discrete and skewed nature of the datasets.
In this paper, we address these two challenges, independently and in combination. We first present a method that finds the most informative features/genes from the scRNA-seq datasets based on the widely used Renyi and T sallis entropy measures. Although entropy-based feature selection is widely-researched, its applications in the single-cell domain is less explored. The handful of works that exist in this domain 11,12 , are not focused on the informative gene selection from single cell data. While 11 aim to use entropy to improve the cluster annotation step, 12 focus on quantifying the differentiation potential in the cells. In this paper, we show that employing Renyi and T sallis entropy in the gene filtering process introduces major advantages both in terms of clustering accuracy, and in terms of a biologically meaningful interpretation (a.k.a. marker selection) of the results. The latter point is established because we are able to reveal novel marker genes for different cell types. Note that biological marker selection is usually a crucial step in the downstream analysis and depends on the purity of the cell clusters annotated in the previous stage. Here, we present a method for selecting the most informative genes that ultimately leads to a good clustering and annotation of cells in scRNA-seq data. We have compared the proposed method with six state-of-the-art gene selection selection techniques: two methods are based on Min Renyi 13 , and Shannon 14 ) entropy; and four methods are well known for gene selection in scRNA-seq data (Gini Clust 15 , PCA Loading 16 , CV 2 Index 17 and Fano Factor 18 ). We have also utilized three well known scRNA-seq clustering method (seurat 4 , SC3 3 and CIDR 19 to validate the features (genes) selected by different competing methods.
Beyond predicting the cluster annotation of single cells, we also demonstrate that our method provides better clustering and annotation on a completely independent test data. For this, we split the data into a training and test sets and demonstrate that the selected features in the training set are equally effective in the clustering of the test samples. We also carry out a comprehensive simulation study to establish the effectiveness of the proposed method on eight contaminated Gaussian Mixture datasets. The results show that proposed method not only select genes with high accuracy, but is also robust when the parameters are appropriately tuned.
Summary of contributions: Here, we provide the following novelties: -We explore the first entropy-based gene selection approach for clustering single-cell data. We utilized Renyi and T sallis entropy that has major advantages over the Shannon method for their controlling parameter (q), which makes them less sensitive (robust) against different noises present in the data.
-Our approach is the first one to explicitly address how to learn the feature relevancy and redundancy using Renyi and T sallis entropies in the scRNA-seq expression data. We raised an objective function that will minimize conditional entropy between the selected features and maximize the conditional entropy between the class label and feature.
-Our framework (with tuned parameter) can able to cluster unseen scRNA-seq expression data with utmost accuracy. Clustering hitherto unclassified data is crucial for the annotation of cell types. We present our framework to be effective in this case. We demonstrate that the selected features from the training set are equally valid for the completely unseen test data of the same tissue.
-Here we derived a new risk factor for Renyi (R q ) and T sallis (R T q ) entropies (see Equation 4 and Equation 7 in Method). We theoretically proved that the iterative selection of features eventually minimizes the Renyi (R q ) and T sallis (R T q ) risk, which strengthens the robustness of our objective function.
-Our method is less sensitive (robust) against different noises present in the data.
-Our method provides good results for small-sample and large-feature sized single cell data.
Short description of feature selection and related works Selection of relevant features remains a challenging problem in machine learning because of the ever-increasing size of the data. The principle motivation of feature selection is to reduce the dimension of datasets to decrease computational cost and enhance the accuracy of algorithms. Some existing approaches for feature selection in different domains like text categorization, image retrieval, intrusion detection, genomic analysis are described in 20,21 , respectively. Feature selection algorithms are broadly classified into three categories: filter , wrapper and embedded model (see sec-1 of the supplement).
Several entropy based filter methods was proposed in the literature, among them Shannon's entropy is utilized in most of the cases (see sec-1 of supplement). Renyi and T sallis entropies 22 are measures which have some interesting characteristics. Recently, their application in the fields of security and privacy have reinvigorated interest in them 23 . Their utility in the biological domain is yet to be established. The present article is an attempt in this direction.

Results
In the following, we will describe the workflow of our analysis pipeline.

Workflow
The figure 1 describes the workflow of our analysis pipeline. All important steps are discussed in this subsection.
Preprocessing of raw datasets: Single-cell RNA sequence raw datasets are downloaded from publicly available sources. The RNA counts are organised as a matrix M cl×ge , where cl is the number of cells and ge is the number of genes. Each element [M] i j represents count of the i th cell and the j th gene. If more than a thousand of genes are expressed (non zero values) in one cell, then the cell is termed as good. We assume one gene is expressed if the minimum read count of it exceeds 5 in at least 10% of the good cells. The data matrix M with expressed genes and good cells is normalized using a linear model and normality based normalizing transformation method (Linnorm) 24 . The resulting matrix (M cl ×ge ) is then log 2 transformed by adding one as a pseudo count.
Feature/gene selection using Renyi and Tsallis entropy See panel-A of the figure 1. The feature/gene selection is driven by an iterative algorithm which select most relevance and non-redaudant features using Renyi and T sallis entropy in each step. First, relevancy between all features and class labels is computed to select most relevant feature (see Equation 8). Next, non-redaudant features are selected by calculating redundancy between remaining features and the relevant one (see Equation  9). Validation of selected features/genes See figure 1, panel-B. The validation is performed by employing benchmark single cell clustering techniques to group the cells with selected features/genes. The clustering results are evaluated using Adjusted Rand Index (ARI) score which ensures proper and accurate partitioning of cells.We compute differentially expressed (DE) genes form each each cluster which can be treated as marker genes of specific groups of cells Clustering of unannotated cell samples See figure 1, panel-C. Clustering unknown samples are crucial in scRNA-seq data analysis and can be addressed by a supervised or unsupervised way. For unknown cell types, our method can able to cluster the data with selected genes, yielding a good clustering that is as fine-grained as is justified by the available true labeled data. We split the whole expression data in train:test ratio 70:30, and observed that the selected genes in training data can accurately cluster the cell samples of test data. This provides the key element of our approach to work in practice.

Feature selection on synthetic contaminated data Data generation
To validate sc-REnF on synthetic contaminated data, we first generate four Gaussian mixture datasets with overlapping and non-overlapping classes (k) and then contaminated it by reshuffling the samples among the classes. Figure  2, panel-A shows the pictorial 2-dimensional t-SNE representation of the generated data with k = 4 and 5. In each case, we generated 500 samples containing 50 relevant and 250 irrelevant features by varying the mean (µ) in the range between -5 to 20 for non-overlapping and in the range between -2 to 3 for overlapping classes. The covariance matrix (Σ) is fixed and is computed using the formula: Σ(i, j) = ρ |i− j| with ρ = 0.5. For generating the irrelevant feature white Gaussian noise 11 is generated (µ = 0, and s.d = 1) and added to the relevant features. Further, to make the constructed data noisy, we reshuffled/mixed the samples by using 8 : 2 mixing ratio among the classes.

Comparisons with state-of-the-art
Here we compared sc-REnF with five gene selection algorithms (Shannon entropy, Gini Clust, PCA Loading, CV 2 Index, and Fano Factor) widely used in scRNA-seq clustering. For Gini-Clust, we use the R package with the default parameter as provided in the original paper. For PCA loading, we consider the first three PC components as the default parameter. For the other methods, Fano Factor and CV 2 index which are commonly used to define mean normalized variance, are implemented from their original definition. Three benchmark scRNA-seq clustering algorithm (Seurat 4 , SC3 3 and CIDR 19 ) are employed to validate the selected features obtained from different state-of-the-arts gene selection techniques. We have retained the default parameters as specified in each of the scRNA-seq clustering package.
Top 100 genes are selected for each competing method. The ARI score is computed for each clustering and is reported in table 1. It is observed that sc-REnF (using Renyi and T sallis entropy) responds well compare to the other six competing methods. sc-REnF achieves the highest ARI score (for Renyi entropy with q − parameter=0.7) for Pollen dataset. Similarly, sc-REnF using T sallis entropy yields the highest ARI score for the darmanis dataset. Considering all the three clustering techniques (Seurat, SC3 and CIDR) and all the datasets, it can be seen that sc-REnF (with Renyi and T sallis entropy) provides better results than the other competing methods. The detailed results of clustering analysis and marker analysis on Darmanis and CBMC dataset are provided in supplement (sections 5, 6 and 7).

Clustering unannotated cells with selected features
Clustering unannotated cell is crucial for scRNA-seq analysis pipeline. Here we addressed this by performing a cross validation with a train test split of data in the ratio 7:3. First, the training dataset is used in sc-REnF to select informative genes. Top 100 genes are selected for clustering and validation. The performance of sc-REnF is computed on the test data using the top selected genes in the earlier step. We performed clustering on test data with the selected genes and ARI value is reported. The experiment is repeated 20 times with a random split of train-test data (7:3 ratio) in each case. Table 2 shows the median and standard deviation of the ARI score.

Selected genes have reliable overlap with marker genes
Here the aim is to investigate the overlap of the selected selected gene set with the DE genes of specific cell types. We compare the selected gene sets of five competing methods (sc-REnF, Gini Index, PCA Loading, Fano, and CV 2 Index) to know the number of overlap with the marker genes. Figure 3 shows six radarcharts corresponding to six methods, each showing the number of intersection of genes in four datasets. The area of each chart represents the total number of common DE genes among selected top 100 genes. The method which covered larger area have the more number of overlapped DE genes. The figure depicts that sc-REnF (with T sallis entropy) outperforms all the methods. sc-REnF using Renyi entropy covered almost 5/13  the same area as CV 2 Index. For computing differentially expressed genes from the dataset, we choose Wincoxon Rank Sum test.

Stability of sc-REnF
Here we explore an additional advantage of sc-REnF over the other measures by validating its stability of performance. A non-parametric statistical test Kruskal-Wallis Test 25 is utilized to examine the stability of ARI scores resulted from the clustering. We vary the number of features from range 20 to 100 and for each case, we compute the ARI score after clustering. Thus for one method (e.g. Renyi) and one dataset, we get nine ARI scores (for #feature=20, 30, 40, 50,60,70,80,90,100) representing the clustering performance with different selected features. To know the variation of ARI scores across all the datasets for a particular method (e.g. Renyi), we performed Kruskal-Wallis Test. ARI scores are computed 50 times for each method, and the median of the scores is given to the Kruskal-Wallis test. Although all methods produce a stable results with low p-values, nevertheless the sc-REnf (with T sallis and Renyi entropy) show more stable (p − value=8.01E-05 and 1.59E-04) performance among the other methods (see supplementary table-1 ). These results may be treated as a straightforward implication of the theoretical proof presented in the method section.

Clustering results on CBMC data
Here we provide clustering results on 8000 cord blood mononuclear cells (CBMCs) using the selected features with sc-REnF (with T sallis entropy). Figure 4 panel-A and B shows the t-SNE visualization of cells with original labels and predicted cluster labels, respectively. Most of the clusters such as cluster-2, cluster-4, cluster-14 determine unique cells in the data. For example, cluster-2 captures most of the samples of CD14+ Mono cells, while cluster-4 and cluster-14 represent samples of 'Nk' (Natural killer) and erythrocyte cells. Some clusters represent more than one cell, such as cluster-13 includes DCs and pDCs, cluster-11

Discussions and conclusions
Clustering of cells in scRNA-seq data is an essential step for cell type discovery from a large population of cells. Owing to the large feature/gene set of scRNA-seq data, the selection of most variable genes are crucial in the preprocessing step, which has immense effect in the later stage of downstream analysis. The proposed method sc-REnF addressed this issue by using an entropy (Renyi, T sallis) based feature selection method for identifying possible informative genes in the preprocessing steps. sc-REnF has the advantage over the conventional statistical approach that it can consider the cell-to-cell dependency based on generalized and wide spectrum entropy measures Renyi and T sallis. We demonstrated that sc-REnF using Renyi and T sallis method introduces major advantages both in terms of clustering accuracy and in terms of marker gene detection in the downstream analysis of scRNA-seq data. sc-REnF yields a stable feature/gene selection with a controlling parameter (q) for Renyi and T sallis entropy. The optimal controlling parameter (q) is determined by applying it in a synthetic contaminated Gaussian mixture dataset. We later demonstrated that the range of selected q − values is applicable in the real life scRNA-seq data clustering task. The four scRNA-seq data where we apply sc-REnF yields accurate clustering results which are validated by the ARI score. The stability of sc-REnF is demonstrated by evaluating the performance of it using KruskalWallis test. While applying sc-REnF multiple times with varying number features, the resulting ARI scores employ a minimum deviation ( p-value 0.05 for Kruskal-Wallis) for Renyi and T sallis entropy.
Although the primary objective of sc-REnF is variable gene selection in the preprocessing step of scRNA-seq data, we extend the process towards the later stage of downstream analysis. We employ the clustering technique to groups the cells using those selected genes. A precise clustering of cells also demonstrates the efficacy of our method for selecting the most variable genes in the first stage. This facilitates the selection of novel marker genes within each cluster. We pinpoint several markers, which shows a high expression level within a particular cluster, among them some are also identified in previously published cell marker database.
Clustering of unknown samples based on the reference data is a crucial problem for the identification of cell types in scRNA-seq cell classification. We addressed the problem by cluster the unknown samples using the selected genes in the reference data. We demonstrate the advantage of using selected genes by sc-REnF in clustering of the unknown test sample. We observed good ARI scores in clustering of test samples, suggesting selected genes from the reference data is also effective to produce a perfect clustering in a completely unknown test sample.
The execution time of sc-REnF is directly proportional to the number of selected features and can be expensive when one needs to select a large number of features. However, this can be easily tackled with ever-increasing computing power in advanced servers. Additionally, the regularization parameter has not been considered in our proposed approach, which may sometimes make the algorithm susceptible to overfitting unless carefully employed.
Taken together, the proposed method sc-REnF not only has good performance on informative gene selection in the preprocessing step but also can explore the classification of unknown cells in the scRNA-seq data. Despite being applied in feature selection of different domains, the application of Renyi and T sallis entropy shows good potential in gene selection and cell clustering of scRNA-seq data. Results show that sc-REnF not only leads in the domain of robust feature (gene) selection analyses but accelerate the investigations of cell type definition in large scRNA-seq data as well. We believe that sc-REnF may be an important tool for computational biologists to explore the most informative genes and marker genes in the downstream analysis of scRNA-seq data.

Materials and method
Overview of Datasets Single Cell RNA sequence dataset Four single-cell RNA-seq datasets are used for the evaluation of the proposed method: Yan, Pollen, Darmanis, and CBMC. • CBMC: 29 Cord blood mononuclear cells (CBMC) were profiled by CITE-seq which is proposed to measure both cellular protein and mRNA expression in one cell, by using oligonucleotide-labeled antibodies. The data set consists of the expression levels of 2000 mRNAs and 13 protein, individually measured in 8,000 cord blood mononuclear cells (CBMCs). The dataset is available in the GEO website under the accession GSE100866.
Deriving Renyi and T sallis risk functions Let us consider three discrete random variables X, Y and Z with supports {x 1 , · · · , x d }, {y 1 , · · · , y p } and {z 1 , . . . , z n }, respectively. For each i = 1, 2, . . . , d, j = 1, 2, . . . , p and k = 1, 2, . . . , n, let us denote p i = P(X = x i ), p i jk = P(X = x i ,Y = y j , Z = z k ), p i| jk = P(X = x i |Y = y j , Z = z k ) and so on. The Renyi entropy of the random variable X is defined in terms of a positive real number q, with q = 1, as Interestingly, note that, this Rényi entropy reduces to the Shannon entropy when q → 1. It can also be extended for the three random variables X, Y , and Z, so that their joint Rényi entropy is given by Accordingly, the conditional Renyi entropy can be defined as The Renyi entropy is a decreasing function of q. It can be shown that the conditional Renyi entropy closely correspond to a risk function, i.e. a form of the expected error when we try to estimate the value of X, given the values of Y and Z. We refer to the associated risk as the Renyi Risk Function which is defined as Mathematically, the T sallis joint entropy of the three random variables X,Y and Z is defined in terms of a tuning parameter q > 0 as It also coincides with the Shannon entropy as q → 1.
The conditional T sallis entropy of random variable X, given the values of Y and Z is defined as

9/13
Next, we define a T sallis risk function, i.e. another form of the expected error when we try to estimate the value of X, given the values of Y and Z. This is defined as where e p j,k is the joint escort probability distribution 30 given by e p j,k = Feature Relevance: Feature f i is more relevant to the class label C than feature f j in the context of the already selected subset S, when where E (·|·) is a (bivariate) conditional entropy function.
Feature Redundance: If feature f j shares similar information with feature f i than feature f j+1 , then feature f j is redundant to feature f i with given information about class label C ; it is characterized as where E (·|·, ·) is an appropriate conditional entropy.
Objective Function: We minimize the conditional entropy function between f i ∈ (F − S) and f s ∈ S (to reduce the redundancy between them) and maximize the conditional entropy function between class label C and f i ∈ (F − S) to select the first feature, where f s ∈ S is already selected feature. The selected feature subset, {S} and the feature f i ∈ (F − S) are inductively defined as The selected features using the objective function (with Renyi and T sallis entropy as a choice of E ) are optimal in the sense of minimizing the corresponding risk functions, defined in Equations 4 and 7, respectively, as stated by the proposition: Theorem: At every step, the selected feature f i+1 minimizes the Renyi and T sallis risk functions, i.e., where R denotes either R q or R T q , defined in Equations 4 and 7, respectively.
Proof. In order to proof the theorem, We will start from the objective function in Equation 10 . According to our objective function Let, u, v , v represent generic value tuples and values of f s ∈ S (Selected feature), f i+1 (To be selected feature at (i + 1) th step), and f ∈ (F − S) (Non selected features) respectively. After putting the generic representation, Equation 12 for the Renyi