ClusterDE: a post-clustering differential expression (DE) method robust to false-positive inflation caused by double dipping

In typical single-cell RNA-seq (scRNA-seq) data analysis, a clustering algorithm is applied to find putative cell types as clusters, and then a statistical differential expression (DE) test is employed to identify the differentially expressed (DE) genes between the cell clusters. However, this common procedure uses the same data twice, an issue known as “double dipping”: the same data is used twice to define cell clusters as potential cell types and DE genes as potential cell-type marker genes, leading to false-positive cell-type marker genes even when the cell clusters are spurious. To overcome this challenge, we propose ClusterDE, a post-clustering DE method for controlling the false discovery rate (FDR) of identified DE genes regardless of clustering quality, which can work as an add-on to popular pipelines such as Seurat. The core idea of ClusterDE is to generate real-data-based synthetic null data containing only one cluster, as contrast to the real data, for evaluating the whole procedure of clustering followed by a DE test. Using comprehensive simulation and real data analysis, we show that ClusterDE has not only solid FDR control but also the ability to identify cell-type marker genes as top DE genes and distinguish them from housekeeping genes. ClusterDE is fast, transparent, and adaptive to a wide range of clustering algorithms and DE tests. Besides scRNA-seq data, ClusterDE is generally applicable to post-clustering DE analysis, including single-cell multi-omics data analysis.


Figure S1:
The generation process of synthetic null data from target data (top left) by scDesign3.We show the bivariate case (with two genes) for illustration purposes, and the real case is high-dimensional with thousands of genes.The null model consists of two parts: for marginal gene modeling, each gene's counts follow a negative binomial (NB) distribution; for joint gene modeling, the genes' dependence structure is specified by the Gaussian copula.For the marginal gene modeling part (top), a negative binomial (NB) distribution is fitted to each gene's counts in the target data, obtaining the two NB parameters (mean and dispersion) for each gene.For the joint gene modeling part (bottom), there are three steps.First, each gene's counts in the target data are transformed into cumulative distribution function (CDF) values, via the fitted NB distribution or the counts' empirical distribution (if the target cell number is large), so the gene's CDF values are uniform between 0 and 1.Second, each gene's CDF values are transformed into quantiles of the standard Gaussian N (0, 1) distribution.Third, a multivariate Gaussian distribution (a bivariate Gaussian distribution for illustration) is fitted to the transformed Gaussian values of the many genes whose correlations are to be modeled.The correlation matrix of the fitted multivariate Gaussian distribution specifies the Gaussian copula.After the marginal and joint modeling, the generation of the same genes' synthetic counts takes three steps.

Figure S4:
The FDRs and power of ClusterDE and the existing methods under various severity levels of double dipping when the two cell types have a size ratio of 1 : 1.The log fold change (logFC) summarizes the average gene expression difference between the two cell types in simulation (see Methods "Simulation design with two cell types and 200 DE genes").Corresponding to a small logFC, a small adjusted Rand index (ARI) represents a bad agreement between cell clusters and cell types, representing a more severe double-dipping issue.Across various severity levels of double dipping and the five DE tests, ClusterDE controls the FDRs under the target FDR thresholds (diagonal dashed line) and achieves comparable or higher power compared to the existing methods at the same actual FDRs.

Figure S5:
The FDRs and power of ClusterDE and the existing methods under various severity levels of double dipping when the two cell types have a size ratio of 1 : 4. The log fold change (logFC) summarizes the average gene expression difference between the two cell types in simulation (see Methods "Simulation design with two cell types and 200 DE genes").Corresponding to a small logFC, a small adjusted Rand index (ARI) represents a bad agreement between cell clusters and cell types, representing a more severe double-dipping issue.Across various severity levels of double dipping and the five DE tests, ClusterDE controls the FDRs under the target FDR thresholds (diagonal dashed line) and achieves comparable or higher power compared to the existing methods at the same actual FDRs.

Figure S6:
The FDRs and power of ClusterDE and the existing methods under various severity levels of double dipping when the two cell types have a size ratio of 1 : 9.The log fold change (logFC) summarizes the average gene expression difference between the two cell types in simulation (see Methods "Simulation design with two cell types and 200 DE genes").Corresponding to a small logFC, a small adjusted Rand index (ARI) represents a bad agreement between cell clusters and cell types, representing a more severe double-dipping issue.Across various severity levels of double dipping and the five DE tests, ClusterDE controls the FDRs under the target FDR thresholds (diagonal dashed line) and achieves comparable or higher power compared to the existing methods at the same actual FDRs.

Figure S7:
When the target data contains cells from two cell types (simulation; see Methods "Simulation design with one cell type and zero DE genes"), the synthetic null data generated by ClusterDE (second row) fills the gap between the two cell types but resembles the target data in other visual aspects of UMAP cell embeddings (left), per-gene expression mean and variance statistics (middle), and gene-gene correlations.In contrast, the synthetic null data generated by knockoffs (third row) and permutations (fourth row) do not resemble the target data.For the synthetic null data generated by knockoffs, it preserves the gene-gene correlations of the target data, but does not preserve per-gene expression mean and variance statistics.For the synthetic null data generated by permutations, it preserves per-gene expression mean and variance statistics of the target data, but does not preserve the gene-gene correlations.

Figure S8:
The FDRs and power of ClusterDE with three approaches for synthetic null generation: scDesign3 (the default in ClusterDE), the model-X knockoffs, and independent permutations of all genes across cells.Compared with the other two approaches, scDesign3 controls the FDR and yields higher power.
Figure S1:The generation process of synthetic null data from target data (top left) by scDesign3.We show the bivariate case (with two genes) for illustration purposes, and the real case is high-dimensional with thousands of genes.The null model consists of two parts: for marginal gene modeling, each gene's counts follow a negative binomial (NB) distribution; for joint gene modeling, the genes' dependence structure is specified by the Gaussian copula.For the marginal gene modeling part (top), a negative binomial (NB) distribution is fitted to each gene's counts in the target data, obtaining the two NB parameters (mean and dispersion) for each gene.For the joint gene modeling part (bottom), there are three steps.First, each gene's counts in the target data are transformed into cumulative distribution function (CDF) values, via the fitted NB distribution or the counts' empirical distribution (if the target cell number is large), so the gene's CDF values are uniform between 0 and 1.Second, each gene's CDF values are transformed into quantiles of the standard Gaussian N (0, 1) distribution.Third, a multivariate Gaussian distribution (a bivariate Gaussian distribution for illustration) is fitted to the transformed Gaussian values of the many genes whose correlations are to be modeled.The correlation matrix of the fitted multivariate Gaussian distribution specifies the Gaussian copula.After the marginal and joint modeling, the generation of the same genes' synthetic counts takes three steps.First, the genes' standard Gaussian values are jointly sampled from the fitted multivariate Gaussian distribution.Second, each gene's standard Gaussian values are transformed into the CDF values of the standard Gaussian distribution.Third, each gene's CDF values are transformed into quantiles of the gene's fitted NB distribution and thus become counts, which constitute the synthetic null data (top right).

Figure S2 :
Figure S2: When the target data contains cells from two cell types (simulation; see Methods "Simulation design with one cell type and zero DE genes"), the synthetic null data generated by ClusterDE fills the gap between the two cell types but resembles the target data in other visual aspects of UMAP cell embeddings (left), per-gene expression mean and variance statistics (middle), and gene-gene correlations.

Figure S3 :
Figure S3: Validity checks of the contrast scores of ClusterDE and P values of Seurat, Countsplit, and the TN test on an exemplary one-cell-type dataset, which does not contain any true DE genes by the simulation design (see Methods "Simulation setting with one cell type and zero true DE genes").The panels (rows) a-e represent the five DE tests in Seurat used in ClusterDE, Seurat, and Countsplit (see Methods "ClusterDE step 3: DE analysis"); since the TN test has its own DE test, its results are the same in the panels a-e.The first column shows that the ClusterDE contrast scores of all genes (true non-DE genes) are approximately symmetric around 0, which meets the assumption of ClusterDE for the FDR control.The second column shows the histograms of the P -values of the correlated genes (top) and the uncorrelated genes (bottom) from Seurat, Countsplit, and the TN test.A larger empirical Kullback-Leibler divergence (KL div.) between the P -value distribution and the theoretical Uniform[0, 1] distribution represents a more severe violation of the P -value uniformity assumption.The results show that Countsplit and the TN test have close-touniform P values for the uncorrelated genes, but their P values exhibit a severe departure from the uniform distribution for the correlated genes.The third column contains the quantile-quantile plots of the negative log-transformed P values corresponding to the second column.

Figure S9 :
Figure S9: Stability of the DE genes identified by Cluster in relation to the randomness of synthetic null generation.Given one target dataset simulated with two cell types (see Methods "Simulation design with two cell types and 200 DE genes"), 50 synthetic null datasets are generated with 50 random seeds, and DE genes are identified by ClusterDEusing each syntheticnull dataset.The red curve shows the mean and standard deviation (half of the vertical bar height) of the numbers of DE genes identified at each target FDR across the 50 random seeds.The cyan curve shows the mean and standard deviation (half of the vertical bar height) of the numbers of DE genes shared between two random seeds, across 50

Figure S11 :
Figure S11: ClusterDE avoids false discoveries under double dipping.Although the datasets contain two monocyte subtypes, the clustering results poorly match the subtype labels (the bottom four datasets in Fig.S10), and thus no DE genes should be discovered.The numbers in black and white are the number of DE genes and the proportion of DE genes among all genes, respectively.In most cases, ClusterDE does not find DE genes, as expected.

Figure S12 :
Figure S12: Gene set enrichment analysis (GSEA) of the ranked DE gene lists identified by ClusterDE and Seurat with five DE tests from three datasets.The red lines represent the enrichment of the "CD14 + /CD16 + Monocyte Marker Genes" set, and the blue lines represent the enrichment of the "Housekeeping Genes" set.The short vertical lines at the bottom show the rank distributions of the genes in the two gene sets within each ranked DE gene list.The normalized enrichment score (NES) reflects the direction and magnitude of enrichment, and the P value indicates the significance of enrichment.

Figure S13 :
Figure S13: Overlaps between monocyte markers/housekeeping genes and the top k DE genes, with k ranging from 1 to 100.The horiontal axis represents k (for example, k = 100 means we select the top 100 DE genes from the DE gene lists).The vertical axis indicates the number of monocyte markers (the top row in each panel) or housekeeping genes (the bottom row in each panel) among the top k DE genes found by each of the five DE tests (columns).In most cases, ClusterDE (blue line) identifies more monocyte subtype markers and fewer housekeeping genes compared to Seurat (red line).

Figure S14 :
Figure S14: The minus-average (MA) plots of ClusterDE contrast scores (target DE score minus null DE score) vs. averages of target DE scores and null DE scores.The red color labels four well-known CD14 + /CD16 + subtype markers, and the blue color labels four well-known housekeeping genes.The dashed black line indicates the contrast scores of 0. For housekeeping genes, their DE scores are large in both target data and synthetic null data, so their contrast scores are centered around 0. Hence, these housekeeping genes would be ranked top by Seurat (which only examines target DE scores) but not by ClusterDE.On the other hand, the DE scores of subtype markers are much larger in target data than in synthetic null data, so their contrast scores are large and ranked top by ClusterDE.

Figure S15 :
Figure S15: A demonstration of using ClusterDE in the presence of multiple cell clusters.a, A UMAP visualization of Seurat clusters found in the PBMC dataset Rep1 10x(v3).The blue box labels two neighboring clusters, which roughly represent the CD14 + /CD16 + monocytes.b, The cluster tree constructed by Seurat, with clusters 2 and 8 corresponding to the two clusters in the blue box in a.It is recommended that ClusterDE can help annotate two neighboring cell clusters in UMAP or a cluster tree, based on the more trustworthy post-clustering DE genes identified by ClusterDE.The reason is that neighboring clusters are more likely to be spurious.

Figure S16 :
Figure S16:A toy example to showcase the double-dipping issue.The two genes' expressions follow a bivariate Gaussian distribution as the cells come from a homogeneous cell type.However, if we run the K-means clustering to divide the cells into two clusters, the two genes are forced to exhibit different distributions in the two clusters.