The Poisson distribution model fits UMI-based single-cell RNA-sequencing data

Abstract Background: Modeling of single cell RNA-sequencing (scRNA-seq) data remains challenging due to a high percentage of zeros and data heterogeneity, so improved modeling has strong potential to benefit many downstream data analyses. The existing zero-inflated or over-dispersed models are based on aggregations at either the gene or the cell level. However, they typically lose accuracy due to a too crude aggregation at those two levels. Results: We avoid the crude approximations entailed by such aggregation through proposing an Independent Poisson Distribution (IPD) particularly at each individual entry in the scRNA-seq data matrix. This approach naturally and intuitively models the large number of zeros as matrix entries with a very small Poisson parameter. The critical challenge of cell clustering is approached via a novel data representation as Departures from a simple homogeneous IPD (DIPD) to capture the per-gene-per-cell intrinsic heterogeneity generated by cell clusters. Our experiments using real data and crafted experiments show that using DIPD as a data representation for scRNA-seq data can uncover novel cell subtypes that are missed or can only be found by careful parameter tuning using conventional methods. Conclusions: This new method has multiple advantages, including (1) no needfor prior feature selection or manual optimization of hyperparameters; (2) flexibility to combine with and improve upon other methods, such as Seurat. Another novel contribution is the use of crafted experiments as part of the validation of our newly developed DIPD-based clustering pipeline. This new clustering pipeline is implemented in the R (CRAN) package scpoisson .


Background
Single cell RNA-sequencing (scRNA-seq) estimates the transcriptome at the individual cell level. ScRNA-seq can directly measure cell-to-cell heterogeneity, which is more challenging using bulk RNA sequencing. First applied in 2009 [1], scRNA-seq has become the preferred tool to identify cell sub-populations and to investigate cellular heterogeneity [2,3,4,5,6,7], gene regulatory networks [8,9], stochastic fluctuations in transcription [10,11], and so on. Due to the unique features of the data distribution in scRNA-seq, it's essential to develop statistical methods which accurately model scRNA-seq data for many important downstream analyses including differential expression analysis and clustering of cells.
Existing methods typically model the scRNA-seq data at the gene level for differential expression analysis to find biomarkers, and at the sample level for clustering of cells to find cell subtypes; however they typically lose accuracy due to a too crude aggregation at those two levels. This aggregation has led to attempts to explicitly model the apparent resulting zero-inflation or over-dispersion. We propose more precisely addressing these issues by modeling the distribution of each individual entry of the data matrix. A major challenge is that scRNA-seq data typically contain a large number of zero counts for gene/cell combinations (often exceeding 90%) [12]. This is due to both biological reasons that some genes are only expressed in a cluster of cells, and technical limitations such as low RNA capture rates, low efficiency library construction, cell disintegration and RNA degradation.
There also exists a severe threshold effect in detection sensitivity of gene expression in scRNA-seq. Typically higher expressed genes in a cell tend to have a higher probability to be detected [13,14,4,15]. These characteristics can lead to large discrepancies among sequencing libraries for different cells, i.e. batch effects, and render many global normalization approaches ineffective. Various approaches have been proposed to address barriers that limit the interpretation of scRNA-seq data [16,17,18,19,20,21,22,23]. On the "wet-bench" side, unique molecular identifier (UMI) was introduced [24]. UMI reduces biases introduced by the extreme signal amplification that is necessary for scRNA-seq. Some researchers have argued that if the UMI technology works properly, there is no need to account for zero-inflation [25,22,26]. This is an encouraging perspective; however, these classical probability models are again only crude aggregations focusing on either cells or genes. To improve the accuracy of statistical modeling and gain more precise inference, we propose the novel and principled approach of studying individual entries of the gene-by-cell matrix. This approach is based on the Independent Poisson Distribution (IPD) statistical framework, where every gene in each cell follows its own Poisson distribution. Working with such a model is challenging because the maximum likelihood estimate of each Poisson parameter is simply the corresponding count, which is too noisy to be useful. To solve this problem which presents for the validation of the IPD model we first start with several biologically homogeneous data sets derived from single clonal cell lines [27]. Next, we perform parameter for any cell group. For some data [28] this approach gives results similar to those using other pipelines. For others [29] it shows an improvement. Overall, for additional downstream tasks, the DIPD matrix is proposed as a new data representation (model departure).
In sum, the IPD statistical framework has the potential to capture meaningful biological properties at a higher resolution than prior normalization methods, without the need for more complicated probability distributions. We demonstrate the usefulness of model departure DIPD as a novel data representation by conducting downstream analysis, such as clustering of cells. Our newly developed DIPD-based clustering pipeline is validated in multiple experimental data. Another important contribution of this paper is the use of the novel method called crafted experiments for the comparison of the DIPD with other methods in a principled way. While we demonstrate the value of our proposed model departure data representation for clustering, we anticipate it will be useful for additional downstream tasks, such as differential expression analysis, gene set tests and trajectory analysis, because it provides a useful replacement of the conventional data matrix.

Validation of Poissoneity for scRNA-seq data
Poissoneity postulates that each matrix entry (gene by cell) comes from an independent Poisson distribution. As stated in Methods, the Poisson parameter for each matrix entry can be estimated using GLM-PCA [25]. However, the success of that estimation requires a good choice of the number of latent vectors L, which is generally quite challenging. The model validation context we consider here allows an unusual approach to that challenge. In particular, finding a value of L which gives a good fit of the resulting IPD model establishes its validity. That goodness of fit is quantified here using both Q-Q envelope visualization and formal hypothesis testing.
To study the Poissoneity of scRNA-seq data, we first explore the simplest case: cells picked at random from a clonal cell line processed as a single batch (Plate 3 [27]) with L = 10 (for the reasons given in section Methods). In Fig. 1, panels a  Another experiment has an equal mixture of two different cell lines (Plates 5A and 6A). In Fig. 2  Poisson departure data representation Here, we introduce a novel data representation (DIPD) based on a departure from the IPD. The initial step is based on a crude two-way parameter approximation, where variation across cells is modeled by a cell-level parameter, and variation across genes is modeled by a gene-level parameter (as defined in equation (3)).
This initialization step in itself does not appropriately account for cell heterogeneity. In the next step, the interesting cell structure is captured by departures from the naïve two-way approximation in both genes and cells, and the original count matrix is replaced by a Poisson departure matrix. In the departure matrix, each entry is quantified by the relative location of that original count with respect to the tentative Poisson distribution, whose parameter comes from the initial two-way approximation. The departure measure is captured by a Poisson Cumulative Distribution Function (CDF), which leaves the unexpectedly small counts nearly 0 and unusually large counts close to 1. Next, the departure measure is put on a more statistically amenable scale using the logit function. As a result, unexpectedly large counts give large positive values and unexpectedly small counts give large negative values. between the two cell lines. We will show that the DIPD-based data matrix outperforms Seurat normalized counts as a novel data representation in a later section. Cell type clustering based on Poisson departure A major application of this data representation is cell clustering using DIPD. This can be used directly as input into other algorithms. It also opens the possibility for a novel clustering algorithm, as illustrated in Fig. 4. This algorithm, referred to as Hclust-Departure, operates as follows: Starting with the UMI count matrix (UMI ), a very crude two-way parameter approximation (more details in Methods) is used to estimate Poisson parameters (Λ). Cell heterogeneity is not assumed at this step.
Next, each UMI count is replaced by the DIPD (D) measure from the naïve model.
This DIPD-based matrix serves as the input for the clustering step. Clustering with k = 2 is applied and the two-way approximation and DIPD-based data matrix is recalculated separately for each of the two subclusters. This process is repeated until (a) the split is no longer statistically significant; (b) the maximum allowable number of splitting steps is reached; or (c) any current cluster has less than 10 cells.
Statistical significance is calculated using Sigclust2 [30]. For a homogeneous cluster of cells, all the departure entries (D) are similar, and therefore Sigclust2 should not find significant clusters.
To investigate the performance of Hclust-Departure, we compared it with a commonly used package, Seurat (version 3.1.1) [31].

Single clonal cell line
First, homogeneous data from a single clonal cell line (Plate 3) is tested [27]. There are no known clusters. This data serves as a negative control because the cells have

Two cell lines, equal mixture
Combining the data from two clonal cell lines (Plates 5A and 6A) in an equal mixture provided a positive control, as the two cancer cell lines were from independent patients, but of the same lineage [27]. Hclust-Departure resulted in two clusters, consistent with the known cell lines. Seurat also identified two clusters under the default setting (resolution parameter 0.8) as expected (Fig. 3). Three cell lines, unequal mixture Next, we applied Hclust-Departure, to data comprised of three mixture cell lines, at a ratio of 1:3:6 [32]. Hclust-Departure identified three clusters (panel a in Additional file 2). Using the default setting, Seurat identified 7 clusters. By tuning the Seurat resolution parameter from the default 0.8 to 0.1, overfitting was resolved (panel b in Additional file 2) and both approaches identified the three biologically defined clusters.

Multiple cell lineages, unequal mixture
To increase the complexity of the data further, data from the lymphoid organs of a mouse [28] was analyzed. These represent the complex lineages and populations of the hematopoietic system: T and B cells, which mediate the adaptive immune response, as well as dendritic cells (DCs), macrophages, mast cells, etc., which mediate the innate immune response as well as red blood cells (erythrocytes). Within each of these broad classes, multiple subclasses are recognized.
The results are visualized using t-distributed Stochastic Neighbor Embedding (t-SNE) [33] and Uniform Manifold Approximation and Projection (UMAP) [34] in Fig. 5 panels a, c and panels b, d. Hclust-Departure (panels a and b) is used without dimensionality reduction or feature selection. Seurat (panels c and d) is applied using the top 2,000 most variable features as defined by default. The cell type labels are manually assigned to each cluster using known lineage markers.
The clusters discovered by Hclust-Departure are consistent with those identified by Seurat. Furthermore, Hclust-Departure identifies several significant subclusters within common Seurat labels (namely B-cells (light/dark green clusters), NK cells (light/dark gold clusters) and erythrocytes (light gray/black clusters)).

To evaluate the biological plausibility of the additional clusters identified by
Hclust-Departure, we identified differentially transcribed genes using the t-test (cluster size larger or equal to 30) or the Wilcoxon rank-sum test (cluster size less than 30) (Fig. 6). The genes colored in red are statistically significant after FDR adjustment (p < 0.05), and have a large mean difference. The genes colored in orange have a significant difference but the mean difference is small. Those colored in black do not differ among clusters. Known cellular identity-specific differentiation markers are annotated by name. Their difference in departure representation is consistent     parameters for those methods is necessarily heuristic, specific to each data set, and not necessarily reproducible or robust. By comparison, Hclust-Departure has no tunable parameters, other than the significance level and neither has Sigclust2.

Hybrid Approach: model departure and Louvain clustering
A key difference between Hclust-Departure and other pipelines is the actual clustering algorithm. We therefore combine the DIPD data representation with the Louvain algorithm as implemented in Seurat.
To validate this combination, we used a different, very complex and very well studied data set with known ground truth. These are the Peripheral Blood Mononuclear Cells (PBMCs) data sets defined by [29].  Table 1 shows the confusion matrix. We also explored the more advanced normalization method SCTransform [18]. This method uses the residuals from Negative Binomial regression with the default parameters maintained for clustering. The results do not differ from the default normalization and are included in Additional file 3. None of the pipelines is completely consistent with the FACS labels in identifying subtypes of T cells. This may be due to the limited accuracy of the algorithms or it may be due to FACS labels not correctly signifying the underlying biological complexity, as T cell differentiation can be very fluid. Finally, DIPD-based data representation combined with Louvain clustering performs better than any of the pure pipelines ( Fig. 7 panel d). The hybrid method correctly identifies the T cell subsets and subgroups of monocytes (red cluster). This result suggests that modeling UMI counts by departure from Poissoneity has advantages over other normalization/transformation methods independent of the particular clustering algorithm.
To further define the performance of the hybrid approach, different parameters were explored using either DIPD-based representation (D) or log-normalized data as input. These were (a) the number of principal components (15, 20, 25 or 30) Variation in library size (total UMI counts per cell) is a driver of non-relevant variation in scRNA-seq. To explore this issue we artificially magnified the library size and compared different data representations ( Fig. 9 panels a and b). As noted above, many pipelines use multiplication and scaling to adjust for the library size effects. This poses a problem for data containing many zeros. This experiment again used the Zhengmix4eq data. To model library size effects, cells with a large or small library size were perturbed to be even larger or smaller (see Methods for details). We compare data representations from DIPD (yellow), log-normalized

Discussion
We develop an alternative data representation, DIPD, for scRNA-seq data as well as a clustering algorithm based on this data representation. DIPD is applicable to scRNA-seq data that incorporates experimental UMI correction. With an appro- Poisson parameters within the framework (Fig. 1).
Implementing Sigclust2 in clustering provides an explicit hypothesis testing for each cluster, which avoids parameter tuning. A direct comparison of different data representations demonstrated that DIPD had an improved performance over conventional log-normalized data (Fig. 7, 8). A hybrid approach combining DIPD with the Louvain clustering algorithm gave the best performance (Fig. 9). Using all the data represented as model departure allowed for the detection of weaker signals compared to feature selection based clustering.
A limitation of this pipeline is computational speed because it uses the full feature set. Computational speed vs. the number of features to be included in the model represents a trade-off of any unsupervised learning approach. It is not specific to this data representation.
At this point, we have only begun to identify biological scenarios that favor this data representation over others. It is necessary to explore additional scenarios where the DIPD and Hclust-Departure show differences compared to other approaches.
This may identify properties of scRNA-seq data beyond over-dispersion and zero inflation.
The idea of departure based data representation could also be used for other data types based on other distributions, for example, the Assay of Transposase Accessible Chromatin sequencing (ATAC-seq) data based on Binomial distributions.

Conclusions
Most of the existing scNRA-seq analysis methods suffer from a too crude aggregation at either gene or cell level. We proposed shifting the focus from modeling counts to modeling probabilities and avoided the crude approximations by our IPD statistical framework. We investigated the validity of this model using some carefully designed experiments. As a result, we achieved improved cell clustering performance using a novel data representation based on departures from the estimated Poisson distributions without prior feature selection or manual optimization of hyperparameters. The idea of our DIPD as data representation can also be combined with other clustering methods, such as the Louvain algorithm implemented in Seurat.
This novel data representation is useful in better understanding the mechanism of scRNA-seq.

Data Description
The main performance of the Poisson independent framework for data representation is illustrated using multiple data sets representing different scRNA-seq cate-  . The data were pre-processed as described in [27]. Specifically, filtering was done such that each cell had greater than 5,000 total UMI counts and greater than 1,500 detected cellular transcripts. Only protein coding transcripts that were detected in more than 0.5% of all cells were retained. The data set used here had a total of 621 cells and 12,689 genes.
This carefully constructed data enabled us to validate the Poissoneity under different scenarios, i.e. different degrees of batch variation. The data are summarized in Table 2. For instance, Plates 1 and 2 were from the same cell line but performed on different dates (biological replicates); Plates 3 and 4 also used the same cell line, but were performed on the same date (technical replicates). They were expected to be more similar as technical variation is smaller than biological variation. Data labeled Plate 5A and 5B represent cells where the scRNA-seq libraries from the same cell was sequenced in two independent runs. Thus these were the most similar data sets. The only variation should be due to randomness from the Poisson distribution.
Plates 6A and 6B were from an entirely different cell line JSC-1, and were expected to give a radically different expression signature from the BCBL-1 cell line. Plate 8 investigated the impact of doubletons by intentionally putting two cells per well.

Three cell lines mixture data
This data set was generated from a mixture of three cell lines by 10X Genomics as in [41] and cleaned by [32]. There are three cell lines in this data set: human dermal fibroblasts-skin, breast cancer luminal epithelial cell line, and breast cancer basal-like epithelial cell line. These were mixed at a ratio of 1:3:6. The cell of origin label for each cell was retained. The data were pre-processed as discussed in [32].
This data set contains 2,609 cells with known labels and 21,247 genes.

PBMC data
This scRNA-seq data was generated using 10X Genomics originally from [42]. Cells

Multiple cell lineages data
This data set was based on a study by [28]. This study sampled mouse spleen tissue and obtained scRNA-seq data sets using the 10X Genomics platform. We used one of the mice (Sample A5) which is comprised of 1,476 cells and 12,822 genes. Seurat data cleaning and cell clustering by default parameters were used in the original report and provided computational cell type labels (more details in [28]).

Existing Methods
We first discuss the GLM-PCA algorithm, which is applied in parameter estimation for our assessment of the IPD framework. Then we give a brief review of the Seurat pipeline, for data pre-processing steps and cell clustering as an example for the state-of-the-art in RNA-seq data analysis.

GLM-PCA algorithm
GLM-PCA is an algorithm for computing an analog of PCA in the context of generalized linear models (GLM) (see [25] for details). A typical organization for a scRNA-seq data set is a matrix of counts, where columns denote cells (indexed by c = 1, 2, ..., C), and rows denote genes (indexed by g = 1, 2, ..., G). Let x gc denote one matrix entry, and let n c = g x gc denote the total counts for the cell c. The GLM-PCA calculation using the Poisson distribution treats the counts as a random variable: X gc ∼ P oisson(λ gc ), i.e.
where α g is a gene specific parameter, where ξ gl and ρ cl are factor scores and
It enables users to study the cell-to-cell heterogeneity from transcriptome data.
Seurat also integrates diverse types of single cell data sets (see more details in [23,44,31]). At each step in the computation pipeline, there are multiple hyperparameters to consider. These provide the users with flexibility, but are selected heuristically.
Recommendations for these parameters are arrived at empirically and are varied depending on the input data set. Here we briefly review the standard workflow as described in [28].
quality control: Genes with less than three positive counts overall were excluded; cells where the unique gene counts (the number of detected genes) were above 2500 or below 200 were excluded; cells with total mitochondrial gene counts greater than 5% of the overall total were excluded.
normalization by cell: The gene expression for each cell (x gc ) was divided by the cell total counts (n c ) and this quotient was multiplied by a scale factor of 10,000 (default).
transformation: The natural log transformation was applied.
feature selection: The standardized variance (more details in [31]) was calculated for each gene, and the top 2,000 (default) genes with the highest cell-to-cell variation were retained.
scaling: The expression of each gene was scaled to have a mean of 0 and variance of 1 across cells. A variation of standard scaling includes regularized negative binomial regression, which is called SCTransform [18].
linear dimension reduction: The data was represented by the first 15 principal components obtained by Euclidian PCA.
clustering: Cell clustering was done with a graph-based clustering approach using the Louvain algorithm and visualized using t-SNE or UMAP methods.

Novel Methods
In the following section, we describe the approach for assessment of the validity of the IPD statistical framework. We propose DIPD as a novel data representation, which is a measurement of the relative location of that UMI counts with respect to the independent Poisson distribution at the individual entry level. The cell heterogeneity can be better reflected at the scale of continuous possibilities than in the original scale with excess zeros. Therefore, we further develop a departure-based cell clustering algorithms to identify cell subpopulations.

Independent Poisson statistical framework
We work with scRNA-seq data with individual matrix entries through an IPD statistical framework, where each matrix entry (x gc ) is a UMI count indicating expression of gene g for cell c. In particular, we model that as a Poisson random variable X gc , which is independent over genes and cells. The Poisson probability function is given in equation (1).  33 33 In this framework, the maximum likelihood estimate of λ gc is the UMI count x gc , which is not useful because of the large amount of natural Poisson variation. This motivates combining information and one approach is the GLM-PCA algorithm.
The challenge to measuring the goodness-of-fit is that can not be done using only one data point. We approach this by aggregating matrix entries

Q-Q plot for small discrete counts
Visualization methods are useful for assessing the "Poissoneity" of scRNA-seq data.
In general, the Q-Q (Quantile to Quantile) plot provides a useful visualization for comparing two distributions. These distributions can be either continuous or discrete, and a common application is to compare a data set represented by its Cumulative Distribution Function (CDF), with a hypothesized probability distri- In the case of checking an empirical CDF against a potential theoretical model CDF, a useful device for understanding the natural variation in a Q-Q plot was the Q-Q envelope. This visualization identifies which observed aspects were important and which were artifacts of sampling variation. This idea modeled the hypothesized sampling process by simulating repeated samples of the same size from the candidate theoretical distribution, and overlaying the envelope of resulting CDFs (also using the idea of continuity correction). In the case of conventional Q-Q plots (shown as black dots in panel a Additional file 4), this gave a useless visual impression in low count discrete settings. But as seen in Results section, the continuity corrected Q-Q envelopes are very useful.

Over-dispersion test
In the case of the Poisson distribution, an insightful test was the dispersion test. An important property of the Poisson distribution was the mean equals the variance.
However, many mixtures of Poisson, such as the Negative Binomial, have a variance which was larger than the mean, called over-dispersion.
This test is conducted using the dispersion test from the R package AER (v1.2-9; [46])

Zero-inflation test
A much different departure from the Poisson that can arise in certain applications was zero-inflation, where the number of observed zeros was larger than the expected number of zeros. We implemented this test with the R package vcdExtra (v0.7-5; [47]), which was based on a score test proposed by [48] using a test statistic with an asymptotic Chi-square distribution.

Model departure as data representation
Again, from our IPD framework, each gene expression measurement for each cell (i.e. each matrix entry) comes from an independent Poisson distribution with parameter λ gc . A naïve starting point for the application of that framework is viewing cell and gene differences in a purely additive way, i.e. a two-way approximation, expressed as where g indexes gene and c indexes cell. Of course, there is much richer biological structure beyond this, which we will represent in terms of departures from this approximation of each matrix entry.
Fitting of a simple two-way approximation The model (3) is fit to the data using maximum likelihood. In order to make parameter estimation identifiable, restrict that g e αg = G and c e βc = C.
There is a closed solution, which is: It's straightforward to prove that the first derivative at parameter estimates defined above are all zero.
We used the above two-way approximation as an initial model, which gave a first order approximation of both library effects and also gene by gene variation.
Phenomena, such as cell clustering, were effectively captured by studying the departure from that first order approximation. In other words, features of interest were captured by the difference between the observed UMI counts and the counts expected from the two-way approximation. In particular, the matrix entries that showed significant departure played an important role in cell clustering. The key idea of our departure representation of scRNA-seq data is to replace each count x gc by a number that reflects how well it is explained by the Poisson distribution from the simple two-way approximation. Clustering such numbers is effective at finding structure beyond the two-way fit, such as discriminating cell types. We started by representing departure in terms of where the given count x gc lay in the P oisson(λ gc ) distribution. A naïve approach to this would be to use the UMI count x gc in the CDF of the P oisson(λ gc ) distribution, i.e. F (x gc ;λ gc ) = P (X ≤ x gc |λ gc ). While this probability was very effective (i.e. probabilities close to zero or close to one indicate a strong departure) for large values ofλ gc , it was less effective for small values ofλ gc , because the probability had a lower bound of P (X = 0|λ gc ) = e −λgc ≈ 1 (as often encountered in scRNA-seq data). This problem was caused by the conventional CDF representation as P (X ≤ x). While it was typically not done, CDFs could also be represented as P (X < x), which for our purposes goes too far in the other direction (P (X = 0|λ gc ) = e −λgc ≈ 0). Hence, we chose to use the average form of the CDF, i.e.
By doing this, our representation of unexpectedly small UMI counts was nearly 0 and unexpectedly large UMI counts was close to 1.
Another consequence of the generally skewed shape of the Poisson distribution (at least for small values ofλ gc ) was that these probabilities tend to be quite asymmetric at the two ends of the distribution. A straightforward device for more balanced treatment of the departures from the Poisson fit was to take the matrix entries to be the logit transform of these CDF based probabilities: Since exactly 0 and 1 were not allowed for the logit transformation, set any matrix entries withF (x gc ;λ gc ) below 10 −10 as logit(10 −10 ), andF (x gc ;λ gc ) above (1 −

Cell clustering algorithm
The proposed clustering starts with the DIPD-based matrix computed for the complete data set. Hierarchical clustering using Euclidean distance and Ward´s linkage is recommended from a top-down viewpoint. At each step, we re-calculated the two-way approximation again within each subcluster, and the potential for further splitting is calculated using Sigclust2 [30], a method to assess statistical significance at each split based on a Monte Carlo simulation procedure. A non-significant result suggests cells are reasonably homogeneous and may come from the same cell type.
In addition, to avoid over splitting, we further require setting a maximum allowable number of splitting steps J (default is 10, which leads to at most 2 10 = 1024 total number of clusters) and minimal allowable cluster size S (the number of cells in a cluster allowed for further splitting, default is 10) beforehand. Thus the process was stopped when any of the conditions was satisfied: (1) the split was no longer statistically significant; (2) the maximum allowable number of splitting steps was reached; (3) any current cluster had less than 10 cells. This process was done in a recursive way. Algorithm 1 and Fig. 4 outline the procedure using hierarchical clustering in a recursive way based on departure representation.
We do not need to set the number of clusters beforehand. Thinking of the number of clusters in a multi-scale way as in [32], a coarser scale clustering can be obtained by stopping the clustering process at any stage in between. hclustDepart(dat j = dat 1j , j = j + 1) 11 hclustDepart(dat j = dat 2j , j = j + 1)

Declarations
Ethics approval and consent to participate Not applicable.

Consent for publication
Not applicable. The mouse multiple cell lineages data is available at the Gene Expression Omnibus (GSE148796).

Code availability
R code used to demonstrate the fit of our IPD statistical framework and perform clustering using Hclust-Departure is available as an R (CRAN) package and can be accessed from https://cran.r-project.org/web/packages/scpoisson/index.html.

Competing interests
The authors declare that they have no competing interests.  Visually, both data representations effectively demonstrate the differentially expressed genes among the three cell lines. However, highly expressed genes within single cells, as depicted by the bright red spots, may potentially play a role in clustering but many are filtered out by Seurat.
Additional file 3.pdf file The UMAP plot visualizing the clustering performance in the Zhengmix8eq data set [29] using Seurat SCTransform     variable genes kept by the Seurat pipeline. Visually, both data representations effectively demonstrate the differential expressed genes between the two cell lines. However, highly expressed genes within single cells, as depicted by the bright red spots, may potentially play a role in clustering but many are filtered out by Seurat. Figure 4 The Hclust-Departure cell clustering workflow. Hierarchical clustering is performed using Euclidean distance and Ward´s linkage in a recursive way. consists of n=1,476 cells from [28]. The top two panels (panels a, b) were based on Hclust-Departure using model departure as data representation. The bottom two panels (panels c, d) were labeled by cell types from the Seurat analysis of [28]. The clusters discovered by Hclust-Departure are consistent with those identified by Seurat. Furthermore, Hclust-Departure identifies several significant subclusters (namely B-cells, NK cells and erythrocytes).   Figure 7 The UMAP plots comparing clustering performance in the Zhengmix8eq data set [29] using different data representations and clustering methods. Panel a displays the FACS labels we used as a benchmark to measure clustering performance. Both the Seurat pipeline (panel b) and our Hclust-Departure pipeline (panel c) correctly identify the distinct cell types but fail to distinguish the subtypes within the T cells. Panel d uses the DIPD-based data matrix as data representation combined with Louvain clustering, which is a more direct comparison with panel b since 15 PCs and a resolution of 0.8 are used in both cases. It improves the original Seurat clustering performance by better distinguishing T memory cells from T helper/regulatory cells. proportions, c and f) data sets.         Supplementary Files This is a list of supplementary les associated with this preprint. Click to download.