Statistical inference from stem cell barcoding data using adaptive approximate Bayesian computation

Background: Barcodes that can be supplied to cells by transduction of a library of unique DNA sequences allow identiﬁcation of heterogeneity in cell populations and lineage tracing applications. Estimation of the number of hematopoietic stem cell (HSC) clones is important since it also allows to approximate the number of hematopoietic stem cells from which the circulating blood cells descend. This problem is similar to the species problem, well-known to ecologists. However, an additional ”degree of freedom” exists, since diﬀerent HSC generally give rise to clones with diﬀerent growth rates. This adds credibility to sampling models based on diﬀerent versions of Dirichlet-multinomial distributions. Results: We developed a truncated population approximate Bayesian computation (ABC) algorithm which is derived from sequential Monte Carlo ABC (SMC-ABC) and applied the method to the symmetric Dirichlet-multinomial model proposed by Zhang et al. (2005) and asymmetric Dirichlet-multinomial model we proposed. Methodology was tested using simulated and real-life data. Conclusions: Results suggest that ﬂexibility of the asymmetric Dirichlet-multinomial helps to obtain insight into heterogeneity of proliferating cell systems such as HSC. Estimates based on experimantal data approach the correct count of murine HSC.


Background
In recent years, next generation sequencing (NGS) has enabled new methods to study cells at the single-cell level, including single-cell DNA and RNA sequencing. Cells can be also traced using NGS paired with DNA barcodes in cells that are either endogenously present or exogenously supplied. Here, we are concerned with barcodes, which can be supplied to cells either by transduction of a library of unique DNA sequences, thereby allowing identification of heterogeneity in cell populations and lineage tracing, or by use of an inducible transposon which allows achieving a DNA barcode without removing cells from their natural environment. In this latter case the DNA sequence flanking a random site of transposon integration provides the barcode.
The use of barcoding systems for single cell studies stimulated interest in the development of statistical tools to analyze the data produced by these systems. The diversity of clones that can be studied is limited by the range of barcodes that can be generated and subsequently detected. We study here the exogenous barcoding system as published by Sun et al. (2014) and specifically undertake estimation of the number of the hematopoietic stem cells (HSC) based on distribution of barcodes in peripheral blood cell, which are HSC descendants. Our findings can be applied to other barcoding experiments and moreover, since the problem is similar to the one with which the classical "problem of species" is concerned, to ecological applications to estimate the true species diversity based on species frequencies in a field study.
As an informal introduction, let us consider an example taken from Sun et al. (2014) (Fig. 1). We would like to estimate the number of existing HSC based on experimental single-cell data. Among the transposon tags (Tn) detected in granulocytes analyzed in a single plate, some of the tags were found in single granulocytes, some of them appeared in two cells, and fewer tags were observed three times or more. In the experiment , 36 weeks after barcoding all cells (including all HSC) in the experimental animal, tag 187 (numbering is arbitrary) was observed three times, tags 188, 189, and 190 were observed twice, the rest of the tags were only observed once. The 36 weeks were allowed to complete the replacement of all the blood cells by descendants of the barcoded HSC. In a plate of wells holding individual granulocytes, tag 187 (numbering is arbitrary) is observed three times, tags 188, 189, and 190 are observed twice, the rest of the tags are only observed once. The standard plate has 96 wells, but in some of them (7 in this case) tags were not identified.
Suppose that we detect w unique Tn barcodes among n granulocytes being descendants of K barcoded HSCs, and n i is the number of granulocytes with tag i, while f j is the number of different Tn tags present in j cells. Then n i and f j satisfy the following constraints: where n i -s are the non-zero n i -s. We will only use the latter, without causing confusion. An important question is the development of an appropriate sampling model for observations relying on a sample of descendants of the HSC we try to count. If each of the K HSC were producing a clone of large size, the sample of n descendants would constitute a multinomially distributed vector (n 1 , . . . , n K ) ∼ M ult(n; K −1 , . . . , K −1 ). ( If n K, then most of the entries of the vector (n 1 , . . . , n K ) are 0-s. However, an additional "degree of freedom" exists, since different HSC generally give rise to clones with different growth rates. In such situation, the vector may be conditionally multinomial with random coefficient vector θ = (θ 1 , . . . , θ K ). Because of the duality between multinomial and Dirichlet distributions, it seems natural to employ this latter for the coefficients. In this paper, we examine models involving symmetric and asymmetric Dirichlet distributions, and their use to obtain the posterior distribution of the number of barcoded clones K, given data, under adaptive priors. Approximate Bayesian Computation (ABC) is a class of methods that apply for complex systems where likelihood function is intractable or impractical due to computational cost. The term ABC was first introduced by Beaumont et al. (2002). In ABC, the likelihood is approximated by a large number of simulations based on the acceptance-rejection. The algorithm first specifies a threshold value which indicates the required agreement level between observed and simulated data, and a prior distribution π(θ). We draw parameter θ from the prior distribution and generate simulated data y given θ according to the model f (y|θ), and determine the distance d(y, y obs ) between simulated and observed data. The algorithm accepts the parameter θ if d(y, y obs ) < . If the data is high dimensional we may reduce the dimension by using summary statistics s(y) rather than the vector y. The summary statistics s(y) should be sufficient for θ however frequently we cannot identify a sufficient statistics. Use of a non-sufficient statistic introduces more 'approximation' to the result.
In this paper we present an adaptive truncated-population ABC method based on sequential Monte Carlo (SMC) ABC introduced by Sisson et al. (2007) and Toni et al. (2008). We estimate parameters of the asymmetric Dirichlet-multinomial model for the barcoding problem by ABC and a proposed adaptive ABC method. We also evaluate the performance of the adaptive method.
2 The first iteration t = 1 is the pilot run. In the pilot run, for i = 1, · · · , N the algorithm generates the posterior distribution π 1 (θ i |y) using initial prior distribution p(θ i ) until the corresponding distance satisfies the acceptance criterion dist(y, y obs ) < 1 , with 1 being the first tolerance threshold. In this way we accept N estimated parameter values and we set an improper importance weight w 1 = Dir(1, 1, · · · , 1) with dimension N for each accepted parameter in the first iteration. 3 For iteration t = 2, · · · T, we truncate the posterior distribution π t−1 obtained in iteration t−1 to the interval (θ min , θ max ), where θ min and θ max may be chosen as for example the α and 1 − α quantiles of π t−1 . In this way we can make sure new parameters will be sampled from a more 'concentrated' range thus expedite the algorithm. The algorithm also perturbs the parameter values by using kernel function K t (e.g. a random walk distribution kernel, a uniform kernel or a Gaussian kernel) and obtains parameters from the kernel function. The truncated distribution under perturbation kernel K t is K t (θ t |θ t−1 ) and each parameter value is assigned with normalized weights w t and kernel distribution function to smooth the distribution as well as favor parameters that are closer to the true posterior. We propose that for each iteration t the improper importance weight w t , which is a Dirichlet random weight with size N defined as w t = Dir(1, 1, · · · , 1) for each accepted parameter. 4 For each iteration t similarly we draw θ from truncated and perturbed parameters values with w t until we accept N parameter values according to the corresponding t chosen for the corresponding iteration.
We employ the distance metric dist(y obs , y sim ) = (y obs − y sim ) 2 and we run the adaptive truncated-population ABC strategy to find target π(θ|y), N = 1500 parameters values are accepted in each iteration. For iteration t the algorithm uses a uniform random walk kernel with its variance equals twice the empirical variance of the parameter values accepted in the previous iteration t − 1. Figure 2 depicts six iterations of the adaptive truncated-population ABC algorithm given the Gaussian mixture model. 'Observations' are sampled from the 'true' posterior (black lines in Fig. 2). The algorithm performs satisfactorily on this 'toy' example.
Symmetric Dirichlet-multinomial model for count data As Zhang et al. (1997) and Boender and Rinnooy Kan (1987) suggested, in this section we apply a generalized multinomial model for estimation of the size of DNA barcoding population, with its sample value denoted as K. Let Y i be the number of barcodes that are observed i-times in DNA sequencing results, when a sample n is taken from the population. Assume that the number of vector count Y = (Y 1 , Y 2 , · · · , Y K ) follows a multinomial distribution with probability (θ 1 , θ 2 , · · · , θ K ): Dirichlet distribution can be applied to biological partition problems due to its exchangeability. Pitman (1996) considers species sampling problem by applying a Dirichlet random measure in his book. For DNA barcoding data, assume that the sampling process is independent and consider a symmetric Dirichlet distribution as the prior distribution for relative frequencies (θ 1 , θ 2 , · · · , θ K ) with parameter α : Assume the prior distribution for K is uniform. Then according to Bissiri et al. (2013) the posterior distribution of K given w and n is as follows: or which indicates that w is sufficient statistics for K if α is known. The posterior mode, which is also the Maximum A Posteriori (MAP) estimator, can be derived from this formula. Since in practical data analysis we do not know the value of α, the MAP estimator of K is thus not possible to compute. The barcoding sampling problem can be treated as the joint estimation of both parameters K and α. Zhang et al. (2005) implemented MCMC algorithm to realize the estimation of both parameters, however the posterior of K relies heavily on its prior distribution. In order to minimize the effect of how we choose the prior, we apply the adaptive ABC approach to analyze the barcoding sampling data.
Asymmetric Dirichlet-multinomial model for count data In the barcoding experiment, a biologically feasible assumption is likely that some of the barcoded HSC divide and differentiate abundantly whereas other remain dormant. In order to account for heterogeneity and variation within different clusters in barcodes, we then consider an asymmetric Dirichlet-multinomial model of cell probabilities θ = (θ 1 , θ 2 , · · · , θ K ) with positive real parameters α = (α 1 , α 2 , · · · , α K ), which has the probability density function: We can also rewrite this formula as for each barcode, the cell frequencies θ follow an asymmetric Dirichlet distribution with parameters qγ = α: where Z(qγ) is the normalizing constant in the Dirichlet distribution, and γ serves as a concentration parameter for q, which can reflect the abundance level within the clusters of barcodes. Now we specify the hierarchical Bayesian model for barcodes modeling as follows: first we take the model for the counts of observed species y to be multinomial, then in the second stage, θ|K, q, γ ∼ Dir(qγ).
We also assume that in order to make a complete Bayesian analysis. Without any prior knowledge we assume that α and γ are independent and from uniform distributions. The resulting hierarchical Bayesian model for multinomial count data is: Adaptive truncated-population ABC for Dirichlet-multinomial model Assume that K, α (and γ) are drawn from uniform distributions, then according to approximate Bayesian computation algorithm: 1 Compute the summary statistics of observed dataset s obs , and choose s = (w, f 1 , f 2 ) as summary statistics. Draw parameters from prior distributions of α, K (and γ). 2 Simulate the Dirichlet-multinomial distribution and compute summary statistics s sim . 3 For each iteration, accept parameter values if weighted L 2 distance: dist(s sim , s obs ) ≤ , then accept parameters.

Choice of summary statistics and stopping rule
The Dirichlet-multinomial model depends on w as well as vector counts f = (f 1 , f 2 , · · · , f n ). Considering that the nonparametric estimator from Chao (1984) is based on (w, f 1 , f 2 ), we also choose s = (w, f 1 , f 2 ) as the summary statistics.
Much research is carried out on how to determine the stopping criterion in an adaptive ABC algorithm (Simola et al. 2020). In this paper we apply the following simple criterion: starting from the fourth iteration, if a certain α quantile θ αt = (θ αt1 , θ αt2 , · · · ) for example when α = 10% of the posterior distribution in iteration t satisfies 0.9 ≤ θαti θ α(t−1)i ≤ 1.1 for each component i in the parameter vector, then the algorithm is stopped at iteration t and we assume that the posterior is stable.

Results
In order to address the poor-concentration issue caused by informative priors, an adaptive Bayesian method is applied in the following section. The adaptive ABC method is intended to improve the problems resulting from choosing uniform priors. A general summary of results is found in Table 1 Table 1 Summary of results of estimated K from six simulation studies. The true α value is not listed in Simulations 3 and 4, since the interpretation of parameter α in the symmetric model is different from that in the asymmetric model. Let us note that the crucial parameter K is correctly estimated also in these simulations.

Simulation study I
We simulate the species count data under symmetric Dirichlet-multinomial model with true K = 2000 and α = 1. We also set initial uniform prior U(0.01, 8) for α and U(100, 6000) for K. 4 iterations are performed and results are presented in Figure 3(a). The final posterior gives K M AP = 1967, with 95% credible interval (1708, 2571) and α M AP = 0.87, with 95% credible interval (0.55, 1.64). We also present the joint density plot of K and α (see Figure 3 (b)). Interestingly, the joint credible region for K and α is hyperbolic ("banana") in shape, indicating trade-off between the estimates of K and α. This effect might be influenced by sampling from uniform priors.

Simulation study II
We simulate barcoding data from symmetric Dirichlet-multinomial model with true K = 8000 and α = 4. The uniform prior for the first iteration is U(1, 10) for α, and U(2000, 12000) for K. 5 iterations are performed and results are presented in Figure 4(a). The final posterior gives K M AP = 7817 with 95% credible interval (7424, 8360) and α M AP = 4.4, with 95% credible interval (2.6, 5.9). We also present the joint density plot of K and α (see Figure 4(b)) and we also notice the hyperbolic ("banana") in shape, indicating trade-off between K and α.

Simulation study III
In order to check the sensitivity of the adaptive-prior ABC algorithm to model mis-specification, we also apply the algorithm involving the asymmetric Dirichletmultinomial model to the barcoding experiment data simulated from symmetric Dirichlet-multinomial simulation with true K = 2000. 4 iterations are performed. The asymmetric Dirichlet-multinomial model gives estimate K is 1982 with credible interval (1718, 2333), which covers the true value. Posterior distributions from the last iteration are presented in Figure 5(a). We also present the joint density plots of K and α and γ in Figure 5(b).

Simulation study IV
We also apply the asymmetric Dirichlet-multinomial model to the simulated barcoding data from symmetric Dirichlet-multinomial simulation with true K = 8000. 6 iterations are performed. The asymmetric Dirichlet-multinomial model gives estimate K of 8545 with credible interval (7960, 9019). Posterior distributions from the last iteration with kernel densities presented in dash lines in Figure 6(a). We also present the joint density plot of K and α and γ in Figure 6(b).

Simulation study V
We also apply the asymmetric Dirichlet-multinomial model to the simulated barcoding experiment data from asymmetric Dirichlet-multinomial simulation with true K = 10000, γ = 3000 and α = 2. 6 iterations are performed. The estimated K is 9412 with credible interval (8533, 10198). The estimated γ is 2935 with credible interval (2822, 3091). The estimated α is 2.3 with credible interval (1.8, 3.2). Posterior distributions from the last iteration are presented in Figure 7(a). We also present the joint density plot of K and α and γ in Figure 7(b).

Simulation study VI
We apply the asymmetric Dirichlet-multinomial model to another simulated barcoding data from asymmetric Dirichlet-multinomial simulation with true K = 6000, γ = 2000 and α = 4. 5 iterations are performed. The estimated K is 5860 with credible interval (5582, 6088). The estimated γ is 2036 with credible interval (1973,2088). The estimated α is 4.43 with credible interval (3.36, 5.36). Posterior distributions from the last iteration are presented in Figure 8(a). We also present the joint density plot of K and α and γ in Figure 8(b).  Figure 9(a). We also present the joint density plot of K and α and γ in Figure  9(b).

Conclusions and Discussions
In this article we consider statistical problems related to the use of new technologies in hematology, specifically of cell barcoding. This technique is used primarily to label a high percentage of cells in the organism with unique labels, being short DNA sequences integrated randomly in cell genome. As mentioned, at the time of barcoding, the identification is unique, however with time progeny of barcoded cells that proliferated, inherit the label from their parents. At the time when the sample is ascertained, some cells will have identical barcodes. The number of these different clones, reflect the number of labeled cells that proliferated. However, when cells are sampled, it is possible to characterize only a limited count n of them, usually of the order of several hundred. The number K of originally labeled cells of interest, such as for example the hematopoietic stem cells (HSC) is of the order or 10 4 or higher.
The typical problem a biologist wishes to deal with (see Background and Fisher et al. (1943)) is to estimate K given the sample of size n and more specifically given the count w and frequency f of the clones represented in the sample. This problem is similar to the well-established species-problem in theoretical population biology (see Zhang and Stern (2009) and Zhang (2007) (1984)) such as the Chao's lower bound, and Bayesian (Efron and Thisted (1976) and Zhang and Stern (2005)). In this paper, we designed and employed an adaptive Approximate Bayesian Computation algorithm based on symmetric Dirichlet-multinomial and asymmetric Dirichlet-multinomial models. The multistage algorithm, employs truncation of the intermediate posterior distributions, followed by kernel smoothing and re-weighting, to counteract the effects of non-informative priors. We used simulations, to demonstrate the convergence of the the algorithm under different scenarios, scaled to be similar to the typical experimental data. In addition, we applied the algorithm to previously published HSC barcoding data, showing its consistency with previous results.
We notice that the asymmetric Dirichlet-multinomial model can be seen as the generalized form of symmetric Dirichlet-multinomial model that addresses the heterogeneity issue and the asymmetric Dirichlet-multinomial is applicable for population size estimation of simulated data from symmetric Dirichlet-multinomial model. The use of Dirichlet distribution with flexibly adjustable parameter α is a generalization of previous models, which used α = 1 (Boender and Kan, 1987) and α → ∞ (Sun et al., 2014). It is interesting to notice that the α-value estimated from HSC data fluctuates between 3 and 4, which indicates heterogeneous clone size, however less so than if the multinomial parameters were sampled from uniform distribution on a simplex as it was proposed in Boender and Kan (1987).