Single-cell RNA-Seq datasets
We here analyzed the following scRNA-Seq datasets:
Tabula Muris Senis (TMS): This mouse scRNA-Seq dataset [15] encompasses many different tissue-types with samples collected at 6 different ages: 1, 3, 18, 21, 24 and 30 months. Data object files were downloaded from figshare https://doi.org/10.6084/m9.figshare.8273102.v2
We used the normalized data as provided in the h5ad files. We focused on the lung-tissue 10X dataset, because it contained one of the largest numbers of immune-cell subtypes with good representation across age-groups including multiple mouse replicates.
Colon cancer development: This is a human snRNA-Seq dataset [18] encompassing colon samples collected from healthy individuals, normal samples from unaffected individuals with FAP, polyps from FAP and non-FAP cases, and colorectal adenomas. We analysed both the processed snRNA-seq data available from GitHub (https://github.com/winstonbecker/scCRC_continuum), as well as the full unprocessed data available from GEO (GSE201348). Processed data were stored as Seurat objects that included donor, disease stage and cell-type annotation information.
The EVA algorithm
Here we describe the recently published algorithm (EVA: Extended LouVain Algorithm) to identify homogeneous communities in a network with node attributes [26]. Let \(G=(V, E, A)\) denote a graph where \(V\), \(E\) and \(A\) are the set of vertices, edges and node (cell) attributes, respectively. Node/cell attributes can be categorical or numerical such that \(A\left(v\right)\), with \(v\in V\), identifies the set of cell attribute values associated with cell \(v\). In our applications, we will mostly consider one cell attribute (typically the cell’s age or perturbation state) but below we formulate the model for any number of node attributes. With EVA, the goal is to identify a network partition, i.e. a mutually exclusive set of communities/clusters, that maximizes a topological clustering criterion as well as node label homogeneity within each community. Thus, the measure we wish to maximize consists of two components: the modularity Q that measures the extent to which the partitioning captures clusters of high-edge density, and a purity index P that measures the homogeneity of the communities in relation to node attributes.
In more detail, the modularity Q of a partition quantifies the edge density within communities relative to that expected under an appropriate null distribution [27], and is defined by
$$Q=\frac{1}{2m}{\sum }_{vw }\left[{A}_{vw}-\gamma \frac{{k}_{v}{k}_{w}}{2m}\right]\delta \left({c}_{v, }{c}_{w}\right) \left(1\right)$$
where m is the number of edges, \({A}_{vw}\) represents the edge weight between nodes \(v\) and \(w\), \({k}_{v}=\sum _{w}{A}_{vw}\) and \({k}_{w}=\sum _{v}{A}_{vw}\) are the sum of weights of the edges linked to nodes \(v\) and \(w\), respectively. The Kronecker \(\delta \left({c}_{v, }{c}_{w}\right)\) function is 1 when nodes \(v\) and \(w\) are in the same community (\(c\)) and 0 otherwise. The resolution parameter \(\gamma\), which typically takes values in the range \(0<\gamma \le 1 ,\)controls the number and size of inferred communities with higher resolution values leading to a greater number of smaller communities. For \(\gamma =1\), Q can take values between 0 and 0.5 [12].
The purity index is defined by the average node label homogeneity (i.e. purity) over all inferred communities [26]:
$$P=\frac{1}{\left|c\right|}\sum _{c\in C}{P}_{c} \left(2\right)$$
where \({P}_{c}\) represents the purity of community \(c\). The purity of a given community (\(c\)) is defined as the product of frequencies of the most frequent node attribute values within the\(c\)
$${P}_{c}=\prod _{A}\frac{{max}_{a}\left(\sum _{v\in c}a\left(v\right)\right)}{\left|c\right|} \left(3\right)$$
where \(A\) is the set of node attributes, and \(a\left(v\right)\) is an indicator function for node \(v\) taking value 1 if \(a=A\left(v\right)\), 0 otherwise. \({P}_{c}\) attains maximum purity 1 when all nodes in the community (\(c\)) have same attribute value sequence. In the case of just one cell attribute, this corresponds to the case where all cells in the community have the same attribute value. Of note, P takes values in the range 0 to 1.
Finally, the EVA algorithm is defined by the optimization of a generalized modularity function Z
$$Z=\alpha P+\left(1-\alpha \right)Q \left(4\right)$$
where \(\alpha\) is the purity index parameter taking values also in the range 0 to 1. Of note, for \(\alpha =0\), Z = Q and we recover the Louvain algorithm if \(\gamma =1\) [13] (or a modified Louvain algorithm if \(\gamma <1)\). At the other extreme (\(\alpha =1)\), Z = P, and the clustering algorithm only cares about maximizing purity subject to network connectivity constraints. For a given \(\alpha\) and \(\gamma\), we optimize Z following the algorithmic implementation of Citraro and Rossetti [12].
The ELVAR algorithm
Building on EVA, we developed an algorithm and R-package called ELVAR for differential abundance (DA) testing in scRNA-Seq data. Specifically, the biological question being addressed by ELVAR is whether the proportion of a given cell-state (or cell subtype) changes in relation to some other factor or cell attribute (e.g. age, disease stage). Given the potentially high technical and biological variability, such DA-testing should ideally be carried out in scenarios where multiple replicates are available [4]. ELVAR is designed for complex scenarios where (i) the source of variation associated with the cellular states or subtypes is relatively small and (ii) where sample replicates are available. The ELVAR algorithm consists of 4 main steps that we now describe:
-
Construction of the k nearest-neighbor (knn) cell-cell graph: As input, the EVA algorithm requires a connected nearest neighbor cell-cell graph, as generated for instance using Seurat’s FindNeighbors function. The number of nearest neighbors k should be chosen sensibly in relation to the total number of cells ntot. For instance, we aim for a ratio ntot/k ~ 50, so that for a 1000 cell graph, the number of neighbors is 20. To run EVA, the input graph must also have at least one vertex/node (i.e. cell) attribute, which will be used when inferring communities in the graph. In our applications this main cell attribute will be age or disease status. In general, cells also have other attributes besides the one being used in the community-inference process. For instance, these additional attributes could be the sample replicate (individual/mouse) from which the cell derives, or a particular cell-state or subtype. To avoid confusion, we will call the attribute used in the clustering or community inference as the “main attribute”, whilst the attribute being interrogated in DA-testing as the secondary attribute.
-
Selection of purity index parameter a: Another important input to the EVA algorithm is the value of the purity index parameter a (called \(\alpha\) in previous section), which controls the relative importance of purity P over modularity Q when optimizing Z. Typically, we recommend running the EVA algorithm 50 to 100 times for different a parameter values ranging from 0.1 to 0.9 (the extremes a = 0 and a = 1 are not interesting), in order to assess how P, Q and Z vary as a function of a. It is also important to record the number of inferred communities, as this also depends on the value of a. Following the recommendation of Citraro et al [12], we choose a such that when compared to Louvain (a = 0) there is a clear increase in the purity P without much degradation to the modularity Q. Ideally, the number of inferred communities must also be higher compared to Louvain, although this is not absolutely necessary. Thus, although for a fixed a EVA works by optimizing Z, we do not choose the a value which maximizes Z, because for most real-world networks, Z will increase with a and will be maximal when a = 1. For typical knn cell-cell networks, the optimal a value is generally in the range 0.7 to 0.9. As far as the resolution parameter is concerned, we generally consider \(\gamma =1\), as this allows direct benchmarking to the original Louvain algorithm which is also defined for \(\gamma =1\).
-
Inference of enriched communities with EVA: Having inferred the optimal parameter a value, we now rerun EVA with this input parameter a value, to infer communities. The next step is to then identify those communities that are enriched for specific main attribute values. This is done for each main attribute value in turn, using a Binomial test with a stringent Bonferroni-adjusted P-value < 0.05/(number of communities * number of attribute values) threshold. Only communities enriched for specific attribute values are taken forward for further analysis. Thus, the purpose of this important step is to remove cells that don’t clearly define cell-states associated with the attribute value. Having found all enriched communities for a given attribute value, we then group together all cells from these enriched communities to define a cell-group per attribute value. The cells within a cell-group may derive from distinct sample replicates.
-
DA testing for secondary attributes: For each cell group associated with a main attribute value v, we next count the number of cells with any given secondary attribute value contributed by each replicate r, whilst also recording the total number of cells contributed by that sample replicate. For each secondary attribute value, we then run a negative binomial regression (NBR) of the cell count against the main attribute value (here assumed ordinal e.g. age-group or disease stage) with the sample replicate’s total cell count being the normalization factor. Mathematically, we model the cell count \({n}_{st}\) of secondary attribute value t and sample s, as a negative binomial (NB)
$${n}_{st}∼NB({\mu }_{st},{\phi }_{t})$$
where \({\mu }_{st}\) is the mean number and \({\phi }_{t}\) is the dispersion parameter. We further assume that \({\mu }_{st}={\mu }_{vrt}={f}_{vt}{n}_{vr}\), where \({f}_{vt}\) is the fraction of cells of type t when they have main attribute value v, and where \({n}_{vr}\) is the total number of cells derived from sample s (which has main attribute value v and r is the replication index). We next assume that the log of \({f}_{vt}\) is a linear function of v, so that the final regression is of the form
$$\text{log}{\mu }_{vrt}={\alpha }_{t}+{\beta }_{t}\text{log}{n}_{vr}+{\gamma }_{t}{v}_{vr}$$
$$\text{log}{\mu }_{st}={{\alpha }_{t}+\beta }_{t}\text{log}{n}_{s}+{\gamma }_{t}{v}_{s}$$
Thus, given the cell counts \({n}_{st}\) we simply run a negative binomial regression (NBR) against the two covariates \({v}_{s}\) and \(\text{log}{n}_{s}\), to find out if the cell counts vary significantly with the main attribute value v. The covariate \(\text{log}{n}_{s}\) plays the role of a normalization factor, accounting for the total of number of cells contributed by sample s. Wald z-statistics and P-values of association are obtained from this NBR.
Benchmarking against Louvain is done by direct comparison of these statistics. Because the original Louvain algorithm is deterministic, whilst ELVAR is not (in ELVAR optimization is performed in a non-sequential random manner), for benchmarking we perform 100 distinct ELVAR runs, comparing the distribution of Wald-test statistics to the Louvain-derived one with a one-sided Wilcoxon rank sum test. Of note, further below we also describe how we benchmark ELVAR to a non-sequential randomized version of Louvain, where the Louvain output may also differ between runs.
ELVAR pseudocode
Below is an outline of pseudocode for running ELVAR. We assume the scRNA-Seq data is encoded in a Seurat object seu.o with all the required meta-information, including a main cell attribute to be used in the clustering (clustering attribute “CA”), and a secondary cell attribute for DA-testing that we call “SA”. The first step is to normalize the data and to build the k-nearest neighbor cell-cell graph:
Step-1 (Normalization and construction of cell-cell graph):
-
seu.o <- FindVariableFeatures(seu.o,selection.method=”vst”);
-
seu.o <- ScaleData(seu.o,features = rownames(seu.o));
-
seu.o <- RunPCA(seu.o);
-
Elbowplot(seu.o); ### to determine number of significant PCs: topPC
The choice of k in specifying the number of nearest neighbors should be chosen sensibly. Typically the ratio of number of cells/k should be around 50, so assuming 1000 cells, k should be around 20:
-
seu.o <- FindNeighbors(seu.o,dims = 1:topPC,k.param = 20);
-
adj.m <- as.matrix(seu.o@graphs$RNA_nn); diag(adj.m) <- 0;
-
gr.o <- graph.adjacency(adj.m,mode=”undirected”);
-
vertexN.v <- names(V(gr.o));
-
vertex_attr(gr.o,name=”CA”) <- [email protected]$CA;
-
is.connected(gr.o); ### check graph is connected (if not, incrementally increase k).
Step-2 (Estimate optimal purity index parameter a):
Step-3 (Inference of enriched communities with EVA):
-
eva.o <- Eva_partitions(gr.o,alpha = aOPT,Vattr.name=”CA”);
-
comm.o <- ProcessEVA(eva.o,seu.o);
Step-4 (Do DA-testing of a secondary attribute with negative binomial regressions):
The object nbr.o is typically the output of the glm.nb function from the MASS R-package, and statistics of association between the SA (e.g. Cd4t activation status) and CA (e.g. age) attributes can be extracted using summary(nbr.o)$coeff .
Louvain with random non-sequential node selection (LouvainRND)
Because EVA is a natural extension of Louvain, it is natural to benchmark it against the ordinary Louvain algorithm as implemented by Blondel et al [13], which is also implemented in the igraph R package. The ordinary Louvain algorithm gives a deterministic network partition output, i.e. every run of the Louvain algorithm results in the same partition. This is because during an optimization run nodes are visited and assessed for local moving sequentially. On the other hand, EVA builds upon algorithmic details implemented in the Leiden algorithm [28], which results in potentially different partitions every time EVA is run. Specifically, as with the Leiden algorithm, during an optimization run in EVA, we consider random (non-sequential) selection of nodes, which can therefore result in a different partition every time EVA is run. Thus, when benchmarking EVA against Louvain, it is important to account for these implementation differences. We do this by also benchmarking EVA against a modified version of Louvain (called LouvainRND), where during an optimization run, nodes are visited randomly, which may also result in different partitions for different runs.
Simulation Models
The analysis benchmarking EVA against Louvain, was performed in the context of two simulation models. In the first case, we selected 200 classical monocyte cells from the TMS lung tissue scRNA-Seq 10X dataset, ensuring all cells derive from the same mouse (mouseID = 19) and thus from the same age. For 100 of these cells, we then modified their scRNA-Seq profiles, simulating a “perturbed” cell-state, as follows. We randomly selected 50 genes among all genes not expressed in any of the 200 cells. For each perturbed cell, we then randomly subselected 20 genes from these 50, whose values were then altered in the cell, by randomly drawing 20 non-zero expression values from the distribution of non-zero expression values of the whole data matrix. Thus, this procedure generates a weak but significant co-expression structure among the 100 perturbed cells. Seurat was then applied to the 20138 gene x 200 cell scRNA-Seq data matrix, with VST feature selection followed by PCA. Top-8 PCs were selected to build the k-nearest neighbor cell-cell graph using k = 6. Louvain clustering algorithm as implemented in igraph was used to infer communities. EVA was run on the same cell-cell graph using a cell’s perturbation state as cell-attribute. Since the EVA result depends on initialization, we performed a total of 100 runs for each of nine choices of purity index parameter a (a = 0.1, 0.2, …, 0.8, 0.9). The final value of a was chosen heuristically as the value at which purity increased compared to the Louvain solution (a = 0) without compromising modularity too much. The quality of the EVA and Louvain clustering was assessed using the Adjusted Rand Index against the cell’s perturbation state, as well as using Chi-Square statistics.
For the second simulation model, we selected all classical monocytes from mice with mouse-IDs 0 and 1 representing 1 month old mice (201 & 284 cells), mouse IDs 2 and 3 representing 3 month old mice (51 & 60 cells), mouse IDs 13 and 14 representing 21 month old mice (104 & 138 cells) and mouse IDs 21 and 22 representing 30 month old mice (94 & 61 cells). For young mice (1 and 3 month old mice), 25% of cells were perturbed using the same procedure described previously. For old mice (21 and 30 months), the frequency of perturbed cells was increased to 50%. Thus, this model simulates an age-related increased in a perturbed state. The cell-cell graph was derived as before, this time using the top 10 PCs and k = 20, i.e k was increased in line with the larger number of cells (n = 993). Louvain and EVA were run on this cell-cell graph, in this case using age as the cell-attribute information. As before, EVA was initially run a 100 times for each of nine choices of a parameter, in order to select an optimal a based on overall purity and modularity values. Using the optimal a value, EVA was then compared to Louvain in its ability to predict the increased frequency of the perturbed cell-state in the older mice.
Application of ELVAR to detecting shifts in lung tissue Cd4 + T-cell subtypes
As part of the 10X lung-tissue TMS set, a total of 551 Cd4 + T-cells were profiled to allow testing of a shift in naïve to mature subtypes with age. We removed cells from 4 mice each contributing less than 10 cells, leaving a total of 537 cells from 11 mice representing five age-groups: 143 (1m), 122 (3m), 67 (18m), 107 (21m) and 98 (30m). Cells expressing Lef1, a well-known marker of naïve Cd4 + T-cells [16], were defined as naïve (n = 186), the rest as mature (n = 351). ELVAR was then applied to determine if the naïve/mature proportions change with age. Cell-cell graph was constructed using Seurat with VST and 8 top PCs and k = 10. EVA was run a total of 100 times with a = 0.8 (the optimal value in this dataset).
Application of ELVAR to M1/M2 polarization analysis in lung alveolar macrophages
As part of the 10X lung-tissue TMS set, lung alveolar macrophages were abundantly profiled (n = 1261) to allow testing of a shift in M1/M2 macrophage polarization with age. We removed cells from mice displaying batch effects, leaving a total of 1124 cells from 15 mice representing five age-groups: 517 (1m), 184 (3m), 193 (18m), 91 (21m) and 139 (30m). In order to annotate these 1124 lung alveolar macrophages into M1/M2 subtypes, we first identified 5 robust murine M1 (Cd80, Cd86, Fpr2, Tlr2, Cd40) and 5 robust M2 markers (Egr2, Myc, Arg1, Mrc1, Cd163) from the literature [29]. In an initial annotation, we declared cells as M1 if they co-expressed at least 2 of the 5 M1 markers, and similarly for M2. Cells annotated to both M1 and M2 subtypes were re-assigned an undetermined (UD) category alongside all other cells not annotated to either M1 and M2, resulting in 308 M1, 195 M2 and 621 UD-cells. We reasoned that UD-cells clustering predominantly with either M1 or M2 cells could be re-assigned to M1/M2 subtypes. To this end, we developed an iterative algorithm that reassigns the status of UD-cells to either M1 and M2, depending on their relative proportions among the neighbors of a given UD-cell. In more detail, we used the cell-cell graph as inferred using Seurat, and a multinomial test with P < 0.05 threshold to identify the UD-cells whose polarization status could be reassigned to either M1 or M2 status. We also required the absolute difference between the proportion of M1 and M2 neighbors of a given UD-cell to be larger than 0.2. This procedure was iterated 20 times, but numbers already converged after 7 iterations, resulting in 464 M1, 214 M2 and 446 UD-cells. ELVAR was then applied to determine if the M1/M2 proportions change with age. Cell-cell graph was constructed using Seurat with VST and 9 top PCs and k = 20. EVA was run a total of 100 times with a = 0.7 (the optimal value in this dataset).
Application of ELVAR to colon cancer progression
We applied ELVAR to explore if the fractions of epithelial stem-cells and T-regulatory cells changes with disease progression. The analysis was performed in two ways. In the first approach, we downloaded Seurat objects from https://github.com/winstonbecker/scCRC_continuum which contain QC-processed data and cell-type annotations for a subset of samples. In the case of epithelial cells, the analysis was performed on a subset of the data consisting of stem-cells, TA2 & TA1 transit amplifying progenitors, enterocyte progenitors, immature enterocytes and differentiated enterocytes cell states. We randomly picked 1000 cells from each cell state (thus a total of 6000 cells) in order to reduce the computational runtime because two cell states contained ≥ 30k cells. Next, we removed cells from 2 donors each contributing less than 10 cells and cells from 1 donor displaying a batch effect, leaving a total of 5810 cells from 11 donors representing four disease stages: 1153 (Normal), 1672 (Unaffected), 2911 (Polyp) and 74 (Adenocarcinoma). The distribution of cell-states was 843 stem-cells, 971 TA2, 1000 TA1, 999 enterocyte progenitors, 998 immature enterocytes and 999 enterocytes number of cells. A cell-cell graph was constructed using Seurat with variance stabilization for feature selection, and selecting the 15 top PCs with k = 20. ELVAR was run a total of 100 times with a = 0.8 and disease stage as the main cell-attribute for community detection. For the analysis of T-regulatory (Tregs) cells we focused on the subset of data consisting of Tregs, NK, Naïve T, CD4 + and CD8 + cells. We removed cells from 3 donors each contributing less than 10 cells and cells from 1 donor displaying a batch effect, resulting in a total of 6171 cells from 8 donors representing four disease stage groups: 381 (Normal), 3721 (Unaffected), 1900 (Polyp) and 169 (Adenocarcinoma). The distribution of cells across cell-types was: 472 Tregs, 245 NK, 1042 Naïve T, 3063 CD4 + and 1349 CD8 + number of cells. The cell-cell similarity graph was constructed using Seurat with variance stabilization for feature selection, selecting the 15 top PCs and number of nearest neighbors k = 20. ELVAR was run a total of 100 times with a = 0.8 with disease stage as the main clustering attribute.
In the second approach, we downloaded the raw count snRNA-Seq matrices from GEO (GSE201348). Data was normalized using Seurat with variance stabilization for feature selection, leaving a total of 380,527 cells. Because the cell-type annotation for the full dataset was not provided, we applied dimensional reduction, clustering, UMAP visualization and well-known marker genes from Becker et al [18] to the cells from the normal samples only, to annotate well separated cell clusters into enterocyte, goblet, immune-cell, stromal and endothelial cell categories. We then used Wilcoxon tests and marker-specificity scores [20, 21] to build an mRNA expression reference matrix for these 5 broad cell categories. With this mRNA expression reference matrix in place, we then used our robust partial correlation framework [21] to estimate cell-type probabilities for all cells from all disease stages. Using a probability threshold of > 0.7, we were thus able to confidently annotate 1866 endothelial cells, 104009 enterocyte cells, 78421 goblet cells, 24973 immune-cells and 7941 stromal cells. Because of the very high sparsity of the snRNA-Seq data, in order to confidently identify stem-like cells among the 104,009 enterocytes, we applied our validated CancerStemID algorithm [22, 23] which approximates stemness of single-cells from the estimated differentiation activities of tissue-specific transcription factors. In this instance, we used a set of 56 colon-specific TFs and their associated regulons, already validated by us previously [23]. The regulons were applied to the snRNA-Seq data, to estimate transcription factor differentiation activity (TFA) for each of the 56 TFs across each of the 104,009 enterocyte cells. We then declared stem-cell like cells as those displaying average TFA levels over the 56 TFs less than a threshold given by the 5% quantile of the average TFA distribution defined over the normal cells only. For ELVAR analysis, we only retained samples contributing at least 50 cells. For all other samples, all cells up to a maximum of 500 randomly selected cells were chosen, resulting in a total of 31,385 cells, drawn from 69 samples (8 normal, 16 unaffected FAPs, 41 polyps and 4 CRCs), encompassing 3761 normal, 7443 unaffected, 18558 polyp and 1623 CRC cells. The cell-cell k = 50 nearest neighbor graph was constructed using Seurat, and ELVAR run a 100 times with a = 0.8, using disease stage as main cell attribute.
Comparison of ELVAR to DAseq and MiloR
We compare the three algorithms in their ability to detect proportions of cell-states as a function of a main cell-attribute. In our context this main cell-attribute could be age or disease-status, and cell-states could refer to Cd4t activation status, or differentiation stage. What the three algorithms have in common is the inference of groups of cells that display differential abundance relative to the main cell-attribute. The methods differ in how these groups of cells are inferred. In ELVAR, we use the main cell-attribute information when inferring cellular communities from the nearest neighbor cell-cell graph, subsequently identifying those that display enrichment for any specific attribute value (e.g. age-group). In contrast, DAseq [7] and MiloR [4] infer local regions, or potentially overlapping cellular neighborhoods, displaying significant DA of age-groups. Thus, one way to compare all three algorithms is by selecting the cells that appear in these significant communities/regions/neighborhoods, and subsequently running negative binomial regressions of cell-state counts vs age-group, taking biological replicates into account and normalizing for the total number of cells that each replicate sample contributes to each age-group, as described earlier for ELVAR.
To understand the difference in performance between methods, we developed the following metric. Methods may display different sensitivity to detect DA of a secondary attribute relative to the main one because the significantly associated cell groups derived from each method (i.e. age-group enriched communities in the case of ELVAR, age-associated neighborhoods/regions in the case of MiloR/DAseq) may capture different numbers of cells. To make this clear, consider a scenario where one of the methods (call it “X”) can’t detect a cell group with sufficient numbers of old cells, say it detects a cell-group with at most 10 old cells, with 6 of these belonging to one cell-state “A”, with the remaining 4 belonging to another cell-state “B”. In contrast, another method “Y” does infer a large enough cell-group consisting of old-cells, say 30 cells with 15 belong to state “A” and 15 belonging to state “B”. As far as young cells are concerned, all methods are able to infer a cell-group with a considerable number of cells, say 50 young cells, with 40 belonging to state “A” and 10 belonging to state “B”. Because method "X” was not able to identify a cell-group with sufficient numbers of old-cells, it lacks power to detect the relative decrease of state “A” with age (two-tailed Fisher-test P = 0.22), whilst method “Y” has the power to detect it (two-tailed Fisher-test, P = 0.006). Whilst this hypothetical example ignores the variation due to sample replicates or variations due to replicate cell numbers, it clearly illustrates that the fraction of captured cells (fCaptCells) per main attribute value will strongly influence a method’s power to detect DA of an underlying cell-state with respect to this main cell attribute. Mathematically, we define the fraction of captured cells per attribute value a and from method m by:
$${fCaptCells}_{ma}=\frac{\left|\left(\#Cells with attribute value=a\right)\cap (\#Cells in groups from method m)\right|}{(\#Cells with attribute value=a)}$$
Specifically, a method that attains higher fCaptCellsma across the whole range of attribute values a, including the extremes if the attribute is ordinal, will display higher power.