Distinct characteristics of correlation analysis at the single-cell and the population level

Background: Correlation analysis is widely used in biological studies to infer molecular relationships within biological networks. Recently, single-cell analysis has drawn tremendous interests, for its ability to obtain high-resolution molecular phenotypes. It turns out that there is little overlap of co-expressed genes identified in single-cell level investigations with that of population level investigations. However, the nature of the relationship of correlations between single-cell and population levels remains unclear. In this manuscript, we aimed to unveil the origin of the differences between the correlation coefficients at the single-cell level and that at the population level, and bridge the gap between them. sampling and normalization procedure should be done before we draw any conclusions.


Background
Correlation analysis is widely used to identify closely related components and interactions within biological networks [1][2][3][4]. Traditionally, cellular behaviors were investigated by experimental measurements derived from bulk-averaged assays [3]. In recent years, cellular heterogeneity within genetically identical populations has been well-recognized and there has been an explosion of interest in single-cell analysis [5,6]. Comparative analysis of co-expressed genes between multi-levels suggested that correlation profiles obtained with single-cell data are not always consistent with those obtained with bulk-sample measurements [7][8][9]. Strikingly, in a previous study, researchers compared the glioblastoma expression profiles between bulk-tissue and single-cell data, and the result suggested that less than 10% of the co-expressed genes were shared, while a majority of gene pairs were highly correlated either at the bulk-tissue or the single-cell level [7].
Previous population-averaged measurements, such as western blotting, HPLC and microarray, were able to capture numerous protein-protein interactions within signaling pathways or regulatory networks [10]. Still, co-expression analyses at the single-cell level, which take intercellular variability into consideration, could give discrepant correlation coefficients [7,11]. Studies suggested distinct co-regulatory mechanisms underlying the co-expression relationships in the bulk samples and the single cells [7,12]. However, it remains to be fully addressed why a correlation based on individual-level (single-cell) data could be much stronger or weaker than those based on population-level (bulk-sample) data. Does a weak correlation indicate no interaction between two components? Which level of correlation is more robust and less affected by technical artifacts?
In this study, these fundamental questions were addressed by a bottom-up approach [13]. We bridged the gap between the single-cell level and the population-level correlations by deriving formulations, illustrating that their relationship depends on the relative variations or correlations within populations. Further, we developed mathematical models to mimic signaling cascades or multi-regulated gene expressions. Interestingly, the results indicated that the correlation between two components in the networks could be weak at the single-cell level, though a strong correlation existed at the population level. Thus, a strong correlation at the population level cannot demonstrate a strong correlation at the single-cell level, and connections within the biological networks should be clearly noted whenever the components are correlated at either level.

Aggregated correlation could be stronger, weaker or equal to individual correlation
The relationship between correlations across aggregated and individual levels has been studied in sociology [14]. The derived formulations could also be applied to biological analyses to describe correlations at the population and single-cell level. Definitions of the components of the formulations are shown in Table 1. Correlation between x and y (individual correlation) Correlation between u and v (aggregated correlation) Correlation between ix and iy (Correlation within) Here, we calculated and compared the Pearson's correlation coefficients between the single cell and the population level. Assuming ≠ 0, the ratio of the aggregated correlation to individual correlation was defined by Equation 1 (Supplemental materials).
The relationship between correlations at the single-cell and the population level depends on values of the standard-deviation-within relative to the aggregatedlevel standard deviations ( and ) and the correlation-within relative to the individual correlation ( ). Depending on varied values of these components, aggregated correlation could be stronger, weaker or equal to individual correlation ( Figure 1). Notably, when the correlation-within ( ) is weaker than individual correlation ( ), or the signs of correlation-within and individual correlation differ from each other (one is positive and the other is negative), the correlation at the population level is stronger than it at the singlecell level ( Figure 1C and Supplemental materials).

Interactions could not be excluded based on weak correlations at the single-cell level
Correlation analyses are widely used in discovering functional modules and exploring biological relationships [15,16]. Typically, the correlated-components in a regulatory network were defined by a Pearson/Spearman correlation coefficient greater than |±0.5| [17,18], and the others were filtered out as unrelated components. However, one question that is often overlooked was whether the weakly-correlated components were indeed unrelated at all? Singlecell analysis has become extremely popular in recent years and improved our understanding in many areas that have been traditionally studied at the population level. Does the correlation at the single-cell level provide novel insight into connections between the components?
To address these questions, we employed mathematical modeling to investigate the correlations between components in regulatory networks. Two toy models were developed to represent two typical types of biological regulatory systems respectively: Model1 described a multi-step signaling cascade ( Figure 2A) and Model2 characterized a multi-regulator controlled gene expression ( Figure 2B). X*(X1*, X2* and X3*) were the intermediate regulators and Y* was the responder ('*' indicates active form). Strikingly, although X* and Y* did interact with each other and they were indeed significantly correlated at the population level, the correlation coefficient between them at the single-cell level could be very weak ( Figure 2C-F). Specifically, we generated population-level data by randomly sampling the cells based on their proximity of X*, and then averaging the respective values (See Methods for a detailed description). There're overlaps between the neighboring bulk samples in our simulations, which was intended to mimic the possible overlaps of population-level measurements in the biological experiments.
We also investigated the correlations in a previous published regulatory network of mammalian cell cycle, the parameters of which were experimentally derived [19]. The discrepancy of correlation coefficients between the single-cell and the population level were shown. For the aggregated-level significantly correlated molecules, their correlation coefficients at the individual level could be much lower ( Figure S1). In sum, we could not exclude the possibility of interactions between components by their single-cell level correlation coefficient only.

Aggregated correlation is robust and less affected by technical random error
Technical random measurement errors are often unavoidable in experimental assays. When comparing correlations at the population and single-cell level, random measurement errors should be taken into consideration before drawing any conclusions. By taking the random errors into account, the measured values of variables at the single-cell or the population level are represented by where x m , y m , u m , and v m are the measured values, which are composed of x, y, u, or v and some random errors; e x and e y are the random measurement errors for variable x and y, respectively; e u and e v are the averaged random measurement errors for variable and , respectively.
Theoretically, for the measurements at the population level, the mean of random error should be a constant ( = 1 and = 2) and thus the variance of random error is zero ( 2 = 0 and 2 = 0). Therefore, the correlations at the population level are unaffected by the random measurement errors (Equation 6).
where denotes covariance, ρ uv ′ is the population-level correlation with random measurement errors; σ eu 2 and σ ev 2 are the variances of e u and e v , respectively.
In contrast, measurements at the single-cell level, as well as their correlations, are affected. By assuming the random errors (e x and e y ) are uncorrelated with each other, and they are independent of x, y, u, v, i x or i y , the individual correlation with measurement errors is represented by Equation 7.
Compared with the individual correlation without measurement errors (Equation 8) since σ ex 2 > 0 and σ ey 2 > 0, we get > ′ , which demonstrates that the individual correlation is reduced by measurement errors.
These theoretical conclusions could also be confirmed and visualized by model simulations. We added random error to the same dataset of Figure 2C and 2D, and then calculated the corresponding correlation coefficients of the data with random error at the single-cell ( Figure 3A) or the population level ( Figure 3B). As expected, the individual correlations were largely reduced while the aggregated correlations were unaffected. Notably, although in theory the aggregated correlations should stay the same (Eq.6), in practice, the mean of random error might most probably be a small number approximate to, but not equal to zero. As a consequence, the aggregated correlations could be also slightly or markedly reduced ( Figure S2), especially for the situation when the values of output (Y*) would not significantly change along with the inputs (X*).

Correlation within could not guide sampling
In many biological studies, researchers are exploring how a specific molecule E (effector) is positively or negatively regulate another molecule R (responder). Generally, molecule E might quantitatively affect molecule R only in a certain dose range. For instance, after reaching the saturation point, the amount of responder R does not change along with the effector E. It's usually essential to cover the entire responsive range in the experimentation design. But in the situation when we haven't clearly explored that range, it could be very tricky to identify the proper responsive range efficiently.
To find a method for efficiently locate the responsive range, within which the two molecules present mutual correlation, we were curious if the correlationwithin sheds light on it. We proposed a hypothesis that the correlation-within of two samples would be similar if they located closely on the responsive range, while the correlation-within varied a lot among samples covering a large responsive space.
To test our hypothesis, we investigated the correlations between Y* and X3* in Model2 by simulation. 30000 cells were simulated and divided into 30 bulk samples according to their close proximity of X3* ( Figure S3A and S3B). The correlations-within were calculated ( Figure S3C) and every five samples closely located were merged into one close-sample group. For each close-sample group, we calculated the correlation coefficients of the five bulk-sample within it, and then derived the standard deviation of these coefficients ( Figure 4A). Next, we generated a sparse-sample group by choosing one sample from each of the closesample groups, and then performed similar calculations. Interestingly, the standard deviations of the sparse-sample group (covering a large space) could be higher or lower than those of the close-sample groups ( Figure 4B and 4C). As a conclusion, our primary hypothesis was rejected by the model simulation, and the correlation-within could not offer us a better guess for the responsive range. It is important to perform sampling covering the responsive space as large as possible without prior knowledge, in particular for investigations on digital responses [20]. Unfortunately, correlation-within could not provide additional information about it.

Discussion
A central challenge in the biological research is to unveil the connections in the regulatory networks and understand them in a quantitative way. Correlation analysis, traditionally based on bulk-averaged assays, is widely used to identify interactions within metabolic, signaling or transcriptional networks. In recent decades, as advanced technologies make more single cell measurements accessible, one of the key questions arising is whether the conclusions based on population-averaged assays could be applied to individual cell behaviors.
Despite a well-recognized inconsistency of correlation analysis between the single-cell level and the population level, the underlying mechanism remains to be fully addressed. Our formulations illustrated that the individual-correlation could be stronger, weaker or equal to the corresponding aggregated-correlation, depending on the variations and the correlation within the population. By modeling a signal cascade and a parallel regulation respectively, we found that the correlation between two interactive components could be weak at the singlecell level, though they were strongly correlated at the population level. Therefore, the connection possibility between components could not be excluded solely by their correlation coefficient. Existence of connection between two molecules should be considerated whenever the components are highly correlated at either level.
Our results demonstrated that the variance-within plays a crucial role in the consistency of the correlation coefficients across the levels, which describes the heterogeneity within samples. Statistical artifacts, biased estimates or elimination of error variance might somewhat all contribute to the higher correlations at the aggregated level than those at the individual level [14]. Biological activities are often much more complex than just several definite interactions among a few molecules, one specific response often results from a combination of factors. Though most of these factors might not impact a lot, however, on the one hand, these impacts could constantly accumulate, on the other hand, the chaos theory suggests that even a tiny interference could induce a largely deviant response. All of these would attenuate the correlation between two variables at the individual level. However, these deviances are often averaged out at the aggregated level, which help to expose the correlation. Basically, the more deviant the individual cells, the larger the ratio of the aggregated correlation to the individual correlation. Thus, it was not surprising to find a low correlation coefficient for a pair of molecules at the single cell resolution while a strong correlation for the same pair has been reported in a previous population-based study. Variability in protein levels is a common phenomenon even in genetically identical cells. Across the population, proteins were log-normally distributed, with an average coefficient of variation (CV = standard deviation/mean) ranging from 0.12 to 0.28 [21,22]. Therefore, the discrepancy between the aggregated correlation and the individual correlation is always observed.
Our results illustrated an unavoidable difference of correlation coefficient between the single-cell and the population level, which raises the question of which level of correlation analysis weights more when discrepancy occurs. Depending on the purpose of an investigation, individual correlation, aggregated correlation or correlation-within should be taken into consideration to make proper conclusions wisely. Studies have revealed distinct biological insights of the co-expressed genes at the population or the single-cell level: the population level co-expressed genes share the same biological functions, while the singlecell level co-expression indicates interactions [7]. Specifically, considering the existence of technical random measurement errors, our data show that individual correlation is less robust. Therefore, higher accuracy is required for single-cell level investigation and if possible, proper normalization procedures for individual-level measurements should be done before drawing any conclusions.

Conclusions
Measurements at the single-cell level indeed proved us with a fantastic tool to increase the resolution of our exploration into the biological activities. When taking a dive from the population level into the single-cell level, discrepancy could occur for correlation analysis because of the heterogeneity within samples. Since distinct biological insights could be revealed from either the individual or aggregated perspective analysis, we should always be aware of the level of analysis we are at, or choose the proper level of data to explore in our research.

Model development and simulation
Two toy models were developed to represent two typical types of biological regulatory systems: Model1 described a multi-step signaling cascade and Model2 characterized a multi-regulator controlled gene expression. The illustrations of the two models are shown in Figure 2.
Model1 comprises three steps of signaling transduction and one feedback loop, described by eight ordinary differential equations. Model2 comprises three regulators and three feedback loops, described by eight ordinary differential equations. X1*, X2*, X3* and Y* are the active forms of X1, X2, X3 and Y, respectively.
10000 cells in total were simulated. The initial values of X1, X2, X3 and Y were randomly generated from lognormal distributions. The initial values of X1*, X2*, X3* and Y* were zeros. Correlation analyses were preformed based on the data of the systems at steady state. See supplemental materials for a detailed description.

The sampling procedure for the population level analysis
To represent the possible overlaps of population-level samples in the biological experiments, aggregated values with overlaps between neighboring groups were obtained by the following steps: (1) 10000 single cells were grouped based on their proximity of X* (2000 cells per group); (2) For each group, chose the closest 500 cells from its two neighboring groups, respectively (1000 cells in total). Specifically, for group on the most left or most right side, chose only 500 cells from the neighboring group. Then combined these cells with the original group (obtained a 3000-cells group or 2500-cells group); (3) Randomly chose 1000 cells from it and then averaged the values of X* and Y*.