Undersampled microbial profiles dramatically deviated from full profiles.
By comparing the compositional variations of mock communities with the seed communities, we investigated whether and how undersampled microbial profiles deviated from full profiles. Here, fifteen seed communities following lognormal distribution and with different levels of β-diversity were generated. Each seed community was composed by 104 species and 108 organisms, representing the approximate prokaryotic diversity in one gram of soil or 0.2 liter of sea water [69]. A select number (0–100%) of organisms were renamed as new species and/or randomly shuffled, respectively simulating community assembly processes such as dispersal and drift. As a result, seed communities with β-diversity ranging from 7.48–87.83% were generated (Supplementary Table 1). Mock communities were then generated by random subsampling a select number of organisms from the seed communities. A series of mock communities with different organism numbers (5000 to 200,000) were generated, aiming to investigate the effects of increasing sequencing depth on eliminating random sampling issues. Here, the seed communities with 35% shuffling rate and 35% new taxa were selected to illustrate the deviation of undersampled microbial profiles from full profiles. As a result, a large number of rare taxa were not captured by the mock communities, whereas the abundant taxa were rarely affected (Fig. 2A, B and C).
The β-diversity for the seed communities and the mock communities was also comparatively analyzed. Overestimated β-diversity was observed for the undersampled mock communities, including the whole community, the abundant and rare subcommunities (Fig. 2D, E and F). Among these, the β-diversity for rare subcommunities was the most dramatically overestimated (Fig. 2F), while the β-diversity for abundant subcommunities was only slightly overestimated (Fig. 2E). Notably, increasing sequencing depth from 50,000 to 200,000 can only slightly ease the situation of overestimated β-diversity (Fig. 2), suggesting that the random sampling issues associated with microbial profiling could be persistent with current and near future technologies.
The β-diversity of null mock communities was also affected
We then investigated how random sampling affected the β-diversity of null communities, based on which microbial stochasticity is inferred. Two types of commonly used randomization methods in microbial community analyses were investigated here. In the first randomization method, the composition of microbial communities was randomly shuffled while holding the community richness in each sample (α-diversity) and across all samples (γ-diversity) constant [37]. Here, the regional species pool is defined as the total number of microbial taxa found in all of the simulated communities with the same sequencing depth. Dissimilar null communities were expected. In the second randomization method, null microbial communities were generated by randomly drawing individuals into given taxa with the probability proportional to the relative abundance in the regional species pool, in addition to preserving both α-diversity and γ-diversity [73]. As such, low compositional variations for null communities were expected.
As a result, deviated β-diversity of null communities was also observed. Several issues were noticed here (Fig. 3). First, as expected, the β-diversity of null communities relative to observed values dramatically differed with different randomization methods. For instance, when the community composition was randomly shuffled under constraints, the β-diversity of null communities (Fig. 3A) was larger than the observed β-diversity (Fig. 2B). However, when the community composition was generated proportionally according to the relative abundance of the taxa in the regional species pool, the β-diversity of null communities (Fig. 3B) was much smaller than the observed β-diversity (Fig. 2B). Second, the β-diversity of null mock communities relative to that of null seed communities dramatically differed with different randomization methods. The β-diversity of null mock communities was smaller than the β-diversity of null seed communities when the community composition was randomly shuffled under constraints (Fig. 3A). In contrast, opposite patterns were observed when the randomization of community composition was proportional to the relative abundance of microbial taxa in the regional species pool (Fig. 3B). Such different patterns mainly resulted from rare subcommunities, whereas the abundant subcommunities were less affected (Fig. 3). Importantly, such dramatically differed β-diversity of null communities by different randomization methods may result in dramatically differed conclusions in microbial community stochasticity inference. Third, samples with low sequencing depth (e.g. 5000 and 10000) deviated more dramatically, or even showed opposite pattern (Fig. 3). The results suggested that different randomization methods exerted different effects on undersampled microbial profiles, and rare subcommunities were more strongly affected.
Microbial stochastic ratios were overestimated
Multiple community stochasticity inference approaches are available. Here, the stochastic ratio approach [71, 81] was first evaluated to see how undersampled microbial profiles affected microbial community stochasticity. Overestimated stochastic ratio was observed for both randomization methods (Fig. 4). Such overestimated stochastic ratio was persistently observed for rare subcommunities regardless of randomization methods (Fig. 4C and F). Comparing to what was observed for rare subcommunities, the effects of random sampling issues on stochastic ratio for abundant subcommunities differ by randomization methods (Fig. 4B and E). The stochastic ratio for abundant subcommunities was rarely affected when the “shuffle” randomization method was used (Fig. 4B). Most critically, undersampled microbial profiles may lead to dramatically deviated conclusions. For example, when the community composition was randomly shuffled under constraints, high stochastic ratio (> 0.75) was observed for both seed and mock communities (Fig. 4A, B and C). However, when the randomization of community composition was performed by drawing individual organisms proportional to the relative abundance of microbial taxa in the regional species pool, the stochastic ratio was low (~ 0.40) for the seed community, but high (> 0.52) for mock communities, even for those with 200,000 sequencing depth (Fig. 4D). Such issues also tended to occur with rare subcommunities (Fig. 4F). Overall, the results here suggested that undersampled microbial profiles could lead to overestimated stochastic ratio inference, especially for rare subcommunities. Such overestimation may lead to dramatically different conclusions depending on which randomization methods was used.
Microbial stochasticity inference using the RCbray metric was also affected
In addition to stochastic ratio analyses, the RCbray metric that characterizes the deviation between null distributions and observed taxonomic turnovers to infer the contributions of different processes in community assembly [73, 74], was also employed to evaluate how stochasticity inference was affected by random sampling issues. Notably, as it was not possible to experimentally generate the required datasets (e.g. deep sequencing of 108 organisms per sample), the same simulated datasets were also used here. And as it was technically almost impossible to simulate the phylogenetic relationships representing the community assembly process of mock communities, the taxonomic compositional turnover was assessed here using the RCbray metric not considering the selection process inferred based on phylogenetic signals. Similarly, the same two different randomization methods (i.e., “shuffle” and “proportional”) were investigated here. Again, dramatically different results were observed for different randomization methods (Fig. 5). Such difference was mainly reflected by the relative contribution of different processes as judged by RCbray values. Notably, when the “shuffle” method was used, the contribution of deterministic factors causing variable communities (RCbray > 0.95) is overestimated, whereas the contribution of deterministic factors causing similar communities (RCbray < -0.95) is underestimated. Such pattern was consistently observed for the whole community, the abundant, and rare subcommunities (Fig. 5A, B, and C). However, when the “proportional” randomization method was used, overestimation of stochastic processes was observed for the rare subcommunities (Fig. 5F). For the whole and abundant subcommunities, deterministic factors causing variable communities was found as the sole process responsible for the compositional variations of the mock and seed communities when sequencing depth is larger than 50000 (Fig. 5D and E). The results suggested that RCbray metric is relatively robust to random sampling issues, but could be strongly affected by randomization methods.