A Core Set of Microorganisms are Biomarkers of Health or Disease Across a Variety of Settings.
We downloaded gut microbiome whole community metagenomic samples from 20 studies which included healthy controls and patients from studies of multiple diseases, including colorectal cancer (CRC), type 2 diabetes (T2D), type 1 diabetes (T1D), metabolic syndromes (IGT), inflammatory bowel disease (IBD), Behcet’s disease (BD), hypertension (HD), liver cirrhosis, and others (Table 1). We hypothesized that many of the differentially abundant microorganisms within these studies are general markers of disease in general, or hallmarks of ’dysbiosis’. To test this hypothesis, we computed and averaged association scores in the form of Mann-Whitney AUC statistics for each taxon in each study. We then generated a core set of taxa whose average association across all studies was statistically significant at a threshold of p < 0.01 (Fig. 2) and an expanded set of markers at p < 0.05 (Figure S1).
We identified 261 taxa at a p < .05 (S1) and 41 taxa p < .01 (Fig. 2a). Many health associated bacteria included common commensals such as Lachnospiraceae 17 and Ruminococcaceae families 18, 19, while many known pathogen-associated organisms such as Fusobacterium nucleatum 16, Streptococaceae 20–22family, and Solobacterium moorei 23were identified as consistently disease-associated. In particular, many of these taxa align with other studies. For example, out of 29 taxa identified as colorectal cancer-associated by Wirbel et al., 19 directly matched with disease-associated markers in our extended setTable 2 12.
To demonstrate the consistency of these markers across studies, we performed a two-sided t-test comparing control and experimental groups for each taxon in each study and found that our most significant health markers were health-associated in almost every study independently (and similarly for disease markers) (Fig. 2b, Figure S1b). This congruence indicates that our results were not biased by a handful of outlier studies but rather that significant taxa in each of these studies overlap despite involving completely different patient cohorts. Next, we calculated the percent of samples that contained each taxon. We found that healthy markers tend to be present in most samples while some disease markers, such as Anaerostipes 24, are rarely carried and others, such as Clostridium hathewayi 25, are found in most samples (Fig. 2d, Figure S1d). Finally, we plotted the relative abundance of the general health and disease markers in microbiomes carrying these taxa. Here, we observe a wide distribution in relative abundance across most taxa (Fig. 2d, Figure S1d).
Predominance of Health-Associated Taxa in Microbiomes
We next asked if the observation that health-associated markers are more abundant in gut metagenomes generalizes to other studies. We analyzed 15 additional datasets containing only healthy-labeled samples from Western populations and found that, in all cases, healthy taxa have a higher mean relative abundance than disease taxa (Fig. 3a). This observation extends to non- Western populations. We analyzed six datasets from geographies such as Tanzania, Fiji Islands, and mountainous village regions in Latin America, and all of them followed the same trend with the exception of Smith et al. 26–31 (Fig. 3b). We also observed that health taxa are more abundant than disease markers even in datasets for which we only had unhealthy patient samples. Taken together, our results suggest that healthy taxa are more predominant in general, regardless of underlying condition 32–35 (Fig. 3c).
Oral Microbes are Predominantly Associated With Dysbiosis
We next assessed potential biological or functional similarities in our disease-associated taxa. From a review of the literature, we observed that many of our disease-associated taxa are commonly found in oral microbiomes 36. To investigate this relationship in our data, we first quantified the mean abundance of each organism in the oral cavity of healthy individuals using 206 oral microbiome samples from three studies 26, 37, 38. We found that, on average, disease taxa are more abundant in the oral cavity than healthy taxa, whereas healthy taxa are more abundant in the gut (Fig. 4a). No such pattern emerged when performing the same analysis on skin and nasal body sites 39–42.
When we compare disease-associated markers with health-associated markers in gut studies, we find that health-associated markers have a higher mean relative abundance in the gut than disease-associated markers (Fig. 4b). We found a tendency for more abundant gut taxa to be associated with health regardless of whether the mean AUC was statistically significant (slope = 31.7, p = 1.26e-8). Most health-associated taxa are close to 0.5 in mean AUC until approximately 10^-1 mean abundance, at which point a fanning out occurs, where organisms above this threshold tend to be either strongly health- or disease-associated, although the trend leans towards higher-abundance organisms being more health-associated. (Fig. 4c)
By contrast, when we compare the mean relative abundance of bacteria in the oral cavity with their disease vs. health association, we find that the more abundant bacteria are in the oral cavity, the more likely they are to be disease-associated (Fig. 4c). We found that, of the 160 taxa that are found both in the oral cavity and are in the core gut group identified by our meta-analysis, 134 are associated with disease and only 26 with health. Furthermore, when we plotted the mean AUC score in gut studies vs. the log mean abundance of each oral taxon, we found that taxa with higher abundances in the oral cavity tend to be more disease-associated (regression slope = -50.4, p = 6.84e-22). This relationship also held true if we extended to all oral taxa; higher mean abundance was associated with lower mean AUC, even if that AUC score is not statistically significant.
Almost All Disease Markers are Oral Taxa or Class Clostridia
When analyzing taxonomic abundance data, we noticed that many taxa were completely absent from most samples, but when they were present, they would follow a consistent distribution (Fig. 2c). Hence, the mean relative abundance of a taxa may be biased downward for taxa that are infrequently carried. To better understand this phenomenon, we visualized our data based not on mean relative abundance but on frequency of carriage. We compared the percentage of samples those different markers are found in across different body sites and discovered a striking pattern: The majority of disease taxa are rarely found in the gut, but rather are common oral commensals (Fig. 5a, Figure S2). Furthermore, we found that all non-oral commensal disease markers were either from the class Clostridia or belonged to a higher taxonomic order such as
genus Peptostreptococcus (Fig. 5b-d).
We next grouped our taxa based on their frequency of carriage and disease score. As Fig. 5c shows, there are no core taxa that are infrequently (< 20% of samples) carried in the gut and also health associated. These observations hold true in the extended set of health and disease markers with the exception of two broad categories (family Bacillaceae and Genus Siphoviridae, a phage taxa) and Clostridium sp-L2-50 (Figure S3). Additionally, there are very few core markers commonly carried in the gut and disease associated. Taxa that fit into this category in the core belong to Clostridia, and in the extended group are mostly higher order categories (for example Class Gammaproteobacteria and Order Enterobacteriales). By contrast, there are no core health associated taxa commonly found in oral microbiomes (Fig. 5d). This also applies to the extended set again with the exception of a few higher order category taxa (Figure S4). The remaining disease-associated taxa are infrequently found in any body site (Fig. 5b). These are relatively rare organisms like Holdemania sp AP2 (2.37% carriage) and Atopobium sp ICM58 (.08% carriage), while the rarest health-associated taxa is Bacteroides intestinalis (27.08% carriage) (Fig. 2d). There are a few exceptions to this pattern in the extended set, such as the disease marker Lachnospiraceae bacterium 3-1-57FAA-CT1 (29.68% carriage) (Figs. 6 and 10). All other core healthy markers are carried by more than half of stool samples and only Clostridia disease markers are ever found at that frequency in the gut. These findings serve as a validation, reinforcing the consistency of the identified taxa as truly general markers of health and disease.