Assemblage of the largest dataset of human vaginal microbiota profiles
We compiled a dataset of vaginal community compositions from 13,160 vaginal swab or lavage specimens that had been collected by our research group from three locations around the US: Baltimore, MD; Birmingham, AL; and Atlanta, GA. The samples originated from 1,975 North American women and included participants who self-identified as Black (n=1,343, 68%), White (n=403, 20.4%), Hispanic (n=110, 5.6%), Asian (n=95, 4.8%), as well as 17 women who identified as a different race and 7 women who did not self-identify. Many of the women had participated in longitudinal studies (n=916) and therefore contributed more than one sample to the compiled dataset; the median number of samples contributed per participant was three and ranged from one to seventy. All of the women included in this study were of reproductive age as defined by recent menstruation and were not pregnant at the time of sampling. The participant age range was 13-53 with a median participant age of 25. Eleven of the women were younger than 15 and two were older than 49. The composition of their vaginal microbiota was established by deep sequencing of the V3-V4 region of the 16S rRNA gene with an average of 54,898 reads per sample (range: 1,005-411,805).
Construction of community state type reference centroids
The compiled dataset of 13,160 vaginal microbiota profiles includes representations of all previously identified CSTs and was used as a comprehensive training dataset. We first defined the CSTs in the training dataset using hierarchical clustering of the pairwise Bray-Curtis distances between samples with Ward linkage (Figure 1). We then identified seven CSTs, four of which had a high relative abundance of Lactobacillus species. These seven CSTs could be further broken down into thirteen sub-CSTs. To conform with previous studies, we name these as follows: CST I—L. crispatus dominated, CST II—L. gasseri dominated, CST III—L. iners dominated, and CST V—L. jensenii dominated. CSTs I and III were more common in this dataset than CSTs II and V, and were each split into two sub-CSTs denoted with A and B. The “A” version represents samples that had a higher relative abundance of the focal species, with the “B” version representing samples with a somewhat lower relative abundance of that species. We also identified three CSTs which did not have a high relative abundance of lactobacilli which we term CST IV-A, IV-B, and IV-C. CST IV-A had a high relative abundance of Candidatus Lachnocurva vaginae (formerly known as BVAB1 (33)) and a moderate relative abundance of G. vaginalis, while IV-B had a high relative abundance of G. vaginalis and low relative abundance of Ca. L. vaginae. Both IV-A and IV-B had moderate relative abundances of Atopobium vaginae. Samples assigned to CST IV-C had a low relative abundance of Lactobacillus spp., G. vaginalis, A. vaginae, and Ca. L. vaginae and were instead characterized by the abundance of a diverse array of facultative and strictly anaerobic bacteria. We thus further split CST IV-C into 5 sub-CSTs as follows: CST IV-C0—an even community with moderate amount of Prevotella, CST IV-C1—Streptococcus dominated, CST IV-C2—Enterococcus dominated, CST IV-C3—Bifidobacterium dominated, and CST IV-C4—Staphylococcus dominated. Samples assigned to CST IV-C represented 6% (n=802) of the training dataset.
We next constructed a reference centroid for each of the thirteen sub-CSTs identified in the training dataset by averaging the relative abundances of each taxa across the samples assigned to the sub-CST. These centroids represent the average community composition of each sub-CST, as defined by the training dataset, and can be used as a stable reference for the assignment of vaginal microbiota profiles to CSTs (Figure 2).
VALENCIA: a novel method for assigning samples to community state types
We implemented a nearest centroid classification algorithm, which we term VALENCIA, to leverage the training dataset for reproducible and robust assignment of vaginal microbiota to community state types. The similarity of a vaginal microbiota profile to each of the thirteen reference centroids is evaluated using Yue and Clayton’s θ (34), a similarity measure based on species proportions. This yields an array of thirteen similarity scores, ranging from 0.0 (no shared taxa) to 1.0 (all taxa shared and at the same relative abundance) for each sample. The reference centroid to which the sample bears the highest similarity provides an optimal assignment to a sub-CST. These similarity scores can also be used to gauge confidence in the assignment and are particularly useful for handling cases where the sample either does not match any of the centroids or has close matches to multiple centroids. For these cases, VALENCIA yields a low confidence score to indicate the degree of ambiguity in the CST assignment.
We first used VALENCIA to reassign CSTs to our training dataset. The taxonomic composition of samples assigned to each CST generally matched that of the associated reference centroid (Figure 3a). As expected, differences in Shannon diversity (H) were observed among the sub-CSTs (Figure 3b). Samples assigned to sub-CSTs defined by the dominance of a single phylotype were found to have lower values of H. A comparison between the initial hierarchal clustering and the assignments provided by VALENCIA revealed 1,454 disagreements (11% of the samples). Discordant samples largely originated from communities which bore some degree of similarity to multiple CSTs. These communities exist in the grey areas between two or more community state types and are therefore difficult to classify. Based on the taxonomic profiles for these discordant samples, we have more confidence in the assignment provided by VALENCIA than that provided by hierarchical clustering (Additional file 1). For example, a number of samples were assigned to CST V by the hierarchical clustering but only contained ~20% relative abundance of L. jensenii and were instead majority L. iners. VALENCIA assigned these samples to CST III-B, which we find more agreeable. A similar pattern can be seen for discordant samples hierarchical clustering assigned to CST II which, based on the reference centroid for this community state type, should have a majority of L. gasseri. In this case the discordant CST II samples had either a majority L. iners, which VALENCIA assigned to CST III-B or a majority G. vaginalis, which VALENCIA assigned to CST IV-B.
Validation of VALENCIA against outside datasets
To further demonstrate the broad applicability of VALENCIA, we applied it to a number of test datasets that had not been included in the training dataset. These test datasets are from studies which had sampled women outside the age range of the training dataset, from non-North American populations, or had sequenced different 16S rRNA gene variable regions. Here we present our analysis of three such datasets: Test dataset 1 contained publicly available microbiota profiles from adolescent girls (n=245, aged 10 to 15) derived from sequencing the 16S rRNA gene V1-V3 region (31). Test dataset 2 was generated from SWAN analyzed in-house and contained microbiota profiles from postmenopausal women (n=1,380, aged 60 to 72) derived from sequencing the 16S rRNA gene V3-V4 region. Finally, test dataset 3 contained publicly available microbiota profiles from eastern and southern African reproductive age women (n=110) derived from sequencing the 16S rRNA gene V4 region (32). For test dataset 2, we applied our own taxonomic annotation pipeline to the data while for test datasets 1 and 3 we relied on the taxonomic annotations provided in the original studies. We assigned all of the samples in each of these three datasets to community state types using VALENCIA. In general, we find the CST assignments made by VALENCIA to be acceptable, although there is no “ground truth” from which to benchmark (Figure 4). The distributions of similarity scores between the samples and their matching reference centroid for outside datasets 1 and 3 did not substantially differ from the same distribution provided by the reclassification of the training dataset (Figure 4). Similarity scores for test dataset 2 were typically lower than those for the training dataset, although this is primarily driven by the high prevalence of CST IV-C and low prevalence of Lactobacillus spp. dominant CSTs in postmenopausal women. For test dataset 1, the author’s made their CST assignments (which were derived from hierarchical clustering) publicly available (31). We find 17 instances of discordance between our and the original assignments (6.7%). As was the case for the reclassification of the training dataset, these discordant samples occupy the grey space between CSTs. For example, six instances of discordance come from communities which were assigned to CST I by Hickey et al. but had more L. iners than L. crispatus—VALENCIA assigned these samples to CST III-B.
Relationships between VALENCIA-defined community state types, Nugent score, and vaginal pH
Prior to the application of amplicon sequencing to define microbiota composition, and still today, researchers and clinicians used the Nugent scoring system to evaluate and categorize vaginal microbial communities for the diagnosis of bacterial vaginosis (BV), a common vaginal condition (35). The Nugent score is primarily based on the morphology of Gram stained bacterial cells viewed under light microscopy. An abundance of Lactobacillus morphotypes, large Gram-positive rods, yields a low Nugent score while the presence of Gram-variable small (G. vaginalis) and/or curved rods (Ca. L. vaginae (36), Mobiluncus spp.) yields a high Nugent score, indicating Nugent-BV. We examined the relationship between VALENCIA-defined CSTs and Nugent score categories and observed high concordance (Figure 5a). Lactobacillus-dominant CSTs typically have low Nugent scores, while communities assigned to other CSTs have higher Nugent scores. CST IV-A VALENCIA-assigned communities, which are enriched of Ca. L. vaginae, had the highest Nugent scores, consistent with the Nugent scoring system (37). However, we were interested to see how the Nugent score system evaluated communities VALENCIA assigned to CST IV-C. These communities do not have a high relative abundance of Lactobacillus spp., G. vaginalis, or Ca. L. vaginae and instead have a high relative abundance of Streptococcus, Enterococcus, Staphylococcus, Prevotella and Bifidobacterium. We find that the Nugent scoring system does not reliably assign these communities to a single category and instead gives mixed results. Of the five subtypes of CST IV-C, CST IV-C2 (Enterococcus) and IV-C4 (Staphylococcus) are most often assigned low Nugent scores although these communities are not dominated by Lactobacillus spp. Besides the ambiguities observed for these CST IV-C communities, VALENCIA-defined CSTs and the Nugent scoring system are largely in accordance with the definition of the Nugent score.
The microbiota is thought to be the primary driver of vaginal pH through the release of acidic fermentation end products. A low vaginal pH (≤4.5) has been associated with decreased risk of adverse health outcomes and is usually achieved via the production of lactic acid by lactobacilli (38-40). Not surprisingly then, we found that the communities VALENCIA assigned to CSTs which are dominated by Lactobacillus spp. were associated with lower vaginal pH than those VALENCIA assigned to other CSTs (Figure 5b, see supplemental figure 2 for odds of vaginal pH >4.5 for each sub-CST). Communities which were dominated by L. crispatus (CST I) had the lowest pH (78% of samples had a pH ≤4.5) with those dominated by L. iners (CST III) and L. jensenii (CST V) following close behind. L. gasseri dominated communities (CST II) were associated with the highest vaginal pH among the Lactobacillus spp.-dominant CSTs. Also as expected, communities which were deficient in Lactobacillus spp. typically had a higher vaginal pH. Communities VALENCIA assigned to CST IV-A had the highest pH followed closely by those assigned to CST IV-B. Due to the large number of samples included in this dataset, we were also able to examine pH for samples assigned to the less common CSTs. Among the subtypes of CST IV-C, we find that communities with Enterococcus, Staphylococcus, or Bifidobacterium were associated with lower vaginal pH while the majority of Streptococcus communities typically had a higher pH. None of these CST IV-C sub-CSTs (percent of samples with pH ≤4.5: IV-C0 24%; IV-C1 15%; IV-C2 42%; IV-C3 28%; IV-C4 39%) consistently reach the low vaginal pH achieved by CSTs I, III, or V (percent of samples with pH ≤4.5: I-A 83%; III-A 61%; V 61%, Figure 5b).
Associations between community state types and participant’s race and age
Statistical associations between CSTs and demographic, medical, and/or behavior data have been used to identify factors that influence the composition of vaginal microbiota. Our training dataset contained microbiota compositions from over 1,900 allowing us to re-examine some previously observed associations with more statistical power. We sought to determine whether a participant’s race or age influenced their likelihood of having particular community state types. For each community state type with sufficient sample size (I, II, III, IV-A, IV-B, IV-C), we modeled their presence or absence as a binomial response variable with race and age as fixed predictor variables and the subject as a random variable (Figure 6a). We found that women who self-identify as Black or African American were less likely to have CST I than women who identify as White or Asian (z=4.6, p < 0.001; z=2.9, p=0.0235). Black women were also more likely to have CST IV-A than White women (z=4.9, p<0.001) and CST IV-B than women who identified as White or Asian (odds ratios: 6.91 versus 0.96 and 0.05; p<0.001, p<0.001). None of the Asian women included in this study were found to have CST IV-A. CST IV-B was also more common for Hispanic women than White women (z=3.3, p=0.006). Finally, we found that Asian women were more likely to have CST III than either Black or White women, although this association was weaker (odds ratios: 1.72 versus 0.34 and 0.36; p=0.025, p=0.050). No significant associations with race were found for CSTs II, V or IV-C, which may be due to sample size limitations as these three CSTs are less prevalent than the others. Our results agree with previous studies (7, 41) and show a much refined association between racial differences and prevalence of CSTs.
We were also able to examine associations between a participant’s age and their likelihood of having each CST using the same logistic regression models. Our analyses indicate that the representation of CST III varies with age, after adjusting for race (Figure 6b). Among reproductive age women, we observed that older women were less likely to have L. inersdominated CSTs than younger women (z=-3.589, p < 0.001). The probability of having CST III ranged from 40% for the youngest women included in the study, down to 20% for the oldest women. No significant associations with age were observed for the other CSTs. However, in our validation of VALENCIA we analyzed the SWAN dataset of postmenopausal women between the ages of 60-72 (Figure 4b). As expected, the representation of CSTs clearly differs between this dataset of postmenopausal women and the training dataset of reproductive age women.