Kenyan cohort of adolescent girls and young women (AGYW)
Samples for this pilot cohort were collected from the Kenyan Girls Health Study (KGHS), a prospective sexual health study of AGYW with limited sexual experience in Thika, Kenya (full study details are provided in ). Girls who met the following criteria and provided written informed consent with their guardian present were enrolled between 2014 and 2016: 1) age 16–21; 2) HIV and HSV-2 seronegative; 3) willing to undergo genital examination; 4) ≤1 sexual partners (total) to date. All samples used in this analysis were from the baseline clinical visit at which the participants underwent a genital exam and provided vaginal swab specimens. Demographic and sexual history data was gathered using questionnaires. Participant demographic and clinical metadata can be found in Table 1 and Table S1, Additional File 2. Study approval was obtained from the Kenya Medical Research Institute Scientific Ethics Review Unit, the University of Washington Institutional Review Board and the Conjoint Health Research Ethics Board (CHREB) at the University of Calgary.
After collection, samples were stored at -80°C and transported on dry ice from the study site to the University of Washington, and then on to the University of Calgary. Saline (0.9% NaCl) was prepared with UV irradiated H2O (HyClone; Cytiva Life Sciences, Marlborough, MA, USA) and filtered with a 100 kDa molecular weight cut-off (MWCO) Amicon Ultra-15 filter unit (Millipore, Oakville, ON, CA). Vaginal swabs were eluted in 1.5 mL sterile filtered saline by thawing at room temperature for 5–10 minutes (min), then vigorously vortexed for 1 min to release the remaining biomass; subsequent swab processing steps were conducted at 4°C. Swab eluates were centrifuged at 18,407 × g for 10 min and resulting cell-free supernatant and pellet samples stored at -80°C. Swabs were processed in batches of 9, including a sham swab as a negative control (Puritan sterile foam tipped applicator swab; Guilford, Maine, USA).
Total protein content in cell-free swab supernatants was measured using the PierceTM BCA protein assay according to the manufacturer’s instructions (ThermoFisher Scientific, Burnaby, BC). Briefly, 25 µL aliquots of swab supernatants or bovine serum albumin (BSA) standards diluted in 0.9% saline were added to a 96 U-well plate (Greiner Bio-One, Monroe, NC, USA) in technical duplicate. Working reagent (200 µL/well) was added prior to incubation at 37°C for 30 min. The plate was then cooled to room temperature before the optical density at 562 nm (OD562) was measured in a BioTek Synergy H1 plate reader (BioTek Instruments, Inc, Winooski, VT) after a 5 second orbital shake. To account for differences in swab load/biomass that can result from sampling variability, all biomolecule measurements (lactic acid, glycogen/maltodextrin and enzyme activities) were normalized to the total protein concentration of each vaginal swab supernatant (Tables S2–S3, Additional File 2).
D- and L-lactic acid were quantified in cell-free swab supernatants using the sequential method of the D-/L-Lactic Acid Rapid Assay Kit (Megazyme, Darmstadt, Germany). Adapted from the manufacturer’s instructions for a 96 well plate format, 50 µL of D-/L-lactic acid standards and swab supernatants were aliquoted into a 96 U-well plate (Greiner Bio-One) in technical duplicate. Samples and standards were mixed with D-glutamate-pyruvate transaminase (2 µL/well), NAD+ (9 µL/well), buffer (45 µL/well) and water (83 µL/well), and the background absorbance (OD340) was measured in a BioTek Synergy H1 plate reader. D-lactic acid was quantified by adding D-lactate dehydrogenase (5 µL) to each well and measuring OD340after a 20 min incubation following a 5 second orbital shake. Next, 5 µL of L-lactate dehydrogenase was added to each well of the same plate and the same steps were followed for quantification of L-lactic acid. Total lactic acid was considered the sum of D- and L-lactic acid.
Glycogen and maltodextrin
The specificity of a fluorometric glycogen assay kit (BioVision, Milpitas, CA) was evaluated by testing 3.2 µg/mL solutions of glycogen (BioVision), maltodextrin (Sigma-Aldrich), maltotriose (Sigma-Aldrich, St. Louis, MO) and maltose (ThermoFisher Scientific) (Figure S1, Additional File 1). These solutions, alone and in combination, were then used to validate that size exclusion improves the kit’s specificity for glycogen. Carbohydrate solutions (50 µL) were diluted in 250 µL of HyPure H2O (Cytiva Life Sciences, Marlborough, MA) and added to a 100 kDa MWCO Amicon Ultra-0.5 centrifugal filter unit (Millipore Sigma, Oakville, ON). Glycogen (concentrate) and maltodextrin/glucose (filtrate) were separated by centrifuging at 14,000 x g for 2.5 min at room temperature. After centrifugation, the volume of each concentrate/filtrate was adjusted to 300 µL with HyPure H2O. Filtrate and diluted concentrate (1:100) were added to a black optical bottom 96 well plate (10 µL/well; Greiner Bio-One) in technical duplicate and each fraction was processed separately in the presence or absence of hydrolysis enzymes according to the manufacturer’s instructions. Fluorescence (excitation/emission at 535/587 nm) was measured using a BioTek Synergy H1 plate reader. Background glucose (measured by not adding hydrolysis enzymes to the concentrate and filtrate) was subtracted from the corresponding hydrolyzed sample to measure glycogen and maltodextrin, respectively. Vaginal swab supernatants were processed using the same procedure.
Glycogen degrading enzyme activity
Enzyme activity in vaginal swab supernatants and cell pellets was quantified using the α-Glucosidase Activity Assay Kit (Abcam, Waltham, MA, USA), Enzchek Ultra Amylase Kit (Invitrogen, Carlsbad, CA, USA), the α-amylase Ceralpha assay kit (Megazyme, Wicklow, Ireland) and the pullulanase PullG6 Kit (Megazyme) per the manufacturers’ instructions. The specificity of the Enzchek Ultra Amylase kit was tested using control enzymes from the α-Glucosidase Activity Assay Kit (Abcam), an αmylase kit (Abcam) and the pullulanase assay kit (Megazyme). The pullulanase (Megazyme) kit was adapted for a 96 well plate. Briefly, pullulanase control enzyme dilutions (Bacillus licheniformis pullulanase) were used to generate a standard curve according to manufacturer instructions. These, along with swab supernatants were incubated in 1.5 mL tubes for 10 min at 40°C. In parallel, the PullG6 substrate (50 µL/well) was incubated in a 96 U-well plate (Greiner Bio-One) that was sealed to prevent evaporation. The enzyme standards (50 µL/well) and swab supernatant samples (75 µL/well) were added to the PullG6 substrate plate in duplicate wells and incubated for 30 min at 40°C. Stopping reagent (2% weight/volume Tris buffer pH 9.0) was added to each well (100 µL/well) and OD400was measured in a BioTek Synergy H1 plate reader. Cell-associated enzyme activity was evaluated in swab pellets that had been resuspended in 50 µL of 0.9% NaCl; samples 2, 7, 13 and 17 did not have adequate sample remaining and were excluded. Reported measurements account for the volume of swab eluate used to generate the pellet, allowing for direct comparison to the activity in swab supernatants.
Human α-amylase concentration
The concentration of host α-amylase in swab supernatants was measured using a pancreatic α-amylase sandwich ELISA (Abcam, Toronto, ON, CA). The α-amylase standard (50 µL) and swab supernatants (75 µL) were added to the provided microplate in technical duplicate and the assay performed according to manufacturer’s instructions. A BioTek Synergy H1 plate reader was used for colorimetric quantification (OD450) after a 5 second orbital shake.
DNA extraction and quantitative polymerase chain reaction (qPCR)
Genomic DNA was extracted from vaginal cell pellets using the DNeasy PowerSoil Pro QIAcube HT Kit on a QIAcube HT instrument (Qiagen, Germantown, MD). Sham swabs were included to monitor for contamination of swab prep and extraction materials. All extracts were assessed for the presence of PCR inhibitors by spiking in and quantitatively amplifying a synthetic jellyfish aequorin gene fragment . Total bacterial load was assessed using broad range 16S rRNA gene qPCR with the following primers/probes: 16SBR_337F, 5'-ACTCCTRCGGGAGGCAGCAG-3'; 16SBR_793R,
5'-GGACTACCVGGGTATCTAATCCTGTT-3'; 16SBR_511R_FAM/TAM, 5'-/56-FAM/CGTATKACCGCGGCTGCTGGCAC/36-TAMSp/-3' (procured as PrimeTime qPCR Probe assays with 2:1 primers:probe ratio, Integrated DNA Technologies, Coralville, IA). An Escherichia coli 16S rRNA gene fragment was cloned using a TOPO™ TA Cloning Kit (ThermoFisher Scientific) and the resulting plasmid used to generate the standard curve. Standard curve and extract reactions were set up in technical duplicate using TaqMan™ Environmental Master Mix 2.0 (ThermoFisher Scientific) and run on a StepOne Plus Real-Time PCR System (Applied Biosystems, ThermoFisher Scientific) with the following parameters cycled 40 times: 15 second denature at 95°C, 1 min anneal/extend at 60°C. 16S rRNA gene copies detected in a given volume of DNA extraction were volumetrically converted to copies/mL swab eluate and then to copies/swab.
Shotgun metagenomic sequencing
Genomic DNA from vaginal cell pellets and the ZymoBIOMICS Microbial Community Standard (Zymo Research, Irvine, CA) was cleaned and concentrated with either sparQ PureMag beads (QuantaBio, Beverly, MA) or the Genomic DNA Clean & Concentrator-10 Kit (Zymo Research, Irvine, CA), according to the manufacturer’s instructions. Shotgun libraries were prepared from 50 ng genomic DNA using the QIAseq FX DNA Library Kit (Qiagen, Germantown, MD), according to manufacturer’s instructions, with fragmentation times of 10–15 min, 6–8 cycles of PCR amplification and sample-specific linkage to combinatorial dual indices. SparQ beads were used for all bead clean-up steps and to fine-tune library size distribution with peak fragment sizes of 300–500 base pairs (bp). Libraries were pooled at 3 nM and sequenced on a HiSeq X instrument (Illumina, San Diego, CA) to generate 150 bp paired-end reads (Psomagen, Rockville, MD).
Metagenomic data processing and taxonomy classification
Raw Illumina reads were subjected to adapter trimming, quality filtering and human sequence decontamination using a custom, publicly-available snakemake pipeline (metqc: https://github.com/SycuroLab/metqc). In brief, adapters were removed using Cutadapt version 1.16  and 3’/5’ end trimming and quality filtering were performed with PRINSEQ version 0.20.4 . Clean reads were filtered to remove human DNA contamination using Best Match Tagger (bmtagger) version 3.101 with version 19 of the human reference genome . Microbial taxonomic profiling was performed with the clean non-human reads using MetaPhlAn version 3.0.1 . Bacterial species exhibiting a relative abundance ≥0.1% and prevalence ≥10% were retained in the relative abundance plot created using the ggplot2version 3.3.3 library package  in R version 4.0.5 . Hierarchical clustering was applied to a Bray-Curtis dissimilarity matrix calculated on MetaPhlAn3 read count data using the vegdist(bray) distance function, with specification of the UPGMA method using the hclust(average) R function, in vegan version 2.5-7 . Participant metadata was incorporated into the relative abundance plot in R using pheatmapversion 1.0.12  with colour palettes generated using the RColorBrewerversion 1.1-2 . Mathematical analyses of bacterial abundance were performed on a read count table that had been adjusted for barcode hopping as follows. First the relative abundance of highly abundant/prevalent taxa unexpectedly found in the Zymo control sample due to low level barcode hopping was calculated (L. crispatus and L. iners, each representing <0.005% of the control sample relative abundance). A number of reads representing the equivalent relative abundance was then subtracted from each sample containing those two taxa; this approach scales the adjustment to the total number of reads in each sample. The net effect is that the total read count for each of these species is slightly reduced, bringing some samples that had very low read counts for these two species to 0 (undetected). Unadjusted and adjusted count tables are provided in Tables S4–S5, Additional File 2.
Metagenome-assembled genome binning
Metagenome-assembled genome (MAG) binning was accomplished using the metaWRAP pipeline version 1.3 . Metagenomic assemblies were generated for each sample with SPAdes version 3.13.0  and evaluated for quality using QUAST version 5.0.2 . Metagenome binning and refinement was performed using MetaBAT2 version 2.12 , MaxBin2 version 2.2.7  and CONCOCT version 1.0.0  software. MAGs were retained if CheckM version 1.1.3  quality assessment reported ≥50% completeness and ≤10% contamination. MAGs were annotated using prokka version 1.13  and classified to the species level using the genome taxonomy database toolkit (GTDB-tk) version 1.5.0 .
Presence and sequence heterogeneity of L. crispatus pulA
To assess the presence and allelic content of L. crispatus type I pullulanase (pulA) genes from women around the globe, whole genome assemblies from urogenital L. crispatus isolates were selected from the NCBI unfiltered prokaryotes list (downloaded August, 2021) by parsing the GenBank metadata for “Homo sapiens” and manually investigating the isolation source and referenced publications. A total of 86 L. crispatus urogenital tract isolate genomes were identified and verified by CheckM to be ≥90% complete and ≤10% contaminated. We also downloaded 17 publicly available L. crispatus MAGs binned by Pasolli et al. from human vaginal metagenomes [74-76] and annotated them as described above for KGHS MAGs; only MAGs meeting our minimum quality thresholds of ≥50% completeness and ≤10% contamination were considered. Type I pullulanase genes were assessed in each isolate genome and MAG using the BLASTn program of BLAST+ version 2.9.0 [77, 78]. The full-length and functionally characterized pulA from L. crispatus isolate genome RL03 (NCBI accession NZ_NKLQ01000285)  was used as the query, and the resulting hits were filtered to include those with ≥95% sequence identity. pulA genes with <100% sequence identity or <100% query coverage were subject to further analysis using SnapGene® version 5.3.0 (GSL Biotech LLC, San Diego, CA) and InterProScan (EMBL-EBI, Cambridgeshire, UK) to evaluate potential functional consequences of any observed insertions or deletions (i.e. frameshift mutations or domain/signal peptide loss) or single nucleotide polymorphisms (SNPs, i.e. those that result in nonsense mutations). We assessed L. crispatus genome coverage and that of the pulA locus in each L. crispatus-containing KGHS metagenome by mapping clean non-human reads from each sample to the L. crispatus FDAARGOS_743 reference genome (NCBI accession GCF_009730275.1)  using bowtie2 version 2.3.5  with default parameters. Depth of coverage was calculated using samtools version 1.11  and the pulA coding sequence region specified as positions 1986082–1989861 on the closed chromosome of the L. crispatus FDAARGOS_743 reference genome (NCBI accession NZ_CP046311.1). Coverage plots were generated in R using ggplot2 version 3.3.3 . Finally, a targeted PCR assay was used to confirm L. crispatus pulA gene presence in the KGHS samples. Using 20 ng of DNA extracted from the vaginal cell pellets, a near full-length pulA gene fragment was amplified with Phusion Hot Start Flex 2X Master Mix (New England Biolabs, Ipswich, MA, USA) using the following primers: PulA-74F, 5’-GTACTTCAGCTATTATGAGTCTTTGG-3’ and PulA-3651R, 5’- GCGCTTTCCATTCTTCTTAACT-3’. Based on the RL03 L. crispatus pulA sequence, this assay generates a 3578 bp product. Select products were sequenced using the Sanger method by the Centre for Health Genomics and Informatics (CHGI) at the University of Calgary and analyzed in SnapGene®. To interrogate the location, type and predicted functional consequences of pulA mutations, an alignment of each pulA sequence containing a mutation, along with the RL03 reference sequence, was created in SnapGene® using MUSCLE 3.8.1551 with default parameters.
To assess phylogenetic relationships amongst L. crispatus genomes and MAGs, cpn60 phylogenetic marker gene sequences were extracted from Prokka  genome annotation files using the gene product identifier “60 kDa chaperonin” and coordinates. Multiple sequence alignment of 115 full-length L. crispatus cpn60 sequences and a Lactobacillus acidophilus cpn60 sequence included as the outgroup (strain La-14, NCBI accession GCF_000389675.2:NC_021181.2) was performed using MUSCLE  with default parameters. A maximum likelihood (ML) tree was constructed in RAXmL using the GTRCAT model for nucleotide sequence alignments [84, 85]. Branches with less than 50% reproducibility over 1000 bootstrap iterations were collapsed .
To account for differences in swab load, all biomolecule measurements were normalized to the total protein content of each swab supernatant or cell pellet (ng). Measurements that fell below the assay limit of detection were replaced with a value equal to half the lower limit of the standard curve. Differences in biomolecule abundance or activity across sample groups were assessed with a Student’s t-test or a Mann-Whitney U-test. Relationships between biomolecules were assessed with a generalized linear regression model and Spearman’s correlation. Statistics were performed in GraphPad Prism (GraphPad, San Diego, CA) or Stata 15 (StataCorp, College Station, TX) and graphs were prepared in GraphPad Prism or R Studio . GraphPad was used to assess data variance (F-test) and normality (Shapiro-Wilk test), and statistical significance was evaluated with an alpha value ≤0.05 and the following significance levels: *p≤0.05, **p≤0.01, ***p≤0.001, ****p≤0.0001.