Human microbiome studies: context and sampling. Human microbiome studies were done in the context of ZOE 2.0, a genetic epidemiologic study of early childhood oral health among a community-based sample of preschool-age children in North Carolina (NC)52. In brief, between 2016 and 2019, 8,059 children ages 36–71 months who attended public preschools in NC were enrolled in the study and 6,404 of them underwent comprehensive clinical examinations by trained and calibrated dental examiners. Data on childhood dental caries experience were collected using modified International Caries Detection and Classification System (ICDAS) criteria53. Two supragingival (i.e., “dental plaque”) biofilm samples were collected during the clinical encounters immediately prior to dental examinations, which took place before or at least 30 minutes after snack or breakfast. The plaque sample that was carried forward to microbiome analyses was obtained using sterile toothpicks from the facial/buccal surfaces of primary teeth in the upper-right quadrant, i.e., Universal tooth numbering system: #A, #B, #C, #D, and #E; FDI tooth numbering system: #55, #54, #53, #52, and #51 (Extended Data Fig. 3). Upon collection, plaque samples were placed in RNAlater TissueProtect 1.5mL tubes and were frozen at -20oC on-site until transferred to university core biospecimen processing facility for further processing or long-term storage at -80oC. Detailed information regarding sample collection, storage, processing, nucleic acid extraction, and sequencing has been reported in a recent protocol publication54.
Participants and phenotyping. We carried forward to whole genome shotgun sequencing (WGS, metagenomics/MTG) and RNA sequencing (RNA-seq, metatranscriptomics/MTX) supragingival samples of the first 300 ZOE 2.0 participants, 50% with and 50% without person-level dental caries experience defined at the “established” caries lesion detection threshold, ICDAS≥3)46. This lesion threshold corresponds to macroscopic tooth structure loss, i.e., a “cavity”. The ZOE 2.0 participants (mean age=52 months) formed the ‘discovery’ sample whereas the ‘replication’ sample comprised 116 similar-aged children (mean age=55 months), members of the same study population (i.e., enrolled in NC public preschools) examined under virtually identical conditions (i.e., by one clinical examiner) during the parent study’s pilot phase. For the purposes of the present study, we quantified caries experience using the most sensitive clinical criteria including the enumeration of early-stage caries lesions (i.e., at the ICDAS≥1 threshold) according to the recent international consensus definition of early childhood caries (ECC)8. This was done both locally (i.e., within the five tooth surfaces where plaque biofilm was harvested from) and at the person-level (i.e., in the entire dentition, comprising all 88 primary tooth surfaces). We considered the surfaces where plaque was collected from as the most informative in terms of microbiome taxonomy and thus localized caries experience was the primary clinical trait in all analyses. Nevertheless, we posited that the taxonomy of microbiome biofilm in those areas is likely to also be informative for the condition of one’s entire dentition--therefore, we considered a secondary ‘person-level’ caries experience trait as a secondary clinical trait. Estimates of these caries experience traits, in the discovery and replication samples, as well as participants’ demographic information are present in Extended Data Table 5.
Sequencing and alignment. Total nucleic acid was quantified using QuantIT® PicoGreen®. 5 ng of genomic DNA was processed using the Nextera XT DNA Sample Preparation Kit (Illumina). Target DNA was simultaneously fragmented and tagged using the Nextera Enzyme Mix-containing transposome that fragments the input DNA and adds the bridge PCR (bPCR)-compatible adaptors required for binding and clustering in the flow cell. Next, fragmented and tagged DNA was amplified using a limited-cycle PCR program. In this step index 1(i7) and index 2(i5) was added between the downstream bPCR adaptor and the core sequencing library adaptor, as well primer sequences required for cluster formation. The thermal profile for the amplification had an initial extension step at 72°C for 3 min and initial denaturing step at 95°C for 30 sec, followed by 15 cycles of denaturing of 95°C for 10 seconds, annealing at 55°C for 30 seconds, a 30 second extension at 72°C, and final extension for 5 minutes at 72°C. The DNA library pool was loaded on the Illumina platform reagent cartridge (Illumina) and on the Illumina instrument. Sequencing output from the Illumina HiSeq 4000 2x150 was converted to fastq format and demultiplexed using Illumina Bcl2Fastq 2.20.0 (Illumina, Inc. San Diego, CA, USA.). Quality control of the demultiplexed sequencing reads was verified by FastQC (Babraham Institute. Cambridge, UK). Adapters were trimmed using Trim Galore (Babraham Institute. Cambridge, UK). The resulting paired-end reads were classified with Kraken255 and Bracken 2.556 using a custom database including human, fungal, bacterial, and the expanded Human Oral Microbiome Database (eHOMD) genomes57 to produce an initial taxonomic composition profile. All reads identified as ‘host’ were eliminated. Paired-end reads were joined with vsearch 1.10.258. Any remaining adapter reads were trimmed again using Trim Galore. Estimates of gene family, path abundance, and path coverage were produced from the remaining reads using HUMAnN359 based on taxonomic estimates from MetaPhlAn 360,61.
RNA isolation was performed using Qiagen RNeasy Mini Kit (Cat. No. / ID: 74104) and RNA was quantified using Nanodrop1000 (ThermoFisher, Waltham, MA). To generate MTX data via RNA-seq., sequencing output from the Illumina HiSeq 4000 2x150 platform was converted to fastq format and demultiplexed using Illumina Bcl2Fastq 2.20.0 (Illumina, Inc. San Diego, CA, USA). Quality control of the demultiplexed sequencing reads was verified by FastQC (Babraham Institute. Cambridge, UK). Adapters were trimmed using Trim Galore (Babraham Institute. Cambridge, UK). The resulting paired-end reads were classified with Kraken255 and Bracken 2.556 using a custom database including human, fungal, bacterial, and the expanded Human Oral Microbiome Database (eHOMD) genomes57 to produce an initial taxonomic composition profile. All reads identified as host were eliminated. Paired-end reads were joined with vsearch 1.10.258. The resulting single-end reads were again trimmed of any remaining adapters using Trim Galore. HUMAnN 3.059 was used to generate gene family and pathway-level data based on taxonomic estimates from the MetaPhlAn 360,61 metagenomic analysis. Additionally, MTX gene expression analyses were performed for the four “top species” that were tested experimentally. Reads classified by Kraken2 as belonging to each of the four top species and any strain thereof were extracted by species, individually by bacterial species aligned to the relevant reference transcriptome with STAR62, and subsequently quantified via Salmon63.
Quality control procedures. An overview of quality control, scaling, and filtering of taxa is presented in Extended Data Fig. 4. After removal of viral sequences, the remaining data were first arranged in a reads-per-kilobase (RPK) format, then rescaled into transcripts-per-million (TPM), and finally rescaled close to the averaged-over-subject RPK level. The average RPK in MTG in the discovery sample was 8,004,958, and the rescaled TPM was set as 8,000,000 for each subject. To facilitate discovery and inference, species with relative abundance less than 10-5 or with prevalence rate less than 10% were excluded from all taxonomic discovery analyses. Out of 6,411 non-viral species, 5,990 low-abundance and 2,935 low-prevalence (including 2,932 both low-abundant and low-prevalence) taxa were filtered, retaining 418 taxa for all downstream MTG analyses. MTX data in the discovery sample, as well as MTG and MTX data in the replication sample were all processed in a similar fashion. Total RPKs per subject were 11,014,832, 5,077,759, and 2,739,606 on average, and the total rescaled TPMs per subject were on average 11 million, 5 million, and 3 million, respectively. Numbers of retained taxa after applying abundance and prevalence filters were 385 for the MTX in the discovery sample, 422 and 397 for the MTG and MTX data in the replication sample, respectively. We refer to these sets of retained taxa as “core.” TPM normalization was not done for the targeted MTX of the 4 top species, due to limited diversity of the sample. Average RPK per sample was 72,474. We retained genes with average >20 reads for downstream differential expression and gene-gene interaction analyses. Using this filter, out of 9,103 total genes available for these 4 top species, 542 genes were retained: 47 for S. mutans, 39 for S. sputigena, 8 for P. salivae, and 448 for L. wadei.
Identification of species significantly associated with caries experience. We used log-normal linear models to test the association between species’ differential abundance in MTG and MTX data and quantitative measures of caries experience. The models included terms for phenotypes of interest (i.e., either localized or person-level caries quantitative experience), batch effects (i.e., the first sequencing batch included 52 samples and the second batch the remaining 248 samples), age at enrollment (measured in months), and race/ethnicity (reported by legal guardians and categorized as white-non Hispanic, African American-non-Hispanic, and others including Hispanics), and a unity was added to the rescaled TPM abundance data before log-transformation with base 2. Presence of each species in MTG data was controlled for in models examining differential expression (i.e., abundance in MTX data). A false discovery rate (FDR) correction for multiple testing was applied using the Benjamini-Hochberg procedure64 for each of the four models (caries traits and abundance / expression).
We used strict criteria for the identification of bacterial species associated with caries experience and required taxa to be FDR-significantly differentially abundant for quantitative disease experience defined both locally and at the person-level, in MTG and MTX data (i.e., 4 models), in the discovery sample (n=300). To reduce the potential for false discovery, we sought for additional evidence of association in an independent sample, i.e., we examined the replication of the previously identified associations in a sample of 116 similarly aged participants from the same population as the parent study (i.e., public preschools in NC), similar clinical and microbiome data. Thus 4 more models were created, for localized and person-level disease, in MTG and MTX data. As evidence of replication, we considered, in order of ascending importance, directional consistency of the estimate of association, nominal significance, or FDR-level significance in the replication sample. Species that were FDR-significant in all 4 models in the discovery sample and were at least nominally significant for localized disease experience in MTG data were termed “significant species”. This set of species with high-confidence evidence of association from multiple traits, MTG and MTX data, and from all 416 study participants were prioritized for reporting and were candidates for consideration in the experimental validation pipeline.
Inter-species correlations and pathways involving species significantly associated with caries experience. To provide initial insights into the inter-relationships of the 16 significant species, we examined their pairwise correlation patterns in health and disease. For this purpose, we used Pearson correlation coefficients between model residuals that were generated for each of the 16 significant species. These log-normal models had the same specifications with models used in the main analyses (i.e., adjusted for batch effects and age) with the addition of an adjustment for disease experience. Species’ mean correlations with each of the other 15 significant species were examined between health (i.e., no localized disease experience) and disease (i.e., any localized disease experience), and in MTG and MTX.
To identify pathways significantly associated with caries experience in the biofilm transcriptome, we examined pathway and pathway-species MTX data that were prepared in RPK format as described above. Total RPKs per sample were on average 383,119, and the RPK data were transformed into a TPM format with a scale of 400,000. For this analysis, we departed from the top 100 pathways in MTX data in terms of overall abundance and prioritized the top 30 pathways that had representation from at least one of the 16 significant species. Each of these 30 pathways, summed over all species including those unclassified, was tested using the log-normal model with the same set of covariates as in the main discovery analysis and the addition of a unity before log2 transformation. An FDR correction for testing of 30 pathways was applied using the Benjamini-Hochberg procedure. Using this procedure, we identified pathways whose expression in MTX was significantly associated with caries experience. Additionally, we examined the representation of the 16 significant species in these pathways as percent of significant over all species involved in each pathway.
Shortlisting of species for virulence assays and experimental testing. We departed from a list of 16 species that showed strong and replicable evidence of association with dental caries experience (Extended Data Table 1). To select a shortlist of candidates that could be fully characterized in virulence, biofilm, and possibly in vivo studies in this study, we considered species’ statistical evidence of association in the discovery sample (i.e., p-value), evidence replication in the independent sample, representation of different genera (i.e., selected one species per genus), and availability of clinical isolates. Based on these criteria, from the list of 16 significant species we prioritized Streptococcus mutans (the known, well-established pathogen), and Selenomonas sputigena, Prevotella salivae, and Leptotrichia wadei (as 3 new candidates). These 4 taxa, referred to as “top species”, were carried forward to in vitro virulence assessments and biofilm characterization and served as candidates for further in vivo colonization and cariogenicity studies.
Microorganisms and growth conditions used in laboratory validation. Streptococcus mutans UA159 (ATCC 700610), Selenomonas sputigena ATCC 35185, Leptotrichia wadei JCM 16777, Prevotella salivae JCM 12084 were used in in vitro and in vivo studies. Bacteria were grown in Brain Heart Infusion Supplemented (BHIS) medium (Brain Heart Infusion broth supplemented with 5 g/L yeast extract, 5 mg/L Hemin, 1 mg/L Vitamin K1, and 0.5 g/mL L-cysteine) at 37 °C to exponential phase in an anaerobic chamber (Anaerobe Systems). For in vitro biofilm studies, a saliva-coated hydroxyapatite disc (2.7 ± 0.2 cm2; Clarkson Chromatography Products) was placed in a vertical position to mimic the tooth-enamel surfaces of human teeth. Single or mixed biofilms were grown on the apatitic surfaces in BHIS medium supplemented with 1% sucrose, the primary dietary sugar associated with tooth decay, to simulate the cariogenic condition in ECC patients. Biofilms were grown on saliva-coated hydroxyapatite surfaces for 24 h to allow the establishment of single or mixed-species communities. For real-time live imaging, a green fluorescent protein (GFP)-tagged S. mutans UA159 strain (S. mutans UA159 Ef-Tu-ngGFP)65 was used. Biofilm EPS glucan matrices were labeled via supplementing the culture medium with 1 μM Alexa Fluor 647 dextran conjugate (Molecular Probes) during biofilm growth. This labeling method is highly specific for S. mutans-derived α-glucans since the fluorescently labeled dextrans serve as primers for streptococcal glycosyltransferases and are directly incorporated into glucans during biofilm EPS synthesis.
Top species’ virulence and metabolic profile assessment
Acid tolerance and acidogenicity. Acid tolerance tests were performed using Brain Heart Infusion (BHI) broth (Anaerobe Systems) with pre-adjusted pH. Lactic acid (13.42 M) was used to adjust the pH of the medium (ranging between 2.9 and 6.5). Each of the top species was serially diluted to 107 cells (CFU)/mL. One hundred microliters (100 µL) of bacterial suspension were transferred to 900 µL pH adjusted medium to study the growth of single species under acidic conditions. In parallel, each of the new species (S. sputigena, L. wadei, or P. salivae) was mixed with S. mutans (equal proportion of single species) to examine the growth of mixed species combinations. The cultures were incubated in the anaerobic chamber at 37 °C for 48 h. Bacterial growth was quantified using a microplate reader (SpectraMax M2e) by measuring optical density values at 610 nm, whereas the pH values were measured using a pH meter (FiveEasy Benchtop F20 pH/mV Meter, Mettler Toledo). Both the lowest pH values at which each bacterial species survived (detectable growth) and the final pH values of the culture (after 48h) were recorded to assess acid tolerance. For acidogenicity (acid production), we used standard glycolytic pH-drop assay. Bacterial cells (single or mixed-species, as described above) were incubated in salt solution (50 mM KCl, 1mM MgCl2.6H2O) with 1% glucose and allowed to produce acids over time. The decrease in pH was measured using a pH meter every 1 h over a period of 12 h and the proton production rate was calculated to compare acidogenicity of different species either alone or in combination with S. mutans.
Metabolic profiling. Top species’ metabolic profiles were measured using real-time isothermal microcalorimetry66. S. mutans, S. sputigena, L. wadei, P. salivae, and P. oulorum were transferred to BHI broth and incubated in the anaerobic chamber at 370C for 48 hours. Each culture was serially diluted to 107 cells, and 60 µl of each bacterial suspension was transferred, in triplicate, in titanium vials (calWell; SymCel) containing 540 µl of pre-reduced BHI. Real-time heat production, proxying metabolic activity, was measured using a calScreener™ microcalorimeter (SymCel Sverige AB, Stockholm, Sweden) in a 48-well plate (calPlate™) as previously described67. The full processing of the samples and plate preparation were performed inside the anaerobic chamber, to maintain the anaerobic condition for optimal bacterial growth. Heat and corresponding energy data were quantified with calView™ software (Version 1.0.33.0, © 2015 SymCel, Sverige AB). The instrument was set and calibrated at 37 °C, with all handling and set-up done according to the manufacturer’s recommendations.
Biofilm live imaging. Biofilm live imaging was performed based on our established fluorescence labeling and confocal imaging protocols optimized for oral biofilms with some modifications68. Biofilms were dip-washed three times in phosphate buffered saline (PBS, pH 7.1) to remove any loosely bound microbes from the surface. To enhance the GFP fluorophore development in GFP-tagged S. mutans cells, we performed aerobic fluorescence recovery immediately before imaging, following the protocol previously reported69. Biofilms were counterstained with Syto 60 (Molecular Probes), which labeled all bacterial cells. Super-resolution live imaging was conducted at 37 °C using a 40× (numerical aperture = 1.2) water immersion objective on a Zeiss LSM800 upright confocal microscope with Airyscan. For real-time imaging of bacterial motility, multi-channel confocal images were taken at a 2.6-second interval for 30 s.
Computational biofilm image analysis. We then used a fluorescence subtraction method to analyze the biofilm spatial structuring (positioning of different species and EPS across the 3D biofilm structure) as detailed previously36. In brief, we applied channel subtraction using the Image Calculator in ImageJ Fiji (https://imagej.net/Fiji) to calculate the fluorescence signal from the new species in the mixed-species biofilm, using the following equation: ChNew species = ChAll bacteria - ChS. mutans, where ChAll bacteria is the channel of all bacteria (SYTO60), and ChS. mutans is the channel of S. mutans (GFP). Computational image processing and quantitative analysis were performed using BiofilmQ software (https://drescherlab.org/data/biofilmQ), an image analysis toolbox optimized for biofilms37. After image segmentation using Otsu algorithm, we conducted a cube-based object declumping in BiofilmQ that dissected the entire biofilm into small cubic volumes (cube size = 2 μm). This function allows analyses of structural properties within biofilm subdomains with spatial resolution since each cube has a unique spatial coordinate in 3D. Parameters including local shape volume, relative abundance, and intensity were calculated for each channel within the cubes. For co-localization analysis, we used Mander’s overlap coefficient in BiofilmQ to quantify the degree of spatial proximity of two different fluorescence signals in relation to each other. Two Manders’ coefficients were calculated: 1) the new species with S. mutans and 2) the new species with EPS. To track single-cell bacterial motility in real time, we performed computational single-particle tracking and generated time-resolved trajectories using the TrackMate plugin in ImageJ Fiji70. Biofilm visualization was performed using maximum intensity projection and 3D surface rendering in ImageJ Fiji.
Targeted bacterial gene expression and gene-gene interaction analyses. We used a targeted MTX approach to test S. mutans, S. sputigena, L. wadei, and P. salivae gene expression. Reads were aligned to these species reference genomes using STAR-Salmon62,63. Where multiple strains were available, results were merged to the species level. The association of gene expression with caries experience was then tested for each gene using log-normal models adjusting for batch effects, age, and race/ethnicity, and applying an FDR multiple testing correction for each species. Additionally, we additionally examined gene-gene interactions between S. mutans and S. sputigena—the two species that demonstrated enhanced acidogenesis and unique biofilm structure when co-cultured. To this end, the same log-normal models including gene-gene interactions were used including pairwise combinations of all filtered genes from these two species, applying an FDR multiple testing correction.
In vivo rodent model of childhood caries. Bacterial colonization on teeth and their impacts on disease onset were assessed on an established rodent caries model as detailed elsewhere with some modifications40,71. In brief, 15-d old female Sprague-Dawley rat pups (specific-pathogen-free grade) were purchased with dams (8 pups per dam) from Harlan Laboratories (Indianapolis, IN, USA). Upon arrival, animals were pre-screened for oral infection of S. mutans and S. sputigena by oral swabs and real-time polymerase chain reaction (qPCR) and were determined not to be infected with either organism. Oral swabs (FLOQSwab, COPAN Diagnostics, Murrieta, CA, USA) were taken from each of the animals and bacterial DNA were extracted using DNeasy PowerLyzer Microbial Kit (Qiagen, Valencia, CA, USA). Species-specific qPCR primer sets were used for microorganism detection as follows: S. mutans: Forward 5’ ACCAGAAAGGGACGGCTAAC 3’, Reverse 5’ TAGCCTTTTACTCCAGACTTTCCTG 3’; S. sputigena: Forward 5’ GGTCAGCCTTATCAGTTCCGTT 3’, Reverse 5’ GGCGAGCTTTCAGCAATCTTAG 3’; all bacteria: Forward 5’ TCCTACGGGAGGCAGCAGT 3’, Reverse 5’ GGACTACCAGGGTATCTAATCCTGTT 3’.
The pups were randomly assigned into four groups that received different bacterial infections and were housed separately: (1) S. mutans alone; (2) S. sputigena alone; (3) S. mutans plus S. sputigena; (4) control without S. mutans or S. sputigena infection. Animals were inoculated daily using cotton oral swabs with actively growing cultures of S. mutans and/or S. sputigena (~108 CFU/mL in BHIS) at the age of 19-23 d (five doses in total). Each of the infected groups were confirmed for its designated microbial infection at 21 d, 24 d and 30 d by oral swabs and qPCR, while the control group remained free of either S. mutans or S. sputigena. No cross-contamination was detected throughout the experiment. All animals were provided with the National Institutes of Health cariogenic diet #2000 and 5% sucrose water ad libitum to mimic the cavity-promoting diet in childhood caries. The experiment proceeded for three weeks, at the end of which the animals were euthanized using carbon dioxide. The jaws were dissected and processed for caries scoring of teeth according to Larson’s modification of Keyes’ system72. Caries scores were determined by a calibrated examiner. Investigators were masked to experimental group (i.e., infection) allocations during the infection, sampling, and assessment stages, by using color-coded samples. In vitro and in vivo experimental data were presented as mean ± standard deviation. Data were subjected to Student’s pairwise t-test or analysis of variance (ANOVA) with post-hoc Tukey HSD test to account for multiple comparisons. Differences between groups were considered statistically significant when p < 0.05.
Ethical approval. Human observational data and analyses received approval (#14-1992) from the University of North Carolina-Chapel Hill Office of Human Research Ethics Institutional Review Board on September 18, 2014. Legal guardians of all children provided written informed consent for participation in the study. The in vivo experimental study was reviewed and approved by the Institutional Animal Care and Use Committee of the University of Pennsylvania (IACUC#805735). All research was performed in accordance with the Declaration of Helsinki.
Reporting summary. Further information on research design is available in the Reporting Summary linked to this article.
Data availability
Microbiome taxonomy (MTG and MTX) and pathway information for the entire biofilm microbial community, as well as targeted MTX data for the four top species used in this study have been deposited and are freely accessible alongside metadata (i.e., demographic and clinical phenotype information) via the Carolina Digital Repository and accession number 5d86p890x: https://cdr.lib.unc.edu/concern/data_sets/5d86p890x Raw sequence data for ZOE 2.0 have been deposited as part of dbGaP accession phs002232.v1.p1 and are pending release. ZOE pilot study sequence data have been made available via the National Center for Biotechnology Information Sequence Read Archive (NCBI SRA) and BioProject accession number: PRJNA843091.
Code availability
The scripts used to perform this analysis can be found at https://github.com/Hunyong/ZOE_metagenomics_2022