Study cohort We used data from the Center for Oral Health Research in Appalachia 2 study (COHRA2) (44). COHRA2 recruited White, pregnant women between 2011 and 2015 from Pennsylvania and West Virginia. Healthy women who were in the 12th to 29th week of pregnancy, of European descent, over 18 years of old, fluent in English, and with a singleton pregnancy were eligible for inclusion. Women and their babies were followed longitudinally through the early years of the baby’s life. Women were excluded if they had tuberculosis, were immunocompromised, thought they might soon leave the general regions of West Virginia or southwestern Pennsylvania, or did not have a reliable telephone contact. Mother-child pairs also were excluded from the study if the child was delivered before the 35th week of pregnancy or if the mother or child developed a serious medical condition.
Participants completed in-person visits when the child was 2-months and 12-months old, then yearly thereafter. Mother-child pairs from the Pennsylvania site had additional in-person visits at birth and when the child’s first primary tooth erupted. At in-person visits mothers and children underwent a comprehensive dental assessment by a trained and calibrated dental professionals (training and calibration described in detail in Neiswanger et al (44)); participants were asked not to eat or drink for 2 hours prior to the examination. The examination included caries assessment via the PhenX Toolkit Dental Caries Experience Prevalence Protocol (http://www.phenxtoolkit.org/, protocol number 080300) which allows for the decayed, missing, and filled tooth count to be calculated either including or excluding white spots. The dental examination also included collection of microbial samples from saliva, plaque and gingival swabs using OMNIgene Discover kits (OM-501 or 505 DNA Genotek); only saliva and plaque samples were used in this analysis. Saliva was collected via swabs for children too young to spit into a collection tube and via spitting otherwise. Pooled plaque samples were taken with a Stimudent or curette from three intact tooth surfaces (in UNS/FDI notation: 8-buccal/51-buccal, 24-buccal/71-buccal, 31-occlusal/84-occusal or nearby surfaces if these were not intact). Plaque is also taken from tooth surfaces with untreated dental lesions. Salivary pH was also measured at visits where the child was old enough to spit (most by 12-months, all by 24-months).
A 30–45-minute telephone interview was administered to the mothers at approximately 6-month intervals to capture sociodemographic and behavioral data.
Sampling & case definition For this analysis we selected 99 children who had any dental lesions, including white spots (d1mft), at or prior to the 60-month visit as early childhood caries (ECC) cases. The visit in which a child was first identified as having a dental lesion was the incident-visit for that child. We then selected a similar number of children who were free of dental lesions at the same visit as the cases to serve as incidence-density sampled controls (n = 90). Incidence density sampling does not preclude the reselection of a control as a case at later time points; controls can also be selected as controls for multiple cases (Fig. 1) (45). In this analysis, one control was later selected as a case and one control was selected as a control twice (n = 92 control records). Duplicate records of the case/control and control/control children were not used in the supervised random forest: the case/control was only included as a case and the control/control was only included as a control once. In both unsupervised clustering techniques, we did not include duplicate records from these individuals when performing initial clustering but did include them when graphing and testing associations between identified clusters and variables of interest (i.e., in Table 1, Figs. 1–3). The number of total unique individuals in the analysis was 189, with 191 unique person-records (Additional file 3).
All available saliva samples from cases and controls, up to and including the incident-visit saliva sample, were pulled for 16S rRNA amplicon sequencing (Fig. 1). Note that selected individuals occasionally missed visits, did not have a saliva sample available, or had a saliva sample which failed 16S amplicon quality control (Additional file 3). Additionally, we randomly selected a sub cohort of 15 cases presenting with enamel lesions at or after the 36-month visit and 15 corresponding controls. Plaque and saliva samples from the visit of case diagnosis for these 30 individuals were submitted for shotgun metagenomic sequencing.
Laboratory and bioinformatics pipeline for 16S rRNA amplicon metagenomic sequencing Bacterial DNA was extracted from aliquots of saliva. Library preparation and sequencing of the 16S rRNA V4 amplicon was performed by the Michigan Microbial Systems Molecular Biology Laboratory using previously validated protocols (46). DNA extraction was performed using the Eppedorf EpMotion liquid handling system following the Qiagen MagAttract PowerMicrobiome kit protocol. The V4 variable region was amplified from extracted DNA using barcoded dual-index primers and sequenced on the Illumnia MiSeq platform using the MiSeq Reagent Kit V2 500 cycles. Each plate of samples was submitted with a positive mock community control, a DNA extraction kit control, and a negative water control (Additional file 2 Figures S3-4). Reads were processed to amplicon sequence variants (ASVs) using DADA2 (version 1.14.1) (47) and the Human Oral Microbiome Database (HOMD) version 15.2 (48). To identify contaminants, we used the R package decontam (version 1.8.0) (49). We filtered out samples with less than 1000 reads (n = 6 samples lost). Diversity metrics were calculated using the estimate_richness function from the R package phyloseq all ASVs. However, to limit the number of features used in supervised and unsupervised learning, we instituted a prevalence-abundance ASV filter. ASVs which were present in less than 5% of all samples and which represented less than 5% of all sequences in the samples in which they were present were excluded from the analytic subset for supervised random forest and unsupervised clustering techniques (m = 273 ASVs in analytic subset). ASVs were not collapsed at the genus or species level.
Random forest We used the 12- and 24-month visits as inputs for the random forest as these visits preserved a large subset of pre-incident samples. Only non-incident cases and matched controls were used in the random forest: individuals with incident-visit saliva samples were excluded (6 individuals with available samples who were identified as cases or controls at the 12-month visit and 37 individuals identified at the 24- month visit were excluded, total sample size of n = 158 and n = 133). Hellinger transformed ASV counts from the 273 ASVs in our analysis subset were used in the random forest. Using the train function in the R package caret (50), we ran 5 repeats of 10-fold cross validated random forest machine algorithms with 500 trees. We allowed the mtry parameter (number of parameters randomly sampled as candidates at each tree split) to be tuned from a choice of 2, 136, or 271 using the receiver operating characteristic curve; for both the 12-month and 24-month all taxa random forest an mtry parameter of 2 was selected. Area under the receiver operating curve and other evaluation statistics were calculated using the R package MLeval (51).
Dirichlet multinomial community state typing We used the R package DirichletMultinomial to cluster samples into community state types (CSTs) using Dirichlet multinomial mixture models (52). We fit ten Dirichlet multinomial models, using as input the count matrix of the 855 samples by 273 ASVs in the analytic subset and varying the number of Dirichlet components (i.e., CSTs) from 1 to 10. We calculated the Laplace measure of fit for each model and plotted against k, identifying k = 6 as the best model. We varied k={4, 5} as a sensitivity analysis. Samples were assigned to the single k CST for which they had the highest posterior probability of membership; if a sample assigned to no CST at a posterior probability > 80%, the sample was not assigned to any CST.
Weighted co-occurrence networks We used the R package WGCNA to build a signed weighted network of ASVs using the Hellinger-transformed count matrix of 855 samples and 273 ASVs (53). As a sensitivity analysis, we used the center-log ratio transformed count matrix instead. The soft thresholding power of the signed network was selected to maximize the R^2 of the model fit while preserving the mean connectivity of the network using the pickSoftThreshold function in WCGNA. We used a dynamic tree cut and the cutreeDynamic function in WCGNA to identify network modules or clusters using a minimum module size of 5 and a deep split value of 4, with the aim of producing more fine-grained clusters. Intramodular connectivity statistics were calculated for each ASV using the intramodularConnectivity function. Finally, per-sample module relative abundances were calculated by summing the relative abundances of all ASVs belonging to the same module.
Replication cohort We performed the exact same bioinformatics and analytic pipeline on publicly available V3-V4 16S rRNA gene data from the Holgerson cohort (PRJEB35824) (12), as we did to the COHRA2 samples. This cohort was also composed of sequential salivary samples from similarly aged children, the prevalence of ECC was 10% by 60 months of age. The laboratory methods for these samples are described in Holgerson et al. (12). All the bioinformatics parameters and steps were the same as described above, with the exception that decontam was not used to identify potential contaminants as the publicly available data did not include DNA quantitation data. Since we could not obtain access to any metadata characteristics of these samples, including ECC status, the random forest models could not be run. For visualization purposes, the two matching networks shown in Main Fig. 4C were filtered to only edges with a weight > 0.03. The full, unfiltered network images are shown in Additional File 2. To compare the relatedness of the amplicon sequence variants assigned to various network modules across the COHRA2 and Holgerson cohort, we performed multiple sequence alignment of the amplicons using the R packages msa, using the ClustalW algorithim (54). We computed pairwise distances from the DNA sequences using the r function dist.dml from the r package phangorn (55), using the JC69 model. We created a neighbor joining tree using the phangorn function NJ, then fit a generalized time-reversible with gamma rate maximum likelihood tree using the neighbor joining tree as a starting point. We obtained 100 bootstrap values for the tree using bootstrap.pml and plotted the tree using ggtree (56) and collapsed branches present in < 50 of the bootstrapped trees.
Statistical analyses We used logistic regression to test for associations between summary metrics from unsupervised clustering and ECC separately at the 12- and 24-month visits while controlling for potential confounders identified from literature review and directed acyclic graphic. For each time point only pre-incident cases and controls were included in the regression, i.e., cases diagnosed at 12-months of age and age-matched controls were not included in the 12-month regression models. This ensures that the regression models test for prospective associations between current salivary bacteriome and future ECC diagnosis. To test for associations between CST and future ECC diagnosis, ECC status was used as the outcome and CST assignment was included as a categorical predictor. To test for associations between network modules and future ECC diagnosis, ECC status was used as the outcome and relative abundance of the network module was included as a continuous predictor ranging from 0 to 100. This allows the exponentiated coefficient to be interpreted as the odds ratio for a 1 percentage point increase in network module abundance. In adjusted models we included the following covariates: binary indicator for child being currently breastfed at the visit, binary indicator for maternal report of child antibiotic use within 3-months of visit, count of emerged primary teeth, binary indicator for birth delivery mode, binary indicator for maternal education greater than high school, and categorical variable for visit of case diagnosis/control matching.
Laboratory and bioinformatics pipeline for shotgun metagenomic sequencing DNA was extracted from plaque and saliva samples using the Zymobiomics miniprep kit according to the manufacturer’s instructions. Isolated DNA was quantified by Qubit. DNA libraries were prepared using the Illumina Nextera XT library preparation kit according to the manufacturer’s protocol. Library quantity and quality was assessed with Qubit (ThermoFisher) and Tapestation (Agilent Technologies, CA, USA). Libraries were then sequenced on Illumina HiSeq platform 2x150bp. Quality filtering and adapter trimming were performed using Trimmomatic and the Nextera PE adapters. Host DNA was removed using bowtie2 and the GRCh38 index. Trimmed, cleaned and decontaminated reads were processed through both the Humann3 short-read profiling pipeline (57) and the SqueezeMeta assembly-based pipeline (version 1.4.0) (58). Plaque and saliva samples were run separately through the assembly pipeline. Briefly, assembly was done using Megahit, ORFs were predicted using Prodigal, and similarity searches against GenBank, eggnog and KEGG were conducted using Diamond. Read mapping against contigs was performed using Bowtie2. Binning was done using MaxBin2 and Metabat2 and bins were combined using DAS Tool. To test for differential abundance of KEGG orthologs and taxa abundance estimated from contigs, we used DESeq2, first filtering out KEGG or taxa with fewer than 500 reads from the testing subset. We tested for enrichment in KEGG pathways using gene set enrichment analysis and the R package fgsea separately on plaque and saliva samples. We used the package SQMTools to extract functional and taxonomic subsets of interest, such as the KEGG orthologs which annotated to oxidative phosphorylation. To test correlations between 16S rRNA gene amplicon sequence variants and abundances of taxa from whole genome sequencing, we used a partial spearman correlation while controlling for incident visit and case status.