To extract sequence information from cfDNA isolates, we used a single-stranded DNA library preparation that improves the recovery of microbial cfDNA relative to host cfDNA by up to seventy-fold for cfDNA in plasma [5]. We quantified microbial cfDNA by alignment of sequences to microbial reference genomes [11, 12] (see Methods). We identified two classes of noise, which we addressed using a bioinformatics workflow that implements both novel and previously described filtering approaches [13, 14] (Fig. 1a). The first type of noise can be classified as “digital cross-talk” and stems from errors in alignment and contaminant sequences that are present in microbial reference genomes, including human-related sequences or sequences from other microbes. Digital crosstalk affects distinct segments of a microbial genome and gives rise to inhomogeneous coverage of the reference genome. We computed the coefficient of variation in the per-base genome coverage for all identified species (CV, computed as the standard deviation in genome coverage divided by the mean coverage) and removed taxa for which the CV differed greatly from the CV determined for a uniformly sampled genome of the same size (see Methods), because this indicated that a significant number of sequences assigned to the genome are due to digital cross-talk.
A second class of noise is due to physical contamination of the sample with environmental DNA present at the time of collection and in reagents used for DNA isolation and sequencing library preparation [13]. We reasoned that the total biomass of environmental DNA would be consistent for samples prepared in the same batch. LBBC filters environmental contaminants by performing batch variation analysis on the absolute abundance of microbial DNA quantified with high accuracy. The core elements of LBBC can be implemented using any metagenomics abundance estimation algorithm which makes use of sequence alignment to full microbial genomes. In our analysis, we estimate the genomic abundance of each species using a maximum likelihood model implemented in GRAMMy [12] (see Methods). From the relative abundance of species, we compute the absolute number of molecules in a dataset corresponding to a specific species, considering differences in genome sizes for all identified microbes. The total biomass of microbial DNA is then estimated as the proportion of sequencing reads derived from a species, multiplied by the measured biomass inputted in the library preparation reaction. Recent approaches have identified environmental contaminants by i) looking for batch-by-batch covariation in the relative abundance of microbes measured by metagenomic sequencing, or ii) examining the (inverse) correlation between biomass of the sample and the relative abundance of microbial DNA in the sample [13, 14]. These studies have shown the dramatic effect of environmental contamination in low biomass settings. LBBC effectively combines these two prior approaches into one. Using this analysis applied to the metagenomic cfDNA datasets described below, we estimate that the total biomass of environmental, contaminant DNA can exceed 100 pg (range of 0 pg to 230.4 pg). This is a small amount of DNA (< 1% of sequencing reads) that nonetheless can significantly impact the interpretation of metagenomic sequencing results. We further incorporated a known-template, negative control in the library preparation procedures to identify any remaining contaminant sequences. The use of a negative control is recommended for metagenomics studies [15], and was implemented in our previous work [2, 16]. Here, we compared the microbial abundance detected in samples to those in controls to set a baseline for environmental contamination. This analysis indicated that, on average, only 46% of physical contaminant species determined by LBBC are removed using comparison to a negative control alone, supporting the need for the additional filters implemented in LBBC.
[Figure 1]
We evaluated and optimized LBBC using a dataset available from a recently published study that assessed the utility of urinary cfDNA for the monitoring of bacterial infection of the urinary tract [2]. We analyzed 44 cfDNA datasets from male and female kidney recipients. These included 16 datasets from subjects with E. coli UTI, 11 datasets from subjects with Enterococcus UTI, and 17 datasets from subjects without UTI, as determined by conventional urine culture performed on the same day. Prior to application of the LBBC algorithm, the ratio of sequences assigned as non-host vs host (paired host reads relative to sequences assigned to microbial taxa) was 4.4x10-1 ± 1.68 in this dataset. We detected 616 bacterial genera across all forty-four samples (Fig. 1b; RGE >10-6), many of which were atypical in the urinary tract; including Herminiimonas and Methylobacterium, albeit at very low abundance.
We defined two parameters for threshold-based filtering, these are: (1) the maximum difference in the observed CV and that of a uniformly sequenced taxa for the same sequencing depth and genome size, ΔCVmax, and (2) the minimum allowable within-batch variation, σ2min. A third, fixed parameter was used to remove species identified in the negative controls (threshold 10-fold the observed representation in the negative controls). We optimized these parameters based on following metric:
[Due to technical limitations, the formula could not be displayed here. Please see the supplementary files section to access the formula.]
where {TP, TN, FP, FN} is the number of true positives, true negatives, false positives, and false negatives, respectively, U is the total number of identified taxa for which an orthogonal measurement was not performed, and the coefficients k for these values represent weights to optimize the filtering parameters. Here, we chose {kTP, kTN, kFP, kFN, kU} = {4, 2, -1, -2, -0.2}, and used nonlinear minimization by gradient descent on the variable BCscore to determine an optimal set of threshold parameters: {ΔCVmax, σ2min} = {2.00, 3.16 pg2}.
Applying LBBC with these parameters to urinary cfDNA microbiome profiles led to a diagnostic sensitivity of 100% and specificity of 91.8%, when analyzed against results from conventional urine culture. We computed a confusion matrix (see Methods) and determined the accuracy of the test to be 0.886 (no information rate, NIR = 0.386, p < 10-10). Without LBBC, the test achieved a sensitivity of 100% but a specificity of 3.3%, and an accuracy of 0.000 (as most samples have both E. coli and Enterococcus). Applying a simple filter that excludes taxa with relative abundance below a pre-defined threshold (RGE >0.1) led to an accuracy of 0.864 (sensitivity of 81.5%, specificity of 96.7%); however, such filtering does not remove sources of physical or digital noise at high abundance and may remove pathogens present at low abundance. After applying LBBC, we observed far fewer bacterial genera outside of Escherichia and Enterococcus in samples from patients diagnosed with UTI (Fig. 1c). LBBC did not remove bacteria that are known to be commensal in the female genitourinary tract, including species from the genera Gardnerella and Ureaplasma [17]. For male subjects without UTI, we detected a single Lactobacillus species among all subjects, consistent with the view that the male urinary tract is sterile in absence of infection. For patients with UTI, the urinary microbiomes were less diverse in males compared to females, as previously reported [18]. These examples illustrate that LBBC conserves key relationships between pathogenic and non-pathogenic bacteria.
We next applied LBBC to the analysis of cfDNA in amniotic fluid. Circulating cfDNA in maternal plasma has emerged as a highly valuable analyte for the screening of aneuploidy in pregnancy [19], but no studies have examined the properties of cfDNA in amniotic fluid. No studies have furthermore assessed the utility of amniotic fluid cfDNA as an analyte to monitor clinical chorioamnionitis, the most common diagnosis related to infection made in labor and delivery units worldwide [20]. Traditionally, it was thought that clinical chorioamnionitis was due to microbial invasion of the amniotic cavity (i.e. intra-amniotic infection), which elicits a maternal inflammatory response characterized by maternal fever, uterine tenderness, tachycardia and leukocytosis as well as fetal tachycardia and a foul smelling amniotic fluid [21, 22]. However, recent studies in which amniocentesis has been used to characterize the microbiologic state of the amniotic cavity and the inflammatory response [amniotic fluid interleukin (IL)-6 >2.6 ng/ml [23]] show that only 60% of patients with the diagnosis of clinical chorioamnionitis have proven infection using culture or molecular microbiologic techniques [10]. The remainder of the patients have clinical chorioamnionitis in the presence of intra-amniotic inflammation (i.e. sterile intra-amniotic inflammation) or without neither intra-amniotic inflammation or microorganisms in the amniotic cavity [10]. Therefore, the emergent picture is that clinical chorioamnionitis at term is a heterogeneous syndrome, which requires further study to optimize maternal and neonatal outcomes [24]. We analyzed forty amniotic cfDNA isolates collected from the following study groups of women: 1) with clinical chorioamnionitis and detectable microorganisms (n = 10), 2) with clinical chorioamnionitis without detectable microorganisms (n = 15), and 3) without clinical chorioamnionitis (i.e. normal full-term pregnancies) (n = 15). Microorganisms were detected by cultivation and broad-range PCR coupled with electrospray ionization mass spectrometry or PCR/ESI-MS (see Methods). Data from several independent clinical assays were available, including levels of interleukin 6 (IL-6), white and red blood cell counts, and glucose levels (see Methods).
[Figure 2]
We obtained 77.7 ± 31.8 million paired-end reads per sample, yielding a per-base human genome coverage of 1.90x ± 0.88x. The data provide unique insight into the properties of amniotic fluid cfDNA. For women carrying a male fetus, we used the coverage of the Y chromosome relative to autosomes to estimate the fetal fraction of cfDNA in amniotic fluid (see Methods). The fetal fraction ranged from 6.0% to 100%, and was strongly anticorrelated with inflammatory markers such as IL-6 [25, 26] (Spearman’s rho of -0.763, p = 1.34 x 10-4, n = 20; Fig. 2a). We attribute this observation to the recruitment of immune-cells to the amniotic cavity during infection [27, 28]. We next used paired-end read mapping to determine the fragment length profiles of cfDNA in amniotic fluid (Fig. 2b). We found that amniotic fluid cfDNA was highly fragmented (median length 108 bp), and lacked the canonical peak at 167 bp typically observed in the fragmentation profile of plasma cfDNA [19, 29]. To determine size differences between fetal and maternal cfDNA in amniotic fluid, we computed the median fragment length for molecules derived from the X and Y chromosomes in cfDNA from male pregnancy samples. We hypothesized that if all cfDNA in a sample originated from the male fetus, the median fragment lengths for the X and Y-aligned DNA would be equivalent, and, conversely, in samples with a large fraction of cfDNA originating from the mother, a length discrepancy may arise. Using this approach, we found that fetal-derived cfDNA was shorter than maternal-derived cfDNA (up to 31 bp shorter; Fig. 2c). Previous reports have similarly noted that fetal cfDNA in urine and plasma is shorter than maternal cfDNA [30, 31].
We next examined the utility of LBBC for the diagnosis of clinical chorioamnionitis. Prior to application of the LBBC algorithm, the ratio of sequences assigned as non-host vs host (paired host reads relative to sequences assigned to microbial taxa) was 1.08x10-2 ± 4.76x10-2 in this dataset. After applying LBBC with a relaxed batch variation minimum to account for species level analysis (σ2min= 1 pg2), no bacteria were detected in the normal pregnancy group (Fig. 2d), in line with recent studies that point to a sterile amniotic cavity and placenta in the absence of infection [32, 33]. The cfDNA sequencing assay detected only six of the fourteen bacterial genera identified by bacterial culture or PCR/ESI-MS, and was unable to identify a fungal pathogen, Candida albicans, detected by PCR/ESI-MS (see Methods). We asked if these false negatives were due to LBBC filtering. Relaxation of the filtering thresholds revealed that Ureaplasma was removed in four samples by the batch variation filter; other false negatives were not due to LBBC filtering. Interestingly, in all cases of chorioamnionitis without detectable microorganisms, no bacterium was identified (Fig. 2d), in line with previous evidence showing that chorioamnionitis and intra-amniotic inflammation can occur in the absence of microbial invasion of the amniotic cavity [10]. Last, in two samples, we identified a high burden of viral DNA, including papillomavirus in one sample and bacteriophage in another (Fig. 2d), demonstrating the utility of cfDNA paired with LBBC to detect viruses in the amniotic fluid.