Sample collection
Anterior nasal swab samples were obtained from healthy adults according to a standardized protocol as previously described in an earlier study [28]. PBS was added (1 mL) to the nasal swab samples and vortexed briefly. Four aliquots were made with one nasal swab sample, and a swab and 200 uL of sample solution was utilized for each aliquot. Sputum was collected from adult patients with cystic fibrosis described in a previous study [29, 30]. Briefly, adult persons over age 18 satisfying cystic fibrosis clinical diagnostic criteria and receiving routine care at the Massachusetts General Hospital Adult Cystic Fibrosis Center were recruited. The volume of sputum samples was supplemented with PBS to make 1mL of 6 aliquots, and gently homogenized by syringes to make evenly distributed aliquots. Bronchoalveolar lavage (BAL) fluid was collected from intubated patients for clinically indicated bronchoscopies with excess BAL. For BAL, each sample was evenly separated into 6 individual aliquots to have volume of 200 uL. Ethical approval for this study was obtained by the Institutional Review Board of Mass General Brigham (Protocol #2018P002934, 2019P002868 and 2020P001761).
Host DNA depletion treatments
5 different host DNA depletion methods (lyPMA, Benzonase, HostZERO, MolYSIS, and QIAamp) were tested for nasal swab, sputum and BAL samples. The total number of treatment groups per each sample was 6 (5 different treatment and 1 control group). Nasal swab samples had 2 control groups due to the ability to collect only 4 nasal swabs per participant at any given time.
For lyPMA, the procedure followed a previously published protocol by Marotz et al [10]. Briefly, samples were centrifuged to collect cells (10,000 g, 8 min). After carefully discarding the supernatant, the pellets were resuspended with 200 ul of DNAse free-water (163049500, Qiagen, Germany) and mixed by Voltex-Gini2. Samples were left at room temperature for 5 min and 5 uL of PMA (40019, Biotium, USA) was added to the sample (10 uL of 1 mM PMA). After briefly vortexing, samples were incubated in the dark room at room temp (5 min). To bind PMA dyes to DNA, samples were placed horizontally on an orbital shaker and exposed to a light source with 2610 lumens (LED A21, GE, USA) at 20 cm distance for 30 min, and rotated every 5 min.
Benzonase treatment method was conducted as described at Nelson et al. [16]. Firstly, 7 mL of DNAse-free water was added to 200 uL sample, and then the samples were placed on an orbital shaker for 1 hour at 60 RPM to lyse mammalian cells. 800 uL of 10x Benzonase buffer (200 mM Tris-HCl (15567027, Invitrogen, USA), 10 mM MgCl2 (AM9530G, Invitrogen, USA)) and 250U of Benzonase (E1014-25KU, Sigma, USA) to each sample (1 uL) was added to the samples, and the mixtures were incubated for 2 hours at 37 C (120 rpm) in an incubator (New Brunswick Innova 42, Eppendorf, Germany). After centrifuging at 8,000 g for 10 min the pellets were resuspend with 1mL PBS and moved to 1mL tubes. The second centrifuge was conducted at 13000g for 3min, the supernatants were removed, and the pellets were resuspended with 400 uL of TE (AM9849, Invitrogen, USA) + 5 mM EDTA (15575-020, Invitrogen, USA).
HostZERO was implemented according to the manufacturer’s protocol (https:/files.zymoresearch.com/protocols/d4310_hostzero_microbial_dna_kit.pdf). Briefly, 1 mL of host DNA depletion solution (D4310-1-20, Zymo, USA) was added per 200 mL of sample. The mixture was agitated by orbital shaking for 15 min at room temp at 180 rpm. After centrifuging the tube at 10,000 g for 5 min at room temperature, the supernatant was carefully removed. 100 uL of microbial selection buffer D4310-2-5, Zymo, USA) and 1 uL of microbial selection enzyme (D-4310-3-50, Zymo, USA) was added to the samples, and the samples were incubated at 600 rpm, 37C for 30 min in a thermomixer.
MolYsis Basic 5 (D301-050, Molzyme, Germany) was implemented following the manufacturer’s protocol. Briefly, 250 uL buffer CM was added to the samples and they were agitated by vortexing for 15 seconds and incubated at room temperature for 5 minutes. Reagents (250 uL buffer DB1, 10 uL MolDNase B) were added to the samples and briefly mixed by vortexing for 15 seconds. After an incubation process at room temp for 15 minutes, samples were centrifuged 12,000 g for 10 minutes and the supernatant was removed carefully. The pellet was resuspended with 1 mL buffer RS and the Vortex Gini2, and centrifuged at 12,000 g for 5 minutes. Finally, 80 uL buffer RL was added to the pellets and mixed with pellets by Vortex Gini.
For QIAamp, the procedure followed the manufacturer’s protocol (https://www.qiagen.com/us/resources/resourcedetail?id=c403392b-0706-45ac-aa2e-4a75acd21006&lang=en). Briefly, after adding 800 uL of PBS (MRGF-6230, Growcells, USA) to each sample to make the total reaction volume 1 mL, 500 uL Buffer AHL (1080302, Qiagen, Germany) was added to 1 mL of sample. Samples were incubated at room temperature for 30 minutes at 600 rpm. The pellet was collected by centrifuging the tube at 10,000 g for 10 minutes and removing the supernatant carefully. After adding 190 uL of Buffer RDD (1018702, Qiagen, Germany) and 2.5 uL of bezonase (1038893, Qiagen, Germany), the samples were incubated at 37C for 30 minutes at 600 rpm. 20 uL of Proteinase K was added and samples were incubated at 56C for 30 minutes at 600 rpm afterwards. All incubation processes were conducted with a thermomixer (Thermomixer C 5382, Eppendorf, Germany).
DNA extraction
The same nucleic acid extraction approach was applied to all sample types as previously described [31]. In brief, treated and untreated samples, reagent-only negative controls, and mock community positive controls (Zymo Research) were extracted using a protocol optimized for respiratory samples with a magnetic bead-based protocol using the Maxwell HT 96 gDNA Blood Isolation System (Promega) on a KingFisher Flex instrument. Briefly, cetyl trimethyl ammonium bromide (CTAB) is added to samples in individual Lysing Matrix E tubes (MP Biomedicals), incubated at 95°C for 5 minutes followed by beadbeating for three 30 second cycles at 7.0 m/s, incubated with proteinase K at 70°C for 10 minutes, 300 sample µL lysate collected, additional beadbeating for three 30 second cycles at 7.0 m/s with each cycle, and additional 300 sample µL lysate collected. Sample lysates are transferred to 96 well plates for binding, washing, and elution steps on the Kingfisher Flex sample purification system.
Quantitative polymerase reaction (qPCR)
Quantification of human DNA was determined focusing on the LINE-1 region with the Femto human DNA quantification kit (Zymo E2005, USA) with standards. Bacterial DNAs were measured targeting 16S region with a set of universal primers (5’-CCTACGGGAGGCAGCAG-3′ and 5’-ATTACCGCGGCTGCTGG-3′) for bacterial 16S rRNA19 and bacterial DNA standards (Zymo E2006-2, USA) for quantification. All reactions were performed in triplicate. Absolute quantification was determined using standard curves generated according to the manufacturer’s protocol (https://files.zymoresearch.com/protocols/_e2006_femto_bacterial_dna_quantification_kit_ver.pdf and https://files.zymoresearch.com/protocols/_e2005_femto_human_dna_quantification_kit.pdf).
Metagenomics sequencing and data processing
PicoGreen dsDNA assay kits were used for DNA concentration measurement at library preparation (P7589, Invitrogen, USA). Due to low microbial DNA content, a DNA library prep kit (E6177L, New England Biolabs, USA) designed for high range of input (100–500 ng) was used. For all sample types, 1:25 diluted adapter was used during adapter ligation, and 12 cycles of PCR amplification was conducted. Success of library preparation was assessed with fragment analyzer (DNF-474-0500, Agilent, USA) and qubit (Invitrogen, USA). In total 157 samples (30 BAL, 35 nasal, 30 sputum, 30 negative control from host depletion, 30 positive controls from host depletion, 1 extraction positive control, and 1 extraction negative control) were sequenced on the Illumina NovaSeq platform targeting 10 Gb/sample. Reads were processed with Casava (Illumina) and bbduk to retrieve sequences and remove Illumina adapters.
Profiling of metagenomes were processed with bioBakery 3 [7] combined with bowtie2 with hg38 reference database for mapping and removing human reads [32]. Specifically, MetaPhlAn 3.0 and HUMAnN 3 were employed for taxonomical and functional profiling, respectively. Community profiles, either microbial taxonomy or predicted function of genes, and their hierarchical structures were merged by `phyloseq` package v1.41.1 [33]. Outputs were normalized to relative abundances considering the length of core genomes used for the identification for taxa, and reads per kilo million for function, respectively. Proportions of host DNAs in a sample was calculated using both qPCR and mNGS results using the following equations.
Host DNA (qPCR) [%] = Hq / (Bq + Hq) eqn. 1
Host DNA (sequencing) [%] = RH / (RC) eqn. 2
Where, Hq is the absolute amount of host DNA quantified by qPCR with LINE-1 region, and Bq is the amount bacterial DNA quantified by qPCR with 16S region, RH is the host reads identified by bowtie2 among RC, and the RC is the cleaned reads after removing low quality reads. Furthermore, low prevalent taxa were removed at a 5% threshold for statistical analyses, to avoid resulting in wrong association i.e. detecting more taxa after host DNA depletion [34].
Statistical analyses
Statistical analyses were pre-registered on the Open Science Foundation (https://osf.io/2jtc5) [35]. All statistical analyses were conducted in R version 4.3.1 (https://www.r-project.org). Library preparation success rates were assessed with a logistic mixed-effect model using the glmer function from ` lme4` R package v1.1.34 [36]. Final reads, % host DNA and other continuous outcomes were assessed by linear mixed effect models using lme4::lmer function. Predictors of the models were sample type, treatment method, and an interaction term for sample type x treatment method. Repeated measurements from one participant were accounted for with a random effect term. Alpha diversity indices were calculated with `vegan` package v2.6.4 [37]. Beta diversity was calculated with the vegan::vegdist, where Morisita-Horn dissimilarity index was calculated by subtracting the Morisita-Horn similarity index from 1, ordinated with the `phyloseq`, and paired distances extracted using `harrietr` package v0.2.3 (https://github.com/andersgs/harrietr). Stratified analyses by sample type were conducted when the sample type * treatment method interaction term was significant. Predictors of overall microbial community structure was evaluated on beta diversity using permutational analysis of variance (PERMANOVA). All the predictors were tested as fixed effects to compare the effect size between them (feature ~ subject + sample type + treatment method + sample type * treatment method). After confirming sample type-treatment specific effects with the sample type * treatment method interaction term, analyses stratified by sample typewere used (feature ~ lyPMA + Benzonase + HostZERO + MolYsis + QIAamp, strata = subject id). To quantify the potential bias of treatment compared to controls, paired beta diversity indices between each treated and untreated sample were extracted and used for a subsequent linear mixed effects model.
The effect of treatment on alpha diversity and on species-specific differential abundance was identified by a linear mixed effect model implemented in lme4::lmer (feature ~ sample type + lyPMA + Benzonase + HostZERO + MolYsis + QIAamp + (1|subject id)). Analyses performed for differential abundance first implemented a centered log-ratio transformation to account for compositionality prior to regression modeling using `microbiome` R package v1.22.0 [38]. The false discovery rate (FDR) was calculated using the `qvalue` R package v2.32.0 [39].
Mediation analysis was conducted using `mediation` package v4.5.0 [40] with species richness as an outcome, the treatment method as an exposure, the final reads as a mediator, and sample type as a mediator-outcome confounders. Mixed effects linear regression was used for both outcome and mediator models, and the analysis was stratified by each treatment method.
All the data were visualized using R packages ggplot2 v3.4.4 [41] and ggpubr v0.6.0 (https://github.com/kassambara/ggpubr/)
Sensitivity analysis
For further identification of contaminants, two different statistical decontamination methods were employed. Firstly, the `decontam` package [23] was used with the `combined` method based on bacterial DNA from the qPCR result as total DNA concentration. The analysis employed negative controls, BAL, nasal, and sputum without negative controls after prevalence and abundance filtering. Secondly, we estimated decontaminated genus-level relative abundances using `tinyvamp`[24] based on the MetaPhlAn read count table. The community compositions of the negative control and mock community samples were treated as known, and the relative abundance profiles were estimated for each of the remaining samples. A single contaminant relative abundance profile was estimated for each protocol. Detection efficiencies were estimated for taxa in the mock community relative to E. faecalis. The expected number of reads attributable to contamination was assumed to be inversely proportional to the bacterial DNA concentration in each sample (parameter `Z_tilde`). Optimization of the unweighted Poisson criterion was performed until convergence was reached. Optimization was performed separately for each of the six protocols. Despite its presence in the mock community, S. enterica was not detected in any of the mock community samples in the HostZERO protocol. Therefore, to enable estimation, a pseudocount of 1 was imputed for a single positive control sample for this protocol only. For downstream analysis we only considered the estimated relative abundances of the samples of unknown composition (output matrix P).