Study Design. Ten healthy adult volunteers, aged 18–65 years, were recruited to provide stool samples over a 4-month period, before, during, and after antibiotic exposure. Participants had not received antibiotic treatment one year prior to study enrollment, nor they ever experienced allergic reactions to the antibiotic class used in the study. Participants did not follow special dietary habits (vegetarian or vegan). To reduce the effect of individual variation, all volunteers were treated in parallel with a single antibiotic (Table S1). Four antibiotics from different chemical and therapeutic classes were used: ciprofloxacin (quinolone class), cefuroxime (β-lactam class), doxycycline (tetracycline class), and azithromycin (macrolide class). As a control, stool samples from two untreated healthy individuals also were processed. Treatments were randomly assigned and open label.
Sampling. Six stool samples from each participant were obtained as one sample 15 days before treatment ± 1 day (T1), two samples on the third (T2) and fifth day (T3) of antibiotic treatment ± 1 day, and three samples at 15 (T4), 30 (T5) and 90 days (T6) after treatment ± 1 day. Samples were immediately stored at -18 ˚C at volunteers’ homes and transported to the hospital within 14 days, packed to prevent thawing. Upon arrival at the hospital, samples were stored at − 80 ˚C until processing.
DNA extraction. DNA was extracted from 5 g aliquots of frozen stool using an MO BIO PowerMax Soil DNA Extraction Kit (MO BIO Laboratories, Inc) according to the manufacturer's protocol with a few modifications. Stool samples were stored at − 80 °C in sterile 50 ml Falcon tubes until extraction. To samples, 15 ml MO BIO PowerBead Solution and MO BIO Garnet beads were added before vortexing for 1 min at maximum speed using a horizontal vortex adopter (SI-H506, Horizontal 50 mL Tube Holder, Scientific Industries). Solution C1 (1.4 ml) was added before incubation at 65 °C for 30 min with shaking at 130 rpm. Samples were vortexed for 10 min at maximum speed, 6 ml Solution C2 was added, samples were incubated at 4 °C at 20 min and processed per MO BIO instructions.
DNA purification. After extraction, DNA was purified with PowerClean Pro DNA Clean-Up Kits (MO BIO Laboratories, Inc.) according to the manufacturer's protocol. When necessary, isolated DNA was concentrated to > 50 ng/uL using a vacuum concentrator (Concentrator plus, Eppendorf). The DNA quantity was measured using Qubit 2.0 Fluorometer (Thermo Fisher Scientific Inc.). The DNA quality was measured using a NanoDrop ND-1000 spectrophotometer (Thermo Fisher Scientific) and size was examined by gel electrophoresis of 5 µl DNA on a 1% (w/v) agarose gel with RedSafe Nucleic Acid Staining Solution (iNtRON Biotechnology).
DNA library preparation and sequencing. DNA samples were sent to Macrogen (South Korea) for library preparation and sequencing (Illumina Hiseq 2000 PE125). DNA libraries for sequencing were using TrueSeq Nano 550 bp kits (Illumina). The input template was 200 ng according to kit instructions. Sequencing depth was set as up to a minimum of 6 Gb data per sample.
RNA extraction. RNA extraction was by MO BIO PowerMicrobiome™ RNA Isolation Kit (MO BIO Laboratories, Inc.) according to manufacturer instructions. RNA was stored at -80 C until processing.
RNA library preparation and sequencing. RNA samples were sent to Macrogen (South Korea) for library preparation and sequencing (Illumina Hiseq 2000 PE125). For each sample, 2.5 µg total RNA was used as input for rRNA, which was processed using Ribo-Zero Gold rRNA removal kits - Epidemiology (Illumina). RNA libraries were constructed using Trueseq RNA library preparation kits (Illumina) according to manufacturer instructions. Two platforms were used for RNA sequencing with balanced data output (Illumina Hiseq 2500 PE125 and Illumina NextSeq PE75). Sequencing depth was set as up to a minimum of 2 Gb per sample.
Phage DNA extraction. Phage particles were isolated from 5 g aliquots of frozen stool with 50 ml Phage Buffer (10 mM Tris, pH 7.5, 10 mM MgCl2, 68 mM NaCl, 1 mM CaCl2) before homogenization by vortexing for 20 min at the highest speed (SI-H506, Horizontal 50-mL Tube Holder, Scientific Industries). Samples were centrifuged 3 times at 4 °C: 2 min at 872 x g, 10 min at 3800 x g, and 20 min at 7500 x g. After each centrifugation, supernatants were transferred to new 50 ml Falcon tubes and pellets discarded. Supernatants were filtered through 0.22 µm filters (EMD Millipore Sterivex-GP SVGPL10RC Polyethersulfone Filter Unit, Millipore). To concentrate virus particles, 10 ml of filtered supernatants were concentrated to 1 ml by centrifugation with 100 Da Amicon Ultra filters (Amicon Ultra-15 Centrifugal Filter Units, Millipore) at 3488 x g at 15 °C. Supernatants in Amicon tubes were washed two times with 5 ml Phage Buffer and volumes adjusted to 1 ml. Supernatants were filtered through 0.45 µm syringe filters (Cellulose acetate membrane syringe filter, Filter Technology) into 1.5 ml phase-lock gel tubes (5 PRIME), 40 µL lysozyme (10 mg/mL, Sigma-Aldrich) was added, and filtrates incubated for 30 min at 37 °C under shaking at 300 rpm. After incubation, 400 µL chloroform was added to samples before incubating for 15 min at room temperature with gentle inversion every 2 minutes. Samples were centrifuged at 14,000 x g for 5 min at room temperature and supernatants transferred to 1.5 ml Eppendorf tubes. A mix of 500 U bovine pancreas DNase I recombinant (Roche), 33 U Baseline-ZERO DNase (Epicentre), 6 U Salt Active Nuclease (ArcticZymes), and 500 U RNase A (Roche) was added to samples with 100 µl 10 × Incubation buffer (Roche) for incubation at 37 °C for 90 min followed by DNase inactivation at 75 °C for 10 min. After DNase/RNase treatment, phage particles were stored overnight at 4 °C. Phage DNA was extracted using Phage DNA Isolation Kits (Norgen Biotek) according to the manufacturer's protocol. DNA quantity was measured using a Qubit 2.0 Fluorometer (Thermo Fisher Scientific Inc.). Phage DNA samples were stored at -80 °C.
Control for bacterial contamination in phage DNA extractions. Full-length 16S rRNA gene (1503 bp) was amplified with 16S_up (AAGAGTTTGATCCTGGCTCAG) and 16S_lp (TACGGCTACCTTGTTACGACTT) primers [24] from Pseudomonas aeruginosa PAO1 reference strain (NC_002516) by quantitative PCR (qPCR). The template for qPCR reactions with SYBR Green Master Mix (Thermofisher) was 0,5 ng phage DNA. All samples were amplified by qPCR with triplicates of standards with known gene copy numbers and negative controls. Standards were 10-fold dilutions of purified, full-length 16S rRNA gene amplicons. Phage DNA samples with higher cycle amplifications (above 28 Ct corresponding 102 gene copies) were discarded and phage DNA extraction was repeated. Phage DNA below the threshold of detection for 16S rRNA gene copies (102) was used for next-generation sequencing library preparations.
Phage library preparation and sequencing. Phage DNA libraries were prepared using KAPA HyperPlus Kits (Kapa Biosystems). All steps were on ice except two cleanups that were at room temperature. Samples of 2.5 ng phage DNA were diluted in 17.5 µl 10 mM Tris-HCl (pH 8.0-8.5). Enzymatic fragmentation was by adding 2.5 µl KAPA Frag Buffer (10X) and 5 µl of KAPA Frag Enzyme to DNA dilutions in PCR tubes. Tubes were vortexed gently and spun down briefly, then incubated at 37 °C for 30 min in a thermocycler that was pre-cooled to 4 °C. After adding End-repair and A-tailing buffer, PCR tubes were vortexed and spun down and immediately incubated at 65 °C for 30 min in a thermocycler with the lid preheated to 85 °C. Adapter ligation reactions were in the same tubes with an addition of 3.75 µl PCR-grade water, 15 µl Ligation Buffer, 5 µl DNA ligase, and 1.25 µl 750 pM single adapters (Pentabase). Tubes were mixed thoroughly and centrifuged briefly and incubated at 20 °C for 60 min. Products were cleaned using Agencourt AMPure XP reagent at a ratio of 1:0.7 (adapter ligation reaction product: reagent). Reagent and product were mixed by pipetting 10 times followed by short spindown centrifugation. Mixtures were left for 15 min at room temperature to bind DNA to beads. Beads were captured by magnets for 5 min and supernatants were carefully removed and discarded. With tubes still on the magnet, 200 µl freshly prepared 80% ethanol was added with incubation for 30 s. Ethanol was discarded and the ethanol washes repeated. Tubes were left on the magnet for 5 min to dry the beads before resuspending them in 12.5 µL 10 mM Tris-HCl, pH 8.0-8.5 and vortexing for 30 s. Beads were incubated at room temperature for 2 min to elute DNA, then captured by magnet for 5 min. Supernatants were transferred to new tubes and 12.5 µl 2X KAPA HiFi HotStart ReadyMix and 2.5 10X KAPA Library Amplification Primer Mix added to 10 µl adapter-ligated library. Reagents were mixed and tubes centrifuged briefly. Library amplification used the cycling protocol: 1 cycle 98 °C, 45 s; followed by 14 cycles of 98 °C for 15 s, 60 °C for 30 s, and 72 °C for 30 s, with a final extension at 72 °C for 1 min. Post-amplification cleanup was as described above. Concentrations were measured using a Qubit 2.0 Fluorometer (Thermo Fisher Scientific Inc.) and library size (average 500–900 bp) was determined using a Bioanalyzer (Agilent 2100 Bioanalyzer system, Agilent Technologies). Libraries pooled and sequenced on a MiSeq platform (PE300).
Nucleic acid extraction and sequencing control. Reference strain E.coli MG1655 transformed with pzZE21mCherry was used as a control for the nucleic acid extraction and sequencing protocols. Sequencing reads were mapped to reference genome using CLC Genomic Workbench (version released in 2015). More than 99,5% of reads mapped back to the reference genome.
Sequencing data quality control. All raw reads from DNA, RNA and, phage libraries underwent quality trimming using a previously described pipeline [25] to filter out adapters and universal primer sequences, low-quality bases (< Q20), reads shorter than 75 bp, duplication reads and reads mapping to the human genome with over 95% identity. Computational scripts are at https://github.com/TingtZHENG/VirMiner/. Quality control results are summarized in Data S1.
rRNA removal from RNA sequencing clean data. Removal of rRNA was by riboPicker (version 0.4.3) [26] against the non-redundant rRNA database (riboPicker, downloaded from http://edwards.sdsu.edu/ribopicker/rrnadb/rnadb_2012-01-17.tar.gz) with arguments “-c 80 -i 90”. The results are summarized in Data S1.
In silico estimation of bacterial contamination in virome. We compared the 16S rRNA gene content in virome and whole metagenome samples. First, viral reads were truncated to 125 bp to match the read length of the whole metagenome dataset. Subsequently, both datasets were mapped against the SILVA database (v.123) [27] using bwa mem v. 0.7.15 [28] and the number of unique reads mapped with at least 90% identity was counted for both. Finally, the percentage of 16S reads within the entire read set of each sample was calculated and compared with published virome datasets (MetaVir) [29] (Figure S8).
De novo assembly. The de Bruijn graph-based assembler IDBA-UD (v. 1.1.1) [30] was used for de novo assembly. Clean reads from all time points for each individual were pooled for the co-assembly of metagenome and virome, respectively. For metagenomes, parameters for IDBA-UD were: “-mink 40 -maxk 100 -step 10 -num_threads 24 --min_contig 300 -pre_correction”. For viromes with PE300 sequencing, k = 180 was selected as the max kmer length. Two modifications were made in the source code before compiling IDBA_UD: in file src/basic/kmer.h, constant kNumUint64 was changed from 4 to 8 to allow maximum kmer length beyond 124; in file src/sequence/short_sequence.h, constant kMaxShortSequence was set to 512 to support longer read length. Virome co-assembly used parameters: “-mink 20 -maxk 180 -step 20 -num_threads 24 --min_contig 800 -pre_correction”. After co-assembly, paired-end reads (DNA, RNA and phage libraries) were aligned to respective assemblies using bwa “mem” model (v. 0.7.15) [28]. Statistics of mapped and unmapped reads were calculated using samtools with function “flagstat” (v. 1.3.1) [31]. Overall, final assemblies had mean mapping percentages of 82.3–91.4% for metagenomic contigs (Table S6) and 76.4–93.3% for viromic contigs (Table S7). Samtools functions “depth -aa” and “idxstats” were used to calculate contig coverage and depth and per-locus depth.
Updated phage orthologous group (uPOG) database. We used an updated POG database for phage gene annotation – uPOG [32]. The uPOGs are available on our website (http://147.8.185.62/VirMiner/downloads/updated_POG_database/).
Open Reading Frame (ORF) prediction and annotation. For metagenomes, MetaGeneMark was adopted to predict coding DNA sequence (CDS) regions in assembled metagenome contigs using default parameters [33]. The functional COG category for each protein was assigned using National Center for Biotechnology Information rps-BLAST [34] with the parameter “-e 1e-5”. Protein sequences were aligned to the Kyoto Encyclopedia of Genes and Genomes (KEGG) database [35] using Diamond blastp [36] with the parameter “-e 1e-5”. For viromes, ORFs were predicted by GeneMarkS v4.3 [37] with the parameter “--phage”. Predicted ORFs from metagenomic and viromic contigs were mapped against Pfam [38], POG 2012 [39], and uPOG databases by DIAMOND blastp [36] with parameters “--id 70 -e 1e-5” and the top hits were selected.
Antibiotic resistance gene homolog (ARG) annotation. The CARD database [47] and the accompanying Resistance Gene Identifier (RGI) pipeline were used to annotate ARGs in the metagenome and the virome. For metagenomes, protein sequences of predicted ORFs were used as input. For viromes, input was viral contig sequences. The RGI hits with “Perfect” and “Strict” identification were used as qualified ARG annotations.
Antibiotic-specific resistance gene homolog (AsRG) annotation. To identify ARG homologs with proven resistance to each antibiotic given to the study participants, we compiled a list of AsRGs from previous literature (Table S3). Genes with no annotation hit in the metagenomes were removed. The list was further verified by CARD [47] and only genes listed as “confers_resistance_to_drug” in “Sub-Term(s)” of each antibiotic were kept. For potential AsRGs in the mobilome (virome, phage-like contigs, plasmid contigs, and contigs with TE), only the RGI hits with “Perfect” or “Strict” identification and a minimum identity of 95% were kept.
Calculation of transcriptional activity. A gene or contig’s DNA and RNA abundances were calculated by transcripts per million mapped reads (TPM). Only genes with detectable abundances (TPM > 1e-5) across all time points were analyzed for DNA abundance. For the genes, contig, or gene sets with detectable DNA and RNA abundances (TPM > 1e-5) across all time points, transcriptional activity (TA) was calculated as TA = RNA (TPM) / DNA (TPM). Log2 transformation for TA was applied before statistical tests.
Taxonomic assignment of phage contigs. The RefSeq [40] database of viral reference genomes Release 81 (March 2017) was used to map contigs from each assembly using blastn [34] with filtration criteria: E < 1e-4, identity > 70% and coverage > 50%. Contigs that shorter than 3 kb were discarded. The relative abundance of each contig with a reference viral genome hit was calculated as transcripts per kilobase per million mapped reads (TPM). For each viral family, relative abundance was calculated as the sum of TPM of all contigs assigned to the viral family.
Phage contig verification by VirSorter and VirFinder. All phage library contigs were analyzed by VirSorter [41] with “virome database” and “virome de-contamination” modes. Contigs classified in any viral categories or as prophages were considered viral. VirFinder [42] analysis used default settings and contigs with a false discovery rate (FDR) below 0.05 were considered viral.
Phage-like contig identification in metagenomes. Phage library contigs were mapped against metagenomic contigs to find target phage-like contigs and phage integration sites in metagenomes. MegaBlast [34] was applied with parameters “--id 90 -e 1e-5”. Mapped regions with unmapped gaps smaller than 100 bp were catenated. Metagenomic contigs with coverage greater than 50% of the phage contig were identified as phage-like contigs in metagenomes.
Phage contig filtration. Phage library contigs meeting at least two of the following criteria were marked as confident phage contigs: 1) annotated with uPOG gene; 2) annotated with viral genes from PFam; 3) mapped to known phage genome in RefSeq; 4) identified as viral contig by VirSorter; 5) identified as viral contig by VirFinder; 6) mapped to at least 3 target phage-like contigs in the metagenome. Only confident phage contigs were used in downstream analyses. Contigs meeting only one criterion were marked as “suspicious phage contigs” with others marked as “contaminant contigs.
Plasmid contig annotation. Due to the complexity of a metagenome and the presence of homologous sequences from different species, metagenomic assemblies always yield fragmented contigs and inevasible mis-assemblies. Thus, a traditional plasmid identification tool that relies on circular contig assembly of high quality may not be the best practice for metagenome assemblies. Thus, we employed PlasFlow [43], a neural network-based tool to identify potential plasmid contigs from models trained by kmer frequencies. Among the metagenomic contigs longer than 1kbp, plasmid contigs were annotated with PlasFlow v1.0 with the parameters and default models (k = 5, 6, 7, respectively). Among the results, entries marked as “unclassified” or “chromosome” were discarded, and contigs binned as “plasmid” with a probability score over 0.7 were kept as plasmid contigs. The ARGs on these plasmid contigs were marked as mobile ARGs.
Transposable element annotation. These elements in metagenomic contigs were annotated using the database ISFinder [44]. ORF protein sequences and contig nucleotide sequences were mapped against, respectively, ISFinder protein and nucleotide databases using Diamond blastp [36] or blastn [34] with parameters “--id 70 -e 1e-5”. Contigs with mapped transposable elements (TE) were marked as contigs with transposable elements.
Identification of mobile contigs and mobile AsRGs. Metagenomic contigs annotated as phage-like contigs, plasmid contigs, or contigs with transposable elements were classified as mobile contigs. Identified AsRGs on these contigs were classified as mobile AsRGs. For each individual and each species ARG profile, identified AsRGs with no mobile copies in that species were classified as non-mobile AsRGs.
Species-specific marker gene annotation. Species marker gene profiles were obtained through MIDAS [45] with arguments “run_midas.py genes -s very-sensitive --species_cov 0.1”, for samples with efficient read depth (“merge_midas.py genes -sample_depth 0.1”). Functional profiles (Gene Ontology, KEGG Orthology, Enzyme Commission number (EC)) were further summarized based on annotations from the MIDAS reference database (midas_db_v1.2). Metagenomic ORFs were mapped against this marker gene database using Diamond blastp [36]. All hits with identity over 70% and E-value less than 1e-5 were kept for contig-to-species assignment.
Contig binning and contig-species assignment. For each individual, co-assembled metagenomic contigs were binned using MaxBin 2.2.4 [46] with the default parameters. For each contig or contig bin, species-specific marker genes were summarized. If more than 70% of the species-specific marker genes appeared in a contig or a contig bin could be assigned to a single candidate species with only one candidate species identified, then the contig or contig bin was assigned to that species. If a contig was successfully assigned to a host species, the species for its contig bin was then ignored. If a contig could not be assigned to a host species, while its bin could be assigned to a host species, then the contig bin’s host species was assigned to the contig. All the functional genes on a contig, including ARGs, were assigned to the contig’s host species.
HGT-supporting read pair extraction. Reads mapped to phage-contig homologous regions on phage-like contigs were extracted by samtools function “view” [31] and mate reads were extracted from bam files using grep command. Read pairs with reads mapped to different contigs were kept for cross-contig read-pair analysis. A read pair is defined as an HGT-supporting read pair if the ARG-carrying contig and its mate contigs were assigned to different host species.
Metagenome taxonomic assignment. The relative abundance of taxa was analyzed using MetaPhlAn2 with default settings [48].
Community diversity and dissimilarity. For metagenomes, species-level taxonomic profiles were used as input for alpha-diversity and beta-diversity analyses. Alpha-diversity was represented by Shannon index and beta-diversity by Bray-Curtis distances, both using the vegan R package [49]. For viromes, alpha-diversity (Shannon index using vegan) was calculated on the relative abundance matrix of confident phage contigs (in TPM). For beta-diversity, MASH [50] MinHash sketch strategy for estimating the Jaccard index was used to estimate dissimilarity between samples. Briefly, a mash sketch for each read file from each time point was derived and distances of all-against-all sketches were calculated. Ordinations for beta-diversity analysis were calculated by nonmetric multidimensional scaling for illustrations.
Definitions of special species categories. 1. Disappeared: the species undetected in all post-treatment samples (T4 to T6); 2. New: the species undetected in the baseline, but persistently abundant (relative abundance > 0.1%) in all post-treatment samples (T4 to T6); 3. Proliferating: the species meeting any of the two criteria: 1) a species with a relative abundance < 0.1% in baseline and reached 5% in T2 or T3, or 2) a species with a relative abundance ≥ 0.1% in baseline and increased by at least 10-fold in the relative abundance in T2 or T3.
Growth rate index (GRiD) calculation. GRiD was calculated according to a previously published method [51] with the following parameters: “grid multiplex -d /sbidata/shared2/GRiD/Stool -p -c 0.2 -m”. GRiD may generate not applicable (NA) values for all species in one sample, thus such time points (T3 for DOX-a and T4 for CTR-a) were considered invalid and excluded from analyses. For statistics, we kept only the species with no NA values across all valid time points.
Gram-positive and Gram-negative species. Bacterial species and strain information were acquired from PATRIC [52], visited in November 2017. Bacterial species with all strains with the same (or missing) Gram-staining type were assigned as Gram-positive or Gram-negative species. Species with conflicting Gram-staining types for different strains were categorized as dubious.
Statistics and data visualization. Statistics were done in R, with data visualization by R and corresponding packages including ggplot2, grid, gridExtra, RColorBrewer, ellipse, and pheatmap. Two-tailed Wilcoxon signed-rank test was performed for comparisons of paired data in nonparametric statistics. Two-tailed Wilcoxon rank-sum tests were performed for comparisons of unpaired data in nonparametric statistics. Adonis tests were applied for the community dissimilarity. Fisher’s exact tests (only when there are less than 5 observations in a table cell) or Pearson's chi-square tests were performed for 2 × 2 tables for independence test. Significance threshold was set to p < 0.05 or false discovery rate (FDR) < 0.05.