Experimental process
Each fresh fecal specimen from five healthy volunteers was fully stirred and divided into 40-50 tubes immediately after defecation, and then stored at -80 ℃. For each specimen, we conducted standard VLP enrichment and DNA extraction (using our previous method[20]) for each of ~30 tubes for follow-up analysis. Our experimental plan and four main hypotheses were shown in Figure 1. Firstly, to explore the optimal cycle number for PCR amplification of high-fidelity enzyme, we prepared 4 amplification products of virome DNA samples for each specimen: one sample without amplification (0C; C represents the number of cycles) for direct DNA library construction and three samples with 5, 15 and 30 cycles of amplification (later referred to as 5C, 15C, and 30C), respectively (Fig. 1a). Due to the relatively low concentration of DNA, the 0C sample was mixed from 12-16 tubes of the unamplified virome DNA and the 5C sample was mixed from 8-10 tubes of the amplified products of virome DNA to get enough DNA for library construction of Illumina-based sequencing (minimum total DNA amount 0.05μg; Supplementary Table 1). Then, since 15 cycles of amplification had the optimal results by comprehensive consideration, to validate the repeatability of the whole viral metagenomic process, we prepared 4 parallel virome DNA products for each specimen (15C-0, 15C-1, and 15C-2 from three independently unpooled virome DNA samples, and 15C-3 from above Fig. 1a) and performed PCR amplification with 15 cycles for them (Fig. 1d). Next, we prepared additional 3 virome DNA amplified products with identical pooled templates per specimen using multiple displacement amplification (MDA) for 90 minutes. One MDA product was used for Illumina-based sequencing and compared with the non-MDA samples (PCR amplification by high-fidelity enzyme) to evaluate the reliability of MDA (Fig. 1b). The other two products were then mixed to get enough DNA for library construction and performed nanopore sequencing based on the PromethION platform to test the capability of long-read sequencing in improving viral metagenomic assembly (Fig. 1c). Finally, we compared the efficiency and coverage of viruses between the viral metagenome and traditional bulk metagenome (Fig. 1e).
Optimization of the cycle number for PCR-based amplification
For each individual, samples with different PCR amplification cycles (0C, 5C, 15C-0, 15C-1, 15C-2, 15C-3, and 30C) were compared. All samples were successful for DNA library construction and shotgun sequencing, except one sample (Subject No. 2 with 5C, #2-5C) was failed (Supplementary Table 1). We found that the samples in 30 cycles were significantly lower in both the number and total length of raw metagenomic-assembled contigs when compared with the 0C and 15C samples (Fig. 2a; Supplementary Table 2). Similar results were also found in the number and total length of identified viral contigs (Fig. 2b). Samples in 5 cycles showed unstable results, e.g., #3-5C (Subject No. 3 with 5C) had assembled the largest number of raw/viral contigs than other #3 samples, while #4-5C (Subject No. 4 with 5C) had assembled the fewest contigs comparing with other #4 samples. We calculated the “recall rate” for each sample that is defined as the number of viruses assembled from a sample divided by the number of viruses assembled from all samples of an individual (Fig. 2c). 0C, 5C, and 15C samples recovered average 37.8% (ranged from 27.5% to 53.9%), 34.8% (ranged from 11.7% to 69.8%), and 29.7% (ranged from 12.0% to 43.6%) of viruses in five fecal specimens, respectively, while the 30C samples recovered only average 13.0% (ranged from 4.2% to 18.2%) of viruses. In summary, these findings suggested that 0C, 5C, and 15C samples could reconstruct a considerable large proportion of the fecal DNA virome, moreover, the amplification cycle number of 15 was the optimized method considering its experimental convenience and stability of the result. Notably, although more contigs were recovered in low amplification cycle (i.e., 0C, 5C, and 15C) samples, their N50 length, as well as the estimated completeness of the viral contigs did not seem to extend (Fig. 2b; Supplementary Fig. 1), which suggesting that the assembly performance could still be improved by other technologies.
To further evaluate the potential deviation in viral community structure, we grouped the viral contigs of all samples into a catalog of nonredundant viruses using 95% nucleotide similarity[29] and then profiled the viral composition of various samples by mapping the sequencing reads against this catalog. Analysis of the within-sample viral diversity based on viral profiles revealed that the 30C samples had the lowest diversity (in all three diversity parameters) as comparing with the samples by other numbers of cycles, while the 15C samples had almost equal Shannon and Simpson indexes by comparing with 0C and 5C samples (Fig. 3d). Principal coordinates analysis (PCoA) and Spearman correlation analyses of the viral profiles showed that all samples belonging to the same individuals were closely clustered (Fig. 3e; Supplementary Fig. 2), all of which demonstrating the high consistency of viral composition across different amplification cycles.
Validation of technical replication
The four 15C repetitions using the same experimental procedures of each fecal specimen had markedly differed in DNA concentration and total DNA amount, and their data production and performance of raw metagenomic assembly were not significantly different (coefficient of variation [CV] <40% for each sample; Supplementary Table 1-2). Also, the number of identified viral contigs and their assembly parameters did not differ among repetitions. For 4 repetitions from each individual, the proportion of viral reads and the within-sample diversity parameters (i.e., Shannon and Simpson indexes, observed number of viruses) were similar (CV <40% for all; Supplementary Fig. 3a). PCoA showed that 4 samples of the same specimen clustered together (PERMANOVA R2 = 0.97, p<0.001), and they significantly differed among samples from other individuals (Supplementary Fig. 3b). Similar, Spearman correlation analysis revealed that the samples of the same specimen were high consistently (ρ = 0.91±0.07, ranged from 0.73 to 0.98), while samples from the different specimens were distanced (ρ = 0.04±0.23, ranged from -0.27 to 0.42) (Supplementary Fig. 3c). These findings demonstrated that a high reproducibility result can be observed from independent repeat viral metagenome procedures.
Evaluation of MDA
The results by MDA method did not show a significant difference in the performance of raw metagenomic assembly compared with the 5C, 15C, and 30C samples, but their number and total length of raw contigs were lower than the 0C samples (Supplementary Fig. 4a; Supplementary Table 2). Similar, the results of viral identification of MDA samples were not very different from that of all non-MDA samples (Supplementary Fig. 4b). The “recall rate” (updated by adding the new samples) of MDA samples was average 22.6% (ranged from 12.0% to 29.9%), which was slightly lower than the 15C samples (average 27.2%) but remarked larger than the 30C samples (average 12.1%) (Supplementary Fig. 4c). Moreover, within each individual, the MDA samples were highly consistent with other non-MDA samples in both viral diversity and composition (Supplementary Fig. 5).
Combining short and long reads improve viral metagenome assembly
The data generated by Nanopore sequencing were preprocessed, which led to, on average, 266,026 (ranged from 77,682 to 431,229) long reads per sample, with average reads length 5,071 bp and an average quality score of 8.8 (corresponding base accuracy 86.8%; Supplementary Table 1). Notably, the average 96.3% of the short reads could be robustly mapped into the Nanopore long reads for each individual, confirming long reads well represented the DNA virome of original specimens. Based on a large amount of viral sequencing data, we first performed a state-of-the-art short-read assembly for each individual using the combined data of four 15C samples and then used the long reads for scaffolding the short-read-assembled contigs to generate a hybrid result, followed by gap-filling based on both short and long reads (see Methods). The hybrid assembly had significantly improved the performance of metagenomic assembly comparing with the short-read approach, with an average 15.8% (n = 2,743 vs. 2,369) increase of raw sequences and 19.8% (24.3 Mbp vs. 20.3 Mbp) increase of total length (Fig. 3a; Supplementary Table 3). Likewise, the number and total length of viral sequences had extended on average 23.6% (n = 545 vs. 441) and 26.3% (5.1 Mbp vs. 4.0 Mbp), respectively (Fig. 3b). Moreover, although the N50 length of these viruses was not extended, we found that the number of high-quality viruses (>90% completeness as estimated by CheckV[30]) was remarkably increased in the hybrid assemblies of each sample (Fig. 3c). These findings indicated that the long reads were effective in improving viral metagenome assembly and viral genome reconstruction.
Genomic and functional characterization of high-quality viruses
Next, we specifically focused on the high-quality viruses (n = 151; average length 50,952 bp; N50 length 60,043 bp; length ranged from 3,376 to 206,128 bp; Supplementary Table 4) generated by the hybrid assembly, as these viruses represented the dominant proportion (>75%) of viral relative abundances in original fecal specimens. The average estimated completeness of these viruses was 99.4% (ranged from 90.9% to 100%), while 72 of which were identified as “finished” genomes as they contained the high-confidence direct terminal repeat (DTR) or inverted terminal repeat (ITR) sequences. 70 of 151 viruses could be robustly assigned into the viral families, of which the members of Siphoviridae (n = 34), Microviridae (n = 10), Myoviridae (n = 6), were most frequently occurred (Fig. 4a), in agreement with the previous studies reporting that these three families were dominated in human gut virome[15]. Almost all known viruses were prokaryotic viruses, except 3 eukaryotic viruses (Circoviridae, n = 2; Geminiviridae, n = 1). We also assembled the near-complete genomes of 6 viruses that belonged to a candidate viral family, “Quimbyviridae”[31], suggesting the probably widespread of this family in gut virome. To further investigate the novelty of our virus catalog, we compared the viral genomes with three large-scale human gut virus datasets including Gut Virome Database (GVD)[8], Gut Phage Database (GPD)[3], and Metagenomic Gut Virus catalog (MGV)[6]. 60.3% (91/151) of our viruses were completely absent from all three databases (Fig. 4a), including all 10 members of Microviridae, highlighting that more unknown viruses are still needed to be identified in the human gut. In addition, we identified the bacterial hosts of 63 of 151 viruses based on their homology of genome sequences or CRISPR spacers to the available gut microbial genomes. This analysis revealed some novel virus-host affiliations such as 4 virus-host pairs between “Quimbyviridae” members and Prevotella spp. or the virus-host pair between a Microviridae virus and a Cyanobacteria species (Supplementary Fig. 6). Members of Firmicutes and Bacteroidetes were the most frequent hosts of the virus catalog (Fig. 4a), consistently with previous studies showing that these phyla are most dominant in healthy human gut[1, 32].
We predicted a total of 10,951 protein-coding genes from the high-quality viruses and annotated functions of 939 of these genes based on the KEGG (Kyoto Encyclopedia of Genes and Genomes)[33] database. Totaling 206 viral auxiliary metabolic genes (AMGs) that were assigned to specific metabolic pathways were further analyzed to elucidate the metabolic capabilities of the viruses (Supplementary Table 5). Strikingly, 22.7% of viral AMGs were involved in sulfur metabolism (Fig. 4b), in agreement with recent reports that the viruses are widely participants in both organic and inorganic sulfur metabolism in human gut[34, 35]. The proteins involved in the destructive metabolism of peptidoglycan (an important struct of bacterial cell walls) were frequently encoded by the viruses (consisting of 11.4% AMGs), such as peptidoglycan DL-endopeptidase function as both cell wall hydrolases and poly-γ-glutamic acid hydrolases[36], which would facilitate infection and fitness of bacterial host by such viruses. Besides these, we also found that the viruses encoded several important but rarely reported functions, including the enzymes involved in nicotinate and nicotinamide metabolism, folate metabolism, metabolism of other molecular (e.g., lipopolysaccharide, glycerophospholipid pantothenate, porphyrin). These findings largely extended the functional capacity of gut virome.
Virus identification in bulk metagenome versus viral metagenome
Averaging 814 viral contigs (ranged from 479 to 1,147) and an average total viral length of 6.8 Mbp (ranged from 3.9 Mbp to 9.6 Mbp) were generated from the bulk metagenome samples (Supplementary Fig. 7a), which were 49.4% and 33.5% larger than that of the hybrid-assembled viral metagenome samples, respectively. However, the N50 length and estimated completeness of bulk samples were significantly lower than those of VLP samples (Supplementary Fig. 7b), probably due to the lower proportion of viruses in bulk samples. Surprisingly, only an average of 16.4% (ranged from 10.9% to 22.3%) of the viruses identified by VLP metagenome were shared with the viruses of bulk metagenome (Fig. 5a). Further comparison at the family level between the VLP-specific and bulk-specific viruses revealed that, despite both two types of viruses were dominated with Siphoviridae and Myoviridae, they had significantly differed in frequency among some families (Fig. 5b). For example, 20 crAss-like phages were recovered by viral metagenomes but only 3 were assembled in bulk samples. Also, the VLP metagenomes uniquely recovered all Microviridae (n = 13), Circoviridae (n = 3), and Drexlerviridae (n = 2), and Genomoviridae (n = 2) viruses, whereas the bulk metagenome specially recovered Ackermannviridae (n = 5), Herpesviridae (n = 3), and Pithoviridae (n = 2). Moreover, an average of 2.4% (ranged from 1.3% to 4.3%) viruses in viral metagenomes were recognized as prophage by the CheckV prophage algorithm, while this proportion was average 5.4% (ranged from 3.8% to 8.1%) in bulk viruses (Fig. 5c).
Finally, we compared the viral profiles of viral and bulk samples to investigate the viral diversity and structure difference by these two technologies. Bulk samples showed higher within-sample diversity than the viral metagenome samples (Fig. 5d). Likewise, PCoA and Spearman correlation analyses showed that the viral profiles of viral metagenome and bulk metagenome samples from the same individuals remarkedly differed, with Spearman correlation coefficient ρ = 0.67±0.11 (Fig. 5e-f), and this phenomenon was also observed in viral composition at the family level (Fig. 5g). Collectively, our findings suggested a considerable difference between the two technologies in profiling the DNA virome of the human gut.