Optimization and Ev a Luation of Viral Metagenomic Amplication and Sequencing Methods Toward a Genome -le Vel Resolution of the Human Fecal DNA Virome

Background: Viruses in the human gut have been linked to health and disease. Deciphering of the gut virome is dependent on metagenomic sequencing of the virus-like particles puried from the fecal specimens. A major limitation of conventional viral metagenomic sequencing is the low recoverability of viral genomes from the metagenomic dataset. Results: Herein, we developed an optimal method for viral amplication and metagenomic sequencing to maximize the recovery of viral genomes. Using 5 fecal specimens with multiple repetitions, we revealed the optimal number of PCR cycles of high-delity enzyme-based amplication and the reliability of multiple displacement amplication in virome DNA preparation, veried the reproducibility of the optimally whole viral metagenomic experimental process, and tested the capability of long-read sequencing for improving viral metagenomic assembly. Based on our optimized results, we generated 151 high-quality viruses using the data combined from short-read (15 cycles for PCR amplication) and long-read sequencing. Genomic analysis of these viruses found that most (60.3%) of them were previously unknown and showed a remarkable diversity of viral functions, especially the existence of 206 viral auxiliary metabolic genes. Finally, we compared the viral metagenomic and bulk metagenomic sequencing approaches and revealed signicant differences in the eciency and coverage of viral identication between them. Conclusions: Our study demonstrates the potential of optimized experiment and sequencing strategies in uncovering viral genomes from fecal specimens, which will facilitate future research about genome-level characterization of complex viral communities.


Background
The human gut is a reservoir of numerous microbes including bacteria, archaea, eukaryotes, and viruses [1,2]. Viruses are the smallest organisms in the human gut microbiome, but they may be the largest in terms of the variety of species [3] and the number of organisms, with 10 9 -10 12 viral particles per gram of feces [4,5], and likely play important roles in shaping the gut microbiome.
As a valuable untargeted tool, the next-generation high-throughput sequencing had been comprehensively used to characterize the viral community (or "virome") of the human intestine. Currently, there are two main approaches for sequencing viruses from the fecal samples: directly whole-metagenome sequencing (referred to as bulk metagenomic sequencing) and virus-like particle (VLP) enrichment and subsequently metagenomic sequencing (referred to as viral metagenomic sequencing) [6]. The bulk metagenomic sequencing approach is to extract the whole genetic materials of the gut microbiome; although the proportion of viral sequences is relatively low, it has the advantages of simply experimental operation and a high success rate of the library construction for next-generation sequencing. Additionally, a large number of publicly available bulk metagenomic datasets had provided massive input materials for virome mining, promoting the generation of several large-scale viral genome catalogs, such as Human Virome Database (HuVirDB) [7], Gut Virome Database (GVD) [8], Metagenomic Gut Virus (MGV) [6], and Gut phage Database (GPD) [3]. On the other hand, although the experimental process is more complex, the viral metagenomic sequencing has the advantages of representation of the more comprehensive viral communities and improved detection sensitivity for low abundance and rare viruses [9]. For these reasons, a series of studies had used the viral metagenomic sequencing to delineate the gut viral diversity and structure, showing that the gut virome of healthy adults is highly diverse, individual-speci c, and broadly stable over long periods [10]. On base of these two approaches, increasing viral metagenomic studies had also shown the relevance of gut virome in human health and disease, which highlighted a considerable role of the viral community in multiple systemic disorders including rheumatoid arthritis [11], nonalcoholic fatty liver [12], alcoholic hepatitis [13], colorectal cancer [14], in ammatory bowel disease [15], and obesityrelated type 2 diabetes [16]. These studies greatly promoted the understanding of extensively unexplored diversity of the human virome.
In practice, analysis based on the viral metagenomic sequencing is often insu cient in characterizing the individual viruses at the genome level. Due to the limitation of low biomass, viral metagenome inevitably requires multiple polymerase chain reaction (PCR) ampli cation procedures to obtain su cient DNA for high-throughput sequencing [17]. For high-delity enzymes approach, the universal ampli cation cycle is [30][31][32][33][34][35][36][37][38][39][40], which may result in severe stochastic or systemic bias in viral genome fragmenting, skew viral abundances, and over-amplify small circular single-stranded DNA (ssDNA) viruses [6,[18][19][20][21]. Such viral ampli cation leads to uneven distribution of read coverage across the viral genome, which prevents genome reconstruction from the sequencing fragments [22,23]. Reducing the cycle number for PCR ampli cation will e ciently reduce the ampli cation bias but simultaneously reduce the PCR products, this phenomenon thus raises the requirement for the determination of the optimal cycle number for ampli cation. Currently, multiple displacement ampli cation (MDA) has been proven e cient in the ampli cation of very small amounts of DNA and is frequently used in viral metagenomic analysis [24,25].
However, this method also has a large ampli cation bias and can generate numerous templateindependent errors[26], which limits its wide application in the genome-level analysis of the gut virome.
In addition to optimizing viral ampli cation experiments, another promised strategy for viral genome analysis is long-read sequencing. Long-read viral metagenomic analysis based on nanopore sequencing was preliminarily validated using mock viral communities as well as human oral virome samples, providing signi cant improvements in the recovery of viral genomes [27,28]. Long-read sequencing data can theoretically capture the near-complete viral genomes with single reads, overcoming the challenges of assembling viral genomes from short-read data, however, it requires an additional 10-fold total amount of virus DNA with high integrity and molecular weight [9]. Coincidentally, MDA can amplify relatively long fragment products, and thus be helpful to prepare input materials for long-read sequencing.
In this study, we proposed the optimal viral metagenomic ampli cation and sequencing strategies aiming to maximize the recovery of viral genomes from human fecal specimens. We performed parallel virus enrichment and DNA extraction to generate ~30 viral DNA samples from each of 5 fresh fecal specimens and conducted the experiments including 1) optimizing the cycle number for PCR ampli cation, 2) evaluating the reproducibility of the optimally whole viral metagenomic experimental process, 3) evaluating the reliability of MDA, 4) evaluating the performance of long-read nanopore sequencing for metagenomic assembly, and 5) comparing the differences between viral metagenomic and bulk metagenomic sequencing. Our analyses recommended an optimized strategy and uncovered hundreds of metagenome-assembled viral genomes from these specimens, including 151 high-quality viruses that are able to deep taxonomical and functional characterizations.

Methods
Fecal specimen collection and preprocessing The fecal specimens were collected near the laboratory and processed within 1 hour. More than 20g of the fresh fecal specimen per person was collected from ve healthy volunteers using a sterile fecal collection box. In order to verify the repeatability of the experiment, after stirring and mixing by sterile tongue depressor, fresh fecal specimens were aliquoted into 35-50 subsamples of equal mass (0.17g each tube) in 1.5ml sterile tubes and stored at -80 ℃. This study was approved by the ethics committee of Dalian Medical University [NO 2020-014], and informed consent was obtained from all participants.

Experimental grouping and process
To answer 5 key questions in the gut DNA virome research shown in Figure 1, we carried out a scheme including 5 correspondingly experimental routes to compare their results. Firstly, to avoid the difference PCR template caused higher proportion and more divergent pattern of biased contigs [24] in fecal viromes obtained by MDA or Q5 high-delity enzyme ampli cation, the excessive amounts of virome DNA extracted in the same batch from 16 tubes of fecal subsamples per person were mixed thoroughly to ensure that all next MDA or high-delity enzyme ampli cations used identical templates (except for repeatability veri cation experiments which used individual subsamples, Fig. 1a). To determine the effects of different PCR cycles on ampli cation bias during virome DNA preparation, we screen the optimal number of PCR cycles varied among 0 (PCR-free), 5, 15, and 30 (denoted as 0C, 5C, 15C, and 30C, where C represents the cycles) using the above pooled extracted virome DNA as template and their PCR products were used as input DNA for library construction of high-throughput sequencing. Secondly, to determine the reliability of multiple displacement ampli cation, we performed MDA ampli cation with mixed templates above. Thirdly, long-read sequencing by nanopore technique was applied to compare with Illumine sequencing results. Fourthly, the reproducibility of virome DNA enrichment and ampli cation procedure was further evaluated using randomly selected 4 independent tubes of fecal subsample per person as starting material which were different from the above mixed templates. Using the validated optimal 15 PCR cycles in the abovementioned experimental procedure, we even performed the experiment with four different independent repetitions for evaluating the reproducibility. Lastly, the differences in analysis results between bulk metagenomic and viral metagenomic sequencing was compared.
Virus-like particles enrichment and viral DNA extraction The procedures of virus-like particles enrichment and viral DNA extraction according to our previously described protocol with minor modi cations [20]. Brie y, each tube of feces (0.17g) was added 1ml of Hank's Balanced Salt Solution (HBSS) without phenol red and was vigorously homogenized in the vortex (at least 15 s of pulse vortexing). Centrifuge the samples at 10000 × g for 10 min at 4 ℃. The supernatant was proceeded to serial ltrations using sequentially 0.45 µm and 0.2 µm lters [18]. Then, the samples were ultracentrifuged at 750000 × g for 60 min at 8 ℃ and the precipitate was resuspended by 500 µL of HBSS. 120 microliters of resuspension were transferred and treated with 23 Virome DNA was ampli ed by MDA method according to the manufacturer (GenomiPhi V2 Ampli cation kit, GE Healthcare, Little Chalfont, UK). 9µL sample buffer and 1µL of previously pooled virome DNA were pre-cooling at 4 ℃. Then, the reaction system was added 9µL reaction buffer and 1µL enzyme mix to amplify at 30 ℃ for 90 min.

Metagenomic sequencing
The concentration of the Q5 high-delity enzyme and MDA ampli ed and cleaned DNA was measured by Qubit dsDNA HS Assay Kit (ThermoFisher Scienti c, Waltham, MA, USA). Based on their PCR product yields of Q5 high-delity enzyme suitable for the minimum standard of NGS library construction, 8-10 or 12-16 parallel ampli ed products from 5 cycles or 0 cycles of ampli cation were mixed, correspondingly ( Figure 1). All the short-reads shotgun metagenomic sequencing of virome DNA was performed on the Illumina NovaSeq platform according to our previous research process [20]. Brie y, libraries were prepared with a fragment length of approximately 350 bp. Paired-end reads were generated using 150 bp in the forward and reverse directions. Based on the MDA product yields suitable for the minimum standard of nanopore library construction, 10 ampli ed products (= 2 parallel ampli ed products * 5 samples) were mixed ( Figure 1). The long-read nanopore sequencing of virome DNA was performed on the PromethION platform. NEBNext FFPE DNA Repair Mix and NEBNext End repair / dA-tailing Module (New England BioLab, USA) were used for DNA chain damage repair, end repair and A base addition at the 3'end. Native Barcoding Expansion 1-12/13-24 (Oxford Nanopore Technologies, UK) and NEB Blunt/TA Ligase Master Mix (New England BioLab, USA) was used for barcode attachment. Then, the ligation sequencing kit (SQK-LSK109, Oxford Nanopore Technologies, UK) was used for library construction.

Data preprocessing and assembly
For the short-read data, raw metagenomic sequencing reads were ltered and trimmed using fastp v0.20.1 [52] with the options '-q 20 -u 30 -y -l 90 --trim_poly_g' to generate high-quality reads. Host contamination reads were removed by querying high-quality reads against the human genome GRCh38 using Bowtie2 v2.4.1 [53]. For the long-read data, raw nanopore reads were ltered using NanoFilt v2.8.0 with the options '-q 7 -l 2000'. For each sample, all short-read data were mapped into the long reads using BLASTN v2.11.0, and the long reads with over 20% of the total length with >80% identity mappability by short-read data were kept. The high-quality short reads of each sample were assembled into contigs using SPAdes v3.14.1 [54] with the options '-meta -k 21,33,55,77'. For the hybrid assembly, we developed a custom pipeline to improve the viral metagenomic assembly. Brie y, long reads from each sample were used to scaffold contigs pre-assembled by SPAdes into scaffolds via SSPACE-LongRead v1-1 [55] with the options '-i 80 -a 300 -g 1500'. GapFiller v1.11 [56] was then used to close gaps within scaffolds based on short reads from the corresponding sample with the options '-m 20 -r 0.6 -g 2'. To further reduce the number of gaps, the second scaffold lling step was performed based on long reads from the corresponding sample via FGAP v1.8.1 [57] with the options '-i 80 -C 200 -R 2000 -I 2000'. The contigs with a minimum length of 3,000 bp were extracted from the scaffolds for further analyses.

Analyses of viral sequences
After assembly, all contigs (≥3,000 bp) were assessed by CheckV v0.7.0 [30], and the contigs with more than 25% known microbial genes were removed. The remaining contigs were considered as viral sequences if they met any of the following criteria: 1) contig whose viral genes were more than the number of host genes in CheckV; 2) contig with p-value <0.01 in DeepVirFinder v1.0[58]; 3 contig identi ed by VIBRANT v1.2.1 [59]. To decontaminate the catalog of viral sequences, according to the previous study[8], we rstly searched bacterial universal single-copy orthologs (BUSCO)[60] within viral sequence using HMMsearch with default options, and calculated the ratio of the number of BUSCO to the total number of genes in each viral sequence (BUSCO ratio). Then high-contaminated viral sequences with ≥5% BUSCO ratio were removed, and the remaining was identi ed as the nal viral sequences for each sample. The ltered viral sequences were dereplicated based on the following steps: 1) pairwise alignments of all viral sequences were performed using BLASTN v2.11.0 with the options '-evalue 1e-10 -word_size 20 -num_alignments 10000'. 2) viral sequences which shared 95% nucleotide identity across 75% of the sequence were clustered into a viral operational taxonomic unit (vOTU) using the custom script.
3) The longest viral sequence was selected as the representative sequence for each vOTU.
Taxonomic annotation of viral sequences was implemented based on protein sequence alignment to the combined database derived from Virus-Host DB downloaded in May 2021, crAss-like protein sequences from Guerin's study [61] and viral protein sequences from Benler's study [31]. Putative proteins of viral sequences were predicted using Prodigal[62] with the option '-meta', and then assigned to the reference database using DIAMOND v2.0.6.144[63] with the options '--query-cover 50 --subject-cover 50 --id 30 --minscore 50 --max-target-seqs 10'. A viral sequence was annotated to the viral family-level taxonomy when over a quarter of its proteins were matched to the same family.
The abundance of vOTU was calculated by aggregating the matching clean reads in each sample using Bowtie2 and SAMtools [65]. The abundance of the family-level population was obtained by aggregating the abundance of vOTUs matched to the same family. The relative abundance of each population (vOTU or family) was the ratio of its abundance to the total abundance of all populations in each sample.

Statistical analyses
All statistical analyses were performed in the R v4.0.2 platform. Alpha diversity indexes were assessed based on the relative abundance pro le of vOTUs using the function diversity. PCoA was performed based on the Bray-Curtis distance using the function pcoa in the ape package. Data visualization was carried out using the ggplot2 package. Spearman's correlation coe cient was calculated using the cor.test function. Signi cance tests were performed using the function wilcox.test with the parameter 'paired=T'.

Experimental process
Each fresh fecal specimen from ve healthy volunteers was fully stirred and divided into 40-50 tubes immediately after defecation, and then stored at -80 ℃. For each specimen, we conducted standard VLP enrichment and DNA extraction (using our previous method [20]) for each of ~30 tubes for follow-up analysis. Our experimental plan and four main hypotheses were shown in Figure 1. Firstly, to explore the optimal cycle number for PCR ampli cation of high-delity enzyme, we prepared 4 ampli cation products of virome DNA samples for each specimen: one sample without ampli cation (0C; C represents the number of cycles) for direct DNA library construction and three samples with 5, 15 and 30 cycles of ampli cation (later referred to as 5C, 15C, and 30C), respectively (Fig. 1a). Due to the relatively low concentration of DNA, the 0C sample was mixed from 12-16 tubes of the unampli ed virome DNA and the 5C sample was mixed from 8-10 tubes of the ampli ed products of virome DNA to get enough DNA for library construction of Illumina-based sequencing (minimum total DNA amount 0.05μg; Supplementary Table 1). Then, since 15 cycles of ampli cation had the optimal results by comprehensive consideration, to validate the repeatability of the whole viral metagenomic process, we prepared 4 parallel virome DNA products for each specimen (15C-0, 15C-1, and 15C-2 from three independently unpooled virome DNA samples, and 15C-3 from above Fig. 1a) and performed PCR ampli cation with 15 cycles for them (Fig. 1d). Next, we prepared additional 3 virome DNA ampli ed products with identical pooled templates per specimen using multiple displacement ampli cation (MDA) for 90 minutes. One MDA product was used for Illumina-based sequencing and compared with the non-MDA samples (PCR ampli cation by high-delity enzyme) to evaluate the reliability of MDA (Fig. 1b). The other two products were then mixed to get enough DNA for library construction and performed nanopore sequencing based on the PromethION platform to test the capability of long-read sequencing in improving viral metagenomic assembly (Fig. 1c). Finally, we compared the e ciency and coverage of viruses between the viral metagenome and traditional bulk metagenome (Fig. 1e).

Optimization of the cycle number for PCR-based ampli cation
For each individual, samples with different PCR ampli cation cycles (0C, 5C, 15C-0, 15C-1, 15C-2, 15C-3, and 30C) were compared. All samples were successful for DNA library construction and shotgun sequencing, except one sample (Subject No. 2 with 5C, #2-5C) was failed (Supplementary Table 1). We found that the samples in 30 cycles were signi cantly lower in both the number and total length of raw metagenomic-assembled contigs when compared with the 0C and 15C samples ( Fig. 2a; Supplementary Table 2). Similar results were also found in the number and total length of identi ed viral contigs (Fig.  2b). Samples in 5 cycles showed unstable results, e.g., #3-5C (Subject No. 3 with 5C) had assembled the largest number of raw/viral contigs than other #3 samples, while #4-5C (Subject No. 4 with 5C) had assembled the fewest contigs comparing with other #4 samples. We calculated the "recall rate" for each sample that is de ned as the number of viruses assembled from a sample divided by the number of viruses assembled from all samples of an individual (Fig. 2c). 0C, 5C, and 15C samples recovered average 37.8% (ranged from 27.5% to 53.9%), 34.8% (ranged from 11.7% to 69.8%), and 29.7% (ranged from 12.0% to 43.6%) of viruses in ve fecal specimens, respectively, while the 30C samples recovered only average 13.0% (ranged from 4.2% to 18.2%) of viruses. In summary, these ndings suggested that 0C, 5C, and 15C samples could reconstruct a considerable large proportion of the fecal DNA virome, moreover, the ampli cation cycle number of 15 was the optimized method considering its experimental convenience and stability of the result. Notably, although more contigs were recovered in low ampli cation cycle (i.e., 0C, 5C, and 15C) samples, their N50 length, as well as the estimated completeness of the viral contigs did not seem to extend ( Fig. 2b; Supplementary Fig. 1), which suggesting that the assembly performance could still be improved by other technologies.
To further evaluate the potential deviation in viral community structure, we grouped the viral contigs of all samples into a catalog of nonredundant viruses using 95% nucleotide similarity [29] and then pro led the viral composition of various samples by mapping the sequencing reads against this catalog. Analysis of the within-sample viral diversity based on viral pro les revealed that the 30C samples had the lowest diversity (in all three diversity parameters) as comparing with the samples by other numbers of cycles, while the 15C samples had almost equal Shannon and Simpson indexes by comparing with 0C and 5C samples (Fig. 3d). Principal coordinates analysis (PCoA) and Spearman correlation analyses of the viral pro les showed that all samples belonging to the same individuals were closely clustered ( Fig. 3e; Supplementary Fig. 2), all of which demonstrating the high consistency of viral composition across different ampli cation cycles.

Validation of technical replication
The four 15C repetitions using the same experimental procedures of each fecal specimen had markedly differed in DNA concentration and total DNA amount, and their data production and performance of raw metagenomic assembly were not signi cantly different (coe cient of variation [CV] <40% for each sample; Supplementary Table 1-2). Also, the number of identi ed viral contigs and their assembly parameters did not differ among repetitions. For 4 repetitions from each individual, the proportion of viral reads and the within-sample diversity parameters (i.e., Shannon and Simpson indexes, observed number of viruses) were similar (CV <40% for all; Supplementary Fig. 3a). PCoA showed that 4 samples of the same specimen clustered together (PERMANOVA R 2 = 0.97, p<0.001), and they signi cantly differed among samples from other individuals (Supplementary Fig. 3b). Similar, Spearman correlation analysis revealed that the samples of the same specimen were high consistently (ρ = 0.91±0.07, ranged from 0.73 to 0.98), while samples from the different specimens were distanced (ρ = 0.04±0.23, ranged from -0.27 to 0.42) (Supplementary Fig. 3c). These ndings demonstrated that a high reproducibility result can be observed from independent repeat viral metagenome procedures.

Evaluation of MDA
The results by MDA method did not show a signi cant difference in the performance of raw metagenomic assembly compared with the 5C, 15C, and 30C samples, but their number and total length of raw contigs were lower than the 0C samples ( Supplementary Fig. 4a; Supplementary Table 2). Similar, the results of viral identi cation of MDA samples were not very different from that of all non-MDA samples ( Supplementary Fig. 4b). The "recall rate" (updated by adding the new samples) of MDA samples was average 22.6% (ranged from 12.0% to 29.9%), which was slightly lower than the 15C samples (average 27.2%) but remarked larger than the 30C samples (average 12.1%) (Supplementary Fig.  4c). Moreover, within each individual, the MDA samples were highly consistent with other non-MDA samples in both viral diversity and composition ( Supplementary Fig. 5).
Combining short and long reads improve viral metagenome assembly The data generated by Nanopore sequencing were preprocessed, which led to, on average, 266,026 (ranged from 77,682 to 431,229) long reads per sample, with average reads length 5,071 bp and an average quality score of 8.8 (corresponding base accuracy 86.8%; Supplementary Table 1). Notably, the average 96.3% of the short reads could be robustly mapped into the Nanopore long reads for each individual, con rming long reads well represented the DNA virome of original specimens. Based on a large amount of viral sequencing data, we rst performed a state-of-the-art short-read assembly for each individual using the combined data of four 15C samples and then used the long reads for scaffolding the short-read-assembled contigs to generate a hybrid result, followed by gap-lling based on both short and long reads (see Methods). The hybrid assembly had signi cantly improved the performance of metagenomic assembly comparing with the short-read approach, with an average 15.8% (n = 2,743 vs.   (Fig. 3b). Moreover, although the N50 length of these viruses was not extended, we found that the number of high-quality viruses (>90% completeness as estimated by CheckV [30]) was remarkably increased in the hybrid assemblies of each sample (Fig. 3c). These ndings indicated that the long reads were effective in improving viral metagenome assembly and viral genome reconstruction.

Genomic and functional characterization of high-quality viruses
Next, we speci cally focused on the high-quality viruses (n = 151; average length 50,952 bp; N50 length 60,043 bp; length ranged from 3,376 to 206,128 bp; Supplementary Table 4) generated by the hybrid assembly, as these viruses represented the dominant proportion (>75%) of viral relative abundances in original fecal specimens. The average estimated completeness of these viruses was 99.4% (ranged from 90.9% to 100%), while 72 of which were identi ed as " nished" genomes as they contained the highcon dence direct terminal repeat (DTR) or inverted terminal repeat (ITR) sequences. 70 of 151 viruses could be robustly assigned into the viral families, of which the members of Siphoviridae (n = 34), Microviridae (n = 10), Myoviridae (n = 6), were most frequently occurred (Fig. 4a), in agreement with the previous studies reporting that these three families were dominated in human gut virome [15]. Almost all known viruses were prokaryotic viruses, except 3 eukaryotic viruses (Circoviridae, n = 2; Geminiviridae, n = 1). We also assembled the near-complete genomes of 6 viruses that belonged to a candidate viral family, "Quimbyviridae" [31], suggesting the probably widespread of this family in gut virome. To further investigate the novelty of our virus catalog, we compared the viral genomes with three large-scale human gut virus datasets including Gut Virome Database (GVD)[8], Gut Phage Database (GPD) [3], and Metagenomic Gut Virus catalog (MGV) [6]. 60.3% (91/151) of our viruses were completely absent from all three databases (Fig. 4a), including all 10 members of Microviridae, highlighting that more unknown viruses are still needed to be identi ed in the human gut. In addition, we identi ed the bacterial hosts of 63 of 151 viruses based on their homology of genome sequences or CRISPR spacers to the available gut microbial genomes. This analysis revealed some novel virus-host a liations such as 4 virus-host pairs between "Quimbyviridae" members and Prevotella spp. or the virus-host pair between a Microviridae virus and a Cyanobacteria species (Supplementary Fig. 6). Members of Firmicutes and Bacteroidetes were the most frequent hosts of the virus catalog (Fig. 4a), consistently with previous studies showing that these phyla are most dominant in healthy human gut [1,32].
We predicted a total of 10,951 protein-coding genes from the high-quality viruses and annotated functions of 939 of these genes based on the KEGG (Kyoto Encyclopedia of Genes and Genomes) [33] database. Totaling 206 viral auxiliary metabolic genes (AMGs) that were assigned to speci c metabolic pathways were further analyzed to elucidate the metabolic capabilities of the viruses (Supplementary Table 5). Strikingly, 22.7% of viral AMGs were involved in sulfur metabolism (Fig. 4b), in agreement with recent reports that the viruses are widely participants in both organic and inorganic sulfur metabolism in human gut [34,35]. The proteins involved in the destructive metabolism of peptidoglycan (an important struct of bacterial cell walls) were frequently encoded by the viruses (consisting of 11.4% AMGs), such as peptidoglycan DL-endopeptidase function as both cell wall hydrolases and poly-γ-glutamic acid hydrolases[36], which would facilitate infection and tness of bacterial host by such viruses. Besides these, we also found that the viruses encoded several important but rarely reported functions, including the enzymes involved in nicotinate and nicotinamide metabolism, folate metabolism, metabolism of other molecular (e.g., lipopolysaccharide, glycerophospholipid pantothenate, porphyrin). These ndings largely extended the functional capacity of gut virome.

Virus identi cation in bulk metagenome versus viral metagenome
Averaging 814 viral contigs (ranged from 479 to 1,147) and an average total viral length of 6.8 Mbp (ranged from 3.9 Mbp to 9.6 Mbp) were generated from the bulk metagenome samples ( Supplementary   Fig. 7a), which were 49.4% and 33.5% larger than that of the hybrid-assembled viral metagenome samples, respectively. However, the N50 length and estimated completeness of bulk samples were signi cantly lower than those of VLP samples ( Supplementary Fig. 7b), probably due to the lower proportion of viruses in bulk samples. Surprisingly, only an average of 16.4% (ranged from 10.9% to 22.3%) of the viruses identi ed by VLP metagenome were shared with the viruses of bulk metagenome (Fig. 5a). Further comparison at the family level between the VLP-speci c and bulk-speci c viruses revealed that, despite both two types of viruses were dominated with Siphoviridae and Myoviridae, they had signi cantly differed in frequency among some families (Fig. 5b). For example, 20 crAss-like phages were recovered by viral metagenomes but only 3 were assembled in bulk samples. Also, the VLP metagenomes uniquely recovered all Microviridae (n = 13), Circoviridae (n = 3), and Drexlerviridae (n = 2), and Genomoviridae (n = 2) viruses, whereas the bulk metagenome specially recovered Ackermannviridae (n = 5), Herpesviridae (n = 3), and Pithoviridae (n = 2). Moreover, an average of 2.4% (ranged from 1.3% to 4.3%) viruses in viral metagenomes were recognized as prophage by the CheckV prophage algorithm, while this proportion was average 5.4% (ranged from 3.8% to 8.1%) in bulk viruses (Fig. 5c).
Finally, we compared the viral pro les of viral and bulk samples to investigate the viral diversity and structure difference by these two technologies. Bulk samples showed higher within-sample diversity than the viral metagenome samples (Fig. 5d). Likewise, PCoA and Spearman correlation analyses showed that the viral pro les of viral metagenome and bulk metagenome samples from the same individuals remarkedly differed, with Spearman correlation coe cient ρ = 0.67±0.11 (Fig. 5e-f), and this phenomenon was also observed in viral composition at the family level (Fig. 5g). Collectively, our ndings suggested a considerable difference between the two technologies in pro ling the DNA virome of the human gut.

Discussion
With the emergence of VLPs enrichment technology and the rapid development of high-throughput sequencing technologies, virome researches, especially DNA virome, were received widespread attention [24]. Disease-speci c alterations in the gut virome, such as in ammatory bowel disease (IBD) [15,37,38], irritable bowel syndrome (IBS) [39], acute malnutrition [40], childhood obesity and metabolic syndrome [41], have been widely reported. However, a critical limitation of the human gut virome studies was that the different protocols (including enriching virus-like particles, nucleic acid puri cation, and sequencing strategies) adopted by different research groups were led to a general discrepancy in results [42]. In this study, we had developed the experimental and sequencing strategies that aim to improve the viral metagenome method at the genome-level characterization of the human fecal samples.
Several experimental procedures employed spiking virus or mock communities with known compositions to evaluate their accuracy or reproducibility with complex biological samples [9,43,44]. However, since viruses have completely differed in particle size, overall charge, envelope, isoelectric points, icosahedral capsid shapes, and tails, the limited available viruses cannot fully represent the virome diversity of the human gut by using abovementioned arti cial samples [42]. The strength of using actual biological samples, such as feces, could reveal diversity and variability that may not be apparent in low-diversity mock communities.
The main virome experimental steps for virome sample preparation included the concentration of viral particles, the elimination of contaminating cells and free nucleic acids, and extraction, ampli cation and puri cation of viral nucleic acids [17]. Herein, we performed different ampli cation and sequencing methods on fecal samples from ve adults to identify viruses at the best genome-level resolution which could be used in human gut virome investigation. Because the average yield of VLP DNA from 2-5g of feces is only 500ng [2], to obtain su cient material from low-biomass samples for metagenomic shotgun sequencing, virome nucleic acid random ampli cation was essential. We selected two enzymes including Q5® High-Fidelity DNA Polymerase and GenomiPhi V2 Ampli cation kit of MDA, especially the latter commonly used in virome research currently [44,45]. However, MDA is known to severely decrease genetic diversity and reproducibility and can produce a large excess of ssDNA viruses [21,46]. Although no signi cant difference in the performance of raw metagenomic assembly compared with Q5 high-delity enzyme samples, our results showed that MDA samples did have a lower "recall rate" than the 15C of Q5 high-delity enzyme samples within each individual, which indicated its shortcomings.
As we know, stochastic or systematic biases may be associated with the extent of random ampli cation [24], thus the number of ampli cation cycles is an important parameter for viral metagenomic analysis. The cycle number of PCR generally depends on the viral loads. Generally, polymerases can introduce mutations and fewer cycles are recommended for high viral loads, so the fewer cycles the less bias. A total number of 30 cycles generally works for any stool sample [47]. To avoid initial bias, we used redundantly identical DNA extract for all ampli cation processes in parallel. From our results, although all 0C, 5C, and 15C samples had the satisfactory number and total length of identi ed viral contigs, and almost equal Shannon and Simpson indexes, and could reconstruct a considerable large proportion of the fecal DNA virome. However, due to the low yield of products per reaction from 0C or 5C cycles' samples, we need to combine at least 12 or 8 corresponding parallel samples to obtain su cient material for metagenomic sequencing. This is a troublesome and increased workload operation. Moreover, DNA samples with low concentration were easily degradable and di cult to store and thus resulting in a high failure rate in library construction (e.g., the #2-5C sample in this study), therefore the ampli cation cycle number of 15 is the optimized method considering its experimental convenience and stability of the result. In addition, an ampli cation method from other literature reported that the virus contigs with median length around 5 kb[46], while our 15C results had medium ~11 kb contigs showing its advantage Based on the product yields of high molecular-weight DNA suitable for metagenomic sequencing, performance in assembly results, cost, and hands-on time, we selected Q5 high-delity enzyme with 15 ampli cation cycles (15C) as the basis for our virome DNA ampli cation.
Since nanopore sequencing requires higher amounts of nucleic acid than Illumina short-read sequencing, ampli cation is more inevitable. However, nanopore sequencing has the characteristics of long-read length (>30 kb). Yet, except for MDA ampli cation, other PCR enzymes generally amplify product length within 10 kb. Therefore, MDA ampli cation is the most suitable method in nucleic acid material preparation for nanopore sequencing. The advantage of hybrid assembly combined short-read and longread sequencing has been widely reported [48,49], which results in more high-quality viral genomes.
Herein, we conducted a long-read shotgun metagenomic experiment using PromethION of nanopore sequencer by MDA ampli cation to study gut virome. Our results indicated that hybrid assembly based on the short-and long-read data was effective in improving viral metagenomic assembly and genome reconstruction.
The novelty of our virus catalog was proven by querying against three large-scale human gut virus datasets including GVD, GPD, and MGV. Few eukaryotic viruses may be related to the ltering operation in the virus enrichment process. Interestingly, we found that a large proportion of viral AMGs was involved in sulfur metabolism, in agreement with recent reports that the viruses are widely participants in both organic and inorganic sulfur metabolism in human gut [34,35]. The sul de provides a tness advantage to viruses and viruses also are drivers of organosulfur metabolism with important implications for human health [34]. Since most VLPs were bacteriophages within bacterial hosts, peptidoglycan metabolismrelated enzymes were frequently encoded by the viruses which are consistent with our viral AMGs results (up to 11.4%), such as peptidoglycan DL-endopeptidase function as both cell wall hydrolases and poly-γglutamic acid hydrolases [36], which may facilitate the interaction with the bacterial host. Although the mechanism is still unclear, other research had shown that bacterial lipopolysaccharide or peptidoglycan seems to be used by viruses to protect themselves [50]. Although existed in our viral AMGs, other enzymes involved in nicotinate and nicotinamide metabolism, folate metabolism, metabolism of other molecular (e.g., glycerophospholipid pantothenate, porphyrin) reported in the gut microbiome [51] yet rarely reported in viral research. Horizontal gene transfer may assist these functional genes delivered from the virus to its hosts, thereby further affecting the intestinal physiological function. These ndings largely extended the functional capacity of gut virome and reinforced the necessity of incorporating viral contributions into further research on gut microbial function.
In the last issue of this study, we evaluated and compared the consequent by viral metagenomic and bulk metagenomic approaches. The result showed that only an average of 16.4% of the viruses identi ed by viral metagenome was shared with the viruses of the bulk metagenome, suggesting the good complementarity of the two methods. In addition, an average of 2.4% VLP viruses was recognized as a prophage, while this proportion was average 5.4% in bulk viruses. The main reason may be the genome of the prophage is mostly integrated into the host (bacteria), and the viral metagenomic analyses removed the hosts such as bacteria due to operations such as ltration. Therefore, viral metagenomic sequencing has natural limitations for prophage study. While, the prophages of bacterial genomes are not ignored in bulk metagenomic analysis, because they are retrieved when sequencing the bacterial fraction of the samples. Our results also suggested a considerable difference between these two technologies in pro ling the virome of the human gut ecosystem. Thus, for more comprehensive virome information, the two methods should be combined.

Conclusions
Overall, we developed an improved and reproducible work ow that combined Illumina sequencing using high-delity enzyme ampli cation with 15 PCR cycles and nanopore sequencing using MDA enzyme to uncover hundreds of high-quality viral genomes from ve fecal specimens. This work developed methods for the virome study in feces.

Declarations
Ethics approval and consent to participate Ethical approval for this study was obtained from the ethics committee of Dalian Medical University [File No 2020-014], and informed consent was obtained from all participants. Overview of the experimental design of this study. The volume (average from 5 subjects) of viral DNA samples used for Illumina or Nanopore sequencing is shown, and the detailed information of all samples is provided in Supplementary Table 1. VLP, virus-like particle.  Comparison of performances for short-read assembly and hybrid assembly. a-b, Statistics of the raw metagenomic-assembled contigs (a) and identi ed viral contigs (b) for samples using two assembly approaches. Paired Wilcoxon rank-sum test: * p<0.05; ** p<0.01. c, Comparison of the N50 length of all viruses, the number of high-quality viruses, and the short-read mapping rate between samples using two assembly approaches.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.