A unified catalog of 64,922 phage genomes from the ruminant gastrointestinal tract
To provide a comprehensive overview of the phages associated with the gastrointestinal tract (GIT) of ruminants, we collected a total of 2,333 metagenomic samples from 18 previously published research [6, 7, 13, 23, 26, 58, 66-78] (Table S1) that covered ten GIT sites from eight ruminant species, including (Fig 1A and Tables S1, S2). including buffalo (n = 745), cattle (n = 930), goat (n = 563), sheep (n = 133), deer (n = 115), yak (n = 50) and cow (n = 46). After quality filtering and removing host DNA sequences (Methods), a total of 14.17 terabytes (Tb) of clean data with more than 33 million reads and 9 billion bases per sample were retained (Table S1). We assembled them into a total of 302,721,852 contigs using MEGAHIT [42], averaging 132,251 contigs per sample with an N50 length of 3,836 (Table S4).
To identify putative phage genomes, we screened the assembled contigs using a bioinformatics pipeline adopted from Luis et al. [32] (Methods), followed by quality assessment for viral genome completeness using CheckV [45] and dereliction using CD-HIT [46]. We obtained a total of 74,519 viral contigs (mostly bacteriophages) with >50% completeness and length >1.5 kb, corresponding to 64,922 non-redundant viral populations (VPs), i.e., species-level clusters at an Average Nucleotide Identity (ANI) of 95%. We defined the latter (i.e., the 64,922 non-redundant VPs) as the Unified Ruminant Phage Catalogue (URPC). Among these, 6,035 (9.29%), 3,085 (4.75%), and 55,802 (85.95%) were classified as complete, high- and medium-quality, respectively, according to the CheckV tool (Fig 1B; Table S5).
Previous studies have commonly employed a 5kb threshold for identifying metagenome-based viral genomes [31, 39]. In our study, we opted for a 1.5kb threshold. To substantiate this choice, we categorized all URPC phages into four length groups: <5kb (n = 777), 5~30kb (n = 14,462), 30~60kb (n = 35,199) and >60kb (n = 14,484). We first compared the qualities of phages in each group and were intrigued to discover that the <5kb group exhibited the highest proportion of complete (29.81%) phage genomes, as determined by CheckV, in comparison to the other three groups (Fig. S2A). Moreover, taxonomic annotation, as per our methods, was successful for 91.48% of phages in the <5kb group, surpassing the rates observed in all other groups (Fig. S2B). Therefore, our findings indicate that the utilization of short contigs (i.e., 1.5~5k) not only aids in more accurately estimating the number of phages but also surprisingly enhances the annotation rates.
We used the rarefaction analysis to show that the saturation curve is far from plateaued, and more samples are required for the discovery of ruminant GIT phages (Fig. 1A). Similar trends were observed in human gut virome catalogs such as the Metagenomic Gut Virus (MGV) [21] and a phage genome catalogue of the Japanese [20]. Among all the animal hosts, we obtained the highest number of VPs (n = 33,156) in the buffalo, followed by the cattle (n = 10,589), goat (n = 8,756), and sheep (n = 5,876). And the number of viruses identified varied in different GIT sites (Fig. S3), which correlated with the number of samples we collected. The genome size and viral taxa vary among different animal hosts, indicating species-specific viral composition. Given the recent interest in human gut phageome, we then compared the genome length of URPC and other published metagenome-assembled human gut viral genomes, and found that URPC genomes were significantly longer than those in the human gut (p < 2.22e−16, Wilcoxon Rank Sum test; Fig. S4).
We annotated the VPs using VirusTaxo [52] and Demovir (https://github.com/feargalr/Demovir) (Fig. S5; Table S5) and assigned 74.69% of the VPs at the family level (Fig 1C). Among the annotated VPs, 16,507 (25.42% of the total) belong to the Siphoviridae, followed by Poxviridae (n = 6,327), Mimiviridae (n = 5,409), Baculoviridae (n = 3,962), Myoviridae (n = 3,846), Podoviridae (n = 2,360) and Microviridae (n = 2,291). The overall taxonomic distribution, dominated by viral families such as Siphoviridae, Microviridae, Myoviridae, Podoviridae, was consistent with other metagenome-derived viral catalogs in ruminant rumen (RVD) [31] and human gut [21, 33]. Particularly, we reannotated the viral genomes from RVD using our pipeline for taxonomic classification (see Methods), and we found that the families Siphoviridae, Podoviridae, Myoviridae, Baculoviridae, and Myoviridae accounted for majority of the viral genomes in both URPC and RVD datasets. However, we also identified more phages from family Podoviridae than RVD (3.6% in URPC, and 0.5% in RVD) (Fig. S6; Table S6) indicating that URPC expands the diversity of the ruminant gastrointestinal phage genomes.
We then examined the novelty of the URPC phage genomes by comparing them with several public viral databases including the NCBI viral Reference genomes (Release 201, Jul 06, 2020), IMG/VR v3 [39], four public rumen virome datasets [25-27, 31] and four human gut virome genome catalogs [21, 32-34] (Table S3). Applying an Average Nucleotide Identity (ANI) threshold of 95%, we observed that URPC exhibited the highest number of shared viral populations (VPs) with the RVD (Fig. S7A). Notably, 28.11% of URPC genomes were found in the RVD, while 33.84% of RVD genomes were identified in URPC (Fig. S7A). The substantial overlap between the two datasets can be attributed to the similar number of rumen samples used in URPC (826) compared to the RVD (975), despite variations in tools and criteria for viral contig identification in the latter [31] (refer to Table S7 for a detailed comparison). For a fair comparison, only 41,738 VPs from the RVD dataset meeting the same criteria as our dataset (i.e., completeness > 50%) were considered. At these criteria, this study identified a significantly higher number of VPs (64,145) compared to the RVD (Table S7).
Furthermore, with the inclusion of three additional public rumen phage datasets, a total of 46,668 (71.89%) URPC phages were determined to be novel at a 95% ANI threshold (Fig. S7A), signifying URPC’s substantial contribution to expanding the ruminant gastrointestinal tract phage dataset despite prior outstanding works. When considering all the aforementioned public viral datasets, we found that 60.53% (n = 39,300) of VPs were considered novel at the 95% ANI threshold, indicating that the majority of URPC phages are novel (Fig. S7B).
Organism-specific distribution of URPC genomes in animal hosts
To investigate the correlation between the composition of VPs and their animal hosts, we first calculated the distribution of the VPs in each animal host. We discovered that 99.91% (n = 64,863) of VPs had only one animal host (referred to as organism-specific from now on), while only a few (n = 59) appeared in two or three animal hosts (Fig. 2A). To evaluate the distribution of phages with their animal hosts under higher level, we clustered the VPs into viral clusters (VCs) using methods adopted from the GPD [32] (Methods) and generated a total of 55,635 VCs. Among these, 99.06% of (n = 55,122) the VCs are organism-specific. Similarly, most (91.43%, n=50,874) of the VCs were distributed only in one GIT sit (Fig. 2B). Among the 4,761 VCs that were distributed in two or more GIT sites, 92.69% (n = 4,413) came from the same animal host, indicating an organism-specific distribution of the phages in the animal hosts. Among the 80 “broad-range” VCs (presented in three or more animal hosts), most of their animal hosts were goats (n = 60), buffaloes (n = 55), cattle (n = 52), sheep (n = 51), and deer (n = 39), while there were fewer in yaks (n= 4), camels (n = 2), and cows (n = 1) (Fig. 2C), which might be due to fewer available GIT samples of the latter three animals. To find out whether these “broad-range” VCs were food-related, we also include the VPs from the IMG/VR v3 database and re-did the viral clustering. We found that two of the “broad-range” VCs could be clustered with phages found in the terrestrial, freshwater, and plants (Fig. 2C), which confirmed previous research that phages could be readily introduced into the rumen from water sources, as well as housing and farm infrastructure [19]. However, the origins of the other 78 broad-range VCs remained to be identified.
We next characterized the largest VCs (i.e., the VCs were ranked according to the number of containing VPs). 27 out of the top 50 had two or more animal hosts, suggesting that the diverse VCs were also the well-adapted ones; this trend was more apparent among the top ten VCs (i.e., of these, eight were found in two and more animal hosts). In addition, we observed that most of the top ten VCs consisted of lytic bacteriophages (Fig. 2E; the lifestyles were determined using a DeePhage tool; Methods). Most of the phages in the top ten VCs were 30~60kb in size, which were well within the size range of typical phage genomes; in contrast, crAssphages and Gubaphages were the most diverse in the human gut virome, which were significantly longer (~100kb in size) [32, 79]. Interestingly, the majority of broad-range phages in the top ten VCs were identified in the rumen (63%; Table S5), much higher than those identified in the rectum (including fecal samples; 17%), despite that we had comparable samples from the two GIT sites (826 vs. 753, Fig. 1A). These results strongly suggested that at least some of the rumen phages were likely originated from the environment.
In summary, we found that most of the ruminant phages are organism- and GIT site-specific at both the VP and VC levels, with a few broad-range ones, likely originating from the environment.
Co-diversification of phages with their animal hosts
We next investigated whether the phage genomes in the broad-range VCs could show co-diversification patterns with their animal hosts. We focused on the 80 broad-range VCs that were presented in three or more animal hosts; for each of the VCs, we used a phylogenetic tree-based method to test whether phages from the same animal host were significantly closer than those of different animal hosts (Methods). We observed that in 83.75% (n=67) of the VCs, the phages tend to significantly co-evolute with their animal hosts (i.e., VPs from the same animal host in a VC were clustered together on the evolutionary tree and had significantly closer evolutionary distance), while only 2.5% were classified as significantly not co-evolution (Fig. 3A). We included in Fig. 3B-F a few typical examples to showcase our analysis. As shown in Fig. 3B, we identified two deer phages in VC_1167, which showed closer phylogenetic relationships to phages of other ruminants than to each other, indicating significant non-co-evolution. Conversely, the Fig. 3C-F showed a few cases of significant co-evolution in VC_1341, VC_1220, VC_35, and VC_95, in which multiple phages from the same animal hosts often cluster together in their respective phylogenetic trees. Further efforts would be required to illustrate whether the co-evolution was due to the adaption of the microbial hosts of the phages to the ruminant species.
The URPC contains the highest proportion of lytic phages as compared with other environments
The observation that the lytic phages account for seven of the top ten VCs encouraged us to further characterize the lifestyle of the ruminant phages. Because the metagenome samples we collected were not separated viral particles, we expected that many phages were derived from bacterial cells and were more likely to be integrated into the bacterial chromosomes as temperate phages. However, we found that the majority (59.60%, n = 38,696) of the phages were classified as lytic phages (virulent or uncertain virulent) using a DeePhage tool [49], which outperformed several existing tools in terms of accuracy. We found similar proportions of lytic phages in the eight ruminants (55.15% to 88.59%, with an interquartile range (IQR) of 55.77% to 61.86%; Fig. 4A) as well as across GIT sites (53.76% to 61.51%, with an IQR of 57.35 to 59.08%; Fig. S8). Similarly, we also identified a comparable percentage of lytic phages in two rumen viral genome datasets (48.55% in RVD, and 52.43% in moose rumen [26]) that contained at least 200 phage genomes (Table S3), further supporting our findings.
We then compared with public datasets and found surprisingly that phages from all other environments had lower proportions of lytic phages. Since the length and completeness of the virus affect the number of genes detected in the viral genome, we performed the same quality control consistent with our URPC (filtering length >1.5k and completeness >50% estimated using CheckV [45]) on the viral genomes from public databases (Table S3). For example, 44.40% of phages in the IMG/VR v3, the most comprehensive phage database so far, are lytic, which is significantly lower than the URPC. We found the same results when stratifying the IMG/VR phages according to the habits (Fig. 4A). In addition, we found an overall of 32% lytic phages in several human gut virome datasets (24.02% to 40.81%, interquartile range (IQR): 28.00% to 38.03%; Table S3), consistent with previous observations that the human gut phages were mostly temperate [20, 32].
Interestingly, out of all the human gut virome/metagenome datasets we have analyzed, we found only one that contained a similar proportion of lytic phages to the URPC: the Tanzania hunter gut metagenome. As shown in Fig. 4B, 57.5% of the phages identified in the Tanzania hunter dataset were lytic, similar to that of the URPC (p = 0.61, Chi-square test), while both were significantly different from the other human gut virome datasets (p < 0.01, Chi-square test). We speculate that the different lifestyles between the Tanzania hunter and the other human samples might underlie the different phage lifestyles, such as the consumption of raw or less processed foods and high exposure to microbe-enriched environments of the hunters [80]. However, due to the limited numbers of samples (i.e., 40 metagenomic samples from the NCBI SRA database; PRJNA392180) and identified viral contigs (i.e., 40 non-redundant viral contigs with completeness above 50%), our hypothesis should be tested using larger datasets.
Bacterial and archaeal host prediction of the URPC phages identifies dozens of lytic phages targeting methane-producers
Predicting viral hosts is crucial for understanding their roles and impacts [82], and phages can serve as ideal tools to regulate ruminant GIT ecosystems by limiting the number of their microbial hosts through lytic infections [83]. We thus predicted hosts for all the VPs using metagenome-assembled genomes (MAGs) from public datasets [6, 7, 58, 59] using two different methods, namely the CRISPR-spacer- and sequence similarity-based methods (Methods). We were able to assign a total of 9,271 phages (14.28% out of the total) to their putative bacterial/archaeal hosts, including 4,690 (50.59%) to a public ruminant GIT genome collection and 5,562 (59.99%) to the MAGs in the Global Microbial Gene Catalog (GMGC [59]). Among these, 754 phages could be assigned consistently to the same hosts by both methods (e.g., the highly confident prediction results; Fig. 5A, Table S8). We observed little overlaps between the predicted virus-host connection pairs produced by the two methods, consistent with previous results [84, 85]. In general, a total of 7,227 (77.95%) phages were classified as a specialist (Fig. 5B), meaning that they infect only one genus (i.e., specialist phages), while the others were predicted to infect two or more genera (i.e. generalist phages), which confirmed previous research that phages have a limited host range [20, 86, 87].
Among all the predicted virus-microbial host relationships, Firmicutes (the combination of Firmicutes_A, Firmicutes, and Firmicutes_C) was the most common phylum targeted by the URPC phages (n = 5,214, i.e., the number of interacting phages), followed by Bacteroidetes (n = 3,554) (Fig. 5C), which were the two groups of beneficial bacteria that were dominant in the ruminant GIT [88]. Many of the functionally important genera were targeted by the phages. At the genus level, the most predicted hosts were Prevotella (n = 1,035; one of the most abundant and versatile genera that contributes to hemicellulose degradation, lignocellulose pretreatment, and feruloyl esterase activity [89]), followed by cellulose digestive Bacteroides (n = 625) which secrete cellulases and hemicellulases to degrade cellulose and hemicellulose into glucose and other sugars for ruminants [12, 90], Lachnospira (n = 524) and Roseburia (n = 381) major short-chain fatty acids (SCFAs) producers in the rumen providing energy and anti-inflammatory effects [12, 91, 92]. Our results suggested important regulatory roles of phages in the ruminant GIT microbial structures and functions.
Phage could be an ideal tool to inhibit the growth of methane-producing archaea in the GIT of ruminants [93]. However, no lytic phages targeting methane producers have been identified [16]. Here, we retrieved 109 phages that infected methanogenic archaea from the phage-microbial host analysis (Fig. 5D). Of these, 74 were lytic (virulent or uncertain-virulent; Method) and could target the six genera of methanogens (i.e ISO4, Methanobrevibacter, Methanobrevibacter_A, Methanobrevibacter_B, Methanocorpusculum, Methanosphaera) annotated by GTDB-Tk [94]. These results should facilitate targeted isolation of phages and experimental validation of their lysis efficiency against methanogenic archaea.