Identification of archaeal genomic contigs from the metagenomes expands the archaeal diversity in the human gut
We identified the human-associated archaeal genome contigs from 12 human microbial metagenomic datasets consisting of 3,971 samples from rural and urban human populations across 13 countries (Table S1), resulting in 17,830 archaeal genomic contigs from the human gut samples, but only 33 from the samples of other body sites (detailed in Methods, Fig. 1a, Fig. S1, and Fig. S2). Thus, we focused on the archaea inhabiting the human gut. These contigs were taxonomically assigned based on the taxonomic information of the encoding proteins using the GTDB taxonomy system24 (detailed in Methods). The result revealed a remarkably high taxonomic diversity of as-yet undescribed archaea in the human gut across 4 phyla including Methanobacteriota (72.76%), Thermophlasmatota (27.10%), Halobacteriota (0.10%) and Altarchaeota (0.03%), 8 families, 22 genera and 56 species (Table S2a and Table S2b). To further validate this result, these 17830 archaeal contigs were mapped to the 1,162 classified human gut archaeal genomes collected in UHGG23 with BLASTn (E-value ≤ 10 − 5 and coverage ≥ 0.5, Table S3a). The result showed that 15,732 contigs were matched to 833 gut archaeal genomes (see Table S3b, Fig. 1b), while 2,098 (11.8%) did not yield a significant match in the 1,162 reference genomes. These 833 genomes were classified into 3 families (with 7 genera and 13 species) (Fig. 1b). Most genomes were taxonomically affiliated with the genus Methanobrevibacter_A (735 genomes; 88.24%), in agreement with earlier reports12. Other genomes were affiliated to Methanomethylophilaceae UBA71 (44; 5%), Methanomethylophilus (19; 2.3%), Methanosphaera (17; 2%), Methanomassiliicoccus_A (11; 1.3%) and Methanocorpusculum MX-02 (6; 0.7%). The discrepancy in the number of the archaeal taxa assigned by these two methods suggested that more novel archaea with extremely low abundance likely are present in the human gut.
Then, these 17,830 archaeal genomic contigs were de-replicated by clustering according to 95% average nucleotide identity (ANI) and only the contigs with length \(\ge\)3 kbp were kept. The longest contig within each cluster was chosen as the representative sequence, resulting in 2,948 nonredundant archaeal genomic contigs. We assessed the prevalence of these archaea in the human populations based on the number of metagenomic sequencing reads mapped to the representative sequences (Fig. 1c). Due to raw reads of some HMP samples are not available, the total reads we used for the following analysis were derived from 1,904 metagenomic samples (Table S4). As a result, a total of 1,770 (92.26%) samples had at least one read that was mapped to the archaeal contigs. It turned out that the most prevalent archaeal genera in the human gut were Methanobrevibacter_A (82.14%), followed by Methanomethylophilaceae ISO4-G1 (74.32%), Methanomethylophilaceae UBA71(58.56%), Methanomethylophilus (28.2%) and Methanosphaera (20.27%), indicating the most common archaea in the human intestine.
The human gut carries a complex, previously unexplored virome
To perform a comprehensive search for human gut archaeal viruses, first, we constructed a Human Gut Associated Archaeal Spacer Database (HGASDB) including 13,021 nonredundant CRISPR spacers recruited from the identified archaeal genomic contigs and the 1,162 archaeal genomes of UHGG23 (detailed in Method). These spacers were derived from the contigs and genomes of different archaea lineages, with the genus Methanobrevibacter_A contributing to the largest number of spacers (89.82%). In particular, 8,962 spacers specifically were derived from M. smithii, 2,549 spacers from M. smithii_A, and 185 spacers from other three species (M. woesei, M. orals, and M. millerae) (Fig. 1d, Table S5). A small number (n = 1,325; 10.18%) of spacers were derived from other archaeal genera. We then developed a pipeline based on the spacers and viral signatures (i.e. hallmark genes for the known archaeal viruses) to recruit the archaeal viral sequences in the 2,271 assembled metagenomic datasets and the publicly available human gut virus collections (see Methods and Fig. 2a for further details), resulting in 16,234 sequences. After we filtered out archaeal and bacterial genomic contamination and the sequences not encoding the viral signatures (see in Methods), these sequences were ultimately clustered (\(>\)95% Average Nucleotide Identity) into 1,279 nonredundant viral species, and the longest sequences within each species were selected as the representative. These representative sequences were considered as final archaeal virus sequences, namely Human Gut Archaeal Virome Database (HGAVD), for further analysis. In particular, 1,080 archaeal viral representative sequences in HGAVD were detected from the assembled metagenomic datasets, 89 from IMG/VR25, 92 from GPD26, 14 from GVD13, 2 from HGV10, 1 from EVP27, and 1 from GL-UVAB28.
Subsequently, genome completeness for the representative sequences of these viral species was estimated using CheckV29, giving rise to four different quality tiers: complete genomes (3%), high-quality (9%), medium-quality (7%), low-quality (17%), and the remainders (67%) being undetermined (Fig. 2b and Table S6). In addition, we applied VirSorter 30 (categories 1–6) and VirFinder 31 (score \(\ge\)0.7 and p \(<\)0.05) on the sequences in HGAVD, and in total 442 HGAVD sequences (Table S6) were classified as viral sequences by these tools. We did not detect plasmid signatures using PlasForest32 in these sequences. Two sequences encoded both transposase genes and viral signatures. We then aligned the HGAVD species with the 85 nonredundant proviruses derived from 557 (50–100% completeness) of the 1,162 gut archaeal genomes in UHGG23, resulting in 56.5% (n=723) of the 1,279 species sharing identity > 95% with those proviruses. These results demonstrated that the identification workflow we constructed here is reliable. Considering that the HGAVD sequences encode viral signatures and were matched to the spacers, a number of the HGAVD sequences likely are novel archaeal viruses and were ignored by these well-developed software tools.
To further explore the extent to which the HGAVD viral species were homologous to the known archaeal viruses in the RefSeq database (v201) (built-in database of vConTACT2) and thereby taxonomically classify these viruses, we constructed the gene sharing networks generated by vConTACT2, where viral clusters (VCs) approximate genus level taxonomy33. With the sequences from the archaeal viral genomes in the database RefSeq and the 1,279 archaeal viral species, this analysis clustered 735 HGAVD species into 61 VCs, 391 viral species into outliers (where contigs were assigned to a VC but shared fewer similar proteins than the bulk of the cluster), and 153 viral species into singletons (sequences that did not cluster with any other sequences). Only 2 VCs included one known reference viral sequence, respectively. This suggests that the majority of the VCs derived from the human gut likely represent viral genera that were novel to the archaeal viruses in RefSeq (Table S7). Moreover, in agreement with the previous gut virome studies26,34, the majority (67.9%) of the HGAVD viral species can’t be taxonomically classified into any known viral order. Less than half of the species (32.1%) were taxonomically classified into, specifically, the Caudovirales order (n = 388) (virus characterized by having tails and icosahedral capsids), the Cremevirales order (n = 13), and the Haloruvirales order (n = 2) (Fig. 2c).
We further compared the HGAVD viruses to those of the publicly available virus collections (detailed in Method) including 11,827 sequences (only the sequences encoding at least one protein sequence with hit to those of HGAVD) from MGV (Metagenomic Gut Virus) catalog34, 3,502 from Prokaryotic Viral Refseq (supplied by vConTACT2), and 37 provirus sequences (sharing identity\(\le\)95% with the HGAVD sequences) derived from the archaeal genomes in UHGG (Table S8, Fig. 3a and Fig. S4). The MGV catalog is the newest human gut viral database and contains extensive viral genomic diversity. The vConTACT2 network analysis resulted in the generation of 68 VCs with at least 1 HGAVD prediction. However, the 102 MGV archaeal virus sequences were clustered into 15 VCs, and 37 proviruses were only clustered into 9 VCs, reflecting the great diversity of the gut archaeal virus taxa represented by HGAVD at the genus level. Strikingly, we found that 1,097 of the 1,279 HGAVD viral species (86%) were not grouped with any viral genomes from other collections (Fig. 3a), while a majority of 37 archaeal proviruses (78.4%) and the MGV archaeal viral sequences (83.3%) were grouped with the HGAVD viruses, indicating that HGAVD can represent most of the archaeal viruses in other gut virus collections, and the number of HGAVD viruses is much higher than that of other databases. Taken together, HGAVD considerably expanded the previously unknown archaeal viral diversity in the human gut.
Archaeal viruses are highly prevalent in the human gut
We estimated the abundance of the HGAVD viral species in the human gut samples by metagenomic read recruitment and accordingly performed the principal coordinate analysis (PCoA). No significant difference in the human gut archaeal viral composition was observed between male and female sex (ANOSIM, r = 0.002, p = 0.306) or according to BMI distribution (ANOSIM, r = 0.011, p = 0.201) (Fig. S7). Nevertheless, when the analysis was stratified by country, we observed that the diversity of these archaeal viruses was distinct in the samples of different locations. In particular, the archaeal viral communities between the Tanzanian and the populations from China, America, and the UK displayed significant differences, respectively (ANOSIM, R > 0.7, p < 0.001; Table S9 and Fig. 3b).
Based on the abundance determined by the reads mapping, we further investigated the prevalence of these viruses among the human populations. The result indicated that the prevalence of 7 archaeal viral species was > 10% across the human populations. These viruses belonged to 7 different VCs (Table S7 and Fig. 3c). These 7 viral species all were predicted to infect M. smithii and had a higher prevalence in Asian, European, and American populations than in the African population. Moreover, 712 archaeal viral species were prevalent in 1% of the human population. Remarkably, the virus IMG|UGV-GENOME-0271153, one putative medium-quality viral genome (40.51 kb, CheckV), had the highest prevalence (72.16%) among the human populations and was predicted to infect Methanobrevibacter smithii. This virus genome encodes 46 genes and 8 of them were predicted for the caudoviral functional proteins (Fig. 3d and Table S10a). Furthermore, all the viral sequences (23kbp-55kbp in length) in the same VC with this virus had the host of M. smithii (Fig. 3a) and were derived from the samples of United Kindom, Sweden, Austria, United States, China, Spain, and Madagascar, respectively, further suggesting the wide distribution of this virus among the global population. In particular, another highly prevalent caudovirus (10.7%) IMG|UGV-GENOME-0263128 encoding 51 genes was detected more frequently in the African population than IMG|UGV-GENOME-0271153 (see Fig. 3c). The viral sequences in the IMG|UGV-GENOME-0263128-contained VC were from 19kbp to 56 kbp in size and were predicted to infect the hosts of M. smithii and M. smithii_A (Fig. 3a). These two highly prevalent viruses likely are temperate because integrase gene was detected on the genome of the virus (IMG|UGV-GENOME-0263128) or the genomes of other viruses within the same VC (IMG|UGV-GENOME-0271153) (Fig. 3d and Table S10b).
It is worth mentioning that 13 smacoviruse species were identified and were clustered into 3 VCs with lengths ranging from 2.0 ~ 2.5kbp in HGAVD, reflecting the diversity of these small viruses in the human gut. Smacoviruse in the order of Cremevirales has a small circular single-stranded DNA genome and had been identified in fecal samples (both feces and rectal swabs) of various animals35. These HGAVD smacoviruses were targeted by 7 spacers derived from the archaeal genomes in UHGG and they were predicted to infect Methanomassiliicoccus intestinalis or Methanomassiliicoccus_A intestinalis. Compared with the cohort of Asia and America, the prevalence of smacovirus was higher in African and European populations (Fig. 3e).
Viruses infecting M. smithii are a major component of the archaeal virome in the human gut
To accurately investigate the diverse virus-host interactions, we particularly screened for the CRISPR spacers present in the 1162 archaeal genomes in UHGG to target the HGAVD viral sequences (see Methods). As expected, the majority (n = 1217; 95.2%) of the viral species connected to the genus Methanobrevibacteria_A, which is dominant in the human gut archaeaome (Fig. 4a and Fig. 1c). We then measured viral diversity by determining the number of VCs for each archaeal genera, revealing that the genus Methanobrevibacter_A harbored a significantly higher viral diversity than those of other archaeal genera (Fig. 4b), with 51 VCs assigned to this genus. Among the 51 VCs, 47 VCs were specific to M. smithii, only 17 VCs specific to M. smithii_A, and 13 VCs were linked to both these two archaeal species, reflecting archaeal viruses can infect their hosts cross-species. To show this in detail, we constructed the network of host-virus by matching the HGAVD viruses with the CRISPR spacers derived from the 1,162 gut archaeal genomes. Surprisingly, we revealed that approximately one-third of HGAVD viral species had a broad host range (Fig. 4c). Namely, 434 viral species had a host range that spanned 2 archaeal species (M. smithii and M. smithii_A) and 12 viral species had a host range across 3 archaeal species (M. smithii, M. smithii_A, and M. woesei). These analyses provide a comprehensive blueprint of archaeal virus-mediated gene flow networks in the human gut microbiome.
Most of (305/388 = 78.6%) the caudoviral species in HGAVD connected to the host of M. smithii. Thus, phylogenetic trees were constructed based on the predicted large subunit terminase (LST) to estimate the diversity of these archaeal viruses. In total, the LSTs were identified in 80 of 305 viral species, with 1 belonging to the Terminase_1(PF03354) domain, 33 to Terminase_3 (PF04466) and 46 to Terminase_6 (PF03237). Therefore, phylogenetic trees of the HGAVD viruses and closely related viruses in the RefSeq database (v201) were constructed using the proteins encoding the domains Terminase_3 and Terminase_6, respectively (Fig. 4d and Fig. S5). As shown on the trees, all gut archaeal viral clades detected in our study did not include any known viruses, likely representing novel viral types. In consequence, the LST phylogeny expanded the diversity of the viruses that infect M. smithii and defined new lineages.
Archaeal virus genomes encode an extensive functional repertoire
The functional potential of human gut archaea has been extensively studied12. HGAVD can enable us to explore the functional potential of the archaeal virome in the human gut. To do this, we identified 97,208 protein-coding genes on the representative sequences for these 1,279 viral species and compared these genes with the Pfam (v32) database. Overall, 40% (n = 39,268) of the viral genes did not have significant matches (cutoff: e-value\(<\)1e-5, score \(>\) 50) in this database and were not assigned to any biological functions, indicating that remarkably little is known about the functional potential of human gut archaeal viruses (see Fig. S6 and Fig. 5a).
The viruses of M. smithii contained the most functional diversity with proteins homologous to 1,034 different kinds of caudovirus-specific proteins in the Pfam database (only the proteins assigned biological function were taken into consideration), such as prohead protein, baseplate J, portal protein, tail fibers, and terminase large subunit, whereas other archaeal viruses lacked some caudovirus-specific functional proteins (Fig. 5b). For example, except for the viruses infecting M. smithii, the remainder had no proteins annotated for lysis. In particular, the genes encoding HNH endonuclease were observed on the viral genomes of both M. smithii and M. woesei. This protein potentially cleaves DNA into genome-length units during packaging and may operate in concert with their terminase large subunit and portal proteins36.
The representative sequences of 36 archaeal viral species in HGAVD were measured as complete genomes by CheckV. They were clustered into 7 different VCs and taxonomically classified to Caudovirales (n = 33, 6 VCs) and Cremevirales (n = 3, 1 VC). Analysis of these whole viral genomes in the order Caudovirales (Fig S10 and Table S11) resulted in an interesting finding that a gene encoding the protein homologous to PeiW frequently occurred on many viral genomes (n = 23). The prototype PeiW found in the archaeal prophage psiM100 was identified as the autolytic enzyme pseudomurein endoisopeptidase produced by the thermophilic methanoarchaeon Methanothermobacter wolfeii to cleave pseudomurein cell-wall sacculi of archaeal methanogens37. The phylogenetic analysis of PeiW revealed that except for the viruses of Methanothermobacter wolfeii, other archaeal viruses could also be the carrier of PeiW, such as the viruses of M. smithii and Methanobrevibacter olleyae (Fig. 5d). When extending this analysis to all HGAVD viruses, 150 viruses encoded the genes of PeiW (Fig. S9), suggesting the importance of this gene for the archaeal viruses in infecting methanogenic archaea.
In the analysis of these complete caudoviral genomes, 29 of 33 contained the genes encoding phage integrase protein. However, only 9 genomes were predicted as proviruses, and 20 were not flanked by host DNA by CheckV. In particular, we observed that 10 genomes infecting M. smithii or M. olleyae also encoded proteins belonging to the antitoxin MazE superfamily. MazE-MazF Toxin–Antitoxin system (TA) has been found in the genomes of E. coli and other bacteria38. TA systems can function as anti-virus mechanisms and viruses have evolutionally obtained defense mechanisms to avoid these systems. For example, T4 phage harbors a MazE antitoxin against MazF encoded by E.coli for efficient growth39. In addition, a small protein belonging to the antitoxin MazE superfamily protein was found in the temperate haloarchaeal virus SNJ1 to control the lysis-lysogeny switch40. Accordingly, the antitoxin MazE protein encoded by the HGAVD archaeal viruses might highlight an arms race between the gut archaea and their viruses over TA systems by regulating the state of viruses (lysis and lysogenic). Further, we performed a phylogenetic analysis based on the MazE antitoxin protein sequences detected in these viral genomes. The phylogenetic tree shows that (Fig. 5e), the viruses predicted to infect M. smithii and M. olleyae were separated into different clades. We performed comparative genomic analysis on the representative caudoviral sequences selected for each VC of the complete HGAVD caudoviruses (Fig. 5f). Although these viral sequences encoded the genes for the core viral functional proteins (such as portal and terminase proteins), and the accessory genes for PeiW, MazE, or integrase, they were shown divergent in genomic sequence. Overall, the analysis on these complete HGAVD viral genomes indicated that temperate archaeal viruses were dominant in the human gut, similar to the human gut bacterial phages41, while most of the archaeal viruses detected in this study likely were in the lytic status.