Identification of novel phage genomes from whole-community human gut metagenomes
The collection of 5,742 whole-community assembled metagenomes was searched for the presence of complete phage genomes. To limit the search space to putatively closed genomes, only ‘circular’ contigs with direct repeats (50–200 bp) at their termini (n = 95,663) were searched for open reading frames (ORFs) matching a known phage marker profile (i.e, the terminase large subunit, major capsid protein, or portal protein). In total, 3,738 contigs encode at least one ORF that passed the e-value and length cutoff criteria (Methods) (Additional file 1). Dereplication at approximately 95% mean nucleotide identity reduced the number of phage marker-matching contigs to 1,886 (Additional file 2). A subset of 664 contigs encoded all three markers, 531 encoded two of the three and the remaining 691 possessed a single detectable marker (Additional file 3). The putative phage contigs had a median length of 44.9 kb which is consistent with the recent estimates of the median genome size of dsDNA phages [13]. To exclude any contaminating contigs (e.g., a plasmid harboring an integrated phage), each was assessed with ViralVerify [16] and Seeker [26], two bioinformatic tools trained to discriminate phage genomes from other sequences. These tools classified almost all the selected contigs as phages with varying levels of confidence except for several cases that, upon manual examination, were found to represent false negative classifications by these tools (Additional file 2). Although we cannot rule out the possibility that some non-phage contigs were retained erroneously, the results collectively suggest that the set of circular marker-matching contigs predominantly consists of complete phage genomes.
To determine the host ranges of the phages, a database of CRISPR spacers from prokaryotic genomes was used to query the metagenomic phages for potential matches. In total, 553 (29%) of the dereplicated phage genomes were found to be targeted by at least one CRISPR-Cas system allowing host prediction (Additional file 4). The most common predicted hosts were Firmicutes (321 phages), followed by Bacteroidetes (143), Actinobacteria (43), Proteobacteria (41) and Verrucomicrobia (4).
Many phages have been found to encode Anti-CRISPR proteins (Acrs) to parry CRISPR-Cas defenses [48–51]. Given their function in counter-defense, Acrs evolve rapidly and show limited sequence similarity to experimentally characterized Acrs, making inference challenging [48]. However, a machine-learning based method has been recently developed that utilizes genomic context to identify candidate Acrs [47]. Application of this method showed that 41 phages, 16 of which were found to be targeted by a CRISPR-Cas system of their inferred host, encoded at least one candidate Acr, (Additional file 5). The highest-scoring Acrs belong to four phages that are targeted by Bifidobacterium CRISPR-Cas systems. All four phages are ≥ 97% identical over ≥ 90% of their length at the nucleotide level to uncharacterized prophages in cultured Bifidobacterium isolates (Additional file 5), confirming their host-tropism assignment via CRISPR spacer-protospacer matches. In these phages, two candidate Acr-encoding genes lie between the large terminase subunit and integrase (Additional file 6). The localization of the Acr-encoding genes suggests they are expressed not only upon initial entry into the host cell and during lysogeny [52], but also upon transition to the lytic program to prevent cleavage of progeny phage genomes by CRISPR-Cas, as demonstrated experimentally in Listeria-infecting phages [53]. Transcription of Acrs is typically regulated by HTH domain-containing proteins termed Acr-associated proteins (Acas) [54]. Indeed, in the Bifidobacterium phages identified here, a short HTH domain-encoding ORF is located immediately downstream of the Acrs and can be predicted to regulate the expression of these two genes throughout the phage lifecycle. While these uncharacterized Bifidobacterium phages possess the characteristic features of Acr loci, the great majority of the phages identified in this work did not harbor any detectable Acrs yet were targeted by CRISPR-Cas (Additional file 4). Some of these phages might encode distinct Acrs undetectable by the method we used that was trained on a collection of previously characterized Acrs, whereas others might employ alternative anti-CRISPR strategies.
Selection of candidate families for comparative genomics
The taxonomic analysis based on phage hallmark proteins demonstrates that few phages in the human gut belong to a currently accepted ICTV-family. To prioritize candidate families for in-depth analysis, we next complemented the hallmark gene-based taxonomic analysis with whole-genome comparisons and abundance calculations of each phage relative to GenBank phages.
A gene-sharing network was constructed with the phages recovered from metagenomes and those deposited in GenBank. Edges are drawn between two viral genomes, represented as nodes, based on the number of ORFs that share significant sequence similarity [38]. Most of the metagenome-recovered phages bore multiple connections within the network to GenBank phages, in agreement with the manual curation of these contigs as genuine phage genomes (Fig. 1B). However, two large groups of phages (tentatively labelled “Flandersviridae” and “Gratiaviridae”) were weakly connected to the larger network, reflecting disparate genome content. The divergence of the gene content of these phages from those of previously known phages and their distinct position in phylogenetic trees (Fig. 1A and see below) indicate that they represent novel genera, and likely, new families.
To quantify the fractional abundance [1] of each phage in the human gut viral community, reads from a collection of 1,265 human gut viromes were competitively mapped against a database containing the metagenome-recovered and GenBank phages. The majority of the genomes do not recruit any reads (“detection”) from more than 2% of the viromes (Q1-Q3, 0–2% of viromes) (Fig. 1C-D), consistent with the previously reported individuality of the human gut viromes [2, 7, 11]. A notable exception are the crAss-like phages [59] that recruit at least one read from about one-third of the viromes (Q1 - Q3, 9–28%), in agreement with previous reports of their cosmopolitan distribution [60, 61]. One uncharacterized Caudovirales genome was frequently observed in the collection of human gut viromes (54%, Fig. 1C), suggesting that this phage is also cosmopolitan. To rule out the possibility that the observed frequency stemmed from non-specific read mapping to one or a few loci, rather than the complete genome, the coverage of sequencing reads across the genome (accession OMAC01000147.1) was examined. The broad coverage of this genome in the viromes confirms that its frequent detection is not an artifact, although several loci present in the reference sequence were absent in the viromes (Additional file 7). The exceptional detection of this uncharacterized phage (hereafter referred to as Quimbyvirus, after the character Mayor Quimby from the Simpsons) in the human gut viral community warrants its detailed examination.
Thus, three groups of phages were selected for in-depth analysis based on their distinct positions in the phylogenetic trees of the marker genes (all three groups), combined with divergent gene contents (“Flandersviridae” and “Gratiaviridae”) and high abundance in the human gut viral community (“Flandersviridae” and “Quimbyviridae”). A comparative genomic analysis of each candidate family is presented below, case-by-case.
“Quimbyviridae” phages are abundant, hypervariable phages infecting Bacteroides
In the TerL phylogenetic tree, Quimbyvirus belongs to a group of phages whose closest characterized relatives include the Vequintavirinae and Ounavirinae subfamilies, under the now defunct Myoviridae family. To elucidate the taxonomic affiliation of Quimbyvirus, genomes from adjacent branches were examined (Fig. 2). The median genome length of Quimby-like phages is 75.2 kb, close to the genome size of a branch basal to the Quimby-like branch, “group 4986” (72 kb), but smaller than the genomes of other phages in adjacent branches, Ounavirinae (88 kb) and Vequintavirinae (145 kb). Despite the similarity in genome size, phylogenetic reconstruction of the portal protein and MCP separate the Quimby-like phages from group 4986 (Additional file 8). Moreover, most Quimby-like phages encode a DnaG-family primase and DnaB-family helicase that are both absent in group 4986. However, in one branch of Quimby-like phages, the primase was lost from the replication module. The genomes of this branch encode a protein adjacent to the DnaB-family helicase with significant structural similarity to the winged helix-turn-helix domain of RepA (HHpred probability, 96.5) (Fig. 2). RepA-family proteins mediate replication of plasmids by interacting with host DnaG primases [62], suggesting that the RepA-like protein coopts the host primase during replication, triggering the loss of the phage-encoded dnaG in this lineage. Consistent with a RepA-mediated episomal replication strategy, no integrase is identifiable in the genomes on this branch yet the phages encode numerous antirepressors, proteins involved in the lysis-lysogeny decision of temperate phages [63, 64]. The rest of the Quimby-like phages harbor a full-length integrase, indicating that these phages integrate into their host cell genome (Fig. 3). Based on the topologies of the TerL, portal, MCP and DnaG trees, we propose that Quimby-like phages represent a novel taxonomic group at the family rank (henceforth, the “Quimbyviridae”). The potential differences in replication strategies (episomal vs. integrated) combined with the topologies of the phylogenetic trees of marker proteins suggest that “Quimbyviridae” splits into two distinct subfamilies.
The Quimbyvirus genome aligns with a cryptic prophage of the bacterium Bacteroides dorei (CP011531.1), with 95% nucleotide sequence identity across 92% of its length, indicating that B. dorei, a common constituent of human gut microbiomes [65], carries a prophage closely related to Quimbyvirus. Inspection of the alignment shows that Quimbyvirus site-specifically integrates into the tRNA-Asp gene of B. dorei, a typical site of prophage integration [66]. The hosts of the other “Quimbyviridae” phages, determined through CRISPR-spacer analysis, include the Prevotella, Bacteroides and Parabacteroides genera within the phylum Bacteroidetes and the Lachnospiraceae within the phylum Firmicutes. In contrast, the hosts of group 4986 do not include any Bacteroidetes. The differences in the inferred host ranges support separating group 4986 from ‘Quimbyviridae” phages and suggests that group 4986 might represent a novel family, but these genomes were not investigated further.
Some of the “Quimbyviridae” phages harbor diversity-generating retroelements (DGRs), a cassette of genes that selectively mutate a short locus, known as the variable repeat, that is part of a C-type lectin or an immunoglobulin-like domain [67, 68]. Targeted mutation of these domains yields proteins with altered binding affinities and specificities [69]. The DGR cassette in Bordetella phage BPP-1 of the genus Rauchvirus is the only experimentally studied DGR system in a phage, where diversification of the C-type lectin domain-containing tail fiber gene enables adsorption to different host cell receptors [70]. In Quimbyvirus, the RT component of the DGR is encoded by overlapping ORFs in all three frames (ORFs 52–54), suggesting that the active RT is produced by two programmed frameshifts. Although overlapping ORFs and programmed frameshifts have been identified in in many compact tailed phage genomes [71–74], DGR RTs have thus far only been predicted to be encoded by a single ORF. To discern if the frameshifts render the RT inactive, the variable repeats were examined for adenine-specific substitutions, a hallmark of DGR-mediated variation [68]. The two variable repeats reside in ORF 47 and 80 of the Quimbyvirus genome which both encode proteins containing C-type lectin domains, the canonical target of DGRs [67] (Fig. 2). Alignment of the variable repeats with their cognate template repeats from nearly identical Quimbyvirus genomes (≥ 95% average nucleotide identity) allowed the detection of 22 adenine sites in the variable repeat exhibiting substitutions whereas all other bases were nearly perfectly conserved (Additional file 9). Collectively, these results suggest that the frameshifted RT possesses the selective infidelity that characterizes DGR-mediated hypervariation.
The first variable repeat resides in the C-terminus of ORF 51 that is located downstream of the tail fiber genes, suggesting that this gene codes for a structural component of the virion, similar to the hypervariable tail fiber of phage BPP-1 [70, 75]. The second DGR target locus is in ORF 84 that is distal to the phage structural gene module and is expressed from the opposite DNA strand, suggestive of a non-structural protein. The genomic neighborhood of ORF 80 includes genes coding for a nuclease, four methyltransferases and a tRNA ligase within 7 kb. The nuclease shows significant sequence and structural similarity to E. coli mutY (HHpred probability, 97.3, Additional file 10), a DNA glycosylase involved in base excision repair. The methyltransferases are most similar to adenine- and cytosine-modifying enzymes (HHpred probability 100 and 99.9, respectively, Additional file 10) that likely prevent cleavage by host restriction endonucleases. Similarly, the tRNA ligase might repair tRNAs cleaved by host anticodon endonucleases [76]. Overall, the adjacency of ORF 84 with defense- and counterdefense-related genes implies that this hypervariable phage protein plays a role in the phage-host conflicts; however, the exact functions of the DGR and hypervariable target proteins during the life cycle of “Quimbyviridae” phages remain to be investigated.
“Flandersviridae” phages are common and abundant in whole-community metagenomes
Analysis of the phylogenetic trees of TerL identified a deep branch of 29 gut phages (dereplicated from 196 total genomes) that joins the family Ackermannviridae (Fig. 3A). Annotation of the ORFs encoded by the 29 representative contigs demonstrated that the genomes are colinear, confirming that they belong to a cohesive group (Fig. 3B). The cohesiveness of this group was confirmed by the gene-sharing network, where these genomes form a coherent cluster that has few connections to the larger network (Fig. 1B), reflecting distant (if any) similarity between most of the proteins encoded by these phages and proteins of phages in GenBank. The median genome size of the phages in this group is 85.2 kb, compared to 157.7 kb among the Ackermannviridae phages. There is a conserved module of structural genes that encode the MCP, portal, sheath and baseplate proteins, TerL and the virion maturation proteinase. The presence of a contractile tail sheath indicates that these viruses possess contractile tails similar to those in the family Ackermannviridae, in agreement with the TerL phylogeny. Several of the genes within the structural block contain immunoglobulin-like or C-type lectin domains (e.g., BACON and GH5, respectively) which are predicted to play a role in adhesion of the virion to bacterial cells or host-associated mucosal glycans [77–80]. Downstream of the structural block is a module of genes involved in DNA replication that includes a DnaB-family helicase, DnaG-family primase and DNA polymerase I (PolA). The polA gene is widely distributed among dsDNA phages and therefore serves as a useful marker for delineating the diversity of phage replication modules [81]. Phylogenetic reconstruction of both polA and dnaG encoded by these phages confirmed their monophyly (Additional file 11). Following the replication module is an approximately 20 kb long locus containing ORFs that showed no detectable similarity to functionally characterized proteins. Two of the phages harbor matches to CRISPR spacers encoded by Bacteroides and Parabacteroides spp., indicating these bacteria serve as hosts. Based on the large terminase and polA phylogeny, colinearity of their genomes and differences from known phages in both genome size and content, we propose that these Bacteroides-infecting phages represent a novel taxonomic group, probably, with a family rank, hereafter, named “Flandersviridae” (after the region where some of the metagenomes were sampled).
Although all members of the “Flandersviridae” are syntenic, some contain an insertion of two adjacent genes encoding nucleotidyltransferase superfamily enzymes within the DNA replication module. One enzyme belongs to the ispD family that is involved in the biosynthesis of isoprenoids [82, 83], and the other is a licD family enzyme that is responsible for the addition of phosphorylcholine to teichoic acids present in bacterial cell walls [84] (Fig. 3). To our knowledge, neither of these enzymes has been reported in phages previously. Given that only some members of the “Flandersviridae” possess these genes, they are unlikely to perform essential functions in phage reproduction, and instead could be implicated in phage-host interactions. The licD family enzyme might modify teichoic acids to prevent superinfection by other phages, given that these polysaccharides serve as receptors for some phages to adsorb to the host cells [85]. The role of ispD is less clear because ispD family enzymes catalyze one step in the biosynthesis of isopentenyl pyrophosphate, a building block for a large variety of diverse isoprenoids [86]. Phages manipulate host metabolic networks including central carbon metabolism, nucleotide metabolism and translation [87]; the discovery of ispD present in the “Flandersviridae” phage genomes might add to this list the isoprenoid biosynthetic pathway.
Complete “Flandersviridae” phage genomes were recovered from 249 whole-community human gut metagenomic assemblies. Their frequent assembly into closed contigs suggests that these phages might persist in their host cells as extrachromosomal circular DNA molecules, similar to phage P1 [88]. However, neither genes involved in DNA partitioning nor lysis-lysogeny switches are readily identifiable in the “Flandersviridae” genomes. Thus, this group of phages might be obligately lytic although discerning the lifestyle of a phage from the genome sequence alone is challenging [89]. Regardless of their lifestyle, the frequent recovery of these phages from whole-community metagenomes implies that they are common members of the human gut virome. Indeed, the “Flandersviridae” phages reach similar detection frequency as the crAss-like phages (Fig. 1D) although there are fewer Flanders-like phages in the database. Like the “Quimbyviridae”, the even coverage of sequencing reads across one “Flandersviridae” genome (accession OLOC0100071.1) confirms its detection is not artifactual (Additional file 12). The high fractional abundance and detection of Flanders-like phages in viromes generally agrees with their frequent assembly from whole community metagenomes although they were not the most abundant (see Discussion). Overall, Flanders-like phages represent a previously undetected phage group that is widely distributed in human gut viromes.
“Gratiaviridae”, a putative novel family of phages infecting Bacteroides
A deeply branching cluster of 18 genomes (dereplicated from 45 total) is basal to the families Autographiviridae, Drexlerviridae and Chaseviridae on the TerL phylogenetic tree (Fig. 4A). Although not commonly present in gut viromes (Fig. 1D), the deep relationship between these contigs and established phage families prompts in depth genome analysis of these putative phages. All 18 genomes encode a DnaG-family primase and a DnaE-family polymerase, and phylogenetic reconstruction for these genes demonstrates monophyly of these phages, the sole exception being the dnaE gene of bacteriophage phiST, a marine Cellulophaga-infecting phage that belongs to the polyphyletic, currently defunct Siphoviridae family [90] (Additional file 13). The dnaG and dnaE genes are nested within a module of other replication-associated genes that include superfamily I and II helicases, SbcCD exonucleases and a RecA family ATPase (Fig. 4B). The structural module is composed of genes that encode an MCP, capsid maturation protease, portal protein, baseplate proteins and a contractile tail sheath protein. Although these genomes are not strictly colinear as observed for the “Flandersviridae” phages, the overall similarity of the proteins encoded by these phages is apparent in the gene-sharing network where they form a coherent cluster that shares some edges with the crAss-like phages (Fig. 1B). Similar to crAss-like phages, the predicted hosts suggested by CRISPR-spacer matches are the Bacteroides and Parabacteroides genera (Additional file 4). Taken together, the phylogenetic and genomic organization of these phages indicate that they represent a new family, provisionally named “Gratiaviridae” (after the pioneering phage biologist Dr. Andre Gratia).
In addition to structural and replication proteins, “Gratiaviridae” phages encode several enzymes of the ferritin-like diiron-carboxylate superfamily. The ferritin-like enzymes encoded by these phages belong to two families, namely, DNA protecting proteins (DPS) and manganese-catalases. Manganese-catalases have not been documented in phage genomes and DPS-like enzymes have only been observed in seven Lactobacillus-infecting phages [91]. Both enzymes are involved in the tolerance of anaerobes to oxidative stress. Catalases detoxify hydrogen peroxide to oxygen and water, enhancing survival of anaerobic Bacteroides in the presence of oxygen [92]. DPS enzymes catalyze a reaction between oxygen and free iron to yield insoluble iron oxide, lowering the concentration of both intracellular oxygen and free iron levels that would otherwise react with hydrogen peroxide and produce a hydroxyl radical, the most toxic reactive oxygen species [93, 94]. “Gratiaviridae” phages might deploy catalase- and DPS-like enzymes during infection to enhance the tolerance of their strictly anaerobic Bacteroides hosts to oxidative damage. Notably, these enzymes were not restricted to the “Gratiaviridae” but could be identified in 196 (manganese catalase) and 36 (Dps) other phage genomes, including the “Flandersviridae”. The frequent identification of these enzymes in gut phage genomes underscores the importance of intracellular iron and reactive oxygen species concentration for productive infections in an anaerobic environment.
Five of the “Gratiaviridae” phages encode a protein containing a serine/threonine protein kinase domain with distant but significant sequence similarity to HipA family kinases (HHpred probability 99, Additional file 10). Whereas HipA family kinases are present in numerous, phylogenetically distinct bacterial genomes as the toxin component of a distinct variety of type II toxin-antitoxin systems [95, 96], there are only two characterized examples of protein kinases encoded by phages. The protein kinase of T7-like phages phosphorylates RNA polymerase and RNAse III early during infection as part of the takeover of the host cell transcriptional and translational machinery [97–99]. In contrast, the protein kinase of E. coli phage 933W is expressed during lysogeny and mediates abortive infection upon superinfection of the host cell by phage HK97 [100]. The HipA-like kinase is unlikely to function early during infection like the kinase of T7-like phages because, in all five “Gratiaviridae” phages, the kinase is encoded between the portal protein and MCP genes, which are expressed late during infection in numerous cultured phages [101, 102]. Instead, the kinase might confer immunity to heterotypic phage infection, analogous to the kinase encoded by 933W [100]. In support of an immunity-related role, an AAA-family ATPase and a glycosyltransferase are encoded immediately upstream of the kinase in all five phage genomes (Fig. 4). Glycosyltransferases are encoded within capsular polysaccharide biosynthetic loci [103] and phase variation of the capsular polysaccharides confers immunity from phages that rely on these molecules for adsorption [104]. The specific roles of the HipA-family kinase, ATPase and glycosyltransferase are unknown but, collectively, these enzymes might modify host cell capsules, granting temporary immunity to heterotypic phage infection while the morphogenesis of “Gratiaviridae” progeny virions completes.