Establishing a high-quality virus dataset
Four viromes were constructed from four Arctic cryosphere samples: two CB samples (labeled CB17 and CB18) collected from the same borehole site in successive years (2017 and 2018), one sea-ice brine sample (labeled SB), and one seawater sample (labeled SW) from the same region (Figure S1). Two of them (i.e., CB17 and SB) were sampled in 2017 for comparing the viral community compositions using short-read viromes [32]. Here, we enhanced virus recovery from these samples via (results below): increasing sequencing (2.1 times deeper), adding long reads, upgrading assemblies, and using new methods that can better capture viral signals from short contigs. Combining the 2017 samples (CB17 and SB) with two new samples (CB18 and SW) collected in 2018, and using both short- and long-read viromes (named “long-read-powered viromes”), we generated a total of ~7.5 × 1010 quality-controlled bases of sequence data (Table S1), on average 2.6 times deeper per metagenome than the previous report of brine viromes [32].
After generating the viromic dataset, we leveraged the low-input hybrid assembly approaches we developed previously [36, 37] to determine whether adding complementary long reads to short-read virome data would improve the recoveries of vOTUs (approximate species-level viral operational taxonomic unit) and facilitate assembly of hypervariable regions (HVRs), which are informative for estimating microdiversity. Comparing “short-read-only (SR) assemblies” and “short+long-read (SLR) assemblies” (where both assemblies had the same sequencing depth) for the two CB samples indicated that an average of 9%, 25%, 77%, and 133% more vOTUs of length ≥5 kb, ≥10 kb, ≥25 kb, and ≥50 kb, respectively, were obtained in SLR than SR assemblies (Figure 1AD; Table S2), with N50 increasing from ~33 kb in SR to ~41 kb in SLR. In addition, SLR assemblies obtained ~3 times more unique vOTUs (Figure 1BE) and significantly improved (>5 times) the more challenging assemblies of HVRs (Figure 1CF). Genomic comparisons further revealed that 25% of contigs derived from SR assemblies were nested within contigs derived from SLR assemblies (Figure 1G; Tables S3 and S4). Together, these results indicated that long-read sequencing significantly improved the recoveries of vOTUs (particularly for long viral contigs) and HVRs within viral genomes, consistent with previous findings [36, 37]. Here, however, we evaluated these improvements while controlling for differences in sequencing depth among assemblies, which was not done in other reports [36, 37] but can significantly influence such comparisons. To maximize virus recovery, we followed the established pipeline [36, 37] to use all reads (i.e., without subsampling) for assembly and combined viral contigs generated from both SR and SLR for further analysis.
A total of 11,088 vOTUs (≥5 kb) were recovered, including 6,142 vOTUs ≥10 kb (Table S1). Among them, 5680 vOTUs (≥5 kb) were from CB and SB, a 4.4-fold increase over the 1,305 vOTUs reported previously for the same ecosystems [32]. In addition, the N50 and the percentage of complete genomes (PCG, assessed by checkV) for the 11,088 vOTUs were 1.4 times longer (N50: 21.9 kb versus average 15.5 kb) and 2.0 times higher (PCG: 3.0% versus average 1.5%) than for vOTUs from previously reported CB/SB [32] and the Global Ocean Viromes 2.0 (GOV2) dataset [38]. In all samples, 78.1% (range: 58.8–97.6%) of the reads recruited to the 11,088 vOTUs (Table S1). This level of read recruitment is exceptional and may represent the best recovery of any viromes published to date; compared to the also deeply sequenced GOV2 dataset, this read recruitment was about 4-fold higher [38]. Beyond the markedly strong read recruitment to reference genomes, rarefaction analysis also suggested that viral sampling was close to saturation (Figure S2), with less than 6, 11, and 50 additional vOTUs identified if adding an additional 1 million reads to the CB, SB, and SW libraries, respectively (See Methods). Overall, we generated four high-quality viromes for these largely under-studied Arctic cryosphere brines and established an important virus dataset for further ecological and evolutionary investigations.
CB communities remain mostly stable through two successive years
We first compared the viruses between the two CB samples (CB17 and CB18) that were collected from the same permafrost borehole in successive years 2017 and 2018. The two years overlapped in the majority of CB vOTUs (400 of 596, comprising >97.6% of each community), despite a slight increase of the overall viral concentration in 2018 and the differences in rare vOTUs that in total comprised only 0.3–2.4% of each community (Figure 2ABC; Figure S3 & Table S1). These minor differences may be attributable to the variations in host relative abundances, or the possible effects of chemical differences (e.g., extracellular polysaccharide content) and the in situ spatial heterogeneity of brines between years [16].
SB and SW communities mostly overlap, but are distinct from CB communities
Comparing SB and SW communities found that the most abundant vOTUs were present in both samples (Figure 2DE). However, SW had 2.0 times higher richness (10,188 vs 5,084 vOTUs) than SB and the unique vOTUs were mainly rare viruses that totally comprised only 23.0% and 4.3% of the SW and SB communities, respectively. This result was expected, as seawater is the primary source of viruses (and their hosts) to brines in sea ice during the freeze up, yet further selection by the harsh conditions in the brine may have reduced richness [17]. Because the viral communities were highly similar and closely linked environmentally, the SB and SW viromes were merged into one ‘SB/SW virus dataset’, just as the two CB viromes (CB17 and CB18) were combined into one ‘CB virus dataset’ for subsequent comparative analyses (unless otherwise stated, as for viral concentration and diversity comparisons).
We then assessed the viral concentration, diversity, and community composition in CB to those in SB/SW. As published previously [16, 32], viral concentration in CB was about 3 orders of magnitude higher than in SB and SW (i.e., 108 versus 105 mL–1; Table S1), while viral diversity was substantially lower than in either SB or SW (Shannon: 4.1 ± 0.1 for CB versus 8.0 ± 0.2 for SB/SW; Figure 2F). This pattern was consistent with the findings for microbes in the same samples and might be partly due to the higher organic matter (~100 times higher; more organic matter may support more organisms) and salinity (~2 times higher; higher salinity selects for a smaller number of adapted members, thus lower diversity) in CB than SB/SW [16]. Read recruitment-based analysis found that no vOTU was shared between CB and SB/SW, indicating that CB contained unique viral species (Figure 2G). Dating of the permafrost matrix in which these CB are embedded suggested that the CB and their viruses (and hosts) have been isolated from the atmosphere and meteoric water for at least 40,000 years [8], which would have provided a long time period for virus communities to diverge. This description of the CB viral community expands the known diversity of viruses in the cryosphere by adding entirely unique members not found in SB or SW, and further strengthens our previous efforts [32] by increasing the number of discovered brine vOTUs fourfold.
Higher novelty in CB viruses and potential ancient marine origin
We next evaluated the genus-level novelty and environmental distribution of all 6,142 longer vOTUs (≥10 kb) by genome-based network analyses to compare them to published viruses in the RefSeq database and 250 metagenomes from many habitats: global oceans (GOV2), deep ocean water and sediment, permafrost, soil, air, glacier cryoconite and ice core, and lake water (see Methods). First, only one viral cluster (VC; approximate genus-level taxonomy) was shared between the CB and SB/SW samples (Table S5), implying that each community type had distinct viral genera. Second, 14.5% and 45.6% of the CB and SB/SW vOTUs, respectively, formed VCs with viruses from databases (Figure 3AC; Table S5 and Figure S4), indicating a much higher (85.5% versus 54.4%) genus-level novelty of viruses in CB than SB/SW. Third, for those vOTUs associated with database viruses, most of them (73.1% and 92.4% for CB and SB/SW, respectively) formed VCs with the GOV2 Arctic seawater viruses (Figure 3; Table S5). The implication at the genus level that some CB viruses originated from seawater (prior to becoming isolated within permafrost) supports the marine origin of CB in the Utqiaġvik region, previously based on salt composition, stable isotopes, the presence of Marinobacter-like viral genes, and an abundance of Marinobacter species in the brines [8, 16, 33].
Given that genus-level analyses showed that most (84.6% and 99.9% for CB and SB/SW, respectively) of the database-affiliated vOTUs were associated with the GOV2 sample-derived viruses (Figure 3; Table S5), we further leveraged these GOV2 metagenomes to evaluate the global distribution of CB and SB/SW viral species by recruiting metagenomic reads to the vOTUs (≥5 kb) from both this study and GOV2 datasets (see Methods). Similar to the findings from genus-level analyses, only a single vOTU was shared between CB and GOV2 samples, while SB/SW samples comprised viruses that were mostly related to Arctic viruses, as indicated by nonmetric multidimensional scaling (NMDS) analysis (Figure S5).
Viruses predicted to actively infect the dominant brine microbes
We next investigated whether there were gene transcriptions from brine viruses while in their hosts, either as infecting lytic viruses or as prophages. To address this question, published metatranscriptomic reads from the same project [10] were recruited to the 596 CB and 5,084 SB vOTUs. The results revealed that 18 CB and 9 SB vOTUs were potentially active in the brines, with 7 of the 18 active CB vOTUs and 6 of the 9 active SB vOTUs predicted to infect Marinobacter and Polaribacter, respectively (Figure 4A). These taxa were the dominant genera in their respective brine types [10, 16], suggesting that viruses have been actively impacting the brine ecosystems through infection of the dominant microbial lineages. As well, two CB vOTUs that encoded fatty acid desaturase (FAD) genes, thought to improve host survival via membrane adaptation to the brine environment [32], were also active. Through the additional sequencing here, the genomic context of the two vOTUs was extended to 1.7 and 9.2 times longer (i.e., 222 and 230 kb versus 127 and 25 kb, respectively). These two FAD-encoding viruses were predicted to infect Pseudomonas, belonged to the same viral genus (likely a novel one; Figure 4AB; Table S6), and had highly similar genomes (Figure 4C).
We then explored which genes were being transcribed by these two Pseudomonas phages, according to the coverage of recruited reads from metatranscriptomes, and found that a capsid gene, a head gene, a gene relevant to signal transduction, and several genes of unknown function were among the genes that were most actively transcribed (i.e., with highest metatranscriptomic read-based coverage). The FAD genes had some representative reads but with lower coverage (Figure 4C), suggesting that the FAD genes might be transcribed at low levels or only under certain conditions or certain infection stages (e.g., not during virion assembly, the putative occurrence of which in these samples was suggested by the high transcription of capsid and head genes in the two Pseudomonas phages; Figure 4C). Overall, and though only a small number of active viruses were identified, these results from metatranscriptomic analyses provided the first window into the activity of CB viruses – at both the genome and gene level – and reinforce our understanding that viruses likely impact brine ecosystems through active infection of abundant microbes and encoding host-derived metabolic genes.
EpsG: a novel AMG potentially influencing host EPS synthesis in CB
In addition to the FAD genes [32], we assessed whether the brine viruses encoded any AMG that could influence the production or metabolism of extracellular polysaccharides (EPS), which can serve as cryo- and osmo-protectant for microbes [39-43] and were >1000-fold more concentrated in CB than in SB and SW [16]. From gene annotations of all the recovered vOTUs (Table S7), we identified an epsG gene that was flanked by viral genes on a complete viral genome recovered from CB (Figure 5A; Table S8). The genome was assembled from the long-read dataset, while the SR assembly yielded two fragments that were identical to the full genome (Figure 5A). The epsG gene represents a novel AMG, as it has not been found previously to be encoded by viruses. In bacteria, the epsG gene is a member of the eps operon (gene cluster), which produces proteins that function in polymerization [44] for EPS synthesis and biofilm formation [45, 46]. Mutations in epsG can control EPS production and, in Pseudoalteromonas atlantica, biofilm formation [47, 48]. In subzero brines, microbes have been reported to produce EPS as both cryo- and osmo-protectants [42, 49]. We speculate that viruses hijacked this epsG to influence EPS synthesis by their hosts and thus enhance host survival in these harsh settings – a useful strategy for a prophage lifestyle.
We then explored the evolution and functionality of epsG. Phylogenetic analysis was conducted to evaluate the relationship of this virus-encoded epsG (vEpsG) to 61 microbial epsG (mEpsG) sequences that were recovered from the concurrently sampled CB microbial metagenomes [10] (n = 11) and from the NCBI nr database (n = 50). The CB vEpsG was deeply divergent from all mEpsGs, including those from the CB samples (Figure 5B; Figure S6). Thus, we speculate that the espG was transferred from a microbe to a virus a long time ago (e.g., in the late Pleistocene, prior to or soon after entrapment in CB), and the gene subsequently diverged in the viral genome. Finding CB vEpsG to be unique is consistent with our previous reports of ancient CB virus-encoded FAD genes [32] and ancient glacier virus-encoded motility genes [50], which are also distinct from all known microbial genes.
An evaluation of the amino acid and protein sequences suggests that the vEpsG is functional. No conserved motif has been reported previously for EpsG, but we identified two motifs that were highly conserved in the tested vEpsGs and mEpsGs (Figure S7). Further, the number of nonsynonymous mutations relative to the number of synonymous mutations, a measure of selection pressure, was determined by recruiting short metagenomic reads from CB samples to this vEpsG. No single nucleotide polymorphism (SNP) was discovered for the vEpsG gene, which thus was probably under purifying selection in CB, indicative of a functional gene. This interpretation was supported by analyzing the evolutionary dynamics of epsG homologs across lineages, which implied that the epsG was under purifying selection (average dN/dS = 0.053) and remained functional (Table S9). No activity was observed for this AMG by metatranscriptomic read recruitment, which might be due to transcription of epsG being too low for detection or only occurring under certain conditions. Though experimental evaluation is required to establish function, the genomic evidence that vEpsG is likely functional leads us to presume that it would alter host EPS synthesis in ways that improve cryo- and osmo-protection in the CB ecosystem.
Lower viral evolutionary pressure in CB than SB/SW
Intra-population variations (i.e., microdiversity) can improve ecological resilience and offer windows into population- and gene-level selective pressures [51-54]. With the relatively recent opportunity to calculate such variations in viromics [55], we next assessed whether viral microdiversity and the selection pressures acting on viral genes differed in the two subzero brine types, representing relatively stable (CB) and fluctuating (SB) conditions, with SW as a reference. This revealed that viral microdiversity (via nucleotide diversity π value and the density of SNPs at both genome and gene levels) and gene selection pressure (via pN/pS) in CB were significantly lower than in SB and SW (Figure 6A–D; Figure S8). Higher microdiversity may be generated and maintained by species adaptation and expansion into harsh or disturbance-prone environments, in which the viruses may experience strong selection pressure; such processes may drive viral speciation and provide advantages for viral adaptation to environmental extremes or perturbations [35, 56]. Indeed, higher microdiversity reflects the increase in microbial stress responses [57-59] and adaptations to environmental fluctuations [60]. In this study, SB was collected from first-year sea ice that forms and melts annually and, during its lifetime, provides interior liquid habitats that fluctuate considerably with seasonal and diurnal changes in temperature and salinity, as well as in the composition of microbial communities [12, 16, 17]. In such systems, a higher level of microdiversity for a viral population could be advantageous because it may allow populations to survive when environmental conditions and hosts change [61]. In contrast, the studied CB has been separated from the surface environment and remained under relatively stable temperature and salinity conditions for millennia [8]. In such geophysically stable ecosystems, viruses might have become dominated by those best adapted to these unique brine habitats and be under a relatively relaxed environmental selection, as was also suggested for their microbial hosts [9, 11, 16].
Looking specifically at gene selection pressures (via pN/pS) in the two brine types, we found that 103 (0.5% of total 18,820) CB and 2,848 (2.2% of total 128,218) SB viral genes were under positive selection (Figure 6E). Functional gene annotation could be assigned to 23 of the 103 CB genes, and to 415 of the 2,848 SB genes (Figure 6E; Table S10). About half of these annotated and positively selected genes were related to DNA replication, metabolism, and virion structure (Figure 6E) – genes often under strong selection pressure during adaptation to new microbial hosts [62, 63]. Because 80–85% of the positively selected genes had unknown functions, exploring the functionality of these positively selected genes at a large-scale level is challenging.
Though many functions remain to be more deeply explored, especially as annotations are improved, here we focused on phage tail fiber genes as a proxy for assessing absorption and infection of host cells [64], and thereby the phage-host co-evolutionary arms race [65-68]. We hypothesized that mutations in some tail fiber genes may be under positive selection during arms-race evolution and more frequent in the fluctuating SB environment than in the relatively stable CB. A total of 707 phage tail fiber genes were identified from SB, including 23 genes (3.3% of 707) that were putatively under positive selection (pN/pS range: 1.09–8.97; Table S10). In contrast, none of the 51 tail fiber genes identified from CB viruses had a pN/pS ratio greater than 1, implying that all were under purifying selection (Table S10). Finding a lower positive selection pressure for mutations in tail fiber genes in cryopeg compared to those in SB is consistent with our overall result of lower gene-level selection pressure in CB than SB ( Figure 6CD) and further supports our proposition that viral evolutionary pressure was lower in the relatively stable CB habitat than in the fluctuating, transient habitats within sea ice.