The virus classification tree can be integrated into the NCBI taxonomy for classifying viral sequences from metagenomic and metatranscriptomic samples. To illustrate the power of having a classification tree of 715,652 viruses, we ran ParaKraken  on two different studies. The first study is a Populus metagenome that consists of two different genotypes (P. deltoides and P. deltoides x P. trichocarpa) in three different compartments (endosphere, rhizosphere, and soil). Each genotype-compartment pair had five replicates (with one P. deltoides endosphere sample removed for quality control reasons). Without the viral databases, the resulting metagenomic classification had 50% of reads with unknown taxonomic classification. The high amount of unknown microbial dark matter allows for a low-end estimation to be made of how many viral sequences may exist in microbiome samples. The second study is an ASD study comparing post mortem brain biopsies between ASD patients and controls. While the vast majority of reads in the ASD study are human, we captured a portion of the brain microbiome, including the viriome.
To identify differences in the phytobiome between compartments and genotypes, we ran ParaKraken on endosphere, rhizosphere, and soil samples from P. deltoides and P. deltoides x P. trichocarpa (a hybrid). To mirror commonly used mapping methods in recent publications aimed at characterizing plant viriomes , the samples were initially analyzed using only the NCBI databases of all publicly available genomes from prokaryotes, archaes, eukaryotes and 8,000 viral taxa. The NCBI databases resulted in 50% of the 4.8 billion reads being assigned to taxa, with the rest representing unknown microbial dark matter (Fig. 2A). To illustrate the increased mapping ability of the k-mer based viral databases, we then analyzed the samples using our k-mer based viral databases in combination with the NCBI databases. The inclusion of the 715,672 JGI metagenome viruses resulted in an increase in mapping by 347 million reads. Metagenome viruses make up between 6%-20% (mean 15%) of the total mapped reads for a given sample, greatly increasing the coverage of the viriome (Supplemental Table 1). In addition to the increased read coverage, viral sequences also differ between compartments and genotypes leading to potential associations between viruses, the host, and other community members (Supplemental Table 2).
To better explore the differences between compartments and genotypes, we ran differential abundance analyses comparing both within and between genotypes, focusing on viruses. For the within genotype comparison, we compared endosphere vs rhizosphere, rhizosphere vs soil, and soil vs endosphere for each genotype (Fig. 3A). There were 65 viral sequences that were significantly differentially abundant across the comparisons. Rhizosphere (48 and 37 sequences) and soil (36 and 39 sequences) had much higher amounts of differentially abundant viral sequences compared to the endosphere (four and six sequences), likely due to the lower amounts of detected viral reads in the endosphere. Rhizosphere and soil profiles are similar to each other, with only 10 significant sequences unique to one compartment and genotype. Additionally, rhizosphere and soil samples within a genotype are more similar than rhizosphere and soil across genotypes. The two endosphere samples shared three out of the six unique significant endosphere viral sequences.
In addition to the within genotype comparisons, a within compartment comparison was performed as well comparing: P. deltoides soil to hybrid soil, P. deltoides rhizosphere to hybrid rhizosphere, and P. deltoides endosphere to hybrid endosphere (Fig. 3B). There were 48 significantly differential viral sequences across the comparisons. Similar to the within genotype comparison, the rhizosphere (28 and 15 sequences) and soil (27 and 16 sequences) had more significant viruses compared to the endosphere (three and two sequences). Additionally, the soil and rhizosphere samples for a given genotype have nearly identical significant viral sequences with one virus unique to the P. deltoides rhizosphere, three unique to the hybrid rhizosphere, and four unique to the hybrid soil. There were no significant viruses shared across genotypes for the rhizosphere and soil. The endosphere samples had no unique viruses associated only with the endosphere, and each endosphere sample shared at least one significant virus with the hybrid rhizosphere/soil and P. deltoides rhizosphere/soil. The significant differences in viral sequences associated with the P. deltoides, the hybrid, and the different compartments suggest host factors and differences in microbiome community composition may select for different viruses. Whether the lower number of viruses associated with the endosphere samples relative to the rhizosphere/soil is due to differential centrifugation, some host-related factor, or some database bias is unknown.
Autism Spectrum Disorder Metatranscriptome
To better understand the association between viral sequences and human health, we analyzed post mortem brain tissue metatranscriptomic samples  from ASD individuals and controls (Fig. 2B). We first aligned reads to the GRCh38 human reference genome, resulting in 67.5% (2.99 billion of the total 4.43 billion) of reads mapping to the reference. The unmapped reads were then processed with our ParaKraken pipeline. Unsurprisingly, 95% of the unmapped reads were assigned to eukaryotes, most likely due to ambiguous mappings and differences between the patient and the human reference genome (Supplemental Table 3). Despite the high percentage of human sequences in the unmapped reads, metagenomic viral reads were readily identified in all samples, ranging from 5 k to 125 k reads (avg 0.06% of reads; whereas, bacteria make up 0.57% of reads). To assess the uniqueness of the JGI viruses, we quantified the percent of reads that mapped at the metagenome virus level. Only 8.9% of reads mapped at a level higher than the individual virus, suggesting that the JGI viruses are highly unique and unambiguous.
To further understand if there is an association between ASD and viral sequences in the brain biopsies, differential abundance was performed comparing ASD to controls. Eight metagenome viral sequences were significantly more abundant in ASD cases relative to controls at a > 2x fold change, compared to zero significantly more abundant in the controls at that fold change. When comparing all of the significant viruses (p-value < 0.05 and f-value > 0.9; irrespective of fold change), the average fold change of the nine viruses significantly higher in ASD was 2.23 (1.99–2.62) compared to 1.09 (1.00-1.16) for the five viruses significantly higher in controls, suggesting that brain tissue from those with ASD may have relatively higher abundances of viral sequences (at least for the viruses contained in our databases). Increased numbers of polyomaviruses such as those identified here have previously been reported in brain tissue of individuals with ASD , and the number of viruses within an individual has been shown to be correlated with decreased neuropsychological development , supporting the idea that there is an association between ASD and the viruses present in brain tissue.
Assessment of Virus Uniqueness and Abundance
To assess the uniqueness of viral sequences, we downloaded an updated version of NCBI’s viral taxonomy in Feb. 2020 . We first computationally generated reads from all of NCBI’s viral sequences with a length of 200 bp and a sliding window of one, resulting in 1.06 billion reads. We then ran these reads through ParaKraken on both the NCBI and JGI databases. A slight majority of the reads (53.4%) mapped only to the individual viral isolate in which the read was generated. A small percentage of the reads (8.3%) mapped to the root, which means that the viral reads either had homology to either another superkingdom or to the metagenomic viruses. The uniqueness of the reads within the NCBI’s database suggests that much of the viral diversity is undiscovered, there are few viruses in which there are multiple similar isolates sequenced, and there are few overlaps between NCBI’s viruses and the viruses from JGI. The JGI viruses in the Autism metagenome had much lower than expected ambiguity based on NCBI’s results. The ratio of reads that mapped to the individual isolate compared to non-root viral reads was 1.4 (53.4 / 38.3) for the viruses in NCBI compared to the 10.2 (91.1 / 8.9) for the JGI metagenome virus in the Autism dataset.
To further understand the uniqueness of NCBI’s viruses, we used two different databases to classify viruses in a COVID-19 BAL sample  (Supplemental Table 4). The first database consists of the original NCBI and JGI parakraken databases, while the second database includes the Feb. 2020 version of NCBI’s viruses (which includes the SARS-CoV-2 genome that causes COVID-19). Without the isolate of interest, 26878 (0.2% of total reads) reads mapped to different Coronavirinae, with the top hit of 22238 being SARS coronavirus (the old coronavirus taxonomy). With the addition of the SARS-CoV-2 isolate, 62480 reads (0.5% of total reads) now mapped to Coronavirinae with the majority 62461 mapping to the specific virus taxa of interest: severe acute respiratory syndrome coronavirus 2. SARS-CoV-2 has an 89.1% sequence homology with another SARS-like coronavirus , which is partly why we were able to identify coronavirus reads without the exact isolate of interest. However, despite having a highly similar virus already sequenced, the inclusion of the exact isolate of interest increased the mapping by 2.3x (reaffirming the prior results that the majority of viral sequences are unique given the sparsity of the number of viruses that have been sequenced).
In addition to the COVID-19 sample, we also analyzed a cassava sample from Mozambique that was confirmed to be infected with a cassava brown streak virus  (Supplemental Table 5). We ran ParaKraken with the original NCBI database and the JGI databases on the sample and confirmed the presence of the virus of interest. ParaKraken identified 1942 reads (0.1% of total reads) associated with the cassava brown streak virus. As expected, neither the coronavirus nor the brown streak virus was identified by ParaKraken in the Autism brain samples or the plant metagenome samples. Both the SARS-CoV-2 and cassava brown streak viruses demonstrate that ParaKraken can identify viruses of interest during an active infection, and that viruses contain highly unique sequences, partially due to the lack of viruses sequenced and viruses with taxonomy.