DeepMicrobeFinder sorts metagenomes into prokaryotes, eukaryotes and viruses, with marine applications

1 Sequence classification is valuable for reducing the complexity of metagenomes and providing a 2 fundamental understanding of the composition of metagenomic samples. Binary metagenomic 3 classifiers offer an insufficient solution because metagenomes of most natural environments are typically 4 derived from multiple sequence sources including prokaryotes, eukaryotes and the viruses of both. 5 Here we introduce a deep-learning based (not reference-based) sequence classifier, DeepMicrobeFinder, 6 that classifies metagenomic contigs into five sequence classes, e.g., viruses infecting prokaryotic or 7 eukaryotic hosts, eukaryotic or prokaryotic chromosomes, and prokaryotic plasmids. At different 8 sequence lengths, DeepMicrobeFinder achieved area under the receiver operating characteristic 9 curve (AUC) scores >0.9 for most sequence classes, the exception being distinguishing prokaryotic 10 chromosomes from plasmids. By benchmarking on 20 test datasets with variable sequence class 11 composition, we showed that DeepMicrobeFinder obtained average accuracy scores of 0.94, 0.87, and 12 0.92 for eukaryotic, plasmid and viral contig classification respectively, which were significantly higher 13 than the other state-of-the-art individual predictors. Using a 1-300 µm daily time-series metagenomic 14 dataset sampled from coastal Southern California as a case study, we showed that metagenomic read 15 proportions recruited by eukaryotic contigs could be doubled with DeepMicrobeFinder’s classification 16 compared to the counterparts of other reference-based classifiers. In addition, a positive correlation 17 could be observed between eukaryotic read proportions and potential prokaryotic community growth 18 rates, suggesting an enrichment of fast-growing copiotrophs with increased eukaryotic particles. With 19 its inclusive modeling and unprecedented performance, we expect DeepMicrobeFinder will be a 20 useful addition to the toolbox of microbial ecologists, and will promote metagenomic studies of 21 under-appreciated sequence types. 22


Introduction
Microbes are omnipresent in all conceivable systems on earth, be it extreme environment such as deep-sea hydrothermal vent or host-associated ecosystem like human gut, exerting immense influence on the global biogeochemical cycles (Falkowski et al., 2008;Azam & Worden, 2004) and host physiology (Turnbaugh et al., 2007;Berendsen et al., 2012).Studies of microbial diversity and evolution were pioneered by the discovery of three domains of life using the universally conserved small subunit rRNA gene 1/25 sequences (SSUs) as the phylogenetic marker (Woese & Fox, 1977), which enabled biodiversity surveys of environmental microbial communities (Pace et al., 1986;Olsen et al., 1986) and gave rise to the discovery of abundant archaea lineages in the open ocean (Fuhrman, 1992;DeLong, 1992).Microbial coding potentials were further probed using cloning libraries of natural microbial assemblages (e.g., cosmid and fosmid libraries) (Olsen et al., 1986;Schmidt et al., 1991;Stein et al., 1996;Vergin et al., 1998;Rondon et al., 2000;Béjà et al., 2000;Legault et al., 2006), which were revolutionized by shot-gun metagenomes to infer ecological roles of uncultured microbes (Venter et al., 2004;Handelsman, 2004).Depending on where, when and how metagenomic samples were collected, the microbial richness within a sample can range from a consortium of several dominant strains to a conglomerate of thousands of species.The tremendous amount of ever-growing metagenomic data compound with its inherent heterogeneous nature not only provides opportunities to decipher the cryptic interactions of complex microbial communities and the genome-level evolutionary trajectory of individual species, but also poses challenges on how to reliably extract genomic/transposable fragments from metagenomic sequence pools.By assigning metagenomic contigs into distinct groups, the complexity of metagenomes can be reduced to certain taxonomic levels, from coarse domains to fine-grained consensus species or strains.Metagenomic applications developed with the objective of computationally retrieving intended contigs can be briefly framed into two categories, supervised contig classification tools (i.e., viral contig predictors), and unsupervised contig clustering tools (i.e., metagenomic binners, see Sedlar et al., 2017 for a review of binning strategies).
Microbial communities are a collection of diverse biological entities including the ribosome-encoding cellular organisms (REOs), the capsid-encoding organisms (CEOs, e.g., viruses) that can only reproduce within cells of REOs, and orphan replicons (plasmids, transposons, etc) that parasitize REOs or CEOs for propagation (Raoult & Forterre, 2008).Viruses are one of the most abundant entities on earth (Suttle, 2005(Suttle, , 2007)), playing a crucial role in the global biogeochemical cycles by controlling nutrient flow via viral shunt (Fuhrman, 1999;Wilhelm & Suttle, 1999).Metagenomic contig classification has been heavily focused on the prediction of viral sequences.One of the state-of-the-art tools to identify viral contigs from metagenomic assemblies is VirSorter (Roux et al., 2015), which predicts viral contigs based on viral signal and categorizes them into three tiers with different confidence levels.VirFinder (Ren et al., 2017) is another viral contig predictor that employs k-mer frequencies and logistic regression to classify contigs to either viral or host sequences, which outperforms VirSorter at different contig lengths, especially for shorter contigs without detectable viral genes.The success of k-mer based methods has inspired the application of deep learning in viral sequence discovery, which leads to the development of DeepVirFinder (Ren et al., 2020) and PPR-Meta (Fang et al., 2019), both of which use one-hot encoding to convert DNA sequences into presence/absence matrices of nucleotides, and use neural networks to train virus-host classifiers at different contig lengths.Besides, PPR-Meta combines both nucleotide path and codon path in the encoding step, and classifies contigs into viruses, host chromosomes and plasmids (Fang et al., 2019).VIBRANT (Kieft et al., 2020) is a recently published tool that uses neural networks to distinguish prokaryotic dsDNA, ssDNA and RNA viruses based on "v-score" metrics, which were calculated from significant protein hits to a collection of Hidden Markov Model (HMM) profiles derived from public databases.Most of the aforementioned tools target bacteriophages.Eukaryotic virus predictors are emerging in recent years, and one of such tools is Host Taxon Predictor (HTP) (Gałan et al., 2019), which utilizes four machine learning methods to classify viral sequences to eukaryotic viruses or bacteriophages based on sequence features including mono-, dinucleotide absolute frequencies and di-trinucleotide relative frequencies.Beyond viruses, plasmids are also important players of shaping 2/25 microbial genome evolution and environmental adaptation.Plasmids are mobile genetic elements that can exchange genes with host chromosomes and shuttle between different hosts, leading to gene flows within microbial communities.Thus, by carrying genes related to environmental adaptations and defense systems, plasmids play a pivotal role in maintaining the host genetic and phenotypic plasticity, and increase the host fitness to the changing environments.This also poses a challenge in classifying chromosomal and plasmid contigs from metagenomes, which is particularly true for plasmids sharing a significant amount of genes with their host genomes.There are multiple dedicated tools developed besides PPR-Meta, such as cBar (Zhou & Xu, 2010), PlasFlow (Krawczyk et al., 2018), PlaScope (Royer et al., 2018) and PlasClass (Pellow et al., 2020).In principle, PlaScope employs a similarity searching approach based on species-specific databases, while cBar, PlasFlow and PlasClass use differential k-mer frequencies with different machine learning methods.Beyond viruses and plasmids, there is a paucity of applications targeting the classification of eukaryotic contigs from metagenomes, while eukaryotes are indispensable to the ecological functioning of natural microbial communities.Reference-based applications such as Kaiju (Menzel et al., 2016) and MetaEuk (Levy Karin et al., 2020a) search for close matches in reference databases, thus can be used to assign reads or contigs to taxonomic groups.
While the accuracy of these applications depends on the completeness of reference databases, their performance in classifying eukaryotic contigs is arguable due to the lack of a comprehensive microbial eukaryotic database (Keeling et al., 2014).EukRep (West et al., 2018) is a reference-independent application that uses k-mer frequency and linear-SVM to classify metagenomic contigs into eukaryotic and prokaryotic sequences.It has been proven that when combined with the conventional metagenomic and metatranscriptomic analyses, such as reconstructing eukaryotic bins and gene co-abundance analysis, biological and ecological insight can be readily obtained for uncultured eukaryotes (Vorobev et al., 2020;West et al., 2018).
Despite the significant progress made in the past years, there isn't one tool that can classify eukaryotic/prokaryotic genomes, eukaryotic/prokaryotic viruses, and plasmids in one shot.In fact, all these binary classifiers suffer from sequence types that are not modeled, such as eukaryotic contigs or plasmids can be misclassified as viruses by viral predictors, and viral contigs can be misclassified as plasmids by plasmid predictors, etc.Thus, in order to achieve a more reliable classification of the target sequences, one has to run several of these tools consecutively, each suffers from its own sensitivity and specificity, and the error rates propagate throughout the workflow, resulting in less accurate and biased classification.
Here we introduce DeepMicrobeFinder, a versatile multi-class metagenomic contig classifier based on convolutional neural networks (CNN).We show that DeepMicrobeFinder outperforms all the existing tools by precision and sensitivity across all test datasets with different sequence type compositions.
More importantly, DeepMicrobeFinder is superior to the other tools by classifying all sequence types simultaneously, which will greatly reduce the time and computation resource usage compared to the conventional way of pipelining a set of different predictors.Using a coastal marine metagenomic dataset as a case study, we show that DeepMicrobeFinder captures more eukaryotic contigs than reference-based classifiers.The higher eukaryotic read proportion is positively correlated with prokaryotic community growth rates, indicating the higher abundance of fast-growing copiotrophs might be involved in the recycling of particular nutrients during the spring bloom.
Complete and draft viral sequences and associated metadata were retrieved from the NCBI Viruses database (Brister et al., 2015) (https://www.ncbi.nlm.nih.gov/genome/viruses) on Jan 17, 2020, which contains 3,214,806 nucleic acid records in total.To include more viral sequences from the increasing metagenomic and single-cell genomic datasets, we also included more than 760,000 viral sequences from IMG VR2 (Paez-Espino et al., 2019).In total, there are 3,975,259 viral sequences included in this study.
For viruses from NCBI, we classified them into eukaryotic or prokaryotic viruses according to their host domains based on the host taxonomy lineages using taxonkit v0.3.0 (Shen & Ren, 2021).Similarly, the IMG VR2 viruses were also classified into eukaryotic or prokaryotic viruses based on the "Host_domain" field in the sequence information file.

Sequence preprocessing, model selection, training and validation
Training sequences were randomly selected using seqtk (available at https://github.com/lh3/seqtk)from the collected host, virus and plasmid sequences, which contained 3,404 prokaryotic chromosome sequences (Prok), 5,515 prokaryotic virus sequences (ProkVir), 10,952 eukaryotic chromosome sequences (Euk), 173,082 eukaryotic virus sequences (EukVir), and 2,390 plasmid sequences (Plas).Validation sequences were randomly selected from the same original dataset after removing the training sequences.Training and validation sequences were labeled as one of the five classes, e.g., Prok, ProkVir, Euk, EukVir, and Plas.In order to train models at different sequence lengths (500 bp, 1 kb, 2 kb, 3 kb, 5 kb), we first cut training sequences into fixed-length fragments for each model.This resulted in roughly an equal number of chopped sequence fragments for each class for different length models.Specifically, the numbers of training fragments are 400,000 at 500 bp, 200,000 at 1 kb, 90,000 at 2 kb, 66,000 at 3 kb, and 38,000 at 5 kb, respectively.The validation dataset was ∼ 1/10 of the training dataset for each length model.
Both the training and validation sequence fragments were used as input to train a fully connected 3 4/25 layer one dimension multi-class CNN.Sequence fragments were first one-hot encoded strandwise into a binary matrix with the dimension of 4 × L where L is the length of the fragment, which was used as the input layer of the neural network.The neural network comprises three convolutional layers, with 64, 128 and 256 filters and kernel sizes of 6, 3, and 2, accordingly.To improve the model robustness and reduce overfitting, a max pooling layer and a batch normalization layer were added after each of the first two convolutional layers, and a global max pooling layer, a dropout layer and a flatten layer were added after the third convolutional layer.Two dense layers were connected after the convolutional layers, with the first one containing 500 hidden units and the second one containing 5 hidden units, corresponding to 5 input classes (Fig. S1).Input sequences were encoded in both forward and reverse directions, and the predicted classes were determined by the average scores of both results.When preparing training and validation data, first the training data was sampled from the entire downloaded data, then from the leftover non-training data, we sampled the validation and test datasets with the size of 1/10 of the training data, respectively.The validation and test datasets were used to compute the model performance metrics.
Two prediction modes were provided for user input sequences, the single mode and the hybrid mode.
The single mode allows users to select a specific length model, then to cut input sequences into nonoverlapping fixed-length chunks to fit the selected model, and finally to make predictions based on the cumulative scores of all chunks for each input sequence.Chunks smaller than the half of the model length will be discarded.In the hybrid mode, when possible, models with longer sequence length have the highest priority.Input sequences were first cut into chunks corresponding to longer models, the remaining part of the sequence were further cut into chunks corresponding to shorter sequence models if possible.This way, most part of the input sequences will be used for the prediction, and the longer models will be always preferred to maximize prediction accuracy.The final prediction scores will be the sum of predicted scores for all chunks.

Custom benchmark dataset preparation
To compare with the other state-of-the-art individual predictors, we have generated 20 equal-sized (1000 contigs) test datasets with a variable composition of the 5 sequence classes (Supplemental Table S1).

Use-case data preparation and analysis
The daily time-series metagenomic samples were taken off the coast of Southern California using an Environmental Sample Processor (ESP), and the 1�µm A/E filters (Pall Gelman) collected during the day were used for DNA extraction as described previously (Needham et al., 2018).Metagenomic libraries were prepared using the Ovation® Ultralow V2 DNA-Seq library preparation kit (NuGEN, Tecan Genomics) under the manufacturer's instruction using 10 ng of starting DNA and 13 PCR cycles.

Community doubling time and growth rate calculation
The prokaryotic community microbial growth rates were calculated using gRodon (Weissman et al., 2021) in weighted metagenomic mode with temperature adjustment.Specifically, prokaryotic contigs from each individual assembly were predicted using DeepMicrobeFinder and annotated using Prokka v1.14.5 (Seemann, 2014).Reads were mapped to predicted coding genes to get the coverage information, and genes encoding ribosomal proteins were used as highly expressed gene sets for growth rate prediction.
The ambient temperature recorded by the sampler was used for growth rate prediction according to the user manual.The direct output of gRodon is doubling time, which was converted to growth rate per day using equation 1, where µ and d stand for maximal growth rate per day and minimal doubling time, respectively.

A CNN-based multi-class classifier
Microbial eukaryotes and viruses infecting them are not dispensable but indigenous in microbial communities of diverse ecosystems.Confidently identifying these sequences in metagenomes is crucial to understanding their ecological roles.Unfortunately, most of the eukaryotic viruses and hosts are underappreciated by current state-of-the-art tools.For instance, assessed by the predicted viral scores, the two popular viral contig predictors, VirFinder (Ren et al., 2017) and PPR-Meta (Fang et al., 2019), gave high scores to prokaryotic viral sequences and low scores to prokaryotic host sequences.However, the scores for eukaryotic host and eukaryotic viral sequences were more evenly distributed (Fig. S2).

6/25
Out of 500 randomly subsampled genomic sequences for each sequence type of prokaryotes, prokaryotic viruses, microbial eukaryotes, and eukaryotic viruses downloaded from NCBI, 454 prokaryotic viruses and 85 prokaryotic hosts had VF-score above 0.5, while 238 eukaryotic viruses and 157 eukaryotic hosts had VirFinder-score (VF-score) above this value (Fig. S2a).A similar trend can be observed for PPR-Meta (Fig. S2b), indicating these tools are not well trained to handle eukaryotic virus and host sequences, which calls for the development of novel predictors that take more sequence types into consideration in the model training step.
Here we compiled a collection of training datasets and trained several CNN-based multi-class models using one-hot encoding at variable lengths (500 bp, 1 kb, 2 kb, 3 kb, and 5 kb) to simultaneously classify eukaryotic host, eukaryotic virus, prokaryotic host, prokaryotic virus and plasmid sequences in one shot (Fig. S1).Using test sequences randomly sampled from the datasets that were not used for training, we evaluated the model performance using the Receiver Operating Characteristics (ROC) curve for each sequence type for each trained model (Fig. 1).Overall, we showed that with the sequence length increased, the model performance improved based on the Area Under the Receiver Operating Characteristic (AUC or AUROC) measurements for most sequence types (Fig. 1).The AUC scores were higher for eukaryotic viruses and eukaryotic hosts, followed by prokaryotic viruses, plasmids, and lastly prokaryotic hosts (Fig. 1).When prokaryotic host and plasmid classes collapsed into one sequence type, the performance of all length models improved, indicating the misclassification between prokaryotic hosts and plasmids is a major caveat of DeepMicrobeFinder (Fig. 1).Left panel shows the ROC curves for 5 sequence classes at different model lengths (500bp, 1k, 2k, 3k and 5k), and the right panel shows the performance when prokaryotic hosts and plasmids were collapsed into the "Prokaryotes" class.

DeepMicrobeFinder outperforms EukRep in eukaryotic host sequence prediction
We compiled a list of benchmark datasets to cover 20 composition scenarios of different sequence types (Table S1).The five sequence types were first grouped into two umbrella classes: PROK (including prokaryotic hosts, prokaryotic viruses and plasmids) and EUK (eukaryotic hosts and eukaryotic viruses).
Five large groups were first determined using the PROK:EUK ratios of 9:1, 7:3, 5:5, 3:7, 1:9, then within each group, the fractions of host sequences decreased gradually (the details of benchmark dataset preparation can be found in the Materials and methods section).This allowed us to compare the performance of DeepMicrobeFinder with the other state-of-the-art predictors under variable sequence composition of metagenomes.had accuracy or F1 score lower than 0.8 in most cases (Fig. 2), and for those datasets EukRep achieved accuracy or an F1 score higher than 0.8, the EUK fraction (including eukaryotic hosts and viruses, Table S1) of the test datasets were less than 10% (Fig. 2, Table S1), suggesting the performance of EukRep decreases with the increase of the eukaryotic contig proportion.This trend also holds for each fixed EUK fraction, the performance of EukRep increases with the eukaryotic host:virus ratios decrease from 5:1 to 2:1 (Fig. 2, Table S1).In contrast, the accuracy and F1 score of DeepMicrobeFinder are higher than 0.9 in all tested scenarios with smaller standard deviations (accuracy/F1 score: 0.02/0.019)8/25 compared to EukRep (accuracy/F1 score: 0.17/0.18),indicating DeepMicrobeFinder is accurate and robust to all the tested scenarios.
A further look into those misclassified sequence types revealed that the poor performance of EukRep is mainly due to its low sensitivity in recognizing eukaryotic contigs, though some plasmids and prokaryotic viruses were also misclassified into eukaryotes when the PROK fractions were high (Fig. S4).

DeepMicrobeFinder outcompetes PlasFlow and PPR-Meta in plasmid sequence prediction
Plasmids are mobile genetic elements of diverse prokaryotes and are one of the major agents of horizontal gene transfer (HGT) among hosts.Here we compared the performance of DeepMicrobeFinder to PlasFlow (Krawczyk et al., 2018) and PPR-Meta (Fang et al., 2019) using the same benchmark datasets.
The F1 scores of DeepMicrobeFinder and PPR-Meta were higher than those of PlasFlow in all tested scenarios (pairwise Wilcoxon test adj.p-values ≤ 5.7e-06; Fig. S5), and DeepMicrobeFinder showed significantly improved results than PPR-Meta in all tested cases (pairwise Wilcoxon test adj.p-value ≤ 1.7e-05; Fig. 3 & S5).Moreover, DeepMicrobeFinder showed a conspicuous increment in performance metrics when the EUK fractions were high, while the performance of PPR-Meta and PlasFlow were severely impaired, which is particularly perceptible for PlasFlow (Fig. 3).For each fixed PROK fraction, the performance of DeepMicrobeFinder was roughly the same or slightly decreased with the increase of non-host sequences, while marginally improved for PPR-Meta and PlasFlow.We further examined the misclassified sequences and found PlasFlow had high sensitivity but low specificity, the dominance of misclassified sequence types was in line with the composition of benchmark datasets (Fig. S6a).PPR-Meta might benefit from its modeling of chromosomes and phages, while it still had a low specificity mainly due to the misclassification of prokaryotic and eukaryotic host sequences into plasmids (Fig. S6b).It's noteworthy that DeepMicrobeFinder might further benefit from its modeling of eukaryotic hosts and viruses since eukaryotic host sequences were rarely classified 9/25 as plasmids, though the misclassification rates between plasmids and prokaryotic hosts were still high (Fig. S7a).The performance of DeepMicrobeFinder can be improved by collapsing prokaryotic hosts and plasmids into one sequence class as demonstrated previously (Fig. 1), which also greatly reduced the total misclassified cases (Fig. S7b).This suggests that current tools are inefficient in distinguishing plasmids from prokaryotic hosts, and further improvements on the neural network structures or using additional features extracted from gene or operon centric approaches might yield a better classifier.

DeepMicrobeFinder achieves improved results in viral sequence prediction
Viruses are ubiquitously found in every natural system where cellular organisms colonize.Significant

& S8
).The accuracies and F1 scores of DeepMicrobeFinder were rarely lower than 0.9, while none of the other tested tools had an accuracy or F1 score higher than 0.9.In addition, with the share of EUK sequences increasing, the accuracies and F1 scores of VIBRANT, VirSorter, and PPR-Meta decreased, while DeepMicrobeFinder kept constant or slightly increased.VIBRANT and PPR-Meta showed slightly higher accuracy scores than VirSorter in most cases, while within each fixed PROK fraction, both of them showed a decreasing tendency in both accuracies and F1 scores with the increasing of non-host sequences, suggesting higher proportions of virus and plasmid sequences can degrade the performance of VIBRANT and PPR-Meta (Fig. 4a).In contrast, the performance of VirSorter was less variable within each fixed PROK:EUK group, and the accuracy and F1 scores could be higher than VIBRANT and PPR-Meta in cases where the host percentages were lowest (host:virus:plasmid ratio of 2:1:1 for PROK, and host:virus ratio of 2:1 for EUK) (Fig. 4).This is particularly true for PPR-Meta in terms of F1 scores, which could rapidly decline to lower than 0.7 when the host sequences were less dominant (Fig. 4b).A previous benchmarking study showed that VirSorter had a slightly higher specificity than VIBRANT on distinguishing plasmid sequences (Kieft et al., 2020), indicating the higher plasmid fraction in the benchmark datasets might degenerate the performance of VIBRANT and PPR-Meta.The misclassified sequences by VirSorter were mainly bacteriophages and eukaryotic viruses, indicating it suffered from low sensitivity (Fig. S9a).VIBRANT also showed low sensitivity in predicting bacteriophages and eukaryotic viruses, while it also frequently classified plasmids and prokaryotic genomes as viruses (Fig. S9b).The rapid deterioration of PPR-Meta with increasing EUK fraction can be attributed to its low specificity by misclassifying eukaryotes to phages and its low sensitivity by misclassifying eukaryotic viruses to chromosomes (Fig. 6b).
Both DeepMicrobeFinder and PPR-Meta are multiclass classifiers, here we also compared the accuracies and F1 scores of them on these custom test datasets (Fig. S10

DeepMicrobeFinder predicted more eukaryotic and viral contigs than reference-based predictors
Reference-based classifiers can suffer from incomplete genomic databases, particularly for complex natural environments such as marine or soil systems.To test the performance of DeepMicrobeFinder in real metagenomic context, here we examined its performance with the other two sequence classifiers, Kaiju (Menzel et al., 2016) and MetaEuk (Levy Karin et al., 2020a), using a 1-300 µm size fraction marine metagenomic dataset sampled off the coast of Southern California (Needham et al., 2018).Using the co-assembled contigs as the reference, we show DeepMicrobeFinder classified less prokaryotic but more eukaryotic, eukaryotic viral and prokaryotic viral contigs than Kaiju and MetaEuk (Fig. 5a).
Notably, though DeepMicrobeFinder assigned less prokaryotic and more eukaryotic reads than other classifiers, the relative abundance profiles across the whole time series are highly correlated (Fig. 12a 11/25 & 12b), and to a less extent for the prokaryotic viral read percentage profiles (Fig. 12c).This is not the case for eukaryotic viral read abundance profiles, where Kaiju and MetaEuk are highly correlated, but not to DeepMicrobeFinder (Fig. 12d).To sum up, DeepMicrobeFinder is more correlated with MetaEuk in eukaryotic read profiles, and more correlated with Kaiju in prokaryotic and prokaryotic viral read profiles.Clean reads were aligned to metagenomic contigs and percentages of mappable reads were calculated (e).Mapped read percentages were further summarized according to sequence types of reference contigs as predicted by DeepMicrobeFinder (f), Kaiju (g) and MetaEuk (h).Prokaryotes included both prokaryotic hosts and plasmids.
UnclassifiedViruses were sequences predicted to be viruses but their taxonomy couldn't be further resolved by Kaiju or MetaEuk.Same benchmarking datasets were used as in Fig. 2.

Abundant microbial eukaryotes are correlated with potential prokaryotic community growth rates
If we assume higher eukaryotic read percentages can be a proxy of higher eukaryotes-derived particles, will higher percentages of eukaryotic reads result in faster prokaryotic community growth?Here we calculated the prokaryotic community maximum doubling time using gRodon with species abundance correction (Weissman et al., 2021) and found a significant positive correlation between centered log-ratio (CLR) transformed eukaryotic read percentages and prokaryotic potential community growth rates (Fig. 6a & 6b), suggesting higher relative abundances of eukaryotes might brew more fast-growing particle-attached copiotrophs.Significant correlations (p-values < 0.01) can also be observed between the relative abundance profiles of prokaryotic and eukaryotic viral sequences, as well as between eukaryotic and prokaryotic viral read profiles (Fig. 6a).In contrast, the relative read abundance profiles of prokaryotic and prokaryotic viral sequences were negatively correlated with prokaryotic community potential growth rates, though this correlation relationship was less significant (Fig. 6a & 6b). 12/25
Due to challenges in cultivation and whole genome-sequencing of microbial eukaryotes, biodiversity surveys of microbial eukaryotes were commonly performed using marker genes, such as the 18S rDNA hypervariable V4 or V9 regions (Pawlowski et al., 2012;Amaral-Zettler et al., 2009).The amplicon-based analysis provides valuable information on the taxonomy of microbial eukaryotes, while in order to probe their metabolic potentials or ecological functions, genomic and transcriptomic information are essential.
Despite several achievements in collecting microbial eukaryotic genes (Carradec et al., 2018;Vorobev et al., 2020), transcripts (Keeling et al., 2014) or single-cell amplified genomes (SAGs) (Sieracki et al., 2019) towards a comprehensive microbial eukaryotic database, our knowledge are still limited by the availability of diverse microbial eukaryotic genomes ] (Burki et al., 2020).With the rapid accumulation of metagenomic datasets and availability of binning software, it's appealing to recover eukaryotic genomes from natural microbial communities.EukRep was developed in such a context to identify eukaryotic contigs for metagenomic binning (West et al., 2018).This approach has enabled the genome-resolved analysis of fungi, protists, and rotifers from human microbiome studies (West et al., 2018;Olm et al., 2019).Similar approaches have been applied to marine microbiome studies (Duncan et al., 2020;Delmont et al., 2020), which recovered hundreds of microbial eukaryotic MAGs and provided insight into the functional diversity and evolutionary histories of microbial eukaryotes beyond the taxonomic information.Large DNA Viruses (NCLDV) (Iyer et al., 2001).Since most of the commonly used viral predictors are trained on the RefSeq viral database, it's expected that these tools suffered from identifying eukaryotic viruses from the test datasets (Fig. 4 & S2).Given the high diversity of protists (Foissner, 1999;Slapeta et al., 2005), high throughput metagenomes and single-cell genomes are expected to offer a culture-independent solution to rapidly expand the coverage of viral database.For instance, two recent studies reconstructed 2,074 and 501 NCLDV MAGs from global environmental metagenomes (Schulz et al., 2020;Moniruzzaman et al., 2020), dramatically increased the phylogenetic and functional diversity of NCLDVs.Single-cell metagenomics was also employed to identify viruses infecting marine microbial eukaryotes (Needham et al., 2019b,a), these studies provided insightful findings of the viral encoded proteins and metabolic pathways.
These studies demonstrated that metagenomics and single-cell genomics can be promising in studying microbial eukaryotes and viruses infecting them.While most commonly used tools are not optimized in classifying eukaryotes (Fig. 2 & S3) or eukaryotic viruses (Fig. 4 & S2).Given the high performance of DeepMicrobeFinder and the evidence of abundant eukaryotic contigs in marine ecosystems (Fig. 5 & 6), we expect it will be a valuable addition to the toolbox of marine ecologists.

The challenge of classifying prokaryotic host and plasmid sequences
DeepMicrobeFinder has a relatively lower accuracy in classifying plasmids when compared to the classification of eukaryotic or viral contigs (Fig. 2, 3 & 4).The majority of the sequences that were misclassified as plasmids were from prokaryotic host genomes (Fig. S7), confirming classifying prokaryotic chromosomal and plasmid sequences is a caveat of DeepMicrobeFinder (Fig. 1).In comparison, the other tested plasmid classifiers suffered from both prokaryotic and eukaryotic sequences as we have benchmarked (Fig. 3, S5 & S6).It's noteworthy that this marginal advantage can be crucial in natural environments, such as the marine environments as we mentioned here (Fig. 5 & 6), where eukaryotic sequences can have a substantial impact on the performance of plasmid sequence classifiers.
This also indicates that it is achievable to separate plasmid sequences from eukaryotic sequences solely based on patterns of oligonucleotides, and current plasmid predictors can benefit from using a more comprehensive training dataset including eukaryotic sequences.
It is understandable given the higher genome complexity of eukaryotes than prokaryotes (Lynch & Conery, 2003), such as the coding density, prevalence of introns and repetitive sequences, etc.In contrast, it's challenging to classify plasmids and prokaryotic chromosomal sequences for all the tested plasmid predictors (Fig. 3).The reasons can be manifold, but plasmid transmission among microbial hosts and plasmid-chromosome gene shuffling can be two fundamental ones.The host range of plasmids is variable, it can be within closely related species for narrow host range plasmids, or across distant phylogenetic groups for broad host range plasmids (Jain & Srivastava, 2013).Broad host range plasmids can be important drivers of the gene flux among host microbes in natural environments (Heuer & Smalla, 2007;Wolska, 2003;Davison, 1999).For instance, in natural soil microbial communities, the IncP-and IncPromA-type broad host range plasmids were found to be able to transfer from proteobacteria to diverse bacteria belonging to 11 bacterial phyla (Klümper et al., 2015).When plasmid carriage could increase the fitness of the hosts, such as improving host survival with antibiotic resistance, it can be rapidly 14/25 adopted and persistently maintained in natural microbial communities (Li et al., 2020;Bellanger et al., 2014).On the other hand, when the maintenance of plasmids imposed a high fitness cost on the hosts, plasmids or plasmid-borne genes could be lost as in the process of purifying selection (Hall et al., 2016).
Interestingly, studies also suggested sometimes this fitness cost could be ameliorated by compensatory evolution (Millan et al., 2014;Harrison et al., 2015;Loftie-Eaton et al., 2017), which was hypothesized to be the major factor of plasmid survival and persistence (Hall et al., 2017).Plasmid carriage also increases the chance of plasmid-chromosome genetic exchange mediated by SOS-induced mutagenesis citeprodriguez-beltranHorizontalGeneTransfer2021 or mobile genetic elements such as transposons and integrons, etc citepfrostMobileGeneticElements2005,rodriguez-beltranHorizontalGeneTransfer2021.For instance, genes carried by transposons or in the variable regions were also frequently found on plasmids (Eberhard, 1990;Zheng et al., 2015).Thus, the permissive transfer of plasmids across diverse hosts and the plasmid-chromosome gene flow pose a challenge for current plasmid classifiers.The oligonucleotide-based approaches might be complemented by gene-centric approaches using plasmid signature genes or enriched gene functions, such as genes involved in mobilization or conjugation.In addition, a comprehensive plasmid database is also crucial for model training, and plasmid enriched metagenomics (plasmidome) can be a promising way to screen plasmids from environmental samples (Shi et al., 2018).

Ecological influence of microbial eukaryotes on the prokaryotic community
Marine water is a continuum of particles, which are aggregates of diverse planktonic detritus and minerals, providing nutrient rich microenvironments in the oligotrophic oceans (Simon et al., 2002).
Size-fractionated metagenomes are commonly used to study microbial lifestyles in aquatic environments.
Bacterial production experiments showed that though PA microbes were less abundant, their growth rates could be 20-fold higher than their FL counterparts (Friedrich et al., 1999).Fast-replicating PA bacteria were also tightly linked to chlorophyll a, suggesting their growth might be fueled by particulate organic matter (POM) derived from phytoplankton (Friedrich et al., 1999).PA microbes are also found to be associated with the degradation of hydrocarbons and lipid materials derived from eukaryotic plankton (Yoshimura et al., 2009;Wei et al., 2013;Fontanez et al., 2015), playing an important role in the POM remineralization and biogeochemical cycles.Nutrient-demanding copiotrophs (here Vibrio, Roseovarius, Polaribacter, etc, according to Needham et al. (2018)) are usually PA microbes, while oligotrophs (here SAR11, Prochlorococcus, etc) are FL adapted to nutrient poor environments (Giovannoni et al., 2014).
Copiotrophs encode more genes involved in carbohydrate and amino acid transport and metabolism than oligotrophs (Weissman et al., 2021), and can be classified solely based on the minimum doubling time (<5 hours), which correlates with the codon usage bias (CUB) of highly expressed genes (such as ribosomal genes) due to the selection for translational efficiency (Vieira-Silva & Rocha, 2010;Long et al., 2021;Weissman et al., 2021).Thus, the increase of eukaryotic read proportion can be used as a proxy of higher POM availability, which promotes the growth of fast-growing particle degraders and changes the prokaryotic community composition.Conversely, using a species abundance aware community growth rate prediction method (Weissman et al., 2021), as we have done here, one can also probe the relative nutrient or POM status of given samples based on the predicted potential community growth rates from metagenomes.
15/25 collection or analysis, the decision to publish, and preparation of the manuscript.We thank Dr. David M. Needham, Dr. J. Cesar Ignacio-Espinoza and Erin B. Fichot for their help with DNA extraction and metagenomic library preparation.

Fig 1 .
Fig 1.The ROC curves and AUC scores of different length models assessed on validation datasets.

Fig 2 .
Fig 2. Distribution of (a) accuracies and (b) F1 scores across 20 test datasets for DeepMicrobeFinder and EukRep.The top panel shows the sequence composition of 20 test datasets, the composition ratios of different sequence types can be found in TableS1.The dashed black lines indicate where accuracy or F1 score equals to 0.8.Please note a decreasing tendency in both accuracy and F1 score for EukRep along with the increasing eukaryotic fractions.

Fig 3 .
Fig 3. Distribution of (a) accuracy and (b) F1 scores across 20 test datasets for DeepMicrobeFinder, PlasFlow and PPR-Meta on plasmid classification.DeepMicrobeFinder achieved the highest performance in respect to accuracy (a) and F1 scores (b).Same benchmarking datasets were used as in Fig. 2.
advances have been made in recent years in developing tools to identify viral contigs from metagenomic assemblies, using essentially gene-centric (e.g., VirSorter, VIBRANT) or oligonucleotide-centric (e.g., VirFinder, DeepVirFinder, PPR-Meta) approaches.Here we compared the performance of DeepMi-crobeFinder to VirSorter, VIBRANT, and PPR-Meta on viral contig prediction using the aforementioned benchmark datasets.DeepMicrobeFinder achieved better performance in terms of accuracy and F1 score than all the other tools in all tested datasets (pairwise Wilcoxon test p-values ≤ 1.1e-05; Fig.4 ).Although DeepMicrobeFinder can classify more sequence classes than PPR-Meta, it still outperformed PPR-Meta in all tested cases in both accuracies and F1 scores (pairwise Wilcoxon test p-values ≤ 1.9e-06; Fig. S10 & S11).

Fig 4 .
Fig 4. Distribution of (a) accuracy and (b) F1 scores across 20 test datasets for DeepMicrobeFinder, VirSorter, VIBRANT and PPR-Meta on viral contig classification.DeepMicrobeFinder received the highest scores in both accuracy and F1 score in all tested scenarios compared to the other predictors.Increasing the fraction of eukaryotic related sequences didn't impaired the performance of DeepMicrobeFinder, but did for the other tools.The dashed black lines indicate where accuracy or F1 score equals to 0.8.Same benchmarking datasets were used as in Fig. 2.

Fig 5 .
Fig 5. Sequence classification and read abundance of a 1-300 µm size fraction marine metagenomic dataset sampled off the coast of Southern California.Metagenomic contigs were classified using DeepMicrobeFinder, Kaiju and MetaEuk at a length cutoff of 2 kb, and percentages of different sequence types were calculated (a).Contigs predicted as Prokaryotes by both Kaiju and MetaEuk (b), and contigs that were not classified by Kaiju (c) or MetaEuk (d) were further broken down into DeepMicrobeFinder's classification.Clean reads were aligned to metagenomic contigs and percentages of mappable reads were calculated (e).Mapped read percentages were further summarized according to sequence types of reference contigs as predicted by DeepMicrobeFinder (f), Kaiju (g) and MetaEuk (h).Prokaryotes included both prokaryotic hosts and plasmids.UnclassifiedViruses were sequences predicted to be viruses but their taxonomy couldn't be further resolved by Kaiju or MetaEuk.Same benchmarking datasets were used as in Fig.2.

Fig 6 .
Fig 6.Positive correlation of prokaryotic community potential growth rates and eukaryotic read percentages.Correlation coefficients were calculated between relative abundances of different sequence types and prokaryotic community potential growth rates (a).Prokaryotic community potential growth rates (b) as a linear function of centered log-ratio (CLR) transformed read percentages of eukaryotes and prokaryotes.Numbers in (a) were Pearson's r correlation coefficients, only significant (p-values ≤ 0.01) correlations were highlighted in colors.CLR transformation was performed using the mclr function from the SPRING package(Yoon et al., 2019) before performing correlation or regression studies.The potential growth rate of sample Mar_20 were determined to be outliers based on the Grubbs test (p-value = 5.4e-08), thus excluded from the regression analysis.
Beyond microbial eukaryotes, current viromic studies are biased towards viruses infecting prokaryotes.This could be introduced by the skewed distribution of viral genomes in the RefSeq database, which is dominated by phages and pathogenic viruses.By Jan 23, 2021, among 10,161 viral reference 13/25 genomes, there are only 33 records belonging to algae-infecting Phycodnaviridae and 6 belonging to protists-infecting Mimiviridae.Both of the two viral families are subgroups of the Nucleocytoplasmic 1 9