Estimating the distribution of viral taxa in next-generation sequencing data using artiﬁcial neural networks

Background: Estimating the taxonomic composition of viral sequences in a biological sample processed by next-generation sequencing is an important step for comparative metagenomics. For that purpose, sequencing reads are usually classiﬁed by mapping them against a database of known viral reference genomes. This fails, however, to classify reads from novel viruses and quasispecies whose reference sequences are not yet available in public databases. Methods: In order to circumvent the problem of a mapping approach with unknown viruses, the feasibility and performance of neural networks to classify sequencing reads to taxonomic classes is studied. For that purpose, taxonomy and genome data from the NCBI database are used to sample artiﬁcial reads from known viruses with known taxonomic attribution. Based on these training data, artiﬁcial neural networks are ﬁtted and applied to classify single viral read sequences to diﬀerent taxa. Model building includes diﬀerent input features derived from artiﬁcial read sequences as possible predictors which are chosen by a feature selection method. Training, validation and test data are computed from these input features. To summarise classiﬁcation results, a generalised confusion matrix is proposed which lists all possible misclassiﬁcation combination frequencies. Two new formulas to statistically estimate taxa frequencies are introduced for studying the overall viral composition. Results: We found that the best taxonomic level supported by the NCBI database is that of viral orders. Prediction accuracy of the ﬁtted models is evaluated on test data and classiﬁcation results are summarised in a confusion matrix, from which diagnostic measures such as sensitivity and speciﬁcity as well as positive and negative predictive values are calculated. The prediction accuracy of the artiﬁcial neural net is considerably higher than for random classiﬁcation and posterior estimation of taxa frequencies is closer to the true distribution in the training data than simple classiﬁcation or mapping results. Conclusions: Neural networks are helpful to classify sequencing reads into viral orders and can be used to complement the results of mapping approaches. The machine learning approach is not limited to already known viruses. In addition, statistical estimations of taxa frequencies can be used for subsequent comparative metagenomics. 14


Background
Next-generation sequencing (NGS) is now regularly used to identify viral sequences in the biological sample of an infected host in order to relate the presence of a 5 virus with symptoms of the host [1][2][3]. Furthermore, NGS data is regularly used to determine the taxonomic composition of pathogens for the purpose of comparative metagenomics [4] or reconstruction of networks between microbial taxa [5]. Most computational virus detection pipelines or pipelines for determining the taxonomic composition map sequencing reads or assembled contigs against viral reference se- 10 quences available in databases [6][7][8][9][10]. These mapping approaches have been proven to be successful in a large number of examples, however, they mostly fail to classify reads from new emerging viruses whose sequences are not yet deposited in a database. Besides such mapping approaches, machine learning approaches have been used to classify reads to taxonomic ranks [11] such as genus, family, order, class 15 and phylum. Among the machine learning approaches listed by Peabody et al., only one approach studies the performance of deep learning models for taxonomic classification (Rasheed et al. [12]). Rasheed et al. use artificial neural networks (ANN) to classify reads to different taxonomic ranks and evaluate their models on datasets representing about 500 microbial species. Some approaches have also been under- 20 taken to use machine learning models on NGS data to identify or classify reads that belong to novel emerged viruses. Zhang et al. [13] as well as Bartoszewicz et al. [14] employ diverse machine learning models to identify viruses that can infect humans. Whilst Zhang et al. applied the k-nearest neighbour (k-NN) model using k-mer frequencies as input features for their classification, Bartoszewicz  In contrast to the approaches by Zhang et al. as well as Bartoszewicz et al., in this work we study the ability of an ANN to classify sequencing reads on the taxonomic rank of nine biological orders. While the ANN studied by Rasheed et al. was fitted on sequence reads from a rather small number of microbes, we train our model on 30 the basis of several thousand viral sequences available at the NCBI database [15]. In comparison to support vector machines or linear discriminant analysis, ANNs are reported to achieve the highest accuracy rates on test data, though it depends on the specific classification problem [16,17].
Before fitting ANNs, we link available viral reference sequences with their taxo- 35 nomic description and study the distribution of different taxonomic ranks (phyla, subphyla, classes, subclasses, orders, etc.) to identify on which of these ranks a solid basis for building a training dataset would be reasonable. Next, we extract kmer distributions and inter-nucleotide distances as input features from the available reference sequences. In total, we generated 120 input features. 40 In the methods section, we describe the building of the training data in more detail, as well as the ANN models used. Furthermore, we present the new formulas for estimating the frequency distribution of viral orders in a sample. In the results section, we present a simulation study in which we evaluate the performance of the ANNs and demonstrate how to estimate the frequency distributions. We also

Taxonomy and genome data
A taxonomy database containing data that are needed for taxonomic classifica-50 tion can be downloaded from the ftp-server ftp://ftp.ncbi.nih.gov/refseq/ release/viral of the National Center for Biotechnology Information [15,18]. We downloaded these data in March 2020.
A section of the taxonomy database file fullnamelineage.dmp from this ftpserver is depicted in Figure 1. In this file, taxonomic information about species are 55 listed in rows. Beginning with archaea, bacteria and fungi, viral data is included from line 2,007,414 onwards. Row entries contain species names and taxonomy ranks which are unfortunately neither labelled nor complete.
We assign these taxa to taxonomy ranks based on the International Committee on Taxonomy of Viruses [19]. To enable correct assignments, we use the taxa end-60 ings summarised in Table 1. The ending "...viridae" indicates to which family the virus belongs. The Bovine adenovirus 4 on line 2,007,418 from Figure 1, for example is a member of the family of Adenoviruses.
As the taxonomy database does not provide any unique NC identifier, the species names of the reference genome file are searched in the taxonomy file to link genome 65 with taxonomy data.
We obtain viral reference genome FASTA files [20] again from the NCBI database and concatenate the three files into a single one. The first lines of the FASTA file ( Figure 2) show its composition. Line one consists of a unique identifier together with its species name. The genome sequence consisting of the single-letter codes A, 70 C, G and T that represent the four nucleobases follows from line two onwards. Combining genome and taxonomy data, we create a structured taxonomy table including FASTA file species which replaces the unstructured original taxonomy file. As there is data missing, we only choose species with available taxonomy information.
From more than 2 million taxonomy file lines only about 200, 000 virus file lines 75 remain. In contrast, the FASTA file contains only 12, 180 viral genome sequences whereof 2, 332 sequences provide taxonomy data. Hence, maximum 2, 332 can be used for classification.

Taxonomy category
From a list of 14 taxonomy ranks, we need to decide which one to choose for 80 classifying viral reads. As a necessary criterion, any classification algorithm requires at least two different taxonomy levels. To achieve a high accuracy, artificial neural networks need a big amount of training data in relation to the number of different levels one wants to predict.
As a consequence, we look for a taxonomy category that has more than one, but 85 not too many different levels and that provides a sufficient amount of available viruses per level. This is why we choose the taxonomy category order from Table  2. Table 3 depicts the total taxa frequencies. Take notice of the imbalanced count distribution. For example, in the category "order" there are only seven Bunyavirales, 90 but 1, 859 Caudovirales taxa.

Sampling of viruses and reads
We continue with the statistical sampling of viruses and reads from which the input features will be computed subsequently. In step one, from each order j ∈ {1, ..., C}, we randomly sample the corresponding viruses without replacement and sort them 95 into training, validation and test viruses. The disjunct partition is 70, 15 and 15%. For instance, from seven Bunyavirales viruses in total, five are used for training and one for validation or test data, see Table 4.
In step two, we randomly sample with replacement the same amount of viruses out of each of the 3 · 9 = 27 cell entries. From each order j ∈ {1, ..., C} and for 100 training, validation and test data separately (i ∈ {1, 2, 3}), we sample n i,j of N i,j viruses which yields a total amount of n viruses: ∀j ∈ {1, ..., C} : n 1,j + n 2,j + n 3,j = 0.7n + 0.15n + 0.15n = n .,j = n The second sampling stage is illustrated in Figure 3. Let us assume, for test data we want to sample two viruses within each order. The randomly chosen viruses 105 are surrounded by squares. However, there is the possibility that some orders only provide an insufficient amount of viruses so that viruses of these orders need to be sampled multiple times. For example, the single Bunyavirales virus must be sampled twice.
The whole sampling plan from step one and two is also known as two-stage strat-110 ified sampling with disproportionate allocation [21]. After sampling the viruses, we sample reads of the FASTA file by choosing random start positions out of a discrete uniform distribution and a read length of 150.

Input features
Possible input features to select are k-mers [22], inter-nucleotide distances [23] and 115 DNA sequence motifs [24]. k-mers are subsequences of length k of the original read sequences such as A, TT or GTA. Inter-nucleotide distances are the distances between two identical bases. For instance, the DNA sequence TTATACTACGTGGGGGGGGGTCCT exhibits T-T distances of 1, 2, 3, 4, 10 and 3 and we are interested in the count frequencies of the distances 2 to 10. A DNA sequence motif is a set of words which 120 can have arbitrary or a subset of A, C, G and T letters on particular base positions while having fixed letters on other positions. Here, we focus on DNA sequence motifs with two or three fixed letters, therefrom two at the margins, such as C.....T or G..G...A. Points represent place holders.
The number of input features sums up to 125 4 1 + 4 2 + 4 3 + 4 · 9 + x + y = 4 + 16 + 64 + 36 + x + y = 120 + x + y, whereof 4 k are the numbers of k-mers, 36 inter-nucleotide distances and x = 19 or y = 24 the numbers of selected DNA sequence motifs with two or three fixed letters, respectively. Since we do not select any DNA sequence motifs, which will be justified in the next subsection, we use 120 input features in total. We do not select more  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 on the one hand this will result in too many input features and on the other hand the probability of frequency zero is too high. The decision of the selection of input features will be made on the results of Kruskal-Wallis tests and accuracy comparisons of neural network simulation runs. 135

Feature selection
The first step is, for each DNA sequence motif separately, to count the relative frequency in each complete nucleobase sequence of the FASTA file. In the second step, we compare the frequencies between and within the groups or orders and perform a Kruskal-Wallis test [25]. High differences of the frequencies between the 140 orders in relation to low differences within the orders point out that this particular DNA sequence motif could be a good input feature to differentiate between the orders. On the third step, we select the DNA sequence motifs with the lowest pvalue respectively highest effect size 2 [26], up to a sequence length of eight.
From all possible DNA sequence motifs with two or three fixed letters, we choose 145 those with an effect size not less than 0.08 or 0.2, respectively. For small data, we iterate 10 simulation runs per layer in the artificial neural network and report mean and standard deviation of accuracy results (see Table 5). Each layer uses a subset of all proposed input features. If the paired t-test recognises a statistically significant and relevant accuracy increase of the subsequent layer, its corresponding input 150 features are added to the artificial neural net model. Finally, the suggested feature selection method chooses 1-mers, 2-mers, 3-mers and inter-nucleotide distances as input features.

Artificial neural net
For computation of the artificial neural network (ANN) [27], we apply the R-package 155 keras [28]. The sequential model contains 100, 000 training reads per order. The input layer contains 120 input nodes, the hidden layer 64 nodes and the output layer nine nodes for the nine different orders. The respective activation functions used are Rectified Linear Unit [29] for the hidden layer and Softmax [30] for multinomial classification of the output layer.

160
As optimisation algorithm we choose stochastic gradient descent. To quantify the model performance, sparse categorical crossentropy and accuracy are suitable metrics.
While training the model, the data are shuffled before each epoch to allow faster convergence and to prevent memorizing of the order of the training samples. [31]. 165 The batch size is 2 10 and maximum epoch count is 500 with an early stopping method to prevent overfitting [32].
The whole simulation procedure is iterated N sim times on independently generated sets of data samples each, see Figure 5.

170
A generalised confusion matrix [33] provides an appropriate summary of classification result metrics. Cell entry (j, k + 2) of Table 6 counts the reads of order j − 1 which were classified to order k − 1. The test data provide 21, 429 reads per order as row-sums, in total 9 · 21429 = 192861. The diagonal from top left to bottom right provides the correct classification frequencies. 175   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 As additional metrics, true positive and negative rates T P R = T P T P + F N and T N R = T N T N + F P as well as positive and negative predictive values P P V = T P T P + F P and N P V = T N T N + F N are reported in the special confusion matrix in Table 7. Note that in both Tables 6 180 and 7 all numbers are mean values on 10 simulation runs of which each utilises a different partition of training, validation and test data.

Estimation of taxa frequencies
The artificial neural network classifies single reads into taxonomy ranks. However, we do not only want to predict to which order a particular read belongs, but also we 185 want to estimate the global taxa count distribution of the nine orders. In this way we gain a summary of species abundance on taxonomy ranks. Prior and posterior taxa count distributions are calculated as follows.
There are N sim = 10 simulation runs and N cat = 9 categories, in this case orders.
is a finite sequence of generalised confusion matrices, without first two columns, which contain absolute read count frequencies of the simulation runs.
is a finite sequence of generalised positive predictive values, that is the probability that a read classified as taxon k * actually belongs to taxon j: For each simulation run with s ∈ {1, . . . , N sim }, we compute a leave-one-out mean matrix The mean values are calculated over all P s matrices except the one corresponding to the particular simulation run: Note that all training, validation and test samples of all simulation runs are generated artificially from the genome and taxonomy database and thus their real taxonomy ranks are known to the analyst, with the exception of test data of simulation   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 run s. Correct classifications of test samples of this simulation run are unknown to the analyst in real-world scenarios, see Figure 5, and hence to be predicted. This is 205 why those data are excluded from the computation of leave-one-out mean matrix P s . The column sums of By applying the law of the total probability and multiplying a leave-one-out mean P s matrix by A, we receive the posterior estimation of read count distribution: By using information about misclassification rates and generalised positive predictive values, the estimation of taxa frequencies should be improved. These, however, 215 includes the necessity to perform multiple simulation runs on independently generated artificial data to estimate misclassification rates. Apart from estimating taxa frequencies based on the classification results of artificial neural net simulation runs, we implement a supplementary mapping approach where we generate artificial 150 base pair read sequences with constant quality val-220 ues and map these versus our virus reference genomes represented in the training data using Bowtie2 [34]. As we want to classify new, yet unknown, viruses from each order we randomly sample 15% of the viruses and use these to sample 21, 429 artificial read sequences which are stored in a FASTQ file. The remaining viruses from each order are stored in a FASTA file so that virus sets of both files are disjunct. 225 This procedure is conform to sampling viruses and reads of the ANN approach which ensures comparability. Finally, from the mapping results the taxonomy ranks are determined and taxa frequencies are computed.
Whether the posterior estimation is actually superior to prior estimation and mapping estimation will be examined in the results section.

Results
Accuracy performance history As a first result, the performance history of four different simulation runs is exemplarily depicted in Figure 4. Primarily, the smooth curves stand out which is caused by a large batch size. The training time differs greatly, from 500 epochs on plot A 235 to only 50 on plot D. Training loss decreases negatively exponentially and training accuracy increases logarithmically. Validation accuracy is considerably higher than 1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 11% which means that our neural net classifications are clearly more precise than simple random classifications. Plot A shows an exceptional validation performance behaviour. Validation loss decreases slightly until epoch 150, then increases to a value higher than on start, while validation accuracy is still increasing. This indicates that the uncertainty for predicting class probabilities increases, whereas the classification results are becoming more precise. Plots B to D partly show better loss and accuracy performances on validation data in comparison to training data.

Generalised confusion matrix 245
While according to Table 7 the relative majority of reads from most orders is correctly classified, the misclassification rate is rather high regarding other orders. Due to the close relationship between Bunyavirales order and Paramyxoviridae family, which belong to Mononegavirales order, most of Bunyavirales reads are falsely classified into Mononegavirales order, but not the other way around. In detail, 7, 606 Bunyavirales reads are classified to Mononegavirales and only 2, 728 Mononegavirales reads are classified to Bunyavirales. This is justified by basic principles of set theory. Since Bunyavirales and Mononegavirales, to which the Paramyxoviridae family belongs, share the same ancestor, a relative majority of Bunyavirales reads is classified to the subset of Mononegavirales viruses that belong to the Paramyx-255 oviridae family and thus is classified correctly. Contrariwise, only Mononegavirales reads selected from Paramyxoviridae family are majoritarian classified to Bunyavirales order and this does not apply for reads from other Mononegavirales families. Nidovirales, Picornavirales and Tymovirales are biologically related and Herpesvirales descend from Caudovirales which explains high misclassification rates between 260 these orders (see Table 6). For instance, 3, 816 or 4, 458 Picornavirales reads are mapped to Nidovirales or Tymovirales, respectively.

Estimation of taxa frequencies
The boxplots on Figure 6 show the distributions of mapping, prior and posterior estimation of taxa frequencies of the nine orders. Indeed, both variance and bias 265 and thus mean squared error of posterior estimation are considerably smaller for every order in comparison to prior or mapping estimation.
The mapping estimation performed worst by far. Mapping rate was low with a mean value of 14.5% and a standard deviation of 3.3% which is caused by the disjunct partition of FASTQ and FASTA viruses in the simulation runs. From the 270 remaining mapped reads, not a single read was mapped to a Bunyavirales or Tymovirales virus and thus taxa frequencices of these orders were underestimated. In contrast to that, Caudovirales and Ligamenvirales taxa ranks were overestimated considerably. In median, more than 40% of all reads were mapped to Caudovirales although only 100/9 ≈ 11% was expected. Since the difference in the statistical es-275 timation error of the mapping approach in comparison to both other approaches is graphically obvious, no statistical estimation was necessary to prove its inferiority.
Besides the graphical comparison, we apply Mood's median test to compare the median absolute deviation of estimated frequencies from the true value of 21, 429 test reads between the groups "Prior" and "Posterior". The p-value of 280 0.00000002197 is highly significant. A paired alternative to the median test is the   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 sign test. Indeed both prior and posterior estimation are computed on same data for each simulation run which allows the usage of the sign test, yet we used the more conservative median test to confirm the MAD (mean absolute deviation) error differences of both estimators. Median absolute deviations of both groups to the true 285 value are 8, 872.5 or 2, 741.1 which results in a difference of 6, 131.4. Thus, median posterior estimation of taxa frequencies is more than 6, 000 reads more precise than median prior estimation. That implies, using generalised positive predictive values contributes vastly to more precise taxa frequency distribution and even prior estimation is superior to a mapping approach based estimation, at least if the probe is 290 composed of new virus reads.

Performance comparison between machine learning and mapping approach
In this work, we classified viral read sequences to nine different viral taxonomic orders by using ANNs. Furthermore, we developed two formulas to statistically 295 estimate species abundance on taxonomy ranks. After building and training, the performance history of the model was evaluated graphically on validation data. Prediction accuracy of the derived methods was evaluated on test data and classification results were summarised. From this confusion matrix, diagnostic measures such as true and false positive rates as well as positive and negative predictive values 300 were calculated. The prediction accuracy of the artificial neural net was considerably higher than for random classification and posterior estimation of taxa frequencies was relatively precise in comparison to prior estimation or to order distributions obtained by the mapping approach, at least for yet unknown, new viruses.
Evaluation of our new approach shows that an ANN model can be helpful to 305 classify mapped viral read sequences into viral taxa. The results of this classification could also be helpful to support or complement the findings of a mapping approach by classifying reads from novel viruses, too. The calculation of misclassification rates and training of the ANN can be done independently of the test data the analyst wants to classify and thus can involve large datasets and require a long runtime 310 which could help to improve accuracy and estimation performance. As a benefit in comparison to mapping, our new approach is not limited to already known viruses. In addition, statistical estimations of taxa frequencies provide an insight into taxa abundance of viral species.
For fitting the ANN we draw training reads from only 70% of viruses from each or-315 der. With this, we took into account scenarios with reads of novel viruses in the data of a test sample. If an ANN model was trained on reads from all available viruses in a database instead, we would expect that a mapping based approach would excel the ANN performance. Nevertheless, the ANN model could then be used to validate the mapping results. Independently of which of those two approaches, mapping or 320 machine learning, are used for taxa classification, in both cases taxa abundance estimations could be enhanced by our posterior estimation formulas applying the law of the total probability. While these formulas were evaluated for our ANN model, here, as a future perspective we want to also implement an algorithm to enhance classification accuracy of mapping results based on estimated misclassification rates. 325   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65 Phylogenetic explanations for misclassifications In our evaluation of the neural net for classification of viral orders, several misclassifications were observed. A reason for the false classification of reads from Bunyavirales to the order Mononegavirales could be explained by the close phylogenetic relation between these two orders (cf. Table 1 from Wolf et al. [35]).
Furthermore, the phylogenetic relationship between Picornavirales as an ancestor and Nidovirales as a descendant on branch two [35], which both belong to positive-sense RNA viruses, is somewhat close which results in a high one-way misclassification rate from Picornavirales to Nidovirales.

Bunyavirales Caudovirales Herpesvirales Ligamenvirales Mononegavirales
Step one of the sampling procedure is to randomly sample from each order viruses without replacement and allocate them to training, validation and test viruses. 70, 15 or 15% of sampled viruses are used for training, validation or test samples, respectively.
Training Validation Test Σ  Table 5 Feature selection. We choose DNA sequence motifs with an effect size not less than 0.08 or 0.2, respectively. This yields 19 or 24 DNA sequence motifs. Using short-runtime artificial neural networks on small data, we iterate the simulation runs 10 times per depicted layer and report the accuracy results. In this step-wise procedure, we select only those sets of features that contribute a significant and relevant accuracy increase. The paired t-test is statistically not significant on layer four and accuracy increase is not relevant on layer four, which is why DNA sequence motifs are excluded as features and thus not selected. Only 1-mers, 2-mers, 3-mers and inter-nucleotide distances remain for the actual simulation procedure on big data.