Deep learning (DL) models in computational biology have been applied to an increasing number of challenges1, such as virus detection2, antibiotic resistance prediction3, and contamination removal4. The development of such models often uses DL application programming interfaces (API) such as Keras, which enables researchers to stack neural layers into deep neural networks. However, these interfaces do not provide solutions for efficient data handling and network training for genomic data modalities.
With deepG, we provide a software library that includes adaptations for genomic datasets on the nucleotide and amino acid level and provides an easy-to-use interface for training and applying DL networks. In deepG, the standard workflow starts from collections of FASTA files divided into two or more sets that correspond to class labels. The deepG data generator iterates over the input files and trains a deep neural network that can be applied to new datasets to perform predictions (Fig. 1A, S1).
A common challenge of applying DL to genomics is that the input length of genomic sequences is typically longer compared to datasets of the traditional DL domains, especially when input samples are full (meta)genomes. To address this, the deepG library also includes specialized neural networks to account for long-range dependencies that span multiple batches of samples (stateful long short-term memory) as well as architectures designed for long input sequences such as WaveNet5, which implements residual and parameterised skip connections with dilated convolutions to speed up convergence in this data regime. Further, we have found that processing nucleotides sequences in a naïve and purely sequential fashion often leads to a highly ineffective training, since this approach successively processes potentially varying nucleotide distributions of input samples, leading to repeated regimes of under- and overfitting (Fig. 1B). We identified and implemented combinations of subsampling strategies to mitigate these problems (Fig. 1A), which makes model training easy to use for a wide range of research questions. Furthermore, the library supports more customized and advanced training methods, such as the training of language models and fine-tuning (Fig. 1C) (see Supplementary Methods).
We validated our method by developing classifiers targeting common bioinformatic tasks, and show that our approach achieves accuracy comparable to, or even superior to, highly specialized state of the art tools. To demonstrate the wide range of possible applications, we trained supervised models at the read-level, locus-level (within non-coding regions), gene-level, genome-level, and metagenomic-level (Fig. 2a). We provide code notebooks that fully reproduce these use-cases at http://deepg.de.
At the read-length level, we trained a model that is able to discriminate between bacterial and human sequences, which can be used for screening of human contamination in metagenomic sequences. After 3 hours of training with a context size of 150 nucleotides (nt) (corresponding to the typical read-level size), our model achieved a balanced accuracy of 97% when trained on a set of bacterial genomes and a human reference genome. We evaluated this model on a paired end metagenomic dataset with a processing speed of over 250,000 reads per minute on a consumer-grade graphics processing unit (GPU), demonstrating a practical processing speed. In this case, the model was trained on data from the full human genome (non-read size) but optimized for the application on read data. deepG also supports the model development on read-level data (FASTQ files) and can account for this in the encoding of the network by using a probability encoding instead of one-hot encoding – which is of a potential interest for the direct processing of long and low-quality sequencing reads (Fig. S2). This use case demonstrates that deepG can successfully run on small sequences such as 150 nt, which enables direct alignment-free applications. While the model was trained here on fragments larger than sequencing reads, the context size of the model was set to 150 nt to allow for the inference process of FASTQ files. However, models might be trained directly from FASTQ files holding the read information. This use case showcases the ability of deepG to efficiently process small sequence data and apply it to real-world metagenomic screening scenarios. When comparing the performance of model-based contamination detection on a synthetic evaluation dataset, where 75% of the reads originate from E. coli and the remaining 25% of the reads originate from a human genome, the deepG model shows similar accuracy compared to the alignment-based read removal tools BMTagger6 with 97.58% vs. 97.98% accuracy (non-significant differences on ten datasets according to a paired Wilcoxon test). However, compared to alignment-based methods, the deepG model generalizes better to other eukaryotes: When replacing the human contaminant with mouse reads, our methods successfully predict 98.13% of the reads correctly, compared to 75.25% of BMTagger (Fig. 2a).
At the locus level, we trained a model that is capable of predicting CRISPR arrays, which are variable-length features present in non-coding parts of sequences. The CRISPR array identification problem is of interest, since it cannot be captured with models leveraging the profile of a multiple sequence alignment using profile Hidden Markov Models (pHMM)7,8, and there are no clear conserved motifs, but there are conserved higher-order structures9. The deepG model outperforms strategies that are purely based on the identification of local alignment on regions that are classified as false positive, such as Staphylococcus aureus repeat like elements10, with an accuracy of 95% on these CRISPR-like sequences and reaches an area under the receiver operating characteristic curve (AUC ROC) of 0.98 (Fig. 2b,c). This shows that deepG models have higher representational power by taking semantics, syntactic, and synteny of genomic sequences into account and work well on non-coding parts of the genome. Such models can be used similarly to pHMM, but have the advantage of not requiring a multiple sequence alignment and working on a collection of genes or other sequences such as genomic islands only, while also being able to model more powerful relationships compared to pHMMs. This shows that deepG models can be applied to a wider range of sequences and genomic regions compared to pHMMs.
At the gene level, we used deepG to build and train a classifier capable of detecting 16S rRNA genes (context size of 500 nt), achieving a balanced accuracy of 0.975 within 1 hour of training using the genomic gene pool as the background. deepG can apply such models within a sliding window over an input sequence to screen for possible hits, as demonstrated when we identified the location of this gene in E. faecalis. We compared the predictive performance with Barrnap11, a bacterial ribosomal RNA predictor based on HMMs on 1,059 genomes. In 99.3% of genomes, both tools agree. For two genomes, where Barrnap outputs warnings due to a low alignment fraction, the deepG model reports no hits, while the deepG reports false positives on 5 genomes that could be filtered out using a reasonable length cutoff of at least 800 nt and mean aggregated confidence of at least 0.8.
To demonstrate the application of deepG at the genome level, we predicted bacterial morphology from the genomic sequence. Here, we used labels provided by the BacDive database12 on sporulation, which is of particular interest due to their potential epidemiological danger. With the library, we trained a model that predicts the ability to sporulate based on subsequences of length of 1M nt. We applied the model on full genomes (test data) unseen during training, with an average inference time of less than 7 seconds per genome. Since the aforementioned context size of 1M nt might be smaller than the size of the desired genome to be predicted, deepG runs multiple predictions over the genomic sequence (every 100,000 nt) and aggregate these to the final prediction. On this task, deepG had a balanced accuracy of 97.1% on the test set and incorrectly classified 15 genomes out of 512, while Traitar, a tool that infers sporulation and other phenotype properties using presence/absence information of gene families13, falsely classified 29 genomes (94.3% balanced accuracy). The improved performance of deepG compared to established approaches like Traitar demonstrates the application of deepG on predicting bacterial sporulation from genomic sequence, due to deepG’s comparably high accuracy and short prediction time. Our model was trained for around 10 days on a single data center GPU and trained on a different dataset than Traitar, which could also be responsible for the increase in accuracy.
For demonstrating the application of deepG to full metagenomes, we trained a supervised model on the Chinese cohort from the colorectal cancer (CRC) study (128 metagenomes)14. The input data are the full metagenome samples of the study, with individuals grouped into CRC and healthy subjects. The AUC ROC of the resulting deepG model is 0.82 – similar to the AUC ROC score reported by the authors of the original study, with 0.81 and 0.87 AUC ROC based on different analysis strategies. Therefore, deepG achieved a similar performance without the need for any alignment, taxonomic annotations, or functional annotations. Additionally, deepG is not limited to coding regions or identification of functional groups. Furthermore, deepG's ability to train on full metagenomes allows for a more comprehensive data analysis compared to methods relying on annotations or coding regions. To facilitate model training in scenarios where the sequence length is long, where one input sample might be a whole metagenomic sample, deepG can be run using a “set training”15 regime, where no order is encoded within subsamples (contigs).
While most aforementioned applications describe the construction of a supervised binary classifier, where a model is trained from scratch with the goal to discriminate between two classes, deepG can also run in a multi-label setting – e.g., for the prediction of multiple classes at a taxonomic rank or when metagenomes are grouped into more than two sets. Moreover, since deepG models are implemented with the Keras functional API, it supports custom and more advanced models – e.g., when the user has access to additional information, such as clinical metadata, that can be used as input for the model. Besides supervised training, deepG also supports unsupervised training with support for additional training modes, such as Contrastive Predictive Coding16 and Self-GenomeNet#. In this setting, a model is trained on unlabeled data and can later serve as a fundament for supervised tasks to increase performance and speed up model convergence, as this enables the efficient use of unlabeled datasets by efficiently capturing representations. deepG also supports Tensorboard, a tool that allows users to track training runs and generate custom metrics, such as balanced accuracy (Fig. S3).
While further tools for implementing DL models for genome sequence data are available (as reviewed in Alharbi et al.17, the python packages Janggu18 and Selene19 also apply DL models on genomic input), these tools are designed with a focus on applying DL on human genome data. deepG further implements data augmentation strategies to handle scenarios that arise when DL is applied to microbial data collections. Furthermore, we demonstrated deepG to be effective for bacterial and viral and mixed sequence origins of different species, from 150 nt read-level to (meta)genomes using set learning.
To support the full taxonomic range and different sequence lengths, deepG comes with a range of data augmentation methods and training schemes that are required for such input types (Fig. 1C) and makes DL training possible for non-human datasets. Another key feature of deepG is its support for both supervised and unsupervised models. This provides researchers the capability to use pre-trained models, e.g., to improve supervised tasks to speed up training time and accuracy. It could also be used for clustering based on the neural representation which could be used as an alternative to classical clustering methods which may provide more accurate and robust results compared to traditional methods20.
The deepG software resource will enable many researchers to create and apply customized learning models when the classification problem is very specific and other software tools are outdated or unavailable, or where classical machine learning tools reach their limitations, e.g. due to Markov Assumptions when applying pHMM models or the lack of known features. deepG code, documentation, and interactive case studies can be found at the following URL: https://deepG.de.