1.1Motivation
According to the World Health Organization (WHO), infectious diseases continue to one of the leading causes of death worldwide. Recent research has shown that humans are heavily colonized by thousands of microbes that are harmful, harmless, or beneficial to human health. (Qin et al., 2010; Hooper and Gordon, 2001). Since human core microbiota is very diverse, it is particularly challenging to determine whether a particular bacterial strain is pathogenic to humans.
Currently, the gold standard for determining infectious agents is the Koch hypothesis established in the 19th century, which requires isolation and cultivation of microbial strains (Sassetti et al., 2003; Young et al., 1984). However, the culture period is long and many pathogens are difficult to culture, so it is difficult to meet the clinical requirements.
Another phylogeny-based method classifying bacteria into human pathogenicity was to look for some molecular features(Falkow,1997), two-component systems (Stock et al.,2000) and secretion systems(Hacker and Kaper.,2000). However, there are two problems with these methods. One is that these features are exchanged between pathogenic and avirulent strains of the same or different species due to horizontal gene transfer (HGT) (Frost et al., 2005; Frost et al., 2008; HoSui et al., 2009). Another is that there are some virulence genes, although they cannot directly determine virulence, but they are essential for bacterial response in the host to survive and evade the body's immune system(Wassenaar and Gaastra,2001; Paine et al., 2002).
With the latest advances in next-generation sequencing (NGS) technology, DNA sequencing has become the state-of-the-art in pathogen detection (Lecuit and Eloit, 2014; Calistri and Palù, 2015), The amount of data in bacterial sequence database is rapidly accumulating. (Benson et al., 2015; O’Leary et al., 2016), bioinformatics tools and techniques using NGS data have increasingly been used for the diagnosis and monitoring of infectious diseases detection (Lecuit and Eloit, 2014; Calistri and Palù, 2015). Even if the biological background is not available, the machine learning method can still infer the pathogenic phenotype from the NGS readings, independent of the database of known organisms, and being studied intensively.
1.2Related Work
Existing methods for predicting pathogenicity can be broadly divided into two types, read-based (Miller et al., 2013; Byrd et al., 2014; Naccache et al., 2014; Deneke et al., 2017; Jakub M. Bartoszewicz et al., 2019) and protein-based (Iraola et al., 2012; Cosentino et al., 2013; Eran et al.,2018), as described below. The method we propose in this article belong to the former category.
Read-based methods use the raw short genome reads as input. Miller et al. developed several tools for mapping NGS reads to reference genomes and for classification (Miller et al, 2013; Byrd et al, 2014; Naccache et al, 2014). However, these tools make the taxonomy rather than phenotype prediction, and severely affected the basic data set covering taxonomy, and cannot be used to predict new pathogens.
Recently, some read-based phenotype prediction methods were published.
Deneke et al. proposed a random forest-based pathogenicity prediction method PaPrBaG (Deneke et al., 2017). It predicts the pathogenicity of novel unknown bacterial pathogens by training a large number of labelled pathogens and non-pathogenic bacteria. Compared to other methods, PaPrBaG can be predicted based on NGS data with very low genomic coverage, while other methods are based on the similarity of the known reference genomes.
Another recent method for predicting pathogenicity is Deep Learning Approach to Pathogenicity Classification (DeePaC), includes a universal, extensible framework for neural architectures ensuring identical predictions for any given DNA sequence and its reverse-complement (Jakub M. Bartoszewicz et al., 2019). It combines the reverse-complementary architecture with the integration of predictions for both mates in a read pair results, and designs a reverse-complementary convolutional neural network and Long Short-Term Memory (LSTM), reducing the error rate by nearly half compared to the latest technology.
Although these new methods are more accurate, the application of any of these methods to mission-critical contexts remains problematic due to un-interpretability of black-box model and neglect of opportunistic pathogenicity.
Protein-based methods characterize the phenotype of the microbe by the presence or absence of members of the protein families (PFs) in its genomes, provided that the assembled genomes are available.
Iraola et al. proposed the first large-scale application of the protein-based method BacFier, training a Support Vector Machine (SVM) model to predict bacterial virulence based known families of orthologous genes (Iraola et al., 2012). This method depends on a virulence factor database that annotates virulence at the genetic level and is therefore limited to specific proteins known to be associated with virulence, ignoring many unannotated genes whose sequences are available and possibly virulence (or anti-virulence) function related.
Other protein-based pathogenicity prediction tools that create and annotate PF based on their frequency of occurrence in pathogenic or non-pathogenic organisms, without depending on pre-established databases (Cosentino et al., 2013; Eran et al.,2018).
Cosentino et al. (Cosentino et al., 2013) developed a web server PathogenFinder for predicting bacterial pathogenicity using proteomics, genomes or raw reads (https://cge.cbs.dtu.dk/services/PathogenFinder/ ). The pathogenicity of bacteria depends on the proteome known to be involved in pathogenicity.This web server utilizes a selection of proteins created without annotated function or known involvement in pathogenicity. It can predict pathogenicity for all taxonomic groups of bacteria with 88.6% accuracy. The approach of the program is not biased with known pathogenicity. Therefore the program could be used to discover novel pathogenicity factors. However, the step of clustering proteins into PFs step in this method has computational bottleneck.
Eran et al. proposed a machine learning method BacPaCS for bacterial pathogenicity classification through sparse support vector machine (sparse-SVM). By fully automating the training of clinically relevant data, the calculation time is greatly shortened, and the training data set is much larger than before (Eran et al. 2018). Experimental results in a clinically relevant data set containing only human host bacteria showed that BacPaCS showed high accuracy in distinguishing between pathogenic bacteria and non-pathogenic bacteria.
However, the human body has plenty of long-term coexistence of microbes, these microbes in many cases do not exhibit pathogenicity, but under certain conditions, will be pathogenic to humans. Previous methods have not considered opportunistic pathogenic and they are not well suited for clinical requirements.
In this paper we propose a novel interpretable machine learning approach IMLA for classifying unidentified bacterial genomes as human pathogens, opportunistic pathogenicity or none of them, then use the following model-agnostic interpretation methods to interpret model: feature importance, accumulated local effects and Shapley values, as model interpretability is essential for healthcare applications.