Homology-based prediction of resistance to antituberculous medications using machine learning algorithms


 Objectives: We aimed to develop a prediction model based on machine learning algortihms to predict the impact of variants on resistance of Mycobacterium. Data was collected from TB Drug Resistance Database (TBDReaMDB), and the drug sensitive variants from GMTV database. We also collected a List of 1115 NsSNPS reported in proteins related to drug resistance to Rifampicin, Isoniazid, Pyrazinamide and Ethambutol. PMUT online tool was used to generate the features included in the algorithm training. We trained different classifiers using R software on the features generated by PMUT. The classifiers trained are Random Forrest, Boosting prediction, Naive Bayes, Neural networks, k-Nearest Neighbors, Logistic regression, and Linear Discriminant analysis.Results: The 445 variants valid for comparison were divided into training dataset (75%) and testing dataset (25%). We compared the classifiers according to the AUC, accuracy, kappa, sensitivity, specificity, positive predictive value, and negative predictive value. Results show that random forrest is the best classifier (accuracy: 0.9072. Kappa: 0.690, Sensitivity: 1.00, Specificity : 0.5909, Pos-Pred Value : 0.8929, Neg-Pred Value : 1.00, Detection Rate : 0.773). This indicates that Homology-based machine learning algorithms could be a solid base for development of rapid tools for screening of M.TB resistance to medications.


Introduction
The most common cause of death attributed to a single microorganism worldwide is Tuberculosis (TB). Without combatting antimicrobial resistance in Mycobacterium tuberculosis the goals of the End TB strategy of the World Health Organization (WHO) will not be achieved. The antimicrobial resistance was acknowledged as an important threat to public health and economic growth by the leaders of the G20 nations (1). The WHO estimates that of the 600 000 newly emerging MDR-TB or rifampicin resistant TB (RR-TB) cases every year, only 25% are detected and notified. Evidence also suggests that there is a 16% and 80% higher frequency among estimated and notified MDR-TB patients. In addition, the number of reported MDR/RR-TB peaked by more than one third in 9 of the 30 high MDR-TB burden countries in the period between 2015 and 2016(1).
The definite diagnosis of pulmonary tuberculosis is by isolation of M. tuberculosis from a bodily secretion (e.g., culture of sputum, bronchoalveolar lavage, or pleural fluid) or tissue (pleural biopsy or lung biopsy). Additional diagnostic tools are considered sufficient to make a diagnosis. These include sputum acid-fast bacilli (AFB) smear and nucleic acid amplification (NAA) testing. Radiographic studies are also important supportive tool (2).
The implementation of molecular tests worldwide has dramatically reduced the time needed to diagnose tuberculosis (3)(4)(5)(6). These tests can be used to diagnose the infection in the different biological samples (sputum, cerebrospinal fluid, pleural fluid, urine, blood, stool) (6). Nucleic acid amplification tests (NAATs) also enable the prediction of resistance to important first-and second-line drugs on the basis of resistance-conferring mutations (7). Nevertheless, most commercially available tests are available only to certain drug targets and remain inadequate when comprehensive individualized treatment regimens have to be established(1).
The best coverage for molecular DST is given by sequencing of the entire mycobacterial genome (whole genome sequencing, WGS) (8). This method reports all potential resistance-conferring mutations. However, there are few drawbacks for the WGS. First, WGS requires relatively large amounts of enriched mycobacterial DNA, which indicates early positive culture materials, to sequence. Second, although special enrichment procedures make direct sequencing from fresh clinical samples applicable, it is still several years away from clinical implementation (9)(10)(11).
Researchers believe that If these issues can be overcome, information obtained from WGS with computational sequence analysis, which may be uploaded to large databases, is probably the future of molecular drug resistance prediction (12,13).
The aim of this study was to develop a bioinformatics machine learning algorithm that is capable of prediction of variants impact on resistance to antituberculous medications. The next sections illustrate the methods used and the main features included in the classifier beside the accuracy parameters.

Study design
I developed a machine learning algorithm that is able to predict the impact of Nonsynonymous (NsSNPS) variants on susceptibility of Mycpbacterium Tuberculosis to antituberculous medications based on amino acid sequence and predicted protein secondary structure.

Database: -
We collected data from 2 databases; from TB Drug Resistance Database (TBDReaMDB), and collected the drug sensitive variants from GMTV database (14,15).
TBDReaMDB is a comprehensive resource on drug resistance mutations in M.
tuberculosis developed by conducting a systematic review to identify drug resistance mutations from the existing literature to include in the database (14).
GMTV contains a broad spectrum of data derived from different sources and related to M. tuberculosis molecular biology, epidemiology, TB clinical outcome, year and place of isolation, drug resistance profiles and displays the variants across the genome using a dedicated genome browser. GMTVdatabase, which includes 1084 genomes and over 69,000 SNP or Indel variants, can be queried about M. tuberculosis genome variation and putative associations with drug resistance, geographical origin, and clinical stages and outcomes (15).

Inclusion and Exclusion criteria:
We collected a List of 1488 NsSNPS that are associated with drug resistance to Rifampicin, Isoniazid, Pyrazinamide and Ethambutol. The data included the gene ID, protein ID, codon number, wild amino acid, mutation amino acid, drug susceptibility, and variant impact on protein.
Variants were grouped into two groups (sensitive vs resistant) according to drug sensitivity. Variants found in drug sensitive organism were labelled sensitive, whereas variants found in drug sensitive organism were labelled resistant. We included only proteins that have variants in the two groups. These proteins are Rv0667, Rv1908c, Rv2043c, Rv2428, Rv3793, and Rv3795. The final number of variants included was 1115.

Features generation:
We used PMUT online tool to generate the features included in the algorithm training.
PMut Web portal allows the user to perform pathology predictions, to access a complete repository of pre-calculated predictions, and to generate and validate new predictors. The default predictor performs with good quality scores. The PMut portal is freely accessible at http://mmb.irbbarcelona.org/PMut (16).
The features computed by PMUT and entered in the classifiers were number of sequences in the alignment, number of amino acids in the aligned position (no gaps), total and relative number of aligned wild type amino acids, total and relative number of aligned mutated amino acids, position Weight Matrix score, and PMUT overall score. Table (

Software/ Packages:
We used Microsoft excel 2016 and SPSS v22 to clean the data. We used R Software to train the classifiers. We used the "caret" and "nnet" packages to train and evaluate the classifiers, and "ROCR" package to plot the ROCs curves and calculate the AUC for each classifier.

Exploratory analysis:
Exploratory data analysis revealed that all (791 mutation) frameshift mutations, insertions, and deletions are associated with drug resistance and none was reported in drug sensitive organism. Regarding STOP variants, only 5 variants were found in drug sensitive organism. As a result; they were not included in the training and testing datasets. We divided the remaining 445 variants into two datasets: training dataset (75%) and testing dataset (25%).  Table   (3) and figure S1 compares the performance of the classifiers. Figure  The sensitivity of the direct microscopy and of the mycobacterium culture for a bronchoalveolar lavage sample is 19% and 50% respectively while the specificity is 90% and 100% respectively which make both of them excellent diagnostic tools but not suitable for tuberculosis screening(24). The features of our algorithms are exactly the opposite. The algorithm was accurately able to predict the impact of the variant on the drug resistance. It was able to identify all the drug resistant variants with a sensitivity of 100%. However, the specificity was just around sixty percent.
This indicates that there is considerable possibility of classifying a drug sensitive variant as drug resistant, whereas the probability of classifying a drug resistant variant as a sensitive variant is extremely low. These features leads to a conclusion that machine learning based classifiers can be an excellent base for screening tools for mycobacterium drug resistance.
The question that rise up to the surface is what could be the next step. The first step will be developing an online server that can be used for prediction of the variants impact on the drug resistance instantly within minutes. If this server is combined with a molecular tool that can detect the aminoacids variants we will end up with a bedside quick screening tool for tuberculosis drug resistant.