Antibiotics, often designated as ‘miracle drugs’, revolutionized the prevention and treatment of bacterial infections1. However, irrational and indiscriminate use of antibiotics has resulted in evolution of antimicrobial resistance against almost all major classes of antibiotics2,3. Antibiotic resistance is not restricted to any particular geographical region or country, rather it is a global phenomenon, which over the last several years has spiraled into a major problem of public health significance. Realizing the crisis created by expansion of microbial antibiotic resistance and non-significant antibiotic development efforts, the World Health Organization in 2017, published the “first global priority pathogen list” and categorized antibiotic-resistant bacteria into critical, high and medium priority pathogens, according to the urgency of need for new antibiotics4.
Antibiotic resistance is a dynamic phenomenon. Hence, regular surveillance of antibiotic resistance genes of microbes and metagenomes, spanning clinical and environmental settings is important to understand the epidemiology and foresee the emergence of new antibiotic resistance factors. The rapid decline in the cost of high-throughput DNA sequencing technologies has resulted in their use in routine clinical 5–7 and environmental samples. Consequently, a plethora of information about antibiotic resistance and antibiotic resistance genes (ARGs) is available in gene databases. Also, metagenomic studies have added immensely to our knowledge of ARG pool8. It is expected that in the near future, classical microbiological and molecular methods for determination of antimicrobial resistance might be replaced by whole genome sequencing (WGS) based identification of microbial resistomes using antibiotic resistance databases and in-silico prediction tools. However, at present a major hindrance in identification and annotation of ARGs from WGS data is the abundance of fragmented genes/genomes in the genome databases (due to incomplete assembly). This also implies that an ideal bioinformatics tool, which can detect ARG in whole genomes, should be able to work effectively on short sequencing reads and fragmented contigs, too.
In the past few years, many in-silico resources and tools have been developed to predict, monitor, catalogue and characterize the spread of ARG1. This includes Antibiotic Resistance Genes Database, ARDB9, Comprehensive β-lactamase Molecular Annotation Resource, CBMAR10, ResFinder11, Comprehensive Antibiotic Resistance Database, CARD12, and Resfams13. These tools have provided important insights in identification and prevalence of ARG and prediction of new resistance mechanisms. Among different antibiotic resistance databases, CARD is the most popular database among microbiologists owing to its free public availability and ease of use.
It has been reported by several people that the existing bioinformatics tools of ARG prediction/identification are biased towards a few specific ARGs. For example, ResFinder11 and SEAR14 are more tuned to predict the plasmid-based ARGs15. Another popular ARG resource, PATRIC, is suitable to identify resistance against carbapenem, methicillin, and β-lactam antibiotics.
Arango-Argoty et al. has shown that a similarity search using the manually curated potential ARGs from UniProtKB database against the CARD and ARDB genes resulted in sequence identity in range of 20 to 60% terms of the percentage identity, bit score >50, and highly significant e-value (< 1e-20)16. This indicated that ARG annotation tools based on the “best hit” approach might suffer from a high rate of false negatives. In general a high identity cut-off (>80%) is used as a threshold to determine the diversity of ARGs in a resistome. This means that a significant number of ARGs might be declared as false negative during similarity search. The problem of false negatives is further compounded in case of short reads consisting of nearly 100bp or 25 amino acids. To avoid a high false positive rate generally a very high identity threshold (~80%) is considered17, which might result in a high rate of false negative ARG annotation.
It has been also observed that very stringent criteria of similarity search based ARG annotations would mostly identify known ARGs 11,13. This means several new ARGs that were discovered by functional screening, would have been missed by the similarity-based approach18–23. The results also indicate that the best-hit similarity search approach (a) is more suitable to find only highly conserved ARGs, and (b) might fail to find novel and/or those with low sequence identity to known ARGs.
The abundance of ARGs in metagenomic and transcriptomic data is done by alignment of individual reads to a reference database or by de-novo assembly of short-reads in large contigs and then alignment to a reference database. Though alignment after short-read assembly is considered as a better method of annotation, in case of low-coverage of reads, the assembly is highly fragmented hence similarity search based methods are not very useful. Due to the ability to efficiently model the conservation of a sequence, Hidden Markov models (HMMs) can identify the regions that are necessary for protein function. Hence a HMM based tool might be more sensitive in detection of remote homology. Two most popular databases for protein annotation, namely, Pfam and TIGERFAM, are examples of the better annotation capability of HMM in comparison to simple sequence alignment tools. However, very limited application of profile HMMs has been done in ARGs annotation. One such example of usage of HMM in ARG annotation is Resfams13, which is a HMMs based search tools. Resfams offer an alternative to the pairwise local sequence alignment based tools for identification of emerging and/or unknown ARG on the basis of homology to functional domains of AMR protein. It has been reported that HMM based method, namely Resfams, identified AMR proteins with a low false-discovery rate vis-à-vis BLAST based approach of AMR protein annotation using ARDB and CARD13.
Despite several advantages of HMMs over the simple alignment based search tools, a very limited attempt has been made to apply profile HMMs in ARGs annotation. Here, we have described an in-silico tool, named as BacARscan (acronym for “Bacterial Antibiotic Resistance scan”), for quick identification and annotation of ARGs in –omics datasets. BacARscan is developed using modeling of insertions, deletions and conservation of amino acids in the form of profile-HMM. Each profile-HMM of BacARscan is annotated with relevant details such as the class of antibiotics targeted by the gene and functional type of ARG genes using the existing ARG databases. It is user friendly, freely accessible and can be integrated in any user defined ARG annotation pipeline. BacARscan is composed of two types of ARG HMM libraries namely pARGhmm and nARGhmm that can work on protein and gene dataset respectively. During the search, BacARscan scans all profile-HMMs against the query sequence and lists the ARGs in the query dataset and its corresponding details such as the class of antibiotics targeted by the gene and functional type of ARG genes.