BacARscan: A Comprehensive and Interactive Web-Resource to Discern Antibiotic Resistance Gene Diversity in –Omics Datasets

Regular surveillance of antibiotic resistance genes (ARGs) is important to understand the emergence and epidemiology of antibiotic resistance (AR) in clinical and environmental niches. With diminishing costs, NGS technologies are anticipated to replace classical microbiological and molecular methods for determination of AR. One major hindrance underlying identication and annotation of ARGs from WGS data is that a major part of genome databases contain fragmented genes/genomes (due to incomplete assembly). Herein, we propose a web resource of Bacterial ARGs, named as BacARscan (Bacterial Antibiotic Resistance scan), to detect, predict and characterize ARGs in metagenomic, genomic and proteomic data. The current version of BacARscan comprises 254 ARG models, each annotated with a resistant prole against different classes of antibiotics, resistance mechanism etc. Benchmarking on a combined dataset of AR and non-AR proteins found 92% precision & 95% F-measure. BacARscan can also discriminate between the protein families that are homologous but not all families are involved in the AR. BacARscan identied more ARGs in (a) gut microbiome and (b) datasets comprising short read genomic and proteomic sequences of ESKAPE pathogens. Analysis of clinical metagenomic data indicated its potential to complement and/or supplement WGS based identication of ARGs in clinical samples. BacARscan standalone software and web-server are freely available at http://proteininformatics.org/mkumar/bacarscan and github repository (https://github.com/University-of-Delhi-south-campus/BacARscan). Comparative analysis of prediction of different types of AR genes: Dataset composed with nucleotide and translated sequences of Gut metagenome, multiple programs (ARDB, CARD, Resfams and BacARscan) searched against gut metagenomic dataset. Different types of AR genes predicted using diverse methods respectively.


Introduction
Antibiotics, often designated as 'miracle drugs', revolutionized the prevention and treatment of bacterial infections 1 . However, irrational and indiscriminate use of antibiotics has resulted in evolution of antimicrobial resistance against almost all major classes of antibiotics 2,3 . Antibiotic resistance is not restricted to any particular geographical region or country, rather it is a global phenomenon, which over the last several years has spiraled into a major problem of public health signi cance. Realizing the crisis created by expansion of microbial antibiotic resistance and non-signi cant antibiotic development efforts, the World Health Organization in 2017, published the " rst global priority pathogen list" and categorized antibiotic-resistant bacteria into critical, high and medium priority pathogens, according to the urgency of need for new antibiotics 4 .
Antibiotic resistance is a dynamic phenomenon. Hence, regular surveillance of antibiotic resistance genes of microbes and metagenomes, spanning clinical and environmental settings is important to understand the epidemiology and foresee the emergence of new antibiotic resistance factors. The rapid decline in the cost of high-throughput DNA sequencing technologies has resulted in their use in routine clinical [5][6][7] and environmental samples. Consequently, a plethora of information about antibiotic resistance and antibiotic resistance genes (ARGs) is available in gene databases. Also, metagenomic studies have added immensely to our knowledge of ARG pool 8 . It is expected that in the near future, classical microbiological and molecular methods for determination of antimicrobial resistance might be replaced by whole genome sequencing (WGS) based identi cation of microbial resistomes using antibiotic resistance databases and in-silico prediction tools. However, at present a major hindrance in identi cation and annotation of ARGs from WGS data is the abundance of fragmented genes/genomes in the genome databases (due to incomplete assembly). This also implies that an ideal bioinformatics tool, which can detect ARG in whole genomes, should be able to work effectively on short sequencing reads and fragmented contigs, too.
In the past few years, many in-silico resources and tools have been developed to predict, monitor, catalogue and characterize the spread of ARG 1 . This includes Antibiotic Resistance Genes Database, ARDB 9 , Comprehensive β-lactamase Molecular Annotation Resource, CBMAR 10 , ResFinder 11 , Comprehensive Antibiotic Resistance Database, CARD 12 , and Resfams 13 . These tools have provided important insights in identi cation and prevalence of ARG and prediction of new resistance mechanisms. Among different antibiotic resistance databases, CARD is the most popular database among microbiologists owing to its free public availability and ease of use.
It has been reported by several people that the existing bioinformatics tools of ARG prediction/identi cation are biased towards a few speci c ARGs. For example, ResFinder 11 and SEAR 14 are more tuned to predict the plasmid-based ARGs 15 . Another popular ARG resource, PATRIC, is suitable to identify resistance against carbapenem, methicillin, and β-lactam antibiotics.
Arango-Argoty et al. has shown that a similarity search using the manually curated potential ARGs from UniProtKB database against the CARD and ARDB genes resulted in sequence identity in range of 20 to 60% terms of the percentage identity, bit score >50, and highly signi cant e-value (< 1e-20) 16 . This indicated that ARG annotation tools based on the "best hit" approach might suffer from a high rate of false negatives. In general a high identity cut-off (>80%) is used as a threshold to determine the diversity of ARGs in a resistome.
This means that a signi cant number of ARGs might be declared as false negative during similarity search. The problem of false negatives is further compounded in case of short reads consisting of nearly 100bp or 25 amino acids. To avoid a high false positive rate generally a very high identity threshold (~80%) is considered 17 , which might result in a high rate of false negative ARG annotation.
It has been also observed that very stringent criteria of similarity search based ARG annotations would mostly identify known ARGs 11,13 . This means several new ARGs that were discovered by functional screening, would have been missed by the similarity-based approach 18-23 . The results also indicate that the best-hit similarity search approach (a) is more suitable to nd only highly conserved ARGs, and (b) might fail to nd novel and/or those with low sequence identity to known ARGs.
The abundance of ARGs in metagenomic and transcriptomic data is done by alignment of individual reads to a reference database or by de-novo assembly of short-reads in large contigs and then alignment to a reference database. Though alignment after short-read assembly is considered as a better method of annotation, in case of low-coverage of reads, the assembly is highly fragmented hence similarity search based methods are not very useful. Due to the ability to e ciently model the conservation of a sequence, Hidden Markov models (HMMs) can identify the regions that are necessary for protein function. Hence a HMM based tool might be more sensitive in detection of remote homology. Two most popular databases for protein annotation, namely, Pfam and TIGERFAM, are examples of the better annotation capability of HMM in comparison to simple sequence alignment tools. However, very limited application of pro le HMMs has been done in ARGs annotation. One such example of usage of HMM in ARG annotation is Resfams 13 , which is a HMMs based search tools. Resfams offer an alternative to the pairwise local sequence alignment based tools for identi cation of emerging and/or unknown ARG on the basis of homology to functional domains of AMR protein. It has been reported that HMM based method, namely Resfams, identi ed AMR proteins with a low false-discovery rate vis-à-vis BLAST based approach of AMR protein annotation using ARDB and CARD 13 .
Despite several advantages of HMMs over the simple alignment based search tools, a very limited attempt has been made to apply pro le HMMs in ARGs annotation. Here, we have described an in-silico tool, named as BacARscan (acronym for "Bacterial Antibiotic Resistance scan"), for quick identi cation and annotation of ARGs in -omics datasets. BacARscan is developed using modeling of insertions, deletions and conservation of amino acids in the form of pro le-HMM. Each pro le-HMM of BacARscan is annotated with relevant details such as the class of antibiotics targeted by the gene and functional type of ARG genes using the existing ARG databases. It is user friendly, freely accessible and can be integrated in any user de ned ARG annotation pipeline. BacARscan is composed of two types of ARG HMM libraries namely pARGhmm and nARGhmm that can work on protein and gene dataset respectively. During the search, BacARscan scans all pro le-HMMs against the query sequence and lists the ARGs in the query dataset and its corresponding details such as the class of antibiotics targeted by the gene and functional type of ARG genes.

Retrieval of antibiotic resistance gene and protein sequences
The data on antibiotic resistance genes/proteins was collected from various databases that contains information about antibiotic resistance genes/proteins such as ARDB 9 , ARG-ANNOT 24 31 . Some of these databases are focused only on the ARGs while few provide the ARG information as a part of the larger schema such as integrons (INTEGRALL), Antibiotic resistance Cassettes (RAC). In addition to the antibiotic resistance databases, we also collected data from the published research papers. All ARGs were rst binned on the basis of their antibiotic inactivation pro le. Members of each bin were then divided into distinct clusters on the basis of sequence homology using the BLASTClust module of the NCBI's BLAST program at identity threshold of 90% and query coverage of 95% (BLASTCLUST with "-S 90 and -L 0.95" option) to cluster the protein sequences. After removing protein clusters that were composed of only fragmented sequences, we got 254 sequence clusters.
Building of pARGhmm pro le from antibiotic resistance protein sequences Firstly we did cluster-wise global multiple sequence alignment (MSA) of proteins using MUSCLE 32 at default parameters. Each MSA was manually curated to remove the fragmented sequences. Each MSA was then converted into pro le-HMM using the hmmbuild function of HMMER (version 3.1) at default settings 33 .
The library of pro le-HMM models of protein sequences of ARGs was built here henceforth referred as pARGhmm. To assess the ability of HMM models to accurately identify the proteins of its own family we used a dataset composed of ARG and non-ARG sequences. We observed that with an e-value threshold 1e-6 we were able to assign the maximum number of proteins to the correct ARGhmm. Hence for all subsequent searches we used 1e-6 as the e-value threshold. When multiple ARGhmm were aligned to the same region of a protein with e-value lower than 1e-6, the model with the maximum alignment score was chosen as the best ARG model.

Functional annotation of pARGhmm pro les
All HMM models of pARGhmm were annotated with their functional properties using the databases from which ARG sequences were obtained and the published literature. The information that was associated with each ARGhmm are (a) class and subclass of antibiotics against which the query proteins/genes impart resistance, (b) resistance mechanism, (c) antimicrobial resistance spectrum, (d) AMR protein name & families (e) function of AMR genes. Besides the UniProt ID of the proteins from which annotations were collected, is also mentioned (Table S1).

Construction and functional annotation of nucleotide version of ARGhmm (nARGhmm)
To facilitate usage of ARGhmm directly on genomics/metagenomics data, without translating them into the protein sequences, we also developed a nucleotide version of ARGhmm. It was named as nARGhmm. The procedure to build nARGhmm was the same as what we followed during building of pARGhmm. Brie y, for each AR sequence cluster, we constructed two different HMMs, one for protein (pARGhmm) and another for nucleotide (nARGhmm). The complete set of 254 functionally annotated n and pARGhmm were named as BacARscan. The schematic schema of building (pARGhmm & nARGhmm) from a dataset of curated antibiotic resistance protein/gene sequences is shown in (Fig. 1).
The performance of nARGhmm was evaluated on a dataset of short sequence reads generated from the backtranslation of protein sequences of positive and negative independent dataset. The length of each read was 100 nucleotides and from each ARG sequence 20 random short sequence reads were generated. Similar to the pARGhmms, the prediction capability of nARGhmms was also evaluated using precision and F-score.

Dataset-I: Evaluation dataset
The e ciency of BacARscan was evaluated using a dataset consisting of two different categories of sequences namely, positive and negative. The negative sequences were obtained from protein clusters formed by BLASTclust and had < 5 sequences. Since, these proteins were not used to create the pro le-HMM models hence, they were designated as negative datasets. Sequences that were used to build the pARGhmm i.e. protein clusters which contained ≥ 5 sequences were called as positive dataset.

Dataset-II: Independent dataset
A. To test the performance of BacARscan vis-a-vis other methods of ARG annotation, we also did a comparative performance evaluation on an independent dataset obtained from NCBI Bioproject. This dataset contained gut microbiomes of 13 healthy individuals of various ages and genders, including an unweaned infant (BioProject Accession: PRJNA28117) (Taxonomy ID 408170) 34 . The dataset has a total 1,67,118 gene, which were translated to the protein.
B. For the benchmarking purpose, we used two independent datasets for evaluating the prediction capability of BacARscan. 1) Penicillin-binding proteins (PBPs) or DD-peptidase proteins, which were retrieved from UniProtKB database dated on 15-02-21 using keyword search (penicillin binding proteins). We found 60 reviewed non-fragmented PBPs in the UniProtKB database whose existence was established at protein level. 2) Non-antibiotics resistant bacterial e ux (non-ARE) proteins that were collected from our earlier work (Reference). The nal benchmark independent datasets consisted of 60 PBPs and 389 Non-ARE protein sequences respectively.

Evaluation parameters
During evaluation only hits with e-values lower than 1e-6 were considered as signi cant. The e-value cut-off of 1e-6 was determined on the basis of independent evaluation of ARGhmm. At e-value 1e-6 we found maximum precision, recall and F-score (details in Case Study-I: Comparative evaluation of prediction e ciency of pARGhmm). To evaluate the e ciency of both n & pARGhmm, we classi ed the prediction results in four different categories namely, true positives and true negatives (TP and TN respectively), false positives and false negatives (FP and FN respectively). TP and TN were the observations that were correctly predicted, while the predictions categorized in FP and FN were wrong. When an ARG sequence was predicted to the actual class to which it belongs, the prediction was classi ed as TP. Similarly, when a non-ARG was predicted as non-ARG, it was classi ed as TN. When an ARG and non-ARG were predicted as non-ARG and ARG respectively, the prediction result was labeled as FN and FP, respectively. Using the above-described classi cation we calculated precision, recall and F-score to measure the prediction capability of BacARscan.

Results & Discussion
Performance assessment of pARGhmm To assess performance of pARGhmm, we used an evaluation dataset that consisted of antibiotic resistant (positive dataset) and non-antibiotic resistant proteins (negative dataset). When the rst hit of pARGhmm search result on the basis of e-value, was used to evaluate the prediction capability, 228 true positives and 26 false positives predictions were found (Table 1). We also evaluated the performance of pARGhmm using a majority approach by including multiple search results from the top and decided the nal outcome by the consensus approach. For example, during assessment on the basis of top three search results, the sequences which are in the top three hits and belong to the class of query pARGhmm, they were categorized as true positive and the sequences which did not belong to the class of query pARGhmm, was categorized as false positive. Since the majority of search results (two out of three) belong to the class of query pARGhmm hence overall the prediction would be categorized as true positive. With the majority approach the maximum performance was achieved when top ve hits were used to evaluate the performance. The precision and Fmeasure was 92.12% and 95.90% respectively. Overall the assessment results showed that the prediction performance increased till the decision was made on the basis of ve hits, afterwards the performance started decreasing. The consistency in search e ciency of each ARGhmm was evaluated using the leave-one-out cross validation (LOOCV) approach. During LOOCV one sequence was excluded from the process of model building and then the performance of model was tested against the dataset containing the (a) excluded sequence, (b) sequences using which remaining 253 HMMs were build and (c) non-antibiotic resistance sequences that were not used for making of ARGhmm. We found near perfect consistency in performance of the majority of HMMs ( Figure S1).

Comparative evaluation of prediction e ciency of nucleotide and protein modules of BacARscan (p & nARGhmm)
The performance of nARGhmm was evaluated using a dataset of short read sequences of 100 nucleotide length. The short read sequence library was generated from gene sequences constituting the positive and negative evaluation datasets. When only the rst hit was considered for evaluation, the precision and Fmeasure values were 90.94 and 95.25 respectively (Table 1). Similar to pARGhmm, the performance of nARGhmm was also evaluated using the majority approach. The best performance with majority approach was achieved when top ve hits were used to quantify the performance. The comparative evaluation of performance of pARGhmm and nARGhmm also indicate comparable performance in terms of both precision and F-measure. Based on the above ndings, we can conclude that BacARscan is an e cient tool to nd ARGs in both protein and gene datasets.
Case Study-I: Comparative evaluation of prediction e ciency of pARGhmm on metagenomic data The prediction e ciency of pARGhmm vis-à-vis other annotation tools was evaluated on an independent dataset obtained from a functional metagenomic study obtained from the gut microbiome of 13 healthy humans 34 . BacARscan identi ed 235 types of ARG in it ( Table 2). Other methods namely, BLAST search against ARDB found 155 types of ARG (Table 2), Resfams revealed 166 different types of ARG, while Resistance Gene Identi er (RGI) tool of CARD found 157 types of ARGs. Comparison on the basis of resistance mechanism showed that in four mechanism classes namely acetyltransferase, e ux, beta-lactamase and antibiotic target alteration the number of predicted ARGs were more than Resfams. In 'gene modulating resistance' and 'glycopeptide resistance' the number of distinct pro les was more in Resfams. The results showed that the performance of BacARscan was better than Resfams, which is also a HMM based ARG annotation tool. Moreover, in BacARscan antibiotic-resistancemechanism categories have been further sub-classi ed so that it can deliver the maximum available information to the user (Figs. 2 and 3). Our results indicated that not only was the performance of BacARscan was better; it provided more information than other ARG annotation tools.

Case Study-II: Annotation of antibiotic resistance genes in ESKAPE pathogens
The results of ARG annotation in all ve ESKAPE pathogens revealed predominance of e ux based ARG, followed by beta-lactamases or acetyltransferases based ARG. The diversity of ARG in ESKAPE pathogens discerned by BacARscan was similar to other similar work 37 . Though a slight variation was found in the number of ARG among different strains of each ESKAPE pathogens, however the distribution of different ARG classes was similar.
We also reassessed the resistome prediction ability of other ARG annotation tools vis-à-vis BacARscan using the ESKAPE pathogen data. Resfams annotations revealed that the majority of ARGs present in the ESKAPE pathogens belonged to the e ux category (Fig. 4). Although both Resfams and BacARscan discerned a similar pattern of distribution of ARG, BacARscan found a signi cantly greater diversity in ARG encoding e ux proteins than Resfams. On the basis of homology to CARD database proteins, the output of RGI-CARD was divided in three different categories namely, perfect, strict and loose. In the present work only ARG of perfect and strict category were included for comparison. Similar to BacARscan and Resfams, as per RGI-CARD the majority of ARG belonged to the e ux category. But, the number of ARG predicted by RGI-CARD and ARDB was very less in comparison to BacARscan and Resfams (Table S4).

Case Study-III: Comparison of BacARscan and UniProtKB ARG annotations
To validate the ARG annotations done by BacARscan, we used Acinetobacter baumannii strain 6200 (Ab6200) [Genome Assembly ID: 216230] as a model. We compared the ARG annotations of Ab6200 in UniProtKB with BacARscan annotations. Our results showed that BacARscan annotated many proteins of Ab6200 related to antibiotic resistance, which were absent from the list of antibiotic resistance proteins annotated by UniProtKB (Table S3). This indicated that BacARscan is capable of identifying more and unannotated antimicrobial resistance factors in microbial pathogens, however the novel ARGs predicted by BacARscan needs experimental veri cation.
Case Study-IV: Prediction of antibiotic resistance genes in clinical genomic samples We performed ARG search against short read sequences of metagenomic samples that were obtained from bile, gut and saliva samples of six human patients (named from P1 to P6) suffering from acute cholecystitis (25) using nARGhmm. The original metagenomic analysis was performed by Kujiraoka et al. 35 to discern the imbalance in the microbiota of patients suffering from acute cholecystitis. As shown in Fig. 5, we found a large number of ARGs in all three samples of patient P1. Interestingly we didn't nd any ARG in patient P5 and only two ARGs in P6. Overall we found the number of ARGs was highest in the gut microbiome followed by bile and saliva microbiomes. We feel this may be due to high microbial load in the gut sample as observed in the original study also. We also noticed that e ux was the most abundant class of ARG across the microbiomes. The fact that BacARscan did not identify ARG in the bile microbiomes of patient P5 and P6 indicates it's accuracy of prediction and potential to complement and/or supplement WGS based identi cation of ARG in clinical samples. This observation was inline with the original study that reported very less number of bacterial reads in bile microbiome of patients P5 and P6 35 . The results showed that BacARscan could discern the ARG diversity and the corresponding mechanism of antibiotic resistance in clinical metagenomic studies also. We could not compare our results with any other ARG annotation methods, since none of the existing tools provide a platform to scan the sequences containing raw non-assembled reads.

Comparison with other existing ARG annotation resources
Among several ARG annotation resources, ARDB, CARD and Resfams are the most popular ones. Hence, we also compared the performance of BacARscan vis-à-vis ARDB, CARD and Resfams. ARDB 9 was the rst insilico resource that provided a centralized repository to facilitate characterization of ARGs. It kick-started the annotation of ARG sequence data obtained from high-throughput sequencing using BLAST. The Comprehensive Antibiotic Resistance Database (CARD) is the most updated repository of ARG 12,38 . It identi es and annotates ARG either using BLAST or Resistance Gene Identi er (RGI) module. The BLAST option performs sequence similarity search against the CARD reference sequences. RGI uses the Resistance Gene Identi er to predict ARG(s) based on homology and SNP models. Resfams is a database of protein families, which exhibit antibiotic resistance 13 . It tried to improve over the simple sequence alignment based methods by adding position speci c information to make local alignments. In Resfams, each ARG is represented as HMM models.
ARDB and CARD have an integrated BLAST program to infer the biological roles of proteins, based on their similarity to experimentally characterized proteins. Simple linear sequence alignment based methods are very effective for comparing sequences with a high degree of similarity (60 percent or higher) but fail in case of highly divergent and distantly related proteins 16,17 . Also, the last update of ARDB was done in 2009, which presently makes it less attractive than other highly curated AR databases like CARD or Resfams. In BacARscan improvements were made by (a) adding position speci c information to each alignment, (b) measuring sequence similarity against curated domain assignments and (c) including probabilistic models namely, pro le Hidden Markov Models which calculates a sequence pro le on the basis of frequency of occurrence of a given amino acid at each position of a protein sequence. Both Resfams and BacARscan have used pro le based searching, which has a better sensitivity and speci city, faster and more accurate, and a better ability to detect highly diverged homologs due to the strength of its underlying probability models in comparison to BLAST. Hence, BLAST-based search is more suitable, quick and easy for smaller datasets, while BacARscan is inherently more apt to handle big datasets generated by metagenomic or transcriptomics study.
Though, both BacARscan and Resfams are based on HMM models, BacARscan has six advantages over Resfams (a) BacARscan is available both as web server and standalone tool; (b) the number of pro le-HMM models in BacARscan is 254, which is nearly 50% more than 170 pro le-HMM models in Resfams; (c) BacARscan can be used on both protein and nucleotide sequence data with comparable performance e ciency; (d) The nucleotide version of BacARscan can be used on a fully assembled genome as well as short read library without any need for assembly of reads; (e) BacARscan has signi cantly more ARGhmms representing each antimicrobial resistance mechanism than Resfams (Fig. 6). Further we have also categorized the BacARscan ARGhmms on the basis of their target antibiotics (Fig. 7).
Benchmarking on homologous non-antibiotic resistance conferring proteins In order to assess the capability of BacARscan to predict proteins/sub-families that doesn't provide antibiotic resistance, but homologous proteins/sub-families can have antibiotic resistance capability, we constructed an independent dataset comprised of Penicillin-binding proteins (PBPs) and non-antibiotic e ux proteins (non-ARE). Both PBP and β-lactamases belong to the superfamily of serine penicillin-recognizing enzymes and have similar conserved protein folds 39,40 . It is pertinent to mention here that no PBPs and Non-ARE proteins were used to construct the HMM models. Similar to other previous studies [11][12][13] , during the benchmarking we have adopted the best-hit approach with an e-value cutoff 1e-20. When we scanned this database using BacARscan, six PBPs were annotated as β-lactamases while 23 non-ARE proteins were annotated as e ux based ARGs. Overall the results showed that BacARscan could discriminate between the ARG and non-ARG homologous proteins/sub-families.

Description of the BacARscan webserver
To provide the BacARscan prediction module to the scienti c community, we also established a webserver at which gene/protein sequences can be uploaded for annotation. The output of BacARscan contains the complete annotation of the query sequence. A maximum of ten sequences can be processed by BacARscan webserver in one go. For whole genome/metagenomic/proteome scale annotation, standalone version will be required or provided in the download section of the webserver or from Github repository.

Potential use of BacARscan
Most traditional antibiotic resistance determination methods are based on laboratory tests but the time and resources required to generate this data sometimes make it unsuitable for quick monitoring. The development of high throughput sequencing methods has opened an alternative way of quick and cost-effective method of antibiotic resistance determination. BacARscan is an effort to provide an additional way forward to identify the ARGs in an -omics (proteomics/genomics and metagenomic) sample. BacARscan can also be combined with the traditional surveillance and thus can complement the traditional methods of ARG annotation. The current version of BacARscan supports prediction using only 254 ARG families but in future, we will extend it to the new ARGs or families. We hope that BacARscan will help in the prediction of ARGs and it will help in progress of studies related to AMR.

Conclusions & Further Developments
In the present study, we have described a novel antibiotic resistance analysis tool named BacARscan. The proposed tool can be used to monitor and annotate microbial ARG in both genomics and proteomic datasets. Benchmarking with other widely used online ARG annotation tools like ARDB and Resfams showed that BacARscan was able to identify more ARGs than these tools. One of the most notable improvements of BacARscan over other ARG annotation methods is its ability to work on genomes and small reads sequence libraries with equal e ciency and without any requirement for assembly of short reads. The analysis of ESKAPE pathogen proteomes showed that BacARscan could predict ARGs and provides valuable information about the mechanism of antibiotic resistance, thereof. Analysis of clinical metagenomic data set by BacARscan suggests its potential as a complementary and/or supplementary adduct for the traditional method of culture dependent analysis of ARGs. Benchmarking with non-AR proteins revealed that BacARscan has the ability to differentiate between ARG and non-ARGs. We anticipate BacARscan will be a valuable tool for the monitoring, characterization, identi cation and surveillance of ARGs in bacterial communities at an early stage of infection/outbreak. Also, BacARscan might be helpful for biologists in estimation and annotation of abundance of resistance gene(s) and their diversity in agricultural, environmental and animalassociated metagenomic samples and microbial communities. We believe that BacARscan would help the scienti c community in their efforts to control the spread of antibiotic resistant bacteria in medicinal and clinical environments. Though the performance of BacARscan on different datasets looks promising, we understand that without a constant update, it would not remain useful to the scienti c community. The current version of BacARscan supports prediction using only 254 ARG families but in future, we will extend it to the new ARGs or families. The information content of BacARscan will be periodically updated, as a new class of antibiotic resistance proteins will emerge. It should also be noted that none of the above-mentioned methods involve screening of AMR conferred via mutation but instead focus solely on dedicated resistance genes. Figure 1 Steps adopted to construct pro le-HMM models of BacARscan: An initial set of all AR proteins are curated and clustered. Compiled all clusters and used only those clusters which have ≥ 5 number of sequences in each cluster. Performed multiple sequence alignments of those clusters. MSAs are further used for creating pro le HMMs of protein and nucleotides. Afterward each pARGhmm & nARGhmm annotated it via different sources.

Declarations
Statistics for each adopted step in the generation of the HMMs are in parentheses.   Comparison of Resfams and BacARscan pro les -HMM models on the basis of their resistance mechanism