Genes encoding β-lactamases (BLs) confer resistance to a widely prescribed class of antibiotics, the β-lactams [1]. These β-lactam-degrading enzymes represent a major public health threat due to their ability to inactivate clinically important β-lactams and rapid dissemination of β-lactamase genes among pathogenic microorganisms [2, 3]. Mobilization of BL genes from environmental organisms into pathogens additionally occurs [4], since β-lactams are naturally produced compounds [5–7], and is also relevant for public health. Currently, more than 1800 BL variants have been described [8]. BL variants are classified into four classes based on conserved active-site amino acid motifs (Ambler classification) [9] or multiple classes based on functional characteristics and substrate/inhibitor profiles (Bush-Jacoby-Medeiros classification) [10]. According to the Ambler classification system, BL variants are classified into four molecular classes: A, B, C and D [9]. Among the Ambler classes, classes A, C and D are serine BLs, which hydrolyze β-lactams by forming an acyl enzyme intermediate through an active site serine, whereas class B enzymes are metallo-β-lactamases (MBLs) and utilize at least one active-site zinc ion to hydrolyze their substrates [11].
Assessing presence of β-lactamases and their specific variants with next generation sequencing (NGS) technologies in isolate genomes and metagenomes could help guide treatment selection and enable improved surveillance of BL in both clinical and environmental settings (12, 13, 14). However, identifying short reads carrying BL genes remains challenging due to the frequent sharing of amino acid functional domains and motifs between proteins encoded by target antimicrobial resistance genes (ARGs) like BL genes and non-BL or other (non-ARG) gene sequences [2, 3]. Full-length gene sequences assembled as part of genomes or metagenomes generally represent less of a problem with this respect, although the high sequence divergence often observed between and within BL classes represents another major challenge for identification of short read or full-length gene sequences carrying BLs [8, 12]. For instance, distinct BLs, even of the same class, frequently share less than 40% amino acid identity across their sequences, which is problematic for detecting homologous sequences [13].
Identification of ARG-carrying reads typically relies on sequence similarity searches against curated databases such as ResFinder [14], The Comprehensive Antibiotic Resistance Database (CARD) [15], Antibiotic Resistance Gene Database (ARDB) [16] and UniProt (Universal Protein Resource) databases, or the use of ARGs-OAP and DeepARG tools [17–19]. Due to the continuous release of genomes and metagenomes carrying novel BL variants [5], these curated databases are not updated in a timely fashion. Further, BL sequences are frequently annotated inconsistently or erroneously in the public domains [20], meaning that sequences remain unassigned to a family or class, or sequences that are related but not BLs such as lactam-binding (but not hydrolyzing) sequences are identified as BLs. For these reasons, as well as the technical challenges mentioned above related to shared amino acid domains and motifs, the tools and databases available are far from ideal in detecting BL encoding genes from metagenomic samples, especially for those sequences carried by short reads [17–19]. Hence, a reliable approach to detect BLs including novel and divergent variants in short read (unassembled) metagenomic data sets is needed. It is important to note that these issues are also relevant for amplicon sequencing data sets such as those that result from PCR assays with broad-specificity BL primers. In particular, it remains challenging to distinguish between BL sequences (specific priming) and non-BL sequences produced in such amplicon data sets due to non-specific priming [21, 22]. Regions of high sequence identity also often make it difficult to distinguish accurately between multiple BLs in both amplicon and shotgun short read metagenome data [23, 24].
To address the technical limitations in detecting BLs in short-read metagenomic shotgun or amplicon sequences, we built ROCker models for BLs and evaluated them with mock 150 bp and 250 bp read data sets of known composition. Instead of relying on fixed e-value thresholds for the whole alignment, as is typically the case in similarity searches for detecting ARG-carrying sequences which can return an unknown number of false positives and negatives, ROCker identifies position-specific, most-discriminant bit score thresholds in sliding windows along the sequence of the target protein sequence (e.g., BL) using the receiver operating characteristic (ROC) curve. Hence, ROCker can account for non-discriminative domains shared by unrelated proteins [25]. We have previously reported that ROCker often shows more than a 50-fold lower false discovery rate (FDR) when compared to the common practice of using fixed e-values or Hidden Markov Model (HMM)-based searches for genes of the microbial nitrogen cycle [25]. Here we show that our ROCker models can reliably detect and type BLs in short-read metagenomes with similar difference in FDR and a high accuracy (F1 score) compared to alternative approaches (F1 = ~ 1 vs 0.02–0.86). Specifically, we built ROCker models for all BL classes, i.e., classes A, B, C and D of the Ambler classification system as an example of the between-class resolution that ROCker can provide. We also built ROCker models for TEM (Temoniera β-lactamases, in class A), as an example of the within-class resolution of a highly conserved family of BLs that ROCker models can provide. To avoid repetition of our methodology and results, we present the results for two representative ROCker models, class D BLs (oxacillinase β-lactamases, OXA) and TEM, in the main text; we have provided the remaining models in the supplementary material.