Neisseria gonorrhoeae, the bacterial pathogen that causes gonorrhea, is one of the most common sexually transmitted infections, with an estimated 82.4 million new infections globally in 2020.1 Antimicrobial resistance (AMR) in N. gonorrhoeae is increasing and is considered an urgent global health issue.6 Currently, ceftriaxone is the last remaining empiric treatment option for gonorrhea, and ceftriaxone monotherapy is recommended in many settings, including the USA and UK.2,3
Next-generation sequencing has revolutionized our understanding of bacterial pathogens, in general, and has advanced our knowledge of N. gonorrhoeae, in particular, including the determination of AMR mutations, investigating outbreaks, and surveillance.7–9 The establishment of large databases of genomic sequence data provide a rich resource for public health professionals and researchers seeking to understand trends of N. gonorrhoeae on a global scale. One such database is PathogenWatch (https://pathogen.watch/), a publicly available database that combines sequence data with metadata, including phenotypic AMR data, that can be used not only for surveillance purposes, but also for the development of molecular assays to improve diagnosis and treatment.10,11
Using bacterial genomic data, prior research has used machine learning (ML) algorithms to predict AMR in various pathogens. For example, Nguyen et al. used extreme gradient boosting to predict MIC values for Nontyphoidal Salmonella species against multiple antibiotics.12 An analysis of different ML models for predicting AMR in different combinations of bacterial species and antibiotics, one of which included N. gonorrhoeae with ciprofloxacin and azithromycin, found performance varied depending on the species, resistance metrics, and antibiotic drugs, highlighting the complexity of developing clinically applicable ML models.13 When training an ML model, two types of input data can be used with genomic data, namely k-mer-based and reference-based.14 A k-mer method has an advantage when the clinical reference is not set and if pathogens have complex AMR mechanisms. On the other hand, the reference-based method incorporates well-established prior knowledge such as certain mutations in AMR genes.
N. gonorrhoeae is a pathogen that has been studied extensively, and many mutations are known to be associated with resistance to ceftriaxone.4,5 Recently, Demczuk et al. used multivariate regression to create an equation for predicting minimum inhibitory concentration (MIC) values of N. gonorrhoeae using a number of antibiotics, including ceftriaxone, within a dataset of Canadian isolates.12,15 However, machine learning approaches using genotypic data to predict AMR to ceftriaxone have not yet been accomplished. The objectives of this study were to use the global PathogenWatch database to develop, evaluate, and compare several different ML algorithms that use genotypic data to predict susceptibility/decreased-susceptibility of N. gonorrhoeae to ceftriaxone.
In total, there were 12,936 genome sequences extracted from PathogenWatch and 9,540 sequences with MIC data included in the ML analyses (Fig. 1a). Among those in the ML analyses, most N. gonorrhoeae sequences were from the USA and other high-income countries (Fig. 1b, Extended data Fig. 1). In total, 368 (0.04%) strains were associated with ceftriaxone MICs >0.064 mg/L, and classified as decreased-susceptible. Synthetic minority oversampling technique (SMOTE) was used to generate additional data for 8,804 DS synthetic sequences, resulting in 18,344 total sequences (9,171 susceptible + 368 decreased-susceptible + 8,804 decreased-susceptibleSMOTE). (Extended data Fig. 2)
We evaluated seven different ML algorithms on 18,344 real and synthetic ceftriaxone S/DS N. gonorrhoeae sequences. The AUC and bAcc scores for each model are depicted in Fig. 2a and b, while other score measures such as precision (0.723-0.903), recall (0.611-0.976), accuracy (0.755-0.926), and F-1 score (0.713-0.928) are listed in Table 1. The RFC model had the highest performance out of the seven ML models, having the top scores in precision (0.903), accuracy (0.926), F-1 score (0.928), AUC (0.966), and bAcc (0.926). Thus, the RFC model was selected for the prediction of ceftriaxone S/DS and further analysis.
To tease apart the RFC model, the SHAP values were calculated for each of the 97 SNPs initially used to train the model in order to measure the impact of each feature in the ML model (Fig. 2c). The RFC model was retrained with a different combination of these top SNPs. When the top 20 SNPs were used, a small decrease in AUC (0.966 to 0.941) and balanced accuracy (0.926 to 0.882) were observed compared to the model including all 97 SNPs. The models showed similar scores even when using only the top 5 SNPs: penA-501, penB-120, ponA-421, penB-121, and penA-545 (AUC: 0.915, and bACC: 0.875). (Fig. 4d-e; Table 2)