We obtained antibiotic prescribing indication data from 826,533 prescriptions from 171,460 adult inpatients, ≥ 16 years, between 01-October-2014 and 30-June-2021 from three hospitals in Oxford, UK. The most commonly prescribed antibiotics were co-amoxiclav (n = 269,945, 33%), gentamicin (n = 70,002, 8%), and metronidazole (n = 65,094, 8%) (Appendix S5), and the most common specialities were General Surgery (n = 146,719, 18%), Acute General Medicine (n = 98,687, 12%), and Trauma and Orthopaedics (n = 90,719, 11%) (Appendix S6). Patients were a median 56 years old (IQR 36–73), and 94,721 (55%) were female.
We also used an independent external test dataset to assess classifier performance further, from the Horton Hospital, Banbury (~ 30 miles from Oxford). This dataset comprised 111,617 prescriptions from 25,924 patients between 01-December-2014 and 30-June-2021, with 13,650 unique free-text indications. Antibiotics prescribed (Appendix S5) and specialities (Appendix S6) were broadly similar to the Oxford training set. Patients were a median 67 years old (IQR 47–80), and 13,853 (53%) were female.
Prescription indications
From the 826,533 Oxford prescriptions, 86,611 unique free-text indications were recorded. The top 10 accounted for 41% of all prescriptions; these included “Perioperative Prophylaxis” (20%), “UTI” (4%), “LRTI” (3%), “Sepsis” (3%), and “CAP” (3%). The most commonly occurring 4000 unique indications, used for model training, accounted for 84% (692,310) of prescriptions (Appendix S7).
As expected, different wording was used to reflect similar concepts, e.g. “CAP [community acquired pneumonia]”, “LRTI [lower respiratory tract infection]”, “chest infection”, and “pneumonia”. Additionally, misspellings were common, e.g. “infction”, “c. dififcile”. Multiple examples expressed uncertainty, or multiple potential sources of infection, e.g. “sepsis ?source”, “UTI/Chest”, etc. Reflecting the complexity of prescribing, there were multiple potentially informative, but rarely occurring indications, e.g., “transplant pyelonephritis”, “Ludwig’s angina”, and “deep neck infection”, which were only seen 51 (< 1%), 27 (< 1%), and 13 (< 1%) times respectively (Appendix S8).
‘Ground truth’ labels
Following labelling by clinical experts, the 4000 most commonly occurring free-text indications were classified into 11 categories, with a separate variable capturing the presence of uncertainty. The most commonly assigned sources were “Prophylaxis” (267,788/692,310 prescriptions, 39%), “Respiratory” (125,744, 18%) and “Abdominal” (61,670, 9%). 50% (n = 344,773) prescriptions had “No Specific Source”. The most uncertainty was expressed in “Neurological” and ENT cases at 38% and 33%, respectively (Fig. 1A). Although “Respiratory” was the most common category overall after “Prophylaxis”, there were more distinct text strings associated with “Abdominal” infections, with “Skin and Soft Tissue” infection also having a disproportionately larger number of unique text strings (Fig. 1B). Most ‘multi-source’ prescriptions were a combination of “Prophylaxis” and a source (> 90%). Excluding prophylaxis, the most common combinations of sources were “No Specific source” and “Not Informative”, “Urinary” and “Respiratory”, and “Skin & Soft Tissue” and ENT, in 1.6%, 0.58%, 0.41% prescriptions, respectively (Fig. 1C-D). The former two reflected diagnostic uncertainty and the latter reflected infections of the face, head and neck frequently involving skin/soft tissue.
Classifier performance
We trained classifiers using the labelled training data from three Oxford hospitals (Fig. 2). Compared to clinician-assigned labels, within the internal Oxford test dataset (n = 2000), the weight-averaged F1 score across classes was highest using Bio + Clinical BERT (Average F1 = 0.97 [worst performing class F1 = 0.84, best performing F1 = 0.98]) followed by finetuned GPT3.5 (F1 = 0.95 [0.77–0.99]), base-BERT (F1 = 0.93 [0.23–0.98]) and tokenisation + XGBoost (F1 = 0.86 [0.64–0.96]). Nearly all approaches exceeded traditional regular expression-based matching (F1 = 0.71 [0.00-0.93]). The few-shot GPT4 model which did not require labelled data performed similarly to this baseline (F1 = 0.71 [0.30–0.98]). Similar performance characteristics were achieved on the external validation dataset from Banbury (n = 2000; weight-averaged F1 scores: Bio + Clinical BERT 0.98 [0.87-1.00], finetuned GPT3.5 0.97 [0.70-1.00], Base BERT 0.97 [0.63–0.99], XGBoost 0.84 [0.63-1.00], Regex 0.74 [0.00-0.96], GTP4 0.86 [0.25-1.00]) (Table 1, also shows classification run times).
Table 1
Model performance metrics for the internal (Oxford) and external (Banbury) test sets. Each score is listed with the weighted average across the classes
(sources), with the lowest and highest performing class. Overall Accuracy refers to the score calculated for a sample treated as a whole. The finetuned Bio + Clinical BERT outperforms all other methods on both internal and external test sets. Internal Oxford test set
Model
|
F1 Score
|
ROC AUC
|
|
PR AUC
|
|
Per Class Accuracy
|
Accuracy
|
Training Runtime
|
Test Runtime
|
Aggregation
|
Average
|
Lowest
|
Highest
|
Average
|
Lowest
|
Highest
|
Average
|
Lowest
|
Highest
|
Average
|
Lowest
|
Highest
|
Overall
|
Per 4k
|
Per 10k
|
Regex
|
0.71
|
0.00
|
0.93
|
-
|
-
|
-
|
-
|
-
|
-
|
0.82
|
0.32
|
0.99
|
0.14
|
-
|
6.4s
|
XGBoost
|
0.86
|
0.64
|
0.96
|
0.96
|
0.87
|
0.99
|
0.9
|
0.62
|
0.99
|
0.95
|
0.92
|
1.00
|
0.72
|
6s
|
1.2s
|
Base BERT
|
0.93
|
0.23
|
0.98
|
0.99
|
0.91
|
1.00
|
0.97
|
0.69
|
0.99
|
0.98
|
0.97
|
0.99
|
0.88
|
282s1
|
82.2s
|
Bio + Clinical BERT
|
0.97
|
0.84
|
0.98
|
0.99
|
0.96
|
1.00
|
0.98
|
0.88
|
1.00
|
0.99
|
0.98
|
1.00
|
0.94
|
279s2
|
83.1s
|
Fine-Tuned OpenAI GPT3.5
|
0.95
|
0.77
|
0.99
|
-
|
-
|
-
|
-
|
-
|
-
|
0.98
|
0.97
|
1.00
|
0.91
|
~ 3500s1
|
~ 3000s2
|
Few-Shot OpenAI GPT4
|
0.71
|
0.30
|
0.98
|
-
|
-
|
-
|
-
|
-
|
-
|
0.87
|
0.64
|
1.00
|
0.50
|
-
|
~ 3000s1
|
External Banbury test set |
Model
|
F1 Score
|
ROC AUC
|
|
PR AUC
|
|
Per Class Accuracy
|
Accuracy
|
Aggregation
|
Average
|
Lowest
|
Highest
|
Average
|
Lowest
|
Highest
|
Average
|
Lowest
|
Highest
|
Average
|
Lowest
|
Highest
|
Overall
|
Regex
|
0.74
|
0.00
|
0.96
|
-
|
-
|
-
|
-
|
-
|
-
|
0.82
|
0.41
|
0.99
|
0.24
|
XGBoost
|
0.84
|
0.63
|
1.00
|
0.94
|
0.86
|
1.00
|
0.87
|
0.57
|
1.00
|
0.94
|
0.88
|
1.00
|
0.68
|
Base BERT
|
0.97
|
0.63
|
0.99
|
0.99
|
0.95
|
1.00
|
0.98
|
0.75
|
1.00
|
0.99
|
0.99
|
1.00
|
0.95
|
Bio+Clinical BERT
|
0.98
|
0.87
|
1.00
|
0.99
|
0.97
|
1.00
|
0.98
|
0.87
|
1.00
|
0.99
|
0.99
|
1.00
|
0.97
|
Fine-Tuned OpenAI GPT3.5
|
0.97
|
0.70
|
1.00
|
-
|
-
|
-
|
-
|
-
|
-
|
0.99
|
0.98
|
1.00
|
0.95
|
Few-Shot OpenAI GPT4
|
0.86
|
0.25
|
1.00
|
-
|
-
|
-
|
-
|
-
|
-
|
0.95
|
0.81
|
1.00
|
0.73
|
[1] Using one Nvidia V100 GPU
[2] OpenAI’s cloud service
Classifier Performance by Class
Using the best performing classifier, Bio + Clinical BERT, we assessed performance within each category. The best-performing categories within our internal test set were “Respiratory”, “No Specific Source” and “Prophylaxis” (F1 score = 0.98), followed by “Urinary” (0.97), “Abdominal” (0.96), “Orthopaedic” (0.90), “Not Informative” (0.89) and “Neurological” (0.88). The worst performing category was Orthopaedic (0.84), likely due to the high variety of terms used and low number of training samples (n = 14, Appendix S9). Uncertainty was also well detected (0.96) (Fig. 3A, Appendix S10).
In the external test data, scores varied slightly, with all source categories except for “Not Informative” having F1 scores on average 0.02 higher compared to the internal test set. These small differences likely arose from different compositions of categories and the amount of shared vocabulary between the training and test datasets.
Misclassifications
Most misclassifications were spread evenly across classes for single indications. The two most common misclassifications occurred for “Orthopaedic” and “Other Specific” cases, with 12% being misclassified as “Prophylaxis” and 8% as “Skin and Soft Tissue”, respectively on the internal test set. On the external test set, most misclassifications were predicted to be “Other Specific” or “Prophylaxis” (Fig. 3C).
Training Dataset Size
We examined the effect of training size on model performance using randomly selected training dataset subsets of 250, 500, 750, 1000, 1500, 2000, 3000, and 4000 unique indications, tested using both internal/external test sets. There was a notable increase in performance (AUC-ROC and F1 scores) when the training size increased from 250 to 1000 samples, suggesting a minimum of 1000 training samples for adequate performance. However, we saw limited improvement as training dataset size rose to 4000, indicating there may be only marginal gains to expanding the training data beyond 4000 samples (Appendix S11).
Comparing free-text indications to ICD10 codes
We also compared infection sources from manually labelled ‘ground truth’ free-text indications to sources inferred from ICD10 diagnostic codes. 31% of sources classified as “unspecific” using diagnostic codes could be resolved into specific sources using free-text. Rarer infection sources such as “CNS” and “ENT” (< 1% and no occurrence in diagnostic codes) were represented better by sources extracted from free-text (4% and 4% respectively). Overall, where defined, sources listed in clinical codes generally concurred with the ‘ground truth’ free-text sources (Fig. 4).