A total of 341 dogs that were presented for airway assessment between September 2015 and September 2021 were included in this study (Table 1). The exclusion criteria were dogs that had lower airway diseases confirmed by images (computed tomography or radiography) or physical examination (thoracic auscultation); dogs that were younger than one year of age (minimal age for RFG scheme); and dogs that had other upper airway diseases (e.g. nasal tumour, trauma) rather than BOAS.
Table 1
Breakdown of breeds in the dataset.
Breed | No. of dogs |
French Bulldog | 170 |
Pug | 81 |
Bulldog | 51 |
Other/unspecified | 39 |
The Department of Veterinary Medicine Ethics and Welfare Committee at the University of Cambridge approved the study and experimental protocol, under ethical review applications CR62, CR63, CR213, and CR215. All dog owners provided informed consent to include their animals in this study.
The data was gathered at the Queen’s Veterinary School Hospital (Cambridge), Animal Health Trust (Newmarket), breed-specific dog shows, and health clinics. Assessment of each dog was performed by a trained veterinarian, following the RFG scheme protocol [10]. Auscultation with a 3M Littmann 3200 electronic stethoscope was performed over the larynx from the side, with the head in a neutral position. Sound recordings were made for up to 30 seconds with a 4 kHz sampling frequency. The assessor then graded the severity of stertor and stridor noises as either inaudible, mild, moderate, or severe. The respiratory pattern (regular or irregular) and level of inspiratory effort were also recorded.
After the initial recording, each dog underwent an exercise test, which involved 3 minutes of movement at a trotting speed of approximately 4–5 miles per hour. Following the test, the respiratory recording and assessment were repeated. The trained veterinarian used the functional grading scheme to assess the overall grading of BOAS from the pre-exercise and post-exercise measurements. Overall, 374 unique encounters were recorded, with a total of 665 individual recordings. A minority of patient encounters (78) occurred after an operation. Study data was collected and managed using REDCap data capture tools hosted at Cambridge University [23, 24].
Table 2 shows the breakdown of the functional grades given to each patient in the dataset. Following the RFG scheme [10], we designate grades 0/1 as “BOAS negative” and grades 2/3 as “BOAS positive”. Dogs classified as “BOAS positive” are deemed to be clinically affected by the disease and should be considered for intervention [10].
Table 2
Breakdown of functional grades in the dataset.
Overall state | Functional grade | No. of encounters |
BOAS negative | 0 (No BOAS) | 47 |
1 (Mild BOAS) | 112 |
BOAS positive | 2 (Moderate BOAS) | 157 |
3 (Severe BOAS) | 58 |
Stertor and stridor are the key audible signatures detected with a stethoscope when assessing the RFG scheme grade. Following the scheme, the presence of moderate or severe stertor or stridor results in a “BOAS positive” grade. However, the prevalence of stertor and stridor in the dataset is not equal. Table 3 shows the number of recordings labelled to contain the different grades of stertor and stridor, demonstrating that the number of moderate and severe stridor recordings is much lower than that of stertor. The table also shows that severe stridor rarely appears without severe stertor. However, the reverse is not true and moderate or severe stertor regularly appears without any stridor. Particularly in the French bulldog and bulldogs, laryngeal collapse, recognisable as stridor, is a consequence of other airway obstruction that produces stertor. Given this clinical reasoning and the make-up of our dataset, neglecting stridor would not have a significant impact on the specificity of our diagnosis as BOAS positive or negative. We, therefore, focus this study on training a machine learning model to detect the presence of stertor sounds.
Table 3
Count of recordings with paired stertor and stridor labels.
Stertor | | Stridor |
| None | Mild | Moderate | Severe |
None | | 105 | 7 | 2 | 0 |
Mild | | 101 | 35 | 12 | 4 |
Moderate | | 191 | 50 | 8 | 3 |
Severe | | 19 | 13 | 5 | 10 |
Large machine learning datasets typically hold out around 25 to 30% of the data as an unseen test set to evaluate generalisation and real-world performance. However, the size of the dataset in this study would limit the statistical significance of such a fraction of the data. K-fold cross-validation could be used to estimate the performance of classifiers, but studies have shown that this biases the performance estimate because the model's hyperparameters are optimised using the test fold [25]. To produce a more accurate estimate of performance, we apply a nested cross-validation strategy, which significantly reduces the bias in the estimation of classifier performance whilst using the entire dataset [25].
Figure 1 details the nested cross-validation procedure, which aims to remove the potential bias introduced by stratifying data into folds, selecting a test portion, and selecting hyperparameters. We first split the data into 5 approximately equally sized folds by applying a stratified minimisation algorithm that balances key prognostic variables (gender, breed, body condition score, BOAS grade) across all the groups. We then pick one fold and remove it as the test fold. We run a four-fold cross-validation on the remaining data, optimising the model hyperparameters using a random search. The best hyperparameters are used to train a single model on the four training folds that then makes a prediction on the unseen test fold. We repeat this four more times, each time with a new test fold. We then repeat the whole process, starting with a new random stratification. This outer loop is repeated 10 times, giving 50 independent runs with test predictions. Finally, we average the performance across all optimised runs when reporting results.
to train and evaluate models.