Using machine learning, we developed and validated a simple clinical decision algorithm to diagnose hyperfunctioning parathyroid gland based on three readily available imaging and laboratory variables. Performance of the algorithm for correct binary classification of a hyperfunctioning gland was excellent and outperformed all other variables. Our simple decision tree can be easily implemented into clinical practice to provide an evidence-based probability map of hyperfunctioning and normal glands based on imaging and biochemical profile to guide the surgeon in individualized preoperative planning.
Currently, the paradigm for radiology reports is based on descriptive and subjective verbiage (i.e. “most likely” or “consistent with”) and typically does not convey a degree of diagnostic confidence or probability of disease(18). This leaves the onus upon the clinician to interpret the verbiage of the report, in addition to integrating other variables and data, to formulate a diagnostic or treatment decision(18, 19). Additionally, 4D-CT and MIBI imaging are generally interpreted as a binary result (positive or negative), however patients with PHPT exist on a spectrum of disease and their imaging and biochemical data should accurately reflect that spectrum for optimal patient care. To address this problem, we developed a clinical decision tree using ML that is highly accurate, evidence-based, and defines five distinct probability categories for a hyperfunctioning gland based on serum calcium and PTH levels and 4D-CT and MIBI imaging, with respective probabilities: (i) 3.8%; (ii) 8.0%; (iii) 19.4%; (14) 75.0%; (v) 97.0%. Using a separate dataset, the results from the training set were validated with comparable probabilities. The decision tree can be used to guide clinical decision making to decide the type of parathyroid surgery (MIP vs. BNE). For instance, in a patient with 96% probability for one gland and 3% probability for three glands, a MIP appears to be the best option. On the contrary, in a patient with 24% probability for all four glands, a BNE can be planned with greater confidence.
Using 16 different input variables, the ML signature selected 3 variables for optimal prediction of a hyperfunctioning gland and discarded 13 other variables. The signature also demonstrated that in glands with any positive imaging, no other type of variable was needed for optimal prediction besides imaging, however in glands with negative imaging, the biochemical profile (Ca*PTH) was the most important determinant of probability of a hyperfunctioning gland or conversely, of a normal gland and false negative on imaging. Interestingly, the Ca*PTH cutoff thresholds automatically derived by the ML signature were similar to the original Wisconsin Index (WIN) risk categories proposed by Mazeh et al. (12), with Ca*PTH < 676, 676 < Ca*PTH < 1292, and Ca*PTH > 1292 for our ML algorithm compared to Ca*PTH < 800, 801 < Ca*PTH < 1600, and Ca*PTH > 1600 for WIN. This concordance in Ca*PTH cutoff thresholds derived by separate methodologies suggests that biochemical profile assessed using Ca*PTH does reflect underlying PHPT disease biology and severity, with more severe biochemical profile correlating with increasing probability of single gland disease. Indeed, WIN is a component in the composite multigland disease score proposed by Sepahdari et al. (13), a prospectively validated predictive scoring system for likelihood of MGD based on 4D-CT findings and WIN (13, 14). However, Edafe et al. demonstrated that WIN did not accurately differentiate between single and multigland disease in a large cohort of 624 patients (20). In contrast to WIN, our Ca*PTH cutoffs were derived through ML without supervision, and our prediction tree incorporates imaging criteria of 4D-CT and MIBI and may be further adapted using ML to potentially predict multigland disease.
Our MLCDA outperformed all individual variables, including imaging variables, such as combined 4D-CT + MIBIsensitive, 4D-CT alone, and MIBI alone, which had the highest performances of single variables. A prior study had compared 4D-CT vs. MIBI in the same cohort of patients and showed that 4D-CT was superior to MIBI and combined 4D-CT + MIBI does not provide added benefit over 4D-CT alone suggesting the limited value of MIBI (9). Contrary to that study, the ML signature selected 4D-CT + MIBIspecific as one of three important variables to include in the predictive tree model. When both imaging modalities were positive (positive 4D-CT + MIBIspecific), the probability for a hyperfunctioning gland increases to 97% from 75% when only one modality was positive. This suggests that MIBI imaging does provide incremental clinical value by increasing diagnostic confidence, which may not be readily detected using standard diagnostic accuracy measures and requires the power of a ML algorithm to work through all possible permutations. The conclusions of the two studies may be partly explained by the fact that MIBI had a slightly higher specificity compared to 4D-CT in our prior study (99% vs. 96%, respectively) and PPV (96% vs. 90%, respectively), probably due to lower false positives of MIBI(9), and these subtle differences were determined important for prediction by the ML random forest algorithm.
Despite the growing research and applications of ML and AI in medical imaging, only two prior studies have evaluated the use of ML in hyperparathyroidism and parathyroid diseases (21, 22). Imbus et al. used a ML algorithm to differentiate between single and multigland disease (21) and Somnay et al. evaluated ML to diagnose patients with PHPT (22). Both studies used demographic, clinical, and laboratory variables as predictors, but neither study included imaging variables. While both studies showed high accuracy of ML, the algorithms were not translated into a readily implemented clinical algorithm unlike in our study. Similarly, our complex signature provided a deeper understanding of the relative importance and relationships of different variables in prediction than our simple MLCDA, however the complex signature could not be translated into clinical practice unless implemented on a software and also requires postoperative variables. Therefore, we pursued prospective validation of the simple MLCDA. Our study differs from prior studies as our ML signature simplified a dataset of many variables into a clinical decision tree using only three variables that can be implemented into radiology reports to communicate probabilities to the surgeon for surgical planning.
Another originality of our work is that we implemented several strategies to limit the risk of false-positive findings and optimized simplicity of the model for translation in clinical practice. First, we use a large sample size for algorithm development and training. Second, we decided to select an optimal decision tree rather than using thousands of trees from the ML algorithm. The decision tree was optimized in the training set to predict pathology as a binary variable. Thus, the output was five classes redefining probability of disease as five categories corresponding a spectrum from lowest to highest probability of a hyperfunctioning gland. Third, we created a low-dimensional model. Of note, we evaluated 16 candidate variables, but the algorithm demonstrated that most of the variables were highly correlated and therefore redundant without providing added diagnostic value. The unsupervised random forest algorithm demonstrated in the training set that only three variables were necessary: one continuous variable (Ca*PTH) and two binary variables (4D-CT+/- and MIBI+/-). Using more than three variables did not significantly improve the performance of the model in our training set. In practice, this means that the number of dimensions in this signature is limited and that the signature is likely underfitted, as larger datasets would allow for developing more complex models achieving higher performance. Fourth, this low-dimensional signature was tested on a validation set that was not used for training and reached similar performance. Most importantly, unlike some AI/ML algorithms, our ML-derived clinical decision tree is not a black box and is a ready to use clinical tool that can be implemented across institutions to replace radiology verbiage with evidence-based probabilities. This spectrum of probability with different degree of certainty could lead to a precision medicine approach and increase diagnostic confidence.
There are several limitations of our study. First, our cohort was acquired from a single institution, which limits the generalizability of our findings to other institutions and practice settings. Second, our algorithm is based on using combined 4D-CT and MIBI, which may not be available or routinely performed at other institutions, however we tried to address this limitation by adapting the prediction algorithm for 4D-CT or MIBI alone. This resulted in slightly lower, but still high prediction probabilities when imaging was positive: 89% for 4D-CT and 96% for MIBI. While 4D-CT had a lower probability of hyperfunctioning gland when positive compared to MIBI, it had an overall lower probability of false negative imaging when imaging was negative (24.1% vs. 33.3%), particularly when Ca*PTH < 676, which partly explains the superior performance of 4D-CT in multigland disease. A third limitation is the potential overfitting for the model to the training dataset. We reduced overfitting by specifying a maximum combination of three informative candidate variables through forward search and feature combination. Lastly, we did not include parathyroid gland size as a variable in our simple signature because our goal was to create a prediction based only on preoperative variables, although the complex signature demonstrated parathyroid weight on pathology was the most important predictor variable. We hope to address this limitation in a future study by using estimated parathyroid weight on 4D-CT.