In this study, a random forest model was built by combining a wide array of comorbidity and demographic information from health administrative data to classify patients as undergoing transcatheter closure of PFO or ASD. To our knowledge, it is the first study to utilize machine learning techniques to separate these patient groups in administrative data. Using the CorHealth cardiac registry as the reference standard, our model achieved an overall accuracy of 76% with balanced sensitivity (76%) and specificity (75%), which is a much better and more balanced classification performance than the traditional approach which identifies a TC for PFO or ASD based only on past history of stroke or TIA (accuracy 68%, sensitivity 36%, and specificity 96%).
Our random forest model identified several additional variables that can be used to improve the overall performance of PFO and ASD classification. Although the final model ended up with a relatively large number of variables, comorbidities and intervention codes should be easily available in similar administrative databases. The specific intervention codes that were most common within this patient population may differ between different administrative data systems, but should still be easily available as billable codes.
A different type of administrative data was used as the reference standard in our study due to its wide availability. CorHealth registry data is collected by hospitals, where all TC procedures are performed and includes funding for clinical abstractors, primarily for quality assurance and accountability rather than for billing purposes. Although reliability and validity checks are not routine for CorHealth data aside from identifying any missing data, the clinical richness and attention to distinguishing between TC for PFO and ASD makes it an acceptable reference standard to assess the accuracy of an administrative database algorithm.
With regards to the use of other administrative databases, while the coding within CIHI-DAD has been found to be very accurate for procedural information and individual demographics, its quality is considerably more variable for coding of major diagnoses.27, 28 In addition, ICES databases do not capture health care usage outside of Ontario.27, 28 Although PFO and ASD do not have their own distinct ICD codes, most variables in this dataset used to build the random forest model, as with all administrative health data, hinge on the usage of ICD codes. Generally, diagnostic codes can carry a risk of inaccurate indication of true disease status, leading to potential misclassification bias.29 Regardless of the personal knowledge of the individual entering the information, the inherent structure of ICD codes may not allow the inclusion of important clinical details such as any underlying anatomic diagnoses or a history of interventions or surgeries that may be pertinent to long-term patient health outcomes.11 Furthermore, there is little to no quality control to confirm accurate coding on the level of individual patients.11 This can provide challenges in describing and differentiating pathologies of interest when based on ICD codes alone.10, 11
Given these known challenges with administrative data, certain performance measures were chosen to make this model useful for a wide variety of purposes. The eventual intention of creating this classification model was to utilize it for studying long-term outcomes for both ASD and PFO patients, allowing for comparison between different health care sites, which has not been feasible thus far using administrative data. As such, classification algorithms should target identifying both patient groups rather than prioritizing identification of one over the other. This is why sensitivity, specificity, and accuracy were evaluated together rather than focusing on one specific measure. Typically, sensitivity is prioritized when the primary consideration is to identify true positives, even at the risk of including false positives.30 This is done to enhance the inclusiveness of the model, and can improve the generalizability of the results.30, 31 To balance this, and because it is inversely related to sensitivity, specificity is evaluated as well to identify true negatives.32 Although positive predictive value is often prioritized when identifying a patient cohort, accuracy was chosen in this study as the overall measure of how well the model differentiated between transcatheter procedure done for PFO and ASD patients. 33 Because the correct identification of both PFO and ASD patients were of chief interest for this study, it was important to take into account not only overall performance of the model through accuracy, but also keep in mind the balance between sensitivity and specificity.
There are some limitations related specifically to the use of random forest for classification. Random forest models, like other means of predictive algorithmic modeling, are probability-based and data-driven. As such, although the modeling strategy may be transferred over to different settings, the specific model may not be directly applied, and so external validation is necessary prior to future usage.34, 35 A second set of labeled data was unavailable for our study, so external validation was not possible in this case. Additionally, while the complexity of random forests makes them a powerful data modeling tool, they are less easily interpretable than other models and so may not be as accessible in certain settings.34, 35