Various reports on medical-device recalls in Japan have been published, including studies analyzing the nature of recalls and the time period of their occurrence [15, 16], studies analyzing the causes of recalls [17], and studies focusing on medical materials and the impact of recalls on the medical field [18]. Other studies have investigated bugs targeting medical-device programs and software [19]. Here, we constructed a model that predicts cases leading to recalls among the 10,000 reports of medical-device malfunctions published each year. Unlike a previous study based on decision trees [20], it uses only the medical-device malfunction reports over a single year, but covering all medical devices (Classes I–IV). Such a model must be closely scrutinized for sufficient estimation. Among approximately 154,600 malfunction reports collected over the past 15 years, only approximately 0.01% (~ 2,000 cases) resulted in a recall. Therefore, we considered that a model that combines these would yield poor estimation accuracy and incur a high computational cost.
In Japan, approximately 3–5 medical devices recalled per year are in Class I. CIEDs (which are Class IV devices) are most commonly recalled and are frequently recalled because of high risk. Therefore, we considered that recalls will notably affect medical practices and companies. The CIEDs (specifically, pulse generators) targeted in our study are often combined with accessories such as leads. Over 10,000 reports of malfunctions, including those related to accessories, have been filed, but only around 500 cases of CIEDs, including accessories, have been recalled, implying a large degree of data imbalance and poor estimation accuracy. To improve the estimation accuracy, we limited our study to the pulse generators of CIEDs, which have a high recall rate.
A frequency analysis of the extracted words suggested that “problems related to premature battery consumption,” including “problems related to early battery consumption,” “infections not originating from the product,” and “pacing failure,” including suspicions, will likely be reported as malfunctions. Meanwhile, specific health damages include “re-operation or fear of re-operation,” “infection or suspected infection,” and “removal/replacement of the generators.” This information presents in both recalled and non-recalled cases. As shown in Table 1, very few words are unique to the recalled group. Meanwhile, over 90% of the problems caused by (for example) “charging time” or “pacing function” have resulted in recalls. Such words are likely to become characteristic words representing recalls during BERT estimations. However, only around 20 words in the recalled group have occurrence frequencies above 50%. The recalled group in the dataset of this study is small (363) and correct identification of its words is deemed crucial for improving the estimation accuracy.
Here, we aimed to construct a model that can predict all cases leading to recall. To focus on the Recall measure, we replaced the usual F-score, the harmonic mean of Precision and Recall, with the F2-score. An Fβ-score of 0.5 or higher is considered appropriate and a value over 0.6 is considered high. Our dataset was highly unbalanced, with the recalled group constituting less than 10% of all samples. Therefore, learning based on this dataset will be biased toward the data with a large proportion (the not group), improving the accuracy of judging the not-group samples. Accordingly, the ACC scores of all BERT-based models were high (> 90%), regardless of explanatory variable. However, the Recall scores indicating the recall detection accuracies were approximately 60% for malfunction status and approximately 10% for health-damage status. Although some models achieved high ACC and Recall scores for health-damage status, their F2-scores were all below 0.5, suggesting that correct learning is not possible on the original dataset. After under/oversampling the data, the ACC and Precision scores decreased but the Recall exceeded 70% under all conditions. The model trained by tohoku-BERT on oversampled data achieved a particularly high Recall for malfunction status in the recalled group (0.934; 24 errors out of 363 data). However, the F2-score of this model was low (0.392). The most consistent model was UTH-BERT trained on undersampled data, with a Recall of 0.931 and an F2-score of 0.655 for malfunction and health-damage status.
One limitation of this study is unification of the notation of the written content. A glossary of medical- device malfunction terms has been compiled by the International Medical Device Regulatory Authority Forum and published by the PMDA [21]. The descriptions of malfunction status are in principle based on this glossary. In contrast, the definitions of terms related to health damages, such as “unwell,” “feeling unwell',” “chest tightness,” and “chest pain” are ambiguous and depend on the authors of the reports. In this study, tokenization and learning were performed in a mixed state; that is, as separate words. Therefore, in future work, these spelling variants must be carefully examined and (in some cases) unified before learning. In addition, these examinations were overseen by a single researcher (a clinical engineer). More accurate forced extractions for synonym unification are expected with assistance from engineers and recalls personnel in manufacturing and sales companies. However, if a word does not characteristically appear in the cases leading to recalls, it will probably negligibly affect the estimation accuracy. For example, as tohoku-BERT did not correctly extract the word “re-operation” during the morphological analysis, the word was registered in an additional dictionary, although its proportion in the recalled group was only 1/9th that in the not group. Therefore, the estimation accuracy is unlikely to noticeably change if “re-operation” is divided into “re“ and “operation.” In contrast, the estimation accuracy can be improved by ensuring that words such as “pacing function” and “charging time,” which frequently appear in recollection cases, or words that are unique to recalled cases, are registered in the dictionary. Therefore, to improve the estimation accuracy, the characteristic terms in recalled cases must be extracted in the correct form for learning.
Expanding the range of target devices is another avenue for future research. As malfunctions and health damages depend on the target device, we will optimize the morphological analyses of different devices, extract words unique to each recovered case, and aim to further improve the estimation accuracy.