We developed a newly CAD system based on a deep learning algorithm for hip fracture. This system provided high accuracy, sensitivity, and specificity. The areas activated on the heat map all corresponded to the areas pointed out by the orthopedic surgeon. Inexperienced residents’ diagnostic accuracy, sensitivity, and specificity in the diagnosis of hip fracture improved when they used the CAD system.
Our CAD system, based on a deep learning algorithm, had some advantages over other symptoms. We conducted a literature review that demonstrated the application of AI-based systems for the diagnosis of hip fracture in Table 4 [17–19]. We used the largest amount of learning data from multiple institutions. In this study, almost all of the images of hip fractures obtained from multiple institutions were used, and approximately 10,000 images of machine learning data were generated from approximately 5,000 cases. Large datasets are the key to success in machine learning [29]. The majority of published studies on AI to date were conducted in a single institution; only 6% of these studies used data from multiple institutions [30]. Our multiple-center dataset provides 1) a large amount of data, and 2) images with different imaging formats. In this study, the deep learning algorithm achieved high accuracy at multiple institutions, despite the use of different radiographic equipment and image file formats. The high performance of the multi-center data may help in the practical application of this system.
Table 4
|
Year
|
Institution
|
Number of patients
|
Number of images for machine learning
|
Fracture type (femoral neck/ trochanteric fracture)
|
Images including implants on hip or spine
|
Accuracy (%)
|
Sensitivity (%)
|
Specificity (%)
|
AUC
|
Grad-CAM
|
Clinician test (AI-aided test)
|
Adams
et al. [1]
|
2018
|
1
|
805
|
805
|
femoral neck fracture
|
excluded
|
90.6
|
N/A
|
N/A
|
0.98
|
no
|
no
|
Urakawa et al. [29]
|
2018
|
1
|
1773
|
3346
|
femoral trochanteric fracture
|
excluded
|
95.5
|
93.9
|
97.4
|
0.97
|
no
|
no
|
Cheng
et al. [6]
|
2019
|
1
|
3605
|
3605
|
both
|
included
|
91
|
98
|
84
|
0.98
|
yes
|
no
|
Current study
|
2020
|
3
|
4851
|
10484
|
both
|
included
|
96.1
|
95.2
|
96.9
|
0.99
|
yes
|
yes
|
The performance of our deep learning algorithm was as good as that described in previous reports. On the other hand, the deep learning algorithm failed to diagnose 3.9% of images (39 out of 1,000 test data) correctly. Twenty-four images with fractures were diagnosed as "without fracture" and 15 images without fracture were diagnosed as "with fracture". Interestingly, the sensitivity and specificity of the deep learning algorithm were similar to those of the orthopedic surgery fellows (95.8%: 95.5%). This suggests that our deep learning algorithms would have no more diagnostic abilities than an orthopedic surgeon.
Second, our CAD system, which was based on a deep learning algorithm, was able to provide a heat map of the fracture site, which provided evidence about where the AI recognized the fracture. In all cases, the fracture site indicated on the heat map was located in the area indicated by the orthopedic surgeon. AI-based diagnostics has classically been associated with a “black box problem” [31], in that cannot explicitly express the feature quantity, the reasons for the judgment are not clear, and humans cannot understand or interpret the reasons. In this study, we used Grad-CAM to visualize class-discriminative regions on the X-rays. This could reveal the location of the diagnosis. However, the Grad-CAM could show the fracture as a rough area, but cannot show the fracture line itself. Besides, the image information that the deep learning algorithm based its decision on (e.g., the fracture line, bone marrow edema, or soft-tissue contrast) is still unclear.
Third, in this study, the diagnostic accuracy, sensitivity, and specificity of residents improved when they used the CAD system. Moreover, the CAD system improved their diagnostic accuracy regardless of the year of residency. There have been many studies in which deep learning algorithms showed high diagnostic performance at the basic research level [14]. However, they did not provide comparisons with health-care professionals (i.e., human vs. machine), and few of the studies reported comparisons with healthcare professionals using the same test dataset. As shown in Table 4, in previous studies on deep learning algorithms for hip fractures, no assessment was made as to how deep learning algorithms affect clinicians' diagnostic abilities [17–19]. Our study showed that the CAD system would be useful for aiding residents in the diagnosis of hip fracture.
The present study was associated with several limitations. First, the present dataset included cases of pathological fractures caused by metastatic bone tumors but did not include cases of osteomyelitis without fracture. It is desirable to consult a specialist as soon as possible in such cases; however, the CAD system developed in this study may not be able to point this out. Second, the image needs to be divided by preprocessing. A CAD system that can diagnose hip fractures without preprocessing from X-rays of both hips should be developed using the deep learning algorithm obtained in this study. Third, the diagnostic imaging test was not conducted in an actual clinical setting. This study was retrospective study conducted via a PACS-like web interface used by clinicians for medical imaging. Unlike the high-resolution monitors used in clinical practice, the reading of the images is done on a home personal computer, and therefore the diagnostic rate for clinicians may be underestimated. It is also possible that the incidence of "with fracture" images in clinical practice is different from the frequency of diagnosis in clinical practice. In this regard, future prospective studies in actual clinical settings using an actual PACS system are needed. Fourth, we have not been able to assess whether clinicians have fundamentally improved their diagnostic abilities in the diagnostic imaging test. In the diagnostic imaging test, clinicians read images without diagnostic aid and images with diagnostic aid consecutively. Because clinicians didn’t read the images at regular time intervals, the effect in terms of education is not known. In addition, the correctness criteria for the diagnostic imaging test was whether the clinician was able to answer the fracture site correctly on the basis of the presence or absence of a fracture. Grad-CAM presented the heat map as an indication of the fracture site, but it is unclear how much the heat map contributed to the clinician's ability to read it.