Arti cial Intelligence Improves the Accuracy of Residents in the Diagnosis of Hip Fractures –A Multicenter Study

YOICHI SATO (  yoichisugar.trauma@gmail.com ) JCHO Tokyo Yamate Medical Center: Tokyo Yamate Medical Center https://orcid.org/0000-0001-6144-3735 YASUHIKO TAKEGAMI Nagoya University Graduate School of Medicine Faculty of Medicine: Nagoya Daigaku Daigakuin Igakukei Kenkyuka Igakubu TAKAMUNE ASAMOTO tsushima city hospital YUTARO ONO Nagoya Daini Red Cross Hospital: Nagoya Daini Sekijuji Byoin HIDETOSHI TSUGENO Nagoya Daini Red Cross Hospital: Nagoya Daini Sekijuji Byoin RYOSUKE GOTO SearchSpace Co.,Ltd. AKIRA KITAMURA SearchSpace Co.,Ltd. SEIWA HONDA Nonpro t Organization Nagoya Orthopedic Regional Healthcare Support Center


Background
In Japan, as many as 13 million elderly people have osteoporosis [1,2]. Fragility fractures, such as hip fractures and spinal fractures associated with osteoporosis are also increasing, with 200 thousand patients suffering from hip fractures annually [3]. Patients with hip fractures require admission to hospital as soon as possible, because the longer the patients delay getting treatment, the worse their walking ability and prognosis will be [4,5].
Most hip fracture patients visit the emergency department because they have di culty walking due to pain. In the emergency department, clinicians are exposed to excessive time and mental stress, which can cause fatigue and misdiagnosis [6,7]. This tendency is particularly pronounced among residents [8]. In previous studies, the misdiagnosis rate at the initial diagnosis for hip fractures was estimated to be 2-10% [9].
A delay in the diagnosis and treatment worsens the prognosis [10], and a misdiagnosis may lead to medical litigation [6]. To prevent a misdiagnosis, radionuclide bone scans, computed tomography (CT), and magnetic resonance imaging (MRI), as well as plain X-rays, are recommended as additional diagnostic imaging [11,12]. However, these additional tests are not always available in all institutions.
In recent years, deep learning, a method of machine learning using multi-layered neural networks, has emerged and improved the accuracy of image recognition [13]. In the eld of medicine, many previous studies have reported the application of deep learning to imaging analysis and demonstrated its high diagnostic accuracy [14]. Several studies have applied a deep learning algorithm to the diagnosis of fractures [15]. Olczak rst demonstrated that arti cial intelligence (AI) with the use of a deep learning approach for the diagnosis of ankle and wrist fracture on plain X-rays [16]. There have been some papers on the use of a deep learning algorithm to diagnose hip fractures [17][18][19]. Besides, a previous study reported a that deep learning algorithm improved the diagnostic accuracy of fracture detection by clinicians [20]. However, these studies were conducted in a single center. The dataset was relatively small and the image processing method was uniform. Few studies have described the improvement of clinicians' diagnostic accuracy for hip fractures with the aid of deep learning algorithms and no studies have reported differences in outcomes according to years of clinical experience.
Thus, we planned to train a deep learning model using a large dataset with images obtained by various protocols in a multi-institutional setting. We newly developed the CAD system using a model that could visualize the diagnostic method of the AI. In the present study, we hypothesized that the CAD system would improve the diagnostic accuracy of physicians, including residents.

Subjects
All experiments were performed in accordance with the ethical standards of the amended Declaration of Helsinki. This study was conducted with the approval of the ethics committee of each hospital (Gamagori City Hospital: approval No. 368-1, Tsushima City Hospital: Approval No. 2019-3, Nagoya Daini Red Cross Hospital: approval No. 1360).
We collected images from 3 hospitals (Gamagori City Hospital, Tsushima City Hospital, Nagoya Daini Red Cross Hospital) in Aichi Prefecture, Japan. The Nagoya Daini Red Cross Hospital provided tertiary care in an urban area with a population of 2.3 million. The other two hospitals-Gamagori City Hospital and Tsushima City Hospital-are primary care hospitals in a rural area in Japan. Table 1 shows the background factors of each institution. hip fracture on the opposite side during the study period. We also included hip implants on the opposite side (n = 452), complicated pubic or sciatic fracture (n = 93), cases with osteoarthritis of the hip (Kellgren-Lawrence Grade III or IV: 84 cases) [21], including spine implants (n = 46), and pathologic fractures of the proximal femur due to metastatic cancer (n = 12). We excluded images for the following reasons: periprosthetic fractures (n = 32), bilateral hips were not included within an image range (n = 14), and femoral shaft fracture (n = 7). Finally, we utilized 5242 AP pelvic X-rays in 4851 cases (Sex: male, n = 1193; female, n = 3658, mean age at injury: 81.1 years) (Fig. 1). Of these, we diagnosed 5024 (95.8%) from frontal simple hip radiographs, 97 (1.9%) with radiographic lateral views as well, and 121 (2.3%) with CT or MRI for de nitive or exclusionary diagnosis.

Evaluation of fractures
Two orthopedic surgeons (YS, TA) assessed the presence or absence and the type of fracture. The Kappa statistic of inter-observer correlation for the presence or absence of these fractures was 0.91. If the results differed, it was decided after a discussion. To classify the fracture type, we used the Garden classi cation (Garden classi cation stage I, II, III, IV: G/S I-IV) for femoral neck fractures [22] and the AO/OTA classi cation for femoral trochanteric fractures (AO/OTA 31-A1, A2, A3) [23]. We de ned a great trochanteric fracture as one in which the fracture line did not extend to the medial cortex [24]. A total of 5024 cases (95.8%) were diagnosed from AP pelvic X-rays alone. Other patients were diagnosed by lateral X-ray (n = 97; 1.9%) and CT or MRI (n = 121; 2.3%). Table 2 shows the classi cations of fracture types.   (Table 1)

Image preprocessing and development of the algorithm
We used an Intel Core i7 8700 K, Ubuntu 18.04, and Python 3.7 to perform image processing on the target image data and train the algorithm. Images extracted from the DICOM server were converted into 3 channels and 8-bit JPEG images, and both were resized to 380 × 380 pixels. We used uncompressed data. All images were given a rectangle that included the entire fracture site. To extract larger images, we placed a vertical dividing line at a position with a 50-pixel margin for the rectangle and adopted the images without the rectangle as the non-fractured side data and generated 5242 images that did not contain the fracture site. We also adopted the image of the side containing the rectangle of the same size as the non-fractured site data as the fracture side data and generated 5242 images containing the fracture site. In total, 10484 images were prepared for machine learning (Supplementary Fig. 1).
We used Python 3.7 to train an algorithm for the analysis, and Pytorch 1.3 and Fast.ai 1.0 as deep learning libraries. We also used Nvidia's RTX 2070 GPU for learning and reasoning of deep learning. To perform transfer learning [25], We used the E cientNet-B4 model [26], which was a pre-trained ImageNet model ( Supplementary Fig. 2). A deep convolutional neural network (DCNN) approach was used for the learning. The model was trained for two-class classi cation, with images with fractures as positive and images without fractures as negative.
We used gradient-weighted class activation mapping (Grad-CAM) [27] to conceptualize the basis for the deep learning algorithm's diagnosis of a fracture. We used the show-heatmap-function of Fast.AI (http://www.fast.ai ) on the deep learning algorithm to obtain the heatmap. Through this process, we have developed a CAD system based on a deep learning algorithm that provides diagnosis and visualization of basis.
We determined the calculation time for the whole process of the inference and the generation of heat maps for one image of the test dataset. The calculation method is the average time per image of the test dataset divided by the calculation time, which was deduced from 1000 images of test data.

Controlled experiment with clinicians
To investigate the application of the CAD system and verify its effectiveness in a clinical setting, we conducted a controlled experiment with clinicians. There were 65 residents in the three institutions included in the study. Thirty-one of these residents agreed to participate in the study (10 in their rst year of residency, and 21 in their second year of residency). Four orthopedic surgery fellows (orthopedic surgeons with 6-7 years of clinical experience before taking the specialty exam in Japan) also undertook the test. Each of these participants provided their informed consent at their respective institutions.
We randomly extracted 300 images (133 on the non-fractured side and 167 on the fractured side) from 1000 test image datasets described as a previous study [20]. The 300 images included 136 right femur images and 164 left femur images. First, we checked the performance of the deep learning algorithm for the 300 images.
Then, physicians undertook the diagnostic test. The outline of the diagnostic test was as follows: 1) the physicians diagnosed the presence or absence of fracture by themselves; 2) after the physician answered, the CAD system added the visualization of the fracture to the same image; 3) as a second test, the physician responded again based on the hint. (Supplemental Fig. 4) This sequence was repeated 300 times.
2.6 Assessment 2.6.1 Performance of the deep learning algorithm We evaluated the performance of the trained deep learning algorithm using the test image dataset. We also calculated the accuracy, sensitivity, speci city, F-value, receiver operating characteristic (ROC) curve and measured the area under the curve (AUC), as described in the STARD 2015 guidelines [28].
2.6.2 Evaluation of the heatmap generated by the CAD system We performed accuracy validation of Grad-CAM in accordance with the previous research [18].. We used a total of 40 images, 20 with and 20 without fractures, randomly selected from images that the algorithm was able to correctly diagnose in the test data set, for accuracy validation. For accuracy validation, we used the area with the highest signal intensity in the Heat map as the basis for determining "with fracture" if it was located directly above the femur between the femoral head and just above the popliteus. The assessor (YS) evaluated the consistency between the high signal intensity region on the heat map and the actual fracture site on the X-ray using sensitivity and speci city. The kappa value for intra-observer correlation between two-week intervals was 1.0.
2.6.3 The diagnostic accuracy of physicians with or without the use of the CAD system We compared the accuracy, sensitivity, and speci city with/without the aid of the CAD system among residents and orthopedic surgery fellows. We also compared the diagnostic accuracy of the rst-year residents to that of second-year residents.

Statistical analysis
The EZR software program was used to perform the statistical analyses [16]. The Mann-Whitney U test was used to evaluate normality and the nonparametric Shapiro-Wilk test, and Fisher's exact test were used to analyze categorical variables. P values of < 0.05 were considered to indicate statistical signi cance. Scikit-Learn ( https://scikit-learn.org/ ) was used to analyze the performance of the deep learning algorithm. On the other hand, the deep learning algorithm misdiagnosed 39 of the 1000 images. A total of 24 images with fractures were diagnosed as "without fracture" (false negative). These included slightly displaced fractures (n = 21); fractures located at the greater trochanter of the femur (n = 9), non-displaced femoral neck fractures (n = 8), femoral trochanteric fractures (n = 8; AO31 -A1). The others included relatively displaced fractures (n = 3); femoral trochanteric fracture (n = 2; AO31 -A2, 3), and displaced femoral neck fracture (n = 1; G/S 3, 4). A total of 15 images without fracture were diagnosed as "with fracture" (false-positive). These included 13 cases with normal images, a case with deformity after conservative treatment and a case after nail removal (Fig. 3).

Evaluation of the heatmap generated by the CAD system
For images diagnosed by the algorithm as "with fracture", Grad-CAM showed a high-signal region consistent with the fracture site. For images diagnosed as "without fracture", Grad-CAM showed high-signal areas in the region other than femoral neck and trochanteric. (Fig. 4) For the 20 "with Fracture" images, all 20 images had the same high-signal region on the heat map as the fracture site. On the other hand, 19 of the 20 "without fracture" images had high signal areas except from the femoral head to just above the trochanter in the 19 images, but one image had a high signal area in the greater trochanter.
(Supplemental Fig. 5 The results of the diagnostic accuracy of residents in both the rst and second years and orthopedic surgery fellows are presented in Table 3. The accuracy, sensitivity, and speci city of the residents were improved with the CAD system, irrespective of their year of residency. While, the accuracy, sensitivity, and speci city of the orthopedic surgery fellows did not change with the use of the CAD system.

Discussion
We developed a newly CAD system based on a deep learning algorithm for hip fracture. This system provided high accuracy, sensitivity, and speci city. The areas activated on the heat map all corresponded to the areas pointed out by the orthopedic surgeon. Inexperienced residents' diagnostic accuracy, sensitivity, and speci city in the diagnosis of hip fracture improved when they used the CAD system.
Our CAD system, based on a deep learning algorithm, had some advantages over other symptoms. We conducted a literature review that demonstrated the application of AI-based systems for the diagnosis of hip fracture in Table 4 [17][18][19]. We used the largest amount of learning data from multiple institutions. In this study, almost all of the images of hip fractures obtained from multiple institutions were used, and approximately 10,000 images of machine learning data were generated from approximately 5,000 cases. Large datasets are the key to success in machine learning [29]. The majority of published studies on AI to date were conducted in a single institution; only 6% of these studies used data from multiple institutions [30]. Our multiple-center dataset provides 1) a large amount of data, and 2) images with different imaging formats. In this study, the deep learning algorithm achieved high accuracy at multiple institutions, despite the use of different radiographic equipment and image le formats. The high performance of the multi-center data may help in the practical application of this system. The performance of our deep learning algorithm was as good as that described in previous reports. On the other hand, the deep learning algorithm failed to diagnose 3.9% of images (39 out of 1,000 test data) correctly. Twenty-four images with fractures were diagnosed as "without fracture" and 15 images without fracture were diagnosed as "with fracture". Interestingly, the sensitivity and speci city of the deep learning algorithm were similar to those of the orthopedic surgery fellows (95.8%: 95.5%). This suggests that our deep learning algorithms would have no more diagnostic abilities than an orthopedic surgeon.
Second, our CAD system, which was based on a deep learning algorithm, was able to provide a heat map of the fracture site, which provided evidence about where the AI recognized the fracture. In all cases, the fracture site indicated on the heat map was located in the area indicated by the orthopedic surgeon. AI-based diagnostics has classically been associated with a "black box problem" [31], in that cannot explicitly express the feature quantity, the reasons for the judgment are not clear, and humans cannot understand or interpret the reasons. In this study, we used Grad-CAM to visualize classdiscriminative regions on the X-rays. This could reveal the location of the diagnosis. However, the Grad-CAM could show the fracture as a rough area, but cannot show the fracture line itself. Besides, the image information that the deep learning algorithm based its decision on (e.g., the fracture line, bone marrow edema, or soft-tissue contrast) is still unclear.
Third, in this study, the diagnostic accuracy, sensitivity, and speci city of residents improved when they used the CAD system. Moreover, the CAD system improved their diagnostic accuracy regardless of the year of residency. There have been many studies in which deep learning algorithms showed high diagnostic performance at the basic research level [14]. However, they did not provide comparisons with health-care professionals (i.e., human vs. machine), and few of the studies reported comparisons with healthcare professionals using the same test dataset. As shown in Table 4, in previous studies on deep learning algorithms for hip fractures, no assessment was made as to how deep learning algorithms affect clinicians' diagnostic abilities [17][18][19].
Our study showed that the CAD system would be useful for aiding residents in the diagnosis of hip fracture.
The present study was associated with several limitations. First, the present dataset included cases of pathological fractures caused by metastatic bone tumors but did not include cases of osteomyelitis without fracture. It is desirable to consult a specialist as soon as possible in such cases; however, the CAD system developed in this study may not be able to point this out. Second, the image needs to be divided by preprocessing. A CAD system that can diagnose hip fractures without preprocessing from X-rays of both hips should be developed using the deep learning algorithm obtained in this study. Third, the diagnostic imaging test was not conducted in an actual clinical setting. This study was retrospective study conducted via a PACS-like web interface used by clinicians for medical imaging. Unlike the high-resolution monitors used in clinical practice, the reading of the images is done on a home personal computer, and therefore the diagnostic rate for clinicians may be underestimated. It is also possible that the incidence of "with fracture" images in clinical practice is different from the frequency of diagnosis in clinical practice. In this regard, future prospective studies in actual clinical settings using an actual PACS system are needed. Fourth, we have not been able to assess whether clinicians have fundamentally improved their diagnostic abilities in the diagnostic imaging test. In the diagnostic imaging test, clinicians read images without diagnostic aid and images with diagnostic aid consecutively. Because clinicians didn't read the images at regular time intervals, the effect in terms of education is not known. In addition, the correctness criteria for the diagnostic imaging test was whether the clinician was able to answer the fracture site correctly on the basis of the presence or absence of a fracture. Grad-CAM presented the heat map as an indication of the fracture site, but it is unclear how much the heat map contributed to the clinician's ability to read it.

Conclusion
We developed a newly CAD system for the diagnosis of hip fracture based on a deep learning algorithm. This system provided high accuracy, sensitivity, and speci city. The areas activated on the heat map all corresponded to the areas pointed out by the orthopedic surgeon. The accuracy, sensitivity, and speci city of residents in the diagnosis of hip fracture improved when they used this CAD system. This system may aid residents in the diagnosis of hip

Consent to publish
This study adopted an opt-out approach. This research falls into the category of "research that does not use samples obtained from human subjects" in the "Ethical Guidelines for Medical Research Involving Human Subjects" under "research using existing samples and information held at the institution". In accordance with the guidance, we published the disclosure document of this research on each hospital's website for 30 days from the date of hospital director's approval, notifying the research subjects and guaranteeing them the opportunity to refuse the research.

Availability of data and materials
The datasets analyzed during the current study available from the corresponding author on reasonable request.

Competing Interests
The authors, RG and AK, are employees of Search Space Co. Ltd., a startup company, the eventual products and services of which will be related to the subject matter of the article. No authors own shares in the above companies. SH, the last author, represents the AI research division in the nonpro t organization (NPO) Nagoya Orthopedic Regional Healthcare Support Center, (https://www.fracture-ai.org/). NPO Nagoya Orthopedic Regional Healthcare Support Center, AI Research Division is a research division established for multi-center collaborative research. With the exception of two Search Space Co. Ltd. employees and one NPO employee, no authors received compensation from these organizations.

Funding
Not applicable Authors' Contributions YS carried out the studies and drafted the manuscript. YT participated in its design and helped to draft the manuscript. RG, AK and SH developed the computer-aided diagnosis system using a deep learning model. TA, YO and HT recruited the participants and developed the image dataset. YS and TA analyzed the X-ray data. SH coordinated the team. All authors read and approved the nal manuscript.