In this study, we conducted a multi-classification of osteoporosis into the following grading stages: normal, osteopenia, and osteoporosis through DL using abdominal CT, which is already widely used in clinical practice. Using this DL model, CT images captured for diverse medical purposes can be used to screen latent patients at risk of osteoporosis without additional costs and radiation exposure. Here, modeling was performed separately using CT images, demographic data, and a combination of the two types of data. The model using multimodality data performed well for multi-classification with AUC and ACC of 0.94 and 0.80, respectively, by combining data from two different modalities.
In general, osteoporosis has a high prevalence and is known to present with difficulty in detecting fractures before disease progression. [24] Therefore, most older patients miss the opportunity to combat the risk of fracture owing to decreased bone density. Women > 65 and men > 70 years of age are exposed to many factors that can cause osteoporosis, such as low body weight and a history of previous fractures. [25] Although DXA is a representative method used to diagnose osteoporosis, it has some disadvantages, such as a low utility rate and radiation exposure. Moreover, practical challenges, such as the requirement for specialized equipment knowledge and limited accessibility due to low penetration rates exist. Consequently, an alternative approach to osteoporosis detection involves utilizing bone data derived from abdominal CT scans, a method widely recognized as a valuable tool for osteoporosis screening in general. [16, 22, 26, 27] Abdominal CT is a medical imaging modality that includes the spine and femur and is known to be very useful in accurately measuring the risk of osteoporosis in the area of interest. Proximal femoral fractures are fatal fractures with high morbidity and mortality rates despite representing only a small proportion of osteoporotic fractures.
Thus, our study offers several novelties for the multi-classification of osteoporosis into three stages using the femur region on abdominal CT images. First, our study results provide an opportunity to overcome the shortcomings of DXA and quickly respond to the potential risk of the disease using widely used and easily obtained CT images and demographic data. In particular, we utilized real-world clinical data to address the challenge of concurrently classifying individuals into the normal, osteopenia, and osteoporotic stages, leveraging observations from the femur region in abdominal CT. Previous studies have primarily focused on classifying individuals as either normal or having osteoporosis, with limited attention given to classifying osteopenia—a stage that falls between the two disease categories—as either normal or osteoporotic. Moreover, few studies have classified osteopenia, which lies between the two disease stages, as normal or osteoporosis (Table 3).
We obtained an AUC of 0.94, ACC of 0.81, precision and sensitivity of 0.8, and specificity and recall of 0.1, upon utilizing a multimodal dataset. These outcomes surpass those reported in several recent studies. [19, 28] Notably, our results outperform the performance achieved by machine learning-based X-ray image analysis in many existing studies. [23, 28, 29] Zhang et al. performed multi-classification using the same CNN model used by our team, but showed a performance of AUC 0.81 and ACC 0.6, lower than our results. [28] Liu et al. performed binary classification for each of the three groups (normal, osteopenia, osteoporosis) and presented results of AUC 0.88 for classification between normal and osteopenia, AUC 0.87 for classification between normal and osteoporosis, and AUC 0.75 for classification between osteopenia and osteoporosis. [29] Yamamoto et al. presented relatively high model performance results of AUC 0.93 and ACC 0.88 as a result of classification between normal and osteoporosis. [23] However, these studies used X-ray images and were conducted on a smaller number of patients than our study. In addition, Yasaka et al.’s study exhibited superior performance to ours in classifying normal and osteopenia; however, generalization would be difficult because the dataset for validation was very small in that study, [30] unlike that of ours. A similarity across all the studies, including ours, is the use of CT images. However, we can present high accessibility for use in clinical practice and enhance clinical applicability by performing multi-class classifications. Second, our study presents the explanatory potential of DL models.
One prevalent issue with DL-based approaches is the ‘black box’ nature of these models, which hinders a clear understanding of their internal processes. [31] Misinterpretations by AI-based models can lead to incorrect diagnoses, emphasizing the need for model validation. In our study, we employed Grad-CAM to visualize and elucidate the model's inferred rationale, confirming the alignment between the feature regions within the model and clinically relevant areas (Fig. 4). [32–34] As a result of our Grad-CAM analysis, we focused on feature extraction, with a specific emphasis on the femoral neck. This alignment with clinical knowledge underscores the femoral neck as the most vulnerable area to osteoporosis. Furthermore, our study carries implications for expanding its clinical applications. By amassing data spanning all age groups ranging from 20–70, we can facilitate multiple classifications of osteoporosis for various age brackets. Moreover, our findings remain applicable even for individuals with scoliosis or those who have undergone femur-related surgery, as we leverage the entire thigh region observable in abdominal CT scans. Additionally, by harnessing all femoral bone images within abdominal CT scans, categorized into the neck, head, and torso, we obtain input data with minimal need for preprocessing in most cases. This presents significant potential for the rapid proliferation of computer-aided diagnosis. Lastly, we conducted a comprehensive analysis to identify errors that may have arisen from differences in data distribution while classifying osteoporosis (Supplementary Fig. 1). It is reasonable to encounter classification challenges at the boundary points that demarcate these stages, given that disease risks can vary within the same stage based on T-score values. To visualize these discrepancies, we used the T-score. Our analysis reaffirmed the accuracy of most classifications between normal and osteoporotic tissues while highlighting that errors primarily concentrate on the transitional boundaries between disease stages.
However, this study has several limitations. First, the performance was guaranteed only for contrast-enhanced abdominal CT data. Although CT is a common imaging modality, it seldom provides BMD information in the clinic, owing to technical difficulties. Thus, DXA is required to measure BMD at the expense of additional radiation exposure. Furthermore, abdominal CT, which is generally used for patients who do not have kidney function abnormalities or contrast agent side effects, is a frequently scanned medical image. Accordingly, we conducted a study using contrast-enhanced abdominal CT which showed high accuracy in osteoporosis classification. Second, there was a time gap in the data collected in this study. All patient data were obtained using a concomitant CT scan within 3 months before or after the DXA scan to collect as much data as possible. Therefore, based on a single DXA examination, two or more CT images may be matched for the same patient. Third, data were collected from a single medical institution. It is difficult to prove this effect using data obtained from other institutions or CT equipment. In the future, we plan to collect more data from several machines and hospitals to reduce bias and increase robustness. Fourth, we confirmed that using clinical structured data and unstructured CT image data simultaneously improved the performance compared to using individual data independently. However, the results obtained using only images were not significantly different from those obtained using multimodal data. This finding suggests that the demographic data used in this study had only a minor effect. Improved performance can be expected by additionally using clinical variables directly related to bone density, such as drugs and disease history; however, this may result in a trade-off for widely used in clinical practice.