Automated and Precise Bone Mineral Density Prediction and Fracture Risk Assessment using Hip/Lumbar Spine Plain Radiographs via Learning Deep Image Signatures and Correlations

Dual-energy X-ray absorptiometry (DXA) and the Fracture Risk Assessment Tool are recommended tools for osteoporotic fracture risk evaluation, but are underutilized. We present a novel and fully-automated tool to identify fractures, predict bone mineral density (BMD), and evaluate fracture risk using plain pelvis and lumbar spine radiographs. The performance of this tool were evaluated in 1639 and 11908 patients with pelvis or lumbar spine radiographs and DXA, respectively. The model was well calibrated for hip and spine BMD assessments with minimal or no bias. The area under the curve and accuracy were 0.89 and 92.4% for hip osteoporosis, 0.87 and 86.8% for spine osteoporosis, 0.92 and 94.6% for high 10-year major fracture risk, and 0.92 and 92.2% for high hip fracture risk, respectively. The success rates of our automated algorithm a real-world test were 85.3% and 90.4% for hip and spine, respectively. The clinical use of this automated tool may increase the likelihood of identifying high-risk patients in previously unscreened populations.


Introduction
Osteoporosis is a common bone disease 1 that poses an increasing global health burden. 2 All major types of osteoporosis-related fragility fractures are associated with chronic pain, disability, functional dependence, 3 and enhanced morbidity. 4 In addition, patients with fragility fractures have a two-to threefold increase in mortality, 5 despite the availability of effective anti-osteoporotic drugs. 6 Dual-energy X-ray absorptiometry (DXA) is the current preferred modality for measurement of bone mineral density (BMD) in the human hip or lumbar spine, which is the essential component of the fracture risk assessment tool (FRAX™) used to estimate the 10year risk of hip or major osteoporotic fracture. 7 Currently, both DXA and FRAX™ are underutilized, despite their usefulness in the identification of patients at risk and in treatment decision-making processes. 8 Among Medicare beneficiaries ≥ 65 years of age, only 30% of women and 4% of men were tested for BMD with DXA. 9 In addition, only 10.2% of female 10 and 6% of male patients 11 with fragility fractures have undergone BMD testing before the index fracture. Opportunistic screening for osteoporosis using imaging modalities other than DXA is a potential strategy to effectively and feasibly stratify the unscreened population into distinct risk groups regarding osteoporosis and fragility fractures. For example, several studies used computed tomography (CT)-based matrices to classify osteoporosis, 12 simulate DXA T-scores, 13 and predict fracture outcomes. 14 However, the performance, radiation dose, and population coverage of CT-based screening strategies are barriers to their use in clinical settings.
Unlike DXA and CT, plain radiography has greater availability, broader indications, lower radiation dose, and lower overall costs. The spatial resolution of radiographs is excellent, allowing the visualization of fine bone texture, which is correlated with bone density 15 and can distinguish patients with osteoporotic fractures from controls. 15,16,17 Therefore, an automated tool based on hip or spine radiographs for identifying hip fracture and vertebral compression fracture (VCF), predicting BMD, and evaluating fracture risk can help identify patients with greater fracture risk among individuals undergoing radiography of the hip or spine for other reasons. Deep learning algorithms have achieved performance superior to traditional methods in many visual recognition tasks, including object detection, localization, and classification. 18 The abilities of deep neural networks to learn, identify and optimize essential task-effective image features and decision-making functions from large amounts of images are essential to their success in terms of fracture detection, 19 retinopathy grading, 20 and lung nodule identification. 21 On the basis of our clinical experience, we hypothesized that changes in fine bone microarchitecture associated with the process of age-related bone loss could be visualized on radiographs and used for reliable prediction of BMD by visual recognition models. To test this hypothesis, we proposed and validated a novel deep learning method to predict BMD using hip and spine radiographs on paired data collected from radiographs and DXA. Our deep learning models have a three-step workflow: extraction of the region of interest (ROI) enabled by a highly robust deep adaptive graph landmark detection algorithm; automated quality control to exclude unqualified images from BMD estimation; and deep neural network joint processing of the ROI and the patient's clinical information to calculate the BMD. We calculated the predicted BMD-based 10-year risks of major osteoporotic fracture and hip fracture, then compared these risk estimates with risks determined using DXA-based BMD measurements. To fully automate the process with regard to reproducibility, robustness, and performance reliability, we created a set of additional algorithms to localize the ROIs (proximal femur or L1-4 lumbar vertebrae), identify hip fracture or VCF, and check the radiograph quality to ensure that implants and foreign bodies were absent from the ROIs (figure 1). The automated precise ROI localization, hip and L1-4 vertebrae segmentation, detection of hip fracture and VCF, quality check for the images, inference of BMD, and FRAX risk reporting were packaged into a single tool, which was implemented on the inference server for clinical application.

Results
From 2006 to 2017, 25960 patients with paired DXA-pelvis radiographs (17.0% of patients with pelvis radiographs) and 72059 patients with paired DXA-lateral radiographs of the lumbar spine (16.8% of patients with lumbar radiographs) were screened to identify hip and spine cohorts for analysis. The first data pairs from patients with DXA and radiographs performed within 180 days were included. For patients with multiple DXA examinations, the earliest examination was used as the index DXA. For each index DXA, the radiographs with the shortest interval to DXA were chosen. After the exclusion of patients without complete data, patients with data obtained using a GE DXA scanner, and patients with radiographs of inadequate quality, 3295 patients in the hip cohort (training set: 1602; testing set: 1693 patients), and 16908 patients in the spine cohort (training set: 5000; testing set: 11908 patients) were included in the analysis (figure 2). No patient was included in more than one group. included in the analysis. The mean age was 66.6 ± 10.8 years, and the median time between DXA and spine radiographs was 15 days (interquartile range, 5-43 days).
After quality assessment to exclude unsuitable vertebrae, 33299 lumbar vertebrae were included in the analysis (70.0%). The mean BMD per vertebra was 0.852 ± 0.191 g/cm 2 , which was significantly greater than the mean predicted value (0.846 ± 0.172 g/cm 2 ; P < 0.001); however, this difference was trival and not clinically meaningful.
These trends were similar across L1-L4 and both age and sex strata, but the differences were not statistically significant. In the spine testing set, 5747 patients (48.3%) had osteoporosis (T-score ≤ −2.5 in the vertebra with the lowest T-score). Table 2 summarizes the model performance to predict BMD using hip or lumbar spine radiographs. Pearson's correlation coefficients between DXA-measured and model-predicted BMD were 0.93 for the hip and 0.92 for the lumbar spine, suggesting excellent correlations. The linear regression model showed excellent predictive performance of predicted BMD with regard to measured BMD (hip: R 2 = 0.87, root mean square error = 0.056; spine: R 2 = 0.86, root mean square error = 0.065). The model was well calibrated in the hip (slope = 0.998, calibration-in-thelarge = 0.001), as shown in the calibration plot ( Figure 3a). For the lumbar spine BMD, model prediction tended to slightly underestimate BMD, although the difference was trival and not clinically significant (Figure 3b). Bland-Altman analysis of BMD indicated no significant differences between predicted and measured hip BMD (bias of −0.001 g/cm 2 ; 95% confidence interval, −0.004 to 0.001). A small bias of −0.005 g/cm 2 (95% confidence interval, −0.006 to −0.004) was noted for lumbar spine BMD prediction. As shown in Table 2, the model performance remained consistent across various age and sex strata, demonstrating that the algorithm was robust. Table 3 illustrates the discriminatory performance of the model to classify hip or spine osteoporosis and identify patients with greater 10-year risks of major osteoporotic fractures (≥ 20%) and hip fractures (≥ 3%). The algorithm provided a high degree of discrimination for osteoporosis (area under the receiver operating characteristic curve [AUC], 0.89 for the hip and 0.87 for the spine). The overall accuracies were 92.4% for hip osteoporosis and 86.8% for lumbar spine osteoporosis.
The median FRAX 10-year major fracture and hip fracture risks did not significantly differ when scores were based on the predicted BMD (10.81% and 2.81%, respectively; P = 0.79) and when scores were based on the measured BMD (10.68% and 2.78%, respectively; P = 0.74). The classification performances regarding patients with high 10-year risks of major osteoporotic fractures and hip fractures were better than the osteoporosis classification performance, with AUCs of 0.92 and 0.92, and accuracies of 94.6% and 92.2%, respectively. As shown in Table 4, the network performances for classification of osteoporosis and identification of patients with high risks of major and hip fractures were robust across all age and sex groups, despite significant differences in the association strength (P < 0.001).
Next, we packaged the ROI localization, fracture detection, image quality check, BMD estimation, osteoporosis detection, and FRAX risk evaluation into two standalone tools for the hip and spine, respectively. We implemented the tools in the central inference platform connected to the picture archiving and communication system (PACS) in the Chang Gung Memorial Hospital (Linko branch). The hospital PACS transferred all newly acquired images to the inference platform on a daily basis.
In total, 7353 consecutive pelvis radiography examinations were conducted from March 2020 to November 2020. The tool identified 1013 radiographs that had bilateral total hip replacement, hip fractures, or the presence of other image quality issues that may impede BMD estimation. The remaining 6271 (85.3%) images were successful in predicting BMD. From November 2020 to January 2021, we collected 11291 consecutive lateral radiographs of the lumbar spine. The tool identified 1084 radiographs with VCFs, implants, vertebroplasty, or other features that may impede BMD estimation. The success rate to produce predicted BMD for a single spine radiograph was 90.4% (10208 radiographs).

Discussion
Osteoporosis is a silent disease before fragility fractures, which often leads to multiple morbidities and increased mortality in affected patients. 4 Previous studies estimated that one in three women and one in five men aged > 50 years will experience fragility fractures in their lifetime. 22,23 There is increasing evidence regarding the effectiveness and cost-effectiveness of therapeutic agents in the prevention of fragility fractures. 24,25 Therefore, population-based screening is imperative for the identification of at-risk patients and implementation of preventive measures. However, current DXA-based programs screen fewer than one-third of eligible women and one-tenth of eligible men. 9 Therefore, osteoporosis screening based on DXA seems inadequate. In CGMH, approximately 17% of patients with pelvis or spine radiographs in our hospital previously underwent DXA-based assessment of BMD. This study developed an automated, reliable tool to evaluate fracture risk using hip or spine radiographs to effectively broaden the screening population and increase the number of identifiable high-risk patients.
The performance of the tool is robust with DXA as reference and compared favorably with non-DXA modalities, such as quantitative bone ultrasound (AUC, 0.762), 26 CT-based opportunistic screening using CT attenuation of the spine (AUC, 0.83), 12 and machine-learning-based T-score simulation (accuracy, 82%) 13 to classify osteoporosis. In addition to effective identification of patients with osteoporosis, the tool accurately predicted FRAX risk and identified patients with high risks of major osteoporotic (AUC 0.92; accuracy, 94.6%) or hip fractures (AUC 0.92; accuracy, 92.2%). Our real-world clinical assessment using consecutive pelvis radiographs also demonstrated that 85.3% of patients with pelvis radiographs and 90.4% of patients with spine radiographs could be automatically screened for osteoporosis and evaluated for future fracture risk. Importantly, most such patients had never been screened by DXA. Taken together, the results of this study demonstrated that the radiograph-based screening tool could accurately identify patients with osteoporosis and high fracture risk from previously unscreened population.
BMD is not the only determinant of fracture risk. The National Osteoporosis Risk Assessment study found that 82% of osteoporotic fractures occurred in women with T-score > −2.5, and 67% occurred in women with T-score > −2.0. 27 Other risk factors (e.g., history of osteoporotic fracture) are essential for the identification of high-risk patients. However, many patients with occult hip fractures and VCFs are asymptomatic, and are often diagnosed with other imaging modalities. 12, 28 We exploited the excellent spatial resolution of radiographs to identify unsuspected fragility fractures during the preprocessing and quality control process, prior to estimation of BMD. For hip fracture detection, we incorporated our previously published PelviXNet algorithm 29 to detect hip fracture. We also developed a vertebral fracture assessment algorithm based on a Deep Adaptive Graph network, which determines anatomical landmarks for standard six-point vertebral morphometry that facilitates VCF detection using the widely accepted semiquantitative Genant visual method. 30,31 The overall model performance improved after the exclusion of hip or vertebral fractures. The integrated process automated the identification of hip and vertebral fractures, providing initial quality control for BMD estimation. This process also identified clinical risk factors for fragility fractures, without a requirement for clinical input. Therefore, our tool could evaluate fragility fracture risk based on a single radiograph (existing fractures and predicted BMD) and its age and sex metadata. However, other patient-related clinical risk factors (e.g., history of hip fracture, comorbidity, medication, and lifestyle) require input from electronic medical records.
Opportunistic screening for osteoporosis using other imaging modalities has been assessed previously. The best studied strategy is the use of abdominal CT to predict BMD; 13, 32, 33 classify osteoporosis based on CT attenuation, 12 simulated BMD, 32,33 T-score, 13 or detection of osteoporotic fractures; 34 or use imaging biomarkers to predict the risk of fractures. 14 An earlier study compared the CT Hounsfield units over a manually annotated ROI involving vertebral body trabecular bone with its paired DXA T-score; this approach for detection of osteoporosis yielded an AUC of 0.83. 12 A deep learning-based model provided a better correlation between predicted and reference values, but its validation included only small datasets. 13,32,33 A larger study testing the performance of simulated T-scores on a larger dataset of 1843 CT-DXA pairs achieved an accuracy of 82% to detect osteoporosis. 13 This algorithm was integrated with VCF identification and CT trabecular density as biomarkers, and its performance for the prediction of 5-year fracture risks was compared with the performance of FRAX alone (i.e., without BMD input). This CT-based predictor provided automated risk evaluation using CT-derived metrics and compared favorably with FRAX prediction. 14 Osteoporosis and fragility fracture risk have also been assessed on dental, 35,36 hip, 37 and spine radiographs, 36 as well as magnetic resonance imaging. 38 These studies demonstrated the feasibility of using non-DXA modalities to expand opportunistic screening to a broader population at risk, although the applicability and usability of such tools in real clinical settings are questionable.
In contrast, the present study provided a fully automated tool enabling opportunistic screening for osteoporosis and evaluation of fragility fracture risk using plain radiographs of the hip and spine. Our tool utilizes ubiquitous, low-cost radiographs that involve substantially lower radiation exposure than CT-based tools, thus maximizing the likelihood that eligible populations will be screened, regardless of DXA or CT scan availability. Our tool can assess both the hip and spine, and is therefore not limited to the spine alone (e.g., during evaluation with CT-based tools).
Furthermore, we envision that other musculoskeletal radiographs may also be used to predict bone density and risks of fracture, regardless of the original purpose of the images. This strategy requires no additional patient time or radiation exposure and involves minimal costs, but may substantially improve the risk profiling for fragility fractures.
This study had several limitations. First, Chang Gung Memorial Hospital is a medical center in which the patients tend to have more severe disease. A large proportion of patients have fractures or implantations. Our study population may have not represented the healthier population, which is the target of osteoporosis screening. However, because the tool was developed based on this more complex population, the ROI localization, quality check, and BMD prediction processes can presumably be readily adapted to populations with fewer complications. Second, the calculation of FRAX in this study did not consider past medical history, medication use, family medical history, and lifestyle (e.g., alcohol consumption and smoking status) because this information requires input from the hospital information system.
However, the performance assessment should not change because these parameters are identical for FRAX based on the DXA-measured or model-predicted BMD. For clinical implementation, the tool can be modified to report full FRAX results when digital data are available. Third, the tool was created using the reference BMD values reported by Hologic DXA scanners alone, although both Hologic and GE DXA scanners are actively used at Chang Gung Memorial Hospital. Systematic differences in BMD measurement and reporting between DXA manufacturers hampered the tool's performance in our early experiments. Manufacturer-specific models may be needed in some clinical settings. Fourth, the performance of the prediction tool is influenced by radiograph image quality. In addition to existing fractures, accurate BMD prediction may be impeded by foreign bodies, implants, bowel gas, and bone pathologies (e.g., avascular necrosis or severe osteoarthritis). The actual rate of radiographs that could be evaluated for BMD and fracture risk surpassed 85% in our real-world test. Depending on a patient's specific indications, radiographs are often examined repeatedly. Therefore, the per-patient success rate will potentially increase as more radiographs become available over time.
This study demonstrated that a robust opportunistic screening tool for osteoporosis and fracture risk assessment, based on conventional radiographs obtained for various indications, was able to provide VCF detection, BMD, and fracture risk estimation in a fully automated process. This tool leveraged state-of-theart deep learning algorithms to provide a more efficient strategy for populationbased opportunistic screening with minimal or no additional cost. The integration of this automated tool into the hospital information system may increase the likelihood of identifying high-risk patients in previously unscreened populations.

Hypothesis and study design
This retrospective cohort study was performed to test the hypothesis that an automated deep neural network-based tool could effectively predict BMD and risk of fragility fractures using plain radiographs of pelvis and lumbar spine. This tool is a collection of algorithms to identify and segment regions of interest (hip or lumbar spine), check for factors that would influence BMD prediction (e.g., image quality/positioning, existing hip or vertebral fractures, implants, and/or foreign bodies), and subsequently predict hip and vertebral BMD and fracture risk. We compared the predicted BMD with the BMD measured by central DXA. We also calculated the risks of 10-year hip and major osteoporotic fractures using FRAX tools (https://www.sheffield.ac.uk/FRAX/). The fracture risk prediction performance was compared between algorithm-predicted BMD and DXA-measured BMD.

Setting
This study was approved by the Institutional Review Board at the Chang Gung Memorial Hospital (Taiwan) and was conducted in accordance with the tenets of the Declaration of Helsinki. The requirement for informed consent was waived because the data presented in this paper were fully de-identified to protect patient confidentiality. This study was performed using data from Chang Gung Memorial Hospital, the largest private hospital system in Taiwan, which includes seven acute hospitals with 10050 beds, that received 8. We collected consecutive pelvis radiographs conducted between March 2020 and November 2020 and spine radiographs conducted between November 2020 and February 2021.

BMD measurement
Proximal femoral and lumbar spine DXA scans were performed using a Hologic QDR- Hip T-scores were calculated using the revised NHANES III white female reference values. 39,40 Because there is no international reference standard for the lumbar spine BMD, lumbar T-scores were calculated using the manufacturer's reference values. For each patient, the lowest T-score of the femoral neck or lumbar vertebrae was used to categorize osteoporosis or calculate FRAX risk.

Acquisition and preprocessing of radiographs
The radiographs were collected from the PACS and anonymized before the study procedure. The images were converted to grayscale and resized to a resolution of 0.15 mm × 0.15 mm pixel spacing, then stored as 12-bit images. A deep adaptive graph (DAG) landmark detection method was developed to formulate the anatomical landmarks of the pelvis and spine as graphs, and to robustly and accurately detect these landmarks. 41 We detected 16 anatomical landmarks on hip radiographs, including 12 landmarks on the pelvic boundary and four landmarks on the femoral head and trochanter. We detected six anatomical landmarks for each of the lumbar vertebrae on spine radiographs from L1 to L4. Based on the detected anatomical landmarks, ROIs were extracted from the radiographs and used as input for the BMD prediction model. For hip radiographs, ROIs of the left and right hips were extracted.
For the lumbar spine, ROIs were extracted for each vertebra from L1 to L4. Examples of the detected anatomical landmarks and ROIs are shown in Figure 1. The ROIs were used as input for the BMD prediction model. A schematic representation of the pipeline and models used to predict BMD is shown in Figure 1.

Anatomical landmark detection via Deep Adaptive Graph
The anatomical landmarks were detected using DAG, a method introduced in our previous publication. 41 In DAG, the anatomical landmarks are formulated as a graph, where is the displacement estimated by the local refinement GCN at the -th step.
During training, the training loss is calculated for both the global transformation GCN and the local refinement GCNs. Because the goal of global transformation GCN is to locate the anatomy coarsely, the following margin loss is used: where [ ] + = (0, ); 1 and denote the globally transformed and ground truth vertices, respectively; and is a hyperparameter representing a margin that aims to achieve high robustness for coarse landmark detection and forgive small errors. To encourage the local refinement GCNs to learn a precise localization, L1 loss is directly applied to all vertices after the refinements, written as follows: where denotes the vertices after the last local refinement GCN. The graph edge weights are treated as learnable parameters, which are initialized randomly at the beginning of training and updated via back-propagation during training. In our experiment, the hip and spine DAG models were trained using 3306 and 1076 pelvic and spine radiographs with expert annotations.

Automated radiograph quality assessment procedure
Some medical conditions may affect the hip and vertebra anatomy, making plain films unsuitable for BMD estimation. The most common conditions include implantation (e.g., total hip replacement or spine fusion) and fracture. Therefore, we conducted an automated quality assessment to exclude hips and vertebrae with implants or fractures that were unsuitable for BMD prediction. If a vertebra met any of the three criteria, it was considered abnormal and excluded from downstream processing. These criteria only detected apparent moderate to severe compression fractures to avoid ambiguity in determining mild or borderline deformities.

Algorithm development and training procedure for BMD prediction
We developed a deep learning algorithm to estimate the hip/spine BMD from each corresponding ROI. The neural network used a VGG-16 with batch normalization and squeeze-and-excitation block as the backbone to encode the input image. Compared with deeper and more complex backbone networks (e.g., ResNet and DenseNet), our empirical results indicated that a VGG-16 block with a shallower architecture achieved better BMD prediction performance. We hypothesized that the visual patterns correlated with the BMD were at lower levels (e.g., texture and cortical bone structure), which could be effectively modeled by shallow networks (i.e., no greater object-level abstraction is needed). Because patient age and sex were correlated with BMD, we added this information to the neural network to assist in BMD prediction. In particular, features extracted by the VGG-16 block were first flattened and processed by a fully connected layer to obtain a feature vector of length 512. The patient age and sex information were represented by two values and concatenated with the feature vector. Because L1-L4 vertebrae have distinct BMD statistics, the vertebra index information was required by the model to accurately predict the BMD. Therefore, in the spine model, the vertebra index (from L1 to L4), encoded by a one-hot vector of length 4, was also concatenated with the feature vector (in addition to the encoded patient age and sex information). During training, ROIs were augmented by random affine transformation and subsequently resized to −0.2 to +0.2 and the contrast from −0.2 to 0.2). The L1 distance between the predicted BMD and the ground truth BMD obtained from DXA was regarded as the training loss. A fourfold cross-validation procedure was conducted, and ensemble learning was adopted to combine the predictions of the four trained models during inference.

Implementation details
Deep learning models were developed on a workstation with a single Intel Xeon E5- The overall discriminative abilities to discern osteoporosis and high-risk patients were evaluated using the AUC. Other measures were also calculated, including sensitivity, specificity, positive predictive value, and negative predictive value.
Analyses were conducted using Stata software, version 16 (StataCorp, College Station, TX, USA).        Figure 1 Schematic representation of the work ow for hip and spine BMD estimation.

Figure 2
Flowchart of the study population.