This study constructed a hybrid model combining XGBoost and SVM to predict the fracture risk for elderly patients with osteoporosis based on a real-world dataset, and we determined risk factors for fracture by designing a comprehensive feature importance score. All the possible prediction variables were considered as the input to the model in the beginning. The data went through a wrangling pipeline including missing value imputation and imbalanced label handling. Then, we tested several machine learning models, where the hybrid model combing XGBoost and SVM achieved a better result than other benchmarks.
In the realm of medical applications where high dimensionality and limited data are common phenomena, high prediction accuracy and model stability are crucial, and the fusion of XGBoost and SVM into a hybrid model presents several advantages. In simple terms, the hybridization of the two models could enhance accuracy by capitalizing on the different aspects of the data each model excels at capturing. Concretely speaking, the amalgamation could amplify the models' inherent robustness, given XGBoost's aptitude for handling noise and outliers, and SVM's resistance to overfitting when appropriately kernelized and regularized. This can result in more reliable predictions, a highly desirable trait in medical applications. Additionally, the combination model can efficiently manage imbalanced data, a frequent concern in medical scenarios as well, by adjusting the weights on minority classes in XGBoost and leveraging techniques such as SMOTE in SVM. Moreover, despite SVM often being regarded as a "black box", the hybrid model can provide a level of interpretability via the XGBoost component by examining the significance of the features, offering valuable insights into which factors are driving predictions. With XGBoost's capacity for capturing complex non-linear relationships and SVM's proficiency in binary and multiclass classification problems being particularly notable, Both XGBoost and SVM have demonstrated superior performance across diverse tasks[16].
Our comprehensive feature importance score considered four aspects of information (weight, cover, gain, SHAP) to mine the top 20 important features shown in Fig. 9. These features play important roles in this model, which could be viewed as the risk factors for fracture. Some of the features [17] found during this study are well known to be associated with osteoporosis or even fractures, such as smoking history [18], weight loss [19], DXA value of Lumbar spine 1, weight, age, sodium-blood and DXA value of L1-L2 [7].
Moreover, several factors were not well demonstrated in prior studies and appeared to be related to osteoporosis, which should be focused on, such as (i) bone metabolite: Calcium-blood and Alkaline phosphatase (ALP)-blood, (ii) inflammatory response markers: C reactive protein-blood, and (iii) indicators of lipid metabolism: Apolipoprotein A/B ratio-blood and High-density lipoprotein cholesterol (HDL-C)-blood. The ranking of these five factors in our model is 1, 8, 4, 2, and 6 respectively, all of which are very high. The detailed SHAP diagram is shown in Fig. 10.
Our model predicted that calcium (Ca)-blood may be a high-risk factor for fracture. Even though 99% of the body’s calcium is stored in the bones with the remaining 1% in the blood, calcium-blood is much easier to be measured than calcium-bone; therefore, it is a routine laboratory test for patients with osteoporosis. For patients with primary osteoporosis, their blood calcium values are usually within the reference range. As we all know [20] that blood calcium and parathyroid hormone (PTH) levels have reciprocal regulatory effects, such as decreasing concentrations of blood calcium ions, secondary secretion of increased PTH by parathyroid chief cells, stimulation of osteoclast (OC) proliferation, inhibition of osteoblast (OB) activity, and resulting osteoporosis. A recent study [21] in the Pakistani population suggested that patients with osteoporosis have significantly lower serum calcium than controls. Likewise in our model, as shown in Fig. 10 when blood calcium levels decrease (blue), the predicted probability of fracture (SHAP value) increases. Alkaline phosphatase (ALP) is a hydrolase enzyme responsible for removing phosphate groups from many molecules, including nucleotides, proteins, and alkaloids [22]. It belongs to bone formation markers among bone turnover biochemical markers (BTMS) and can react to osteoblast activity and bone formation status. Although not useful in the diagnosis of osteoporosis, ALP plays an important role in the differential diagnosis of several skeletal disorders, determination of bone turnover types, monitoring treatment adherence, and evaluation of drug efficacy [23], and the levels of ALP are usually normal or mildly elevated in patients with primary osteoporosis. A recent study has also shown that the activity of serum total ALP > 129 U/L is used as an indicator for osteoporosis in males [24], consistent with what our model predicts (shown in Fig. 10).
C-reactive protein (CRP) participates in the immune response [25], which is a non-specific and high-sensitivity inflammatory biomarker. The inflammatory processes are involved in a wide variety of physical health problems and systemic chronic inflammation (SCI) often increases with age [26]. Meanwhile, studies [27, 28] also show that older individuals have higher circulating levels of cytokines, chemokines, acute phase proteins, and greater expression of genes involved in inflammation. Moreover, SCI is persistent and ultimately causes collateral damage to tissues and organs over time, such as by inducing osteoporosis. In addition to age, physical inactivity was found to be directly associated with increased anabolic resistance, increased CRP levels, and increased levels of proinflammatory cytokines in healthy individuals [29]. These effects, in turn, promote several inflammation-related pathophysiologic alterations, including osteoporosis [30, 31]. Furthermore, several studies even showed that increased CRP was linked to an increased fracture rate due to osteoporosis [32, 33]. Overall, our model's predictions agree with these views mentioned in Fig. 10.
The association between lipid and bone metabolism has become an increasing focus of interest in recent years [34]. Our model also predicts two lipid metabolism-related indicators: Apolipoprotein A/B (apoA/B) ratio-blood and High-density lipoprotein cholesterol (HDL-C)-blood. A study by Dennison et al [35], investigating the correlation between bone mineral density (BMD) and lipid profiles, observed that total spine BMD was inversely correlated with levels of apoA but positively associated with levels of apoB in males and females, which agrees with our model results. However, regarding HDL-C, our model's trends don’t appear to match exactly what has been reported. On the one hand, Ackert-Bicknell et al. proved that there is sufficient evidence to conclude that bone metabolism and HDL-C are genetically linked, and HDL-C can interact directly with osteoblasts and osteoclasts [36]. Yamaguchi et al [37] investigated the correlation between plasma lipid levels and BMD and found that low levels of HDL-C were associated with an increased risk of vertebral fracture (similar to those predicted by our model). Nevertheless, another study reported the opposite result [35]. On the other hand, Hsu et al [38] conducted a study that aimed to analyze the association between plasma lipid profile and BMD, bone mineral content (BMC), and osteoporotic fractures in 7137 Chinese males, and 4585 premenopausal and 2248 postmenopausal females. No significant correlation between whole-body BMC and levels of HDL-C was detected. Similarly, another study showed no association between HDL-C and BMD [39]. Several factors may be responsible for these discrepancies. Firstly, the differences in non-modifiable characteristics of the subjects, including age, sex, and medication history may have introduced bias. Secondly, the differences in modifiable characteristics, including cigarette and alcohol consumption, or physical activity among the study subjects may also have led to bias. Apart from these reasons, the use of different research methodologies may have affected the results. Therefore, the relationship between lipid profile and bone metabolism warrants further investigation.
To the best of our knowledge, some of the remaining factors predicted by our model, such as heart disease [17], Retinol binding protein (RBP) -blood [40], and Bilirubin (BIL)-blood [41, 42], also seem to have some relationship with fractures and future work is suggested to focus more on these indicators.
Several limitations of our work need to be mentioned here. Firstly, the retrospective and observational nature of our study may lead to inevitable bias. Secondly, given that the raw data is imbalanced, we have performed oversampling and undersampling of the training set, which might lead to deviation from the true value. Thirdly, the patient’s data came from a single center in China. Therefore, further research with large samples and multiple centers is necessary to validate our model’s performance.