The present study investigated the prediction of incident coronary heart disease (CHD) using machine learning techniques and explored the key risk factors associated with CHD occurrence.
This study revealed that comorbidities such as hypertension, metabolic syndrome, and diabetes were more prevalent among individuals with incident CHD, consistent with previous research [35, 36]. The presence of these comorbidities suggests the need for comprehensive management and preventive strategies targeting multiple risk factors in individuals at risk of CHD.
Optimal machine learning model
In the context of CHD prediction, our study evaluated the performance of five distinct machine learning models for identifying individuals at risk. All five models had good discrimination between the classes of CHD, resulting in a high AUC. LR, a conventional method, demonstrated its utility by accurately classifying binary outcomes, assuming a linear relationship between input variables and CHD incidence, and it serves as a straightforward prediction method that provides accuracy scores for comparison with other nonparametric machine learning models [37]. Nevertheless, LR is not well suited for the analysis of nonlinear and high-dimensional datasets. SVM, with its capacity to manage both linear and nonlinear relationships via a variety of kernels, is a robust alternative [24–26]. Moreover, tree-based ensemble models such as RF, XGBoost, and LightGBM have demonstrated their ability to capture complex data patterns [23, 28, 30, 31, 38]. In summary, RF stood out as the top performer, achieving the highest AUC and other metric values, underscoring its effectiveness in CHD prediction. This finding aligns with similar studies [18, 27].
The predictive modeling results demonstrated that the random forest (RF) model outperformed other supervised models in accurately predicting incident CHD, which is in line with the findings of similar studies [39, 40].
Key risk factors for CHD
Regarding the key risk factors identified in this study, blood pressure variables, particularly systolic blood pressure, were found to play a significant role in predicting incident CHD. This finding aligns with the well-established association between elevated blood pressure and the development of CHD [41, 42]. Non-HDL cholesterol and glucose levels also emerged as important predictors of incident CHD, highlighting the significance of dyslipidemia and impaired glucose metabolism in CHD risk [43–45].
Age, a traditional risk factor, retained its significance in predicting incident CHD, reflecting the cumulative exposure to various risk factors over time [46, 47]. Metabolic syndrome, characterized by a cluster of metabolic abnormalities, has emerged as a noteworthy risk factor for incident CHD. This finding highlights the importance of considering the collective impact of metabolic factors in assessing CHD risk [36, 48]. An inverse relationship was observed between incident CHD and HDL-c and eGFR. This paradox showed the expected protective effects of higher HDL cholesterol levels [49] and optimal renal function in reducing CHD risk [50].
Comparison of Risk Factors in Our Study with Framingham and Suita Scores and Other Studies:
The primary focus of the Framingham and Suita scores revolves around predicting CHD. The Framingham risk score incorporates six coronary risk factors: age, sex, smoking habits, blood pressure, total cholesterol, and HDL cholesterol [51]. In the context of predicting CHD, the Suita score, which is tailored for the Japanese population, outperformed the original Framingham risk score, encompassing similar factors but introducing an assessment of chronic kidney disease (CKD) stage for enhanced accuracy [52]. Our study, leveraging machine learning techniques, shares similarities with the risk factors included in both the Framingham and Suita scores. However, noteworthy disparities arise due to the distinct methodologies employed. Unlike the predefined variables in traditional scores, our models autonomously selected risk factors based on their predictive capabilities. This approach introduces an element of objectivity, allowing for the identification of novel risk factors and potentially refining CHD risk assessment models beyond the constraints of preestablished variables.
Comparing these results with those of previous studies, the identified variables align with established risk factors for incident CHD, including age, blood pressure, lipid profile, and metabolic abnormalities [2, 36, 48]. However, the specific ranking of these variables based on SHAP values provides additional insights into their relative importance in predicting CHD incidence.
The present work emphasizes the considerable capacity of machine learning techniques to effectively capture complex relationships and identify previously unrecognized risk factors in the domain of CHD risk evaluation. Machine learning approaches offer a great framework for investigating and understanding the complex nature of CHD. The discovery of an association between elbow joint thickness and CHD serves as a compelling example of how machine learning can effectively uncover previously overlooked risk factors. However, further research is needed to confirm this association. Therefore, this underscores the advantages of utilizing machine learning algorithms to uncover hidden insights inside large datasets, ultimately assisting researchers and doctors in making informed decisions based on data. In contrast, conventional analytical methods are constrained by predefined rules and models established by experts, potentially limiting their ability to provide novel insights that extend beyond explicit definitions.
Limitations
The study's generalizability is also open to examination. The focus on a specific population might restrict the applicability of findings to diverse populations with varying genetic, environmental, and lifestyle factors. In addition, the complexity of machine learning models can make interpretation difficult. Although these models exhibit enhanced predictive accuracy, the lack of transparency of their internal workings might obscure a clear understanding of the underlying mechanisms involved. This "black-box" nature might hinder the translation of results into actionable insights. Therefore, it would be advantageous to undertake additional research to externally validate the findings of our study and assess the efficacy of machine learning algorithms in clinical practice.
In conclusion, this study emphasizes the predictive value of machine learning models for accurately identifying individuals at risk of incident CHD. Additionally, the use of machine learning techniques has led to the identification and exploration of novel risk factors associated with incident CHD. The incorporation of these findings into clinical practice has the potential to enhance CHD risk assessment and improve patient outcomes.