Background and Objectives: Cardiovascular disease (CVD) remains a leading cause of death worldwide, with early detection critical for effective intervention. Diabetes, especially in its early stages, shares many pathophysiological features with CVD, making it a significant predictor of cardiovascular risk. This study explores the relationship between early-stage diabetes symptoms and CVD risk by developing and evaluating predictive models using logistic regression and Random Forest algorithms.
Methods: The study utilized two publicly available datasets: the Early Stage Diabetes Risk Prediction Dataset and the Heart Failure Clinical Records Dataset, containing 520 and 299 instances, respectively. Data preprocessing included median imputation for missing values and binary conversion of categorical variables. Feature engineering involved creating a symptom severity score by summing key diabetes-related symptoms. Logistic regression and Random Forest models were trained on 80% of the data and tested on the remaining 20%.
Findings: The Random Forest model outperformed logistic regression, achieving an accuracy of 81.4%, an AUC of 0.88, and a balanced accuracy of 83.5%. Serum creatinine, ejection fraction, and age were identified as significant predictors of heart failure risk. Logistic regression achieved an accuracy of 76.3% and an AUC of 0.78. The performance difference between the models was statistically significant (p = 0.015).
Conclusion: Symptoms associated with early-stage diabetes can be effective predictors of heart failure risk, with Random Forest showing strong predictive performance. These findings highlight the potential of machine learning in early detection of high-risk patients, facilitating timely interventions.