Background: Suicide resulted from complex interaction factors. Most classical statistical methods were not efficiently enough to cover this complexity. With the new branch of statistics as statistical/machine learning, complex relationships between risk factors and responses can be modeled.
Methods: We aimed to identify the high-risk groups for suicide using different classification methods including logistic regression(LR), decision tree(DT), and random forest(RF). Also, the prediction accuracy of the models is compared. This study used data obtained from a cross-sectional study conducted in in the Hamadan University of Medical Sciences from 2015-2016 to investigate the prevalence of suicidal ideation and related risk factors among university students. The LR, DT, and RF models were used to evaluate the high-risk group for suicide. Finally, the applied all three models were compared using sensitivity(SE), specificity(SP), and the area under receiver operating characteristics (ROC) curves.
Results: In the training sample, the area under the ROC curve of the DT was greater than the LR and RF. But in the validation sample, the RF model has the best performance and the DT has the worst performance among these methods.
Discussion: In this study, the risk factors for suicide were different for men and women. According to the results of the DT, substance abuse, average, general health score, faculty of education, depression were the risk factors on suicidal ideation in both genders. But despair about the future, residence (parents' house/dormitory) were among the factors contributing to the suicidal ideation of men. On the other hand, parents’ education, interested in the discipline and anxiety influence factors on suicidal ideation in women. The results of RF indicated that depression, general health score, average, anxiety and substance abuse were important risk factors for suicidal ideation in both genders. Also, the faculty of education and age are risk factors for suicide in women.
Conclusions: In the training sample, the DT had better performance but in the validation sample, the RF model provided better results. The LR was the best model for diagnosis of the patient and the DT and RF are considered the best models to diagnosis a healthy individual.