We developed a GNN model to predict acute SI within 2 weeks, which showed improved sensitivity compared to baseline models, and validated it in an external test set: the sensitivity, specificity, accuracy and AUC were 80.9%, 80.6%, 80.6%, and 0.877 (95% CI, 0.854–0.897), respectively, using an ensemble of GIN models with different sampling methods, or MindWatchNet; these values were 15.03%, 99.81%, 98.72%, and 0.574 (95% CI, 0.552–0.597), respectively, using an SVM. Specifically, MindWatchNet, based on a GIN to predict SI, improved the sensitivity significantly at the cost of slight reductions in the specificity and accuracy. The low sensitivities of the baseline models prevented the prediction of individuals who may attempt suicide before committing suicide, resulting in irreversible events5. In contrast, MindWatchNet achieves a significant increase in the sensitivity compared to previous baseline models, allowing more accurate prediction of individuals who may attempt suicide, suggesting that this model can potentially be of great help in the real world.
Our model achieved a good performance by incorporating the following three factors. 1) The GNN extracts a good graph embedding. The GIN29, a variant of a spatial GNN specifically for graph classification, extracts an even better representation from the graph than other GNNs such as graph convolutional networks (GCNs) because GINs are equivalent to generalized convolutional neural networks (CNNs) for non-Euclidean data that can be represented as graph structures, such as brain connectivity30,31. 2) An ensemble method using under-sampling and over-sampling (i.e., SMOTE for nominal and continuous features (SMOTE-NC)32) was designed to handle class imbalance issues. 3) Rich information from multi-dimensional scales and subject clinico-demographic information for large multi-centre datasets were used; 7 questionnaires covering domains such as depression, anxiety, resilience, and self-esteem, which are obtained from n = 31,720 individuals across 4 centres including universities and hospitals. Jung et al33 reported that the baseline models showed good performance in predicting SI in the past 12 months in a young population, with approximately 13 positive cases compared to the current data (12.4% vs. 0.97%, see Table 1). However, it is challenging to predict acute SI within 2 weeks. In the present study, having severe class imbalance, the SVM without the ensemble method, which is a baseline model, could not extract a good representation of the positive cases, resulting in a much lower sensitivity (~ 15%) than MindWatchNet, while a specificity and accuracy of nearly 99% were achieved. This finding suggests that dealing with class imbalance, such as with the ensemble method, should be considered to prevent prediction bias towards the majority class (i.e., the model always predicts SI-negative). It probably does not matter what kind of model is used, but this analysis is beyond the scope of the current study. Interestingly, our model can show not only feature importance but also the association among features. Although PHQ_2 and STAI-S are features having the highest saliency value, the former was associated with other items of the PHQ-9 and the latter was associated with resilience and self-esteem (Fig. 2d).
We predicted MaDEs as a pseudo-label before the prediction of acute SI because pre-existing psychiatric disorders such as major depressive disorder (MDD) have been known to increase suicide risk34, which would be helpful in accurately predicting acute SI. In MaDE prediction, all the conventional and GIN models achieved AUCs and sensitivities over 90%. This finding suggests that both the PHQ-9 and other scales, including the GAD-7, contributed to predicting the MaDE labels. MaDE pseudo-labels were used as input to predict acute SI. Although the presence of a MaDE is 3.46 times more likely to indicate an individual with SI than its absence (Fig. 2c), its low saliency may be indirectly associated with SI via its association with various PHQ-9 items and GAD_7 (“Feeling afraid, as if something awful might happen”) (Figs. 2d and Supplementary Figs. 3d). Interestingly, lifetime SA achieve both the highest OR among the binary items (Fig. 2c) and a higher saliency score than MaDE. In addition, MaDE can be accurately predicted with conventional or GIN models. The results suggest that both gathering SA information and predicting MaDE with a model, instead of structural interviews for diagnosis, is an efficient approach for survey-based screening for suicide risk. Moreover, nearly identical attention plots for the training/validation set (Supplementary Fig. 3) and test set (Fig. 2) might suggest that the common “scale and clinico-demographic signature” of acute SI was extracted by using the GIN, which models the relationship between the scale items and clinico-demographic information in graph-structured data.
In the attention plots, the model recognized the salient items among the multi-dimensional questionnaires and other information (Fig. 2). Specifically, when comparing 19 questionnaire items, several PHQ-9 items (e.g., items 2, 4, 5, 6, and 8) and the total RAS and STAI-S scores showed high saliency values. Among these features with high saliency, depressed mood (PHQ_2) and high state and trait anxiety (STAI-S total score) were the two most salient features. The first 2 items of the PHQ-9 provide the two cardinal symptoms of depression, i.e., PHQ_1 (anhedonia) and PHQ_2 (depressed mood or hopelessness)35. Depressed mood (PHQ_2) mediates negative life events associated with SI22, and its severity is also strongly associated with SI36. In the network analysis of depressive symptoms, hopelessness (PHQ_2) was the most central criterion or central node (specifically, the highest betweenness centrality) in the symptom network, showing a strong connection between PHQ_2 and PHQ_9 (suicide), as well as PHQ_2 and PHQ_6 (worthlessness), and a moderate connection between PHQ_2 and PHQ_1 (anhedonia)37, which are all salient features for SI in the attention plot (Figs. 2b and Supplementary Figs. 3b). In another network analysis of anxiety and depressive symptoms, the same symptom network was revealed, which represents the connections between PHQ_2 and PHQ_1; PHQ_2 and PHQ_6; and PHQ_2 and PHQ_9, making PHQ_2 a central node38. In addition, psychomotor symptoms (PHQ_8, i.e., “moving or speaking so slowly that other people could have noticed or the opposite”) was another salient feature for acute SI (Fig. 2b). In a large population-based longitudinal study, anxiety disorders were found to be independent risk factors for suicidal behaviours (i.e., SI and SA), and an increased risk of SA in combination with a mood disorder was found2. In our results, a high STAI-S total score was associated with increased acute SI, which is consistent with previous studies showing that both state and trait anxiety increase the risk of suicide risk39,40. It has been reported that resilience protects against symptoms of anxiety and depression and strongly influences the associations between symptoms and lifestyle factors41, which is also consistent with the findings that low resilience is strongly associated with mild depression and psychological resilience is linked to social support42, and might lead to an increased risk of SI compared to non-depressed subjects. Moreover, low resilience was a risk factor for suicidal behaviours43. In our study, a high RAS total score was associated with decreased SI, and vice versa, which is also consistent with a previous study showing that high resilience is one of the most protective features for SAs26,44.
In the ablation study of the PHQ_9, it was related to acute SI, the model performance without PHQ_9 showed no significant difference in term of the AUC compared to the model with this item (AUC = 0.869 vs. 0.877, respectively; p = 0.150), which guarantees that the model did not “cheat” to predict acute SI using only PHQ_9. In the validation study of the true labels for acute SI, the model prediction score showed a higher correlation with the KSSI score (i.e., it is a more accurate proxy for acute SI) than PHQ_9 (𝜌=0.719 vs. 0.664, respectively, p = 0.005; see the Validity of the labels for acute SI section in the Results section). Originally, the PHQ-9 was designed for screening depression and to assess severity, not to assess suicide risk22. Interestingly, in a recent validation study, Na et al.45 showed that PHQ_9 is an insufficient assessment tool for suicide risk and SI because of the limited utility in certain clinico-demographic and clinical subgroups, which is in line with our results. Our results indicate that our model-based predictions resulting from multi-dimensional information are more valid than those from only a single question (i.e., PHQ_9 and acute SI label) and is an alternative to a structured interview or a scale for suicide risk. While PHQ_9 itself may not be a valid measure for SI, our results (Fig. 2b) suggest that intermediate scores (i.e., 2–3 points) for this item should not be overlooked. This strategy should also apply to the PHQ_6 (worthlessness) and PHQ_8 (psychomotor symptoms) (Fig. 2b).
It is worth noting that this multi-dimensional scale dataset was collected before the outbreak of COVID-19, and the specific representation of mental illness, including depression and anxiety, evoked by consequences of the COVID-19 pandemic may not be reflected by the scales used in the present study. Further research is needed to explore the effectiveness of MindWatchNet during the COVID-19 pandemic. In addition, the true labels for acute SI may be improved if we obtain the labels for suicidal behaviour from reference to standards, such as structured interviews by clinicians for all subjects; however, this process is time consuming, impractical, and requires large amounts of research funding.
This study has several limitations. As prediction of major depressive episode using small dataset can lead to overfitting, the benefit of the pseudo-label46 of MaDE to predict SI should be confirmed in future studies. The type of institution cannot be generalized to other types of data obtained from workplaces. Although there was relatively low saliency of the type of institution (Fig. 2c) compared to lifetime SA or MaDE, its value for each individual might not be meaningful and must be interpreted carefully. Although beyond the scope of the current study, exploring the impact of edge and sparsity definitions on performance is necessary. To generalize the results of young adults to other populations, further studies of a wide range of ages are needed. Longitudinal cohort studies are needed to investigate factors that can predict future SAs or new SI cases. Verification studies are needed to determine whether predicting SI instead of SAs is effective in preventing SAs in the real world.
In conclusion, we developed and validated a deep-learning-based compensatory tool by using extracted deep features from multi-dimensional self-report questionnaires covering depression, anxiety, resilience, self-esteem, and clinico-demographic information of a large dataset to instantaneously predict suicide risk and monitor responses to suicide prevention strategies, which might be useful in remote clinical practice in the general population of young adults for specific situations such as the COVID-19 pandemic.