Predicting Suicide Among US Veterans Using Natural Language Processing-enriched Social and Behavioral Determinants of Health

Despite recognizing the critical association between social and behavioral determinants of health (SBDH) and suicide risk, SBDHs from unstructured electronic health record (EHR) notes for suicide predictive modeling remain underutilized. This study investigates the impact of SBDH, identified from both structured and unstructured data utilizing a natural language processing (NLP) system, on suicide prediction within 7, 30, 90, and 180 days of discharge. Using EHR data of 2,987,006 Veterans between October 1, 2009, and September 30, 2015, from the US Veterans Health Administration (VHA), we designed a case-control study that demonstrates that incorporating structured and NLP-extracted SBDH significantly enhances the performance of three architecturally distinct suicide predictive models - elastic-net logistic regression, random forest (RF), and multilayer perceptron. For example, RF achieved notable improvements in suicide prediction within 180 days of discharge, with an increase in the area under the receiver operating characteristic curve from 83.57–84.25% (95% CI = 0.63%–0.98%, p-val < 0.001) and the area under the precision recall curve from 57.38–59.87% (95% CI = 3.86%–4.82%, p-val < 0.001) after integrating NLP-extracted SBDH. These findings underscore the potential of NLP-extracted SBDH in enhancing suicide prediction across various prediction timeframes, offering valuable insights for healthcare practitioners and policymakers.


Introduction
Suicide has consistently ranked among the primary causes of mortality in the US for decades, with a substantial 35.6% increase from 2000 to 2021 1 .In 2021 alone, suicide accounted for 48,183 fatalities in the US 1 , while the global toll surpassed 700,000 2,3 .Existing data indicates a higher suicide rate among Veterans than non-veteran adults over the last decade and notably, Veterans are experiencing a more pronounced increase of suicide risk 4 .
Prior studies found that 80% of suicide victims were in contact with their primary care providers in the year preceding their death, and within the same timeframe, 25.7-31% had sought mental health care 5,6 .This puts the healthcare providers in a unique position to contribute, and a better predictive tool may assist them in mitigating the prospective risk of suicidal events.Social and behavioral determinants of health (SBDH) encompass factors such as socioeconomic status, access to healthy food, education, housing etc. that wield strong in uence over an individual's health outcomes 7 .Prior studies established strong relationships between SBDHs and suicidal behaviors [8][9][10][11][12] .For example, social disruptions (e.g., relationship dissolution, nancial insecurity, legal problems, and exposure to childhood adversity) exhibit signi cant associations with suicidal behaviors 8, [12][13][14][15] .However, leveraging SBDHs for predicting suicide has presented challenges, primarily due to the limitations in structured data sources, such as ICD codes, for capturing comprehensive and reliable SBDH information.Unstructured clinical notes, enriched with detailed SBDH information, can play a vital role in this regard 12,16 .
The increasing use of Electronic Health Records (EHR) in the US has stimulated efforts to identify patients at suicide risk using EHR data.This has resulted in data mining and machine learning approaches to predict suicidal behavior and suicide mortality among patients in large healthcare systems 17,18 .While most of the existing work on suicide risk assessment using SBDH has focused on structured data sources, unstructured EHR notes represent a relatively untapped data source that can be accessed relatively inexpensively.With the advent of advanced natural language processing (NLP) techniques, there are large opportunities to automate SBDH extraction from EHR notes to augment the structured data, aiding healthcare providers with a more holistic view of a patient's overall health status and suicide risk 19,20 .
The US Department of Veterans Affairs (VA) operates the largest integrated healthcare network in the country, with a national EHR system used by more than 1,200 medical centers and clinics 21 .With great public concern about the health of Veterans, the VA presents a unique opportunity to fully leverage its data for the exploration of suicide-related predictive modeling.In this study, we conducted the rst retrospective case-control study to examine the impact of both structured and NLP-extracted (from unstructured notes) SBDH on suicide death among Veterans.We evaluated three architecturally distinct suicide prediction models across multiple prediction windows.As detailed below, our ndings showed that SBDHs can improve all models' predictive performance across different prediction windows.

Data Source and Study Design
In this study, we used inpatient and outpatient EHRs from the US Department of Veterans Affairs Veteran Health Administration (VHA) Corporate Data Warehouse.We included all discharges from outpatient emergency room and inpatient care between October 1, 2009 (start of Fiscal Year [FY] 2010) and September 30, 2015 (end of FY 2015) and following Kessler et al. 22 , the unit of analysis was hospital discharge.Our study protocol was approved by the institutional review board of VA Bedford Health Care.The Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis (TRIPOD) 23 reporting guidelines were followed.
Cases were de ned as discharges followed by deaths from suicide (according to National Death Index 24 with International Classi cation of Diseases (ICD), Tenth Revision, codes X60-X84, Y87.0, and/or U03 as underlying cause of death) in the next D days ('prediction window').From each discharge, we established a 2-year retrospective 'observation window' to aggregate all relevant information for prediction.Each case was randomly matched, without replacement, to 5 discharges that were not followed by suicide in the prediction window (controls), on discharge type and date (± 1 year).Our discharge inclusion criteria include 1) at least one diagnosis or procedure record within the observation window, 2) patients at least 18 years old with no con icting demographic information, and 3) discharges at least D days before the study end date (September 30, 2015).
We analyzed 4 prediction windows − 7, 30, 90, and 180 days, resulting in 4 cohorts.For each discharge, our task was to predict death by suicide within the prediction window, given all data from the observation window.We put aside the discharges from FY 2015 as the hold-out test data and used remaining discharges for training.

Predictor Construction
We categorized all predictors into four groups: demographics, codes, suicide behavioral (SB) information, and SBDH.The demographic predictors contain patients' race, gender, age, and marital status.Codes include diagnosis codes, procedure codes, medication codes.Since diagnoses and procedure codes are hierarchical, encoding all of them may lead to over tting.We therefore used the single-level Clinical Classi cation Software (CCS) [1] to categorize them.This led to 283 categories for diagnosis codes and 248 categories for procedure codes.We also categorized medication codes following VA National Formulary 25 (VANF) drug classi cation.More details are available in Appendix 1. SB information includes suicide attempt (SA) and ideation (SI), obtained using the phenotype algorithm available through the VA's Centralized Interactive Phenomics Resource (CIPHER) [2] .
SBDHs were identi ed from structured data using ICD-9 and VHA stop codes (structured SBDH), and clinical notes using NLP (NLP-extracted SBDH).Structured SBDHs included 6 factors -social or familial problems, employment or nancial problems, housing instability, legal problems, violence, and non-speci c psychosocial needs.NLPextracted SBDHs were obtained from unstructured clinical notes using a transformer-based 26 NLP system 12 , and comprised 12 factors -social isolation, job or nancial insecurity, housing instability, legal problems, barriers to care, violence, transition of care, food insecurity, substance abuse, psychiatric symptoms, pain, and patient disability.SBDHs were extracted from the following 9 note types -emergency department notes, nursing assessments, primary care notes, hospital admission notes, inpatient progress notes, pain management, mental health notes, social worker notes, and discharge summaries.To assess the impact of SBDH from both sources, we combined them to create 13 distinct SBDH factors (Appendix 2).In addition to individual-level SBDHs mentioned above, we also included neighborhood-level socioeconomic variable -area deprivation index (ADI) [3]   which represents the socioeconomic status of a patient's neighborhood.ADI includes state and national-level rankings of neighborhoods based on socioeconomic disadvantages.A higher ADI indicates a lower socioeconomic status.We linked each patient's EHR data to the ADI database via their address zip code and discharge quarter of the calendar year to identify the corresponding national-level ranking and included that as a predictor.
We extracted all predictors from the observation window except diagnosis codes, SB and SBDH (excluding ADI).Diagnosis codes were extracted only from the discharge day as this yielded the best performance in our initial experiments.To capture prior documented SA and SI, we extracted SB data from any time before the current discharge date.Furthermore, we varied the time frame for SBDH to investigate how their proximity to discharge affects subsequent suicide.We chose 7 (a week), 30 (a month), 90 (3 months), 180 (6 months), 365 (1 year) and 730 days (2 years) as candidate time frames.To provide the model a sense of time-variability, we also used SBDH predictors extracted from all six time windows simultaneously.
In summary, we considered 619 candidate predictors (Table 1): 4 demographic variables, 283 diagnoses codes variables, 248 procedures codes variables, 50 medication codes variables, 2 SB variables, 6 structured SBDH variables, 12 NLP-extracted SBDH variables, 13 combined SBDH variables, and 1 ADI variable.Demographic and ADI variables were categorical, whereas the remaining predictors were constructed as binary variables -indicating the absence or presence.Unstructured EHR notes

Predictor Screening
Predictor screening was performed on the binary features of diagnoses, procedure, and medication codes.First, we removed any of these predictors with a low prevalence of less than 1%.Next, for each remaining predictor, we t a univariate logistic regression model of suicide death on the predictor and the demographic variables.We evaluated the p-values of the predictors from these univariate models and used the Benjamini-Hochberg procedure 27 to control the false discovery rate (FDR) at 10%.Only predictors with an adjusted p-value smaller than 0.1 were used as candidate predictors to build the predictive models.Our two-stage screening reduced 87.4%, 78.51%, 75.77% and 71.41% of the predictors for the case-control cohorts with 7, 30, 90 and 180-day prediction windows respectively.Prior works suggest that predictor screening can help with noise reduction and substantially improve out-of-sample model performance 28,29 .We stress that SBDH variables were excluded from the screening stage as the focus of this work is on analyzing their impact on the prediction of suicide.

Statistical Analyses
We employed three different machine learning (ML) methods for predictive modeling, namely, elastic-net logistic regression (ENL), random forest (RF), and multilayer perceptron (MLP).For the ENL and RF models, we used 10fold cross-validation on the training data and performed grid searches over a wide range of hyperparameters to select the best models.For MLP, we used a 2-layer feed-forward network with ReLU 30 as the activation function.
To tune the hyperparameters of MLP, we set apart 20% of the training data as the validation set.As our cohorts had a case-control ratio of 1:5, we used cost-sensitive learning 31 for all models to ensure that they prioritized suicide events as equally as non-suicide events.For ENL and RF, we averaged all metrics over the 10 folds.For MLP, we averaged the model performance over three runs with different seeds.We experimented with different combinations of predictors, as shown in Tables 2 and 3.For SBDH, we experimented with the following combinations: structured SBDH, NLP-extracted SBDH, combined SBDH, structured SBDH + ADI, NLP-extracted SBDH + ADI, and Combined SBDH + ADI.To evaluate the models' predictive performance on the test data, we examined various performance metrics on the test data, including the area under the receiver operator characteristic curve (ROC AUC), area under the precision recall curve (PR AUC), sensitivity, speci city, and positive predictive value (PPV).Since suicide is a rare event, we calculated sensitivity, speci city and PPV for different risk group sizes.A risk group size P for a predictive model indicates the fraction of the test set with the highest risk for suicide, as identi ed by the model.
Following prior studies 22,32 and our data statistics, we included 0.05, 0.10, 0.20 and 0.60 as different values for P.
As this is a case-control study, we also reported adjusted PPV 33 .PPV denotes the probability of predicted high-risk patients with suicide death.The measurement of PPV is important as this indicates the chances of saving patients' lives with interventions.
In addition, we conducted calibration analysis and measured predictor importance using the Kernel SHAP (Shapley Additive Explanations) method 34 .For each model, we chose PR AUC to select the best hyperparameter con guration.All analyses used Python 3.8, ENL and RF were implemented using scikit-learn 35 0.23.1 and MLP was implemented using PyTorch 36 1.5.1.

Prevalence of Suicide
Out of 17,267,304 discharges from 2,987,006 Veterans (Fig. 1

Overall Model Performance
The results are shown in Tables 2 and 3.With 'SBDH' as predictors, we only reported results for the combinations that yielded the best PR AUC scores.We noticed incremental improvements across almost all models and prediction windows as we added a new predictor group.Adding codes and SB information always improved the AUC scores (Table 2).A similar trend can also be observed with SBDHs.However, the best SBDH setting for PR AUC did not always yield the best ROC AUC score.
ENL achieved the best AUC scores for the 7 and 30-day prediction windows, except MLP attaining the best PR AUC in the 7-day prediction window.In contrast, RF achieved the best AUCs across 90 and 180-day prediction windows.In general, models for the shortest prediction window (7 days) had the lowest ROC AUCs (74.44%-77.65%),and as prediction windows got longer, the models performed better with the highest ROC AUCs (77.39%-83.94%)obtained for the longest prediction window (180 days).PR AUC scores demonstrated a similar trend.AUC scores were almost always higher among outpatient ED discharges than inpatient discharges.
Across all prediction windows with the best predictor con guration, these models detected 12.98-24.58% of all deaths from suicide at the 5% risk tier (Table 3).This means that even considering only 5% of the discharges with the highest model-assigned suicide risk, a suicide intervention program based on these models can capture 12.98%-24.58% of patient discharges where the patients would otherwise die by suicide.Increasing the risk group size can help capture even more discharges, for example, 24.97%-41.14%at a 10% risk group size.PPVs and adjusted PPVs increase as the prediction window increases and the risk group size decreases.We obtained the highest adjusted PPV of 1.07% for the RF model over the 180-day prediction window at the 5% risk tier.This suggests that in the top 5% risk tier, patients from 1.07% of discharges would die by suicide within 180 days of their hospital discharges in the absence of any additional intervention program.

Impact of NLP-extracted Predictors
In this study, we used an NLP system to extract SBDH from clinical notes.We compared our NLP-extracted SBDHs with structured and combined SBDHs.eTable5 lists all the SBDH combinations that yielded the best performance for each model at a speci c prediction window.In half of the settings (6 out of 12), NLP-extracted SBDHs appeared as the best choice whereas structured SBDH performed better in four settings.We also found ADI to be helpful in most settings.

Calibration and Predictor Importance
Out of the three models, RF is better calibrated than (eFigure 1-2).However, there was no noticeable difference between a model with and without SBDH (eFigure 2).We also measured predictor importance using Kernel SHAP method (eFigure 4).Based on SHAP values, we identi ed predictors that pushed a model towards making positive predictions (suicide death) and predictors that did the opposite.We named them positive and negative predictors, respectively.Upon examining the top 30 positive predictors, we found that SA, SI, and the age group 79 or higher are the most common predictors across different models and prediction windows.In contrast, black race, female gender, and age 50-59 were the most consistent negative predictors in the top 30.Among diagnoses predictors, 'Administrative/social admission', 'COPD', 'alcohol-related disorders', and 'anxiety disorders' were the most common positive predictors.As for procedure categories, 'anesthesia' was a common positive predictor, whereas 'cardiac stress tests' was a common negative predictor.Among medications, 'sedative hypnotics' was a prominent positive predictor and 'antidepressants' was a common negative predictor.Among SBDHs, 'Social isolation' (NLP-extracted) and 'violence' (structured) were two of the most common positive predictors.We would like to emphasize that SHAP values do not indicate risk or protective factors; rather, they help rank predictors according to their usefulness for a task (suicide prediction) with respect to a model (ENL, RF, or MLP).

Ensemble Learning
Ensembling is a popular technique for aggregating multiple models' predictions to improve system robustness.Among various aggregator functions such as linear averaging, majority voting, boosting, etc., we chose linear averaging for our study.First, for each model, we averaged the prediction probabilities over all folds/runs and then, we averaged them over different models.We did this for the two best models (ENL and RF) and all three models.The results are shown in Table 4.We found that ensembling ENL and RF improved the AUC scores over the best single models for 7, 30, and 90-days prediction windows.However, the performance did not improve for the 180-days prediction window.Comparatively, ensembling all models was only helpful for prediction window 7.
Overall, the RF model is still better calibrated than the ensembled systems (eFigure 3).SBDH improved the performance for all cases, with ROC AUC improvements going up to 3.86% and PR AUC improvements up to 11.21%.This is consistent with prior studies 22,37 where multiple SBDH factors were identi ed as important predictors for suicide after discharge from VA psychiatric hospitalization.However, they lacked a robust deep-learning-based SBDH extraction system from clinical notes.Our results also showed that all models bene tted from including NLP-extracted SBDHs in combination with other SDBHs or alone.This highlights the merit of harnessing clinical notes through NLP to enrich SBDH information for improved predictive modeling.
Our work showed that near-term prediction of suicide death is more challenging than longer-time predictions; as such, all models performed the best with 180-day prediction window, and the performance kept declining as the prediction window decreased.This may partly stem from the lack of adequate samples in shorter prediction windows, making it more challenging for any model to map the predictors to suicide.Other studies suggested that larger number of suicides over longer windows increase predictive models' statistical power 22,37 .They found that models built to predict suicide over longer windows outperform models built to predict over shorter windows when applied at those shorter windows.
We also ranked the predictors using their SHAP values (eFigure 4).We discovered that records of prior SA and SI are two of the most important predictors for death by suicide across all prediction windows.SA is wellestablished as a signi cant risk factor for suicide 3,38 .Data indicates that one out of every 100 attempt survivors dies from suicide within the rst year, a risk approximately 100 times higher than that observed in the general population 39 .Furthermore, the risk of suicide can persist up to 32 years following an attempt 40 .A systematic review of 90 studies found a 6.7% suicide completion rate and a 23% non-fatal attempt rate 41 .We also found 'social isolation' and 'violence' as two of the most common positive SBDH predictors.Prior studies showed a higher association between social isolation and suicide risk 12,[42][43][44][45] .Exposure to violence is also a well-known risk factor for suicidality 12,46,47 .
Using NLP to extract clinically relevant information from EHR notes is not new.Datta et al. reviewed 78 studies that utilized NLP to extract cancer-related information 48 .Mitra et al. developed a deep-learning-based NLP system to extract social determinants of health from EHR notes and showed their signi cant associations with suicide among US Veterans 12 .Bhanu et al. designed an NLP system to extract SB information from EHR notes 49 .Many other works also used NLP systems to detect suicidality in EHR notes [50][51][52][53] .However, ours is the rst case-control study to incorporate NLP-extracted SBDHs as predictors for suicide death prediction.
Although predicting suicidal behavior has been an active area of research 17,22,28,54,55 , our study differs in the addition of NLP-extracted SBDH as predictors to analyze their impact on a diverse set of models' predictive performance.Despite many existing studies on the prediction of suicide, integrating their ndings to existing healthcare systems poses a multitude of challenges, such as lack of logistics support at the deployment centers, risk-bene t tradeoff, cost-effectiveness, a sense of false reassurance 22 , and generalizability, among others.
Moreover, a systematic review of 17 suicide prediction studies found that all predictive models suffer from low PPV, regardless of the population distribution or risk tier 56 , thus, making suicide prediction a challenging task.In contrast, Kessler et al. showed that predictive models have positive net bene t across plausible ranges of the PPV distribution 37 .
Limitations and Future Work Our study has several limitations.Firstly, the VA population's demographic composition differs from that of the overall US population.Nonetheless, research utilizing VHA data has informed non-VA facilities in implementing enhanced clinical practices [57][58][59] .Additionally, our study employed no VA-exclusive predictors, allowing for the extraction of the same predictors from EHRs at non-VA facilities for customized prediction models.Secondly, our analysis focused solely on outpatient emergency and inpatient care discharges.Expanding to include other hospital settings could enhance our comprehension of SBDHs' impact on suicide.We leave this for our future work Thirdly, we restricted the observation window to 2 years to incorporate relatively current SBDHs but extending it to encompass historical SBDHs may enhance model predictions, a subject we will explore in future research.Lastly, we utilized the ADI, available only at the census tract block group level; however, we plan to investigate the recently proposed social vulnerability metric 60 as an alternative in future studies.

Conclusions
Ours is rst large-scale study to use NLP-extracted SBDH information from unstructured EHR data to predict suicide among Veterans.We showed that incorporating NLP-extracted SBDH exhibited improved predictive performance across different models and prediction windows.Consequently, integrating NLP-extracted SBDH into structured EHR data holds a promising avenue for the advancement of a more effective suicide prevention system. Declarations

Table 1
List of predictors considered in our study.

Table 2
Performance of different predictive models across different prediction windows.ROC AUC: area under the receiver operator characteristic curve; PR AUC: area under the precision recall curve; SD: standard deviation; Demo: demographic variables; SB: suicidal behaviors -attempt and ideation; Codes: diagnosis, procedure, and medication codes; SBDH: social and behavioral determinants of health; ENL: Elastic Net Logistic Regression; RF: Random Forest; MLP: Multilayer Perceptron.

Table 3
Performance of different predictive models across different prediction windows for the best predictor con guration.PPV: positive predictive value; ENL: Elastic Net Logistic Regression; RF: Random Forest; MLP: Multilayer Perceptron.
), 17,210,996 were eligible to be considered for the 7-day prediction with 849 cases, amounting to 0.005% suicide rate at the discharge level.At the patient level, the suicide rate within 7 days of discharge was 0.03%, with 849 suicide deaths from 2,703,173 patients.Similarly, the suicide rates within 180 days of discharge were 0.05% at the discharge level and 0.27% at the patient level.In summary, the 4 case-control cohorts for prediction windows 7, 30, 90 and 180 days consisted of 5,094 (849 cases and 4,245 controls), 14,256 (2,376 cases and 11,880 controls), 29,580 (4,930 cases and 24,650 controls) and 46,668 discharges (7,778 cases and 38,890 controls) respectively.More details are available in Appendix 3.

Table 4
Performance of different predictive models, including ensembled systems, across different prediction windows.ROC AUC: area under the receiver operator characteristic curve, PR AUC: area under the precision recall curve.