Patient data source
We will train and test the performance of GLM and nine other ML algorithms using historical data from the Amsterdam Study of Acute Psychiatry (ASAP). The aim of ASAP was to study the relationship between the incidence of (involuntary) psychiatric hospitalizations on the one hand and prior psychiatric history, the course of the psychiatric disorder, the patient’s social circumstances, and patient opinions and experiences on the other19,20. The dataset used in the current analysis contains data from a cohort of patients who had an emergency consultation either by the Psychiatric Emergency Service Amsterdam or the Acute Treatment Unit in Amsterdam between 15 September 2004 and 15 September 2006 (the “index” contact), with a follow-up period of 12 months. Although some years old, it is still the largest and most extensive and complete dataset on long term hospitalization outcomes after psychiatric crisis care in the Netherlands.
The data collected at baseline during the emergency consultation were: age, gender, domestic situation and the Diagnostic and Statistical Manual of Mental Disorders, Fourth Edition, Text Revision axis I diagnostic category. To determine the severity of the current psychopathology, the Severity of Psychiatric Illness rating scale (SPI) 21 was used. The SPI contains 14 items rated using a four-point scale: no risk, low risk, moderate risk and high risk – or no information present22.
All variables related to health care consumption, and the number of care contacts in the five years before and the 12 months after the index contact were extracted from the patient health records kept by the three largest mental health institutions in Amsterdam: JellinekMentrum (now Arkin), AMC de Meren (now Arkin), and GGZ inGeest.
All analyses were performed on routinely collected anonymized data from the participating institutions. Therefore, this study was exempted from medical ethics review and opt-in informed consent from participants was not necessary according to article 9 of the General Data Protection Regulation23.
Dependent variable
The dependent variable in our prediction model was hospitalization, operationalized as any psychiatric hospitalization in any of the three participating psychiatric hospitals in the 12 months after the index psychiatric crisis care contact.
Predictor variables
The following 39 variables have been collected in the ASAP study and were used to train our prediction model.
Socio-demographic (5): Gender, age, living situation (alone, with partner, with parents, institutionalized, other), marital status, cultural background (Dutch, Moroccan, Turkish, Surinamese, Netherlands Antilles, North Africa excl. Morocco, Sub Saharan Africa, other western minorities, other non-western minorities)
SPI items (14): Suicide risk, danger to others, severity of psychiatric symptoms, problems with self-care, substance misuse, medical condition(s), disturbances in patients’ family connectedness, professional functioning, stability of patients’ living situation, patient is motivated to receive treatment, prescription medication compliance, anosognosia, patients’ family involvement in informal care, and symptom persistence.
Clinical (2): Psychiatric diagnosis (Depression, psychotic disorder, mania/bipolar disorder, alcohol/substance use disorder, all other disorders, no diagnosis) and Global Assessment of Functioning (GAF) score (0-100).
Psychiatric intake and care register data (18): Patients’ informal social support system involved (Yes/No), patient referrer (general practitioner, first aid station, mental health care, police, other), number of previous face-to-face treatment contacts up to 2 weeks / 1 month / 3 months / 6 months / 12 months before the index crisis care contact, number of previous psychiatric hospitalizations (last 12 months and last 5 years), number of previous psychiatric day care treatments (last 12 months and last 5 years), number of involuntary treatments/hospitalizations (last 12 months and last 5 years), days of psychiatric hospitalization (last 12 months), any earlier psychiatric care referrals (> 1 year and >5 years before current contact).
Statistical procedures
First, a dataset was created consisting of the 39 predictor variables and the dependent variable. Patients with missing hospitalization data, missing SPI data, or who diseased during the study were removed. As some of the used statistical techniques cannot adequately handle missing observations, the remaining missing data were imputed using the mice package24 in R.
The ML techniques were first ran on training data to create a model, and then evaluated on independent test data. We used K-fold cross-validation (with K=10) to validate the model parameters. For K-fold cross-validation, K successive mutually exclusive test sets are created. Algorithm fitting is iteratively done on the training datasets. Predicted classifications are then calculated for the test set. With K=10, at each iteration another 10% of the data is set aside from the original dataset for validation purposes. In the end, each observation in the original data set has a predicted classification that was obtained when it was part of the test set25. We chose K=10 as a simulation study by Kohavi26 indicated that for real word datasets the best method to use for model selection is 10-fold stratified cross-validation.
GLM and nine other ML algorithms were selected in order to achieve maximum variation among the used approaches. The nine other ML algorithms were DeepBoost (R package deepboost), Keras/TensorFlow (R package keras and the TensorFlow and Keras libraries for Python), k-nearest neighbors (R package class), naive Bayes (R package klaR), neural network (R package nnet), oblique random forest (R package obliqueRF), random forest (R package randomForest), stochastic gradient boosting (R package gbm), and (model averaged) support vector machines (with class weights) (R package kernlab). All algorithms had implementations in R and/or Python. A detailed description of all ten algorithms is presented in Additional File 1.
All numeric predictor variables were centered and scaled in the pre-processing phase. Categorical variables were recoded into dummy variables. In the presented base case analysis, we have not applied balancing of the two levels of the dependent variable (hospitalized/not hospitalized); in a sensitivity analysis, all results were validated under a balanced scenario which were created by under-sampling the most prevalent outcome. Confusion matrices, accuracy scores, sensitivities, specificities and the Area under the Receiver Operating Characteristic (ROC) curves (AUC, or c-statistic) were calculated for each model. The AUC measures the area underneath the plot of the ROC curve and is an aggregate measure of the performance of the model27. Theoretically the AUC can have any value between 0 and 1, with 0 corresponding with 100% wrong predictions, and 1 corresponding with 100% correct predictions.
We also estimated the relative unique importance of each individual predictor variable for the overall AUC score using the filterVarImp function in the R package caret28. We standardized the AUC associated with each variable by dividing the absolute deviation for each variable by the absolute AUC deviation associated with the most impactful variable.
In order to evaluate the predictive accuracy of the best performing model against the GLM-based model and against the least performing model, we calculated the Net Reclassification Improvement (NRI) of the best performing model in comparison to the GLM-based model, and against the least performing model. The NRI is an index that provides an estimate (with a confidence interval) of how well a model classifies subjects compared to another model29.