Machine learning algorithms to predict 30-day readmission in patients with stroke: a prospective cohort study

Background No studies have discussed machine learning algorithms to predict the risk of 30-day readmission in patients with stroke. The objective of the present study was to compare the accuracy of the articial neural network (ANN), K nearest neighbor (KNN), support vector machine (SVM), naive Bayes classier (NBC), and Cox regression (COX) models and to explore the signicant factors in predicting 30-day readmission after stroke. Methods This study prospectively compared the accuracy of the models using clinical data for 1,476 patients with stroke treated in six hospitals between March, 2014 and September, 2019. A training dataset (n=1,033) was used for model development, a testing dataset (n=443) was used for internal validation, and a validating dataset (n=167) was used for external validation. A global sensitivity analysis was performed to compare the signicance of the selected input variables. Results Of all forecasting models, the ANN model had the highest accuracy in predicting 30-day readmission after stroke and had the highest overall performance indices. According to the ANN model, 30-day readmission was signicantly associated with post-acute care (PAC) program, patient attributes, clinical attributes, and functional status scores before re-habilitation (all P <0.05). Additionally, PAC program was the most signicant variable affecting 30-day readmission, followed by nasogastric tube insertion, and stroke type ( P <0.05). Conclusions Comparisons of the ve forecasting models indicated that the ANN model had the highest accuracy in predicting 30-day readmission in stroke patients. Before stroke patients are discharged from hospitalization, they should be counseled regarding their potential for recovery and other possible outcomes. These important predictors can also be used to educate candidates for stroke patients who underwent PAC rehabilitation with respect to the course of recovery and health outcomes. PAC program (PAC group or non-PAC group), patient attributes (age, gender, education, and body mass index (BMI)), clinical attributes (stroke type, nasogastric (NG) tube, Foley catheter, hypertension, diabetes mellitus (DM), hyperlipidemia, atrial brillation, previous stroke, acute care length of stay (LOS), rehabilitation ward LOS), and functional status score before rehabilitation. In multivariate analysis, the potential predictors were the independent variables, and 30-day readmission was the dependent variable.

validation. A global sensitivity analysis was performed to compare the signi cance of the selected input variables. Results Of all forecasting models, the ANN model had the highest accuracy in predicting 30day readmission after stroke and had the highest overall performance indices. According to the ANN model, 30-day readmission was signi cantly associated with post-acute care (PAC) program, patient attributes, clinical attributes, and functional status scores before re-habilitation (all P <0.05). Additionally, PAC program was the most signi cant variable affecting 30-day readmission, followed by nasogastric tube insertion, and stroke type ( P <0.05). Conclusions Comparisons of the ve forecasting models indicated that the ANN model had the highest accuracy in predicting 30-day readmission in stroke patients. Before stroke patients are discharged from hospitalization, they should be counseled regarding their potential for recovery and other possible outcomes. These important predictors can also be used to educate candidates for stroke patients who underwent PAC rehabilitation with respect to the course of recovery and health outcomes.

Background
Long-term disabilities after stroke can be enormous physical, mental, and nancial burdens for patients, their families, and society [1]. Readmission for any cause after acute care for stroke is associated with increased mortality, increased healthcare costs, and decreased functional status [1][2][3]. Additionally, high readmission rates negatively affect the pro tability of a healthcare institution. Therefore, reducing readmission has become an active area of research in the medical literature. Although many models for predicting outcomes of stroke treatments have been proposed in recent years, models for predicting 30-day readmission have had major shortcomings: (1) recently proposed forecasting models have shown lower prediction accuracy compared to conventional models [4][5][6][7][8][9], (2) proposed forecasting models require use of health insurance claims data, which would not be available in a real-time clinical setting [8,10], (3) predictions of 30-day readmission do not consider post-acute care (PAC) program, patient attributes, clinical attributes and functional status score before rehabilitation [11][12][13][14][15]. In the current study, the best predictors of hospital readmission within 30 days after stroke were identi ed using arti cial neural network (ANN), K nearest neighbor (KNN), support vector machine (SVM), naive Bayes classi er (NBC) and Cox regression (COX) models. Healthcare administrators in Taiwan can use the predictive simulation results obtained in this study not only for developing and improving healthcare policies as well as support systems for healthcare decision making. In this study, we aimed to compare the ve forecasting models in terms of accuracy and to explore the signi cant variables in predicting hospital readmission within 30 days after stroke.

Study design and patients
The study population included all patients who had an ICD-9-CM code for stroke (433.01, 433. 10, 433.11, 433.21, 433.31, 433.81, 433.91, 434.00, 434.01, 434.11, 434.91and 436 for ischemic stroke; 430 and 431 for hemorrhagic stroke) and had been admitted to the PAC ward at one of four community hospitals (three regional hospitals and one district hospital) or had been admitted to a traditional non-PAC ward at one of two medical centers in south Taiwan between March, 2014 and September, 2019. The enrollment criteria were acute stroke, within 30 days of stroke onset, and a score of 2 to 4 for the Modi ed Rankin Scale (in this scale, absence of symptoms is scored as 0; signi cant, slight, moderate, moderately severe, and severe disability are scored as 1, 2, 3, 4 and 5, respectively) [16]. A total of 1,476 patients with stroke were initially recruited in the study and another 167 patients with stroke also were collected from October to December 2019 (Fig. 1). The study protocol was approved by the institutional review board at Kaohsiung Medical University Hospital (KMUH-IRB-20140308), and written informed consent was obtained from each participant.
A research assistant collected the following data from medical records: PAC program (PAC group or non-PAC group), patient attributes (age, gender, education, and body mass index (BMI)), clinical attributes (stroke type, nasogastric (NG) tube, Foley catheter, hypertension, diabetes mellitus (DM), hyperlipidemia, atrial brillation, previous stroke, acute care length of stay (LOS), rehabilitation ward LOS), and functional status score before rehabilitation. In multivariate analysis, the potential predictors were the independent variables, and 30-day readmission was the dependent variable.

Statistical analysis
The unit of analysis in this study was the individual patient with stroke. Statistical analysis was performed in the following steps. In the rst step, the statistical signi cance of continuous variables was tested by one-way analysis of variance, and the statistical signi cance of categorical variables was tested by Fisher exact analysis. Univariate analyses were performed to identify signi cant predictors (P < 0.05). In the second step of the statistical analysis, the cases in the overall database were randomly divided into two datasets: a training dataset including 1,033 cases was used for model development and a testing dataset including 443 cases was used for internal validation. Additionally, a validating dataset including 167 new cases was used for external validation. The independent variables tted to the forecasting models were the signi cant predictors, and the dependent variable was 30-day readmission. After model training, model outputs were collected for each testing dataset. In the third step of statistical analysis, 1,000 pairs of forecasting models with 95% con dence intervals were compared in terms of accuracy in predicting 30-day readmission in patients with stroke. Statistical signi cance between the differences of the two models and performance indices are calculated using a Chi-squared test, since this test is nonparametric and does not require a normal distribution of either the data or the variances. Indices used for performance comparisons included sensitivity, speci city, positive and negative predictive value (PPV and NPV), accuracy, and area under the receiver operating characteristics (AUROC) curve.
In the fourth and nal step of statistical analysis, global sensitivity analysis was performed to assess the importance of variables in the prediction model, to assess the relative signi cance of input variables in the forecasting model, and to rank the importance of the variables. The global sensitivity of the input variables against the output variable was expressed as the ratio of the network error (sum of squared residuals) [24]. Variables with a sensitivity ratio (VSR) of 1 or lower were assumed to diminish performance and were removed.
All statistical analyses were performed using the STATISTICA 13.0 software package (StatSoft, Inc., Tulsa, OK, USA). All statistical tests were two-sided; a p value less than 0.05 was considered statistically signi cant. Table 1 shows that 1,283 patients (86.9%) joined the per-diem PAC program and the remaining patients selected the fee-for-service non-PAC program. The stroke patients had a mean age of 65.5 years (standard deviation, SD 13.0 years) and most patients were male (62.5%). During the study period, 120 patients with stroke were readmission within 30 days. In univariate analysis, PAC program, age, gender, education, BMI, stroke type, NG tube, Foley, hypertension, DM, hyperlipidemia, atrial brillation, previous stroke, acute care LOS, rehabilitation LOS and functional status score before rehabilitation are signi cantly associated with 30-day readmission (P < 0.05) and these signi cant predictors were included in the forecasting models (Table 2).  Comparison of the forecasting models

Study characteristics
The training and testing datasets did not signi cantly differ in the signi cant predictors and 30-day readmission (data not shown); therefore, samples were compared between the training and testing datasets to increase reliability in the validation results. The data in the Table 3 showed that sensitivity, speci city, PPV, NPV, accuracy, and AUROC were all signi cantly superior in the ANN model in comparison with other forecasting models (P < 0.001).

Sensitivity analysis
Additionally, to verify the predictive accuracy of the models, the 167 new data sets were collected. Table 5 compares the performance indices values obtained by ANN, KNN, SVM, NBC and COX models for external validation. Compared to other forecasting models, the ANN model also consistently and signi cantly obtained better performance indices values to predict 30-day readmission (P < 0.001).

Discussion
To the best of our knowledge, this study is the rst to use forecasting models to analyze 30-day readmission in patients with stroke. Accuracy in predicting 30-day readmission in patients with stroke was compared among the ve forecasting models. When all models were constructed using a given set of clinical inputs, the ANN model was clearly superior to other forecasting models. Additionally, unlike previous works in which the analyses were performed using a dataset for a single medical center, our study used prospective and longitudinal data from multiple medical institutions, which provides a more accurate depiction of current treatment for patients with stroke [10][11][12][13]. Additionally, unlike single-center series studies, our use of registry data provides more accurately depicts stroke treatment in large populations. Using registry data also minimizes referral bias or bias caused by the practices of a single physician or a single institution [25,26].
Recent works have repeatedly demonstrated the superior performance of the ANN model compared to other models [9,13]. The advantages offered by the unique characteristics of the ANN model have been con rmed by statistical analyses. For example, using an ANN model can enable more appropriate and more accurate processing of inputs that are incomplete or inputs that introduce noise [9,27]. Another advantage is that linear and non-linear ANN models with good potential for use in large-scale medical databases can be constructed using data that are highly correlated but not normally distributed. Prognosis prediction is only one of the many applications of ANN models in clinical research in the medical eld [27].
The comparisons of various models in the present study suggest that, by expanding the number of potential predictors, the ANN model facilitates systematic analysis of various diseases and facilitates comparisons of the effectiveness of research methods. Additionally, the proposed model can be extended to outcome prediction for treatments other than PAC and in patients other than patients with stroke.
The global sensitivity analysis of the weights of signi cant predictors of 30-day readmission in the patients with stroke in this study revealed that the best predictor was PAC. This nding is consistent with earlier reports that, compared to all other stroke treatment variables, PAC has the largest effect on outcome in terms of overall treatment cost, functional status after stroke, and duration of hospital stay before transfer to rehabilitative ward [25,28]. Wang et al. coupled a natural experimental design with propensity score matching to assess the impact of PAC in stroke patients and to examine the longitudinal effects of PAC on functional status [25]. The study concluded that intensive rehabilitative PAC delivered on a per-diem basis substantially improves functional status in patients with stroke. Another recent study compared a wide range of functional domains between a stroke PAC group and a well-matched nationwide cohort of patients with stroke who did not receive PAC [29]. The authors similarly reported that the stroke PAC group had signi cantly better outcomes in terms of restoration of functional impairments, 90-day clinical outcomes, and healthcare utilization.
The present study found that, before rehabilitation, NG tube insertion was signi cantly associated with 30-day readmission (P < 0.001). During the study period, no patient with stroke required NG tube insertion after rehabilitation. Previous works indicate that an NG tube insertion may be bene cial in acute stroke.
However, prolonged use is associated with poor prognosis [30,31]. A large clinical trial recently reported that, at 6 months after stroke, survival and other medical outcomes are better in patients who have had NG tubes removed compared to those who still require NG tubes at 6 months [30]. As reported previously in Ho et al., our study found that removal of NG tube early after stroke is associated with reduced rate of readmission, reduced incidence of pneumonia, and reduced mortality [31].
Hemorrhagic stroke is associated with a higher readmission rate and higher mortality compared to ischemic stroke [32,33]. It is generally more severe in hemorrhagic stroke than in ischemic stroke. In the rst 3 months after stroke, readmission and mortality are higher in hemorrhagic stroke, and both readmission and mortality are independently associated with the hemorrhagic of the stroke lesion. In the present study, 30-day readmission was higher in the hemorrhagic patients with stroke compared to ischemic patients with stroke.
This prospective observational study of a cohort of patients with stroke in Taiwan analyzed data for patients treated at multiple healthcare institutions. The ANN model developed in this study improves accuracy in identifying correlations between predictors and 30-day readmission in patients with stroke.
However, the proposed predictive model has many other potential clinical applications. For example, healthcare institutions can improve care quality by using the methods developed in this study to evaluate the effectiveness of medical treatment. Since the proposed ANN model accurately predicts 30-day readmission, healthcare administrators at other institutions can use the model to demonstrate the need for prompt and appropriate PAC for patients with stroke. A broader application of the model is in facilitating the formulation and promotion of healthcare policies and the development of decisionsupport systems in Taiwan, which would ultimately enhance the health of all citizens. However, further studies are needed to determine the true clinical relevance of the ANN model and to clarify whether clinicians can effectively use the model to predict prognosis and to optimize medical management for patients with stoke.
To con rm our data for the PAC program signi cantly associated with 30-day readmission in patients with stroke, Table 6 presents an international data comparison. The comparison includes this and three other selected studies of similar population in the United States and Taiwan [34][35][36]. These studies all shared the following features: 1) sample size was relatively large, 2) mean age of study sample was similar to that in the present study, 3) data sources from the State or National datasets, and, most importantly, 4) to explore 30-day readmission measures in patients with stroke. All of these previous studies are consistent with reported ndings in the present study that multidiscipline PAC program can signi cantly reduce 30-day readmission in patients with stroke compared with the non-PAC program (P < 0.001).

Conclusions
Based on the comparison results in this study, we conclude that the ANN model is superior to the other forecasting models in terms of accuracy in predicting 30-day readmission in patients with stroke after hospital discharge. Overall performance indices for the ANN model were also superior. These important predictors can also be used to educate candidates for stroke patients who underwent PAC rehabilitation with respect to the course of recovery and health outcomes. Although the practical applicability of database studies such as this have been convincingly demonstrated, future studies can expand the range of clinical variables included in the analysis, which could obtain further novel results and could also improve precision. Such data could be vital for developing, promoting, and improving health policies for treating patients with stroke.

Declarations
Dataset used during the current study is available from the corresponding author on request.
Authors' contributions YCC and HYS contributed to the design of the study. JHC, YJY, HFL, CHL, HHH, KWH and SCY collected the data. YCC and HYS analyzed and interpreted the data, and drafted the manuscript. JHC, YJY, HFL, CHL, HHH, KWH and SCY reviewed the manuscript. All authors read, commented, and approved the nal manuscript.
Ethics approval and consent to participate The study protocol was approved by the institutional review board at Kaohsiung Medical University Hospital (KMUH-IRB-20140308), and written informed consent was obtained from each participant.

Consent for publication
Not applicable.  Figure 1 Flowchart of the study.