Unplanned 30-Day Readmissions: Hospital Data Warehouse Modelling.

Background: Unplanned hospital readmissions are a major healthcare and economic burden. This study compared statistical methods and machine learning algorithms for predicting the risk of all-cause 30-day hospital readmission in two French academic hospitals. Methods: The dataset included hospital stays selected from the clinical data warehouses (CDW) of the two hospitals (Rennes and Tours Academic Hospitals) using the criteria of the French national methodology to measure the 30-day readmission rate (i.e. ≥ 18-year-old patients, geolocation, no iterative stays, and no hospitalization for palliative care). Then, the prediction performance of Logistic Regression, Naive Bayes, Gradient Boosting, Random Forest, and Neural Networks were compared separately for the two hospitals but using the same CDW data pre-processing for all algorithms. The area under the receiver operating characteristic curve (AUC) was calculated for the 30-day readmission prediction performance of each model as well as the time to train the algorithm. Results: In total, 259,092 and 197,815 stays were included from the Rennes and Tours Academic Hospital CDWs, respectively, with readmission rates of 8.8% (Rennes) and 9.5% (Tours). The AUC of the regression models for the two hospitals ranged from 0.61 to 0.64, with computation times exceeding 18 hours. The AUC of the machine learning models ranged from 0.61 to 0.69 with computation times below 13 hours. Conclusions: Better performance and shorter computation times are obtained with machine learning methods. It is still necessary to compare different algorithms to identify the most ecient model.


Background
The days following hospital discharge are at high risk of adverse events [1,2], and preventing unexpected readmissions is an important issue. Moreover, unplanned readmissions represent a healthcare and economic burden. In France, the 30-day readmission rate varies between 12% and 14% [3]. In the United States, the 30-day readmission rate is about 20%, depending on the state, and the cost of these readmissions is estimated at 17.4 billion dollars [4].
The days following hospital discharge are a key point in the care pathway [1,2], and must be articulated between caregivers. To comprehensively address this problem in the context of population ageing and the increasing prevalence of chronic diseases, the care pathway between hospital and primary care has to be taken into account. A systematic review highlighted that some 30-day readmission events are avoidable, but their proportion greatly varies, depending on the judgment criterion (a combination of diagnostic codes, adverse events, adverse drug reactions, and subjective criteria) used to de ne such avoidability [5].
Public health policies have been implemented to reduce the 30-day readmission rates in many countries, such as France, Germany, Denmark [6], and United States [7]. In France, the Agence Technique de l'Information sur l'Hospitalisation (ATIH; French national agency of medical information) has put in place an indicator of 30-day readmission, called RH30 [8], based on the French version of the Diagnosis-Related Group (F-DRG) system. However, it is also important to develop methods to anticipate unplanned readmissions, by identifying relevant clinical indicators, and by developing predictive models to integrate these indicators. The most widespread predictive models are based on classical statistical approaches that are widely used in medical research, such as logistic regression analysis, to assess potential improvements in medical care [9][10][11].
The progressive implementation of electronic health records (EHR) in healthcare facilities represents an unprecedented source of data that could be exploited to better apprehend the patient care pathway, both descriptively and analytically. The emergence of automatic learning methods allows using these data to anticipate the patient trajectories. However, to make EHR usable for multicentre studies, an interoperability effort is required to arrange data in clinical data warehouses (CDW) using common terminologies [12,13]. Recently, machine learning methods have started to be tested (e.g., Gradient Boosting, Random Forest, Neural Networks) [9]. The contribution of each of these methods must be now assessed in similar conditions. The goal of this study was to compare different approaches to predict the 30-day readmission risk using two CDWs of two French hospitals.

Study design
Rennes and Tours hospitals are academic hospitals with similar activity and organisation ( Table 1). The CDW data model and technical design are the same in these two hospitals [13]. The same algorithms for selection criteria, readmission criteria, data management, and the same prediction algorithms were ran separately using the Rennes and Tours CDW data. Patients with at least one hospital admission date in Rennes or Tours between 1 January 2013 and 31 December 2018 were included in the study. The two datasets were then divided into training set (admission date before 31 December 2016) and test set (admission date after 1 January 2017).
Rennes and Tours CDWs contain most of the EHR data, including clinical notes, drug prescriptions, laboratory test records, and claim data. The different stays are chained by patient in a de-identi ed EHR. All information about a patient's hospital stay is de-identi ed and organised in structured and unstructured data in the CDW.
Diagnoses and comorbidities were coded using the French version of the International Classi cation of Diseases, 10th Edition (ICD-10) [14], and grouped with the F-DRG [15]. Medical and surgical procedures were also coded according to the French classi cation of clinical procedures (CCAM) [16]. Drug prescriptions were coded using the Anatomical, Therapeutic and Chemical (ATC) codes in the Rennes CDW. Drug data were not yet available in the Tours CDW at the time of the analysis. Hospital departments and laboratory data are currently coded according to a local thesaurus.

Inclusion criteria
To ensure the result comparability with the French national indicator for 30-day readmission (RH-30), the selection criteria were those of the ATIH RH30 national methodology [8]: ≥18-year-old patients who received obstetrical/gynaecological, surgical, or medical care.

Non-inclusion criteria
Patients without geolocation in mainland France and patients in palliative care settings were excluded. Hospital stays with a different entry mode than from home, and iterative stays were also excluded (chemotherapy/radiotherapy sessions, transplant context, renal haemodialysis sessions, cataract surgery). Readmissions within 30 days after an iterative stay were not considered as unplanned readmissions.

Unplanned readmission de nition
An unplanned readmission was de ned as a hospital stay within 30 days after the end of the index stay.
The index stay was de ned as a hospital stay with a discharge to home and not preceded by another hospitalisation within 30 days before the index admission date. Stays corresponding to the inclusion and exclusion criteria and with at least one index stay were included. The aim was to predict readmission within 30 days of the index stay.

Data extraction
Covariates extracted from the two CDWs [13] were: age, sex, length of stay, number of previous hospitalisations, illness severity (F-DRG classi cation), major diagnostic categories, medical diagnoses and comorbidities (ICD-10), medical and/or surgical procedures (CCAM classi cation), hospital department, and available laboratory data.
ICD-10 and CCAM codes were grouped by the three rst characters indicating the diagnostic category and the procedure and organ, respectively. Laboratory data were coded as binary variables, and were considered as abnormal if at least one value during the stay was outside the normal range.
Demographic data, aggregated at the city and district levels when the municipality had more than 10,000 inhabitants [17], were merged with the patient using the geometric map background and the corresponding geolocation: Missing data were imputed using the K Nearest Neighbor method (taking the mean of the ve nearest neighbors); numerical values were considered Missing At Random (MAR). Features with less than 1% of events were removed to avoid unpredictable and inexplicable responses from prediction models related to rare events. For the logistic regression models, only signi cant covariates (p-value ≤0.05 by univariate analysis) were retained and then, a multivariate model with step-by-step selection of variables was performed to obtain the most parsimonious model to maximise the area under the Receiver Operating Characteristic (ROC) curve (AUC). The algorithms used for readmission prediction were the most frequently described in the literature: logistic regression, Random Forest, Gradient Boosting, Naive Bayes, and Neural Networks [9,[18][19][20][21].
The main outcome was the AUC, and the secondary outcomes were sensitivity, speci city, positive and negative predictive value of the cut-point closest to the top-left corner of the ROC space. The calculation time was evaluated after the data management step and was from the start of the feature selection to the end of model tting.
As a secondary objective, the model explainability was assessed by identifying the covariates considered to be important by the different algorithms. This importance was assessed according to criteria adapted to each algorithm: Odds Ratio (OR) for logistic regression, relative in uence for Gradient Boosting, Gini index for Random Forest, and Garson algorithm for Neural Networks. The most relevant covariates were compared between hospitals.
Data handling and pre-processing were performed on R Studio, version 3.6.0 [22].

Ethics and Consent
The clinical data warehouses have been authorized by the . The 30-day readmission rate for the included stays was 9.5% ( Figure 1).
Comparison of patients with and without unplanned 30-day readmission showed that patients with unexpected readmission were older (+3.8 years in Rennes and +6.9 years in Tours), had longer mean length of stay (+1.4 days in Rennes and +0.9 days in Tours) and higher mean number of previous stays (2.6 and 2.4 in Rennes and Tours, respectively, versus 1.8). The socio-demographic data did not differ between patients with and without unplanned 30-day readmission ( Table 2).   Tables 3 and 4. Concerning the model explainability, the covariates selected by the stepwise regression models were all different, except for the 'musculoskeletal and connective tissue diseases and injuries' covariate. With Gradient Boosting, among the 20 covariates with the greatest relative in uence, seven covariates were shared by both hospitals: age, number of previous stays, ICD-10 R18 ascites, red blood cell count, haemoglobin, urea, and C-reactive protein. The Random Forest model prioritized numerical covariates in both hospitals: age, severity level, length of stay, number of previous stays, and sociodemographic variables. Neural Networks did not have any common covariate between hospitals (additional le 1 shows the most important variables for the different tested models with the CDW data from Rennes and Tours hospitals).

Discussion
This is the rst 30-day readmission predictive modelling study performed using data from two CDWs in France. This study was carried out using a methodology as close as possible to the de nition of the French national indicator of unplanned 30-day readmission and the explanatory variables suggested by the French national agency of medical information.
The population health of the neighbouring areas, the local health care network, the hospital management, the computerization of services, and the data integration advancement in CDWs are factors that might in uence the results of the models. The choice of inclusion criteria and the focus on all-cause 30-day readmissions in uenced the performance of the predictive models. The selection of covariates and the algorithm used are end-of-process parameters that depend on the previous steps.
The healthcare organization is very similar at the Rennes and Tours academic hospitals, both in terms of positioning within the local healthcare network and in terms of patient discharge policy, emergency activities, unplanned care rate, complex clinical situations, and university hospital activities providing innovative care. The main difference concerns cancer management that is part of the care activities of Tours hospital, whereas it is performed in a separate cancer centre in Rennes. This difference explains the higher number of excluded stays for palliative care and iterative stays at Tours compared with Rennes. Data integration using the same CDW technology reduces variability. However, the two CDWs are at different stages of development in terms of ow development and data integration, and therefore drug data was not yet available in the Tours CDW. Running the same algorithms with the same inclusion criteria also allowed assessing the consequences of differences in data quality on algorithm tuning.
The readmission rates found for Rennes and Tours academic hospitals were lower than previously reported; however, explaining factors need to be taken into account. First, the selection of stays was carried out according to a methodology that excluded <18-year-old people, rehabilitation stays, iterative stays, and palliative care, thus eliminating stays with often more frequent 30-day readmissions. Moreover, as early readmissions on the rst day of discharge were merged with the previous stay due to the claim data rules, it was not possible to identify them in the structured data. This caused an underestimation of the readmission rate and a loss of information for prediction modelling. As the two CDWs contained only data for a speci c hospital, it was not possible to identify potentially unexpected readmissions in other hospitals or to identify alternate care facilities [23]. In agreement, the performance (AUC value) of our prediction models was lower than the mean AUC of 0.71 found in the literature [9] (i.e. studies on 30-day hospital readmission prediction models, in peer-reviewed journals and in repositories available on GitHub [24]).
One of the main issues mentioned in previous works is the lack of reproducibility. Our work overcomes this di culty by performing the same prediction analyses using data from two similar CDWs with the same de nition of target population, the same data management, and the same algorithms. Predictive models from studies in the United States often included as covariates the patients' resource conditions, health insurance coverage, ethnicity, and job status, which are available in American health care facilities, and have a signi cant impact on the probability of readmission. In France, the social context also is a determinant factor for 30-day readmissions prediction, but fewer data are available to assess the patients' social situation. Therefore, aggregated data were included at the level of cities and districts to try to improve the predictive models' performances. However, aggregated data can introduce an ecological bias, due to the lack of individual speci cation of this information and the correlations between social and geographical inequalities. For both cities, socio-demographic characteristics were not signi cantly different between readmitted and non-readmitted patients.
The long computation times for the logistic regression models showed the inadequacy of classical statistic methods of stepwise covariate selection, in addition to their limited performance and reproducibility. Conversely, methods adapted to large datasets gave better prediction performances with shorter computation times. The covariates were selected using a data-driven method by choosing the information that improved the model performance. The purpose of this study was not to interpret the association between covariates and unplanned 30-day readmission. ICD-10 and CCAM codes were grouped in hierarchical codes to reduce dimensionality and to have a relevant clinical explicability. The importance of covariates differed among algorithms, for reasons speci c to the mathematical logic of each model. For instance, Random Forest placed more importance on numerical features.
The objective of this study was to compare the performance of different algorithms. As the models do not have a speci c use yet, there was no reason to focus on sensitivity (to screen for stays at risk of readmission), and speci city (to target speci cally stays with readmission). During the development and usage of care indicators, their usefulness and the medical facts interpreted must be constantly questioned. The results obtained with these prediction models could be used to target some patients at high-risk of unplanned 30-day readmission. However, their current performance is not su cient for a wider use. Furthermore, explaining the reasons for a potential 30-day readmission understood by the physician is still a challenge for the ease of medical decision making in the current clinical practice [25].
In the future, an improvement in the models' performance or the identi cation of events that provide a more operational response for decision making at discharge are necessary. In addition, implementing CDWs with a similar data model might allow validating prediction algorithms using data from different hospitals and assessing the replicability of results. Availability of data and materials statement

Abbreviations
The data used for this study are real-life medical data from Rennes and Tours academic hospitals. These data are integrated and de-identifed in clinical data warehouses.
The datasets generated and/or analysed during the current study are not publicly available due professional secrecy but are available from the corresponding author on reasonable request.

Competing interests
None declared