Forecasting Daily Arrivals and Peak Occupancy in a Combined Emergency Department

Objective Emergency department (ED) crowding is a global problem associated with negative patient outcomes such as mortality and prolonged length of stay. Forecasting overcrowding would enable pre-emptive strategical maneuvers and is a subject of constant academic interest. However, most studies focus on forecasting arrivals in United States ED setting. We propose a novel and intuitive crowding metric called daily peak occupancy and assess forecasting ED crowding in a Nordic Combined ED using both established and novel predictive algorithms. Methods All episodes of care in Tampere University Hospital ED were acquired from December 1, 2014 to June 19, 2019, amounting to 488 167 individual events. Predictability of two target variables was investigated: total daily arrivals (TDA) and daily peak occupancy (DPO) with forecast horizon of one day. Three models were investigated: Seasonal Autoregressive Moving Average (SARIMA), Facebook’s Prophet algorithm, and General Linear Model (GLM). Calendar variables were used as independent variables. Results SARIMA outperformed other models in predicting both total daily arrivals and daily peak occupancy with mean absolute percentage errors of 6.6% (±5.3) and 12.4% (±10.7) respectively. Next day overcrowding can be predicted using SARIMA with an AUC of 0.74 and accuracy of 79 %. Conclusion Predictive models can be utilized in a Nordic emergency medicine setting with similar or better accuracy as previously reported. Predicting future occupancy is possible but more challenging than predicting arrivals. Predicting peaks in demand remains a signicant challenge. Future work should focus on investigating the value of exogenous variables in increasing model sensitivity.


Introduction
Emergency department (ED) crowding is a global problem associated with negative patient outcomes such as mortality, increased number of medical errors and prolonged length of stay (1)(2)(3). ED crowding is a complex phenomenon that can be divided into three components as described by Asplin et al. 2003: 1) patient input, 2) throughput, 3) output (4). Major factors affecting input are the number of seriously ill and injured patients, the availability of ambulatory care, and socioeconomic characteristics of the population. In some systems, input can also be adjusted by diverting patients to other centers.
Throughput is controlled by the speed of triage and diagnostic evaluation, whereas output is controlled by the availability of patient beds both inside and outside the hospital. Improving patient input or output requires broad, multi-stakeholder effort throughout the healthcare system. Throughput can be directly modulated by the ER management by changes in resourcing and is thus a prime target for optimization (5).
A challenge to resource optimization is the di culty of predicting the load the ED is exposed to. Much research has been done on predicting the number of arrivals or the number of admissions. In order to better forecast future patient volumes, predictive algorithms with different forecasting horizons and exogenous variables have been investigated (6,7). Typical inputs for predictive algorithms utilize calendar and meteorological variables as input. Time series tools such as Seasonal Autoregressive Integrated Moving Average (SARIMA) have achieved mean average percentage errors (MAPE) of 4-15% on a one day ahead forecast horizon (6). The ability to better predict ED loading would enable hospital administrators to make informed decisions on resourcing and improve ED quality metrics. However, most of the research has been conducted in the United States using arrivals as the target variable with less emphasis on occupancy (7). In this study we aimed to provide a novel perspective on ED forecasting by introducing a new target variable called daily peak occupancy (DPO). DPO is affected by all three components of the Asplin's model in contrast to daily arrivals, which may prove useful when building multivariate models with several exogenous independent variables. Moreover, we hypothesize that DPO may be a more clinically relevant target variable than daily arrivals since high input does not necessarily develop into crowding if compensated by e cient throughput and output. This study has three goals. First we investigate the applicability of well-documented forecasting models in a Nordic setting. Second, we document the performance of novel Prophet algorithm in the context of ED forecasting. Third, we introduce a novel target variable called DPO and investigate it's predictability in terms of both continuous and discrete error measures.

Data
Tampere University Hospital is an academic hospital located in Tampere, Finland serving a population of 900,251 patients and providing level 1 trauma center equivalent capabilities. The ED has 70 beds and approximately 100,000 patients are treated annually. For this study, data on all registered ED visits were obtained from hospital database created during the sample period from December 1, 2014 to June 19, 2019. The requested dataset contained timestamps for admission and discharge, diagnosis, patient's age, gender, identi er of the physician and patient's municipality of residence. Overcrowding is relatively common and an internal protocol has been implemented to help the staff to react to these situations. Two levels of overcrowding are recognized: level 1 when occupancy exceeds 80 patients (70% occupancy) and level 2 when occupancy exceeds 100 patients (88% occupancy). Overcrowding annotations were acquired from a patient logistics system (Uoma, Unitary Healthcare Ltd., Tampere, Finland) and these were used to validate the occupancy reconstruction as described below. This was a retrospective study using anonymized data acquired from the hospital electronic health record. According to Medical Research Act of Finland, retrospective studies base on health records are exempted from review by Ethics committee and it was approved by Tampere University Hospital Research, Development and Innovation Centre.

Study design
This study consist of two phases, both of which aim to predict events one day ahead. In the rst phase we perform predictions for total daily arrivals. In the second phase DPO is used as the target variable. Daily Peak Occupancy (DPO) is de ned as the highest occupancy observed on any given day. Occupancy of the emergency department is not continuously recorded in the hospital database and thus it was not contained in the requested dataset. We reconstructed the hourly occupancy by utilizing an algorithm that used the timestamps of admission and discharge as inputs to output hourly occupancy. Once the hourly occupancy had been reconstructed we resampled the data into daily resolution by extracting the maximum occupancy of each day. The resulting vector was validated by comparing it to overcrowding alerts given by ED staff and recorded using a dedicated patient logistics system as described above. The busiest 25% of the days were labeled as crowded since this threshold has been previously documented to be associated with increased 10-day mortality (8).

Analysis
Statistical analyses were performed using Python 3.7.3. Jupyter Lab 1.0.4 powered by iPython 7.8.0 was used in the exploratory data analysis (9). Analytical tools and data structures were provided by Pandas 0.25.0 (10). Visualizations were created using Matplotlib 3.1.1. Predictive models were created using StatsModels 0.10.1 (11), Scikit-learn (12) and Prophet (13). SARIMA hyperparameters were optimized using pmdarima (14). The dependencies were managed using Anaconda 4.7.12. An exhaustive list of all dependencies is provided in the Appendix.

Predictive models
Both simple naïve and seasonal naïve models were included as benchmarks. In case of simple naïve, the value of the previous day is used as the prediction for tomorrow. Seasonal naïve on the other hand works by observing the corresponding value a season ago and using this value for predicting the future. For example, when forecasting the occupancy peak of the coming Tuesday, the peak of last Tuesday was used. Some earlier publications utilize the seasonal naïve as the baseline to compare different methods to (15). Others suggest linear regression as the benchmark method (16).
Seasonal Autoregressive Moving Average (SARIMA) is widely used in both econometrics and ED forecasting and has been established as a benchmark model in international forecasting competitions (17)(7). SARIMA consists of 7 independent components denoted by (p, d, q)x(P, D, Q)s. The parameters p, d, and q describe the autocorrelation, differencing, and size of the moving average window of the non-seasonal component of the models, whereas P, D, and Q describe the same values for the seasonal component. S parameter describes the length of the season and was assigned a value of 7 due to the wellknown weekly seasonality of the target variables. We report results for both SARIMA and SARIMAX, the latter of which was trained using weekdays and months as additional independent variables.
General Linear Model (GLM) has been widely used in predicting ED visits and is regarded as the benchmark against which other models are compared (16). Weekdays and months were used as independent variables and a Poisson distribution was assumed for both target variables.
Prophet is a modular regression model created by Facebook. It has three main components: trend, seasonality, and holidays and also supports adding exogenous predictors into the linear component of the model. The potential of Prophet in ED forecasting has been mentioned but as far as we know no results have so far been published in the peer-reviewed literature although results in u epidemic peak intensity prediction have been promising (18,19). Prophet is tested both with and without calendar variables.

Cross-validation
Both SARIMA and Prophet evaluate the complete historical vector to perform predictions into the future. We utilized a time series cross-validation to simulate moving forward in time while iteratively updating the models with new data. In both phases the data was divided into train and validation sets using December 31, 2017 as the split date. The resulting train sets were used to evaluate the initial hyperparameters of SARIMA and Prophet. We then iterated through the resulting validation set from January 1, 2018 to June 19, 2019 with steps of one day, performing predictions one day forward on each iteration and updating model parameters. GLM was trained using a simple holdout using the train and validation sets described above.

Accuracy measures
Mean Absolute Percentage Error (MAPE) is the most used accuracy metric in the eld of ED forecasting (7) and is widely used in time series forecasting in general. MAPE has the advantage of being scale-independent, thus allowing model comparisons between different datasets. For these reasons, MAPE was selected as the preferred continuous error measure in phases 1 and 2. The formula for MAPE is as follows: In phase 2 we evaluate the accuracy of the models as binary predictors with the goal of forecasting whether the following day will be crowded or not which calls for discrete error measures. We report Receiver operating characteristics (ROC) and calculate the Area under the curve (AUC) measure along with reporting overall accuracy, sensitivity and speci city of the models.

Descriptive statistics
In the sample period we observed 488,167 episodes of care and 131,644 individual patients. Of these patients 48% were male, 52% female and the average age was 52.7 years (95% CI 52.6-52.7). On average there were 261 (± 25) daily arrivals and the average daily peak occupancy was 75 (± 13). The 75th percentile was at the occupancy of 85 patients. A weekly seasonality was observed in the target variables and an hourly seasonality in the underlying hourly vectors. Number of arrivals was lowest at 5 a.m. with 3.5 (± 2.1) patients on average, and highest at 2 p.m. with 17.9 (± 4.5) patients. Lowest occupancy was observed at 6 a.m. with 17.6 (± 5.6) patients and highest at 4 p.m. with 71.6 (± 14.3) patients. In terms of weekly seasonality, Thursdays had the lowest number of daily arrivals with 246 (± 19.4) patients on average and Saturdays were the busiest with 276 (± 23.9) patients. On average, Sundays had the lowest DPO's with 64.4 (± 10.6) patients whereas Mondays had the highest with 85 (± 11.6) patients. Please see Fig. 1 for details.

Model performance
In phase 1 SARIMAX(1, 0, 1) with calendar variables was identi ed as the most accurate model in predicting TDA with MAPE of 6.6% (± 5.3). Prophet performed with MAPE 6.7% (± 6.4), which was superior to GLM, which performed with MAPE of 8.1% (± 6.6). Omitting calendar variables resulted in better performance when using Prophet (6.7 vs. 6.9 %). All models bested the benchmark models (Table 1, Fig. 2). . Prophet performed with 13.1% (± 11.6) and GLM with 14.5% (± 13) respectively. All models outperformed the benchmarking models. SARIMA, with de nitions described above, was the best predictor in terms of binary performance as well, providing an accuracy 79 %. Both Prophet and SARIMA without exogenous variables provided an AUC of 0.74 as did GLM although with temporally inconsistent manner. In terms of binary prediction, all models were high on speci city but low on sensitivity with unadjusted sensitivities ranging from 50 % on GLM to 37 % on SARIMA and speci cities ranging from 89 % on SARIMA to 77 % on GLM. (

Limitations
This study is limited by its retrospective nature and the generalizability of the results should be con rmed in a prospective setting. Despite validating the reconstructed occupancy as described above, some uncertainty remains on the correctness of the target variable. Only few independent variables were used and it is likely that the performance would improve e.g. by incorporating holiday variables. However, this was out of scope of our current investigation.

Discussion
In this study we showed that readily available algorithms can be used to predict not only future arrivals but also future peak occupancies on a daily resolution in a Nordic ED. SARIMA defended its place as the best performing model at least when the number of exogenous features is limited. In terms of discrete error measures, all models demonstrated poor sensitivity but acceptable speci city.
Comparing our results to those of earlier studies is challenging due to signi cant differences in both study settings and reporting practices. However, we document superior performance of 6.6 % in predicting total daily arrivals compared to Whitt et al. 2019 (20) in which MAPE 8.4 % was achieved using a SARIMAX model. Ekström et al. 2015 documented MAPE 6.5 % in a comparable hospital setting using GLM with internet activity of the preceding day and weekdays as the independent variables (21). It is noteworthy that similar forecasting accuracy was achieved in this study using only static calendar variables as an input.
Our ndings demonstrate that forecasting occupancy is more challenging than forecasting arrivals with MAPE 12.4 % compared to 6.6 % using a SARIMA model. This is theoretically sound; DPO is dependent not only on ED input, but also on throughput and output capacities (4). Whether the performance documented in this study is su cient to provide actionable information for hospital management is debatable. We hypothesize that inclusion of other exogenous variables in a multivariate model with DPO as the target variable could substantially improve the performance of the forecasts. We consider the availability of inpatient beds, number of website visits and ED throughput e ciency characteristics to be especially interesting potential predictors. We suggest that future work should focus on investigating the combined value of these predictors in improving model sensitivity.
DPO as an ED crowding metric is a relevant parameter as negative outcomes are related to occupancy of the ED, not input or throughput per se. This is also demonstrated by the fact that many validated scoring systems for ED crowding use parameters such as proportion of beds occupied and longest waiting time as input. The challenge with ED scoring systems is that in many cases their inputs are not readily available or require substantial integration effort to bring them into practice (22). Additionally, many scoring systems are generalizable only to a subset of ED's. For example, NEDOCS uses number of patients in the ventilator as one input of the score (23). Even though undoubtedly correlating with increased mortality, having a signi cant number of mechanically ventilated patients simultaneously in the emergency department is rare in the Finnish setting as these patients are typically promptly transferred to dedicated intensive care units and may not re ect the perceived crowding in a more general setting. In the future, the predictive power of DPO for negative outcomes and perceived crowding should be investigated.
The majority of attempts to forecast ED visits look at patient population as one surrogate (7). However, patients presenting at the ED vary widely and have different underlying conditions. It could therefore be bene cial to divide the ED into relevant subsections e.g. by department, acuity, diagnosis or method of arrival. An example of meaningful strati cation of ED visits was done by Aroua et al. (24). They strati ed patients according to 25 classes of major diagnostic categories and achieved low MAPEs of 7-9% for cardiovascular, respiratory, and gastrointestinal diagnoses, whereas MAPE was as high as 68 % for mental and substance abuse disorders. If implemented into a practical model, these ndings would have obvious utility in administrative decision-making. In addition, training individual models for a number of relevant subgroups may prove bene cial in increasing model accuracy since both seasonality and the effect of predictors likely differ signi cantly across different subgroups.

Conclusions
In this study we showed that readily available time series forecasting tools can be used in a Nordic combined ED with similar or better accuracy as previously reported. We also documented the performance of these models as binary predictors of high daily peak occupancy and demonstrated their poor sensitivity in this task. Further work should focus on increasing model sensitivity by identifying a set of useful independent variables and validating their use in a prospective setup.

Ethics approval
Since our study was retrospective in nature, an approval from the ethics committee was not required. An institutional approval was acquired prior to data collection with following speci cations.