A seq2seq model to forecast the COVID-19 cases, deaths and reproductive R numbers in US counties

The global pandemic of coronavirus disease 2019 (COVID-19) has killed almost two million people worldwide and over 400 thousand in the United States (US). As the pandemic evolves, informed policy-making and strategic resource allocation relies on accurate forecasts. To predict the spread of the virus within US counties, we curated an array of county-level demographic and COVID-19-relevant health risk factors. In combination with the county-level case and death numbers curated by John Hopkins university, we developed a forecasting model using deep learning (DL). We implemented an autoencoder-based Seq2Seq model with gated recurrent units (GRUs) in the deep recurrent layers. We trained the model to predict future incident cases, deaths and the reproductive number, R For most counties, it makes accurate predictions of new incident cases, deaths and R values, up to 30 days in the future. Our framework can also be used to predict other targets that are useful indices for policymaking, for example hospitalization or the occupancy of intensive care units. Our DL framework is publicly available on GitHub and can be adapted for other indices of the COVID-19 spread. We hope that our forecasts and model can help local governments in the continued fight against COVID-19.


Introduction
The global pandemic of COVID-19, rst identi ed from an outbreak in Wuhan, China at the end of December 2019, has drastically changed our societies and economies all over the world. As of the beginning of 2021, the virus has infected more than 86 million people and claimed almost two million lives across the globe. The US has reported approximately 24% of the world's total cases and 19% of the world's total deaths and has been reporting the highest numbers of daily new infections across the world since November 2019. The cooler weather, the reopening of schools and businesses in the fall, holiday travel and gatherings, and the evolving infectiousness of the virus have contributed the rapidly rising cases in many regions.
Accurate forecasting models are essential for government policymaking during the ght against the virus. A variety of methodologies have been used to forecast the spread of COVID-19. Most have used traditional epidemiological models such as the susceptible-infected-removed (SIR), or Susceptible-Exposed-Infectious-Removed (SEIR) models. These models use ordinary differential equations and rely on assumptions such as the transmission characteristics of the virus. Machine learning and deep learning methods have also been used. They provide a complementary, data-driven approach that does not rely on epidemiological assumptions. The most important prerequisite for deep learning models is access to a large amount of data from which to learn. While deep learning models were limited by the availability of the data in the early months, the long-stretch of the pandemic worldwide has, unfortunately, provided ample data for more accurate modeling and longer-term forecast.
Many teams and methods have been used to model and forecast the spread of the pandemic. Currently, over 50 teams from the globe are participating in the COVID-19 Forecast Hub (www.covid19forecasthub.org), led by the Reich Lab at the University of Massachusetts. The Forecast Hub coordinates with the US Centers for Disease Control and Preventions (CDC) to generate weekly forecasts of cases, deaths, and hospitalizations in the US for up to four weeks in the future. The ensemble forecast, which uses the median prediction across a few dozen of eligible models, are published weekly on the CDC website [1]. Most of the forecasts focus on national and state outcomes. Only a few teams have generated forecasts for US counties. The weekly published ensemble models only forecast incident cases at the county-level. They do not forecast deaths or other outcomes.
The main goal of this study was to improve the accuracy of county-level forecasts and to increase the types of outcomes forecast at the county level to include R values and deaths. Another strength of our model is the inclusion of multimodal predictors. We curated an array of relevant demographic indicators such as population density, rural/urban ratios, and a number of chronic health risk factors that have been linked to the risk of more severe COVID-19 infections. We demonstrate that our model learned patterns of input features to make accurate predictions of new incident cases, deaths, and R values up to 30 days in the future. Our framework can also be used to predict other targets that are useful indices for policymaking, for example hospitalization or the occupancy of intensive care units (ICU).

Data source
The time series for daily cumulative cases and deaths were downloaded from the COVID-19 Data Repository by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University [2]. Incident cases and deaths were generated from the cumulative series and seven-day moving averages (7-MA) were used as input sequences for the prediction models.
We computed the daily reproduction number, R, time series using a sliding weekly window for each county based on the methods described in Cori, et al [3] and implemented in the R Statistical Software package EpiEstim. Negative numbers in the incident cases series due to reporting errors were set to 0. R indicates how fast the virus is spreading by computing the average number of secondary cases of disease caused by a single infected individual during the whole infectious period. Higher R values that are above 1 indicate the increasing spread of the virus in the community. The goal of epidemic control is to bring the R values to below 1 (as close to 0 as possible).
Google has released data on daily mobility score changes re ecting the effects of social distancing and local lockdown measures, as well as school and business status. The Google Community Mobility Reports (https://www.google.com/covid19/mobility/) contain percentages of movement changes over six types of communities: retail & recreation, grocery & pharmacy, parks, transit, workplaces, residential regions. The changes were based on "baseline" values reported from January -February 2020. We computed a single daily mobility score by averaging the percentage changes from ve categories (retail & recreation, grocery & pharmacy, parks, transit, workplaces) and the negative values of residential changes. All six categories showed highly signi cant correlations and are also signi cantly correlated with the mean daily mobility score. The daily mean mobility scores were missing in 24% of the days, for which we used the mean daily mobility scores of the previous scores. Mobility scores prior to February 15, 2020 were considered to be baseline, therefore set to 0%.
Population size, population density, rural/urban ratio, and other demographic information such as sex, age and racial composition, were obtained from the 2019 National and State Population Estimates from the US Census Bureau. The numbers of ICU beds (per 1,000 population) were obtained from the 2019 Kaiser Family Foundation Survey (KFF). Other health risk factors, such as differential use of preventive services (preventable hospitalization rate, insurance coverage, mammograph and u vaccine immunization rates), chronic health conditions (diabetics, smoking and obesity), socioeconomic factors (household income, poverty, education, housing density, unemployment rate), as well as an air quality index, were obtained from the Behavioral Risk Factor Surveillance System (BRFSS). A total of 28 county-level variables are included as base-line features. All data were linked by counties' 5-digit FIPS (Federal Information Processing Standards) codes.

Seq2Seq Model
Our prediction model used the Sequence-to-sequence learning framework (Seq2Seq) that powers language-processing applications like Google Translate [4].
The Seq2Seq algorithm trains a model to convert sequences from one domain (e.g. past COVID-19 cases and deaths, resident mobility, or infection numbers, etc.) to sequences in another domain (e.g. future COVID-19 cases). Seq2Seq generation is exible in that input sequences and output sequences can have different lengths. However, the information on the entire input sequence is required in order to predict the target sequence.
Seq2Seq uses feed-forward recurrent neural networks (RNNs) that are specialized for mapping sequences. Our Seq2Seq model (shown in Fig. 1), uses Gated Recurrent Units [GRUs,5] in the RNN layers. The model is consisting of three parts: an encoder, an encoding vector (generated from the input sequence), and a decoder [4,6]. Our Seq2Seq model takes 'm' days data as input and predicts COVID-19 cases for 'n' future days. The encoder generates a comprehensive aggregated representation of the input 'm' days of the sequence. Our encoder implementation is a stack of 'm' GRU units, where each unit accepts input features (i.e., COVID-19 daily cases, mobility and deaths, etc.) from a single day of the input sequence, aggregate the information, and propagates the aggregated information forward via hidden state h i to the next GRU unit in the sequence. The nal hidden state produced from the encoder represents the encoding of the entire input sequence. This vector aims to encapsulate the information for all input elements in order to help the decoder make accurate predictions. The generated encoding is provided as the initial hidden state input of the decoder. That means, h m+1 =s 1 in the Fig. 1. The decoder predicts the output sequence taking 'Encoding of the input sequence' as input. It is a stack of 'n' GRU units where each unit (jth unit) predicts an output/prediction of COVID-19 cases (for the jth future day). Each GRU unit accepts a hidden state from the previous unit and produces an output as well as its own hidden state.
The power of this model lies in the fact that it can map sequences of different lengths to each other. The ith inputs and jth outputs are not necessarily correlated and the sequence lengths can differ.

Model selection and hyperparameter tuning
Hyperparameter tuning for the Seq2Seq model was carried out using the Keras API (version 2.3.1), the TensorFlow library (version 1.14.0) and HyperOpt [7].
The encoder and decoder have two layers. Our search space included layer sizes ranging from three to 5000. We used the Adam optimizer and tuned the learning rate and decay in the range of 1e-6 ~ 1e-2. We tuned the L2 regularization parameter in the range 1e-7-1e-3. L2 regularization was applied to both the weights and biases. We tuned the mini-batch size in the range 4-256. Early stopping was used to avoid over tting. Models were trained using the mean square error (MSE) as the loss function. The best model hyperparameters were chosen based on the lowest validation loss. We inspected both the training and validation losses to ensure that models were not over tting.
Learning and remembering a long input sequence can be challenging for a recurrent network. Short input sequences, on the other hand, may not provide enough information for learning patterns and making accurate forecasts. To determine the optimum input sequences, we used HyperOpt to search the above hyperparameter spaces for three input sequence sizes: 60, 90 and 120 days of input data (cases, deaths, Google mobility, and R values). The model predicted outcomes for each day over the next 30 days. The optimum input sequence length was determined as the one that resulted in models with the lowest validation losses. Model training and forecast evaluation From the 3242 counties (or county-equivalents) in the US states and territories, 2780 had data on the input features we selected for modelling. We randomly split these counties into training (n = 1983, 71.3%), validation (n = 398, 14.3%) and test sets (n = 399, 14.4%). We used the training and validation sets for the hyperparameter search and for learning the model weights and biases. As Supplementary Fig. 1 shows, the input sequences were generated from the whole series of data on a N-day sliding window letting N be the best length of input days. For each county, day 1 was determined as the rst day when the rst case was reported in the county. Models were trained on stacks of input sequences with the xed optimum length N to predict the next 30 days of outcomes.
Training stopped when the validation MSE did not decrease over two consecutive epochs. We evaluated model performance on the test set using data up to November 30, 2020. Accuracy metrics include the relative error (RE= (predicted value-observed value)/observed value), and the interclass correlation coe cient between the predicted and the observed values using a two-way random effect model (ICC (2,1)). In addition to examining the prediction accuracy for the test set on the days that we had data from the training and validation set to train the model, we also examined the model's performance as a historical evaluation by comparing the forecasted values at a given date with the actual values that were reported later. For the historical performance, we reported Pearson's correlation coe cient (Pearson's r) and RE for the November 30 forecast to evaluate how well the model predicted future cases or deaths over the next 30 days.
Finally, using our 30-day forecast, we generated the 4-week ahead weekly forecast (total weekly incident cases) that we submited to the COVID-19 Forecast Hub for inclusion in the CDC ensemble model. We compared the historical performance of our forecast with that of the CDC ensemble model as well as the models of the individual participating teams on two forecasting dates, November 23 and 30, 2020. In addition to the ICC(2,1) correlation coe cients between each team's forecasted values with the actual reported values and the REs, we also reported the absolute relative errors (ARE = absolute(predicted value-observed value)/observed value), the median of which represented the overall degree of deviation of predicted values from the observed values better than the median of REs.

Determine the optimum input sequence length
The initial hyperparameter space search for models using various length of input sequences showed that the lowest training and validation MSEs were achieved by using 90 days of input sequences (Supplementary Fig. 2). Using either shorter (60 days) or longer (120 days) input resulted in higher training and validation MSEs. Therefore, we built overlapping sliding windows of 90 days from the input sequences to train the model for predicting the next 30 days as illustrated in Supplementary Fig. 1.
The relative errors (REs) are plotted in Fig. 2e. The median REs are close to 0 for both incident cases (-0.04) and R values (0.01), despite that statistically they are signi cantly different from 0 (p < .0001) due to the large sample sizes. On the contrary, the median RE is -0.89 for incident deaths, and RE distribution is signi cantly deviated from 0, indicating that our model is signi cantly under-predicting deaths.

Forecast evaluation
The performance of the model was evaluated using the later reported cases and deaths. Using the forecasts generated on November 30 as an example, we show in Supplementary Fig. 3a that the daily predicted cases from all counties for the future 30 days were signi cantly correlated with the seven-day moving average of actual reported cases. Over the 30 days, Pearson's r declined slowly but remained highly signi cant, with the highest correlation of r = .98 found for the rst day to ~ .70 for the furthest forecast days (all p < .001). Supplementary Fig. 3b shows the predicted deaths vs the actual deaths. The daily correlation r ranged from 0.87 to 0.63 (all p < .001). Supplementary Fig. 3c shows the predicted R values versus the R calculated from the actual cases. The daily correlation r ranged from 0.64 − 0.29 (all p < .001). Fig. 3d) shows accurate case forecasts based on their consistent and small RE values that are close to 0 (median RE -0.06). Deaths forecasts, although having a smaller RE quantile range, consistently deviated from 0 in the negative direction (median RE -0.76). The R values have very small quantile ranges and an overall median RE (-0.008) close to zero. Similar results were found for forecasts generated on other dates (not reported). The 30-day ahead forecasts is updated weekly on Mondays and is archived in our GitHub repository (https://ylzhang29.github.io/UpstateSU-GRU-Covid/). Comparison with other models We compared our model forecast results at two time points, November 23 and November 30, with the corresponding results from the CDC ensemble model and individual models from the participating teams. Twelve teams and their models have reported forecasts for US counties, including the CDC ensemble model. Only case forecasts have been reported from these teams. Note that our forecasts at these two dates were not submitted and included in the ensemble model. The total numbers of counties forecasted by each team/model and their median REs are summarized in Supplementary Table 1.

Examination of the RE distributions (Supplementary
To compare the models on the forecasts of the same set of counties, we excluded three teams that reported < 50% of the total counties. Nine teams/models, including the ensemble models were compared with our model using only the common set of counties that all models reported forecasts on. Figure 3a plots the AREs distributions of the two forecasts on each of the four weeks ahead from these nine models and our model. Figure 3b plots the overall median AREs for each team over the four weeks ahead forecast. Table 1 lists additional evaluation metrics including both median RE and median ARE. The Wilcoxon ranksum test was used to compare our model's ARE with other models' ARE (see Table 1). Table 2 shows the ICC (2,1) between the predicated and actual cases for each models' forecast. Overall, these results showed that while all models were able to produce accurate forecasts that were in signi cant agreements with the actual observed values, there are still large prediction uncertainties in many counties, evidenced by the large RE and ARE medians (deviation from 0). The prediction uncertainties increase as the forecast interval increases (Fig. 3b).  Table 2 Interclass correlation coe cients (ICC(2,1)) between the predicated and actual cases for all models. Uncertainty in forecasting models is common as had been observed in in uenza epidemic and other serious outbreaks such as the Ebola, and Zika viruses [9][10][11][12]. Many complex and unknown features contribute to virus spread, disease severity and mortality. It is also common to have many irregularities and unforeseen discrepancies in the reporting process and database curation, which adds additional noise to the data and poses challenges for accurate forecasting. While the factors that contributed to our forecast errors and uncertainties remain to be clari ed, including other local parameters, such as hospitalization and ICU saturation rates, and data from local nursing homes, may help to improve forecast accuracy for incident deaths. However, these data are not readily available for many counties. The COVID-19 pandemic continues in the US and throughout the world with substantial uncertainty in the next months due to the appearances of new strains and the logistics of vaccinating entire populations. Having data about these variables at the county level could improve forecasts in the future.
One important limitation to consider when interpretating the forecasting results is that the attempt to calculate pandemic projections, such as ours and others, was based upon only the observation of emergent cases. However, case reporting is not uniform across entities. In the US, much of the information about cases is collected and collated at the level of the county or large municipalities, which then report to a state department of health, and which in turn report to national repositories. One entity may put a strong emphasis on testing individuals who present with symptoms, whereas another may have implemented a widespread asymptomatic surveillance policy. How and which cases are identi ed can be dramatically affected by such policy differences and testing strategies, which were largely in uenced by non-objective policy decisions and human interpretation during the course of the pandemic. In short, our machine learning model, as well as most other forecasting models of COVID-19, only learns to predict the reported cases (or deaths and other indices), which were likely biased with non-objective in uences that were not uniform across reporting entities.
Since January 4, 2021, we have updated our forecast of deaths and R numbers in our Github repository each Monday (https://ylzhang29.github.io/UpstateSU-GRU-Covid/). Visualization of several useful metrics that are derived from our forecasts are provided on the GitHub page to facilitate understanding. These visualizations include projections of cases and deaths for all counties that had forecasts, and US maps of all counties to show the percentage of change in deaths and R values, and weekly cases and deaths per 100,000 population in future weeks. The code for our model's deep learning algorithm is also available in our repository.