On stay at home orders: Using the power of data science for spatial and temporal modeling and visualization of COVID-19

Coronavirus disease 2019 dominated and augmented many aspects of life beginning in early 2020. Related research and data generation developed alongside its spread. We developed a Bayesian spatio-temporal Poisson disease mapping model for estimating real-time characteristics of the coronavirus disease in the United States. We also created several dashboards for visualization of the statistical model for fellow researchers and simpler spatial and temporal representations of the disease for consumption by community members in our region. Findings suggest that the risk of confirmed cases is higher for health regions under partial stay at home orders and lower in health regions under full stay at home orders, when compared to before stay at home orders were declared. These results confirm the benefit of state-issued stay at home orders as well as suggest compliance to the directives towards the older population for adhering to social distancing guidelines.


Introduction
The coronavirus disease 2019  began in Wuhan, China in December of 2019 and quickly evolved into a global crisis (Trombetta et al. 2016). As of May 18, 2020, data suggested that the pandemic resulted in 1,492,822 confirmed cases and 89,101 deaths in the United States alone (Johns Hopkins University Center for Systems Science and Engineering 2019). Due to the widespread nature of this crisis and the need to develop data-driven insights into COVID-19 and its spread, data from various reputable sources were made publically available and updated daily.
In this project, data science tools are coupled with robust statistical modeling to offer interactive real-time exploration and visualization of COVID-19 at the local, regionally, nationally, and internationally (Chang et al. n.d.;Microsoft Corporation 2014;Salesforce 2003). Data visualization dashboards that are updated automatically and daily furnished an ideal environment for understanding the distribution and characteristics of COVID-19 across the United States over time. These dashboards allowed for the easy dissemination of information to a range of audiences in academe, government and nonprofit organizations, and members of the general populace. We display these dashboards alongside an array of related ones created by colleagues in the UNCW Data Science program, here.
The first steps of this project involved compiling data from multiple sources, including: COVID-19 data from Johns Hopkins University Center for Systems Science and The stay at home orders variable was self-constructed with guidance from government mandates and the New York Times (Mervosh et al. 2020). After compiling the data, we constructed visualization dashboards for optimal viewing, focusing on those values that vary over time (e.g., confirmed COVID-19 cases). Next, we implemented the statistical model to spatially and temporally explore the prevalence of confirmed cases and death outcomes. The statistical model we employed was a Bayesian spatio-temporal Poisson disease mapping model that allowed for estimation of important fixed effect parameter estimates in addition to residual spatial and temporal variation (Lawson et al. 2016;Lawson and Lee 2017;Lesaffre and Lawson 2013). The dashboards linked above display the real-time implementation of this statistical model. The results reported in this article focus on the time of the pandemic when stay at home orders were largely in effect.

Results
An R Shiny COVID-19 regression dashboard with the results for the data presented in this manuscript is available here. This limits the data to the referenced March 23, 2020 to May 18, 2020 window. Figure 1 displays the standardized incidence or mortality ratios ( / ) for confirmed cases, active cases, new cases, and deaths for the first, third, fifth, seventh, and ninth Mondays across all health regions in the United States. These maps suggest that the spatial distribution of increased (greater than 1) vs. decreased (less than 1) risk remains largely consistent across time, which provides justification for the separate spatial and temporal random effect terms in the model described above. Additional real time visualizations of these outcomes, including representation as a percent of the population, can be seen here at the county level via daily updates of the Johns Hopkins data. The time window considered in this manuscript is represented as days 1-57 on the slide bar. Table 1 displays the statistical model parameter estimates associated with the confirmed case and death outcomes for the US as a whole and for the four separate US Census-defined regions. An inverse logarithm was applied to these parameter estimates to offer an interpretation directly related to the multiplicative change in relative risk ( ) of the outcome considered. Consequently, an estimate with a 95% credible interval fully less than one suggests decreased risk while an estimate with an interval greater than one suggests increased risk. The estimates for the other two outcomesactive cases and new casesare largely consistent across all models (these findings are made available in the linked Shiny dashboard). Generally speaking, the parameter estimates for the outcomes in Table 1 suggest increased risk under partial stay at home orders and decreased risk under full stay at home orders compared to before the stay at home orders were instated. Risk appears to decrease as the percent unemployment in a region rises. Similarly, findings suggest risk increases for areas with a higher percent of Black or African American population in the South. Some evidence exists that risk is also higher where there is a higher percent of smokers and a higher percent older population. The testing variable was least consistent among the predictors in its effect on confirmed cases and deaths outcomes. It appears that more positive tests, both count of positive tests and percent positive tests, suggest more confirmed cases but not necessarily more deaths. Figure 1: Standardized incidence or mortality ratio for A) confirmed cases, B) active cases, C) new cases, and D) deaths for the first, third, fifth, seventh, and ninth Mondays across all health regions in the United States.

Discussion
The results presented here demonstrate the usefulness of these spatio-temporal statistical models for identifying several important risk factors for a range of COVID-19 outcomes. The use of Bayesian inference is both relevant to this inquiry and valuable as a benchmark and template for researchers to use for additional applications in data science. Additionally, the dashboards offer an ideal interactive environment for displaying these results for a range of interested stakeholders.  (Bialek et al. 2020;Onder et al. 2020).
Although the fixed effect estimates were quite similar across models, the spatial and temporal random effects were different when the US Census-defined regions were compared to those generated for the entire United States. Some of these differences are a consequence of scaling the random effects in the different models, but several indicators suggest an opposite risk direction. This finding indicates that there are differences in the COVID-19 crisis across the country; thus, it is critical to consider these areas on their own. The US Census-defined region with the most consistent spatial random effect distribution to the one estimated for the United States as a whole was the Northeast. It is possible that the Northeast was driving much of the US-based model because the outbreak was centralized in that region for most of this time period.
The West US Census-defined region was the most inconsistent with respect to the fixed effect estimates, and this is likely due to two factors: 1) the geographic and cultural heterogeneity within that region and 2) the strict nature of the stay at home orders in both California and Washington.
The statistical model implemented in this study has both advantages and disadvantages for understanding disease spread. First, this model appears robust to data delays that are inherent in real-time data (Jordan 2020). For example, there is a strong seasonal trend in the data where more new cases are reported later in the week versus earlier in the week (see this dashboard for evidence). However, that same trend is not evident in the temporal random effect estimates and we observed no well estimated parameters when including a variable that attempted to adjust for this trend. The reason for the models' robustness lies in the expected count ( ) term, since that matrix by definition includes the same seasonal trend. One negative feature for these models involves the computational efficiency. While implementation via the INLA package in R speeds up the model fitting process, these complex models become more computationally inefficient with every day of new data generation. As such, real-time, or daily data as used here, updates become less feasible the longer the crisis continues. Our current real-time rendering of these models uses weeks as the temporal unit for this reason.

Conclusion
In sum, this article demonstrates how statistical models of this nature and caliber can illuminate important relationships with disease spread and identify relevant populationlevel characteristics. Most importantly, we believe that these results confirm the benefit of state-issued stay at home orders as well as suggest compliance with respect to the directives towards the older population for adhering to social distancing guidelines.

Data
Although most of the spatial data employed in this analysis are available at the county level, spatio-temporal statistical modeling of these data at this aggregation is complicated for two major reasons: 1) zero inflation and 2) large dimension. Both of these complications would render our methods inappropriate and inefficient. In order to avoid the issues associated with county-level modeling, we used a broader spatial aggregationhealth regions, which are one or more counties serviced by the same health department (Day et al. 2019 particularly for the timeline pertaining to stay at home orders since several states had partially reopened but none had fully reopened. Second, by constraining the window of time it afforded the ability to examine data on a daily temporal scale as opposed to the real time models now that show data in weeks for computational efficiency. Finally, confining the data in this way allowed for the inclusion of testing information, which lags behind case counts in data release.
We examined four outcomes: number of confirmed cases, number of active cases, number of new cases, and number of deaths. Recovery data is not captured at the US county level and therefore is not incorporated as a stand-alone outcome or in the calculation of active cases. Additionally, there could be differences in covariate relationships based on the region of the country considered. As such, we made an estimation for the US as a whole and apportioned the numbers into the following

Covariates for adjustment
All models adjust for the same neighborhood-level covariates, which were selected based on two criteria: to describe the state-level responses to the pandemic and to represent population level characteristics that may contribute to the disparities in outcome counts between health regions. We elaborate on the selection of predictors in the paragraphs below. The statistical model indirectly adjusts for differences in population across the observed areas (see Section 2.2 below for more details). Images (gif or png) for all predictors considered are available for review in the supplemental files.
To understand and adjust for the state-level response to COVID-19, we considered two important temporal predictors, the state-by-state issuance of stay at home orders and the rate of testing. These variables are specified at the state-level because they are either mandated at the state-level or are only available at that level of aggregation, respectively. The self-constructed predictor representing presence of statewide stay at home orders required daily updates when running in real-time. The most current, convenient, and reliable information for this categorical variable (levels: before, duringpartial, duringfull, and none) came from the New York Times (Mervosh et al. 2020).
To ensure accuracy, we concurrently examined government mandates for more information. Population characteristics were also included as predictors in this model, specifically the presence of an international airport, land area of the health region, percent of the population that is unemployed, percent of the population that is a smoker, percent of the population that is aged 65 and older, and percent of the population that is Black or African American. The binary variable for presence of an international airport was used to represent community connectivity to international locations, which could be an where is the outcome of interest (count of confirmed cases, active cases, new cases, or deaths), is the mean of the Poisson model, is the expected count, is the relative risk, ′ is the design matrix for the predictors, is the vector of fixed effect estimates, is the spatial random effect, and is the temporal random effect.
is calculated as the rate of infection across all health regions on a given day times the population at risk for a given health region, which is assumed constant over time. As such, a unique is produced for each health region and day. Given the Bayesian methodology applied here, prior distributions were required for all parameters and they were defined as follows: ~(0, −1 ) for each of the fixed effect parameter estimates, ~(0, −1 ) for an uncorrelated spatial random effect, ~( −1 , −1 ) for a temporal random walk effect, and all precisions were such that ~(2,1). All these prior distributions are considered non-or weakly-informative and sensitivity analysis (data not shown) suggests little influence on the resulting parameter estimates. This model description applies to all outcomes (confirmed cases, active cases, new cases, and deaths) and US Census-defined regions (US, S, W, MW, and NE) considered.

Computational Details
R statistical software furnished much of the means for data processing and analysis in the work presented here. Specifically, the R packages rgdal, INLA, and fillmap were necessary for spatial data processing, statistical modeling, and spatial plotting respectively (Bivand et al. 2019;Blangiardo et al. 2013;Carroll 2016;Carroll et al. 2015;Martins et al. 2013; R Core Team 2015; Rue et al. 2009;Held 2010, 2011;Ugarte et al. 2014). We, along with several others from the UNCW Data Science program, produced data visualization dashboards with an array of popular software including R Shiny, Tableau, and Power BI (Chang et al. n.d.;Microsoft Corporation 2014;Salesforce 2003). In addition to the regression model results, these dashboards also display general spatial, temporal, and spatio-temporal tracking of COVID-19 cases and deaths at the county, state, national, and international levels. Code for the statistical models and Shiny apps are available at this GitHub repository.