Predicting County-Level COVID-19 Cases using Spatiotemporal Machine Learning: Modeling Human Interactions using Social Media and Cell-Phone Data


 Measurements of human interaction through proxies such as social connectedness or movement patterns have proved useful for predictive modeling of COVID-19. In this study, we first compare the power of Facebook’s social connectedness with cell phone-derived human mobility for predicting county-level new cases of COVID-19. Our experiments show that social connectedness is a better proxy for measuring human interactions leading to new infections. Next, we develop a SpatioTemporal autoregressive eXtreme Gradient Boosting (STXGB) model to predict county-level new cases of COVID-19 in the coterminous US. We evaluate the model on five weekly forecast dates between October 24 and November 28, 2020 over one- to four-week prediction horizons. Comparing our predictions with a baseline Ensemble of 32-models currently used by the CDC indicates an average 58% improvement in prediction RMSEs over two- to four-week prediction horizons, pointing to the strong predictive power of our model.


Introduction
Human interaction in close physical proximity is the primary cause of the transmission of highly contagious diseases such as COVID-19 1 . Measuring human interaction is therefore an important step in understanding and predicting the spread of COVID-19 2,3 . However, tracking human interactions requires rigorous contact tracing at national and regional scales which has not been implemented in the United States due to the economic, legal, and sociocultural concerns, as well as inadequate testing supplies, and insufficient national coordination 4 .
As a result, researchers have adopted different proxies to track human interaction levels. One such proxy is the "Social Connectedness Index" (SCI), generated from Facebook's friendship data. SCI represents the probability that two users in a pair of regions (e.g., U.S. counties) are friends (i.e., connected) on Facebook 5 . Kuchler et al. 6 reported on the strong correlation between early hotspots of COVID-19 outbreak and their level of social connectedness. The underlying assumption in leveraging SCI as a proxy for physical human interactions is that individuals who are socially connected on Facebook have a higher probability for physical interaction, thereby, potentially contributing to the spread of communicable diseases.
Human mobility flow, as measured by anonymized cell phone data, serves as another proxy for quantifying human interactions 7,8 . Widely used to study the spread of COVID-19, most studies incorporating cell-phone data have focused on the change in mobility within a spatial unit 9,10 , while a few others have also incorporated the flow between different spatial units 11 to predict transmissions across units, albeit mostly in the early stages of the pandemic with limited evaluation data. The underlying assumption in this approach is that more movements between spatial units leads to higher interactions, and consequently, an elevated risk of disease spread.
It is unclear, however, which of these approaches-using social media connectedness versus cell phone-derived human mobility flow-is a better indicator of physical interaction within and between different regions.
Furthermore, the underlying assumption in each approach may not necessarily be valid in the case of COVID-19: considering the sporadic and regional stay-at-home orders across the United States, social connectedness may not lead to physical interaction, at least not to the same level as pre-pandemic. Similarly, given the recommended preventative measures such as mask-wearing and physical distancing 12 , human flow from one location to another may not necessarily lead to physical interactions that could communicate the disease, especially in public places, where preventative measures are enforced more strictly.
In this paper, we compare the predictive power of Facebook's social connectedness index, as an example of social media proxy, with cell phonederived human mobility data, as an example of human flow proxy, in forecasting county-level new cases of COVID-19 in the conterminous US over multiple prediction horizons. County-level prediction is more challenging than state-level prediction [13][14][15] , yet it serves as the highest spatial resolution for national models in the U.S., since cases are aggregated and reported at the county level. Longer term County-level predictions are also essential for policy making and resource allocation.
The unique characteristics of COVID-19, including its presymptomatic and asymptomatic contagiousness, rapid spread, along with variations in regional response policies, such as inconsistent and sporadic testing and contact tracing, make forecasting the spatial patterns of this disease challenging. Researchers have used a variety of methods including time-series autoregressive models [16][17][18] , machine learning techniques [19][20][21] , epidemiologic models such as SIR model and its variants 22,23 , and combinations of these methods 24 for forecasting COVID-19. We experiment with five different machine learning-based spatiotemporal autoregressive algorithms to perform county-level predictions, and use the best algorithm, i.e. the one with the lowest average prediction RMSE and MAE, to compare between Facebook-and cell phone-derived features.
We compare our best model predictions against one of the most prominent collective efforts in forecasting COVID-19 in the U.S., namely, the Ensemble model developed by the "COVID-19 Forecast Hub" team 25 which is used by the Centers for Disease Control and Prevention (CDC) to report predictions of new cases and deaths in U.S. counties in one-to four-weeks ahead horizons 25,26 .
Our specific contributions are as follows: (a) designing inter-county and intracounty features for spatiotemporal autoregressive machine learning of countylevel new cases, (b) comparing the performance of social media connectedness (derived from Facebook) and human flow connectedness (derived from SafeGraph's cell phone data) by incorporating inter-county spatial lags for predicting county-level new COVID-19 cases, and (c) improving the long-term prediction of county-level new cases of COVID-19 in the coterminous U.S. in comparison to a baseline Ensemble model, using an end-to-end model.

Algorithm Selection
Five different machine learning algorithms were trained and tuned using each set of features named in the next section (details in Section 4), and tested over the last 5 weeks of our dataset (same dates as Tables 2 & 5), by holding out one week at a time for testing.

Ensemble
We compared the predictions of our best model, STXGB-FB, against the predictions of the COVID-19 Forecast Hub Ensemble of 32 models (used by the CDC in reporting forecasts of new cases 26 ) over one-, two-, three-, and four-week horizons. We trained and tuned STXGB-FB for each prediction horizon separately.  week prediction horizons (Table 3).

a b
To investigate the potential of SafeGraph cell phone-derived features in longterm predictions, we generated one-, two-, three-, and four-week forecasts using the STXGB-SG model as well. This model does not perform as well as STXGB-FB, pointing to the superiority of Facebook-derived features in our models consistent with the one-week predictions (Supplementary Table 4, Supplementary   Information). However, while STGXB-SG generates larger errors compared to STXGB-FB, it still outperforms the Ensemble model in long-term prediction horizons. Nov. 14 and Nov. 21 across all prediction horizons. To find potential explanations for this inconsistency, we inspected the spatial patterns of errors. Figure 3 illustrates   The majority of counties in the U.S are rural, which are also the ones with fewer medical resources, and where social media data or cell-phone mobility data, which underlie our models, might be less representative [29][30][31] . To investigate our models' performance in rural-majority counties compared to the Ensemble baseline, we categorized the counties into urban-and rural-majority by

Discussion
We demonstrated that incorporating (1) spatiotemporal lags using inter-county indices of connectedness and (2) Table 4). We will investigate alternative designs of inter-county connectedness metrics from SafeGraph mobility data in the future to ensure the utilization of the full potential of this dataset.
The superior performance of Facebook-derived features means that stronger county-level social (media) connections could lead to a higher chance of (unsafe) human interaction (e.g., house parties) and thus, COVID spread, compared to human flow from one location to another. The findings of this paper also suggest

Methods
This section outlines the details of feature engineering, algorithm selection, and implementation of spatiotemporal autoregressive machine learning models for predicting new cases of COVID-19 in the conterminous United States. We describe our experimental setup for comparing the predictive power of Facebook-derived features and SafeGraph's cell phone-derived mobility features (as proxies for human physical interaction) for this purpose, and the evaluation of our models against the COVID-19 Forecast Hub Ensemble baseline.

Base Features
We engineered features for machine learning that can be categorized into five Features in category C indirectly capture population compartments of the susceptible-infected-recovered (SIR) epidemiological models 41,42 . We defined the COVID-19 incidence rate of a county as its number of cases per 10,000 population and included a four-week lagged (t-4) weekly average of total incidence rates in each county as a feature (category C.i) to capture the Susceptible and Recovered compartments. If the feature value is small, many individuals in the unit have not yet contracted the disease; and therefore, are still susceptible. If the feature value is sufficiently large (level of sufficiency is learned by the model), the compartment is approaching higher levels of immunity as a whole. We use machine learning algorithms that are capable of learning such non-linear relationships. It is worth noting that vaccination rates can similarly be incorporated as a feature. However, inoculations in the U.S. started after our study period, and thus, not included here.
To account for the latency associated with the effects of temperature, new and historical incidences, and human interaction on the spread of COVID-19, we generated four temporal weekly lags of features in categories B-E (we assume that the features in category A are static during our study period). Notably, for category C, we also included (natural log-transformed values of) change in incidence rate( (∆ + 1)), i.e., the subtraction of observed start-of-week incident rate from end-of-week incidence rate during the four weekly temporal-

Conceptualization of models with social media and cell phone features
We evaluate the predictive power of the Facebook-derived features (category D) and the SafeGraph-derived features (category E) against a base model by developing four different model setups (not to be confused with the final evaluations against the baseline Ensemble model). Here, we provide an outline of these models, with more details on the specific algorithms and features mentioned in Table 4 in the following sections.  All of the models mean(daily minimum temperature) t t-1, t-2, t-3, t-4 mean(daily maximum temperature) t C-COVID-19 incidence rate All of the models Ln (Δ incidence rate t +1) t-1, t-2, t-3, t-4

t-1, t-2, t-3 and t-4 indicate one-, two-, three-, and four-week lags, respectively. The target variable for one week-ahead prediction horizon on forecast date d is the number of new cases per 10k from the forecast date through the end of prediction horizon t in each county, i.e., Δ(incidence rate) t,d . The target for the two week-horizon prediction is Δ(incidence rate) t+1,d , three week-horizon is Δ(incidence rate) t+2,d , and four week horizon is Δ(incidence rate) t+3,d . Ln()in the table indicates natural logarithm, Mean () indicates weekly average, Δ indicates weekly change, i.e., difference (calculated by subtracting the value of the
Ln (mean (incidence rate) t +1) t-4 mean (Stay Put) t and slope (Stay Put) t t-1, t-2, t-3, t-4 mean (Change in Movement) t and slope (Change in Movement) t t-1, t-2, t-3, t-4

E-SafeGraph
-SG and -SGR models -SGR model mean (baseliend median_home_dwell_time) t and slope(baselined median_home_dwell_time) t t-1, t-2, t-3, t-4 mean (baseliend % full_time_work_behavior_devices) t and slope(baseliend % full_time_work_behavior_devices) t t-1, t-2, t-3, t-4 The first model (base model) only includes base features: socioeconomic features (category A), four temporally-lagged weekly temperature features (category B), and four temporally-lagged weekly change in incidence rates in each county, as well as weekly average of incidence rates during the fourth lagged week (category C). Therefore, the base model only incorporates temporal lags of the features and the target variable in predicting new cases of COVID-19.
The second model, which we identify by the "-FB" suffix (to note the inclusion

ii. Inter-county features and spatial lag modeling
The intra-county features capture the intrinsic movement-related characteristics of a county and ignore its interactions (i.e. spatial lags) with the counties to which it is connected. Therefore, we calculated inter-county metrics of connectivity as a basis for incorporating spatiotemporal lags in our models.
Notably, the connectedness in this context transcends spatial connectedness in the form of mere physical adjacency.
Social connectedness Index (SCI), another dataset published by Facebook, is a measure of the intensity of connectedness between administrative units, calculated from Facebook friendship data. Social connectedness between two counties i and j is defined as 46 : where , is the number of friendships between Facebook users who live in county i and those who live in county j; while and are the total number of active Facebook users in counties i j and, respectively. Social Connectedness is scaled to a range between 1 and 1,000,000,000 and rounded to the nearest integer to generate SCI, as published by Facebook 47 . Therefore, if the SCI value between a pair of counties is twice as large as another pair, it means the users in the first county-pair are almost twice as likely to be friends on Facebook than the second county-pair 46 . We used the latest version of the SCI dataset (at the time of our analyses), which was released in August 2020 47 .
While SCI provides a measure of connectivity, our goal is to capture the spatiotemporal lags of COVID-19 cases in county i, i.e., the number of recent COVID-19 cases in other counties connected to county i. Using SCI, Kuchler et al. 6 created a new metric, called Social Proximity to Cases (SPC) for each county, which is a measure of the level of exposure to COVID-19 cases in connected counties through social connectedness. We use a slight variation of SPC, defined as follows for county i at time t : where 10 , is the number of COVID-19 cases per 10k population (i.e., incidence rate) in county j as of time t. For county i, the sums j and h are over all counties. In other words, SPC for a county i, in time t, is the average of COVID-19 incidence rates in connected counties weighted by their social connectedness to county i, i.e., the spatial lag of incidence rates. To the best of our knowledge, SPC data has not been published, but we were able to generate this feature using the original method 6 , modified for our weekly temporal lagged features and calculated using incidence rates (cases per 10k population) rather than total number of cases. In the -FB models (Table 4), we incorporated features of weekly change (Δ) in SPC at four temporally lagged weeks (difference between the end and start of the lag week) to model the Infected SIR compartment in connected counties, as well as weekly average of SPC in the fourth lagged week (t-4), to capture the Susceptible and Recovered SIR compartments in connected counties, similar to the rational for features in category C, as explained earlier.

Features derived from SafeGraph i. Intra-county movement features
To generate movement features from cell phone data, we used SafeGraph's SDM dataset that is "generated using a panel of GPS pings from anonymous mobile devices" 48 . The SDM dataset contains multiple mobility metrics published at the Census Block Group (CBG) level. Among these metrics, distance_traveled_from_home (median distance traveled by the observed devices in meters) and completely_home_device_count (the number of devices that did not leave their home location during a day) 48 are conceptually closest to the metrics included in the Facebook's Movement Range Dataset. We used these two features in our -SG model, which is the conceptual equivalent of the -FB model, but with cell phone-derived features instead of the Facebook-derived features (Table 4).
We included the SafeGraph's median_home_dwell_time (median dwell time at home in minutes for all observed devices during the time period), and full_time_work_behavior_devices (the number of devices that spent more than 6 hours at a location other than their home during the day) 48  We used weekly averages and slopes (calculated by fitting a linear regression model to the values as the response variable and day of week as the independent variable) of these four metrics as features in our models (Table 4).

ii. Inter-county features and spatial lag modeling
Building on the conceptual structure of SCI, we derived a novel and daily intercounty connectivity index from SafeGraph's SDM dataset to quantify connectedness between counties based on the level of human flow from one county to the other (measured through cell-phone pings). We call this index "Flow where for counties i and j, , is the sum of visits with origin i and destination j.
is the number of devices whose home location is in county i. We then scale FCI to a range between 1 to 1,000,000,000.
We defined FPC as: Facebook's social network and friendship connections do not change significantly over time, and therefore, SCI is a static index in a one-year period.
Conversely, inter-county human flow from SafeGraph is dynamic and can change dramatically, even within a week. We generated daily FCI (and FPC) for each county-pair in the US to utilize the full temporal resolution of the SDM dataset.
We used weekly change (Δ) of FPC for the four temporally lagged weeks, and its average only in the fourth week as features -SG and -SGR models, with the same rationale as features in Category C and D to capture SIR compartments in connected counties (Table 4).

Model Implementation
The ultimate target variable in all of our autoregressive models is the number importantly, minimize the sensitivity of the models to the population of counties.
Our exploratory work did confirm that using this logged of incidence rates produced better results.  Table 4). The weekly calculation of features is based on weeks starting on Sundays and ending on Saturdays, with predictions also made for horizons spanning Sunday-Saturday periods as in common practice 25 .
Our features, models, evaluations and comparisons are limited to the counties in the coterminous US. Table 4 summarizes the features that we used and the number of temporal lags (if any) used for each feature. All features were standardized for use in machine learning algorithms.
Our general approach to training, validation and testing of our models for different prediction horizons is similar, only, with target variables calculated separately for the specific prediction horizon. We first outline our approach for one-week ahead prediction horizons, which is used as the basis for algorithm selection and comparison of Facebook-and SafeGraph-derived features (-FB and -SG models); It is worth noting that we compare the two feature sets in longer term predictions too (Supplementary Table 5). We then provide an overview of the implementation of the models for longer term prediction horizons.
We trained and tuned the models using randomized search and 5-fold cross validation, and tested the best tuned model for predicting new cases in the week following the forecast date (for the one-week prediction horizons, as listed in For the next forecast date, October 31, the training size increased by one week (per county), and the target week was also shifted by one week. Table 5 summarizes the forecast dates, one-week ahead prediction horizons, and training data size. The data used in generating these features spans a period from March 29 to November 28, 2020 to cover the temporal lags, and the target variable is collected through December 12, 2020 for the evaluation of four-week ahead predictions on the last forecast date. More details on cross validation, hyperparameters, and evaluation are presented in the supplementary information (Section C). We evaluated the performance of TXGB, STXGB-FB, STXGB-SG, and STXGB-SGR by comparing the RMSE and MAE scores of the predictions against the observed numbers of new cases in the corresponding prediction horizon (results in Table 2).
We also tested the -FB and -SG models for all prediction horizons across all forecast dates (results in Supplementary Table 5).
Re-tuning and re-training all different variations of the STXGB model for each forecast date and prediction horizon resulted in considerably improved predictions compared to the baseline model (Table 4 and Supplementary Tables 4   and 5). Each STXGB model was tuned and trained on a regular desktop machine (with a 6 core Ryzen 5 3600X CPU and 64GB of RAM) in approximately 12-13 minutes for a single prediction horizon, and thus, in almost one hour for all of the four prediction horizons.

Evaluation against the COVID-19 Forecast Hub Ensemble baseline
In addition to the one-week short-term predictions, we performed long-term predictions of new COVID-19 cases in two-, three-, and four-week ahead prediction horizons. We only used the STXGB algorithm to develop long-term prediction models since it outperformed other algorithms in short-term predictions (see Section 2). We used the same set of features for long-term predictions, with modifications on the target variable to reflect different prediction horizons. For instance, the two-, three-, and four-week ahead horizons of the Forecast date Oct. 24, were Oct 24 to Nov. 7, Oct. 24 to Nov. 14, and Oct 24 to Nov. 21, respectively.
The model for each horizon was trained and validated separately using the same training data and approach described in the previous Section (Table 5), and was tested on two, three and four weeks of unseen data, respectively, for each horizon. We evaluated the models' predictions by comparing them against the predictions generated by the COVID-19 Forecast Hub's Ensemble model as well as the ground-truth values of new cases derived from JHU CSSE COVID-19 reports.
Additionally, we compared the long-term predictions of STXGB-FB and STXGB-SG (Supplementary Table 5).

Data Availability
All of the datasets used in this study are publicly available (at the time of writing this manuscript). We created socioeconomic features from the 5-year survey data--between 2014-2018--provided by the American Community Survey (ACS) and available at IPUMS National Historical GIS portal

Code Availability
All code necessary for the replication of our results is available for reviewers upon request. The code will be published publicly on GitHub under MIT license upon acceptance of this article.

Summary of Contributions
Morteza Karimzadeh conceptualized the project, designed the features and contributed 30% of data processing and implementation, and contributed equally to writing. Behzad Vahedi conducted the majority of data processing, implementation, literature review, and contributed equally to writing. Hamidreza Zoraghein contributed to the study design and 10% of implementation and writing.