Stay Home Save Lives: A Machine Learning Approach to Causal Inference to Evaluate Impact of Social Distancing in the US

Although there are few studies done to provide estimations of the impact of COVID-19 pandemic, however, there is a need for an actual policy evaluation of the already implemented social distancing measures. In the US context in specific, this is especially instrumental because nearly a dozen US states are considering the reopening of the economy following anti social distancing protests. Using a machine learning based Generalized Synthetic Control Method, considering the US states that adopted early social distancing approaches as the treatment group and the states that adopted social distancing much later as the control group and controlling for state and time fixed effects (to cancel out the selection bias and endogeneity), this paper finds that social distancing is associated with lower COVID-19 infection growth rate (by 192%) when compared to the no policy intervention counterfactual. predictive machine learning modeling. We find that social distancing is associated with lower covid-19 infection growth rate (by 192%) when compared to the no intervention counterfactual.


Introduction
When around a dozen i US states are considering the reopening of the economy, following anti-lockdown protests in tens of major cities ii and state capitals, it is difficult to advocate for a continued lockdown without an actual policy evaluation of the social distancing measures. This is especially necessary because the World Health Organization, which is situated on the other spectrum of the polarization surrounding pro and anti-lockdown of the policy debate, has just warned that the ongoing COVID-19 pandemic is far from over iii .
Although few post-COVID-19 studies have attempted to provide an estimate of possible US gains (e.g. Greenstone and Nigam, 2020) and simulation model of COVID-19's spread and mortality impacts in the US (e.g. Ferguson et al. 2020), there is a need for a causal policy evaluation of the already implemented social distancing to measure what have we achieved so far compared to a no intervention counterfactual. Using Generalized Synthetic Control Method (GSCM) developed by Xu (2017), considering the US states that adopted early social distancing approaches as the treatment group and the states that adopted social distancing much later as the control group and controlling for state and time fixed effects (to cancel out the selection bias and endogeneity), this paper finds that social distancing is associated with lower COVID-19 infection growth rate (by 192%). GSCM calculates weights for the untreated (control) units in order to create a synthetic twin of the treatment unit in the pre-treatment period, it uses an interactive fixed effect model (discussed later) in this re-weighting phase (Xu, 2017). As in case of predictive machine learning, GSCM then makes out-of-sample (post-treatment period) prediction using the calculated weights (based on the interactive fixed effect model) in order to create a counterfactual for the treatment unit. In this sense, this method is in the same spirit of machine learning predictive modeling.
Since COVID-19 is the only major pandemic in our recent memory, the pre-COVID-19 literature is reasonably inadequate regarding policy evaluations of social distancing. Reluga  This paper proceeds as follows: section 2 presents the data sources and empirical strategy of the paper, which justifies the method and assignment of treatment and control group; section 3 presents the results and visualization of social distancing measures; and finally, section 4 presents the conclusion.

Data
Our outcome variable is growth rate of daily cases or growth rate of infection and the righthand side indicator variable is social distancing. We collect cumulative number of confirmed cases and death cases at the FIPS code level from Johns Hopkins University (JHU) Center for Systems Science and Engineering iv . JHU posts the data in a wide time series format. We follow preliminary lines of the R (open source language and environment) code posted by Tim Churches (Mar 05, 2010) in a blog to extract the data from JHU and put it into a panel format v . The rest of the data cleaning and manipulation tasks are done in R and the data is made ready for feeding into the model of our choice-Generalized Synthetic Control Method. Our outcome variable is growth rate of confirmed cases, which we estimate from the number of daily confirmed cases. We use the data up to April 13.  published an article in New York Times titled "Where America Didn't Stay Home Even as the Virus Spread." Using location data from the data intelligence firm Cuebiq, the authors provide a map of the United States showing, following the implementation of social distancing, "… when the average distance travelled first fell below 2 miles vi ". This is our functional definition of social distancing that we are using for this paper. Glanz

Empirical Approach
The classic model generally used to understand the impact of an event on an outcome of interest is Difference-in-difference (DID). This is the most popularly used model to answer causal questions. A major limitation of this model is it heavily depends on the assumption that the treatment and control units' mean outcome follow parallel time trends in the pre-treatment period. The same parallel trend is assumed in the post-treatment period in the absence of the treatment. Also, in order to identify the treatment effect, we need exogeneity of the treatment event. In other words, the treatment status cannot be determined by any factor(s) that also impacts the outcome variable of interest. Sometimes, a workable assumption is conditional independence (also known as 'selection-on-variables'), which states that if we can identify the variables the treatment is endogenous to, we can control for those variables in the model. In this way, we can break the dependency between the treatment status and outcome of interest created by those variables.
Another method that has gained momentum is the Synthetic Control Method (SCM) proposed by Abadie, Diamond and Hainmueller (2010). SCM relaxes the parallel trend assumption in DID and essentially computes a "synthetic twin" to the treatment unit by reweighting the control units using the pre-treatment data on outcome and other covariates. In our opinion, SCM uses a machine learning approach to create the counterfactual for the treatment unit in the post-treatment period. It calculates the weights for each control unit using the pre-treatment period and then plug in those weights in the post-treatment control unit data to create the counterfactual for the treated unit. One caveat is SCM is applicable for a single treatment unit. In this paper, we use a more sophisticated approach of Generalized Synthetic varying) and uses these factors to estimate factor loadings (unit-specific intercepts) for the treated unit (Xu, 2017). This implies GSCM even relaxes the assumption of selection-onvariables to a great extent and permits the treatment status to be endogenous to unknown timevarying and unit-specific covariates.
We exploit this advantage in our paper as our treatment assignment of social distancing is not random. GSCM also allows for multiple treatment units, which is also the case in our paper. Xu and Liu (2020) shows implementation of the model in R viii . We follow the codes in order to implement the model on the data for this paper. Also, we look at the matching quality between the treatment average and the synthetic twin in the pre-treatment period by eyeballing if their paths overlap. Any difference in the post-treatment period can be attributed to the effect of social distancing. Table 1 shows the results. The outcome variable is Growth Rate of Confirmed Cases. In Table 1 column (1), we have state and as well as day fixed effects and we also control for the first day of infection. This is the main model of our interest. We find a statistically significant average treatment effect on the treated (ATT) of -1.92 (-192% That said, this number unambiguously demonstrates the direction of the impact of social distancing. Figure 1 shows how ATT evolves over time from the pre-to post-social-distancing era.

Results
We eyeball the match quality between the treatment and its counterfactual in the pre-socialdistancing period. Their paths don't perfectly overlap, but they follow each other very closely. It shows that we have a good match in the pre-treatment period, and we can take the treatment effect in the post-treatment period seriously. In our opinion, this is a striking result showing how effective social distancing can be in reducing the growth rate of the infection during a pandemic.
Columns 2 and 3 show the treatment effect with exclusive state and day fixed effects, respectively. The ATT with state fixed effect only is -39% and significant at 5% level. The ATT with day fixed effect only is -53% and is significant at 10% level. Again, our main result is the full model with both state and day fixed effects in column 1 and we report columns 2 and 3 for comparison purposes.  Please note that GSCM creates synthetic twin for each treatment state using information on the control pool of states. More precisely, GSCM channels control pool information through an interactive fixed effect model of which state and day fixed effects are special cases (Xu, 2017).
Our treatment (social distancing by March 26) is not random and can very well be correlated with unknown and/or unobserved state and time specific miscellaneousness. Thus, our main result is stated by both way fixed effect model in column 1, as it takes care of the correlation between treatment assignment and unknown factors in a more comprehensive way.

Conclusion
We investigate whether social distancing measures in the US worked when compared to a no policy counterfactual. In our case, the treatment status is not exogenous and possibly corelated with so many other factors. Thus, traditional DID like approaches of causal inference would not help to identify the impact of social distancing on infection growth rate. Thus, we used the Generalized Synthetic Control Method, which permits treatment status to be corelated with unknown factors that vary over time and across states, and estimates the counterfactual for the treatment states by doing an out-of-sample (post-social distancing period) prediction (Xu, 2017).
This is in the same spirit of predictive machine learning modeling. We find that social distancing is associated with lower covid-19 infection growth rate (by 192%) when compared to the no intervention counterfactual.