2.1 Data
The dataset and the interactive dashboard are created and maintained by a group of researchers at Johns Hopkins University[1] (Dong et al, 2020). This dataset reports infection and death cases of Covid-19 disease in real time. It reports cases at the city level in the US, which is later aggregated at state levels. Our data starts reporting from 2020-01-21 and ends on 2020-04-13. There are a total 4536 observations. Our outcome variable is growth rate of daily cases or growth rate of infection and the right-hand side indicator variable is social distancing. We collect cumulative number of confirmed cases and death cases at the Federal Information Processing System (FIPS) code level from Johns Hopkins University (JHU) Center for Systems Science and Engineering[i]. JHU posts the data in a wide time series format. We follow preliminary lines of the R (open source language and environment) code posted by Tim Churches (Mar 05, 2010) in a blog to extract the data from JHU and put it into a panel format[ii]. The rest of the data cleaning and manipulation tasks are done in R and the data is made ready for feeding into the model of our choice- Generalized Synthetic Control Method. Our outcome variable is growth rate of confirmed cases, which we estimate from the number of daily confirmed cases. We use the data up to April 13. We do not go beyond 13 April as the independent effect of the social distancing was becoming harder to identify as people started learning more about the benefits of social distancing and control states were also catching up in implementing social distancing. Later, states ordered ‘lockdown’ or shelter in place orders, which made it extremely hard to identify treatment from the control group. Thus, considering the immediate time frame following the social distancing measures makes sense.
2.2 Treatment and Control Status
Glanz et al. (2020) published an article in New York Times titled “Where America Didn’t Stay Home Even as the Virus Spread.” Using location data from the data intelligence firm Cuebiq, the authors provide a map of the United States showing, following the implementation of social distancing, “… when the average distance travelled first fell below 2 miles[iii]”. This is our functional definition of social distancing that we are using for this paper. Glanz et al. (2020) divide the places into 5 categories in terms of the date on which the mean distance travelled first dropped under two miles. These dates are March 16, March 19, March 24, and March 26. We considered the states without social distancing until March 26 in the control group and other states, which implemented social distancing earlier than March 26, in the treatment group. There are some states that had some counties that did not show social distancing until March 26, but majority of the counties did. We put even these states in the control pool in order to get a clean and conservative treatment group. The control group for this study includes the following states: Idaho, Wyoming, Utah, Arizona, New Mexico, Oklahoma, Arkansas, Alabama, Louisiana, Mississippi, South Carolina, Tennessee, Virginia, West Virginia, Kentucky, Kansas, North Dakota, South Dakota, Nebraska, Missouri, Iowa, Illinois, Wisconsin, Indiana, North Carolina, Florida, Maryland, Pennsylvania, Vermont, and Texas. While the treatment group consists of California, Colorado, Connecticut, Delaware, District of Columbia, Georgia, Guam, Hawaii, Maine, Massachusetts, Michigan, Minnesota, Montana, Nevada, New Hampshire, New Jersey, New York, Northern, Mariana Islands, Ohio, Oregon, Puerto Rico, Rhode Island, Virgin Islands, and Washington. The event/treatment date is Mar 26, 2020. Figure 1 shows the average outcome of the treatment and the created counterfactual of the treatment in pre and post treatment periods.
States vary in terms of the first day of infection tested. Assuming the novel corona virus has a life cycle independent of the states’ first reported infection date, different states may show slower or faster growth rate of infection depending on the first day of infection. Also, this first day may significantly influence how seriously people take social distancing as the number of cases are low at the beginning and later the cases grow exponentially. Thus, our treatment status can be endogenous to the first day of infection, which is a proxy for the life cycle of the virus. We control for the first day of infection in our model in order to break the dependency between growth rate of infection and treatment status.
2.3 Summary Statistics
In this section we provided some descriptive statistics and some visualization. First, we looked at the average growth rate of infection for treated and control groups, before and after the intervention. In the pre-treatment period, for treatment group, average growth rate of infection was 33%, while for the control group it was 19%. In the post-treatment period, for treatment group, average growth rate of infection was 88%, while for the control group it was 75%. Figure 2 presents average growth rate of infection by period and groups for our sample time frame.
Here, it should be noted that treatment states show higher average growth rate before as well as after the treatment time of March 26. Thus, their infection growth rate is level up from the control states always in our period of data. That makes our scientific analysis more relevant as well as interesting because we want to see if, even after social distancing, the treatment states continued to show reasonably high growth rate of infection relative to the control states. This is because, in the absence of a proper causal inference technique, the descriptive statistics and/or graphics could deceive the reader by implying that social distancing has had no effect in curbing the infection rates. We want to see if a sophisticated model still supports or nullify what we are seeing with bare eyes. We also show the variance of growth rate between the treated and the control states before and after the social distancing in figure 3.
In the pre-treatment period, for treatment group, the variance of infection was 6.57 SD, while for the control group it was 2.62 SD. In the post-treatment period, for treatment group, the variance of infection was 9.35 SD, while for the control group it was 7.41 SD. It is evident that the treated states vary much more from each other than that of the control states in terms of the growth rate of infection. We believe the smaller variation amid the control group makes them a better donor pool for creating the synthetic counterfactual for our treatment group, as far as the method we are using in this paper is concerned.
2.3 Method
The classic model generally used to understand the impact of an event on an outcome of interest is Difference-in-difference (DID). This is the most popularly used model to answer causal questions. A major limitation of this model is it heavily depends on the assumption that the treatment and control units’ mean outcome follow parallel time trends in the pre-treatment period. The same parallel trend is assumed in the post-treatment period in the absence of the treatment. Also, in order to identify the treatment effect, we need exogeneity of the treatment event. In other words, the treatment status cannot be determined by any factor(s) that also impacts the outcome variable of interest. Sometimes, a workable assumption is conditional independence (also known as ‘selection-on-variables’), which states that if we can identify the variables the treatment is endogenous to, we can control for those variables in the model. In this way, we can break the dependency between the treatment status and outcome of interest created by those variables.
Another method that has gained momentum is the Synthetic Control Method (SCM) proposed by Abadie, Diamond and Hainmueller (2010). SCM relaxes the parallel trend assumption in DID and essentially computes a “synthetic twin” to the treatment unit by reweighting the control units using the pre-treatment data on outcome and other covariates. In our opinion, SCM uses a machine learning approach to create the counterfactual for the treatment unit in the post-treatment period. It calculates the weights for each control unit using the pre-treatment period and then plug in those weights in the post-treatment control unit data to create the counterfactual for the treated unit. One caveat is SCM is applicable for a single treatment unit. In this paper, we use a more sophisticated approach of Generalized Synthetic Control method (GSCM) proposed by Xu (2017) that combines SCM with another approach to model time-varying unit specific factors, known as Interactive Fixed Effect model. These time varying unit specific factors are not observed in the data, but yet taken care of. GSCM uses interactive fixed effect model on the control unit data to get the latent unobserved factors (time-varying) and uses these factors to estimate factor loadings (unit-specific intercepts) for the treated unit (Xu, 2017). This implies GSCM even relaxes the assumption of selection-on-variables to a great extent and permits the treatment status to be endogenous to unknown time-varying and unit-specific covariates.
We exploit this advantage in our paper as our treatment assignment of social distancing is not random. GSCM also allows for multiple treatment units, which is also the case in our paper. Xu and Liu (2020) shows implementation of the model in R[i]. We follow the codes in order to implement the model on the data for this paper. Also, we look at the matching quality between the treatment average and the synthetic twin in the pre-treatment period by eyeballing if their paths overlap. Any difference in the post-treatment period can be attributed to the effect of social distancing.