Impact of Modelling Approach on the Estimation of Government Intervention Effects in COVID-19

The efficacy of government interventions in epidemic has become a hot subject since the onset 15 of COVID-19. There is however much variation in the results quantifying the effects of interventions, which is partly related to the varying modelling approaches employed by existing studies. This paper therefore aims to examine how the choice of modelling approach would affect the estimation results of intervention effects, by experimenting with different modelling approaches on a same data set composed of the 500 most affected U.S. counties. We compare the 20 most frequently used methods from the two classes of modelling approaches, which are Bayesian hierarchical model from the class of computational approach and difference-in-difference from the class of natural experimental approach. We find that computational methods are likely to produce larger estimates of intervention effects due to simultaneous voluntary behavioral changes. In contrast, natural experimental methods are more likely to extract the true effect of 25 interventions. Among different difference-in-difference estimators, the two-way fixed effect estimator seems to be an efficient one. Our work can inform the methodological choice of future research on this topic, as well as more robust re-interpretation of existing works, to facilitate both future epidemic response plans and the science of public health.


Introduction
In fighting COVID-19, non-pharmaceutical interventions aiming at reducing mobility and contact have been implemented by governments around the world repetitively. Given the strong (and mostly negative) impact of these interventions on economy and social life, the efficacy of interventions has become a hot subject of study [1][2][3][4][5] , which is informative for both future epidemic response plans and the science of public health. However, there is much variation in the analysis 5 results quantifying the effects of the interventions, which prevents researchers and policy makers from drawing clear and reliable conclusions (estimates with the reproduction number as the outcome of interest are summarized in Table S1). There are two major sources of such variation: the first is the sample data (countries, regions) used in the analyses, as the effects of interventions might be naturally different in different local contexts influenced by the intensity of enforcement, 10 lifestyle, culture and so on; the second is the modelling approach, which might capture different structures in the data thus needs careful evaluation.
The modelling approaches employed by existing analyses can be grouped into two classes: those that incorporate a natural experimental design and those that do not (review of the methods used 15 in 23 highly relevant studies in Table S1). The former usually employ tools from the field of econometrics or biostatistics which have been developed for identifying causal influences in public policy and medical issues. The natural experimental design is featured by comparing regions that implemented or relaxed an intervention in a period (treatment group) to regions that did not (control group), given that the latter group is similar enough to the former group to be 20 considered as the counterfactual. The dominant method used has been difference-in-difference 4,6-11 , a method that identifies causal treatment effect through comparing pre-and post-treatment performance of control and treatment groups 12 ; and occasionally synthetic control 13 and interrupted time series analysis 14 . The second class of analyses do not incorporate a treatmentcontrol comparison but focus on maximizing model fit to observed data based on assumed model 25 structures. Specific methods that have been used include Bayesian hierarchical model 2,3,15-19 , generalized linear regression 5,20-25 and certain machine learning models 5 .
The assumed advantage of natural experimental design lies in the ability to rule out common temporal changes in the treatment and control groups, which might otherwise be mistaken as the 30 treatment effect, thus identify the true effect of an intervention. To be more specific, if there is a natural trend in the epidemic happening together with an intervention effect, researchers would be able to rule it out by subtracting the epidemic course of the treatment group with that of the control group. In the case of COVID-19, multiple processes could lead to such a natural trend, including increased precautionary behavior driven by the severity of the epidemic, decrease in the proportion of susceptible population, virus variants, etc. 9 . This paper therefore aims to empirically examine how the choice of modelling approach would affect the estimation results of intervention effects, by experimenting with different modelling 5 approaches on a same data set. We focus on the most frequently used methods from the two classes, which are difference-in-difference from the class of natural experimental methods and Bayesian hierarchical model from the class of more computational methods (can be checked in Table S1). It should be noted that among the non-natural experimental methods, generalized linear regression is also frequently used, however, most of the model settings are not able to 10 capture the causal relationships and only association can be claimed from the analysis results 20, 21,24,26 . More specifically, in terms of Bayesian hierarchical model, we use the model developed by Brauner et al. (2021), which is a later modification in a group of similar models 3,19 .
This model uses infection case and death data in each region to backward infer daily reproduction numbers and then the impacts of interventions on the reproduction number. In 15 terms of the difference-in-difference method, we test two estimators. The first is the two-way fixed effect estimator, which is a widely used difference-in-difference estimator in policy analysis and the most used one in relevant studies 12 . The estimation is implemented through a linear model with the outcome of interest as dependent variable and intervention status per region per day as well as region and day fixed effects as independent variables. Estimating the 20 linear model is equivalent to comparing the changes in treatment and control groups before and after intervention when there are two groups and two time periods. However, this estimator could be biased if there are multiple time periods and the treatment effect changes over time, which is possible in the case of COVID-19 as the compliance to interventions may change (e.g. it may take some time for compliance to increase, and compliance may also decrease as time goes) 19,27 . 25 To accommodate the heterogeneous effect, we further experiment with a new difference-indifference estimator that is robust to temporally heterogeneous treatment effect 28,29 . The robust estimator directly compares the epidemic course after n days of intervention status change in all possible pairs of treatment and control groups across the entire study period and computes the average treatment effect (n  0). 30 We take the United States as the case and analyze the effect of government interventions using county-level data of 500 counties with the largest number of infections. We use data from the first pandemic wave, that is, from March to August 2020, since intervention effect estimates in later periods could be confounded by more factors including lockdown fatigue, virus variants and vaccination. Six widely applied government interventions are evaluated, which are stay-at-home order, school closure, childcare closure, non-essential retail closure, banning small-size gatherings (below 10 people), and banning large-size gatherings (above 10 people) (Fig. 1).

5
Accurate knowledge of the effectiveness of government interventions is key to cost-effective policy making not only for controlling the on-going COVID-19 but also for future public health crisis. By directly experimenting with three main stream estimation methods, this work aims to facilitate a better understanding on the strength and weakness of relevant modelling approaches and the potential bias of reported results. While we cannot fully encompass all possible 10 methodology, our analysis is nonetheless informative on refining the analysis on this issue and grounding decision making on sound scientific evidences. interventions switched between none (0), partial restriction (0.5) and full restriction (1) during the study period. The widths of lines in the graphs are proportional to the number of counties in the corresponding status on the corresponding day.

Existence of common behavioral trends 20
Since the key difference between natural experimental and non-natural experimental methods lies in the use of a control group to rule out universal trend in the epidemic, we start by testing whether such trend exist and whether it correlates with the implementation or relaxation of interventions. Among the possible contributors to the universal trend, we focus on residents' behavioral changes, which is supposed to be a major contributor and the information is available 25 through Google's COVID-19 Community Mobility Reports that provides daily indicators on the movements of residents in a region (the county level for the United States) 30 .
By regressing the county-specific daily mobility indicators on day variables and intervention status, we find a statistically significant day effect on most days in our study period, indicating a 5 non-zero common change in the amount of travel conducted by residents in different counties on each day (Fig. 2). There was first an upward common trend in the time staying at home from March to mid-April, followed by a gradual decrease afterwards, yet by the end of the period, the time staying at home was still higher than before the pandemic. Actually, the day effects account for a larger proportion of the total variance in mobility than government interventions, as the 10 adjusted R 2 of the model is 0.22 with only intervention status as explanatory variables and increases to 0.66 when day effects are added (full results in Table S2). Correlation tests show that this common behavioral trend is statistically significantly correlated with the status of interventions (Pearson's r=0.02~0.39, all statistically significant at 0.001 level). The positive correlation suggests that the estimation of intervention effects could be upwardly biased by 15 simultaneous spontaneous behavioral changes without proper controlling measures, which confirms the concern behind natural experimental methods.

Comparison of model results 25
The intervention effect estimates produced by the three methods are shown in Fig. 3, which are different from each other given the diverse modelling strategies. Generally speaking, the semimechanistic model produces the largest estimates on interventions' effects in reducing Rt, which is consistent with our reasoning that non-natural experimental methods are likely to produce larger estimates. Three interventions are estimated to have a statistically significant impact on 5 reducing Rt, which are school closure (reducing Rt by -67.4%, 95% confidence interval: -69.7 to -64.8%), childcare closure (reducing Rt by -17.6%, -24.5 to -9.7%), and non-essential retail closure (reducing Rt by -6.9%, -14.9 to -1.2%). The particularly large effect of school closure might be related to the interaction between the sequence of interventions and structure of the However, none of these three interventions are found to have a statistically significant impact according to the two difference-in-difference estimators, which could be a consequence of excluding spontaneous behavioral changes by the natural experiment design. Generally, estimation results acquired from the two difference-in-difference estimators are closer to each 15 other than to the results of the semi-mechanistic model (differences are not statistically significant). Stay-at-home order is found to have the most prominent effect in reducing Rt by both of the two difference-in-difference estimators: -6.6% (-10.6 to -2.5%) as estimated by the two-way fixed effect estimator and -17.3% (-31.7 to 0%) as estimated by the robust estimator, which nonetheless are not statistically different (z-statistics = 1.27). The wider confidence 20 interval of the latter could be because the estimation is on a day-by-day basis so that a smaller sample is used in estimating the impact of implementing or relaxing an intervention for n days (n  {0 ~ 21}), resulting in more uncertainties. Besides, banning small-size gatherings is also found to statistically significantly reduce Rt by the two-way fixed effect estimator (-5.1%, -9 to -1.1%), while its estimate is associated with a wide confidence interval when using the robust estimator. 25 The other interventions are not found to have a statistically significant impact by both estimators.
All the interventions satisfy the parallel pre-trend assumption, which means that the epidemic course in the control group is similar to that of the treatment group before intervention status changes, so that the former can be proper counterfactuals for the latter (details on the methodology and results of parallel pre-trend test in Supplementary Method and Table S3&4). 30 Since the difference between the results from the two difference-in-difference estimators should stem from the temporally heterogeneous intervention effects, we then examine the day-by-day estimates of intervention effects produced by the robust difference-in-difference estimator (Fig.  3B). The results suggest that the intervention effects do change with the length of implementation. Among the two interventions that are found to have significant impacts on reducing Rt in the two-way fixed effect model, stay-at-home order shows an increasing effect over time, which might be caused by the gradual change of behavior and enforcement measures; and small-size gathering ban demonstrates a V-shaped trend, suggesting a loss of effect after 5 initial effectiveness.

Discussion
The wide use of non-pharmaceutical interventions in COVID-19 provides large-scale natural experiments on the impacts of various public health measures, which are valuable information for fostering the knowledge in relevant fields such as public health, urban resilience, public policy and informing future policy making. Diverse methodologies have been developed to evaluate the efficacy of interventions in COVID-19. By testing three major modelling methods 5 on the epidemic data from the United States, we demonstrate that quite different estimates of intervention effects could be acquired with different methods. Particularly, methods without explicit comparison of treatment and control groups tend to over-estimate intervention effects, as we find there is non-negligible spontaneous behavioral change happening together with the interventions. This is further testified by comparing our estimates to previous analyses-our 10 estimates from difference-in-difference methods are generally smaller than estimates from nonnatural experimental methods 2,17,20,23 , though this comparison is not absolutely valid due to varying geographical extents of the studies. The results suggest that estimates produced by methods with a natural experimental design could be more reliable, despite of differences caused by variance in details of modelling approaches. Our work can inform the methodological choice 15 of future research on this topic as data are still accumulating, and also facilitate more robust reinterpretation of existing works.
Regarding to the two estimators with natural experimental design, the one robust to temporal heterogeneity in intervention effects is methodologically more superior. However, since much 20 less observations are available for each n-day effect estimate, there could be less certainty about the estimate thus wider confidence intervals are produced. Considering that the results from the two difference-in-difference estimators are not statistically significantly different, the two-way fixed effect estimator is a preferable choice, which also has the advantage of fast computing. On our data, it takes 42 CPU hours (2.6 GHz) to finish bootstrapping 100 times for the estimation of 25 one intervention. Nonetheless, if longer period data is collected and used, the robust estimator might be able to draw narrower confidence intervals thus become a more preferable choice.
Nonetheless, when employing the two-way fixed effect estimator, one should be aware of the possible bias, which is jointly influenced by the intervention effect in each treatment-control pair 30 and their distribution across time 28,29 . Therefore, the difference between the results of the two difference-in-difference estimators in our analysis might not hold for other countries and time periods thus could not be taken as a reference. Neither might the temporal trends of intervention effects in this analysis (Fig. 3B) hold in alternative contexts, which could be influenced by the dynamics of enforcement intensity and compliance.
A limitation to our methodological analysis could be the potential signaling or spillover effect of interventions. The former refers to the possibility that increased intervention in treatment regions 5 makes people in the control group alerted and more cautious, or the reversed in the reopening stage 9 . The latter refers to the possibility that intervention in treatment regions also reduces (or increases in reopening) infections in the control group due to change in cross-region travel 7 . Both cases would deflate the intervention effect estimates in all the methods.

Testing common behavioral trends
We examine whether there is a common behavioral trend across all sample counties using Google's COVID-19 Community Mobility Reports. This dataset provides the daily percentage change in the visits to different categories of places (or the time spent at places of residence) 5 comparing with baselines (different baselines for each day of the week, using the median value from the 5-week period Jan 3 to Feb 6, 2020). We focus on the 'residence' indicator, referring to the percentage change in the time spent at home, since it can reflect the overall change of mobility. Next, we test whether there is a 'common component' in the time people staying at places of residence each day across counties after controlling for interventions, by regressing the where log(Rc,t) denotes the log-transformed instantaneous reproduction number in county c on day t; xi,c,t denotes the status of intervention i in county c on day t and i denotes their 30 coefficients; c and t denote the unit and time fixed effects, respectively; and c,t is the error term. We estimate robust standard error in a way that allows c,t to correlate at the county level 33 .
We use log(Rc,t) as the dependent variable to match with the Bayesian hierarchical model, whose parameter estimation is interpreted as the proportional change of Rt by interventions. The 5 coefficients of intervention variables in this model can be interpreted in the same way by taking exp(i). Daily Rt in each county is estimated with the method proposed by Cori et al. 34 , with a 7day sliding time window to smooth infection numbers and parameters of the serial interval of infections based on previous epidemiological investigation of COVID-19 (mean=7.5 days, standard deviation=3.4 days) 35 . Rt estimates with coefficient of variation greater than 0.3 are 10 excluded from further analysis (mainly due to small case numbers in the corresponding time window).

Experiment 3. Difference-in-difference: Robust estimator
The heterogeneity robust difference-in-difference estimator is proposed by de Chaisemartin and 15 D'Haultfoeuille 28 . This estimator estimates the average treatment effect of implementing an intervention for k days (k  0) across all the regions where an intervention switched from level d (d{0, 0.5, 1}) on t-1 to d ' (d ' {0, 0.5, 1}, d '  d) on t and remained d ' at least till day t+k (switching regions), using the regions where the intervention remains d between t-1 and t+k as controls (stable regions). Since the treatment effect is estimated through explicit matching 20 between switching and stable regions and analyzing their difference after a given number of days of treatment, temporal heterogeneity will not create bias. The estimator is computed as follows (9) 25 (10) (11) where DIDi,k denotes the estimated effect of intervention i after k days of implementation; 30 t{1,…,T} denotes the observed dates; Di,c,t denotes the status of intervention i in county c on day t; Ni,d,d',t,k and Ni,d,d,t,k denote the number of switching counties and stable counties between day t and t+k. If either Ni,d,d',t,k or Ni,d,d,t,k is 0, then DIDi,d,d',t,k is defined as 0. ei,c,t,k denotes the change of the outcome variable, which is the logarithm of instantaneous reproduction number, between switching and stable counties after controlling for the impact of other interventions (X'i,c,t); ei,c,t,k is estimated through Eq.11 with ordinary least square using only stable observations;  is a vector of coefficients and b is the constant. The standard error of DIDi,k is estimated by bootstrapping 5 for 100 times. Further, to account for the correlation of errors at the day level 4,36 , we utilized block bootstrap with days as blocks.
To compare with the estimates obtained with the other two methods, we further compute the average intervention effect across k days and the corresponding standard error. The average 10 effect is taken as the mean of DIDi,0 to DIDi,k and the standard error is computed as follows. (12) where sei,0-k denotes the standard error of the average effect after k days of intervention i, sei,a denotes the standard error of DIDi,a; and covi,a,b denotes the covariance between DIDi,a and 15 DIDi,b. We estimate DIDi,k for 21 days after an intervention, which is a common length of interventions in practice.