‘ Not � nding causal effect ’ is not ‘ � nding no causal effect ’ of school closure on COVID-19

In a paper recently published in Nature Medicine, Fukumoto et al. tried to assess the government-led school closure policy during the early phase of the COVID-19 pandemic in Japan. They compared the reported incidence rates between municipalities that had and had not implemented school closure in selected periods from March–May 2020, where they rigorously matched for potential confounders, and claimed that they found no causal effect on the incidence rates of COVID-19. However, the effective sample size (ESS) of their dataset had been substantially reduced in the process of matching due to imbalanced covariates between the treatment (i.e. with closure) and control (without) municipalities, which led to the wide uncertainty in the estimates. That said, the study title "No causal effect…" is a rather strong statement because the results are also consistent with a strong mitigating effect of school closure on incidence of COVID-19.


Main Text
School closure as a means to control outbreaks has been studied mostly for in uenza prior to the emergence of COVID-19, which generally suggested low-to-moderate effects, but the evidence on other respiratory infections including coronavirus diseases has been limited (Viner et al., 2020). Sometimes decisions need to be made in the lack of su cient evidence in the earliest phase of the pandemic; nonetheless, such decisions should undergo retrospective policy assessment to provide insights and re nement for future pandemic responses.
One of the challenges in this type of analysis of the early COVID-19 epidemic in Japan is the limited statistical power due to low case counts. During the rst wave of the epidemic from February to June 2020 that overlapped with the study period of Fukumoto et al., Japan never observed more than 1,000 COVID-19 cases per day. As a result, out of the total 79,989 municipality-level daily counts from the 847 municipalities included, 99.9% were less than 10 cases per day ( Figure S2 of original study). Moreover, matching technique used to minimise confounding has a known side effect of limiting statistical power, especially when there is little overlap in the covariates between arms.
Unfortunately, the analysis in Fukumoto et al. appear to suffer from these issues. As the saying goes, "absence of evidence is not evidence of absence"-when the uncertainty range covers practically meaningful values, it should not be prematurely concluded that there is "no effect" just because the effect estimates is statistically insigni cant. Here I highlight limitations of the analysis and discuss possible factors that may have rendered the study underpowered.

Relative ATC and ATT estimates
The original study measures the effect of school closures as the absolute difference in incidence rates between the treatment and control municipalities. However, the theoretical ground is unclear for assuming a xed additive effect of school closures to the incidence rate per capita. The effect estimates relative to the baseline incidence would be a more intuitive and interpretable measure for assessment of its practical use. It should also be noted that since incidence rates can only take non-negative values, the absolute mitigating effect of school closure can only be as high as the average incidence rate in the control group.
I rescaled the reported average treatment effects (average treatment effect on the control: ATC and average treatment effect on the treatment: ATT) and their con dence intervals relative to the average outcome (incidence rate per capita) in the control group ( Figure 1). The con dence intervals of the relative ATC and ATT cover most of the regions from 100% reduction to 100% elevation, suggesting the underpowered nature of the original study. An effect of 50% reduction (i.e. -50% relative effect), which most experts would agree is of practical signi cance, or even complete reduction (i.e. -100%) was within the con dence intervals over the substantial part of the period of interest. ESS of the matched arms of around 40-50 (Figure 1d) was likely insu cient to nd a statistical signi cance because incidence of infectious diseases typically exhibits higher dispersion than independent-and identically-distributed settings due to its self-exciting nature (i.e. an increase in cases induces a further increase via transmission).
Statistical power demonstration with assumed causal mitigating effect of 50%/80% To further examine the statistical power of the study, I arti cially modi ed the dataset such that school closure has a 50% or 80% mitigating effect on the incidence rate per capita. On the treatment reference date (April 6) and onward, the expected incidence rate of each municipality in the treatment group was assumed to be 50%/20% that of the matched control municipality plus Poisson noise (see Supplementary document for details). The results suggested that, even with as much as 50%/80% mitigating effect, the approach in the original study might not have reached statistical signi cance ( Figure 2). The absolute ATT for the 50% mitigating effect (Figure 2b) appears similar to what were referred to as "no effect" in the original study. ATT for the 80% mitigating effect was also statistically insigni cant (Figure 2c and 2d), suggesting that the study was underpowered to nd even moderate to high mitigating effects, if any. ATC estimates also yielded similarly insigni cant/barely signi cant patterns ( Figure S1).

Separation of propensity scores
I also noticed that propensity scores computed for one of the subanalyses included, inverse-probability weighting, exhibited substantial/complete "separation" (Heinze et al. 2002) and most samples were essentially lost due to the substantial imbalance in the assigned weights ( Figure S2). Although separation of propensity scores can arise from over tting, in this case it remained (while slightly ameliorated) even after addressing over tting by Lasso regularisation (Figures S3). This indicates that the treatment assignments may have been nearly deterministic in the dataset, which can compromise the performance of quasi-experimental causal inference via "positivity violation" (Petersen et al. 2020).
The authors did not use propensity scores in the Mahalanobis distance-based genetic matching for the main analysis as opposed to the general recommendation (Diamond and Sekhon, 2012)[1]. This means that the covariates that strongly determined the treatment assignment may not have received large weights (and therefore were not prioritised) in the matching process, which could leave bias arising from these potential confounders unadjusted for [2]. The robustness to this concern could be assessed by computing ESS from another genetic matching including propensity scores and a calliper (to ensure the matched pairs have su ciently similar features).
[1] The authors cite King and Nielsen, 2029 as a reason not to use propensity scores; however, King and Nielsen clarify that their criticism is speci cally towards propensity score matching and does not necessarily apply to use of propensity scores in other methods including genetic matching.
[2] For example, many regression coe cients for prefecture dummy variables had large values (~5 or larger) in the Lasso regularised model, whereas 236 out of 483 matched pairs of municipalities in the main analysis for April 6 had their prefecture dummy variables unmatched for.

Conclusion
The reanalysis of Fukumoto et al. suggested that the study was underpowered to identify the presence of causal effects of school closure on COVID-19. While I recognise the importance of their attempt to assessing the school closure policy given its collateral effect imposed onto students and their family, I argue that their conclusion of "no causal effect" was not strongly supported by data due to the limited statistical power. Finding no mitigating effect itself would not be surprising as children were not the centre of the outbreak especially in the earliest phase (Davis et al. 2020); nonetheless, evidence claiming "no effect" would need to show that effects were at least below the level of practical signi cance.
Altogether, these limitations represent di culties in post-hoc causal analysis of mass interventions implemented without a built-in evaluation design such as randomisation. The fact that even the reasonably designed approach of Fukumoto et al. suffers insu cient power emphasises the importance of the "evidence-generating" philosophy in policy planning as has been promoted for medicine (Embi et al., 2013).

Declarations
Code availability statement Replication code along with the full analysis report (also provided as Supplementary document) is available from a GitHub repository: https://github.com/akira-endo/reanalysis_Fukumoto2021. The repository contains the replication codes from the original study  which are partially modi ed and reused.

Figure 1
Relative average treatment effect on the control (ATC) and average treatment effect on the treatment (ATT). The turquoise vertical lines represent the date of treatment (school closure). The black lines and shaded areas represent the mean effect and 95% con dence intervals, respectively.