Hypothesis testing and sample size considerations for the test-negative design

The test-negative design (TND) is an observational study design to evaluate vaccine effectiveness (VE) that enrolls individuals receiving diagnostic testing for a target disease as part of routine care. VE is estimated as one minus the adjusted odds ratio of testing positive versus negative comparing vaccinated and unvaccinated patients. Although the TND is related to case-control studies, it is distinct in that the ratio of test-positive cases to test-negative controls is not typically pre-specified. For both types of studies, sparse cells are common when vaccines are highly effective. We consider the implications of these features on power for the TND. We use simulation studies to explore three hypothesis-testing procedures and associated sample size calculations for case-control and TND studies. These tests, all based on a simple logistic regression model, are a standard Wald test, a continuity-corrected Wald test, and a score test. The Wald test performs poorly in both case-control and TND when VE is high because the number of vaccinated test-positive cases can be low or zero. Continuity corrections help to stabilize the variance but induce bias. We observe superior performance with the score test as the variance is pooled under the null hypothesis of no group differences. We recommend using a score-based approach to design and analyze both case-control and TND. We propose a modification to the TND score sample size to account for additional variability in the ratio of controls over cases. This work expands our understanding of the data mechanisms of the TND.


INTRODUCTION
The test-negative design (TND) is an observational vaccine study that is commonly used to monitor the effectiveness of influenza vaccines 1,2 as well as vaccines targeting rotavirus 3 , cholera 4 and COVID-19 [5][6][7] .It originated from the indirect cohort study to measure pneumococcal vaccine effectiveness (VE) in 1980 8 .In a typical TND 1 , patients seek health care for symptoms of a particular disease, and their specimens are taken for laboratory testing for the vaccine-targeted pathogen.Groups of test-positive cases and test-negative controls are formed according to the test results, analogous to cases and controls in a case-control study.Vaccination history and demographic information of each enrolled individual are recorded.The central assumption is that the vaccine of interest has no impact on other etiologies of disease 2 .TND can also be used to estimate the relative effectiveness of two vaccines in a direct comparison, or the relative effectiveness of a single vaccine over time by stratifying on time since vaccination.Both testpositive cases and test-negative controls are restricted to people who would seek health care if they experienced symptoms, reducing selection bias due to health-care-seeking behavior 9,40 .In addition, TND studies are cost-effective, as they require neither prospective follow-up nor active sampling of controls from the community 6 .The studies can be integrated into existing surveillance systems 10 .
The TND is commonly analyzed as a case-control study using either logistic [11][12][13][14] or conditional logistic regression [15][16][17] .Covariates often included are age, calendar time, sex, enrollment sites, and comorbidities 5,6,[18][19][20] .VE is estimated as one minus the adjusted odds ratio with an associated Wald-based confidence interval and p-value 21,22 .A strength of vaccines is that they can be highly protective, many vaccines against various infectious diseases exhibiting effectiveness above 90%, such as Covid-19 39 and HPV 41 .As a result, it is not rare to observe low numbers of vaccinated test-positive cases 14 .In these settings, the Wald approach can produce unreliable or even intractable variance estimates.An alternative approach is to add a continuity correction 23,24 or use exact methods 11,25 .Score testing is another option; when estimating the variance under the null hypothesis of no difference, the groups are pooled, which reduces sparsity.Even if a different null hypothesis is used, the results can still be tractable.However, score tests are not commonly used for TND analyses.
Limited guidance is available on power and sample size calculations for the TND.The TND is fundamentally a passive design, with investigators not having direct control over the number of test-positive cases and test-negative controls observed.Yet power and sample size calculations are useful for determining study feasibility and the broadness of eligibility criteria, planning the number of participating sites, and defining the study's duration.Investigators may conduct interim analyses of data as part of real-time monitoring, and they may wish to time these only after sufficient data have accrued.The most natural approach for power and sample size calculations is to use case-control equivalents to design the TND.For case control studies, Breslow proposed a sample size corresponding to the Wald test in 1987 26 .Fleiss modified Breslow's sample size corresponding to the Wald test adding a continuity correction 27 .The score sample size 20 was developed based on score statistics from a logistic regression.Other sample size methods, such as arcsine transformation sample size 26 and exact unconditional and conditional test sample sizes 30 , were also proposed.
In this article, we examine hypothesis-testing methods, assessing the performance of the Wald test without and with continuity corrections, and a score test based on a case-control study applied to TND data, with a focus on sparse data settings.We also compare the performance of their associated sample size calculations.we explore differences between the case-control and TND studies, and identify the added variability relative to the case-control studies due to the random ratio of test-positives to test-negatives in the TND.We proposed a sample size calculation strategy for the TND to mitigate that additional variability.

Sample size methods
We consider three sample size calculation methods corresponding to three different hypothesis tests: a standard Wald test, a Wald test with continuity corrections, and a score test.These are one-sided hypothesis tests for the null and alternative hypotheses of  0 :  ≤  and  1 :  > ,  ≥ 0. Data can be summarized in a simple 2x2 table with cell counts , , ,  as shown in Table 1, and VE is estimated by one minus the odds ratio, i.e., 1 −   .The equivalent null hypothesis is then that the log odds ratio is greater than or equal to , and the equivalent alternative hypothesis is that the log odds ratio is less than .Sample size calculations are often based on simplified assumptions, and we discuss the basic scenario without adjusting for confounders here.
The approach adjusting for confounders will be similar to a logistic regression with added covariates 28 .

Standard Wald test
The standard Wald test is a common test of the log odds ratio, where variance is estimated by the Delta method utilizing the alternative hypothesis.The Wald test statistic   based on the four cell counts in Table 1 is: With a sufficiently large sample size, this test statistic follows a standard normal distribution, which can be used to derive a corresponding p-value.Note that if any of the cell counts are zero, the standard Wald test statistic   is intractable.
Corresponding to the standard Wald test, the Fleiss sample size method 27 is widely used in practice for the design of case-control studies.We modify their formula to re-express the sample size in terms of parameters relevant to the TND; these are: VE, the assumed level of vaccine effectiveness;   , the expected fraction vaccinated among negative tests, which is a proxy for the vaccination coverage in the source population under the central assumption that the vaccine has no effect on test negative illness; (1 −  −  () ) , the cumulative incidence of test-positive illness in the unvaccinated population, assuming that individuals test positive no more than once during the study period  (i.e., gaining immunity after infection with the target pathogen); and   (), the cumulative hazard of test-negative illness during the study period  , allowing individuals to repeatedly test negative with different circulating pathogens producing similar symptoms.
From these inputs, we define several related quantities.These are:   , the expected fraction vaccinated among positive tests,   ≈   (1−) 1−  (see Appendix); and , the expected fraction of test-positive cases amongst all tests (i.e., percent positivity);  can be approximated as follows: Alternatively,  can be estimated based on historical surveillance data.
The quantity  has a parallel to the ratio  of cases to controls that is often specified in case-control studies (e.g.k=2 for 2:1 controls to cases).In a case-control study with ratio k, the fraction of cases amongst all observations is  = 1 +1 .In case-control studies, this quantity is pre-specified and fixed by design.In a TND, the number of positive tests or negative tests is typically not controlled due to the passive sampling.Then,  represents the expected fraction of cases amongst all tests.
The standard Wald sample size with one-sided significance level  and desired power 1 −  is as follows, adapted for the TND, is: For a TND,   denotes the estimated required total number of tests in the study.

Wald test with continuity corrections
To avoid zero cell counts which make the standard Wald test statistic intractable, and to better approximate a normal distribution, a small number , referred to as a continuity correction, can be added to each cell count.Various continuity corrections are described in the literature 23,35,36 .An example is the Yates' correction, based on  = 0.5.The continuity-corrected Wald test statistic   is: For the continuity-corrected Wald test statistic, a corresponding sample size calculation method is the Fleiss sample size with Yates' correction, a modification of the standard Wald sample size.
The corrected sample size   is expressed as a function of   , ,   and   :

Score test
The final test considered is a score test based on the likelihood from a simple logistic regression with binary vaccination status 28 .The test statistic   utilizes the estimated variance under  0 : where  is the total number of tests,   0 is the score under the null and (  0 ) is the variance of   0 calculated based on the information matrix.For demonstration, when the margin of the hypotheses is 0, i.e.,  = 0, the test statistic is simplified as , where ̂ = is the empirical estimated variance of Î − N when  = 0. (see supplement for details).Note that the variance in the score test statistic is developed under the null hypothesis.By pooling data from groups under the null hypothesis, the test statistic is tractable even when an individual cell is zero, as long as all margins are non-zero.
For the score test statistic, a corresponding sample size calculation method is as follows: where The proposed score sample size for high VE can be found by grid search from the case-control score sample size until the right-hand side of the equation achieves the desired power.

Simulations
To compare the case-control studies and TNDs, we performed a simulation study based on the same vaccine effectiveness and same population vaccine coverage.The ratio of cases to controls is fixed by design in the case-control study but variable in the TND, although we fix the expected value of the ratio for the latter so that the studies can be directly compared.
Scenarios we considered across several vaccine effectiveness  = 30%, 50%, 70%, 90%, 95% with vaccine coverage   = 10%, 30%, 50%, 70%, 90%.Vaccination is assumed to be completed before the study starts.Because vaccination coverage is constant over time, calendar time is not a confounder 32 .  individuals in the population are randomly selected to be vaccinated and the rest (1 −   ) remain unvaccinated.An all-or-none vaccine 31 model is adopted.Among vaccinated individuals,  × 100% proportion are randomly selected to be fully protected and the rest are not protected, sharing the same incidence rate with unvaccinated.
To focus on the comparison between the TND and the case-control study, we assume a constant hazard for both test positive and test negative illness, i.e.,   () =   ,   () =   .We generate event times separately for test-positive events and test-negative events so individuals can test negative many times and remain in the at-risk source population.Individuals are not censored after , where  is the calculated sample size.Notice that the source population we consider here is the population who will seek health care and be tested if sick.
The TND data does not require a fixed ratio of test negative controls to test positive cases, so we stop counting events when the number of tests reaches the desired sample size.The case-control data has the fixed ratio , so we stop counting test positive events when the number of positive tests reaches  .(1 − ).Many test negative controls are randomly selected from all test negative events in the population.We also assume 100% sensitivity and 100% specificity of the diagnostic testing.Each scenario runs 100,000 iterations.Simulations are performed using R (R Core Team (2019).

Comparison between the test-negative design data and case-control data
Our simulation results allow us to compare the characteristics of the data generated by a TND and by a comparable case-control study with the same VE, vaccination coverage, and expected ratio of cases to controls.In Figure 1, we compare the distributions of the four cell counts across the two designs in a setting with 95% VE.The most notable difference was that the distribution of the  Because the standard Wald test is intractable when a zero is present in the cell counts, we compared the frequency of observing zero vaccinated test-positive cases across case-control and TND studies (Table 2).These can be very common for both study types when VE is high and vaccination coverage in the population is low.Overall, we noticed minimal differences in the frequency of zeros between the two designs, although in general more zeros are observed in the case-control study as compared to the TND, particularly when vaccine coverage is low.Thus, both designs are prone to intractability if a standard Wald test is applied.

Adding continuity correction to the Wald test
Moreover, we found that adding continuity corrections to the Wald test stabilized variance but induced bias in the point estimate.In Figure 2, we scanned the continuity correction from 0 to 2 and evaluated the bias and variance of the log odds ratio.The black line indicates the mean bias of the log odds ratio among 100k iterations, and the blue line is the standard error of the 100k log odds ratio estimation.When no continuity correction was added, both bias and variance were intractable since zeros occurred in the denominator.As the continuity correction increased, the estimated variance was stabilized, while the bias increased.Even with the widely used Yates' correction of adding 0.5 to each cell count, the bias was around 0.5.
Figure 2. Bias and standard error of log odds ratio for various continuity corrections for 30% vaccine coverage   and 95% VE in the test-negative design

Power performance of the three testing approaches
To broadly compare the three testing approaches and two study designs, we calculated simulated power for vaccination coverage   ranging from 10% to 90%, all assuming 95% VE.For each vaccination coverage level, we calculate the sample size using the Wald formula to achieve 80% power.These sample sizes ranged from n=230 for 10% vaccine coverage to n=37 for 70% vaccine coverage, minimizing at 70% coverage (supplement).We analyze the data using the three tests, substituting a continuity corrected version of the Wald test where the standard Wald test is intractable.The results are shown in Table 3.For both case control studies and TNDs, the two types of Wald tests failed to achieve the desired 80% power, with some exceptions when vaccine coverage was 90%.Vertically comparing the three tests, we found that the score test performed the best across all scenarios.The score test had more stable performance; recall that the score statistic is still tractable when zero cell counts occur.Type I errors for the three tests were well controlled (supplement).Comparing the case-control and TNDs from equivalent settings, we observed typically lower power for the TND.
Next, we compared the sample size calculation methods corresponding to each of the three tests.
From Figure 3  was not evaluated since it is frequently intractable.
Starting with the standard Wald sample size and test, Figure 4 shows very low power for both case-control and test-negative design studies when VE is high, especially for low vaccine coverage, indicating insufficient standard Wald sample size.For the continuity corrected Wald sample size and test, Figure 5 shows low power for both types of studies when VE is high and vaccination coverage is low, but high power (above targeted 80%) when both VE and vaccination coverage are high; this indicates that sample size is insufficient for low vaccine coverage but conservative for high vaccine coverage.For the score sample size and test, Figure 6 shows that power was maintained around the desired power.vaccine effectiveness, y axis: simulated power.Vaccine coverage   varies from 10% to 90% for different panels.Desired power is 80%. Figure 6.Simulated power of score test with score sample size for case control (red) and testnegative design (green).x axis: vaccine effectiveness, y axis: simulated power.Vaccine coverage   varies from 10% to 90% for different panels.Desired power is 80%.
In some of the scenarios where VE is high (90% and 95%), we observe lower power for the TND even though power was sufficient for the case-control study.To explore the reason for this discrepancy, we studied the estimated variance of the score as a function of the total number of test-positives (Figure 7).Recall that the total number of test positives ( + ) is fixed by design in the case-control study but varies for the TND.When the total number of test positives in the TND is similar to the fixed value for the case-control study (shown in red), both designs have similar variability in the score test statistics.Yet when the total number of test positives is higher than expected, the TND score statistic has greater variance, and when the total number of test positives is lower than expected, the TND score statistic has lower variance.Thus, there is overall higher variability in the score statistic of the TND than in the case-control study, which is not reflected in the sample size calculations based on the case-control design.

Proposed TND score sample size and power performance for high vaccine effectiveness
Table 4 illustrates the proposed score sample size and the case-control score sample size for 90% and 95% vaccine effectiveness.The proposed score sample size was relatively larger than the casecontrol sample size across various vaccine coverage for high VE, since it accounted for the additional variability in the TND.
Table 5 shows the simulated power under the proposed sample size improved compared to the case-control score sample size across different vaccine coverages for 90% and 95% VE.The proposed sample size tended to be conservative, especially for low vaccine coverages.

DISCUSSIONS
We examined properties of the TND in comparison to a standard case-control study, with a focus on hypothesis testing and sample size calculation.We considered two Wald-based methods and a score-based method.For hypothesis testing, a key disadvantage of the Wald test is that it can be intractable for high VE because of sparsity in the number of vaccinated test positives.Adding continuity corrections to the Wald test enabled estimation but induced bias.For both the TND and case-control study, the score test was more robust across settings, particularly for high VE.Thus, we recommend score-based approaches for testing the vaccine effect in the logistic regression model.The score test can be readily fit using standard statistical software, and it would represent an improvement over Wald-based approaches, which are common in the TND literature 21,22 .
With respect to sample size calculation methods, we recommend a score-based approach adapted from the case-control literature.When accompanied with score-based testing, we found this approach to be the most robust at maintaining the desired power.We detected a slight reduction in power for the score-based sample size in settings with high VE and low vaccination coverage.This reduction in power was more pronounced for the TND when compared to a traditional case-control study.While the ratio of cases to controls is constrained in case-control studies, this ratio is itself a random variable in TNDs.This is due to the TND's passive sampling scheme, where patient enrollment relies on health-care-seeking behavior and is not controlled by the investigators 37,40 .
With too few test positives captured, the score test statistic is closer to the null value.We proposed a modified score sample size strategy for high vaccine effectiveness to account for the additional variability of this ratio with variance calculated under the multinomial distribution.This approach enhances the power performance but provides conservative sample sizes.This work indicates that sample size calculation methods based on case-control designs have limitations when applied to TNDs and so should not be used uncritically.In this setting, study planning with simulation is another valuable tool.
The additional variability on the column margin in the contingency table results in the TND cell counts followed a multinomial distribution rather than a binomial distribution with one-way variability as in the case-control data.Therefore, the likelihood linked with the logistic regression is not able to fully describe the variance of the vaccine coverage between test positives and test negatives, especially for high vaccine effectiveness and low vaccine coverage (few vaccinated test positives).With the distribution-based variance, the proposed sample size tends to yield power higher than desired.An alternative approach not considered here is to derive a score test sample size from a multinomial distribution linked regression.
The work has several limitations.We considered a simplified scenario with constant vaccine coverage, constant VE, and constant disease hazard over time.We did not consider patterns of health care seeking among the source population.The study population we considered is the population who will seek care if sick.Investigators need to account for the fraction of seeking health care if consider the health-care-seeking behavior varies by vaccination status 32 , but the testing strategies and power calculations are similar.We also assumed the diagnostic test has perfect sensitivity and specificity 38 .Furthermore, we do not consider confounders, such as age or risk status that are commonly included in TND analysis.We simplified the scenario to focus attention on sample size calculations, which are frequently conducted using a variety of simplifying assumptions.Nonetheless, we expect the central points about sparsity at high causing a breakdown in the analysis and the role of added variability in the ratio of positives to negatives to carry forward into more complex settings.Other analytical methods, such as exact methods and Bayesian methods [42][43][44] , are also discussed in the literature but not discussed here.We focused on methods with a corresponding sample size formula.The continuity-corrected Wald test also has a link to Bayesian methods with the added cell counts akin to a non-informative prior.Finally, we have framed the problem as a hypothesis test to assess whether VE > 0% or relative VE > 0% (in the case of a head-to-head comparison or vaccine waning).Investigators may prefer to test a different null hypothesis or seek a desired precision for the point estimate.This would require further modification.
The TND is a relatively new observational study design that is rapidly growing in popularity.
Though it is in many ways similar to case-control studies, it has distinct features resulting from how cases and controls are passively sampled 37 .The convenient sampling method results in extra variability on the number of test positives and the number of test negatives.In practice, while at the outset of a TND study, it may be difficult to predict the number of tests that will accrue and their positivity, these approaches can help investigators assess the potential power of their study and can impact planning decisions such as the number of sites to include and patient eligibility criteria.By our examination, we recommend using score test and score sample size under the casecontrol framework to design the study.Modifications of the score sample size were proposed to account for the additional variability on the ratio of cases over all tests.The work expands our understanding of the data features of the TND relative to a case-control design, bridging gaps in design approaches for the TND.

• Ethics approval and consent to participate
This study did not involve human participants, human data, or human tissue; therefore, no ethical approval or consent to participate was necessary.

• Consent for publication
The authors grant their consent for the publication of this manuscript.

• Availability of data and materials
The research described in this manuscript did not involve the use of any real data or materials.

• Competing interests
No competing interests from all authors.
• Funding This research was financially funded by NIH/NIAID R01-AI139761.
• Authors' contributions Y.H. served as the main author, orchestrating the simulation studies and crafting both the methods and results sections.NE.D., the corresponding author, focused on refining the introduction and performing comprehensive manuscript revisions.Y.Y., ME.H., and IM.L.
contributed through critical reviews and commentary, enhancing the manuscript's overall quality.All authors have reviewed the final version of the manuscript and consented to its submission.
they test positive due to the inclusive sampling.In the simulation study, we consider  as 100 days and constant hazards   =0.001,   =0.002  −1 .With different combinations of the vaccine effectiveness and vaccine coverages, 1%-10% population will be infected by the test positive pathogen and around 20% population will be infected by test negative pathogens by the end of study.Each individual has at most one positive test and at most 3 negative tests.Less than 1% of individuals have more than one negative test in the settings considered.To ensure the study duration is around 100 days, the source population  is calculated based on the expected cell counts (see Appendix).For each combination of vaccine effectiveness  and vaccine coverage   , the unit values of cell counts are calculated: () = ()  =   (1 − )[1 −  −  () ], () = ()  =     (), () = ()  = (1 −   )[1 −  −  () ], (  ().The two Wald and the score sample sizes are calculated at 0.025 significance level and 80% desired power based on   ,   and   formulae.The source population size  then is determined by dividing the preset sample size by the sum of unit cell counts, i.e.,  =  ()+()+()+() unvaccinated test positive cases (panel c) had far lower variability in the case-control study.This occurs because the total number of test positives is constrained by design in the case-control study.In contrast, in the TND, only the total number of tests was fixed, yielding greater variability in the individual cell counts.Differences are also visible for panel d, again reflecting the constrained column margins in the case-control study.

Figure 1 .
Figure 1.Density of cell counts in the case-control study (red) and test-negative design (green) for 95% VE, 30% vaccine effectiveness   with total sample size of 63.
, we observe that adding the continuity correction increased the Wald sample size by 20% to 50% for 95% VE.The score sample size was the smallest across all scenarios.The standard Wald sample size is similar to the score sample size for 10%-90% vaccine coverage.The required sample size for low vaccine coverage is the largest, while 70% vaccine coverage requires the smallest sample size.As vaccine coverage increases up to 90%, the sample size increases; this reflects more sparsity in the unvaccinated cells in the table.

Figure 4 .
Figure 4. Simulated power for the case-control study (red) and the test-negative design (green): the standard Wald test but with continuity correction for zero vaccinated test positive with standard Wald sample size.x axis: vaccine effectiveness, y axis: simulated power.Vaccine coverage   varies from 10% to 90% for different panels.Desired power is 80%.

Figure 5 .
Figure 5. Simulated power of the Wald test with continuity correction with Wald sample size adding continuity correction for case control (red) and test-negative design (green).x axis:

Figure 7 .
Figure 7. Distribution of score test statistics for 30% vaccine coverage   , 95% VE for the casecontrol study (red) and the test-negative design (green).Brown dashed line indicates the critical value for the test statistic at the 0.025 significance level.
1−  ) .These terms are related to the variance of the test statistic numerator of the score test statistics Î − N , where To account for the additional variability of the fraction of test positives over all tests  in the TND, we propose a modification to the case-control score power calculation for high VE.The standard calculation is based on a single assumed fraction .We took the summation of the power over all To calculate the probability of rejection for each value of  ̂, it is necessary to define two variance terms.The variance of the test statistic numerator Î − N under the null is roughly constant across values of  ̂, which we denote as  0 2 .
2.2 Proposed TND score sample size for high vaccine effectiveness