Population-level information as a key input for public health decision-making
While diagnostic tests are usually developed for individual diagnosis and patient care, their results also play a crucial role in public health decision-making. Population-level case data, collected based on the number of positive diagnostic tests in surveillance systems worldwide, are a central input parameter for decision-making processes in public health policy. Cases might in this situation represent different outcomes of contact with an infectious agent (e.g., infections or deaths), and also different types of measures of this contact (e.g., incident or cumulative cases derived from seroprevalence studies).
Surveillance systems for infectious diseases provide reports on the number of cases associated with specific pathogens using standardized case definitions based on pre-defined rules (including diagnostic test results) and legal obligations. These surveillance systems run constantly for notifiable diseases associated with high public health risks. Surveillance-related case data (based on diagnostic test results) are directly used for public health decision-making. They enable the development and parameterization of infectious disease models (e.g., for early warning and monitoring) and for decision-analytic models (e.g., for assessing the benefit-harm, cost-effectiveness or other trade-offs when guiding public health interventions). This is especially true in epidemic or pandemic situations when reducing harm at a population level becomes a crucial aspect of the decision-making philosophy8,9, and high consequence decisions must be made under uncertainty and time pressure. In such scenarios, two fundamental and extremely relevant quantities are; some measure of the presence of the infection in the population (e.g., prevalence or incidence data), and a measure of existing immunity to the infection in the population, i.e. seroprevalence data.
An important decision supported by dynamic infectious disease modelling studies focusing on predicting infection dynamics is the timing of interventions. Interventions are most effective when deployed in time10 and may cease to be effective if implemented too late11. Therefore, it is imperative that decisions about implementing interventions are made in a timely manner and sometimes with incomplete evidence, but with all relevant information being collected and reported suitably. Monitoring population-level data from as soon as possible is essential since it can be used to set thresholds for starting interventions12 and determine when intervention measures are no longer necessary and can be ended13.
Due to reporting delays and the fact that the decision-making process is not instantaneous, decisions can come too late when relying solely on current population-level data. This is where infectious disease modelling comes in. Models help decision-makers obtain reasonable estimates of how the epidemic is likely to progress and what impact different interventions may have. This enables timely and informed decision-making14–16. Combined with benefit-harm and health economic models to account for unintended effects and costs of interventions, infectious disease models enable decision-makers to make optimal decisions given the available evidence and resources17–19.
The points discussed above are exemplified by decision-making during the SARS-CoV-2 pandemic. Even during the early phases of the pandemic, decisions about interventions were made with population-level data in hand. In the UK, the timing of the first nationwide lockdown was determined based on the predicted number of people treated for SARS-CoV-2 in the intensive care unit (ICU)12. In Australia, more targeted lockdowns were implemented based on regional prevalence data20,21, and local lockdowns were also implemented in the UK during later phases of the pandemic22. Prevalence data became yet more important when contact tracing and test-intervention strategies were implemented, because the predictive value of diagnostic tests depends on the infection prevalence. As vaccines became available, subpopulations most at risk of severe COVID-19 were prioritised and given the opportunity to be vaccinated first23,24. In Germany, vaccination and testing control rules for access to parts of public life varied from region to region. Again, the region-specific thresholds were based on the number of hospitalised patients testing positive for SARS-CoV-2 in the respective region.
Mathematical models were used throughout to support the decision-making process. The threshold for applying the first nationwide lockdown in the UK was set based on the number of people estimated to be in need of ICU treatment based on different modelling scenarios12. In Austria, the decision to prioritise vaccinating elderly and vulnerable groups was based on decision-analytic modelling to minimise hospitalisations and deaths25. In general, infectious disease and decision-analytic models contributed substantially to the type and intensity of interventions implemented26–28. Once tests became widely available, they were also used to devise effective mass testing and isolating strategies29–31.
The current pandemic has thus demonstrated the need for accurate and timely population-level case data and clinical case data (requiring different diagnostic tests and testing strategies), to allow public health policy decisions to be as well-informed as possible. Diagnostic tests, as the primary tool to obtain these population-level data, are therefore at the heart of all modelling efforts during an epidemic or pandemic, and early and precise knowledge about their accuracy is crucial for interpreting and further applying these case data.
Challenges for diagnostic test evaluation in an epidemic setting
Diagnostic tests developed for emerging infections should serve various purposes, including individual clinical diagnosis, screening, and – as discussed above – surveillance. These purposes demand distinct strategies and, in theory, require separate approval mechanisms32. However, test development, evaluation of technical validity, clinical validity and utility, as well as test validation currently do not account for that in a generalized way. The challenges and potential solutions in this article and the framework proposed therein have been described with all these purposes in mind.
In the initial phase of an outbreak of an emerging infection, the main focus of diagnostic test development is providing a diagnostic test that can identify infected individuals with high sensitivity, so that they can be isolated and treated as soon as possible. This is usually achieved by direct detection of the pathogen, e.g., by molecular genetic tools like polymerase chain reaction (PCR), microscopy, antigen tests or cultivation of the microorganisms involved. Later, a better understanding of the immune protection caused by contact with the agent is required, leading to the development of indirect pathogen detection tools antibody tests. Here, sensitivity and specificity are equally important to evaluate proxies of long-term immune protection and to detect past low severity infections which would have been missed otherwise. The specificity of the direct detection tools developed earlier can also come into play in the case of reported reinfections, whereby it becomes important to understand whether these reinfections were due to false positives in a time of intensified testing. High specificity is also important once treatment options are available, but possibly come with relevant side effects, high costs or limited availability. Once tests begin to be used as parts of an intervention strategy or other population-level aims, they need to be developed as point-of-care (POC) diagnostic tests, which may allow for lower accuracies but need to be easily and quickly administrable in practice. Furthermore, target populations, testing aims and prioritised estimators (e.g. sensitivity or specificity) can change rapidly, necessitating constant test evaluation and re-evaluation.
During an epidemic or pandemic, direct and indirect tests are thus used for different purposes and require different study designs, with different sample size calculations and study populations, to provide critical information with high precision and validity.
During epidemics with emerging infections, all new tests must, in general, quickly go through three steps; the test must be developed, its clinical performance assessed, and then information on its performance incorporated into infectious disease modelling to inform public health decision-making. Each step has potential sources of various biases that must be considered. In the following, we describe potential challenges during these steps and how these challenges might affect the submission process to regulatory agencies, taking into account the perspective of test developers from the industry.
Diagnostic test development
Diagnostic tests for emerging infections typically fall into the so called in vitro diagnostic (IVD) test category, as they examine human body specimens (e.g., nasopharyngeal swabs, nasal swabs, blood or saliva32). IVDs are generally considered medical devices33. Consequently, their development has to adhere to the rules of regulatory agencies and a pre-defined complex legal framework. Currently, the EU IVD Regulation 2017/746 covers IVD medical devices, and focuses on a legislative process that prioritises individual safety, which means that different types of clinical data must be collected before submission. If a test is deemed capable of distinguishing infected individuals from non-infected ones, it has to be shown to not be a one-off result34.
There are several phase models for the development of diagnostic tests in the literature. In the following, we use the frequently used four-phase model2,35,36- In phase I, the analytical performance is evaluated, - in phase II, the diagnostic accuracy is estimated roughly and the threshold is determined, - in phase III, the clinical performance is estimated in a confirmatory way, and - in phase IV, the test is evaluated together with the following diagnostic and/or therapeutic measures with regard to a patient-relevant endpoint.
Inter-rater agreement, analytical sensitivity (minimally detectable levels)34 and cross-reactivity have to be investigated in phase I studies to verify the technical validity, repeatability and reproducibility of laboratory tests (on a lot-to-lot, instrument group, and day-to-day basis). However, in the early phase of an epidemic or pandemic, there are often not enough samples from infected individuals. Sharing data and using a common infrastructure by, for instance, collecting samples at national reference centres could solve this problem, if they are made accessible to IVD developers. A possible limitation of this approach is the risk of spectrum bias due to the particular mix of individuals, e.g. there may be more severe cases in the samples than in reality. Furthermore, regulatory agencies do not allow the use of (frozen) biobank samples for approval.
After having shown good technical performance, the next step is demonstrating clinical performance in phase II and III studies. An integral part of assessing the sensitivity and specificity of a continuous diagnostic test is determining the threshold at which it should be used34. This must be fixed before moving on to diagnostic test evaluation, to avoid bias caused by a data-driven threshold selection37,38. The optimal threshold for a diagnostic test depends on the prevalence and consequences associated with misclassification in either way31,39, which may both change over time; this would mean that a new study would be needed every time the threshold changes, requiring extensive resources (especially time and money).
Phase II studies are initial, so-called proof-of-concept studies about clinical performance and are often carried out in a two-gate design40, where sensitivity is estimated in diseased individuals and specificity in healthy samples from a different source. However, this design can lead to spectrum bias (Table 1). Rutjes et al.40 have already pointed out that sensitivity and specificity are generally overestimated in such studies. Likewise, Lijmer et al.41 have shown in their meta-analysis that a two-gate case-control design is likely to lead to an overestimation of diagnostic accuracy. In most situations outside an epidemic or pandemic, individuals tested are symptomatic and suspicious for the infection of interest if the test is thought to be used to guide therapy or decide about isolation. However, during epidemics or pandemics, tested individuals can also be asymptomatic if the test is intended as a contact tracing tool or screening test34. In both cases, real-world samples may not be as perfect as in a laboratory situation34 because testing can also be performed at POC, in the community, at the workplace, school, or home32. A test may require different performance characteristics if it is the first test in line, used to triage who will be tested further, compared to when the test is used to confirm infection. For instance, in a confirmation setting, most individuals who clearly do not have the infection of interest will be excluded34.
Diagnostic test evaluation
IVDs must be evaluated in phase III diagnostic accuracy studies that ideally start by including all individuals who will be tested in clinical practice to avoid selection bias (all-comer studies). Individuals fulfilling the inclusion criteria should be enrolled consecutively, without judging how likely this person is to test positive or negative34. In such prospective diagnostic studies, to minimise variability and thus increase statistical power, all study participants ideally undergo all tests under investigation (index tests) as well as the reference standard to assign their final diagnosis.
The reference standard must be sufficiently reliable to differentiate between people with and without the target condition, but it is usually not perfect34. This imperfectness has to be taken into account when interpreting the results. Suppose a POC antigen test for SARS-CoV-2 is evaluated with a PCR test as reference standard resulting in a sensitivity of 90%. This does not mean that 90% of people with SARS-CoV-2 will be detected but that the POC test will be positive in 90% of cases with a positive PCR test. Solutions to this may include follow-up data or composite reference standards, which use all tests or clinical criteria available for a diagnosis. However, if the test under evaluation is part of this composite reference standard, this may lead to incorporation bias42.
Depending on the phase of the epidemic or pandemic, recruitment speed can vary considerably due to changes in incidence. The guideline on clinical evaluation of diagnostic agents of the European Medicine Agency2 demands sample size specification in a confirmatory diagnostic accuracy study in the study protocol. The required sample size is highly dependent on the prevalence of the target condition, which may change during the recruitment phase, making a priori sample size calculations inappropriate at the time of recruitment.
Submission to regulatory agencies
Studies for the industry face rigorous regulatory and ethics requirements as clinical trials follow strict processes and regulatory guidelines which are assessed in the regulatory submission process and are potentially controlled by audits. Clinical studies must be transparent, traceable and reproducible. Special attention must be paid to data quality and privacy. This leads to very detailed study preparation, documentation, quality control, and long and less flexible study processes.
When the SARS-CoV-2 pandemic hit in 2019, the need for diagnostic tests grew with the rising number of cases. Regulatory bodies (like the U.S. Food & Drug Administration, FDA) established country-specific emergency use authorization guidelines43,44to make it easier and faster to bring a test for SARS-CoV-2 to the market and make it accessible during the pandemic. As soon as the emergency situation is declared over, the tests must go through the regular submission process for every country to get clearance.
Requirements like sample size, inclusion criteria for subjects, properties of the reference test and more are different for each country's submission process and may change during an epidemic or pandemic. Therefore, it is not always possible to cover submissions for different countries or certificates within one study, and several studies must be planned.
The different and changing requirements are not the only challenges submission teams face. The changing prevalence of infection makes adequate project management and timeline planning difficult. Recruitment of positive cases fulfilling the recruitment requirements can be very slow which leading to a longer study duration and, therefore, longer time to market. New mutations of SARS-CoV-2 make re-evaluations of statistical properties necessary. Considering regulatory changes during pandemics and possible mutations, (pre)planning such a study is complicated and time-consuming.
Potential solutions for the challenges presented
The challenges discussed in the previous chapters are multidimensional but can be addressed by three countermeasures in several areas. First, test developers should use methodological approaches to address study designs and statistical analyses, increasing study efficiency and reducing the risk of bias. Second, strategic approaches and regulatory guidance for the industry should be deployed to clearly define opportunities but also limitations in the development and approval process. Third, results and feedback from population-level mathematical modelling should inform test development and validation for deriving optimal study designs based on formal value-of-information analyses.
Methodological solutions
Methodological solutions fall into two categories; statistical methods to control bias, and those to increase speed and efficiency.
The different biases in diagnostic studies have been described extensively, both in general45–47 and also specifically in the context of the SARS-CoV-2 pandemic48 and POC tests for respiratory pathogens49. From a methodological standpoint, the problem of bias can be addressed in two ways: either by choosing a study design in the planning stage that minimises the risk of bias, or by using analytical methods that correct for potential bias.
An excellent overview of how to avoid bias through an appropriate design can be found in Pavlou et al.50. Important for the planning phase is the work of Shan et al. 51, who present an approach to calculate the sample size in the presence of verification bias (i.e., partial or differential verification bias).
In terms of bias reduction methods during the analysis phase, most studies focus on the correction of verification bias. Bayesian approaches are mainly proposed for differential verification bias52,53, while there are a variety of methods for partial verification bias (for a methodological review see Chikere et al.54 or de Groot et al.55, for implementation in R see Arifin et al.56).
Time to market has to be reduced significantly in pandemics to find an optimal trade-off between misclassification and missed opportunities of action. From a statistical point of view, the methods and processes must be reconsidered. One possibility to improve study designs and statistical analysis is adaptive designs, that can increase efficiency. These approaches have been established for a long time in therapeutic studies and are also anchored in guidelines3,57. With adaptive designs, it is possible to make pre-specified modifications during the study. For example, inclusion and exclusion criteria can be changed, the trial can be terminated early due to futility or efficacy, or the sample size can be recalculated. Thorlund et al.58 summarise the characteristics and typical adaptive designs very clearly in their review. Cerqueira et al.59 observed in their review of published studies with adaptive designs that the pharmaceutical industry in particular increasingly uses simple adaptive designs, but more complex adaptive designs are still rare.
In diagnostic studies, however, this topic is still very fresh, and experience in using adaptive designs in diagnostic clinical trials for submissions is limited. A summary of the current state of research can be found for diagnostic accuracy studies in Zapf et al.60, for randomised test-treatment studies in Hot et al.61 and for adaptive seamless designs in Vach et al.62. Methods for blinded and unblinded sample size re-calculations for diagnostic accuracy studies have been published recently63–66, as well as adaptive designs for test-treatment studies67 and adaptive seamless designs68. The diagnostic industry heavily depends on regulatory guidelines worldwide. If regulatory bodies emphasise more efficient diagnostic trials that include, e.g., adaptive designs, it would incentivise the implementation of modern study designs.
In the following, concrete possible solutions to the above-mentioned challenges are explained as examples. For details, please refer to the corresponding articles.
-
To address the problem of setting a threshold in an early study that may later turn out not to be optimal, the approach of Westphal et al.69 can be used. A limited pool of promising thresholds can be selected. These are then evaluated simultaneously in the validation study, with the type I error adjusted accordingly. Another idea is to use mixture modelling without defining a threshold70. Prevalence-specific cut-offs might be developed and defined a priori.
-
If the testing strategy and thus the target population change during the study, adaptive designs offer the possibility to re-estimate the sample size in a blinded manner based on the prevalence estimated in the interim analysis63.
-
To address the problem of biased diagnostic accuracy in two-gate designs, a seamless enrichment design could be chosen, in which proof-of-concept and confirmation are performed together in one study68. However, it is apparent that regulatory authorities are cautious of the possible shortcomings of these innovative designs, and a lot of work needs to be done to get them approved71. This, in turn, results in the manufacturers of diagnostic tests being conservative in their study designs.
Solutions for political decision-making based on mathematical modelling
When considering model input data, one key aspect that has to be taken into account by modelling studies is the deliberate parameterization of accuracy for case numbers based directly or indirectly on the results of a diagnostic test72. That typically includes incidence rates as well as seroprevalence estimates. During the first three months of the SARS-CoV-2 pandemic, only a minority of modelling studies in the field accounted for test accuracy estimates; the remaining used incidence and later seroprevalence data as if they represented the ground truth. This approach would be appropriate if incidence or seroprevalence data were already corrected for imperfect test accuracy estimates. However, in this case, the correction procedure should still be reported in the modelling study to enable a transparent evaluation of model parameterization, and the model(s) should be reparametrized once updated information on diagnostic test accuracy is available. The earlier that decisions are made based on updated information, the greater is the impact of these decisions on population health (Fig. 1). An earlier decision by just a few weeks or even a couple of days can make a huge difference, offering a critical time window for accelerated diagnostic studies. Figure 2 shows the sensitivity of model-based assessments of interventions to diagnostic test accuracy parameters. The results show that even relatively small biases in the estimation of test accuracy (much smaller than those found in the Cochrane reviews) for an antibody test used to derive the proportion of undetected cases in a population have an enormous effect on the predicted further course of the epidemic (the mechanism for this impact is that the proportion of undetected cases is used to correct reported case numbers before they are used to calibrate transmissibility estimates and other parameters).. The results are enough to change public health decision-making from, for instance not implementing population-level contact reduction measures to introducing a hard lockdown if the defined outcome of interest crosses a set decision-analytic threshold.
Longitudinal panels as a platform for diagnostic accuracy studies
Given the rapidly changing research questions during an epidemic or pandemic, there is a huge practical challenge in setting up diagnostic studies even with the modern study designs described above, because the acceptable time spans for recruiting study participants and for conducting the actual studies are very short. The availability of a study platform that allows immediate initiation of diagnostic studies reflecting the current research question and infection dynamics is indispensable for timely studies in the field. One way to ensure this is the sustainable implementation of a longitudinal panel within existing cohorts (e.g., as the NAKO Health Study74) that is tested regularly for the presence or absence of the pathogen by a defined test (or several) under evaluation. Another way would be to use data from hospitals, health insurance or public health agencies. In this approach, a platform comparable to the UK ONS panel75 or the REACT study76 can be built and used for two, equally important, purposes; the evaluation of the tests or testing strategies under study, and the real-time communication about the results of the respective tests representing current or past infection dynamics. In this setting, flexible and fast study designs can fulfil both purposes at the same time.
Feedback triangle at the centre of a unified framework
As discussed in the section above, the development and evaluation of diagnostic tests in an epidemic or pandemic setting is closely linked to modelling studies used to inform political and public health decision-making. This link is at the centre of the unified framework we want to propose based on the experiences during the SARS-CoV-2 pandemic (Fig. 3). The execution of diagnostic studies for new tests or new application areas of existing tests depends heavily on current test strategies and those potentially applied in the future. Results from diagnostic studies are a direct input in mathematical modelling studies, and in turn results from these models are used for decision-making based on a defined decision-making framework. However, modelling studies can also give crucial feedback to those responsible for planning and analysing diagnostic accuracy studies. Here, so-called value-of-information analyses can help identify those gaps in knowledge regarding diagnostic test accuracy that need to be tackled first or require the greatest attention77. This can directly affect sample size estimations, for instance if it is clear that more precision is needed to estimate the test’s specificity (as is often the case with antibody tests). Therefore, the optimal strategy to deal with these constant feedback loops is to establish continuous collaboration between the disciplines representing the three parts of this loop (in green in Fig. 3). This collaboration platform can use the longitudinal panel with complementary perspectives described above to create a unified diagnostic test development and evaluation framework during an epidemic or pandemic. The modern study designs and bias reduction methods described above can be applied to obtain the best potentially available evidence about diagnostic test accuracy in different settings.
Diagnostic test-intervention studies using a cluster-randomised approach
In many situations, diagnostic test accuracy estimates should only be seen as surrogate information since the actual outcome of interest during an ever-changing pandemic, especially in the later phases, is the effect of an application of this test on clinical or population-level outcomes. Here, it is possible, as has been discussed during the SARS-CoV-2 pandemic, to take a step further and move test evaluation to phase IV or diagnostic test-intervention studies. In this phase, individuals or clusters of individuals are randomised to a diagnostic strategy (e.g., regular testing of the entire population or testing only in case of symptoms). The relevant clinical endpoint is then compared between randomised groups35. Thus, the test strategy is treated like an intervention evaluated for its effectiveness and safety. Diagnostic test accuracy helps to reach this endpoint but is not the only factor under evaluation. The practicability of the strategy, as well as real-world effectiveness and interaction with other interventions (e.g., the case isolation and quarantine of close contacts), are also assessed indirectly in this approach. In a dynamic infectious disease setting where an intervention can have indirect effects on people other than the target population, only cluster-randomised approaches allow for a reasonable estimation of population-level effects of the intervention under study. In infectious disease epidemiology, similar designs are applied when assessing the effectiveness of vaccination programs on a population level, often combined with a staggered entry approach to allow all clusters to benefit from the intervention over time (so-called stepped-wedge design). During the pandemic, small-scale pilot studies were discussed, trying to mirror such an approach in a non-randomised way, often claiming to be a natural experiment. However, most of them did not follow guidelines and recommendations available for diagnostic test-intervention studies that would have improved the quality of the results and their usefulness for evidence-based public health. Rigorous application of cluster-randomised diagnostic test-intervention studies to implement testing strategies can support decision-making processes in the later stages of an epidemic or pandemic.