Method validation and method verification are two distinct processes that may be required at different points in the development of a method from initial design to implementation in multiple laboratories and/or semi-field sites. A laboratory is required to carry out validation when:
- A laboratory has designed or developed a new method
- A laboratory is required to demonstrate comparability between a novel method and an existing standard method.
- A standard method has been modified
- A standard method is used for a new purpose
Full laboratory design and development involve the conception of the method from scratch, including preliminary testing to define whether the method is logistically feasible and can measure the desired outputs. Where the novel method measures the same outputs as an existing method, the novel and existing methods should be compared during validation. Modification of an existing standard method alters one aspect, for example extending the exposure time used in a test, and requires the internal and external validation process stages (see below) to be repeated. To use a method for a new purpose might involve the use of a method validated for use with one product class being used to characterise a different product class. Validation in such cases is dependent on the magnitude of the change but may require conducting feasibility experiments to demonstrate that the change in scope has not affected the capacity of the method to reliably capture its endpoints.
Laboratories adopting a validated method should conduct method verification [19], which can be conducted using controls of a known value and/or response and ensures that the implementing laboratory can reproduce the established method performance.
Stages of the method validation process
Four stages for evaluating bioassays and semi-field tests are proposed: 1) preliminary development; 2) feasibility experiments; 3) internal validation, and 4) external validation (Table 1). The stages are designed to ensure that the method is scientifically sound and reproducible within the variation exhibited in biological tests [5,19]. During preliminary development, the method is devised, and endpoints and analytical requirements are defined and tested [22]. At the feasibility stage, the performance parameters and endpoints are verified, and a standard operating procedure (SOP) is drafted. In internal validation, the analytical performance of the method is tested, the method claim is drafted and a data package for external validation sites is compiled. During external validation, the method is evaluated in multiple laboratories/sites and the final method claim is produced. Once external validation is successful, the method can be implemented.
[Table 1 to appear here, table included in the Tables and Legends section]
Preliminary development
The purpose of the preliminary development stage is to assess the proposed method design for suitability for a defined purpose in a defined setting, define the endpoints and the level of allowable analytical error (both imprecision and inaccuracy) for each, and to build robustness (minimise the impact of changes in variables or testing conditions on results) into the method. Experiments conducted as part of preliminary development typically use small sample sizes so that data on a range of conditions and variables (for examples, refer to the testing conditions below) can be generated and used to refine the method parameters and guide the experimental design for feasibility and internal validation.
Define method scope and endpoints
The method design, application, and endpoints to be used to assess method performance should be clearly defined. An endpoint is a quantifiable output that can be recorded using the method, e.g., oviposition inhibition in female mosquitoes exposed to an insect growth regulator. Every endpoint that is intended to become part of the eventual method claim must be defined. The definition should state precisely what is to be measured, when, and the desired range of measurement, e.g., number of eggs laid per female, up to five days post-exposure, from 0-300 eggs’.
Define acceptability criteria
Acceptability criteria define the allowable error within the method and are dependent on the effect size of each endpoint. In the example from the previous section, for the measurement outcome: ‘50% reduction in the number of eggs laid per female, up to five days post-exposure, from 0-300 eggs’ an acceptability criterion might be: ‘Measure a 50% reduction in the number of eggs laid with 10% precision within the reportable range’. The allowable error should be as small as possible yet align with what is practically achievable and scientifically justifiable [24].
Factors to consider when defining acceptability criteria:
- Within-day imprecision should be less than ¼ of the total allowable error or the coefficient of variation (CV) <20% [19,25,26]. Between-day imprecision typically has the same error level, but can be increased if justified [22]
- For measurement outcomes relating to target values, criteria can be set either as a multiple of standard deviation (SD), e.g., within 3SD of the mean, or within a percentage range of the target value e.g. +/- 25%
- For phenotypic measurement outcomes, an indicative threshold can be used although such thresholds should be used with care [27]. For example, 98% mortality in a susceptibility test using a discriminating concentration in monitoring for insecticide resistance.
Identify the analytical parameter/s to be measured
At least one analytical parameter must be evaluated [17,20]. Within the common analytical parameters of accuracy/trueness, precision, linearity, range and robustness, the most useful parameters for bioassay and semi-field validations are typically precision, robustness, linearity (concentration dependence), and range (reliable range of test values) [17,20,21,28–30].
Define testing conditions
Test conditions encompass conditions critical for method performance. These can be identified from literature or in-house laboratory data [17]. Bioassay testing conditions include:
- Vector age
- Vector status (sex, fed/unfed)
- Preparation conditions for the vector, e.g., sugar starvation
- Vector holding conditions post-exposure
- Time of day
- Environmental conditions
- Maximum/minimum number of vectors per replicate assay
- Sample handling conditions
- Storage pre- and post-test
- Time to reach ambient temperature prior to testing
- Sample preparation (age, washing)
To determine the optimum conditions for the method, experiments varying the testing conditions should be conducted, for example, changing the time of day that the bioassay is conducted to determine whether a mosquito’s circadian rhythm affects the results of the test [31]. A method is deemed robust if small variations in testing conditions do not heavily impact the performance of the method [22] for the purpose selected e.g., evaluation of pyrethroid content on an ITN. Testing conditions may have a small or large impact on the assay results depending on the specific mode of action of the chemistry being bioassayed.
Select a comparison method (where applicable)
If the novel method has been designed to measure the same outcomes as an established standard method the new method should be compared to this. Standardised methods currently recommended in WHO guidelines are the WHO cylinder test and bottle bioassay to measure insecticide susceptibility, the WHO cone test and the tunnel test to characterise ITN fabrics, and Ifakara Ambient Chamber Test and experimental hut trials to measure entomological efficacy of ITNs [4,27]. Select the method with the most similar test conditions and/or entomological endpoint(s) to the putative new method as the comparator method.
Define controls and control ranges
Negative (baseline) and positive controls must be defined. Since it is not always known what non-insecticidal features of a product may impact the measured endpoint, the negative control should be as close as possible to the product under evaluation. The positive control should induce a known and significant impact on the endpoint under evaluation. Methods designed for characterisation of dual AI products must include controls which contain each AI separately and in combination [32–34].
Conduct baseline and robustness experiments
Baseline experiments assess the performance of a method under assumed optimum testing conditions. Robustness experiments test the robustness of a method by identifying variables or testing conditions that might affect the method’s results [17,35].
For baseline experiments:
- Consider the testing conditions that can potentially affect results and define standard measurement levels, for example a specified temperature range, to control for such effects
- Conduct trial experiments using the simplest design possible, for example, tests using negative controls such as an untreated net for an ITN method
For robustness testing:
- Alter testing conditions or variables one at a time whilst keeping all other parameters unchanged. Although it is possible to vary multiple conditions simultaneously [17,36], due to the high variability in bioassays, one variable at a time is recommended
- Evaluate the degree of robustness: significance testing, a procedure used to quantify whether a result is likely due to chance or to some factor of interest, can be employed to determine the important factors for future consideration for assessing the method’s performance
Sample size:
A sample size of at least twenty replicates per group should be used for baseline or robustness experiments [37].
When designing the experiments, apply the following definitions:
- Replicate: for example, a single set of five individual mosquitoes in a WHO cone test or mosquitoes exposed together in a Tunnel Test
- Sample: for example, a single piece of a net
- Testing system: for example, the mosquitoes being tested. Mosquitoes reared together under the same controlled conditions are referred to as the same testing system. This can be a single colony at a point in time or one colony maintained over time that is characterised and maintains fitness parameters within defined limits.
The data from the baseline or robustness experiments should be analysed and compared to the acceptability criteria. Where necessary the method can be modified, the outcome(s) and acceptability criteria refined and retested before proceeding to the feasibility stage. Figure 1 outlines a decision tree that can be used at each process evaluation stage to determine whether progression to the next stage is appropriate.
[Figure 1 to appear here, figure included in the Figures and Legends section after references]
Text box 1. Terminology
Feasibility experiments
Feasibility experiments are employed to understand the inherent variability of a method, to obtain values that can be used for estimating sample size for the internal validation experiments and to assess the utility and logistical ease of the proposed technique. Where two tests have equivalent performance characteristics, the one which is easier to use, cheaper, faster, more sensitive or more accurate might be preferred.
Estimating an appropriate sample size
Testing 20-30 replicates in a feasibility study is usually enough to obtain an estimate for variability/precision for use in formal sample size calculations [37,38]. Ideally, 20-30 replicates in each study arm (WHO cones or cylinders, for example) would be tested on a single day to estimate within-day precision, followed by testing at least one replicate per day over a period of 20 days whilst holding all conditions constant to estimate between-day precision [19]. However, for bioassays that use long exposure times, such as the tunnel test, this study design is not possible. The sample size should be adjusted appropriately to suit the design of the method, performing at least four replicates per day. Additionally, the use of insects as the test system in bioassays means that it is not possible to hold the test system constant, i.e., use the same mosquitoes each day. Rigorous colony rearing procedures should be followed to ensure colony stability to minimise insect variability, and data on fitness parameters should be collected for consideration as a potential source of variability [34]. To account for this variability, we recommend that at least four replicates should be tested for a minimum of five days wherever possible and any analysis should include day of testing as a variable to account for the temporal bias inherent in bioassays using live insects.
Describe testing pattern and testing period
The testing schema and testing period in the experimental design of feasibility experiments defines how the within and between-day error of the method will be measured, and are typically determined during replication sub-studies (refer to Replication Studies). The testing pattern should be balanced with respect to the number of replicates tested in a single day and the number of replicates tested each day over multiple days so that reliable estimates for the within and between-day precision are obtained.
Defining final endpoints for validation and drafting an SOP
During preliminary development and feasibility stages, multiple endpoints might be trialled. The data from feasibility studies is used to identify which of those endpoints are reliable and suitable for use in assessing the method’s performance during internal and external validation. All selected endpoints and their acceptability criteria should be included in a draft SOP.
Select strains for use in validation experiments
Both pyrethroid-susceptible and pyrethroid-resistant mosquito strains can be used in validation experiments. Where relevant, strains should be selected with reference to existing WHO testing guidelines [27], WHO implementation guidance, and published works. For example, Lees et al [34] provides a strain characterisation SOP which can be used for dual-AI ITNs and adapted as appropriate for other studies that require resistant mosquito strains.
Internal validation
The purpose of the internal validation phase is to ensure that the method is reproducible within a laboratory, i.e., minimally validated, and to compile a data package that can be used by external laboratories/sites to externally validate the method.
Determining appropriate sample size and study design
Data from feasibility studies are used in a formal power calculation to determine the sample size for internal validation. This can be achieved by using standard formulas for sample size estimation or simulation studies for complex designs involving multiple varying factors and testing schema [39–41]. The predefined effect size for the primary endpoint of interest together with the SD/variability estimated from the feasibility experiments should be used to estimate the sample size. In a case where multiple endpoints are of primary interest, we recommend that the endpoint with the smallest effect size and greatest variability in the feasibility experiments is used in the calculation [42].
Draft the method claim
This is a statement that clearly states the scope of the method, the outcomes, analytical parameters, and acceptability criteria associated with the method. Considerations to be taken into account when employing the method, for example, incorporating the variability of sample materials into sample size calculations, should be stated as part of the claim. For example, Video Cone Test (VCT) PLUS, an extension of the standard WHO cone test designed to characterise the effects of the co-formulations of pyrethroids and non-pyrethroid insecticides based on mosquitoes’ activity in the cone (imprecision/CV < 30%) and 24h mortality within ±3% the standard WHO Cone Assay. A detailed example of a method claim can be found on the Innovation to Impact (I2I) website [43].
Compile a data package
A data package must be produced by the laboratory that developed the method and provided to the external validating laboratories. The data package must include:
- SOP
- Method claim
- Applicability
- Measurement outcomes
- Analytical parameters
- Acceptability criteria and justification
- Study designs including sample sizes and testing schema
- Define controls/criteria for selecting controls
- Criteria for strain selection
The method-developing laboratory should ensure that the product(s) and strain(s) used during the internal validation phase are characterised, and the results are provided together with the data package to assist with the interpretation of the validation results.
External Validation
As methods that use entomological endpoints to evaluate vector control tools are usually implemented in multiple laboratories and/or sites, to ensure reproducibility at least two external laboratories should validate the method [22,44]. These laboratories extensively validate the method by ensuring that the method claim is reproducible at multiple sites/laboratories using a standardised SOP. This allows different levels of precision to be assessed, for example, within-day, within-laboratory, between-day and between-laboratory. The external validation sites should follow the experimental design proposed associated with the method claim that was defined following the internal validation stage. All the outcomes and analytical parameters associated with the method claim should be assessed.
A statement of the final claim and a full validation report is produced once external validation is complete. Ongoing quality assurance procedures or method verification in implementing sites certify results produced using the method.
Validation sub-studies
In each of the process stages of feasibility, internal validation and external validation, different relevant sub-studies are conducted depending on the intended purpose of the method and the design of the bioassay. Figure 2 shows each of the process stages and the possible sub-studies that might be employed.
When designing sub-studies, a single experiment can be designed for multiple purposes or to assess multiple analytical parameters. For example, an experiment designed to measure precision, i.e., a replication experiment, can include a comparator method, i.e., comparison experiments. Table 2 provides a summary of typical categories of the methods used to assess vector control tools and their associated studies and performance parameters.
[Figure 2 to appear here, figure included in the Figures and Legends section after references]
Table 2. Examples of experimental types which could be applied as validation sub-studies for methods used to evaluate vector control products.
Linearity or reportable range experiments
The reportable range of a method is the span of test values for which reliable results can be obtained; linearity is the ability of a method to obtain results which are directly proportional to a given concentration [19,44]. These studies can be implemented at all the stages of the validation process including baseline experiments. The purpose of these experiments is to determine a working range of the method’s results that is accurate and precise. For example, a reportable range for a method to measure the characteristics of an ITN might be the minimum to maximum level of 24-hour mortality which can be reliably measured by the method and the variability within the range.
For methods with phenotypic outcomes, establishing LD50 and/or LD90 for each active ingredient can be a substitute. Methods intended to be used for durability monitoring of products should be assessed using, e.g., ITNs that have undergone various numbers of washes, with accompanying chemical analysis of treatment concentration. This will approximate testing at different concentrations and ensure that method performance is validated against a range of different product conditions.
At least five replicates of known values at each concentration/number of washes (where appropriate) should be analysed by bioassay and chemical methods in triplicate to define the reportable range.
Data analysis
Linearity of the reportable range can be visually inspected using a scatter plot and line of best fit that fits the lowest points in the data series or fitting a regression line through the points in the linear range [19,21,45]. To control for potential confounding factors, we recommend the latter. For methods that are non-linear, a non-linear curve or a non-linear regression line can be fitted. For methods used to assess durability, precision should be evaluated throughout the range to determine the method’s reliability for estimating entomological outcomes over time/number of washes.
Replication experiments
Replication experiments are conducted during the feasibility, internal validation, and external validation process stages. During replication experiments, estimates are obtained for random error [19]. The goal is to determine the typical variability of the method during normal usage through measuring precision [22] and therefore the experimental design should encompass routine day-to-day variations.
Precision can be evaluated at different levels [22,44]:
- Repeatability/intra-assay/within-run: precision observed among replicate bioassays performed under the same operating conditions within a day
- Intermediate precision expresses within-laboratory variations: different days, different operators, different mosquito-rearing cages, etc.
- Reproducibility: expresses the precision of agreement between laboratories
Repeatability and intermediate precision are evaluated during the feasibility and internal validation stages while all the levels of precision should be evaluated during the external validation. Repeatability variability is usually smaller compared to the other two levels of precision due to the many sources of variation that exist within and between laboratories contributing to the inter-laboratory variation than the within-laboratory or testing day [18,22]. Therefore, careful attention should be paid when defining the acceptability criteria for the different levels of precision.
Table 3 gives a summary of stages involved when conducting replication experiments.
[Table 3 to appear here, table included in the Tables and Legends section]
Data analysis
Common measures for precision are SD or CV, also known as the relative standard deviation. However, these measures are not ideal if the data are non-normally distributed, contain a high proportion of outliers, and if the number of replicates per group are not equal [46]. In such cases, alternatives to the CV can be used, such as the Geometric Coefficient of Variation (GCV), Coefficient of Quartile Variation (CQV), Coefficient of Variation based on the Median Absolute Deviation (CVMAD), and Coefficient of Variation based on the Interquartile Range (CVIQR) for simple estimates [47–49] or the intra-class correlation coefficient (ICC) [50,51]. More details about the formulas, pros and cons of each method, and examples of R packages (where possible) are contained in Table S1 (Additional file 2).
The data analysis performed should reflect the study design that was implemented, and a data analysis plan should be produced in advance alongside the study protocol. Usually, there are different sources of variation in replication studies, and it is important to estimate precision whilst accounting for the variability of all possible factors. These factors can be fixed and/or random variables, for example, estimating the within-day variability while accounting for the testing days, operators, and site variability. The most powerful approach for estimating precision for replication studies is using mixed-effects models, and the CV and/or ICC and their associated 95% confidence intervals (CI) [18,50,51]. The incorporation of 95% CIs is critical given the many unknown factors that can influence the results of a study but cannot be controlled for in the study design [42]. These analysis methods are applicable for various types of data including continuous, proportions, binary and counts. For example, this can be implemented by using the VCA (normal data only) and rptR R-packages among other software or packages [51,52].
Comparison experiments
These experiments are conducted during the feasibility, internal validation and external validation phases and determine if there are any differences between an existing method and a new method. For example, the WHO cone test is the standard method to measure the impact of mosquito tarsal contact with an AI applied for vector control; a novel method developed to measure the impact of exposure using a different approach could be compared to the cone test to determine the comparability of the two methods. Usually, this is performed by testing the same sample by both methods [19]. However, such designs are not feasible for bioassays as the same insects/replicate samples cannot be used/measured twice using different methods/tests since preexposure will influence the outcome of a second exposure [5]. Therefore, comparison experiments for bioassays should be conducted in parallel using the same test system under the same conditions for both the pre-existing and novel methods to allow comparison.
Comparison studies for methods designed to evaluate products with new modes of action should be undertaken in parallel with a product of known performance using existing methods. Table 4 gives a summary for implementing comparison experiments.
Table 4. Comparison experiments
Stage
|
Notes
|
1. Select comparison method |
|
2. Determine the maximum number of replicates that can be performed in a single day |
Four replicates are the minimum to calculate within-run imprecision
|
3. Perform 30 replicates each of new method and comparator method in parallel over the smallest possible number of days
|
Note that for methods with extremely high variation, 30 replicates may be insufficient.
|
Data analysis
The data analysis will depend on the analytical parameter of interest, and it can be performed using the methods discussed above (as appropriate). To access the performance of the novel method, the Bland-Altman plot should be employed, to describe the agreement between the two methods based on the endpoint(s) of interest [53,54]. The results obtained from the two methods should be compared within a group (i.e., holding all other conditions/parameters constant).
Measurement uncertainty
Validation results should be reported with an uncertainty measure (e.g., 95% CI), which indicates the margin of doubt that exists for the obtained results [22,44]. For example, the CV as a measure for precision can be reported together with its corresponding 95% CI.
Outlying data points
Outlying data points/outliers are extreme values in an experimental dataset [52,55]. Outliers can negatively impact results and/or the validity of fitted models by violating the normality assumption and therefore outliers should be identified and handled appropriately [18]. All extreme data points should be double checked to remove the possibility of recording error or operator error prior to outlier analysis. Outliers can be identified using visualisation, e.g., boxplots, or formal statistical tests such as the regression models or modified Grubb-test using the median and MD68-statistic, for example, this can be employed using the VCA R-package [52,56]. The proportion of outliers should not exceed 1% in the total dataset [18,52]. If outliers are identified, error estimates/analytical parameters such as precision should be calculated with and without the outliers to assess the impact of the outliers on the method’s performance results [18].