Statistical analysis should begin with careful consideration of the data for each response to be analyzed. A basic step is determining the appropriate distribution for a response. This begins by assessing whether the data are best realized as (a) a continuous response, such as egg shell thickness, that might come from a normal distribution, or (b) a proportion such as number of cracked eggs per eggs laid, which could be treated as normally distributed after a normalizing, variance stabilizing transform (usually an arc-sine square-root transform), or (c) a count response, such as total number of cracked eggs, or (d) a conditionally binomial response, such as number of cracked eggs conditioned on the number eggs laid. A list of commonly reported responses with their distributions is given in the supplementary material.

The most appropriate statistical methodology should be determined in order best to distinguish between real effects and mere artifacts of statistical probability by more properly reflecting the nature of the data and experimental design. To the extent possible, statistical analysis should be consistent with visual assessment of data. Only in limited situations, such as assessment of normality and variance homogeneity, and only then with expert judgment, a visual assessment may be sufficient without formal testing. Where visual assessment and formal tests are in conflict, the cause should be explored.

The ideal statistical methodology is a regression approach to estimate an appropriate percent effect of biological importance and its associated measure of uncertainty. This ideal is hampered by the small number of treatment groups in typical avian reproduction guideline studies.

The statistical tests listed in the case studies and in the decision flow diagrams are intended to be implemented as described in the cited references, especially Green et al (2018). Not all software packages that offer these tests implement them in equivalent fashion. For example, the R package mcp has a procedure that may appear to be Williams’ test. In fact, Williams’ test as described in Williams (1971, 1972) and Green et al (2018) and recommended here and in some OECD guidelines is quite different from the test in the mcp package. The StatCHARRMS R package provides a good, but not perfect, approximation to Williams’ test as developed by Williams. A similar precaution is needed for regression models. For example, the software ToxRat does a preliminary transformation of the data prior to fitting regression models that, if not disabled by the user, distorts the concentration-response relationship and can result in seriously misleading BMD/ECx estimates. It is not a purpose of this manuscript to critique software packages that might be used to carry out the recommended protocols but some recommendations are provided below. In addition, the website associated with Green et al (2018) does offer programming code in SAS and R to carry out the tests and regression models discussed.

Figure 7 gives a decision chart that captures the highlights of the regression modeling steps. Figure 8 provides the same for a NOEC determination. Detailed NOEC decision charts are given in the Supplement for each type of response (e.g., quantal, continuous, conditionally binomial, count).

In Figure 8 the Conover test can be configured as a non-parametric alternative to the Dunn test, but the Dunn test is recommended in numerous OECD test guidelines and guidance documents (e.g. OECD 2006, 2014) and its power properties are documented more completely (e.g., Green et al 2018 and in documents supporting OECD test guidelines).

If there are outliers, at most 6 in total should be removed and preferably 4 or fewer. Otherwise, the outlier-omitted data may no longer truly represent the data collected. All data should be re-analyzed after outliers are omitted. If the NOEC or BMDx changes, then care should be taken interpreting results.

If a transform removed non-normality or variance heterogeneity/overdispersion, then the results of the transformed data are generally preferred. A check of distribution fit for GLMM models is assessed through studentized residuals and a non-significant normality test for these residuals means the data fit the modeled distribution.

3.1. Steps in the recommended statistical protocol. Additional detail is given in the Supplement.

1) Assess the distribution

Once the conceptually appropriate distribution is determined, it is important to assess the fit of that distribution (e.g., normality), variance homogeneity or overdispersion. Dunnett and Williams tests and various regression models assume normally distributed data with homogeneous variances. Both are assessed through residuals from an ANOVA model. Normality of the residuals can be assessed using the Shapiro-Wilk or Anderson-Darling test. In the case of generalized linear mixed models (GLMM), studentized residuals are used to assess agreement of the data to the modelled distribution. Variance homogeneity for a normally distributed response can be assessed using Levene’s test. For incidence and count data, overdispersion (also called extra-binomial variance) can be assessed using Tarone’s C(α) test or a method based on GLMMs.

2) Determine the presence, meaning, and impact of outlier

Careful consideration of outliers is advised since outliers can sometimes show that a statistically significant effect is the result of a small number of observations or the lack of statistical significance may be the result of high variability caused by one or more outliers. It should be emphasized that outliers are statistically detected unusual observations, not “bad” observations to be discarded. The primary purpose of outlier detection is to determine to what extend a small number of unusual observations influences the statistical tests and models. These observations may also be important indicators that merit further investigation. The Tukey outlier rule is recommended for continuous responses. But formal outlier rules need to be supplemented by consideration of other data anomalies. For example, 0 fertile eggs out of 1 egg laid is very different from 0 fertile eggs out of 36 eggs laid. A weighted analysis or treatment of fertile eggs as binomially distributed conditioned on the number of eggs laid is a potential way of dealing with some outlier issues. Decision trees for NOEC determination given in the Supplementary material indicate when consideration of outliers is applied. It should be noted that if the NOEC changes after outliers are removed or a normalizing, variance stabilizing transform is found, then scientific judgment is needed to resolve the difference.

3) Assess concentration-response monotonicity

Monotonicity in the dose-response should be assessed to determines whether a trend test (e.g., Williams, Jonckheere-Terpstra, Cochran-Armitage) should be used. Use of a trend test where it is not justified can obscure a real effect or indicate an effect that is not justified. Failure to use a trend test where it is justified ignores relevant biology and can miss an important effect or lead to confusion when a low dose is found statistically significant but higher doses are not.

In general, if a chemical affects a biological response, the effect increases with increasing concentrations of the chemical. That is, one expects a monotonic concentration-response. This is not a strict requirement, but serious deviations from monotonicity rule out the use of trend tests and should prompt careful exploration of the data. Much additional discussion of trend tests and ways to assess monotonicity are given in Green et al (2018) and Springer and du Hoffmann (2018). For normally distributed data with homogeneous variances, Williams’ test is recommended, but with cautions. This test uses a pool-the-adjacent-violators (PAVA) algorithm to smooth the data by forcing monotonicity. If the data deviates greatly from monotonicity, there can be too much smoothing which distorts the interpretation of the data. Green et al (2018) contains further discussion of this, as does OECD TG 248 (OECD 2019). As a rough guide, if three or more mean responses from positive test concentrations are merged by the PAVA algorithm, then the data may not be suitable for Williams’ test. A test for monotonicity is given in the Supplementary material.

For continuous response data that do not meet the requirements of normality and variance homogeneity, the Jonckheere-Terpstra test is a non-parametric trend test that has similar power as Williams’ test to detect effects. Like Williams’ test this is a step-down trend test but unlike Williams, it does not use a smoothing algorithm and so does not have the same tendency as Williams’ test to mask departures from monotonicity. For incidence data, the Cochran-Armitage test is very useful step-down trend test. Where overdispersion is found, a robust version of that test using the Rao-Scott adjustments can be used. All these tests are discussed in detail in Green et al (2018), where additional references are also given. Of those several deserve additional mention, including OECD (2006, 2014).

4) Use historical control data if available

Valverde *et al* (2018) investigated the utility of historical control data for interpreting avian reproduction studies, including power analyses to document the size effect that could be expected to be found statistically significant. The work reported here continues and, to some degree, extends that work. If historical control data are available, such data could provide information on which observations indicate real effects, which observations are well within the historical control range, and can alert the investigator to the presence of an unusual control that may skew statistical analysis. By examining the study data in the context of historical control data, some responses may be found not to require further statistical analysis. Once statistical analysis is done to determine a NOEC or estimate an ECx value, the study data again can be compared to relevant historical control data to help interpretation for hazard identification and risk assessment.

The most appropriate historical control data is from the same laboratory that does the concurrent study and uses data within a time interval centered on the date of the concurrent study. Historical control data from other laboratories can be used if appropriate inter-laboratory comparisons have been done. A span of 2 - 5 years on each side of the date of the concurrent study is recommended. However, European Commission (2013) recommended a 5-year span centered on the starting date of the study. The span will depend in part on the number of studies in the database. It would be best to have 20 or more studies from the HCD, approximately equally split on both sides of the concurrent study date where possible. Once the span of time to include in the HCD is determined, extreme observation should be discarded to avoid skewing the interpretation. It is suggested that a concurrent treatment mean response between the 5th and 95th percentiles of the HCD is not indicative of a real effect. These percentiles are dependent on the number of studies in the HCD and a reality check would include assessing the data using several time spans, such as ± 2, ± 3, and ± 5 years in the HCD to make sure these percentiles are not overly influenced by the size or the time span of the HCD. Note also that 5% of 20 is 1, so the 5% and 95% bounds on a smaller HCD are of questionable relevance.

5) Transform responses to meet test requirements or use generalized (non-)linear mixed models

Transformation of responses must be order-preserving. For example, the Freeman-Tukey transform of proportion data need not be order preserving and its use can distort or even reverse some concentration-response relationships and produce misleading results. If regression models are used to estimate ECx, the meaning of an x% change in the transformed response is unlikely to be equivalent to an x% change in the original, untransformed response.

For proportion responses such as viable eggs per eggs set, the traditional way to analyze is to treat these responses as continuous responses, often with a normalizing, variance stabilizing transformation such as the arc-sine square-root transform. That remains a viable method, but another method can be more informative and is more consistent with the nature of the data. This is the use of a generalized linear mixed model that treats the numerator, viable eggs in the illustration, as binomially distributed conditioned on the denominator, eggs set in the illustration. Count data, such as eggs laid, can likewise be analyzed by treating the data as continuous, usually following a square-root transform, or using a GLMM with a Poisson distribution. Where overdispersion is found, an adjustment is recommended, such as using a negative binomial distribution or allowing variance to vary by treatment group. See Green et al (2018) for additional details and references on all the statistical recommendations.

6) Use regression or BMD methodology where supported by data

Where more treatment groups are available in a study and regression modeling is feasible, model selection criteria are important. Criteria are described in the Supplement. Simulation studies reported by Burnham and Anderson (1998) among others, demonstrate that if the same model selection procedure is followed in repeat studies using the same test concentrations and study design, then different models from the set of models used will be selected in different studies. To compensate for this model uncertainty, a model averaging technique can be implemented. There are two main ways to approach model averaging. Benchmark dose (BMD) methodology outlined in (EFSA 2009a) indicated that the lowest point and interval estimates be used from all models in a set of standard models. This recommendation was updated in EFSA (2017) to use a combination of bootstrap sampling and weighted averages. The most common weighting scheme is based on a single information criterion, such as AIC or BIC. Details are given in the Supplement. With either approach, care must be taken to identify the set of models to use, as clearly both model average and model selection are highly dependent on the models utilized. In addition, one should not rely solely on an automated procedure, such as Akaiki weight, that down weights contributions from poorly fitting models or focuses on only one selection criterion. It is also important to understand the limitations of regression modelling. Once a model is fit to a dataset, it is mathematically possible to estimate ECx for any positive value of x up to 100 for a decreasing model. Not all such estimates are statistically reliable. The dangers of extrapolation much beyond the range of tested positive concentrations are well understood. Also, an estimate of ECx for x < 10 is often beyond the capability of the data. For example, a 1% or 5% change in adult body weight or proportion of eggs laid that hatch or survive 14 days is rarely possible. The impracticality of such estimates is often indicated by a wide confidence interval or a confidence interval extending below 0. More detail on this in given in the Supplement under the heading of model fitting criteria.

7) Assess the need for special regression models

When there is a flat response in the treatment groups but all such groups differ significantly from the control, a “hockey-stick” model may be helpful in describing the data and providing ECx estimates where more standard decreasing models fail. If hormesis is evident a hormetic model, such as Brain-Cousens, should be considered. Such models usually require more test concentrations than commonly found in avian reproduction studies.

8) Consider an alternative to NOEC and BMD

The small number of treatment groups can sometimes be overcome by a statistical methodology designed to test for a specified level of effect of biological or regulatory importance. For example, a maximum “safe dose” or MAXSD can be identified at and below which the effect of the test substance is significantly less than 10%. This method can also be applied when there are more treatment groups but no acceptable regression model can be found.

## 3.2. Software

While it is not the intent to give a survey of software available to carry out the recommended statistical tests to determine a NOEC or models to estimate ECx or BMDx, it still seems appropriate to provide brief descriptions of some software packages useful for the two approaches. For regression model fitting, including model averaging, there are at least three good choices. These are the R package drc (Ritz *et al* 2005, 2006); Proast (Slob 2018, 2019) which was developed specifically for regulatory risk assessment under the auspices of RIVM; BBMD (Shao and Shapiro 2018; Shao 2021) which provides a Bayesian implementation. Also notable is the BMD software developed by the United States EPA (BMDS Benchmark Dose Tools | US EPA) which is an Excel-based application. The current version (3.2) provides model averaging only for dichotomous responses, which limits its utility for avian reproduction studies. The first two cited packages use the Akaiki information criteria to obtain weights for model averaging. The third and fourth cited packages use weights based on prior distributions but otherwise follow the same idea of estimating both BMDx and BMDxLB on these weights. One should be aware that Bayesian model averaging can produce notably different results from the information criteria approach and the list of models used in averaging can also have a strong impact on results. The criteria (e.g., all convergent models from a fixed list or only those meeting some additional criteria) used to decide which model fits to include can also impact results.

For NOEC determination, Cetis (Ives 2021), which was developed for the United States EPA, implements all the standard statistical tests recommended, but not the GLMM tests. The R package PMCMRPlus (2021) provides all tests described for continuous responses, including non-parametric rank-based tests, but it does not include GLMM tests or tests for quantal data. SAS software has very useful procedures for GLMM models but these require programming. There are numerous R packages for GLMM but results from different packages compared to each other or to SAS will often not agree. A good resource for relevant GLMM models in R is Hothorn (2016).

## 3.3. Biological relevance

Real improvement in hazard identification and risk assessment requires scientifically based criteria for what constitutes a hazard. According to EFSA 2009b, in determining a NOAEL there may not be a consideration of the effect or its biological relevance. Therefore, it is proposed to use endpoints that are based on a consideration of the biological and/or ecological relevance. Consequently, the biological relevance should be always considered for the final toxicological endpoint selection as a higher tier refinement option.

For example, Case study 4 illustrated the importance of having an agreement on the size effect on eggshell thickness in evaluating a statistical finding as despite the statistical significance of the decreases in all treatment groups, it should be noted that the maximum observed decrease was only 6%. This endpoint is a rare instance when scientific evidence is available for this purpose. According to EFSA 2009b population effects in the wild tend to come about after thinning of 18% or more. Eggshell breakage increases when eggshells become more than 22% thinner than unaffected eggshells (Lincer 1975). Overall, the maximum observed decrease of 6% should not be considered biologically important and thus the final NOEL should be set to the maximum concentration tested (Table 7, Figure 4).

The regulatory process would be much enhanced by the establishment of such information on other key responses. As it is, an arbitrary rule is adopted, such as a 10% change, or a statistically significant change is used regardless of biological importance.