Table 1, 1A & 1B summarizes the characteristics of the alternative space used in this Monte Carlo experiment to assess the type-I error rates of F-test with 400 variants of Weibull distribution covering all kind of distributions used in the previous literature in terms of shape, scale, skewness, and kurtosis parameters. Table 2-5 feature descriptive statistics of Type-I error rates across different conditions with (un)equal sample sizes for three and five number of groups. These tables provide the average sample size (N/J), sample size variation among the groups ( ), minimum, maximum, & median values of type-I error rates, and proportion of size distortions as per the liberal and conservative criterion of robustness proposed by Bradley (1978) and the criteria of robustness proposed by Diebold & Chen (1996).
On balance, type-I error rates of the F-test are within bounds of Bradley’s liberal criterion (Clinch, & Kesselman, 1982; Zijlstra, 2004; Schmider et. al, 2010; Patrick, 2009, Black et. al, 2010 and Blanca et. al, 2017) except for few violations with 3 groups regardless of the degree of deviation from a normal distribution, sample size, equal or unequal distribution in terms of skewness & kurtosis, equal and unequal variances. Under non-normality and heterogeneity, the violations are noted against the set of distributions with variance ratio (VR) in double figures except for one set of distributions with VR=2.78. Our findings support the rule of thumb proposed by Blanca et. al, (2018) that a variance ratio greater than 1.5 may be considered as a potential threat to F-test robustness with unequal sample sizes and 3-groups. Furthermore, we generalize this rule of thumb for both equal and unequal sample sizes with 3 & 5 groups.
With 5 groups, F-test is robust according to Bradley’s liberal criterion for both equal and unequal sample sizes regardless of degree of deviation from all parameters of concern. Table 5 (column 5) reports the proportions of Type-I error rates of F-test falling outside the range of Bradley’s liberal criterion for equal sample sizes. These Type-I error violations are noted against all those sets of distributions where the variance ratio is greater than the rule of thumb i.e., VR>1.5, otherwise F-test is robust.
However, the F-test is not robust according to the Bradley’s conservative and Diebold & Chen’s criterion considering the assumptions of both non-normality and heterogeneity simultaneously. The proportion of Type-I error rates falling outside the limits of Bradley’s conservative criterion varies from 4 to 82 percent for unequal and 22 to 32 percent for equal sample sizes with 3 groups (table 2-3). With 5-groups, these proportions vary from 12.0 to 94.0 percent for unequal and 38.0 to 40.0 percent for equal sample sizes (table 4-5). Nevertheless, under the rule of thumb for variance ratio, F-test is robust for equal sample sizes with 3- and 5-groups as all the violations are against the scenarios where VR is greater than 1.5.
For unequal sample sizes of 3- and 5-groups, F-test is not robust according to Bradley’s conservative criteria. The F-test becomes conservative, meaning that Type-I error rates fall below the lower limit of Bradley’s conservative criteria. In such scenarios, variance ratio varies from 1.14 to 1.41 and 1.17 to 1.49 for unequal sample sizes of 3- and 5-groups respectively. These findings, based on Bradley’s conservative criteria, define 1.14 as a new threshold of variance ratio for unequal sample sizes. For unequal sample sizes, according to robustness criterion proposed by Bradley (conservative) and Diebold & Chen, the proportion of type-I rate violations increases as the sample size inequality (variation) increases (Fig. 1 & 2).
Diebold & Chen (1996) proposed the use of the asymptotic standard error for the nominal size of a test. The proportion of Type-I error rates of the F-test falling outside the ±3SE bands are reported in table 2-3 and table 4-5 for 3- and 5-groups respectively. Apparently, the robustness of the F-test is not evident from these results. Further investigation reveals that F-test is liberal for equal and conservative for unequal sample sizes of 3-groups under the rule of thumb (VR<1.5).
For 5-groups set up, under the rule of thumb for variance ratio, F-test is liberal but not conservative for equal and conservative but not liberal for unequal sample sizes. On balance, if we keep the rule of thumb aside, F-test is liberal for equal and non-robust for unequal sample sizes with both three and five groups. As per the criteria proposed by Diebold & Chen (1996) the robustness of the F-test requires the variance ratio to be very close to one implying that equality of the variances should hold.
The findings of this study highlight that the robustness in terms of Type-I error rates of the F-test depends on distributional assumptions, criteria of evaluation, and the threshold in terms of variance ratio. As per Bradley’s liberal criteria, under the 1.5 threshold of variance ratio, the F-test is robust for (un)equal sample sizes both for three and five group’s set up. Ignoring the variance ratio threshold leads to the conclusion that F-test is not robust. Similarly, under the variance ratio threshold, F-test is robust for equal and non-robust for unequal sample sizes as per the Bradley’s conservative criteria. Finally, as per the criteria proposed by Diebold & Chen (1996) the robustness of the F-test requires the equality of variance assumption which is in line with the findings in Harwell et. al., (1992) and Delacre et. al., (2019). The variance ratio threshold proposed by Blanca et. al., (2018) is making sense as the high variance ratio indicates the extreme contamination in the data distribution which is rare in real data sets (Blanca, Arnau, López-Montiel, Bono, & Bendayan, 2013). Further investigation in terms of type-I error rates of F-test against such rare distributions (having outliers, different shapes, & variances) including Beta, Gamma, Lognormal and Weibull distributions reaffirms our finding that F-test is robust only if the variance ratios of the samples under consideration are below the threshold level (see appendix, table A). It is evident from the results furnished in table A that other than normality and equality of variances, the robustness of the F-test depends on (i) variance ratio threshold, (ii) sample size, and (iii) equality of the sample sizes.
These findings are useful for researchers in the fields of social science and medicine as F-test, in terms of type-I error rates, is proven to be the robust statistical method under the assumption of non-normality and heterogeneity as per the Bradley’s liberal criteria. Variance ratio threshold of 1.5 plays an important role in validating the use of F-test. However, the conservative criterion proposed by Bradley and Diebold & Chen don’t validate the use of F-test under the violation of distributional assumptions of normality and homogeneity. We, therefore, encourage the researchers to analyze the distribution underlying the data in hand in terms of normality, variance ratio (if variances are not equal), and equality of sample sizes.