Validity of ANOVA under Non-normality &amp; Heterogeneity

doi:10.21203/rs.3.rs-2071136/v1

Download PDF

Research Article

Validity of ANOVA under Non-normality & Heterogeneity

https://doi.org/10.21203/rs.3.rs-2071136/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Background

Performance of the F-test critically depends on the distributional assumptions and the choice of the robustness measures. The robustness of F-test to non-normality has been studied and validated in literature however, the assumption of homogeneity has not been given the due attention in literature. Regarding robustness measures, F-test is not evaluated against both the liberal and conservative criterion simultaneously. This study provides a systematic examination of F-test robustness to violations of normality and homogeneity in terms of Type I error, considering a wide variety of distributions commonly found in psychology, social and medical sciences.

Method

This study conducted Monte Carlo simulations to compute the Type-I error rates of F-test under non-normality and heterogeneity assumption. To assess the Type-I error rates, 100,000 samples were generated with 1900 scenarios to ensure reliable results. The manipulated parameters include shape and scale parameters of distributions, number of groups, equal and unequal sample sizes, total and average sample size, inequality in the sample size and variance ratio.

Results

The findings of this study show that the robustness of the F-test in terms of Type-I error rates depend on the choice of robustness measure, variance ratio, sample size, and equality of samples. The F-test is robust under a threshold value of variance ratio when evaluated against liberal criteria and non-robust against conservative criterion.

MSC Classification: 91-10, 62J10, 65C20

F-test

ANOVA

Homogeneity

Normality

Variance ratio

In psychology and medical sciences, one of the most widely used statistical techniques is One-Way Analysis of Variance (ANOVA). Asymptotically, the parametric test statistic of ANOVA follows the F-distribution with underlying assumptions that the outcome variable is normally and independently distributed with equal variances among the groups. However, it is hard to meet these conditions of normality and homogeneity in the case of real data. Blanca, Arnau, López-Montiel, Bono, and Bendayan (2013) explore 693 real data distributions with sample size ranging from 10 to 30 and conclude that only 5.5% of the distributions are close to normality while considering skewness and kurtosis together. Although they find rare extreme contaminations in the real data sets. Kobayashi (2005) examines the normality and homogeneity of chronic toxicity study with rats and observes that a lot of hematology and biochemistry items are non-normal. Blanca, Alarcón, Arnau, Bono, & Bendayan (2017) explore the impact of (un)know non-normal distributions on one-way ANOVA test for both equal and unequal sample sizes and the results reveal the 100 percent robustness of the F-test in terms of Type-I error rates.

A plethora of studies have been conducted to explore the robustness of the F-statistic under non-normality since 1930s. The importance of the underlying assumption of normality of data for the F-test generated many simulation studies however, the assumption of homogeneity has not been given the due attention in literature. Black, Ard, Smith, & Schibik, (2010) studied the effects of non-normality on ANOVA by considering twelve Weibull distributions in the non-normal space. The shape parameter is found to have a significant impact on size and power of ANOVA however, on balance, ANOVA is found to be robust against most of the alternative distributions. Schmider, Ziegler, Danay, Beyer, & Bühner (2010) employed Monte Carlo procedures to reinvestigate the robustness of analysis of variance against non-normal distributions. The study authenticated the robustness of one-way fixed-effect design against different types of non-normal alternatives using 3 groups of 25 values each. Earlier literature also proves the robustness of analysis of variance against the non-normality of data (Pearson, 1931; Feir-Walsh & Thoothaker, 1974; Clinch, & Kesselman, 1982; Lix, Keselman, & Keselman, 1996; Patrick, 2007; Lantz, 2013).

Gamage & Weerahandi (1998) study the size performance of different tests in one-way ANOVA at 0.05 percent level of significance under the assumption of heterogeneity. With mild heteroscedasticity, Welch-test and classical ANOVA F-test retain the size properties with Welch-test having a slight advantage. Lee & Ahn (2003) establish that F-test of ANOVA is adoptable with minor deviations from the assumption of equality of variances when there is a positive correlation between variances and sample sizes. However, the consequences of violation of equality of variances are serious if there is negative correlation between variances and sample size (Weerahandi, 1995). Measures used to quantify heterogeneity and robustness assessment criterion may matter while deciding about the robustness of the F-test (Blanca et al., 2017). Literature (Blanca et al., 2017, Moder, 2010, and Zijlstra, 2004) suggests the use of variance ratio as measure to quantify the heterogeneity. Diebold & Chen (1996) compute the asymptotic standard error for the nominal size of test, , where \(\widehat{\alpha }\) is the estimated size of the test and s is the number of simulations. However, the confidence limits constructed based on this standard error provide a conservative measure of robustness. Bradley (1978) suggests rather a liberal definition of robustness, that the nominal size of the test should not exceed 1.5 times the nominal rate.

To sum up, performance of the F-test critically depends on the distributional assumptions and the choice of the robustness measures. Among the distributional assumptions, robustness of the F-test is validated in the literature against non-normality however, equality of variance assumption is not given due consideration. Regarding robustness measures, F-test is not evaluated against both the liberal and conservative criterion simultaneously.

This study is motivated by several aspects. First, robustness of F-test is validated in the aforementioned literature provided the departures from normality are not moderate or severe, distributions are of same shape, and sample sizes are large and equal. Large sample size, moderate or severe departure are subject to different interpretations which is the main issue for applied researchers. Second, most of the literature focuses on non-normality and the due attention is not given to the assumption of homogeneity. Under heterogeneity of variances, F-test for ANOVA can yield biased results and lead to invalid statistical analysis (Delacre et. al., 2019). Third, F-test for analysis of variance is sensitive to number of groups. With more than two groups, F-test becomes more liberal i.e., Type-I error rates are larger than the nominal value of the level of significance (Harwell et. al., 1992). Therefore, one-way design with three and five number of groups are considered with varying number of sample sizes to explore the performance of F-test in terms of Type-I errors. Fourth, both conservative and liberal measures of robustness proposed by Diebold & Chen (1996) and Bradley (1978) are used in this study.

This study aims to systematically explore the impact of both non-normality and heterogeneity on Type-I error rates of ANOVA using both conservative and liberal measures of robustness. The association of the robustness with sample size, equality of the sample sizes, number of groups, type of non-normality and heterogeneity of groups will be explored in this study.

Rest of the paper proceeds as follows: section 2 introduces the methodology and data distributions used in the Monte Carlo simulations, section 3 provides the discussion on the findings, and section 4 concludes the paper.

This study calls in the Monte Carlo procedures to examine the effect of wide variety of distributions found in psychology, social and medical sciences on robustness of the F-test for analysis of variance (ANOVA). For the choice of non-normal distributional space, first, we calculated the type-I error rates of the F-test against the distributions with (i) high skewness, (ii) high kurtosis, and (iii) outliers. High type-I errors rates against these distributions (Table 1A & Fig. 1B) do not warrant the use of the F-test when there are severe normality distortions. Given the inapplicability of the F-test against the distributions with significant departures from normality, we focused on the non-normal space with slight or moderate deviations from normality. The purpose can be best served by the Weibull distributions (Black et. al., 2010). Distributions (D_i) with different degrees of contamination are simulated by varying the values of skewness and kurtosis (Blanca et al., 2017). A wide range of sample sizes, both equal and unequal, with unequal variances are considered. Both shape and scale parameters are used to vary the characteristics of the data distributions for different groups (Table 1A & 1B).

Monte Carlo simulations are performed to compute the Type-I error rates of F-test under non-normality and heterogeneity assumption using MATLAB R2021b. To assess the Type-I error rates, 100,000 samples were generated with 1900 scenarios to ensure reliable results. One-way design with three and five number of groups (J) is considered. Unbalanced designs are also used in studies involving one-way analysis of variance. Therefore, both equal and unequal sample sizes are used to extend our results to more realistic situations. Average group sample size (N/J) varies from 10 to 100 with a step size of 5 and with total sample size (N) ranging from 30 to 500. The inequality in the sample sizes of different groups is computed by coefficient of sample size variation, \(\left(\varDelta n\right)\), a ratio of standard deviation of sample sizes of the groups to its mean. Following Blanca et al., (2017), low, medium, and high values of \(\varDelta n\) are chosen at 0.16, 0.33, and 0.50 respectively. In this study, fifty distributions are investigated with varying degree of departures from normality and homogeneity with variance ratio (VR), maximum to smallest variances of the groups, ranging from 1.06 to 16.21. Type-I error rates of F-test are analyzed for each condition according to the criterion proposed by Diebold & Chen (1996) and Bradley (1978).

SIMULATION DESIGN

Following steps are used to generate data, to compute critical values, and to compute the type-I error rates of the F-test.

First, generate 100,000 samples of normal & homoscedastic data under the null hypothesis of equality of means for all groups.
Apply the F-test to these samples and note the values.
Arrange the values in ascending order, the th percentile is the critical value at the level of -significance.
Second, draw 100,000 samples of non-normal & heteroscedastic data under the null hypothesis of equality of means for all groups.

Apply the F-test to these samples and use the critical values (calculated in step 3) to decide about the rejection of the null hypothesis. The proportion of the null hypothesis’ rejection is the type-I errors.

Table 1, 1A & 1B summarizes the characteristics of the alternative space used in this Monte Carlo experiment to assess the type-I error rates of F-test with 400 variants of Weibull distribution covering all kind of distributions used in the previous literature in terms of shape, scale, skewness, and kurtosis parameters. Table 2-5 feature descriptive statistics of Type-I error rates across different conditions with (un)equal sample sizes for three and five number of groups. These tables provide the average sample size (N/J), sample size variation among the groups ( ), minimum, maximum, & median values of type-I error rates, and proportion of size distortions as per the liberal and conservative criterion of robustness proposed by Bradley (1978) and the criteria of robustness proposed by Diebold & Chen (1996).

On balance, type-I error rates of the F-test are within bounds of Bradley’s liberal criterion (Clinch, & Kesselman, 1982; Zijlstra, 2004; Schmider et. al, 2010; Patrick, 2009, Black et. al, 2010 and Blanca et. al, 2017) except for few violations with 3 groups regardless of the degree of deviation from a normal distribution, sample size, equal or unequal distribution in terms of skewness & kurtosis, equal and unequal variances. Under non-normality and heterogeneity, the violations are noted against the set of distributions with variance ratio (VR) in double figures except for one set of distributions with VR=2.78. Our findings support the rule of thumb proposed by Blanca et. al, (2018) that a variance ratio greater than 1.5 may be considered as a potential threat to F-test robustness with unequal sample sizes and 3-groups. Furthermore, we generalize this rule of thumb for both equal and unequal sample sizes with 3 & 5 groups.

With 5 groups, F-test is robust according to Bradley’s liberal criterion for both equal and unequal sample sizes regardless of degree of deviation from all parameters of concern. Table 5 (column 5) reports the proportions of Type-I error rates of F-test falling outside the range of Bradley’s liberal criterion for equal sample sizes. These Type-I error violations are noted against all those sets of distributions where the variance ratio is greater than the rule of thumb i.e., VR>1.5, otherwise F-test is robust.

However, the F-test is not robust according to the Bradley’s conservative and Diebold & Chen’s criterion considering the assumptions of both non-normality and heterogeneity simultaneously. The proportion of Type-I error rates falling outside the limits of Bradley’s conservative criterion varies from 4 to 82 percent for unequal and 22 to 32 percent for equal sample sizes with 3 groups (table 2-3). With 5-groups, these proportions vary from 12.0 to 94.0 percent for unequal and 38.0 to 40.0 percent for equal sample sizes (table 4-5). Nevertheless, under the rule of thumb for variance ratio, F-test is robust for equal sample sizes with 3- and 5-groups as all the violations are against the scenarios where VR is greater than 1.5.

For unequal sample sizes of 3- and 5-groups, F-test is not robust according to Bradley’s conservative criteria. The F-test becomes conservative, meaning that Type-I error rates fall below the lower limit of Bradley’s conservative criteria. In such scenarios, variance ratio varies from 1.14 to 1.41 and 1.17 to 1.49 for unequal sample sizes of 3- and 5-groups respectively. These findings, based on Bradley’s conservative criteria, define 1.14 as a new threshold of variance ratio for unequal sample sizes. For unequal sample sizes, according to robustness criterion proposed by Bradley (conservative) and Diebold & Chen, the proportion of type-I rate violations increases as the sample size inequality (variation) increases (Fig. 1 & 2).

Diebold & Chen (1996) proposed the use of the asymptotic standard error for the nominal size of a test. The proportion of Type-I error rates of the F-test falling outside the ±3SE bands are reported in table 2-3 and table 4-5 for 3- and 5-groups respectively. Apparently, the robustness of the F-test is not evident from these results. Further investigation reveals that F-test is liberal for equal and conservative for unequal sample sizes of 3-groups under the rule of thumb (VR<1.5).

For 5-groups set up, under the rule of thumb for variance ratio, F-test is liberal but not conservative for equal and conservative but not liberal for unequal sample sizes. On balance, if we keep the rule of thumb aside, F-test is liberal for equal and non-robust for unequal sample sizes with both three and five groups. As per the criteria proposed by Diebold & Chen (1996) the robustness of the F-test requires the variance ratio to be very close to one implying that equality of the variances should hold.

The findings of this study highlight that the robustness in terms of Type-I error rates of the F-test depends on distributional assumptions, criteria of evaluation, and the threshold in terms of variance ratio. As per Bradley’s liberal criteria, under the 1.5 threshold of variance ratio, the F-test is robust for (un)equal sample sizes both for three and five group’s set up. Ignoring the variance ratio threshold leads to the conclusion that F-test is not robust. Similarly, under the variance ratio threshold, F-test is robust for equal and non-robust for unequal sample sizes as per the Bradley’s conservative criteria. Finally, as per the criteria proposed by Diebold & Chen (1996) the robustness of the F-test requires the equality of variance assumption which is in line with the findings in Harwell et. al., (1992) and Delacre et. al., (2019). The variance ratio threshold proposed by Blanca et. al., (2018) is making sense as the high variance ratio indicates the extreme contamination in the data distribution which is rare in real data sets (Blanca, Arnau, López-Montiel, Bono, & Bendayan, 2013). Further investigation in terms of type-I error rates of F-test against such rare distributions (having outliers, different shapes, & variances) including Beta, Gamma, Lognormal and Weibull distributions reaffirms our finding that F-test is robust only if the variance ratios of the samples under consideration are below the threshold level (see appendix, table A). It is evident from the results furnished in table A that other than normality and equality of variances, the robustness of the F-test depends on (i) variance ratio threshold, (ii) sample size, and (iii) equality of the sample sizes.

These findings are useful for researchers in the fields of social science and medicine as F-test, in terms of type-I error rates, is proven to be the robust statistical method under the assumption of non-normality and heterogeneity as per the Bradley’s liberal criteria. Variance ratio threshold of 1.5 plays an important role in validating the use of F-test. However, the conservative criterion proposed by Bradley and Diebold & Chen don’t validate the use of F-test under the violation of distributional assumptions of normality and homogeneity. We, therefore, encourage the researchers to analyze the distribution underlying the data in hand in terms of normality, variance ratio (if variances are not equal), and equality of sample sizes.

This study attempts to provide systematic examination of the effect of non-normality and heterogeneity on Type-I error rates of F-test for ANOVA under a wide variety of conditions (1900). Although, the robustness of F-test under non-normality is established in literature (Black, Ard, Smith, & Schibik, 2010; Schmider, Ziegler, Danay, Beyer, & Bühner, 2010; Pearson, 1931, Feir-Walsh & Thoothaker, 1974, Lix, Keselman, & Keselman, 1996, & Patrick, 2007) however, due attention was not given to the assumption of homogeneity. Therefore, assuming heterogeneity and non-normality, one-way design with three and five number of groups are considered with varying number of sample sizes to explore the performance of F-test in terms of Type-I errors. Both liberal and conservative criterion of evaluations are considered.

As per Bradley’s liberal criteria, under the 1.5 threshold of variance ratio, the F-test is robust for (un)equal sample sizes both for three and five group’s set up. The variance threshold value of 1.5 was proposed by Blanca et. al, (2018) as a potential threat to the robustness of F-test for equal sample sizes with three groups. This study generalizes this rule of thumb for one-way design both for equal and unequal sample sizes with three and five number of groups. According to Bradley’s conservative criteria, under the variance ratio threshold, F-test is robust for equal and non-robust for unequal sample sizes. Finally, following the criteria proposed by Diebold & Chen (1996), the robustness of the F-test requires the equality of variance assumption which is in line with the findings in Harwell et. al., (1992) and Delacre et. al., (2019). Overall, other than normality and equality of variances, robustness of the F-test depends on (i) variance ratio threshold, (ii) sample size, and (iii) variation of the sample sizes.

Before using the F-test for analysis of variance, a researcher should compute (i) skewness & kurtosis of the data (ii) Boxplot for outlier detection, and (iii) variance of all the sample distributions to compute the variance ratio. If there are no outliers, skewness and kurtosis are close to normal benchmarks, and the variance ratio is less than 1.5 value, the F-test is applicable. In addition, keep the sample size variation as low as possible. In short, the F-test is not recommended under the significant departures from normality and homogeneity.

Conflict of Interest Statement:

Authors of the manuscript declare no conflict of interest.

Black G, Ard D, Smith J, Schibik S (2010) The impact of the Weibull distribution on the performance of the single-factor ANOVA model. Int J Ind Eng Comput 1(2):185–198
Blanca MJ, Alarcón R, Arnau J, Bono R, Bendayan R (2018) Effect of variance ratio on ANOVA robustness: Might 1.5 be the limit? Behav Res Methods 50(3):937–962
Blanca MJ, Alarcón R, Arnau J, Bono R, Bendayan R (2017) Non-normal data: Is ANOVA still a valid option? Psicothema 29(4):552–557
Blanca MJ, Arnau J, López-Montiel D, Bono R, Bendayan R (2013) Skewness and kurtosis in real data samples. Methodology
Bradley JV (1978) Robustness? Br J Math Stat Psychol 31:144–152
Clinch JJ, Keselman HJ (1982) Parametric alternatives to the analysis of variance. J Educational Stat 7(3):207–214
Delacre M, Leys C, Mora YL, Lakens D (2019) Taking parametric assumptions seriously: Arguments for the use of Welch’s F-test instead of the classical F-test in one-way ANOVA.International Review of Social Psychology, 32(1)
Diebold FX, Chen C (1996) Testing structural stability with endogenous breakpoint a size comparison of analytic and bootstrap procedures. J Econ 70(1):221–241. DOI: https://doi.org/10.3102/10769986017004315
Feir-Walsh BJ, Toothaker LE (1974) An empirical comparison of the ANOVA F-test, normal scores test and Kruskal-Wallis test under violation of assumptions. Educ Psychol Meas 34(4):789–799
Gamage J, Weerahandi S (1998) Size performance of some tests in one-way ANOVA. Commun Statistics-Simulation Comput 27(3):625–640
Harwell MR, Rubinstein EN, Hayes WS, Olds CC (1992) Summarizing Monte Carlo results in methodological research: The one- and two-factor fixed effects anova cases. J Educational Stat 17(4):315–339
Lantz B (2013) The impact of sample non-normality on ANOVA and alternative methods. Br J Math Stat Psychol 66(2):224–244
Lee S, Ahn CH (2003) Modified ANOVA for unequal variances. Commun Statistics-Simulation Comput 32(4):987–1004
Lix LM, Keselman JC, Keselman HJ (1996) Consequences of assumption violations revisited: A quantitative review of alternatives to the one-way analysis of variance F test. Rev Educ Res 66(4):579–619
Moder K (2010) Alternatives to F-test in one-way ANOVA in case of heterogeneity of variances (a simulation study). Psychol Test Assess Model 52:343–353
Patrick JD (2009) Simulations to analyze Type I error and power in the ANOVA F test and nonparametric alternatives (Doctoral dissertation, University of West Florida)
Pearson ES (1931) The analysis of variance in cases of non-normal variation. Biometrika 23:114–133
Schmider E, Ziegler M, Danay E, Beyer L, Bühner M (2010) Is it really robust? Methodology
Weerahandi S (1995) ANOVA under unequal error variances. Biometrics 51(2):589–599
Zijlstra W (2004) Comparing the Student’s t and the ANOVA contrast procedure with five alternative procedures (Master’s thesis, Rijksuniversiteit Groningen). Retrieved from http://www.ppsw.rug.nl/~kiers/ReportZijlstra.pdf
Kobayashi K (2005) Analysis of quantitative data obtained from toxicity studies showing non-normal distribution. J Toxicol Sci 30(2):127–134

Tables 1-5 are available in the Supplementary Files section.

Download PDF

Version 1

posted

You are reading this latest preprint version

Validity of ANOVA under Non-normality & Heterogeneity

Status:

Version 1

Abstract

Background

Method

Results

Figures

Introduction

Method

Results And Discussion

Conclusion

Declarations

References

Tables 1-5

Supplementary Files

Status:

Version 1