The impact of linkage disequilibrium and epistasis in the studies of inheritance based on Hayman’s diallel and generation mean analysis

This simulation-based study assessed the impact of linkage disequilibrium (LD) and epistasis on Hayman’s diallel and generation mean analysis, assuming hundreds of genes, variable degree of dominance, and seven types of digenic epistasis. The diallel parents were 15 doubled haploids from a high LD population. The generation mean analysis was based on seven generations, assuming association. Under low LD and no epistasis, the diallel analysis provided confident results about the inheritance of the quantitative trait and high correlation between number of recessive genes and Wr + Vr, but biased estimates of the dominance components and genetic parameters. The additional consequences of high LD under no epistasis were rejection of the additive-dominance model assuming high heritability and lower correlation. Assuming 100% of epistatic genes, for four epistasis types there was evidence of inadequacy of the additive-dominance model. Assuming 30% of epistatic genes, there was a tendency for accepting the additive-dominance model for low heritability traits and for rejecting for high heritability traits. Linkage and epistasis affect the estimates of the genetic components of the generation means. Even assuming 100% of interacting genes, for most epistasis types there was no statistical evidence of epistasis. Assuming positive partial dominance, the signs of the epistatic components do not allow discriminate complementary, recessive, dominant and recessive, duplicate genes with cumulative effects, and non-epistatic genic interaction. Negative epistatic components evidence dominant epistasis. When the additive × additive and dominance × dominance components are positive and the additive × dominance component is negative, there is duplicate epistasis.


Introduction
The genetic design diallel is regularly used in plant breeding. It is commonly employed to provide interpopulation crosses or single crosses/F 1 's from inbred/ pure/doubled haploid (DH) lines, for assessing heterosis, combining ability, or molecular genetic diversity, for predicting non-assessed testcrosses and single-crosses, among other applications (Mowers et al. 2018;Kadam et al. 2016;Yu et al. 2020;Leng et al. 2019). There are several methods of analysis but most investigations are based on the models (fixed or random) and methods proposed by Griffing (1956a). The main reasons that explain the general choice by breeders for the Griffing's combining ability analysis are: the methodology can be used for any crop and trait and the computation and interpretation of the genetic parameters are simple. Concerning the study of inheritance of quantitative traits, both heterosis and combining ability analyses allows testing non-additive effects, if there is genetic variability between the parents. But it is not possible to test for epistasis.
In the context of inheritance, an interesting approach based on diallel cross was proposed by Hayman (1954). However, very few inheritance studies based on Hayman's method were published in the last 10 years (de Lima et al. 2019;Makumbi et al. 2018;Shahadati-Moghaddam et al. 2017;Kalinina and Lyakh 2011). The Hayman's method has indeed some limitations, as the assumptions of no epistasis (independent action of non-allelic genes) and no linkage disequilibrium (LD) (genes independently distributed in the parents). But it has positive aspects as testing the adequacy of the additive-dominance model. Then, if there is epistasis, the breeder can ignore the trait or take into account the influence of epistasis on the analysis. Why, then, the Hayman's method has been almost ignored in the studies of inheritance of quantitative traits? For sure, this cannot be attributable to a regular evidence of epistasis since most of the empirical data show mainly additive genetic variation (Hill et al. 2008). In my opinion, the main reasons are: the method seems very complex for breeders in regard to computation and interpretation; the breeders guess that the Griffing's method provides the same inferences; and the method is restricted for diploids and homozygous parents. However, Hayman's approach is not very complex for computing and interpreting and it provides some information on inheritance that a combining ability analysis does not offer.
Another interesting approach for the study of inheritance of quantitative traits is the generation mean analysis, proposed by Mather and Jinks (1971). Similar to the heterosis and combining ability analysis, this biometrical genetics methodology is based on the estimation of linear components of means of populations derived from homozygous parents. This approach has been commonly used by breeders of self-and cross-pollinated crops since its proposition (Rai et al. 2020;Addy et al. 2020;Verma and Singh 2018). Its main advantages are: it is easily performed for crops with inbred/pure/DH lines; it is applicable to any trait but few studies involved grain yield (Mohammed et al. 2018); it allows for testing dominance and epistasis separately; and the computation and the interpretation are simple.
Assuming absence of epistasis, the linear components of means do not depend on the LD. However, the genotypic variance and its genetic components are affected by LD (Hill and Maki-Tanila 2015). Because joint modelling epistasis and LD is a challenger for quantitative and biometrical geneticists, in the most important theoretical papers on heterosis and combining ability analysis there is only superficial information on the influence of non-allelic interaction (Gardner and Eberhart 1966;Kempthorne 1956;Griffing 1956b). Because his method is based on components of the genotypic variance for parents and F 1 's, Hayman (1954) makes some statements on how LD and epistasis affects the diallel analysis (see Sects. 4.4. Correlated gene distributions and 4.5. Non-allelic gene interaction). The knowledge provided by Hayman (1954) was extended in the studies of Hill (1964), Nassar (1965), Mather (1967), and Coughtrey and Mather (1970). In regard to generation mean analysis, Mather and Jinks (1971) included theory and some general conclusions for the analysis assuming LD and epistasis (see Sect. 5.18 Linkage of interacting genes). Hill (1964) and Nassar (1965), based on 3-and 10-gene models, provided contrasting results regarding the consequences of LD on the W r / V r graph and the order of dominance. Mather (1967) and Coughtrey and Mather (1970) assumed only complementary and duplicate epistasis in a two-gene model with no LD.
Thus, none previous published study provided general information on the joint impact of LD and epistasis on the Hayman's diallel and generation mean analyses. Then, the objective of this study was to provide significant additional knowledge about the influence of LD and epistasis on the Hayman's diallel and generation mean analysis, based on simulated data. I assumed hundreds of genes, variable degree of dominance, variable gene frequencies, LD, and seven types of digenic epistasis. In the first part of this study I present the theoretical background that supports the software used for simulating the dataset.

Material and methods
Genetic variances and covariances of the Hayman's diallel assuming LD and epistasis Consider n inbred/pure/DH lines (n > 3). Assume initially LD but no epistasis. In regard to two genes, the genotype probabilities are P(AABB) = f 22 = u a u b + Δ ab , where u and v are the allelic frequencies and Δ ab is the measure of LD in the gametic pool. Note that I did not state that the parents are a sample from a reference population or that the two genes are linked. Using the same definition of Hayman (1954) for the genotypic values of the r-th parent and the F 1 between parents r and s, the variance of the parents where d i is the deviation between the genotypic value of the homozygote of greater expression and the mean of the genotypic values of the homozygotes ( m ) andw = u − v . Because the means of the parents and the F 1 are not affected by the LD (in the absence of epistasis), the difference between the mean of the n 2 progeny ( m L1 ) and the mean of their parents ( m L0 ) is equal to the function derived by Hayman (1954): Thus, the average variance is V 1L1 = (1∕4)D + (1∕4) The covariance between the non-recurrent parents and their offspring in the r-th array is W r = (1∕2)D − (1∕4)F r . Thus, the average covariance is W 0L01 = (1∕2)D − (1∕4)F . Then, if there is dominance and LD, the difference between the covariance and the variance in the arrays is not a constant value. The difference is (1∕4) D − H 1 + k(1∕4)F 1r . This implies that the points ( W r , V r ) does not lie on a straight line of unit slope through their mean point ( W 0L01 , V 1L1 ). However, the function does not allow realizing how much LD affects the deviation from 1. The variance of the array means is Note that, because LD, the four equations independent of r allows the estimation of the parameters D , H 1 − F 1 , H 2 − F 1 , and F . This implies that the genetic parameters that are dependent of H 1 and/or H 2 -average degree of dominance, mean value of uv (symmetry) for dominant genes, proportion between dominant and recessive genes, and number of dominant genes -are biased. Because the functions for these parameters are complex, the magnitude of the bias can only be assed using simulated data. Further, it is not also clear how LD affects the order of dominance of the parents in the graph ( W r , V r ).
Assume now that the two genes are epistatic. Assuming an epistatic effect ( I ij ; i, j = 2, 1, and 0) for each genotype, the variance of the parents is V *  22 is the covariance between the non-epistatic deviation and the epistatic effect in the array.
is the variance of the average epistatic values of the array means, Cov 1 deviation and the epistatic effect in the F 1 , and 1 is the covariance between the average non-epistatic and the epistatic values of the array means.
The covariance in the r-th array is W * r = W r + Cov 1 01(r) + Cov 2 01(r) + Cov 3 01(r) , where, for example, for the array of the parent AABB, are the covariances between the non-epistatic deviation of non-recurrent parent and the epistatic effect of F 1 , between the non-epistatic deviation of F 1 and the epistatic effect of non-recurrent parent, and between the epistatic effects of non-recurrent parent and F 1 , respectively. The average covariance Note that the difference between the covariance and the variance in the arrays is W Thus, the epistasis is an additional factor that deviate the points ( W * r , V * r ) from a straight line of unit slope through the mean point ( W 0L01 , V 1L1 ). However, the function does not also allow realizing how much epistasis affects the deviation from 1. The variance of the array means is Thus, epistasis introduces an additional bias in the estimates of the Hayman's genetic parameters, which can only be assessed using simulated data.
Generation mean analysis with LD and epistasis Assuming two linked epistatic genes, an epistatic effect for each genotype, and parents AABB and aabb (association), the genotypic values of the parents and the F 1 are respectively. Using the notation of Mather and Jinks (1971) for the additive and dominance components, where r is the recombination frequency. Note that in F 2 , linked genes with r lower than 0.5 are in LD. The absolute LD value in the gametic pool of F 1 is (1 − 2r)∕4 . The value is positive with coupling and negative with repulsion. The mean of the F 2 generation is where n is the number of selfing generations. It is interestingly to note that the expectation of the epistatic values in a selfed generation is not directly proportional to the expectation in the F 2 generation, since E(I) (n) = E(I) (0) + deviation (see the deviation for the F n+2 generation in the Appendix).
The average genotypic values of the two backcross generations are , where E(I) 1 = (1 − r) I 22 + I 11 + r I 21 + I 12 and E(I) 2 = (1 − r) I 11 + I 00 + r I 10 + I 01 . Thus, assuming digenic epistasis, the means of the parents, F 1 , F 2 , F 3 , and backcrosses (seven equations) depend on 10 genetic linear components (seven epistatic components). A very known approach for allowing testing epistasis and estimating and testing epistatic components was provided by Mather and Jinks (1971). This simplified approach has been used for modelling epistasis in genomic selection, GWAS (genome-wide association studies), and QTL mapping. Mather and Jinks (1971)   Note that I wrote simplified approach because the assumptions for the additive x additive, additive x dominance, and dominance x additive components do not necessarily met for any type of digenic epistasis, including complementary, duplicate, dominant, recessive, dominant and recessive, duplicate genes with cumulative effects, and non-epistatic genic interaction, regardless of the degree of dominance. To characterize complementary epistasis it is necessary to assume (Mather 1967). Then, it is assumed complete dominance (|h/d|= 1).

Data simulation
The simulated dataset was generated using the software REALbreeding (available by request). REALbreeding has been used in studies related to genomic selection , GWAS , QTL mappisng (Viana et al. 2017), LD (Andrade et al. 2019), population structure (Viana et al. 2013b), heterotic grouping/genetic diversity (Viana et al. 2020), and plant breeding (Viana et al. 2013a). In summary, the software simulates individual genotypes for genes and molecular markers and phenotypes in three phases, using inputs from the user. The first phase (genome simulation) is the specification of the number of chromosomes, molecular markers, and genes as well as marker type and density. The second phase (population simulation) is the specification of the population(s) and sample size or progeny number and size. A population is characterized by the average frequency for the genes (biallelic) and markers (first allele). The last phase (trait simulation) is the specification of the minimum and maximum genotypic values for homozygotes, the minimum and maximum phenotypic values (to avoid outliers), the direction and degree of dominance, and the broad sense heritability.
Because the genotypic values for any two interacting genes are not known, there are infinite genotypic values that satisfy the specifications of each type of digenic epistasis. For example, fixing the gene frequencies (the population) and the parameters m, d, h, and h/d (degree of dominance) for each gene (the trait), the solutions G 22 = G 21 = G 12 = G 11 = 5.25 and G 20 = G 10 = G 02 = G 01 = G 00 = 5.71 or G 22 = G 21 = G 12 = G 11 = 6.75 and G 20 = G 10 = G 02 = G 01 = G 00 = 2.71 define complementary epistasis but the genotypic 63 Page 6 of 13 Vol:. (1234567890) values are not the same. The solution implemented in the software allows the user to control the magnitude of the epistatic variance (V(I)), relative to the magnitudes of the additive and dominance variances (V(A) and V(D)). As an input for the user, the software requires the ratio V(I)/(V(A) + V(D)) for each pair of interacting genes (a single value; for example, 1.0). Then, for each pair of interacting genes the software samples a random value for the epistatic value I 22 (the epistatic value for the genotype AABB), assuming I 22 ∼ N(0, V(I)) . Then, the other epistatic effects and genotypic values are computed. In this study, I assumed ratio 1. Increasing the ratio increases the magnitude of the additive, dominance, and epistatic genetic values. I simulated grain yield (g/plant) and expansion volume (a measure of popcorn quality; ml/g), assuming 400 genes distributed in 10 chromosomes of 200 cM. For grain yield and expansion volume, I specified positive dominance (average degree of dominance of 0.6) and bidirectional dominance (average degree of dominance of 0.0), respectively. For grain yield, the minimum and maximum genotypic values for homozygotes were 30 and 160. The minimum and maximum phenotypic values for homozygotes were 10 and 180. For expansion volume, I assumed 5 and 55 as the minimum and maximum genotypic values for homozygotes. The minimum and maximum phenotypic values were 0 and 60. I assumed no epistasis (but LD), seven types of digenic epistasis and an admixture of these types, defining 30 and 100% of epistatic genes. For the diallel cross, I generated 1,000 DH lines from a population with high LD and selected 15 DHs. The number of dominant genes in the DHs ranged from 136 to 278. The criterion for selecting the parents was two to three DHs at random from six classes for the number of dominant genes: 0 to 140 up to 261 to 290. The number of dominant genes in the selected DHs ranged from 161 to 246. Then, the selected DHs were crossed in a complete diallel without reciprocals. For the generation mean analysis, I generated the contrasting parental inbred lines (P 1 and P 2 ), assuming all genes in association, and the generations F 1 , F 2 , F 3 , BC 1 , and BC 2 . The numbers of plants per generation were 50, 50, 50, 400, 400, 400, and 400, respectively.
I also generated a scenario of low LD and no epistasis, assuming 10 independent genes and 10 DHs sampled from 1,000 DHs. The criterion for selecting the 10 DHs was minimization of the LD. The LD values ( Δ ab = p AB − p A p B ) ranged from − 0.11 to 0.09 (absolute average value 0.04). For comparison purpose, I computed the LD values for the 80 genes in the chromosomes 1 and 2. The ranges and the means of the absolute LD values for the genes in chromosomes 1, 2 and for the independent genes were, respectively, − 0.15 to 0.25 and 0.07, − 0.16 to 0.21 and 0.05, and − 0.20 to 0.20 and 0.05. The broad sense heritability at the plant level was 20%. For the progeny level were 40, 60, and 80%. To avoid the influence of the experimental error, all analyses were based on the parametric genotypic values and variances and covariances, provided by REALbreeding.

Results
The impact of LD on the Hayman's diallel analysis is evident even in the scenario of independent genes. The analysis of variance of W r −V r indicated adequacy of the additive-dominance model (P value of 1.00) even assuming a heritability of 80%. The regression of W r on V r indicated partial dominance ( 0 = 0.02 ; P value of 0.00) and 1 equal to 1 (P value of 0.38). The coefficient of determination was 0.99. The conclusion of partial dominance is correct since the average degree of dominance is 0.52 (in the range 0.01-1.16). However, the LD between the independent genes led to a bias of − 22.5 and − 27.3% in the estimates of the dominance components and a bias in the range − 13.8 to − 47.7% for three of the genetic parameters, ignoring the estimate of the number of dominant genes that is always sub estimated (4 vs. 9, because the DHs have one fixed gene). The correlation between the number of recessive genes and W r + V r was 0.92 and the number of dominant genes for the parents' order of dominance was 6, 6, 6, 6, 5, 5, 4, 5, 4, and 2. These results indicates that when there is low LD and no epistasis, the Hayman's diallel analysis provides confident results about the inheritance of the quantitative trait but biased estimates of the dominance components and the genetic parameters.
The negative consequences of high LD under no epistasis are also biased estimates of the dominance components and genetic parameters plus a biased estimate of the covariance F, a significant test for the homogeneity of W r − V r assuming a heritability of 80%, and an intermediate value for the correlation between the number of recessive genes and W r + V r Page 7 of 13 63 Vol.: (0123456789) (0.51) (Table 1). Thus, high LD can lead to inadequacy of the additive-dominance model for a high heritability trait. The number of dominant genes for the parents' order of dominance was 245, 176, 246, 195, 230, 229, 203, 189, 191, 217, 166, 176, 214, 203, and 161. Regardless of the heritability trait, there was evidence of partial dominance and a slight deviation from 1.0 for the regression coefficient.
As demonstrated from the theory, epistasis is a significant additional factor negatively affecting the Hayman's diallel analysis, especially assuming a high proportion of interacting genes. Assuming 100% of epistatic genes and dominant, recessive, duplicate genes with cumulative effects, and non-epistatic genic interaction, there was evidence of inadequacy of the additive-dominance model, regardless of the trait heritability (Table 1 and Online Resource Fig. 1). Irrespective of the type of epistasis and significance of the test of adequacy of the additive-dominance model, there was a tendency for concluding in favor of overdominance, which is a wrong inference. Further, the correlation between the number of recessive genes and W r + V r tends to be lower (in the range − 0.01 to 0.67; 0.39 on average) and all genetic components and parameters will be very biased. Assuming 30% of epistatic genes, there was a tendency for accepting the additive-dominance model for low heritability traits and for rejecting the model for high heritability traits (Table 1 and Online Resource Table 1 and Fig. 1). Regardless of the significance level of the adherence test, there was evidence of partial dominance and the correlation between the number of recessive genes and W r + V r had intermediate magnitude (in the range 0.31-0.62; 0.50 on average). However, as demonstrated from the theory, the genetic components and parameters will be very biased. Note that, in general, the coefficient of regression has a magnitude close to 1.0, but only assuming an admixture of epistasis types it is consistently not statistically different from 1.0. In regard to the heterosis, that is unaffected by LD, note that the epistasis tends to decrease it, except assuming duplicate genes with cumulative effects and non-epistatic genic interaction (Table 1 and Online  Resource Table 1).
In the absence of epistasis, as theoretically demonstrated, linkage (LD) does not affect the generation mean analysis, respecting that this method cannot distinguish absence of dominance from symmetrical bidirectional dominance. Under the sample sizes and heritability (a low value, 20%) assumed, the experimental evidence is that there is additive variability, dominance, and no epistasis, for both traits (Table 2). Additionally, there was evidence of (predominantly) positive dominance for grain yield and (predominantly) negative dominance for expansion volume. Note also that, regardless of the trait, heritability, and genetic control, if the sample sizes for parents and derived generations are representative, the analysis will provide confident estimates of the parametric means (correlations close to 1.0 between the estimated and true means). However, linkage and epistasis significantly affects the estimates of the genetic components of the generation means, regardless of the type of epistasis, proportional the percentage of epistatic genes (Table 2). One impressive result by fitting the epistatic components ([i], [j], and [l]) is that, even assuming 100% of interacting genes, for most epistasis types there were evidence of no epistasis (P values in the range 0.052-0.15). Another remarkable result is a null or negative correlation between the epistatic components of my model and of the Mather and Jinks' model, for most of the epistasis types. Note that the two higher correlations (among four) are associated with the less biased estimates of the non-epistatic components, under duplicate genes with cumulative effects and non-epistatic genic interaction. Fitting the Mather and Jinks' complete model when there is statistical evidence of epistasis or fitting the additive-dominance model when there is epistasis but no statistical evidence of epistasis provides biased estimates of the genetic components, proportional to the percentage of epistatic genes (Table 2 and Online  Resource Table 2). For grain yield, under duplicate and dominant and recessive epistasis, a wrong inference about dominance occurred (negative dominance and absence of dominance, respectively, instead of positive dominance).

Discussion
Probably because Hill (1964), Nassar (1965), Mather (1967), and Coughtrey and Mather (1970) used simplified genetic models (2 to 10 genes, complete dominance, only LD or epistasis) for investigating the effects of LD or epistasis on Hayman's diallel analysis, their main findings are very limited and only partially confirmed in this study. Hill (1964)    Vol.: (0123456789) and the generation mean analysis evidenced complementary epistasis. Concluding, taking into account the theory presented and the results from the simulation study, LD and epistasis can have negative effects on the Hayman's diallel analysis and on the Mather and Jinks' generation mean analysis, proportional to the LD level and the percentage of epistatic genes, and depending on the predominant type of epistasis. However, biased estimates of quadratic or linear genetic parameters are not so serious if the inheritance of the quantitative trait is correctly inferred, at least partially. Note that, excepting for a high proportion of epistatic genes under high LD, the general correct conclusion was partial positive dominance for grain yield from both analysis. Further, the order of dominance provides a good discrimination between the parents, regardless of the type of epistasis, percentage of interacting genes, and level of LD. Unfortunately, the detection of epistasis from both analyses is highly affected by the trait heritability, predominant type of epistasis, and percentage of interacting genes.