A multivariate heterogeneous variance components model for multi-environment studies with locational genetic effects

In this paper, a multivariate heterogeneous variance components model was developed which allows for determination of location specific variance components in the analysis of multiple related traits. In addition to spatial heterogeneity, genetic similarities are also considered by assigning genetic variance components. The performance of the developed model was evaluated through an extensive simulation study and comparison of models was conducted by heritability estimations. The simulation study reveals that the developed method can control the locational heterogeneity well and the heritability estimations are close to desired proportions for the developed model. A real plant breeding data set was used for illustration.


Introduction
Previously, several methods were developed to estimate genetic variance components using random effect models. These models allow estimation of genetic effects to be related more to familial relationships such as pedigree structures rather than environmental relatedness. A motivation for our study is that the heterogeneity arising from the locational clustering may cause underestimation of genetic variance components, especially in animal and plant breeding studies.
In recent decades, there is growing interest in multivariate analysis due to the availability of extensive largescale data (Li et al. 2021;Sundararajan 2021;Rao et al. 2020). Especially in the field of interdisciplinary sciences such as statistical genetics, the multivariate approach allows for analysis of multiple traits affected by the interplay of both genes and environmental influences. Moreover, these multiple traits are usually related to each other in several ways; hence, the correlation between multiple variants should also be taken into consideration with a multivariate approach. The linear mixed model (LMM) was widely used to analyse the relationship between a trait and genetic random factors, and the multivariate linear mixed model (mvLMM) is an extension of LMM for modelling multiple traits simultaneously.
In the LMM approach, certain sources of variation are assessed by random effect variables and the variance-covariance parameters of these variables are referred to as variance components. Since they are useful for explaining the total variation, estimation of variance components has become an essential part of the modelling process in most applied science fields. Particularly in animal and plant breeding studies, the familial and environmental relationships necessitate the usage of LMM due to the naturally clustered structure of the population (Correddu et al. 2021).
Moreover, this clustered structure may often cause genetic diversity among the members of species owing to natural selection or geographic conditions, and the genetic variance components ensure the understanding of the genetic ethology of the characteristics of interest.
Although variance components models have a long history, advanced computational strategies are still being developed to estimate variance parameters accurately (Henderson 1953;Smith and Graser 1986;Searle et al. 1992;Lynch and Walsh 1998). Especially, due to the large dimensions of matrices in mvLMM, most of the extensions are presented to overcome computational difficulties. For instance, Gilmour et al. (1995) developed a computationally convenient and extensive algorithm based on average information matrix for the estimation of variance parameters, and Lee and Van der Werf (2006) discussed the efficiency of direct use of the variance covariance matrix with a general complex pedigree. Recently, an extension of mvLMM based on genomic information was developed by combining the direct average information algorithm with an eigen decomposition of the genomic relationship matrix (Lee and Van der Werf 2006).
In addition to computational solutions, multivariate variance components models need to be extended to account for several different sources of relatedness and heterogeneity. Especially in genetic analysis, most of the complex traits are affected by a collaboration of genetic and environmental factors, and this collaboration may require additional variance components. For instance, due to the interplay of genes and environment in multiple environment population studies, it is misleading to assume that the genetic background is common across different locations (Covarrubias-Pazaran 2016). In such study designs, specifying separate genetic variance components for each environment is a way to model the heterogeneity arising from the interaction of genetic and environmental factors.
In this paper, the focus is the estimation of heterogeneous variance components of mvLMM for the analysis of multiple-related traits across multiple locations. In addition to spatial heterogeneity, genetic similarities are also considered by assigning genetic variance components. Due to the genetic background and location clustered structure of the desired design, a complex multiple phenotype simulation is conducted that relies on genotype simulations. A multivariate heterogeneous variance components model is proposed by taking location-specific variance components into consideration.

Multivariate variance components model
. . .; m and j ¼ 1; 2; . . .; n. Then, the matrix notation of mvLMM is where X 2 R nmÂPm is the design matrix for P fixed effects and Z 2 R nmÂQm is the design matrix for Q random effects.
The diagonal elements of X and Z are identical in themselves, represents the residual variance-covariance matrix of trait i and the term r 2 e i is the residual variance of trait i. The random effects are U ¼ vec u 1 ; u 2 ; . . .; u m ½ and . u i is assumed to be normally distributed with zero mean and the variance-covariance matrix of u i is D i 2 R QÂQ .
For multiple traits, the multivariate distribution can be assumed to be where V is the variance-covariance matrix of all observations.
Due to the multivariate structure, the random effects covariance and the residual covariance between traits i and t are denoted by D it and R it ¼ I n r 2 e it , respectively (i ¼ 1; 2; . . .; m; t ¼ 1; 2; . . .; m and i 6 ¼ t). The term r 2 e it is the residual covariance between traits i and t.
In the mixed modelling approach, the random part of the model may have several components reflecting different grouping effects such as treatment effect, common environmental effect or serial effect for repeated measurements. Especially in genetic studies, a random term is usually defined to consider the genetic similarities. In this case, including genetic background effects, the components of the random vector for trait i can be split into two parts; genetic and non-genetic random effects. For simplicity, considering a single component for the random genetic effects on all individuals as a vector of total genetic value, and the variance-covariance matrix for trait i can be written as where G i is the genetic relatedness matrix, and E i1 , …, E iS are the variance-covariance matrices of S ¼ Q À 1 ð Þ random effect components other than genetic background. Here, G i ¼ Kr 2 g i and r 2 g i denote the genetic variance component related to trait i. K 2 R nÂn is the genomic relationship or kinship matrix. Similarly, the genetic component of D it is G it ¼ Kr 2 g it and the term r 2 g it is the genetic covariance between traits i and t.
In single-environment studies, r 2 g i and r 2 g it are assumed to also be single. However, in genetic studies, heterogeneity can occur based on an environmental indicator, such as location. In this case, the distribution of a genetic component is expressed by location-specific variances.

Extension to the multivariate variance components model
In this study, the multivariate variance components model is typically expanded to include location-specific variance components by specifying a heterogeneous variance structure for the trait vector, at specific levels of a random factor. Compared to the standard mvLMM, the proposed method has the important advantage of accounting for several different sources of relatedness and heterogeneity which occur in genetic analysis. Moreover, the developed model allows modelling of spatial trends to provide more flexibility.
Here, a multi-environment design with L different locations is considered. Regarding the locational heterogeneity, the genetic component of the mvLMM is assumed to have location-specific variance r 2 g il denoting the genetic variance component for environment l (l ¼ 1; 2; . . .; LÞ. Then, the genetic variance-covariance matrix across L locations for trait i can be written as where r 2 g ilb is the covariance term reflecting genetic effects in two different environments l and b (l ¼ 1; 2; . . .; L;b ¼ 1; 2; . . .; L and l 6 ¼ b). For an mvLMM accounting for heterogeneous variances, the variance components vector can be decomposed as H ¼ r 2 g 11 ; . . .; r 2 g 1L ; . . .; r 2 g m1 ; . . .; r 2 g mL ; . . .; r 2 e 1 ; . . .; r 2 To illustrate the location-specific variance components, a multiple trait design in four locations is considered (L ¼ 4). Regarding the locational heterogeneity, the genetic component of the mvLMM is assumed to have location-specific variance r 2 g il denoting the genetic variance component for environment l (l ¼ 1; 2; . . .; LÞ.
The log likelihood function of the mvLMM is

Estimation of heterogeneous variance components
In this study, a multivariate Newton-Raphson iterative algorithm was used to obtain residual/restricted maximum likelihood estimates (REML) for variance components. REML is often solved by updating variance components based on Hessian matrix or Fisher's information matrix consisting of second derivatives of the log likelihood function (Searle et al 1992;Lynch and Walsh 1998). For a more efficient computation, the REML method is implemented via the average of the observed Hessian and the expected Fisher information matrices (Gilmour et al 1995;Lee and Van der Werf 2006). In the average information Newton-Raphson (AI-NR) algorithm, the REML estimates are obtained with where H is the vector of variance components decomposed to Eq. (7) and AI is the average information matrix consisting of second derivatives of the log likelihood function L. For mvLMM, the AI matrix is directly derived from V as In a multiple trait design, considering observations collected in L environments, the variance covariance matrix V can be rewritten as . . .
3 Data simulation In the simulation scenario design, the number of multiple traits was assigned as three and number of samples was assigned as n = 1000. Multiple traits are simulated as the sum of genetic background effect, locational effect and noise effects as described in Meyer and Birney (2018), and covariate effects other than the genetic and locational effects are considered to be nuisances. Genetic background effects were simulated based on kinship estimates. For the kinship estimates, 500 bi-allelic single nucleotide polymorphisms (SNP) are simulated for 1000 samples and the level of genotype is generated relying on probability in a binomial distribution with uniformly sampled reference allele frequencies (0.05, 0.1, 0.3, 0.4 and 0.5).
In order to reveal the effects of location-specific variance components, a location indicator was simulated as a categorical variable and influenced multiple traits with different proportions; w = 0, 0.3, 0.5, 0.8 and 1. Due to the multivariate structure of the design, three different levels of correlation q = 0.25, 0.5 and 0.75 between traits were considered and the correlated effect was simulated from multivariate normal distribution (MVN) with the considered level of correlation. The proportion of genetic variance r 2 g is set to 0.30. The correlation structures for some remarkable scenarios were visualized by heat maps, as given in Fig. 1.
In the modelling process of the simulation study, the heritability estimations were obtained under mvLMM and heterogeneous mvLMM with location-specific variance components. Estimation of heritability, h 2 , relies on the partitioning of observed variation into unobserved genetic and environmental components (Wray and Visscher 2008). For the mvLMM with a general variance component, the heritability is estimated as the proportion of genetic variance to total variance, h 2 ¼ r 2 g = r 2 g þ r 2 e . However, in a multi-environment design with L different locations, heritability depends on the proportion of total genetic variance in L locations and can be estimated as The results in Table 1 indicate the proximity of the heritability estimation to the true proportion (30%) of genetic variance.
In Table 1, w ¼ 0 indicates the absence of locational heterogeneity and heritability estimations are similar and about 0.30. When using mvLMM with location-specific variance components, as the proportion associated with location increases, the heritability estimations are closer to 0.30. Our results also show that the dependence level of multiple traits do not affect estimations remarkably.

Application to real data
In this section, the two variance components models were fitted to safflower (Carthamus tinctorius L.) data collected from three different locations (Bolu, Haymana and The correlation between yield and length in three different environments is illustrated in Fig. 2.
The REML estimations were obtained from mvLMM and heterogeneous mvLMM with location-specific variance components by using the AI-NR algorithm. The model results are summarized in Table 2.
In standard mvLMM, six components (r 2 g Yield , r 2 g Length , r 2 g YieldÀLength , r 2 e Yield , r 2 e Length , r 2 e YieldÀLength ) were estimated, whereas the number of components was doubled for heterogeneous mvLMM by including separate components for each environment. For instance, the genetic variance component of yield was represented by r 2 g Yield in standard mvLMM; however, in heterogeneous mvLMM, the genetic variance component of yield has its own sub-components of r 2

Discussion
In this paper, a multivariate heterogeneous variance components model with location-specific variance components was developed, which allowed for determination of location-specific variance components. The simulation results show that the heritability estimations are closer to the desired proportions using the developed mvLMM with location-specific genetic variance components as the locational heterogeneity increases. Based on real data results, the developed heterogeneous mvLMM with location-specific variance components fits the data better in multi-environmental designs. Thus, our method can control for locational heterogeneity better compared to an mvLMM with a general variance component.

Conclusion
mvLMM is generally suitable for modelling multiple traits simultaneously, and multivariate variance components are useful for explaining the total variation. However, especially in complex experimental designs where heterogeneity can occur due to several indicators, standard variance components models have some limitations such as assuming single variance components. As an extension of the multivariate variance components model, the proposed method allows for location-specific variance components by specifying a heterogeneous variance structure for the trait vector. In addition to spatial heterogeneity, genetic similarities are also considered by assigning genetic variance components. For future studies, the developed model can be extended to allow for modelling of longitudinal responses in later implementations. Moreover, future simulation studies based on the multivariate heterogeneous variance components model may offer insights about how to identify and interpret the parameters in more complex models. A more comparative study can be developed by including the addition of other multivariate modelling methods such as structural equation algorithms.
Publisher's Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.