Genomic Selection Using Environmental Covariates Within an Integrated Factor Analytic Linear Mixed Model

Key message The integration of environmental covariates within a single-stage factor analytic approach provides breeders with an informative and practical framework to utilise genotype by environment interaction for genomic selection. Abstract This paper introduces a single-stage genomic selection approach which directly integrates environmental covariates within a special factor analytic framework. The factor analytic approach of Smith et al. (2001) is an eﬀective method of analysis for multi-environment trial (MET) datasets, but has limited biological interpretation since the underlying factors are latent so the modelled genotype by environment interaction (GEI) is observable, rather than predictable. The advantage of using known environmental covariates, such as soil moisture and daily temperature, is that the modelled GEI becomes directly interpretable, and thence predictable. This paper develops a model for both predictable and observable GEI in terms of a joint set of known and latent factors, as well as non-genetic sources of variation within trials and environments. This single-stage approach is referred to as the integrated factor analytic linear mixed model (IFA-LMM). The IFA-LMM is demonstrated on a late-stage cotton breeding MET dataset from Bayer Crop Science. The results show that the environmental covariates explain 34.6% of the genetic variance across environments (compared to only 23.3% for a conventional regression model). This represents 92.7% of the crossover GEI. The latent factors then explain 40.7% of the genetic variance, which represents 87.6% of the non-crossover GEI. This demonstrates the ability of the IFA-LMM to model crossover and non-crossover GEI in a manner that is both informative and practical to plant breeding.


Introduction
This paper introduces a single-stage genomic selection (GS) approach which directly integrates environmental covariates within a special factor analytic framework. The factor analytic approach of Smith et al. (2001) is an effective method of analysis for multi-environment trial (MET) datasets, including a parsimonious model for genotype by environment interaction (GEI) as well as non-genetic sources of variation. The advantage of using regressions on environmental covariates, such as soil moisture and daily temperature, is that the modelled GEI becomes directly interpretable, and thence predictable. The GS approach developed in this paper exploits the desirable features of both classes of model.
Genomic selection is a form of marker-assisted selection that can improve the genetic gain in animal and plant breeding programmes (Meuwissen et al., 2001). In plant breeding, however, GS is often restricted by the presence of GEI, that is the change in genotype response to a change in environment. There are two appealing features of using environmental covariates for GS. Meaningful biological interpretation can be ascribed to GEI and predictions can be obtained for any tested or untested genotype into any observed or unobserved environment.
Environmental covariates were first used for the analysis of MET datasets by Yates and Cochran (1938). Their work was later popularised by Finlay and Wilkinson (1963), and involved a fixed regression for yield on a set of environment means (covariates) with a specific response (intercept and slope) for each genotype. Knight (1970) extended this approach to explore GEI with regards to a more complex set of covariates, instead of environment means. These approaches have distinct limitations when applied to the analysis of MET datasets, however. Many of which can be overcome by implementing linear mixed models (Smith et al., 2005).
Environmental covariates were first implemented within a linear mixed model by Piepho et al. (1998). Heslot et al. (2014) extended this approach to GS using a random regression on environmental stress covariates derived from daily weather observations. At a similar time, Jarquín et al. (2014) adapted a reaction norm model for a very large set of correlated stress covariates. They found that the genotype (marker) by covariate interactions explained only 23% of the genetic variance across environments. These examples highlight the current limitations of using environmental covariates for GS. That is, they are often highly correlated and only explain a small proportion of GEI (Brancourt-Hulmel et al., 2000).
The factor analytic approach of Smith et al. (2001) includes a multiplicative model for GEI in terms of a small number of common environmental factors. The common factors are effective for modelling repeatable GEI common to all environments in the MET dataset. Any non-repeatable GEI specific to individual environ-ments is then captured by a set of genetic residuals. The model also bears similarities to the ordinary regressions with one important difference. The environment loadings (covariates) are estimated from the data as well as the genotype scores (slopes).
The factor analytic approach has been widely and successfully adopted for the analysis of MET datasets (Oakey et al., 2016;Ukrainetz et al., 2018). The two main variants involve pedigree data (Oakey et al., 2007) or, in terms of GS, molecular marker data. Tolhurst et al. (2019) demonstrated a factor analytic linear mixed model for GS within a major Australian wheat breeding programme. This single-stage approach incorporates an effective model for genotype (marker) by environment interaction as well as non-genetic sources of variation within trials and environments. There is one limitation of this approach, however. The common factors are latent so the modelled GEI is only observable, rather than predictable. This has lead to ad hoc post processing of the latent factors with known environmental covariates (Oliveira et al., 2020).
Until now, the analysis of MET datasets has involved only one set of known covariates or latent environmental factors. The aim of this paper is to extend the factor analytic linear mixed model in Tolhurst et al. (2019) to directly integrate known environmental covariates. This new approach is hereafter referred to as the integrated factor analytic linear mixed model (IFA-LMM). There are three appealing features of the IFA-LMM.
1. The IFA-LMM includes a multiplicative model for GEI in terms of a small number of known and latent environmental factors. This simultaneously reduces the dimension of the environmental covariates as well as the dimension of the environments. 2. The multiplicative model captures predictable GEI in terms of known environmental covariates. This enables direct biological interpretation of the underlying factors. 3. The multiplicative model also captures observable GEI in terms of latent environmental factors, which are orthogonal to the known covariates. This captures any remaining GEI not modelled by the covariates.
The IFA-LMM is demonstrated on a late-stage cotton breeding MET dataset from Bayer Crop Science.
2 Statistical Models

Preliminaries
Assume a general MET dataset comprises t field trials conducted across p environments, where t ≥ p and t = p j=1 t j , with t j being the number of trials in environment j. Also assume that phenotypic and marker data are available on v genotypes. The n-vector of phenotypic data on v genotypes and p environments is given Genomic selection using an integrated factor analytic linear mixed model 3 by y = y ⊤ 1 , y ⊤ 2 , . . . , y ⊤ p ⊤ , where y j = y ⊤ j;1 , y ⊤ j;2 , . . . , y ⊤ j;tj ⊤ is the n j -vector of data for environment j and y j;k is the n jk -vector of data for trial k in environment j. The length of y is therefore given by: Lastly, assume that environmental covariate data are available on q covariates for p d environments. To simplify the derivations of the IFA model, it is assumed that p = p d and p ≥ q, where p is the number of environments with phenotypic data. The p × q matrix of environmental covariate data is therefore given by S = [s 1 s 2 . . . s q ], with columns given by the centred and scaled covariates for each environment. Appendix A presents a computational approach to fitting the IFA-LMM when p > p d . The application to scenarios where p < p d is the topic of current research.

Linear mixed model
The linear mixed model for y can be written as: where τ is a p-vector of fixed effects with n × p design matrix X (assumed to have full column rank), u is a vpvector of random genotype by environment (GE) effects with n × vp design matrix Z, u p is a vector of random non-genetic (peripheral) effects with design matrix Z p and e is the n-vector of residuals. The vector of fixed effects, τ , includes the mean parameter for each environment. This vector is fitted as fixed, rather than random, following a classical quantitative genetics approach in which the GE effects in different environments are regarded as different correlated traits (Falconer and Mackay, 1996). The vector of random non-genetic effects, u p , accommodates the plot structures of trials and environments. Further effects in τ or u p may relate to genotypes without marker data (Tolhurst et al., 2019) or spatial modelling (Gilmour et al., 1997), respectively.

Model assumptions
The vectors of random effects and residuals in Equation 1 are assumed to have a joint normal distribution: Following Tolhurst et al. (2019), G p is diagonal with a separate variance for each set of non-genetic effects and R is block diagonal with a two-dimensional (separable) autoregressive process of order one for each environment (Gilmour et al., 1997). The form of G is developed below, but note that all variance matrices in Equation 2 are fitted at the environment level, not trial level. This completely aligns the non-genetic and residual variance models with the genetic variance model.

Conventional variance models for the GE effects
The GE effects are modelled using molecular marker data, and therefore referred to as the additive GE effects. This formulation is an extension of the univariate genomic BLUP model (Stranden and Garrick, 2009), in which: where M = [m 1 m 2 . . . m r ] is a v × r design matrix with columns given by the centred marker genotypes, u m is a rp-vector of additive marker by environment effects, G e is a p × p variance matrix between environments and K = MM ⊤ /m is the v × v genomic relationship matrix between genotypes (VanRaden, 2008).
The integrated factor analytic model is developed below for u. The application to additive GE effects modelled using pedigree data or non-additive GE effects is straightforward.

Random regression model
A random effects analogue to the regression model in Piepho et al. (1998) is initially considered. This model enables a classical quantitative genetics approach to modelling GEI, which is considered as the lack of correlation between environments.
The additive GE effects modelled using a random regression on environmental covariates are given by: where g is a v-vector of genotype main effects (intercepts), S = [s 1 s 2 . . . s q ] is the p × q matrix of column centred and scaled environmental covariates, γ = γ ⊤ 1 , γ ⊤ 2 , . . . , γ ⊤ q ⊤ is the corresponding vq-vector of genotypespecific regression coefficients (slopes) in which γ i reflects the response of genotypes to the i th environmental covariate and δ = δ ⊤ 1 , δ ⊤ 2 , . . . , δ ⊤ p ⊤ is a vp-vector of residual GE effects (deviations) specific to individual environments. It is assumed that: where σ 2 g is the main effect variance, Σ = ⊕ q i=1 σ 2 i and Ψ = ⊕ p j=1 ψ j are q × q and p × p diagonal variance matrices for covariates and environments, respectively, in which σ 2 i is the variance for the i th covariate and ψ j is the variance for the j th environment. Other joint variance matrices can be used for g and γ, including those which ensure the regression is invariant to linear transformations in S.
The variance matrix for u is therefore given by: where J p is a p × p unit matrix.

Factor analytic model
The factor analytic model (Smith et al., 2001) for GS is now considered. This model is effective for modelling the covariances between GE effects in terms of a small number of latent environmental factors (Kelly et al., 2007). The number, k, is called the order of the model and let FAk denote the factor analytic model with this order. The FAk model for the additive GE effects is given by: where is the corresponding vk-vector of genotype-specific coefficients (scores) in which f l reflects the response of genotypes to the l th latent factor and δ is the vp-vector of residual GE effects. The factor analytic model for u is postulated as a linear combination of latent environment loadings (covariates) and genotype-specific scores (slopes) for each factor, plus residual GE effects (deviations) specific to individual environments. This highlights the analogy to the random regression model in Equation 4. It is assumed that: where Ψ = ⊕ p j=1 ψ j is the p × p specific variance matrix across environments.
The variance matrix for u is therefore given by: 2.3 Integrated factor analytic model The integrated factor analytic (IFA) model was motivated by the desire to ascribe direct biological interpretation to the latent factors underlying the conventional factor analytic model, and thence utilise repeatable GEI for prediction. The model for the additive GE effects, u, in Equation 6 can be modified to directly integrate known environmental covariates. The factor loadings, Λ, and residual GE effects, δ, become parametrised by two orthogonal sources of GEI, that is known covariates and latent environmental factors. The IFA model is therefore postulated as a special factor analytic model, in which: where B = [S Γ] is a p × p matrix of basis functions, S = [s 1 s 2 . . . s q ] is the p × q matrix of column centred and scaled environmental covariates, Γ is a p × (p − q) projection matrix and γ is a vp-vector of regression coefficients which reflect the specific response of genotypes to the known covariates and latent environmental factors. The p × k matrix of joint factor loadings is given by: where Λ s is a q × k matrix of loadings for the known environmental covariates and Λ r is a (p − q) × k matrix of latent environment loadings. The factors underlying Λ s and Λ r are therefore referred to as the known and latent factors, respectively. The projection matrix, Γ, in Equation 8 ensures the known and latent factors are orthogonal. This is achieved by projecting Λ r into the orthogonal complement to the space spanned by S. A convenient choice is: where each column is a p-vector which forms a basis for the orthogonal complement to the range in S. The projection matrix is then taken as the first (p − q) columns in Equation 10: where S ⊤ Γ = 0. This specification ensures that the IFA model has the same number of variance parameters as the conventional factor analytic model in Equation 6. When p ≫ q, however, it may be desirable to use fewer than p − q columns, and thence fewer variance parameters. This enables the IFA model to be scalable to very large p. Other forms of Γ can be used, including a diagonal matrix. The IFA model for the additive GE effects is obtained by substituting the components in Equation 8 into Equation 6, which gives: where the genotype-specific scores, f , reflect the response of genotypes to the joint set of known and latent factors, that is the joint factors. Other forms of f can be used, including the extension to separate sets of factors. The residual GE effects, δ, then model any remaining GEI not explained by the joint factors. The parameterisation for δ in Equation 8 may not be sensible, however, particularly for higher order IFA models in which the percentage of variance explained by these effects is typically very small. This leads to a simplified IFA model given by: which is analogous to the conventional factor analytic model in Equation 6, but with one important difference.
Genomic selection using an integrated factor analytic linear mixed model 5 The linear combination of loadings and scores now involve a joint set of known and latent factors which are parametrised by known covariates and latent environmental factors, respectively. This directly introduces a multiplicative model for interpretable, and thence predictable GEI, into the factor analytic model.

Model assumptions
The random effects in Equation 12 are assumed to have a joint normal distribution: where Ψ = ⊕ p j=1 ψ j is the p × p specific variance matrix across environments.
The IFA variance matrix for u is therefore given by: where G e = ΛΛ ⊤ + Ψ and Λ = B Λ s Λ r .
The known and latent variance matrices between environments are then given by: where the covariance between any pair of environments is different for each source of GEI. Each variance matrix is converted to a correlation matrix for interpretability.

Model features
The IFA variance matrix in Equation 13 is postulated such that the underlying factors model both known and latent sources of GEI, and therefore have direct biological interpretation. The maximum amount of variance is explained by the known environmental covariates, that is SΛ s Λ ⊤ s S ⊤ . Any remaining variance in the underlying factors is then explained by the latent environmental factors, ΓΛ r Λ ⊤ r Γ. The projection matrix, Γ, ensures the known and latent factors are orthogonal. The covariance matrices, SΛ s Λ ⊤ r Γ and ΓΛ r Λ ⊤ s S, reflect the joint regression across both factors.
There are two important results.
1. When p = q; Λ ≡ SΛ s and the additive GE effects are modelled exclusively with environmental covariates: This model is a factor analytic regression on environmental covariates. Genotype main effects (intercepts) can be included where appropriate. 2. When q = 0; Λ ≡ Λ r and the model for u is given by the conventional factor analytic model in Equation 6, in which the underlying factors are not directly interpretable.

Maximum number of factors
The IFA variance matrix in Equation 13 can be compared to an equivalent unstructured form to obtain the maximum number of underlying factors allowed for a particular number of covariates and environments. This comparison can be made using a more general IFA model with different numbers of known and latent factors. Let these numbers be denoted by k s and k r , respectively, and let IFAk s -k r denote the IFA model with this order. The total number of joint factors is then given by k = max(k s , k r ). The maximum number of known and latent factors ensures that G is not over-parametrised. There are two common scenarios.
1. When k s = k r , the total number of joint factors must satisfy: k ≤ min p, 1 2 2p + 1 − 8p + 1 . This is the same condition required for the conventional factor analytic model (Smith, 1999). 2. When k s > k r , the matrix of joint factor loadings in Equation 9 is reformulated as: and the number of known factors must satisfy: where ξ = k s − k r ≤ 8p+1 8(p−q) . Similar results can be obtained when k s < k r , but note that additional column vector/s of zeros would be concatenated to Λ s , instead of Λ r .

Model estimation
The final model is obtained by substituting Equation 12 into Equation 1, which gives: where This model is referred to as the integrated factor analytic linear mixed model (IFA-LMM). The form of the IFA-LMM in this paper is an extension of the genomic BLUP model, and therefore referred to as the additive IFA-LMM. The equivalent extension of ridge-regression BLUP is referred to as the marker-additive IFA-LMM. Typically r ≫ v, so the computational burden in fitting the additive IFA-LMM is substantially less, and hence its inclusion in this paper. Two alternative variants of the IFA-LMM can also be derived. The only difference between these two variants is replacing the genomic relationship matrix in Equation 13 with either a pedigree relationship matrix or a diagonal matrix, respectively. The extension to joint modelling of additive and non-additive GE effects is then straightforward.
All variants of the IFA-LMM were coded in R (R Core Team, 2021) using open source libraries. Appendix B presents a computational approach for fitting the additive IFA-LMM used in this paper. This approach obtains residual maximum likelihood (REML) estimates of the variance parameters using an extension of the sparse formulation of the average information algorithm (Thompson et al., 2003). The REML estimates of the key variance parameters are given by Λ ŝ Λ r andΨ. The empirical BLUPs of the associated random effects are then given byf andδ.

Identifiability and rotation of loadings
When k > 1, the REML estimate of the joint factor loadings, Λ ŝ Λ r , is not unique so that k(k − 1)/2 constraints are required to ensure identifiability during estimation. There are two distinct scenarios for the IFA-LMM.
1. When k s ≥ k r , the k s (k s − 1)/2 elements are constrained in the upper triangle ofΛ s . 2. When k s < k r , the k r (k r − 1)/2 elements are constrained in the upper triangle ofΛ r , instead ofΛ s .
The factor loadings and genotype scores are then rotated to a principal component solution for interpretability. This rotation is given by: whereΛ * s = cΛ s V andΛ * r = cΛ r V. The constant c is chosen as either 1 or -1 to ensure the majority of the first rotated loadings inΛ * are positive (Smith and Cullis, 2018). The k × k rotation matrix V is obtained via the singular value decomposition: The rotation applied in Equation 18 ensures the first rotated factor accounts for the maximum amount of variance, the second accounts for the next largest amount and is orthogonal to the first, and so on. Other orthogonal rotations can be used, including the quartimax (Carroll, 1953) and varimax criteria (Kaiser, 1958).

Model selection
Model selection is achieved using a combination of formal and informal criteria, with emphasis on the former. Formal assessment of model fit is achieved using the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC). Informal assessment is achieved using the percentage of variance explained by the underlying factors. The overall percentage of variance explained by the known factors is given by: and for individual environments by: Similar measures are obtained for the joint factors, that isv and v, using the variance explained byΛ * Λ * . The final model order is typically chosen such thatv s and v are sufficiently high and the number of environments with low variance explained in v s and v is small. This may require different numbers of known and latent factors (see Section 2.3). Specific examples of the measures above are presented in Section 3.

Model summaries and interpretation
The main limitation of the conventional factor analytic model is that the underlying factors are latent so they do not have direct biological interpretation. The IFA model overcomes this limitation since it integrates known covariates, as well as latent environmental factors.
The empirical BLUP of the additive GE effects is partitioned accordingly: whereβ s andβ r are vp-vectors based on the known and latent factors, so are therefore referred to as the known and latent GE effects, respectively. An important feature of the IFA-LMM is that the known and latent GE effects are orthogonal, that isβ ⊤ sβr = 0. The empirical BLUP of the joint GE effects is then given by: The joint GE effects are based on the joint factors and therefore capture repeatable GEI patterns common to all environments. The known and latent GE effects,β s andβ r , then capture predictable and observable forms of repeatable GEI in terms of known covariates and latent environmental factors, respectively. By comparison, the residual GE effects,δ, capture non-repeatable GEI specific to individual environments (Smith et al., 2015). The percentage of variance explained by the known GE effects,v s , is presented in Equation 20. This reflects the percentage of variance explained by all covariates, which is the percentage of GEI that is directly interpretable and thence predicable.
The percentage of variance explained by individual covariates is given by: wherev c = 1 ⊤ q v c , which is equivalent tov s . Specific examples of this measure are presented in Section 3. Cullis et al. (2014) demonstrated latent regression plots to assess genotype performance and stability in response to the k rotated factors in the conventional factor analytic model. The regression plots developed in this section demonstrate performance and stability in response to the known covariates in the IFA model, as well as the latent environmental factors.

Regression plots for genotypes
The data points on the first l = 1 plot are given by: whereλ * s1 is the q-vector of loadings for the first rotated factor andβ si is the p-vector of known GE effects for the i th genotype, such that: The data points on the remaining l = 2, 3, . . . , k s plots are then adjusted for the preceding rotated factor(s), such that: whereλ * s l is the q-vector of loadings for the l th rotated factor andf * li is the corresponding score for the i th genotype.
-The slope of the regression line is given by the empirical BLUP of the genotype-specific score for that rotated factor, that isf * li . -The data points along the regression line are given by the fitted GE effects for each environment for that rotated factor, that is Sλ * s lf * li . -The deviations from the regression line are given by the additive GE effects for the higher order factors, that is k h=l+1 Sλ * s hf * hi . Similar plots are constructed for the joint GE effects usingΛ * andβ i , which have similar definition to the known GE effects in Equation 26.

Interpretation of rotated factors
Interpretation of the k rotated factors is achieved using the percentage of variance explained by individual factors and covariates.
1. The percentage of variance explained by each rotated factor is given by: wherev s = ks l=1v s l . 2. The percentage of variance in each rotated factor explained by individual covariates is given by: wherev cs l = 1 ⊤ q v cs l is the percentage of variance in the l th rotated factor explained by all environmental covariates. Note thatv cs l = 100% since the known factors are fully explained by the covariates. Also note that the numerator in Equation 29 accounts for the variance of each covariate as well as half the covariance with the remaining covariates.
Similar measures are obtained for the joint factors usinĝ Specific examples of these measures are presented in Section 3.

Application to example MET dataset
The Bayer Crop Science Cotton Breeding Programme evaluates the commercial merit of test genotypes by annually conducting multi-environment field trials. The cotton breeders use the analysis of MET datasets to select and promote superior genotypes through stages of the programme. There are three late-stages of field evaluation, referred to as preliminary commercial P1, P2 and P3. Seed cotton yield is the primary trait considered in this paper, and is measured in pounds per acre (lbs/a). The 2017 P1 MET dataset for cotton yield is detailed below.

Data description
Experimental design and phenotypic data Table 1 presents a summary of the MET dataset. There were t = 72 field trials conducted in p = 24 environments across eight states in the northern, southern and western cotton growing regions of USA ( Figure 1). A total of 208 genotypes were evaluated across all 24 environments. Each environment consisted of three trials. Each trial consisted of a unique set of 68 test genotypes plus four checks. This allocation ensured the MET dataset was balanced between environments, and between trials within environments, that is prior to removing missing plots.
Each trial was designed as a randomised complete block design with two replicate blocks of all genotypes. Yield data was recorded on the majority of plots with 6.54% missing. The number of non-missing plots per test genotype ranged from 39 to 47 (mean of 45). The number of non-missing genotypes in common between environments ranged from 173 to 208 (mean of 204). This allocation produced excellent connectivity in the MET dataset. Table 2 presents a summary of the environmental covariates in the MET dataset. There were q = 18 covariates collated for all environments, including latitude and longitude coordinates. The remaining 16 covariates consisted of 11 weather and 5 soil observations that were collated using daily meteorological data collated by Bayer Crop Science. Table 2 summarises the observations for each covariate across the northern, southern and western cotton growing regions. Each covariate was then centred and scaled for use in the subsequent analyses.

Marker data
Molecular marker data were available for v = 204 (of the 208) genotypes, coded according to a high confidence set of 36,009 single-nucleotide polymorphisms (SNPs). For each marker, genotypes were coded as either 1 (homozygous for the major allele), 0 (heterozygous) or -1 (homozygous minor allele). The frequency of heterozygous marker scores was low given the level of selfing accumulated up to the P1 stage. Monomorphic markers were then removed and missing markers were imputed using the k-nearest neighbour approach of Troyanskaya et al. (2001) with k = 10.
The genomic relationship matrix K in Equation 3 was constructed using the pedicure package (Butler, 2021) in R (R Core Team, 2021). The default settings in pedicure were used as filters, with minor allele frequency > 0.0002% and missing marker frequency < 0.9998%. A total of r = 24, 265 markers were retained using this criteria. The diagonal elements in K ranged from 0.004 to 2.022 (mean of 1.234). The off-diagonals ranged from -0.388 to 1.322 (mean of -0.006). Table 3 presents a summary of the conventional linear mixed models fitted to the MET dataset. The important results from each model are detailed below.

Conventional linear mixed models
The analyses commenced by fitting a diagonal variance structure to the additive GE effects in different environments. This process reflects the initial singlesite analyses routinely performed on MET datasets, in which the GE effects in different environments are assumed to be uncorrelated. The single-site analyses are typically used to accommodate the experimental design, address spatial variations and identify potential outliers. This highlights three important processes.
1. Replicate blocks effects within trials and trial effects within environments were fitted to accommodate the plot structures of trials and environments (Bailey, 2008). 2. Random column effects were fitted for seven environments to address extraneous variation in the column direction (Gilmour et al., 1997). Random row effects were fitted for one environment only. 3. Few outliers were detected for six environments. Table 1 presents the environment mean yield and generalised narrow-sense heritability (reliability) following Cullis et al. (2006). Both measures vary substantially between environments and growing regions. A compound symmetric model was then fitted to the additive GE effects, which reflects many current applications of GS in plant breeding. The distinguishing feature of this model is that the GE effects in different environments are now assumed to be correlated. The form of the compound symmetric model is very restrictive, however, since it comprises a single variance component for genotype main effects and genotype by environment interaction effects. Table 3 shows the inferiority of this model in terms of the formal model selection criteria, AIC and BIC, as well as the informal measure, percentage of variance explained. Note that the latter measure was calculated as the percentage explained by the main effects alone, that isv = 36.2%.

Random regression model
The next linear mixed model involved a random regression on the set of 18 environmental covariates presented in Table 2. This model enables interpretation of the covariances between environments in terms of known environmental covariates. Table 3 shows the superiority of this model compared to the diagonal and compound symmetric models. In this case, the percentage of variance explained was calculated as the percentage explained by the genotype intercepts (main effects) as well as the genotype-specific slopes on each covariate, that isv = 58.4%. The percentage explained by the covariates alone is 23.3%.

Factor analytic model
The next series of linear mixed models involved fitting a conventional factor analytic model to the additive GE effects. This model provides a parsimonious alternative to the unstructured model for the covariances between environments. There were k = 4 factors required to reach an adequate fit and acceptable percentage of variance explained, that isv = 75.2%. Half of the 24 environments achieved at least 70% variance explained, however, two environments had <50% (omitted for brevity). Table 3 shows the superiority of the FA4 model compared to all remaining conventional linear mixed models, including the higher order models which proved unnecessarily complicated.

Additive IFA-LMM
The analyses then proceeded by fitting a series of additive IFA-LMMs using the formal and informal model selection criteria. The full sequence of models is presented in Table 4.
There were k s = 4 known and k r = 3 latent factors (i.e. k = 4) required to reach an adequate fit and acceptable percentage of variance explained by these factors, that isv s = 34.6% andv r = 40.7% (v = 75.3% in total). The former measure reflects the percentage of repeatable GEI common to all environments that is directly interpretable and thence predicable. The IFA4-3 model captures more predictable GEI than the conventional random regression, that isv s = 34.6% compared to only 23.3%, despite using the same set of environmental covariates. The remaining 24.7% not explained by the known and latent factors reflects any residual (nonrepeatable) GEI specific to individual environments. Table 4 shows the superiority of the IFA4-3 model compared to all remaining IFA-LMMs, as well as the conventional linear mixed models in Table 3. These tables also show the equivalence of the IFA-LMM and the conventional factor analytic linear mixed models, where the model selection criteria are the same for FA1 and IFA1-1, FA2 and IFA2-2, and so on. The distinguishing feature of the IFA-LMM is the ability to jointly model predictable and observable GEI using different numbers of known and latent factors. The total number of genetic parameters estimated in the IFA4-3 model is fewer than in the FA4 model, that is 108 compared to 114, despite very similar model selection criteria. Table 5 presents a summary of the cotton growing environments from the final additive IFA-LMM. The variance of each environment ranged from 7,548.0 to 47,574.2 (497,423.0 overall). These variances are obtained from the denominator in Equation 21. The percentage of variance explained by the known factors, v s , ranged from 12.7% to 86.3% (v s = 34.6%). The percentage explained by the joint factors, v, ranged from 44.9% to 100.0% (v = 75.3%). That is, the variance explained by the joint GE effects is substantially higher than the variance explained by the known GE effects, and thence by the environmental covariates alone. Table 5 also presents the REML estimates of the environment loadings for each rotated factor, that iŝ λ * l . The first rotated factor comprises positive loadings and explainsv 1 = 43.9% of the variance. The higher order factors comprise both positive and negative loadings and explainv 2:4 = 31.4%. The fitted GE effects for the first rotated factor therefore capture non-crossover GEI while the higher order factors primarily capture crossover GEI (Smith and Cullis, 2018). This is an important feature of the IFA-LMM, and will be discussed below. Table 6 presents a similar summary for the environmental covariates. The covariance explained by each covariate ranged from -261,200.8 to 199,616.7 (172,040.9 overall). These covariances are obtained from the numerator in Equation 24. The percentage of variance explained by individual covariates, v c , ranged from -52.5% to 40.1% (v s = 34.6%). The percentages can be positive or negative, since this measure involves (positive or negative) covariances between environmental covariates. Notable covariates here include maximum dew point temperature, maximum downward solar radiation and maximum temperature, as well as average cloud cover, minimum temperature and minimum humidity. Table 6 also presents the REML estimates of the rotated factor loadings for each environmental covariate, that isΛ * s . The interpretation of these loadings is similar to above, but with one important difference. All rotated factors comprise both positive and negative loadings so the additive GE effects for all rotated factors capture non-crossover GEI. The first rotated factor explains onlyv s1 = 5.4% of the variance and the higher order factors explainv s2:4 = 29.2%. The practical implication of this will be discussed below. Figure 2 presents the REML estimates of the genetic correlations between environments in terms of the a known or b joint GE effects. Both sets of correlations are ordered based on a dendrogram (applied to b), where environments closer together are more correlated, and have more similar genotype rankings, than those further apart. The colourkey ranges from 1 (complete agreement in rankings) through zero (dissimilarity in rankings) to -1 (complete reversal and crossover of rankings). Figure 2 suggests there is structure to the modelled forms of GEI. There are four notable features.

Genetic correlations between environments
1. The correlations for the joint GE effects are predominately higher than the correlations for the known GE effects, that is due to the environmental covariates alone. 2. The highest correlations generally occur between environments in the same growing region. Environments along the east coast of the USA are also very similar as well as those along the Mississippi river, that is regardless of region ( Figure 1). 3. The lowest correlations correspond to 17NC1. This environment is not similar to the remaining environments in the northern growing region, but is similar to those environments along the east coast. 4. The correlations in the northern and southern growing regions are well explained by maximum temperature. The correlations in the western growing region are well explained by maximum dew point temperature, maximum downward solar radiation and maximum temperature.
3.5 Regression plots for genotypes Figure 3 presents a series of regression plots for check genotypes C1 and C2 in terms of the a known or b joint GE effects. The first of these plots are used to assess genotype performance and stability in response to the k s = 4 rotated factors, that is in response to the environmental covariates alone. Since the first rotated factor comprises both positive and negative loadings (Table 6), the pair of regression lines intersect and the genotype rankings change across the sample of environments. For example, the fitted GE effects for C2 are higher than C1 for environments with negative loadings, but the converse is true for environments with positive loadings. This will be the case for any pair of genotypes in the MET dataset and indicates that changes in performance are attributed to both crossover and noncrossover GEI. Similar interpretation can be made for the higher order factors.
Figure 3a also demonstrates that the second rotated factor is correlated with longitude (ρ = 0.80), where the loadings on the left correspond to environments in the northern and southern growing regions while the loadings on the right correspond to the western growing region. This highlights an important limitation of the conventional factor analytic linear mixed model, where interpretation is often limited to post processing of the rotated loadings. The distinguishing feature of the IFA-LMM is the ability to ascribe direct biological interpretation to the rotated factors in terms of the known environmental covariates. This feature will be discussed below.
The regression plots in Figure 3b also summarise genotype performance and stability, but in response to the k = 4 rotated factors, that is in response to the known covariates as well as the latent environmental factors. In this case, the first rotated factor comprises only positive loadings (Table 5), so the pair of regression lines never intersect and the genotype rankings never change (crossover). This is an important difference to the plots in Figure 3a, and indicates that changes in performance are attributed to non-crossover GEI only. Furthermore, since the lines diverge, the difference between the fitted GE effects increases as the loadings increase. This demonstrates the ability of factor analytic models to capture heterogeneity of scale (non-crossover) variance between environments (Smith and Cullis, 2018). The higher order factors comprise both positive and negative loadings, so the additive GE effects primarily capture crossover interactions only. Figure 3b suggests check C1 is predominately higher performing than C2 across the sample of environments since it has a large predicted score (slope) for the first rotated factor. Both checks are notably unstable, however, since they are highly responsive to the higher order factors and therefore have a large spread of additive GE effects (deviations) about the first factor regression. Table 7 presents important supplementary information to the regression plots in Figure 3, and is used to ascribe biological interpretation to the rotated factors. This table presents the percentage of variance in each rotated factor explained by each environmental covariate. In terms of the known GE effects, the percentage of variance explained by individual covariates, v cs l , ranged from -454.9 to 223.4 (v cs l = 100%; l = 1, 2, 3 or 4). The same measure for the joint GE effects, v c l , ranged from only -162.8 to 197.1 (v c1 = 12.4% andv c l > 89%; l = 2, 3 or 4). This highlights three distinct features.

Interpretation of rotated factors
1. The percentages for the joint GE effects are predominately larger than the percentages for the known GE effects, that is due to the environmental covariates alone. 2. The percentages can be positive or negative, since this measure involves (positive or negative) covariances between environmental covariates. 3. Notable covariates include maximum dew point temperature, maximum downward solar radiation and maximum temperature, as well as average cloud cover, minimum temperature and minimum humidity. Table 7 suggests the environmental covariates primarily model crossover GEI, with a substantial percentage of variance in the higher order joint factors explained by covariates alone, that isv c2:4 = 92.7%. This feature can be visualised in Figure 3 where the regression plots for the higher order factors are very similar for the known and joint GE effects. By comparison, there is stark dissimilarity between the regression plots for the first rotated factor since the first joint factor is not well explained by covariates, that isv c1 = 12.4%.

Discussion
This paper introduced an integrated factor analytic linear mixed model (IFA-LMM) for GS in plant breeding, or more generally a single-stage approach for the analysis of MET datasets. The IFA-LMM is a special factor analytic approach which directly integrates environmental covariates. This approach has the potential to become the preferred method of analysis in global plant breeding programmes. There are two appealing features of using environmental covariates for GS. That is, meaningful biological interpretation can be ascribed to GEI and forward predictions can be obtained for any genotype into any environment. This represents a long-standing objective of plant breeding programmes. Many of the current approaches, however, fail to appropriately accommodate the general structure of the phenotypic and covariate data, which has limited their ability to adequately model genotype (and marker) by environment interaction (Heslot et al., 2014;Jarquín et al., 2014).
The factor analytic approach of Smith et al. (2001) provides an effective model for repeatable GEI in terms of a small number of common environmental factors. This approach, or some variant, has been widely and successfully adopted for the analysis of MET datasets (Beeck et al., 2010;Ukrainetz et al., 2018). In terms of GS, Tolhurst et al. (2019) demonstrated a singlestage factor analytic linear mixed model which incorporates molecular marker data and appropriately accommodates the general structure of MET datasets. There is one limitation of this approach, however. The common factors are latent so the modelled GEI is only observable, rather than predictable.
The approach developed in this paper models predictable and observable GEI in terms of a joint set of known and latent factors, respectively. This reflects a special factor analytic linear mixed model which directly integrates known environmental covariates. There are three appealing features of the IFA-LMM.
1. The IFA-LMM includes a multiplicative model for GEI in terms of a small number of known and latent factors. This simultaneously reduces the dimension of the covariates as well as the environments. 2. The multiplicative model captures predictable GEI in terms of the covariances between environmental covariates. This introduces known environmental information into the underlying factors so they now have direct and meaningful interpretation. 3. The multiplicative model also captures observable GEI in terms of the latent covariances between environments. This captures any remaining GEI which is strictly orthogonal to the covariates.
These features address the long-standing objective of plant breeding programmes. The IFA-LMM is embedded within an extension of the computationally efficient reduced-rank formulation of the average information algorithm (Appendix B). Thompson et al. (2003) developed the reduced rank formulation for conventional factor analytic models assuming independent relationships between genotypes, and thence exploited the maximum sparsity of MET datasets. Kelly et al. (2007) extended this formulation to incorporate a (sparse) numerator relationship matrix. The extension to a dense genomic relationship matrix involves complex matrix factorisations and finding more efficient formulations of the IFA-LMM is the topic current research.
The IFA-LMM was demonstrated on a late-stage cotton breeding MET dataset from Bayer Crop Science. The example MET dataset is typical of plant breeding data in terms of the large number of genotypes compared to the number of trials and environments, that is v ≫ t ≥ p. This dataset was used to conveniently illustrate the concepts and methods developed in Section 2. However, it only represents a small subset of trials in the Bayer Cotton Breeding Programme and is not intended to capture the full target population of production environments. A larger MET dataset across multiple years and more locations is common in such applications, since transient and static interactions are substantial components of GEI within the full target population. This will ensure the scope of the covariates, loadings and scores are relevant to the forward prediction of new (untested) genotypes into new (unobserved) environments. Computational challenges are anticipated for these larger MET datasets and finding efficient ways to scale the IFA-LMM is the topic of current research.
There are three important results from Section 3.
1. The selected IFA-LMM involves fewer genetic parameters compared to the selected factor analytic linear mixed model, despite very similar model selection criteria (Tables 3 and 4). This highlights a distinct advantage of integrating known covariates within a special factor analytic framework. 2. More variance is explained by the environmental covariates in the selected IFA-LMM compared to the conventional random regression, that isv s = 34.6 compared to 23.3%. This represents 92.7% of the crossover GEI captured by the multiplicative model, and highlights the importance of appropriately modelling the covariates. 3. The latent factors then explain 40.7% of the variance across environments, which represents 87.6% of the non-crossover GEI. This feature can be visualised in Figure 2 where there is stark dissimilarity between the GEI patterns for the known and joint GE effects.
These results demonstrate the importance of categorising environments based on known covariates as well as latent factors.
Interpretation of the IFA-LMM was demonstrated using a series of regression plots (Figure 3). The regression plots are used to assess genotype performance and stability in response to the known and latent factors, that is in response to predictable and observable GEI. Previously, interpretation of genotype performance and stability has been limited to post processing of model terms, for example by correlating environmental covariates with latent factors (Oliveira et al., 2020) or by examining the response of reference and probe genotypes (Mathews et al., 2011). The distinguishing feature of the IFA-LMM is the ability to ascribe direct biological interpretation to the rotated factors in terms of the known environmental covariates, such as maximum temperature and minimum humidity.
The integration of environmental covariates within a single-stage IFA-LMM for GS provides breeders with an informative and practical framework to utilise repeatable GEI for selection.

Data Availability
The data that support the findings of this study are available from Bayer Crop Science. Restrictions apply to the availability of these data, which were used under license for this study. Data-Driven Innovation Chancellor's fellowship.

Author Contributions
DT developed the methodology, curated the data, ran the analyses and wrote the manuscript. CG and BG provided input on plant breeding perspectives. BG also provided the data. JH organised the research project and secured funding. GG reviewed the manuscript. All authors have read and approved the manuscript for submission.

Conflict of Interest
The authors declare that they have no conflict of interest.

References
Bailey RA (2008)      Note: 128 non-genetic and residual variance parameters estimated in all models. The selected IFA-LMM is distinguished with bold font. Table 5 The selected additive IFA-LMM, Part 1: Summary of growing environments. Presented for each environment is the mean yield (lbs/a), REML estimate of the variance and percentage of variance explained by the known (v s ) or joint (v) factors, as well as the estimates of the rotated factor loadings (λ * l ; l = 1, 2, 3 or 4).

State
Env Note: The overall percentage of variance explained by the known (v s ) or joint (v) factors, and by each rotated factor (v l ) is presented in the final row. Table 6 The selected additive IFA-LMM, Part 2: Summary of environmental covariates. Presented is the REML estimate of the covariance explained by each covariate and percentage of variance explained by each covariate (v c ), as well as the estimates of the rotated factor loadings (λ * s l ; l = 1, 2, 3 or 4).

Cov
Covar Note: The overall percentage of variance explained by the known factors (v s ), and by each rotated factor (v s l ) is presented in the final row. Table 7 The selected additive IFA-LMM, Part 3: Interpretation of rotated factors. Presented is the percentage of variance in each rotated factor explained by each environmental covariate in terms of the known (v cs l ) or joint (v c l ) GE effects (l = 1, 2, 3 or 4). Note: The percentage of variance in each factor explained by all covariates (v cs l orv c l ) is presented in the final row.