­­Simulations of genomic selection accuracy and model updating across multiple breeding strategy scenarios in common bean

doi:10.21203/rs.3.rs-2097712/v1

Download PDF

Research Article

Simulations of genomic selection accuracy and model updating across multiple breeding strategy scenarios in common bean

https://doi.org/10.21203/rs.3.rs-2097712/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Genomic selection predicts the breeding value of selection candidates according to genotypes that are estimated to have favorable effects based on a model. The effectiveness of genomic selection is strongly tied to its prediction accuracy. Previous studies have evaluated the accuracy of genomic selection using simulations. The aim of this study was to evaluate changes in accuracy of genomic selection based on many known QTLs identified in the literature and determine their relationship with true breeding values. Simulation results revealed that correlation-based prediction accuracies (also referred to as realized accuracy) fluctuate depending on trait genetic architecture, breeding strategy and the number of initial parents involved in the breeding program. Generally, maximum accuracies were achieved under a mass selection strategy followed by pedigree and single seed descent methods. Model updating benefitted some breeding strategies more than others (e.g., single seed descent vs mass selection). For low heritability traits (i.e., yield), conventional methods provided comparable rates of genetic gain, but genetic gain under genomic selection reached a plateau in a lower number of cycles.

genomic selection

selection strategy

mass selection

pedigree

single seed descent

common bean

genetic gain

favorable allele fixation

This paper investigates selection accuracies using plant breeding simulations with and without genomic selection model updating. Trait architecture and breeding strategy impact genomic prediction accuracy than initial parent population size.

Genomic selection

First described by Meuwissen et al. (2001), genomic selection (GS) is a technique that can make use of information from genetic markers. With advances in DNA technology and declining costs for genotyping, breeders can now gain access to high density marker datasets. In particular, genome wide association studies (GWAS) have allowed for the discovery of high-resolution genomic signals while using fewer resources (Tibbs Cortes et al., 2021). These QTL are predicted to contribute to the phenotype of a trait. In the past, markers that were closely linked to a QTL could be used to select on individuals with a desired allele. Much of the success from marker assisted selection (MAS) was in traits that were controlled by single genes of major effects. Application of MAS to polygenic traits, or traits controlled by many genes, has seen less success. Even with high density markers, there are limitations to the effectiveness of MAS. This is largely because the linkage phase between a marker and a QTL must be determined each time before its use.

GS is a method that uses estimated breeding values based on model-predicted phenotypic values to advance selection candidates to the next generation. GS differs from MAS in that all markers across the genome are used for prediction. A training population, where individuals are both genotyped and phenotyped, is first used to train a model. Then, the model is applied to a testing population, where individuals have only been genotyped, to predict their phenotypes and assign genomic estimated breeding values (GEBV) to each individual. The advantage to using genomic selection is that it has the potential to save the time and resources that would normally be put towards phenotyping individuals. Depending on the genetic architecture of the trait, GS also has potential to increase selection accuracy (Massman et al., 2013). This is because individuals would only need to be genotyped, so only genotyping costs would need to be considered.

Factors impacting genomic selection accuracy

The main drawback to the use of genomic selection is the accuracy with which the model can predict phenotypes from the genotypes. Genomic selection has been widely used in animal breeding programs. For example, in dairy cattle, one study found annual genetic gain increases of 33 to 77% in three different breeds following the implementation of genomic selection (Doublet et al., 2019). Despite the promising findings for animal breeding, the move towards implementing genomic selection in plants, especially for complex quantitative traits, has been slow. This is likely due to a number of factors that affect the accuracy of genomic selection.

Training population size and trait heritability

Several studies have found that the training population size greatly impacts the accuracy of genomic selection. A larger training population may increase accuracy by up to 20%. Furthermore, the heritability of a trait can impact the training population size required for accurate predictions, especially when the h² is less than 0.4. For example, to obtain an accuracy of 0.7, a training population size of 9000 is required for a trait with h² = 0.2 if the effective population size is 1000. This greatly contrasts a training population size of 3000 when the trait heritability is 0.5 (Lorenz et al., 2011).

Population structure

Accounting for population structure is a key factor for successfully implementing GS. (Isidro et al., 2015) demonstrated that stratifying populations can improve the accuracy of GS. Another group of researchers considered the effects of relatedness between individuals when designing a training population. GS accuracy was determined to be highest when individuals in the training population were closely related to individuals in the testing population. Furthermore, in cases where relatedness is low, increasing the diversity of a training population can improve accuracy (Norman et al., 2018).

Genomic selection model

Several different models are available for predicting marker effects. (Heslot et al., 2012) previously compared the effectiveness of 11 GS models. These included random regression best linear unbiased prediction (rrBLUP), Bayesian ridge regression (BRR), and Bayesian Lasso (BL), BayesB, weighted Bayesian shrinkage regression (wBSR), BayesCπ, empirical Bayes (E-Bayes), elastic net, reproducing kernel Hilbert space (RKHS), support vector machine (SVM), random forest (RF), and neural network (NNET). The authors recommended the use of rrBLUP, BL, and wBSR due to their ease of implementation, versatility, and limited overfitting. They noted that BayesCπ was not an ideal model due to the high computational time. Meanwhile, E-Bayes and NNET both led to overfitting, with E-Bayes also having reduced accuracy and NNET requiring more computational power. Interestingly, RKHS also resulted in overfitting, however the accuracy was not impacted, meaning that while the model picked up more noise, it was able to capture more genetic signal. RF led to promising accuracies, but may require more validation before being established as a GS model (Heslot et al., 2012).

Model update

A simulation study conducted based on a sorghum breeding program found that updating the genomic selection model every year can increase genetic gains up to 39% (Muleta et al., 2019). Accuracy is greater when the training population contains individuals in the same generation as the selection candidates. In essence, as the number of generations separating the training population and selection candidates increases, the accuracy will decrease. Thus, model updates are required to ensure the genomic selection accuracy is maintained (Heffner et al., 2010).

Although genomic selection has been widely implemented in animal breeding, its use in plant breeding still requires further validation, particularly when dealing with pulse crops (e.g. dry beans). The objective of this study was to investigate the accuracy of genomic selection in a simulation study in dry beans. Five breeding strategies were simulated with the selection of three traits. The following hypotheses were tested:

Selection strategy influences genomic selection accuracy and genetic gain.

GS model updating influences genetic gain and prediction accuracy to varying degrees depending on selection strategy, trait heritability, and number of initial parents.

By accounting for population size, heritability and number of independent loci, GS accuracy can be predicted.

Simulation Parameters

The construction of the various phenotypic selection simulation scenarios used for the experiments reported here can be found in further detail in Lin et al. (2022). In this experiment, five breeding strategies (mass selection, bulk breeding, single seed descent, pedigree method and modified pedigree method) with a parental population size of 30 were simulated using the QuLinePlus module (Hoyos-Villegas et al., 2019). Selection was carried out on three traits: days to flowering (DF), white mold tolerance (WM), and seed yield (SY). The simulation was conducted with 20 runs, each with 6 cycles.

Implementing the GS Model

The package ‘rrBLUP’ (Endelman, 2011) was used to determine the marker effects to simulate GS in the study. The model for rrBLUP is shown in Eq. 1

\(y = X\beta + Zu + \epsilon\) [1]

where y is a list of phenotypes, X is a design matrix for the fixed effects β, Z is a design matrix for the random effects u; where u ~ N (0, Kσ²_u), and ε is residual variance. The genotypes and phenotypes were obtained from the parental population generated using the simuPOP program (Peng & Kimmel, 2005). Genomic selection was simulated by using the mixed.solve function in the R package rrBLUP to estimate the effect of each marker on the phenotype. Once the marker effects were determined, they were input into the “MarkerEffects” sheet with a genotyping error rate of 5%. From there, individuals were selected on the sum of the marker effects (genomic estimated breeding value) rather than the QTLs as in conventional breeding.

GS Model Updating

A previously reported study Lin et al. (2022) obtained GS accuracies from simulations among various selection strategies. However, the model reported in Lin et al. (2022) was not updated during the simulation. To simulate updating the GS model, a parental population was used as the reference population in the first 3 cycles by using SimuPOP (Peng and Kimmel, 2005). At the end of the third cycle, a sample of selected individuals were used as the reference population to estimate new marker effects. From cycle 4 to 6, selection was conducted on the basis of the newly estimated marker effects. GS with and without model updating was simulated under each of the scenarios described in the Simulation Parameters section above (5 selection strategies and 3 traits with 20 runs and 6 cycles).

GS Model Accuracies

GS accuracies were assessed via two estimates: first based on Eq. 3 (predicted accuracy) and second based on the correlation between TBV and GEBV (realized accuracy). Using these accuracies, the influence of selection strategy and genetic architecture on GS model accuracy was assessed.

In Silico Realized GS accuracy

The outputs from simulation were used to estimate the accuracy of genomic selection. More specifically, the information regarding the breeding values were obtained from the .pou files generated by the QuLinePlus Module. The genotypic values from the conventional breeding framework were considered to be the true breeding values (TBV), since selection candidates are chosen based on the QTLs provided to the simulation platform. On the other hand, the genotypic values from the genomic selection framework represented the genomic estimated breeding values (GEBV), since individuals were selected based on estimated marker effects.

QU-GENE performs selection on the per-plot basis. Thus, the correlation between the family mean GEBV and TBV values were used as a direct estimate of realized genomic selection accuracy. Original code was used to conduct calculations and correlations and may be found on the lab GitHub (McGill Pulse Breeding and Genetics Laboratory, 2022).

Expected Formula-based GS Accuracy

Daetwyler et al. (2010a) described several components that impact the accuracy of GS. The authors derived a formula for GS accuracy, as follows:

\({r}_{g\widehat{g}G}=\sqrt{\frac{{N}_{p}{h}^{2}}{{{N}_{p}h}^{2}+ {n}_{G} }}\) [2]

Where N_P refers to the number of individuals in a training population, h² is the heritability, and n_G is the number of independent loci. Based on the derived formula, the accuracy of GS is influenced by the heritability of the trait, the number individuals in the training population, and the number of loci being considered. As LD will result in some of the loci being linked, the number of independent chromosome segments (M_e) should be used in place of n_G (Daetwyler et al., 2010b). In addition to this, the parental population generated using simuPOP will have adequate LD, which must be accounted. By replacing n_G with M_e, one can derive Eq. 3:

\({r}_{g\widehat{g}G}=\sqrt{\frac{{N}_{p}{h}^{2}}{{{N}_{p}h}^{2}+ {M}_{e}}}\) [3]

Where N_P refers to the number of individuals in a training population, h² is the heritability, and M_e is the number of independent chromosome segments. Eq. 3 was the formula used to obtain predicted GS accuracies. Eq. 4 was used to calculate M_e.

\({M}_{e}=\frac{2{N}_{e}L}{{log}\left(4{N}_{e}L\right)}\) [4]

Where N_e is the effective population size, and L is the genome length in Morgans.

Equation 4 requires the estimation of N_e. To accomplish this, the variance effective size estimator described by (Waples, 1989) was used, and is shown in Eq. 5:

\({\widehat{N}}_{e}= \frac{t}{2\left[{\widehat{F}}_{c}- \left(\frac{1}{{2S}_{0}}+\frac{1}{{2S}_{t}}\right)\right]}\) [5]

where t refers to the number of generations that have elapsed between the two sampled populations, \({\widehat{F}}_{c}\) refers to the estimator for the standardized variance of gene frequency changes at a single locus, S₀ and S_t indicate the sample sizes of the population at time 0 and time t, respectively. The estimator Fc can be written as:

\({F}_{c}=\frac{1}{k}\sum _{i=1}^{k}\frac{{({x}_{i} - {y}_{i})}^{2}}{({x}_{i}+{y}_{i})/2 - {x}_{i}{y}_{i}}\) [6]

where k is the number of alleles, x_i is the observed allele frequency at time 0, and y_i is the observed allele frequency at time t.

In our study, average F_c estimates for all loci were determined and used to determine N_e. From there, genomic selection accuracy was estimated using Eq. 3. One of the output files generated by QU-GENE is the fre file, which provides the allele frequencies at each marker. These allele frequencies were entered into Eq. 6, and subsequently used to calculate the N_e in Eq. 5, M_e in Eq. 4, and finally the expected accuracy in Eq. 3. All calculations were performed in using original code written in R, which can be found on the lab GitHub page. (McGill University Pulse Breeding and Genetics Laboratory, 2022)

Principal component analysis

Principal component analyses were conducted to visualize the relationships between the factors that influence both genetic gain and genomic selection accuracy. The family means for 7 different factors that contribute to the genetic gain and genomic selection accuracy in the first cycle were determined for each of the 20 runs. The 7 factors included genetic gain, fixation of favourable alleles, Hamming distance, genetic variance, effective population size, true breeding value, and genomic estimated breeding value. The fixation of favourable alleles described the average percentage of beneficial alleles that were fixed in the population. Meanwhile, the Hamming distance was used to describe the distance of an individual from an ideal genotype. This distance was determined as the number of base pairs that differ from the optimal genotype. The effective population size was calculated according to Eq. 5. All calculations were performed using original code written in R and may be located on the lab GitHub page. (McGill University Pulse Breeding and Genetics Laboratory). Lastly, the R packages ggbiplot (version 0.55) and ggplot2 (version 3.3.3) were used to create the principal component analyses.

Genomic estimated breeding values (GEBV)

Genomic estimated breeding values were determined for each cycle for 10 cycles (Supplementary Fig. 1). Over the course of ten cycles, GEBVs saw either a gradual increase or decrease before reaching a plateau. For DF and SY, the GEBVs increased rapidly before plateauing. The opposite trend was observed for WM tolerance, with GEBVs declining before leveling off. The parental population sizes had an impact on the GEBVs at the end of the breeding program. For DF, the GEBVs averaged across the strategies were 19.90, 17.47, 13.98, and 9.87 for 15, 30, 60, and 100 parents. For WM tolerance, parental population sizes of 15, 30, 60, and 100 resulted in average GEBVs of -35.26, -61.52, -62.34, and − 62.52, respectively. Lastly, for SY, the average GEBVs were 2420.18, 2745.80, 2185.04, 2183.38 for parental population sizes of 15, 30, 60, and 100, respectively.

True Breeding Values (Tbv)

True breeding values were obtained from the QU-GENE output files and plotted over 10 cycles (Supplementary Fig. 2). Over the course of ten cycles, TBVs saw either a gradual increase or decrease before reaching a plateau. For DF and SY, the TBVs increased and eventually plateaued. The opposite was true for WM tolerance, where TBVs declined before reaching a plateau. There were notable differences between the TBVs when different numbers of parents were used at the beginning of the cycle. For each of the traits, as the number of parents increased, the average TBV for the strategies decreased. For DF, the average TBVs across strategies at the end of the 10th cycle were 20.21, 17.69, 13.95, 9.96 DF for 15, 30, 60, and 100 parents, respectively. For WM (0-100 severity score) tolerance after 10 cycles, the average TBVs were − 46.17, -62.32, -62.51, -62.55 for 15, 30, 60, and 100 parents. For SY, the average TBVs were 2830.95, 2759.40, 2190.16, 2189.79 kg/ha for 15, 30, 60, and 100 parents, respectively. For most breeding scenarios, bulk breeding, single seed descent, pedigree, and modified pedigree methods led to similar TBVs. Mass selection resulted in a lower TBV for DF and SY, while it led to a higher TBV for WM tolerance in comparison to the other four strategies.

GS Model Accuracies

In Silico Realized GS Accuracy

GS accuracies were estimated from the correlation between the TBV and the GEBV. Under each of the five strategies, accuracies saw a general decline before plateauing. They ranged from − 0.35 for SY to 0.32 for WM (Fig. 1). The mean accuracies for each strategy were − 0.03, -0.02, 0.02, 0.05, and − 0.01, for mass selection, bulk breeding, single seed descent, the pedigree method, and the modified pedigree method, respectively.

For DF, the highest accuracy (0.31) was observed under the pedigree method with 30 parents, while the lowest accuracy (-0.35) was in bulk breeding with 100 parents. When considering WMWM, single seed descent with 15 parents resulted in the greatest accuracy (0.32). The lowest WM accuracy was seen under mass selection with 60 parents (-0.29). Interestingly, in cycle 2, there was an increase in accuracy, after which the WM accuracy declined rapidly and became negative by cycle 4. For SY, both the highest (0.24) and lowest (-0.34) accuracies were observed in mass selection. For certain cycles, a correlation could not be obtained. In these cycles, the variance was zero and the correlation was undefined.

Expected Formula-based GS Accuracy

GS accuracies determined using Eq. 3 ranged from 0.07 for SY to 0.63 for DF (Fig. 2). In general, prediction accuracy decreased over the 10 cycles. The decline was smaller with parental population sizes of 15 and 30. Prediction accuracies were higher with larger parental population sizes. The strategies had similar accuracies and followed similar trends when the parental population size was small. However, with large parental population sizes, mass selection had a much greater prediction accuracy compared to the other strategies. Furthermore, the accuracy remained relatively high for mass selection. The accuracy was highest under DF, followed by WM tolerance and then SY. For DF under mass selection with 100 parents, the accuracy decreased from 0.63 to 0.47 over 10 cycles. Meanwhile, for WM tolerance under mass selection with 100 parents, the accuracy decreased from 0.46 to 0.39 over 10 cycles. Lastly, for SY under mass selection with 100 parents, the accuracy decreased from 0.43 to 0.29 over 10 cycles. In most breeding scenarios, bulk breeding resulted in the lowest prediction accuracies. For DF with 15 parents, the accuracy in bulk breeding decreased from 0.18 to 0.10 over 10 cycles. For WM tolerance with 15 parents, accuracy declined from 0.11 to 0.09 over 10 cycles when bulk breeding was used. For the selection of SY with 15 initial parents, accuracy with bulk breeding decreased from 0.09 to 0.07 over 10 cycles. Heritability had an impact on GS accuracy, where accuracy was highest under DF, followed by WM tolerance and then SY. However, selection strategies had similar accuracies when the parental population size was small, regardless of heritability.

Genetic gain with Model Update

The results from the model update indicated that there was a sharp increase followed by a rapid decline in genetic gain. Model update only seemed to improve genetic gain in one or two cycles immediately after the update, only to return to the rates of genetic gain prior to the update. Conventional breeding was included alongside genomic selection as a comparison for model update. Figure 3 shows that updating the GS model resulted in an increase in genetic gain after cycle 3 for mass selection, the pedigree method, and the modified pedigree method when selecting for DF and SY. However, it led to a decrease in genetic gain immediately after cycle 3, followed by an increase after cycle 4, and a decrease after cycle 5 for all strategies when selecting for WM tolerance. When compared to conventional breeding, genomic selection led to higher levels of genetic gain for certain strategies in the cycle following the GS model update. For DF, mass selection under genomic selection was 23.8% higher compared to conventional breeding. Meanwhile, the pedigree method and the modified pedigree method were 30.2% and 34.0% higher in genomic selection, respectively. For WM tolerance, mass selection led to 17.0% greater genetic gain using genomic selection than conventional breeding, while the modified pedigree method under genomic selection resulted in 9.94% higher genetic gain. Finally, for SY, mass selection, the pedigree method, and the modified pedigree method resulted in 22.7%, 11.3%, and 18.2% higher genetic gain, respectively using genomic selection compared to conventional breeding. For all other breeding scenarios, there was little to no difference between genomic selection and conventional breeding in the cycle after the GS model update.

In Silico Realized GS Accuracy with Model Update

In addition, the in silico realized accuracies fluctuated from one cycle to the next regardless of the model update. GS accuracies, represented by the correlations between the TBV and the GEBV, were obtained and plotted over 6 cycles (Fig. 4). Updating the model generally did not improve accuracies. Once again, the accuracy fluctuated over the different cycles. Mass selection had the greatest variability in accuracy, in some cycles having the highest accuracy, while in others having the lowest accuracies. For DF, following the model update at cycle 4, there was a small improvement in accuracy under single seed descent and the modified pedigree method, where accuracies increased by 0.08 and 0.04, respectively. The other three strategies saw a decrease in accuracy.

From cycle 3 to cycle 4, WM tolerance GS accuracies declined by 0.09, 0.13, and 0.12 for mass selection, bulk breeding, and the pedigree method, respectively. GS accuracies increased by 0.14, 0.03, and 0.02 between cycle 3 and 4 for mass selection, bulk breeding, and single seed descent, respectively. Although mass selection resulted in an increase in accuracy after cycle 3, it rapidly dropped and became negative. Decreases in GS accuracy after the third cycle were observed for the pedigree method and the modified pedigree method. However, the strategy with the highest accuracy in the final cycle was the pedigree method, with a value of 0.06.

Expected Formula-based GS with Model Update

GS accuracies determined using Eq. 3 ranged from 0.08 for SY to 0.58 for DF (Fig. 4). For all breeding scenarios, a general trend was observed where an increase in accuracy occurred after the GS model update at cycle 3, followed by a decline from cycle 4 to 5. The peak accuracy predicted from the DF simulation was 0.58, occurring at cycle 4 with mass selection. For WM tolerance, the peak accuracy was 0.51, occurring at cycle 4 with the pedigree method. The peak accuracy for SY was 0.38 at cycle 4 using the pedigree method.

True breeding values (TBV) with Model Update

After model update, the true breeding values were determined and plotted over 6 cycles (Fig. 5). In general, updating the model resulted in an increase in TBVs. At cycle 3 (where the update occurred), there was an increase in the TBV for all breeding scenarios. For DF, the TBV increased by 9.53, 7.05, 6.32, 4.34, and 6.38 from cycle 3 to cycle 4 for mass selection, bulk breeding, single seed descent, the pedigree method, and the modified pedigree method respectively. For WM tolerances, TBVs rose by 15.0, 38.3, 9.44, 0.73, and 10.7 from cycle 3 to cycle 4 for mass selection, bulk breeding, single seed descent, the pedigree method, and the modified pedigree method, respectively. Lastly, for SY from cycle 3 to cycle 4, mass selection, bulk breeding, single seed descent, the pedigree method, and the modified pedigree method had increases in TBVs of 761, 299, 134, 129, and 134, respectively. TBVs appeared to plateau after cycle 4 for DF and SY. However, for WM tolerance, TBVs rapidly declined after cycle 4.

Genomic estimated breeding values (GEBV) with Model Update

Genomic estimated breeding values were plotted over 6 cycles (Fig. 5). For DF and SY, GEBVs saw an increase following the model update. For WM GEBVs saw an increase and subsequent decline. For DF, there was a pronounced increase from cycle 3 to cycle 4 for mass selection, the pedigree method, and the modified pedigree method, with increases of 19.0, 18.8, and 20.6, respectively. Smaller increases were observed for the other two strategies. GEBVs increased by 7.66 and 6.02 between cycle 3 and 4 for bulk breeding and single seed descent, respectively. For WM tolerance, a large increase was observed for bulk breeding, while single seed descent led to the smallest increase in GEBV. From cycle 3 to cycle 4 for WM tolerance, GEBVs increased by 4.03, 39.1, 9.43, 16.7, and 13.1 for mass selection, bulk breeding, single seed descent, the pedigree method, and the modified pedigree method, respectively. Lastly, for SY, all five strategies resulted in an increase in GEBV following model update, with the greatest increase observed in mass selection and the smallest increase in single seed descent.

GEBVs increased by 1867, 367, 130, 886, and 1251 for mass selection, bulk breeding, single seed descent, the pedigree method, and the modified pedigree method, respectively.

Principal Component Analysis

Principal component analysis was conducted to show the overall result of the simulation with the model update. Figure 6 shows the PCA plot for DF, where 80.53% of the variance is explained by the first two principal components. Notably, the eigenvectors for TBV and GEBV are close together and point in the same direction. The eigenvector for genetic gain and Hamming distance point in similar directions. Towards the right side of the PCA plot, there were two clusters for mass selection that formed on the extreme of the GEBV and TBV eigenvectors. On the opposite side, a cluster containing all five strategies was found in the extremes of both the Hamming distance vector and the genetic gain vector. In the direction of the eigenvector for fixation of favourable alleles, there was a cluster consisting of bulk breeding. No clusters formed in the extreme of the effective population size eigenvector. Near the center of the plot was a cluster consisting of the pedigree method and single seed descent.

The first two principal components in the WM PCA (Fig. 6) explained 79.13% of the variance. such as DF, the eigenvectors for GEBV and TBV were close to each other. Meanwhile, the eigenvectors for Hamming distance and genetic gain were located close together. In the extreme of the Hamming distance eigenvector, there was a cluster consisting of mass selection, while for the genetic gain eigenvector, there was a cluster for the modified pedigree method. In the direction of the eigenvector for the fixation of favourable alleles, there was a cluster for bulk breeding.

For SY (Fig. 6), the first two principal components described 79.47% of the variance. The GEBV and TBV eigenvectors are very close together and point in similar directions. On the opposite end are the Hamming distance and genetic gain eigenvectors, which are located close together. Towards the extreme of the Hamming distance eigenvector is a cluster made of mass selection and bulk breeding. Between the eigenvectors for fixed favourable alleles and effective population size, there was a cluster corresponding to bulk breeding.

The results demonstrated that the prediction accuracy varied across strategy, trait, and parental population size. Genetic gain also varied along with strategy, trait, and population size. The PCAs for each trait had eigenvectors pointing in the same direction, however the strategies did not cluster separately. This suggests that trait architecture is a fundamental parameter of genomic selection accuracy, and the ideal strategy and parental population size will vary by trait.

Formula-based GS Accuracy compared to in silico Realized GS Accuracy

The formula-based GS accuracy results suggested that increased parental population sizes should lead to a higher accuracy. In addition, this increase should be particularly evident in mass selection. However, these equation-based accuracies did not reflect the in silico realized GS accuracies estimated from TBV and GEBV correlations. Moreover, the realized accuracies fluctuated from one cycle to the next, while the predicted accuracies saw a gradual decline over six cycles. Formula-based accuracies were much higher and consistent than in silico realized accuracies. This is likely due to selection-induced changes in the population that are not accounted for in the equation. Prediction accuracy estimates fail to be reflective of a population under the conditions present in this experiment where population sizes are relatively small and selection pressure is high.

Further, Werner et al (2015) point out the impact that population structure has the ability to accurately predict breeding value, however population structure was not considered in this experiment. The lack of concordance between predicted and realized accuracies was probably because variability in the population and the impact that selection strategy has on the structure of a population were not considered. This explains why, at times, predicted and realized accuracies differed. Further, Eq. 2 does not properly account for situations with a very large number of loci. Based on Eq. 2, as the number of loci increases, the accuracy will become biased and shift towards 0. This is because there cannot be an infinite number of independent loci. Instead, independent chromosome segments were used in Eq. 3 to predict GS accuracies (Daetwyler et al., 2010b).

Using estimates of LD as opposed to a known number of loci added a degree of abstraction to the GS accuracy estimates, possibly adding to the disparity between realized and predicted accuracies. Brard and Ricard (2015) previously explored the ability of formulae to predict GS accuracy. They found that the method in which N_e, and by extension M_e, were estimated had an impact on how well the formulae could predict GS accuracy. Both N_e and M_e can be calculated using a number of different methods, meaning different results may be obtained even when the same formula is used to predict GS accuracy. This may lead to overestimation of GS when the M_e is high and underestimation when the M_e is low. Since the values of N_e calculated in this study were considerably low (average N_e = 137), the formula may have overestimated GS accuracy. Several models have been proposed for the estimation of N_e (Caballero and Toro, 2000; Crow and Morton, 1955; Wang and Hill, 2000; Wright, 1938). Depending on the model used, certain assumptions are made regarding the population under investigation. For plant species in particular, few estimates have been made for N_e. (Siol et al., 2007) was the first to report estimates for the highly-selfing model legume species, Medicago truncatula. Since M. truncatula and P. vulgaris belong to the same family Fabaceae, this method for estimating Ne was selected. In the PCA plots, the eigenvectors for TBV and GEBV pointed in similar directions, which gives the impression that correlations would be an accurate measure of GS accuracy. However, Figs. 1 and 4 show the limitations to the use of correlations. In some of the later cycles, the variance was zero due to TBVs and GEBVs converging on a single value, leading to undefined correlations.

Impact of Training Population Size on GS Accuracy

One of the major factors that influences genomic selection accuracy is the training population size. The closed system that was simulated in QU-GENE involved taking the progeny at the end of each cycle and using them as the parents in the next cycle. Part of this process consisted of maintaining 30 families after each cycle. The SimuPOP population, which was used as the parents at the start of the first cycle, was also used for training. This resulted in a small training population. As a result, prediction accuracies were relatively low because the accuracy of the GS model increases along with the training population size (Rincent, 2017). Population structure, which can also impact the accuracy, would not have been a concern in this study because there was no population structure present (data not shown).

Impact of Selection Strategy on GS Accuracy

According the PCA plots, the strategies did not cluster separately from one another. Most of the clusters that formed consisted of more than one strategy However, according to the realized accuracies, the pedigree method and single seed descent led to the greatest accuracies by the end of the 10 cycles under selection for DF. Single seed descent involves advancing one seed from each plant, eliminating the impact of natural or artificial selection on genotypes. The pedigree method involves advancing seed from several superior families. Therefore, it is possible that higher accuracies observed under these two methods are likely due to a retention of genetic variance from generation to generation. When the training population captures higher variance, the model tends to perform better on the testing population. (Rincent, 2017)

Impact of Trait Architecture on GS Accuracies

Consistent with findings from previous studies, predictive ability for traits controlled by a small number of QTL varies depending on which prediction method is used. (Wang, 2015). For WM tolerance and SY, realized accuracy could not be obtained for some of the later cycles. This may have been a result of certain alleles being fixed, and ultimately converging on a single breeding value. When all TBVs in a cycle are the same, then the correlation becomes undefined as the variance is zero. If phenotypic and genomic selection lead to different sets of alleles becoming fixed, then it is possible that they are converging on different breeding values. This may explain the poor correlation, despite TBVs and GEBVs pointing in similar directions in the PCA plots. The fixation of alleles may have also been influenced by the number of QTLs. As previously shown by Lin et al. (2022), selection for WM tolerance led to the highest percentage of fixed favourable alleles in the fewest cycles of selection, while days to flowering had lowest allele fixation rate. This may explain why correlations could not be obtained for some of the later cycles for white mold tolerance and seed yield. It is also important to note that for effective use of GS models, a higher number of QTLs will result in higher accuracy and higher genetic gain. GS is an effective method when selecting on quantitative traits controlled by numerous genes (Tong et al, 2021).

Effectiveness of Model Updating

It is important to note that to update the model, the simulation must be stopped to generate a new model and update. Stopping and re-running the simulation will by default result in an increase in genetic gain. To account for this, phenotypic (or conventional) selection was included alongside genomic selection to evaluate the impact of updating the model. Figure 3 shows that for a number of strategies, there was a larger increase in genetic gain when GS was implemented compared to phenotypic selection immediately after the model update. This suggests that updating the GS model may be beneficial for certain strategies under genomic selection. However, as was the case with the unchanged GS model, predicted accuracies did not reflect the realized accuracies when model updating was included. According to predicted accuracies, updating the GS model led to a spike in accuracy that quickly declined again in the next cycle. This spike in accuracy suggests the importance of relatedness between training and testing populations. The realized accuracies fluctuated from one cycle to the next. However, the inclusion of model-updating did lead to an increase in genetic gain.

Updating was most beneficial for the pedigree method and mass selection in the DF simulation. Model update was beneficial for the use of the pedigree method in selecting for SY. Model update did not benefit any of the strategies in WM tolerance simulation, further suggesting the significance of high quality and quantity of QTL data. Unexpectedly, the genetic gain eigenvector was closest to the Hamming distance eigenvector and far away from the eigenvector for the fixation of favourable alleles. Thus, results for the GS model update should be taken cautiously.

GS has been widely used in animal breeding, however its effectiveness in plant breeding still requires more validation to implement it, particularly when dealing with the complex processes that take place in plant breeding program pipelines. This study showed that variation is present in genetic gain and GS accuracy when dealing with different selection strategies. GS will only be useful if the model can accurately predict the phenotype of a trait from the genotype. Numerous studies have investigated prediction accuracy in simulations. However, in those studies, QTLs were simulated and were evenly distributed across the genome with effect sizes drawn from a random distribution. This study aimed at assessing GS accuracy in simulation that better reflected the real world, in which QTLs, effect sizes and positions were based on reported QTL mapping experiments. The findings from the study indicate that predicted estimates of accuracy do not reflect of accuracies obtained from correlations between TBV and GEBV. Furthermore, according to in silico realized accuracies, there may be some benefits to using genomic selection under single seed descent or the pedigree method. For a given parent population size, accuracies varied depending on which strategy was used, and which trait was being estimated. These findings suggest that trait architecture and breeding strategy play a more significant role in genomic prediction than the initial parent population size.

BL Bayesian Lasso

BRR Bayesian ridge regression

DF Days to flowering

GEBV Genomic estimated breeding value

GS Genomic selection

GWAS Genome-wide association study

LD Linkage disequilibrium

MAS Marker-assisted selection

NNET Neural network

PCA Principal components analysis

QTL Quantitative trait locus

RKHS Reproducing kernel hilbert space

rrBLUP Ridge regression best linear unbiased prediction

SVM Support vector machine

SY Seed yield

TBV True breeding value

WBSR Weighted bayesian shrinkage regression

WM White mold

Funding

Natural Science and Engineering Council of Canada – Discovery Grants.

Conflicts of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

Ethics approval

Not applicable

Consent to participate

Not applicable

Consent for publication

Not applicable

Availability of data and material

Data and code are available at:

Github: https://github.com/McGillHaricots/peas-andlove/tree/master/Simulation-files

Code availability

Not applicable

Authors' contributions

IC: analyzed the data and wrote the manuscript.

JL: Conducted the research and experiments, analyzed the data, wrote the manuscript.

VA: Assisted with research project guidance, reviewed the manuscript, and provided comments.

ZJ: provided input and reviewed the manuscript

JO: provided input and reviewed the manuscript.

PM: provided input and reviewed the manuscript.

DJ: provided input and reviewed the manuscript.

VHV: Led the project and conceived the idea, assisted with data interpretation, assisted with writing of the manuscript.

Brard, S. and A. Ricard. 2015. Is the use of formulae a reliable way to predict the accuracy of genomic selection? Journal of Animal Breeding and Genetics 132: 207-217. doi:https://doi.org/10.1111/jbg.12123.
Caballero, A. and M.A. Toro. 2000. Interrelations between effective population size and other pedigree tools for the management of conserved populations. Genetical Research 75: 331-343. doi:10.1017/S0016672399004449.
Crow, J.F. and N.E. Morton. 1955. Measurement of Gene Frequency Drift in Small Populations. Evolution 9: 202-214. doi:10.2307/2405589.
Daetwyler, H.D., R. Pong-Wong, B. Villanueva and J.A. Woolliams. 2010a. The Impact of Genetic Architecture on Genome-Wide Evaluation Methods. Genetics 185: 1021-1031. doi:10.1534/genetics.110.116855.
Daetwyler, H.D., R. Pong-Wong, B. Villanueva and J.A. Woolliams. 2010b. The impact of genetic architecture on genome-wide evaluation methods. Genetics 185: 1021-1031. doi:10.1534/genetics.110.116855.
Doublet, A.-C., P. Croiseau, S. Fritz, A. Michenet, C. Hozé, C. Danchin-Burge, et al. 2019. The impact of genomic selection on genetic diversity and genetic gain in three French dairy cattle breeds. Genetics Selection Evolution 51: 52. doi:10.1186/s12711-019-0495-1.
Endelman, J.B. 2011. Ridge Regression and Other Kernels for Genomic Selection with R Package rrBLUP. Plant Gen. 4: 250-255. doi:10.3835/plantgenome2011.08.0024.
Heffner, E.L., A.J. Lorenz, J.L. Jannink and M.E. Sorrells. 2010. Plant breeding with genomic selection: gain per unit time and cost. Crop science 50: 1681-1690.
Heslot, N., H.P. Yang, M.E. Sorrells and J.L. Jannink. 2012. Genomic selection in plant breeding: a comparison of models. Crop science 52: 146-160.
Hoyos-Villegas, V., V.N. Arief, W.-H. Yang, M. Sun, I.H. DeLacy, B.A. Barrett, et al. 2019. QuLinePlus: extending plant breeding strategy and genetic model simulation to cross-pollinated populations—case studies in forage breeding. Heredity 122: 684-695. doi:10.1038/s41437-018-0156-0.
Isidro, J., J.-L. Jannink, D. Akdemir, J. Poland, N. Heslot and M.E. Sorrells. 2015. Training set optimization under population structure in genomic selection. Theor Appl Genet 128: 145-158.
Lin, J., V. Arief, Z. Jahufer, J. Osorno, P. McClean, D. Jarquin, et al. 2022. Simulations of rate of genetic gain in dry bean breeding programs. Theor Appl Genet. doi:10.21203/rs.3.rs-1442864/v1.
Lorenz, A.J., S. Chao, F.G. Asoro, E.L. Heffner, T. Hayashi, H. Iwata, et al. 2011. Chapter Two - Genomic Selection in Plant Breeding: Knowledge and Prospects. In: D. L. Sparks, editor Advances in Agronomy. Academic Press. p. 77-123.
Massman, J.M., H.-J.G. Jung and R. Bernardo. 2013. Genomewide Selection versus Marker-assisted Recurrent Selection to Improve Grain Yield and Stover-quality Traits for Cellulosic Ethanol in Maize. Crop Sci 53: 58-66. doi:https://doi.org/10.2135/cropsci2012.02.0112.
Meuwissen, T.H.E., B.J. Hayes and M.E. Goddard. 2001. Prediction of Total Genetic Value Using Genome-Wide Dense Marker Maps. Genetics 157: 1819.
Muleta, K.T., G. Pressoir and G.P. Morris. 2019. Optimizing Genomic Selection for a Sorghum Breeding Program in Haiti: A Simulation Study. G3 Genes|Genomes|Genetics 9: 391-401. doi:10.1534/g3.118.200932.
Norman, A., J. Taylor, J. Edwards and H. Kuchel. 2018. Optimising Genomic Selection in Wheat: Effect of Marker Density, Population Size and Population Structure on Prediction Accuracy. G3 Genes|Genomes|Genetics 8: 2889-2899. doi:10.1534/g3.118.200311.
Peng, B. and M. Kimmel. 2005. simuPOP: a forward-time population genetics simulation environment. Bioinformatics 21: 3686-3687. doi:10.1093/bioinformatics/bti584.
Siol, M., I. Bonnin, I. Olivieri, J.M. Prosperi and J. Ronfort. 2007. Effective population size associated with self-fertilization: lessons from temporal changes in allele frequencies in the selfing annual Medicago truncatula. Journal of Evolutionary Biology 20: 2349-2360. doi:https://doi.org/10.1111/j.1420-9101.2007.01409.x.
Tibbs Cortes, L., Z. Zhang and J. Yu. 2021. Status and prospects of genome-wide association studies in plants. The Plant Genome 14: e20077. doi:https://doi.org/10.1002/tpg2.20077.
Wang, J. and W.G. Hill. 2000. Marker-Assisted Selection to Increase Effective Population Size by Reducing Mendelian Segregation Variance. Genetics 154: 475-489. doi:10.1093/genetics/154.1.475.
Waples, R.S. 1989. A generalized approach for estimating effective population size from temporal changes in allele frequency. Genetics 121: 379-391. doi:10.1093/genetics/121.2.379.
Wright, S. 1938. Size of population and breeding structure in relation to evolution. Science 87: 430-431. doi:10.1126/science.87.2263.425-a.

SupplementaryFigures.docx

Download PDF

Version 1

posted

You are reading this latest preprint version

­­Simulations of genomic selection accuracy and model updating across multiple breeding strategy scenarios in common bean

Status:

Version 1

Abstract

Figures

Key Message

Introduction

Genomic selection

Factors impacting genomic selection accuracy

Training population size and trait heritability

Population structure

Genomic selection model

Model update

Methods

Simulation Parameters

Implementing the GS Model

GS Model Updating

GS Model Accuracies

In Silico Realized GS accuracy

Expected Formula-based GS Accuracy

Principal component analysis

Results

Genomic estimated breeding values (GEBV)

True Breeding Values (Tbv)

GS Model Accuracies

Genetic gain with Model Update

Expected Formula-based GS with Model Update

True breeding values (TBV) with Model Update

Genomic estimated breeding values (GEBV) with Model Update

Principal Component Analysis

Discussion

Formula-based GS Accuracy compared to in silico Realized GS Accuracy

Impact of Training Population Size on GS Accuracy

Impact of Selection Strategy on GS Accuracy

Impact of Trait Architecture on GS Accuracies

Effectiveness of Model Updating

Conclusion

Abbreviations

Declarations

References

Supplementary Files

Status:

Version 1

Simulations of genomic selection accuracy and model updating across multiple breeding strategy scenarios in common bean