Trait variation and summary statistics
Spatial adjustment of agronomic traits proved to be efficient in reducing residual error, as illustrated in suppl Fig 1.
Table 1 shows the variance components of the random effects in model (1), estimated from the whole dataset, and variety means heritability.
(table 1 around here)
The genotype component s²g appears to be larger than the interaction components s²gS ands²gY, leading to heritabilities ranging from 0.4 (Yield) to 0.8 ( Test weight and friability). The heritabilities of malting traits are all larger than 0.6, despite a smaller experimental design that used for agronomic traits.
Correlation and Principal component analysis
Figure 1 shows the single trait distribution and pairwise correlation of conditional modes of genotypic effects of the ten traits studied. Among agronomic traits, the highest correlation (0.77) is between average grain weight TGW and calibration which is quite obvious. Protein content is negatively correlated to yield (-0.32), but not as tightly as reported in bread wheat, e.g. -0.82 in French registration trials 1991-1999 (Oury et al., 2003). Moreover, a high protein content is not looked for in malting barley, since too much protein makes problems in the filtration process, as illustrated by the negative correlation between protein content and extract rate (-0.35). Thus, a stabilization of protein content, which is necessary to correctly feed yeast, is desired rather than a continuous enrichment.
(Fig 1 around here)
Among malting related traits, the highest correlation (0.90) is found between viscosity and b-Glucan content, and another (0.64) between friability and extract. Both correlations were expected from causal reasons. Viscosity and b-Glucan are negatively correlated with extract, which is favorable to breeding objectives, since extract is to be enhanced while viscosity is to be reduced. Malting traits are weakly correlated to agronomic traits, the highest (in absolute value) negative correlation being between extract and protein content (-0.42). This suggests s that genetic improvement of agronomic traits and malting traits can be achieved independently.
These correlations can also be seen in the first two axes of a principal component analysis shown in suppl. Figure 2. It clearly shows the two groups of tightly correlated malting traits, which are in opposite position along axis 1, while the agronomic traits are mostly supported by axis 2, particularly TGW and calibration, thus independent from malting traits, and protein content being poorly represented, as less correlated to all other variables in this plan. Heading date, poorly represented on axes 1-2, is therefore not correlated with agronomic or quality traits.
Molecular Data
The distribution of the 24,101 filtered markers was fairly homogeneous between chromosomes, ranging from 2,505 on chromosome 4H to 4,604 on chromosome 3H. The scatter plot of the 679 breeding lines and cultivars on the first two axes of the principal coordinate analyses of the Roger’s distance matrix is shown on Figure 2:
(Fig 2 around here)
The clouds of the two breeders’ lines show both overlapping regions and more privative ones. Cultivars are more widespread on the whole graph, with a higher density in the middle zone where the two breeder’s lines overlap. This may be explained by the use of cultivars as parents of crosses by both breeders, which explains the overlap, but also that each breeder has its own source of parents for crosses, which explains a beginning of divergence among the two sets of breeding lines. However, the overlap seems to be large enough to anticipate the possibility of successful cross-prediction between the two breeders, i.e. one breeder set used for training and the other set used for validation.
Genomic prediction
Cross-validation and forward-validation prediction abilities using the popular method GBLUP are presented in Table 2.
(table 2 around here)
Random cross validation using the whole set of lines (N=679) shows moderate predictive ability for yield and protein content (0.45-0.50), and good to very good ones for all quality and malting-related traits. In particular, predictive abilities of traits measured by the micro-malting test (last four row) are all larger than 0.65, and up to 0.80 for friability. This is very encouraging about the possibility to efficiently screen more candidates (at cheaper cost) and/or at earlier stage in the breeding scheme, thereby enabling a faster genetic gain for malting quality traits.
Columns 2 and 3 show predictive abilities in random cross-validation using lines from a single breeder + founder lines, i.e. what a single breeder can hope to achieve on its own, without sharing data with another breeder. The differences are 1) a smaller population size compared to column 1 (N=359 and N=410, respectively), which is expected to yield lower predictive abilities, and 2) phenotypes from a single breeder come from the same set of environments, which is expected to give higher repeatability (broad sense heritability) of the measured traits. These two effects are likely to balance each other, since predictive abilities are nearly as good as those in column 1 i.e. when using the whole dataset, and even higher for some traits of Breeder1, despite a smaller size of its available material (N=359 vs N=410 for Breeder2)
Column 4 shows predictive abilities obtained by cross-validation within a very small training set (N=95), made of the founder lines evaluated in common by both breeders. Although they are more variable than using the largest training set (standard deviation 2-4 times larger), they are unexpectedly large, particularly for malting traits.
To assess whether predictive ability of lines subset is due to training size, we used random sampling on limited size within the whole dataset. Figure 3 shows the predictive abilities obtained for Yield (Figure 3a) and Friability (Figure 3b) using random sampling vs determined subset (single breeder and/or founder lines).
(Fig 3 around here)
As expected from the theory, predictive ability decreases with sample size when sampling is random, while its variability increases. Using the training sets from a single breeder and/or registered varieties leads to contrasted results. For both traits, PA from Breeder2 + founder lines are close to those of random samples of similar sizes, while Breeder 1 samples slightly higher PA and founder lines only (N=95) give higher PA than expected by random sample size. The advantage of founder lines over random sampling is particularly pronounced for yield, with a PA of 95 fonder lines being higher that that obtain by cross-validation using the whole population.
These differences between predictive abilities of training set of similar size can hardly be attributed to the average coancestry between lines of the training set (and validation set, since randomly sampled in cross-validation). Indeed, the kinship coefficients (after normalization of the K matrix from A.mat function) within each subset are not very different to each other, at least when considering their average: 0.195, 0.200 and 0.201 for breeder 2, breeder 1 and founders, respectively. There is no clear-cut structure among the lines, and none which fits the a priori grouping by breeder origin (suppl. Figure 3).
Another explanation could be that founder lines were evaluated by both breeders, thereby in 5 locations each year, instead of only 2 or three locations for breeder’s own lines. This should be visible through the broad sense heritability when estimated from a single subset of lines, that are shown in Suppl. Table 1. Heritabilities estimated on founder lines only arealways higher than when estimated on the whole dataset. This may partly explain the higher predictive ability shown in Figure 3a. Moreover, heritabilities estimated in Breeder 2’s materials are always lower than those estimated from Breeder 1’s lines, which is consistent with the relative position of predictive abilities in Figure 3.
Figure 4 shows the effect of the number of randomly selected markers on predictive ability for the same two traits. As expected, predictive ability increases and its standard error decreases as marker number increases, up to a plateau that is reached with as few as 2,000markers, which are enough to nearly achieve predictive abilities that are close to and as reliable as those obtained with the full marker data. The most likely explanation is that the extent of linkage disequilibrium is large enough between any of the 2,000 markers and its neighbors, so that they are able to capture the effect of any QTL lying between them. To test this hypothesis, we estimated the decay of linkage disequilibrium with physical distance between markers.
Suppl. Figure 4 illustrates this decay on chromosome 1H.
(Fig 4 around here)
Although LD seems to decay quite rapidly at the scale of a whole chromosome, on average (green curve), it remains greater than 0.3 up to # 2Mb. Given the size of the barley genome, 4,250 Mb in our data, # 2,100 markers (4,200/2) regularly spaced will achieve a complete coverage of the full genome at LD-threshold =0.3. This fits to our empirical finding of prediction accuracy being nearly optimal with M =2,000 markers.
Table 3 shows the predictive abilities obtained with across-population validation, i.e. using pre-defined subsets for training and validation. Since there are no replicates, standard deviation of PA is not available in this case.
(Table 3 around here)
The size of the training set is roughly decreasing from left to right. As expected, PA decreases with the size of the training set, more rapidly than using random cross validation (Table 2), particularly for Yield and protein content. However, they remain within the range of practical usefulness for malting quality traits.
Values in column 1 are close to those of column 1 in Table 2, with similar size of training sets (N=612 in random 10-fold CV, N=569 in BRE1+BRE2 subset). This is illustrated in Figure 5, which shows the predictive abilities for the 10 traits in the first columns in Table 2 (random cross validation) and Table 3 (across population validation), i.e. with the largest possible size of the training set. Across population validation give predictive abilities which are slightly lower for agronomic traits, except Test weight, but very similar ones for malting traits, and even higher for extract rate.
(Fig5 around here)
To explore why malting related traits are more precisely and more robustly predicted by molecular markers than agronomic traits, we tried genomic predictions with models which depart from the infinitesimal one used in GBLUP. Indeed, LASSO and Bayes Cpi both estimate additive effects, but allow some markers to have null or very small values, while a few ones have larger effects. Results are presented in Table 4.
(Table 4 around here)
As expected, given the limits of the design, Yield and protein content show moderate levels of both heritability estimates. This is also the case for heading date, a trait that is most often considered as being highly heritable. This is likely due to the relatively narrow range of variation in our studied material, made only of western Europe adapted six-rowed winter barley.
It is worth noticing that the first column is broad sense heritability of plot means, also called repeatability, which relies on the design, whose square root is assumed to be the theoretical upper limit of the predictive ability of any model.
The traits that show high heritability have accordingly high predictive abilities. Globally, there are very few differences in predictive abilities among the 4 models, although LASSO shows lower PA, particularly for the least heritable traits, namely yield and protein content Bayes Cpi gives PA very similar to those of GBLUP, sometimes slightly, but not significantly higher, the difference being often on the third digit, i.e. within the range of 2 standard deviations. Comparatively, EGBLUP, which aims to model first order epistatic interactions, has higher than GBLUP, which only accounts for addictive marker effects, for most traits, sometimes with significant improvement (second digit), particularly for yield.