Improving selection efficiency of crop breeding with a genomic prediction aided 1 partial phenotyping strategy 2

20 Increasing the number of environments for phenotyping of crop lines in earlier stages of 21 breeding programs can improve selection accuracy. However, this is often not feasible due to 22 cost. In our study, we investigated a partial phenotyping strategy that does not test all entries in 23 all environments, but instead capitalizes on genomic prediction to predict missing phenotypes 24 in additional environments without extra phenotyping expenditure. The breeders’ main interest 25 – response to selection – was directly simulated to evaluate the effectiveness of the partial 26 genomic phenotyping strategy in a wheat dataset. Whether the partial phenotyping strategy 27 resulted in more selection response depended on the correlations of phenotypes between 28 environments. The partial phenotyping strategy consistently showed statistically significant 29 higher simulated responses to selection, compared to complete phenotyping, when the majority 30 of completely phenotyped environments were negatively correlated and any extension 31 environment was highly positively correlated with any of the completely phenotyped 32 environments. Our results indicate that genomics-based partial phenotyping can improve 33 selection response at middle stages of crop breeding programs. response in the context of multi-environment trials. We also investigate the relationship among environments and how this affects the effectiveness of the proposed genomics-assisted partial phenotyping strategy. Our study demonstrated a genomics-assisted partial phenotyping strategy can improve selection effectiveness for crop breeding, especially at the middle stages of a breeding program when multi-environment trials are not feasible due to cost. The partial phenotyping strategy was optimal when most of the complete phenotyping environments were negatively correlated and at least one of the extension environments was positively highly correlated with any of the complete phenotyping environment.


Introduction 66
Genomic selection is a promising tool to assist plant breeding by accelerating selection gain per

Genotypic data and correlations between environments 117
The genotypic data of the 189 lines used in this study was drawn from the genotypic data of 118 2,412 wheat lines fingerprinted with 41,666 90K single nucleotide polymorphisms (SNP) in He 119 et al. (2019). As the number of genotypes was reduced, SNPs were refiltered by removing those 120 with a minor allele frequency of less than 0.05, which left 32,800 SNP for subsequent analyses. 121 The genetic diversity of the 189 genotypes was inspected based on a cluster analysis using 122 Rogers' distance (Roger 1972) estimated by the 32,800 SNP. The correlation between 123 environments was estimated by Pearson correlation coefficient between the BLUEs of the 189 124 genotypes in different environments. 125

Multi-environment genomic prediction model 126
A multi-environment genomic prediction model explicitly describing genotype-by-127 environment interactions was used: 128 where m is the number of environments, n is the number of genotypes, is a m×n vector of 130 BLUEs of genotypes in each environment, μ is the common intercept, is the m-dimensional 131 vector of environment main effect, is the n-dimensional vector of additive genetic main 132 effect of genotypes, is the m×n vector of genotype-by-environment interaction effects, 133 is the random residual, is the incidence matrices for , is the incidence matrices for 134 Iteration times were fixed to 30,000 and the first 5,000 times were set as burn-in. 141

Partial phenotyping strategy 142
We compared the selection response of the complete phenotyping trial in fewer environments 143 with a partial genomic phenotyping strategy in additional environments. In this sense, all 144 possible combinations of three environments out of the total six environments were used as the 145 complete phenotyping trials, which retained total phenotypic values (BLUEs per environment). 146 Phenotypic values in combinations of four, five and six environments (there is just one 147 combination using all six environments) were proportionally masked to create the partial 148 phenotyping trials. The percentage of phenotypic values retained in the 4-, 5-and 6-149 environment combinations was 75%, 60% and 50% respectively, which made the phenotyping 150 intensity in all 3-, 4-, 5-and 6-environment combinations equivalent. Thus, the number of 151 BLUEs and the amount of phenotype data collected was the same in all scenarios. There were 152 twenty different combinations of three environments out of the total six environments. Each 3-153 environment combination was extended to three 4-or 5-environment combinations by 154 including one or two environments from the remaining three environments. According to the 155 phenotyping proportions (75%, 60% and 50%) of 4-, 5-and 6-environment combinations, 156 phenotypic values in each 4-and 5-environment combination were randomly masked one 7 hundred times, and in the 6-environment combination were stochastically masked three 158 hundred times. This resulted in the same replication level (300) for each 3-environment 159 combination and its three extended 4-and 5-environment combinations, as well as the single 6-160 environment combination. The random masking strategy of phenotypic values was based on 161 cross validation strategy two (CV2) in He et al. (2019). Specifically, in this study, each 162 genotype has six environment-specific BLUEs. We first attempted to randomly mask one 163 BLUE of genotypes in the 4-, 5-and 6-environment combinations to make the phenotyping 164 proportions the same as the 3-environment complete phenotyping trial. If masking one BLUE 165 was insufficient to meet the required phenotyping proportion, another BLUE of genotypes was 166 masked until the required phenotyping proportion was reached. 167

Response to selection 168
The genomic prediction model, also known as a mixed linear model, can be used to directly 169 estimate the response to selection through a simulation-based approach following Piepho and 170 Möhring (2007). Briefly, the multi-environment genomic prediction model was fitted using 171 phenotypic records of complete phenotyping trial (3-environment combination) and phenotypic 172 records of partial phenotyping trials (4-, 5-and 6-environment combinations). We were mainly 173 interested in the relationship between the true genetic main effect and its best linear unbiased 174 prediction (BLUP) ̂, because the selection was based on the BLUP, while the response of The repeatability of each environment was above 0.4, indicating that the phenotypic data was 198 of high quality (Fig. 1a). The distribution of BLUEs in different environments was 199 asymptotically normal (Fig. 1b). Several large families were identified by clustering analysis 200 and linkages existed across families ( Supplementary Fig. S1). The Rogers' distance values 201 between any pair of genotypes ranged from 0.01 to 0.53. 202

Simulated response to selection 214
Twenty one 4-environment combinations with partial phenotyping applied had statistically 215 significant higher responses to selection, compared to their equivalent 3-environment 216 combination with complete phenotyping under each selection ratio, i.e. 10%-90% (Table 1). 217 For the 5-and 6-environment combinations, this number was twenty three and seven, 218 respectively ( Table 2; Table 3 Our study investigated the potential of a genomics-assisted partial phenotyping strategy via 226 simulated selection responses. Partial phenotyping can lead to a similar or greater response and 227 provides information on genotype performance in more environments, compared to fully 228 replicated trials. As the level of phenotyping (i.e. the number of observations) was the same in 229 all scenarios, the advantage of partial phenotyping was achieved with a similar budget. While 230 families existed in our population, our partial phenotyping strategy tested each genotype in at 231 least one environment. Consequently, as all genotypes were included in the reference set, the 232 families did not introduce bias due to relatedness discrepancy to genomic prediction in the 233 different phenotype masking scenarios. The correlations between environments in our study included high (e.g. 0.84), moderate (e.g.  Table 4 and can be used to understand when partial phenotyping 262 can be beneficial. 263 Group 1 had a highly positive correlation (0.84) between environments and the partial 264 phenotyping strategy did not result in additional selection response, regardless of the number 265 of expansion environments added (Tables 1-4). 266 In group 2, all pairwise correlations were positive and when the extended environment was 267 highly positively correlated (0.84) with any of the complete phenotyping environments, the 268 partial phenotyping strategy was always superior (Table 1; Supplementary Table S1). However, 269 this superiority was not maintained when additional environment(s) were included that were 270 only poorly correlated with the complete phenotyping environments ( Table 2; Table 3;  (Tables 1; Table 2; Table 4). This suggests that the 281 robustness of group 3 is less than groups 1 and 2, and the superiority of including two expansion 282 environments in group 3 depends on the relationship between the two expansion environments. 283 In combination 17-18, no expansion environment was highly positively correlated with any of 284 the complete phenotyping environments. However, two expansion environments were highly 285 correlated (0.84), i.e. Year2015_TOS1 and Year2015_TOS3, and each was moderately 286 positively correlated with one of the complete phenotyping environments, which made the 287 partial phenotyping strategy superior (Table 2). In contrast, their per se 4-environment partial 288 phenotyping scenario did not show superiority (Table 1). 289 For group 4, where one pair of environments had a positive correlation and two pairs a negative 290 correlation, i.e. combinations 8-10, 14-16 and 20, partial phenotyping resulted in a greater 12 response when one expansion environment was highly correlated (0.84) or all expansion 292 environments had moderate positive correlations with the complete phenotyping environments 293 (Table 1). In some cases, such as combination 16 and 20, even one extended environment with 294 a moderate positive correlation with the complete phenotyping environments was superior 295 (Table 1). This suggest that when environments are dissimilar the partial phenotyping strategy 296 is particularly useful; a finding corroborated by the largest number of superior 5-and 6-297 environment combinations in group 4 ( Table 2; Table 3). 298 Breeders are advised to consider the expected phenotypic correlation between environments 299 when deciding whether genomics-assisted partial phenotyping is of value. As shown in Table  300 4, when the environments projected for complete phenotyping contain a highly positive 301 correlation, the partial phenotyping strategy does not increase selection response. For any other 302 combination of complete phenotyping environments, adding one expansion environment that 303 is positively highly correlated with any of the complete phenotyping environments will always 304 be beneficial. When most complete phenotyping environments are negatively correlated, 305 including more (≤3) expansion environments also consistently improved the response as long 306 as one positive highly correlated expansion environment was added. It is worth noting that 307 while adding one highly positively correlated expansion environment was of benefit, breeders 308 could choose this environment for complete phenotyping if some prior knowledge was 309 available, which would revert the combination to group 1. Nevertheless, adding positive 310 correlation partial phenotyping scenarios was generally of benefit (group 4, Table 1). However, 311 in practice, breeders tend to choose environments that are distinct to select germplasm that are 312 widely adapted. 313 Finally, although the budgets of a partial phenotyping strategy with different number of 314 expansion environments are theoretically identical, the actual cost would rise if the number of 315 environments was increased, regardless of size. Hence, breeders should assess the practicality 316 of the genomics-assisted partial phenotyping strategy based on both the relationship between 317 testing environments and complexity of breeding program deployment. 318

Conclusion 319 13
Our study demonstrated a genomics-assisted partial phenotyping strategy can improve selection 320 effectiveness for crop breeding, especially at the middle stages of a breeding program when 321 multi-environment trials are not feasible due to cost. The partial phenotyping strategy was 322 optimal when most of the complete phenotyping environments were negatively correlated and 323 at least one of the extension environments was positively highly correlated with any of the 324 complete phenotyping environment.