Identify Superior Parental Lines for Biparental Crossing 1 via Genomic Prediction: Rice as an Example 2

6 Background: A set of superior parental lines is the key to high-performing recombinant 7 inbred lines (RILs) for biparental crossing in a rice breeding program. The number of 8 possible crosses in such a breeding program is often far greater than the number that 9 breeders can handle in the field. A practical parental selection method via genomic 10 prediction (GP) is therefore developed to help breeders identify a set of superior 11 parental lines from a candidate population before field trials. 12 Results: The parental selection via GP often involves truncation selection, selecting the 13 top fraction of accessions based on their genomic estimated breeding values (GEBVs). 14 However, the truncation selection inevitably causes a loss of genomic diversity in the 15 breeding population. To preserve genomic variation, the selection of closely related 16 accessions should be avoided. We first proposed a new index to quantify the genomic 17 diversity for a set of candidate accessions. Then, we compared the performance of three 18 classes of strategy for the parental selection, including those consider (a) GEBV only, (b) 19 genomic diversity only, and (c) both GEBV and genomic diversity. We analyzed two 20 rice ( Oryza sativa L.) genome datasets for the comparison. The results show that the 21 strategies considering both GEBV and genomic diversity have the best or second-best 22 performance for all the traits analyzed in this study. 23 Conclusion: Combining GP with Monte Carlo simulation can be a useful means of 24 parental selection for rice pre-breeding programs. Different strategies can be 25 implemented to identify a set of superior parental lines from a candidate population. In 26 consequence, the strategies considering both GEBV and genomic diversity that can 27 balance the starting GEBV average with maintenance of genomic diversity should be

where is the recombination rate and is the linkage distance between SNP 131 markers A and B. Through a series of Bernoulli distributions and the estimated 132 recombination rates, the crossover of each chromosome was simulated to yield the 133 sequence of a gamete, then two gametes were paired to produce the genotype data for 134 the progeny. 135

GBLUP Model 136
We considered the following single-trait GBLUP model for GP: 137

160
The GEBV for the breeding population is ̂ plus the estimate of the constant term . 161 The Index to Quantify Genomic Diversity 162 Let 0 be the vector of genotypic values and 0 be the genomic relationship matrix 163 for a particular set of accessions with size 0 . According to the GBLUP model of Eq. 164 [1], the covariance matrix for 0 is given by: 165 The determinant of the covariance matrix represents the overall variability for the 167 genotypic values, which is calculated as: 168 Clearly, the determinant of Eq. [3] is proportional to the D-score defined below: 170 The D-score of Eq.
[4] ranges from 0 to 1. For a fixed number of 0 , a subset of 172 accessions chosen from a breeding population that achieves the maximal D-score will 173 have greater genomic diversity than the competing choices with size 0 . The concept of 174 the D-score is adopted from optimum experimental designs (Atkinson and Donev 1992). We required a highly efficient algorithm to search for a subset of accessions within a 186 candidate population so that it can achieve the maximal D-score. We used a genetic 187 algorithm to complete this task, which is an exchange algorithm with the three different 188 operators: roulette wheel selection, crossover, and mutation (Whitley 1994). To evaluate a variety of strategies in determining parental lines, we carried out the 199 following steps. 200 Step 1: For a specific target trait, we used all of the phenotypic values available from 201 the rice genome dataset to build the corresponding GBLUP model of Eq. [1]. 202 Step 2: We predicted the GEBVs for all of the accessions in the dataset through the 203 Step 3: For each subset of 10 accessions determined by the seven strategies, we crossed 216 any two parental lines to produce 45 F 1 hybrids. Here, we started to simulate the 217 genotype data for successive generations of progeny populations through the Monte 218 Carlo simulation. Each of the 45 F 1 hybrids produced 60 individuals of the F 2 219 population by self-pollination, resulting in 2700 F 2 individuals. After obtaining the 220 GEBVs for the 2700 F 2 individuals via the trained GBLUP model of Step 1, we then 221 retained the top 45 F 2 individuals. Again, we used these 45 F 2 individuals to produce 222 2700 F 3 individuals (each F 2 individual produced 60 F 3 individuals) and retained the top 223 45 F 3 individuals. We then repeated the same procedure to produce 2700 F 10 individuals 224 which are assumed to be a fixed population. 225 Step 4: For the resulting 2700 F 10 individuals generated according to each strategy, we 226 found the best F 10 RIL with the top GEBV. 227 A flowchart of the procedure is displayed in Figure 1. We repeated this analysis 228 procedure 30 times to obtain the best F 10 RILs from each repetition for each strategy. 229 The average of the GEBVs for the best F 10 RILs was then calculated and used as the 230 measure of efficiency for the strategy. Note that for the traits of BRSW, FPP, and PNPP 231 in Dataset I; and YLD in Dataset II, larger GEBVs are preferable (i.e., these traits 232 follow the rule that the larger, the better). The remaining five traits of FTAA, FTAF, and 233 PH in Dataset I; and FT, and PH in Dataset II are those for which the rule is "the smaller, 234 the better". 235

Calculation of Genetic Gain 236
To gain an understanding of the genetic improvement on a target trait using 237 different strategies, we estimated genetic gain as 238

Strategies Comparison Based on the best F 10 RILs 246
The GEBV averages of the best F 10 RILs from the 30 repetitions using each of the 247 seven strategies are displayed in Tables 1 and 2  increases for the three strategies considering the genomic diversity only. Also, the 268 desirability declines from parental generation to F 1 generation for every strategy, due to 269 the heterogenous alleles in F 1 hybrids. 270 To explore the extent to which the top two accessions contribute to the subset of 271 ten parental lines determined by the four strategies of GEBV-O, GEBV-GD-30, -50, 272 -100, we compared each subset with a reduced group consisting of F 1 hybrids whose 273 parental lines contain at least one of the top two accessions for each subset. Every 274 reduced group consists of 17 F 1 hybrids. Similarly, we followed the analysis procedure 275 to obtain the GEBV averages for the best F 10 RILs from 30 repetitions based on the 276 reduced group. The results are displayed in Table 3 Table 5. 329 It is known that Dataset I contains more genomic diversity than Dataset II, since it 330 consists of five subpopulations and one admixed group. The more genomic diversity of 331 Dataset I could lead to a bigger difference between the strategies considering both 332 GEBV and genomic diversity, and the strategy considering GEBV only for some traits. 333 For example, the difference of the GEBV averages among the best F 10 RILs between 334 GEBV-GD-50 and GEBV-O is about -9.06 days for FTAA, and -2.55 days for FTAF in 335 Dataset I (Table 1), but the corresponding difference is just -0.09 days for FT in Dataset 336 II (Table 2). However, the flowering time is very sensitive to environments, so the 337 genomic diversity cannot solely amount to the different results between these two 338 datasets. More interestingly, the more genomic diversity of Dataset I could lead to a 339 larger genetic gain for a specific trait. From Table 4, the mean of the genetic gains using 340 the seven strategies for PH in Dataset I is given by -42.15 cm. But, from Table 5 565  566  567  568  569  570  571  572  573  574  575  576  577  578  579  580  581  582  583  584  585  586  587  588  589  590  591  592  593  594 595 596