Genome-wide association study and genomic prediction for yield and grain quality traits of hybrid rice

Genomic selection is an efficient tool for breeding selection, especially for quantitative traits controlled by multiples genes with low heritability. To validate the application of genomic selection in hybrid rice breeding, the yield and grain quality traits of 404 hybrid rice breeding lines were investigated, and the same accessions were genotyped by using a 56 K SNP chip. There were wide variances among the tested accessions for all the measured traits, and most of the traits were correlated. A total of 67 significant loci were identified for the yield-related traits, and 123 significant loci were identified for the grain quality traits by GWAS. Two of these loci associated with increasing grain yield but decreasing grain quality. The GEBVs of all the yield and grain quality traits were calculated by using 15 different prediction algorithms. The plant height, panicle length, thousand grain weight, grain length and width ratio, amylose content, and alkali value have higher predictability than other traits. However, the predictive accuracy of different GS models is different for different traits. This study provided useful information for genomic selection of specific trait using proper markers and prediction models.


Introduction
The use of heterosis in hybrid rice has become increasingly important since the beginning of hybrid rice extension in China (Ma and Yuan 2015). Hybrid rice has contributed greatly to food security in China and the world. In recent years, the average yield of rice in China has increased from 3.5 ton/ha in1975 to 6 ton/ha in 1995, and to 7 ton/ha in 2018 (FAOSTAT), and the grain quality of rice has been improved. About 50% of newly registered rice varieties in China (national approval) have grain quality of grade one (Lu et al. 2019). However, it is time-and labor-consuming for developing a hybrid variety by conventional breeding even though marker-assisted selection (MAS) has been used. Future breeding of hybrid rice will benefit from the use of new breeding Abstract Genomic selection is an efficient tool for breeding selection, especially for quantitative traits controlled by multiples genes with low heritability. To validate the application of genomic selection in hybrid rice breeding, the yield and grain quality traits of 404 hybrid rice breeding lines were investigated, and the same accessions were genotyped by using a 56 K SNP chip. There were wide variances among the tested accessions for all the measured traits, and most of the traits were correlated. A total of 67 significant loci were identified for the yield-related traits, and 123 significant loci were identified for the grain quality traits by GWAS. Two of these loci associated with increasing grain yield but decreasing grain quality. The GEBVs of all the yield and grain quality traits were calculated by using 15 different prediction 1 3 Vol:. (1234567890) technology integrated with genetics, genomics, computational science, and artificial intelligence. Rice is a model species for genomic study of monocotyledonous plant. The genome of rice was fully sequenced in 2005 (International Rice Genome Sequencing Project and Sasaki 2005), and more than 3000 genes have been cloned and analyzed (Yao et al. 2018). Large number of molecular markers have been developed for MAS of important traits such as plant height, blast resistance, leaf blight resistance, submergence tolerance, and fragrance (Jena and Mackill 2008). However, the success of MAS heavily depended on level of heritability and genetic architectures of the selected traits. MAS is not effective for traits controlled by large number of genes/QTLs with small contribution. With the development of highthroughput sequencing and chip technology, genomewide association study (GWAS) has been used for identification of useful genes/QTLs of complex traits . A total of 130 associated loci were identified for 38 agronomic traits of 1,495 elite hybrid rice varieties and their inbred parental lines (Huang et al. 2015). QTLs related to heterosis and yield traits of hybrid rice were also identified (Zhen et al. 2017;Chen et al. 2019;Su et al. 2021). However, rice yield is mainly controlled by minor effect loci and hardly to be selected in breeding process. Thus, genomic selection (GS) or genome-wide selection (GWS) has been proposed as a promising tool and applied for animal and plant genetic improvement (Meuwissen et al. 2001). GS has higher genetic gain than marker assisted selection for complex traits controlled by large number of QTLs (Crossa et al. 2017b). However, GS has not been successfully used in hybrid rice breeding yet.
Genomic selection uses genotypes and phenotypes of target traits from individuals in a training population to establish prediction models, and uses the models to predict genomic estimated breeding values (GEBVs) of individuals based on their genotypes in a test population (Crossa et al. 2017a). The hypothesis is based on the assumption that with high-density SNP markers distributed throughout the whole genome, at least one SNP can be found in a linkage disequilibrium state with the quantitative genetic loci affecting the target trait, so that the effect of each QTL can be reflected by SNP markers (Meuwissen 2007). The statistical models of genome selection can be roughly divided into two categories. The first is the direct method, which takes the individual as the random effect, the genetic relationship matrix constructed by the genetic information of the reference population and the predicted population as the variance covariance matrix, to estimate the variance components through the iterative method, and obtain the predicted breeding value of the individual. The second is the indirect method, which first estimates the marker effect in the reference group, and then accumulates the marker effect combined with the genotype information from the prediction group to obtain the individual estimated breeding value of the prediction group (Zhang et al. 2011;Misztal and Legarra 2017). Different prediction modes use different statistical methods; thus, the efficiency of the models needs to be compared and validated before using for breeding selection.
Genomic selection has been successfully used in animal breeding programs to increase the rate of genetic gain of dairy cattle, pig, dairy goat, layer chicken, and fish (García-Ruiz et al. 2016;Samore and Fontanesi 2015;Mucha et al. 2015;Wolc et al. 2015;López et al. 2015). In recent year, simulations and experimental studies have been conducted to validate the efficiency of this method in plant breeding. Being specific to rice, the predictive accuracy of heading date, culm length, panicle length, panicle number, grain length, and grain width varied from 0.4 to 0.8 in a population of 110 rice cultivars using nine prediction methods (Onogi et al. 2015). The highest predictive abilities for spikelets per panicle, heading date, plant height, and protein content were 0.44-0.7 in a diverse population of 413 rice inbred lines from 82 countries genotyped with a 44 K SNP chip (Isidro et al. 2015). The GEBVs of other traits such as grain shape, grain yield, nitrogen balance index, panicle weight, grain weight, and blast resistance have been predicted using inbred lines or cultivars (Spindel et al. 2015;Yabe et al. 2018;Iwata et al. 2015;Grenier et al. 2015;Hassen et al. 2018;Huang et al. 2019). Genomic prediction has also been conducted for grain yield, thousand grain weight, and index of different traits of hybrid rice (Wang et al. 2017Xu et al. 2014Xu et al. , 2018Cui et al. 2020). The predicted GEBVs from different populations were similar; thus, genomic selection is a reliable method for rice breeding.
In this study, we investigated the yield-related traits and grain quality-related traits of 404 hybrid rice lines that were genotyped by using a 56 K SNP chip, and conducted genome-wide association study and genomic prediction for 20 traits using 15 statistical methods. The objectives of this study were to validate the predictive accuracy of different models and to find best-fit statistical methods for prediction of different traits. Field experiments and phenotyping data collection A randomized complete block design (RCBD) with three replications was used for field experiments. For each hybrid, 250 seedlings were transplanted into a 13.3 m 2 plot (5 m × 2.66 m) with a density of 0.2 m × 0.266 M. At maturity, growth period (GP, days from seeding to harvest), number of tillers (TN), plant height (PH, cm), panicle length (PL, cm), number of grains per panicle (GN), spikelet fertility (SF, %), and thousand grain weight (TGW, g) were measured. Panicles from 1 m 2 area were harvested and dried for calculating the grain yield (YLD, kg/ha).

Plant materials
The grain quality was evaluated following the industrial standard for quality of rice variety (NY/ T 593-2013, Ministry of Agriculture, China). The evaluated traits include BRR (brown rice rate, %), WRR (white rice rate, %), WWRR (whole white rice rate or head rice rate, %), GL (grain length, mm), GLWR (grain length/width ratio), CP (chalk percentage, %), CD (chalk degree), AC (amylose content, %), GC (gel consistency, mm), ALK (alkali value), TRANS (transparency), and OG (overall grade of grain quality, grade 1 is the best, grade 3 is the worst). The phenotypic data of all the traits were collected each year for all the accessions.

Genotyping data collection
Rice seeds were germinated in petri dishes at 28 °C in an incubator. Leaf samples were collected from 2-week-old seedlings and grounded in a motor with liquid nitrogen, and genomic DNA was extracted by using standard CTAB extraction protocol (Doyle and Doyle 1987). The quality of DNA sample was checked by using electrophoresis on 1% agarose gel, and the concentration of DNA was measured by using a UV-Vis spectrophotometer (NanoNrop 8000, Thermo Fisher Scientific, USA). These high-quality DNA samples were then used for fragmentation, hybridization with 56 K SNP chip, and imaging in a GeneTitan Multi-Channel Instrument (Thermo Fisher Scientific, USA) following the user manual. The rice 56 K SNP chip was design by Huazhi Biotechnology Co. Ltd., which includes 56,897 SNPs from the dataset of the 3000 rice genome project (3 K RGP) (Li et al. 2014).

GWAS and GS analysis
Basic statistical analysis and Pearson correlations among traits were calculated by using Minitab 17 (Minitab LLC). GWAS was conducted by using TAS-SEL 5.0 (Bradbury et al. 2007). Sites with more than 10% missing data and minimum frequency less than 0.05 were removed, and the filtered dataset containing 34,832 high-quality SNPs was used for the following analysis. Cladogram was created by using neighbor joining algorithm in TASSEL, and phylogenetic tree was showed in archaeopteryx tree viewer. The kinship matrix with scaled IBS was generated using genotyping data. A united data file with genotyping and phenotyping data (mean values of each accession) of the hybrids was created by using union join. The united file along with kinship matrix was analyzed for marker-trait association using mixed linear model (MLM). The compression level was set to optimum level, and variance component estimation was set to P3D. A criteria for claiming a QTL was p < 1 × 10 −4 (-log10 p-value > 4.0). The identified QTLs were named using the CGSNL nomenclature (McCouch and Committee on Gene Symbolization 2008).
Genomic selection models were built by using big scale ridge regression (bigRR), best line unbiased prediction (GBLUP), least absolute shrinkage and selection operation (LASSO), ridge regression BLUP (rrBLUP), sparse partial least square regression GBLUP(SPLS), reproducing Kernel Hilbert Space (RKHS), BayesA, BayesB, BayesC, Bayesian ridge regression (BRR), random forest classifier (RFC), random forest regression (RFR), support vector regression (SVR), support vector linear classifier (SVC), and Bayesian regularized neural network (BRNN) in G2P (https:// github. com/ cma20 15/ G2P), an integrated genomic selection package for predicting phenotypes from genotypes in R. The main feature of each model was listed in Table S1. For example, the grain yield from multi-year and multi-site was calculated by using mixed linear model of IME4 program in R (Bates et al. 2015): is the yield of i th variety in j th environment. µ is the overall average yield. α i is the varietal effect i th variety. β j is the environmental effect of j th environment. (αβ) ij is the interaction effect of i th variety and j th environment. ε ij is the residual error. ε ij is the residual error.
Variety is fixed effect, while environment and variety and environment interaction are random effects. GEBVs of different traits were calculated by using SOMMER for GBLUP model and G2P for other models in R program (Covarrubias-Pazaran 2016).
The predictive accuracy of the models was validated by using 5 × cross validation method; all data were randomly divided into 5 subsets, with one subset being used as the validation set, and the other 4 subsets being used as the training set, until the complete prediction of all data. The calculation was done once for each subset of the data, and the average predictive accuracy was used as the predictive accuracy of the model.
To evaluate the effect of significant loci from GWAS analysis, the predictive accuracy of a model for each trait was compared between all SNP markers used and significant markers from GWAS excluded.

Statistics of the phenotypic data
Eight yield-related traits and 12 grain quality-related traits were investigated. There was significant variation for each trait. Most of the traits showed normal distribution, except TN, GLWR, and CD with higher kurtosis values than other traits (Table 1).
Based on Pearson correlation, spikelet fertility was not correlated with number of tillers (TN) and thousand grain weight (TGW). Other yield-related traits were correlated (Table 2). For the grain qualityrelated traits, whole white rice rate (WWRR), grain length (GL), chalk degree (CD), and overall grade (OG) were not correlated with amylose content (AC) and gel consistency (GC). And whole white rice rate (WWRR), gel consistency (GC), alkali value (ALK), and transparency (TRANS) were not correlated with grain length/width ratio (GLWR), while most of other grain quality traits were correlated (Table 3).

Diversity and classification of the hybrids tested
All 404 hybrids were genotyped by using a 56 K SNP chip which includes 56,897 SNPs. After filtering, 34,832 high-quality SNPs remained. There are 1992-4736 SNP markers on each chromosome (Fig. 1). Phylogenetic tree from these high-quality SNPs showed that all the hybrids were indica rice, with the exception of three hybrids which have larger genetic distance from others, possibly due to introgression from japonica rice ( Figure S1).

3
Vol.: (0123456789) We found that two SNP markers on chromosome 5 (AX-155748928) and chromosome 12 (AX-154698806) were significantly associated with different yield and grain quality traits (Table 4). There are few accessions with GG genotype for both markers AX-155748928 and AX-154698806, and the means of phenotypic traits were not different from AG genotype. When compared with the homozygote AA genotype, the heterozygotes (AG) of both markers AX-155748928 and AX-154698806 were higher in plant height, number of grains per panicle, yield, chalkiness, transparency, and overall grade (low quality), but lower in white rice rate and alkali value (Figure S2).

Genomic selection models for yield and grain quality traits
The GEBVs of all yield and grain quality traits were calculated by using 15 different prediction algorithms, and 5 × cross validation was used to evaluate the prediction accuracy. For the same trait, the prediction abilities of different GS models were different. Also, for the same GS prediction model, the prediction abilities varied for different traits (Fig. 4). The plant height, panicle length, thousand grain weight, grain length, grain length and width ratio, chalk percentage, amylose content, alkali value, and gel consistency had higher predictive accuracy than other   (Table 5). Thousand grain weight could be well predicted by all the models, while the transparency of the grain had very low predictive accuracy. The predictabilities for grain length and width ratio, chalk percentage, amylose content, alkali value, and gel consistency significantly varied among models. The predictive accuracy for grain yield ranged from 0.18 to 0.35 (average 0.31) (Table S4). By comparing the GS models without or with the significant SNP markers from the GWAS analysis, most of the models had higher predictive accuracy when the significant SNP markers from GWAS were considered (Fig. 5, Table S5).

QTL and genes affecting rice yield and grain quality
It is generally accepted that rice grain yield and quality are two negatively related traits. The high yield varieties usually have low grain quality. Rice breeders have been trying to balance these traits in the breeding process (Xiao et al. 2021). However, the genetic linkage between grain yield and grain quality have not been dissected. In this study, a total of 67 QTLs for grain yield and 123 QTLs for grain quality were identified by comprehensive evaluation of related traits. Among these loci, we found two SNP markers that were significantly associated with various yield and grain quality traits. The heterozygotes (AG) of markers AX-155748928 on chromosome 5 (chr5:5,417,532) and AX-154698806 on chromosome 12 (chr12:13,961,623) had high yield but low grain quality. The SNP marker AX-155748928 was located in the exon region of gene LOC_ Os05g09590 with unknown function (putative uncharacterized protein). Based on the online database analysis (Proost and Mutwil 2017), this gene was highly expressed at seed development stage S4-S5 (11-29 days after pollination) ( Figure S3). Gene LOC_Os05g09590 was significantly associated with grain chalkiness (Misra et al. 2019). The SNP marker AX-154698806 was located between genes LOC_Os12g24450 and LOC_Os12g24460 (5849 bp upstream of LOC_Os12g24460). As a putative unclassified retrotransposon protein, LOC_Os12g24450 was highly expressed at seed Table 3 Pearson correlation among grain quality traits. Upper number is Pearson correlation, lower number is p value *Significance at p < 0.05, **significance at p < 0.01, ***significance at p development stage S5 (21-29 days after pollination). And LOC_Os12g24460, a putative uncharacterized protein, was highly expressed at seed development stage S4 (11-20 days after pollination) ( Figure S3). Further validation of the effects of these genes on rice yield and grain quality should find ways to improve both grain yield and quality.

Predictive accuracy of GS models
High prediction accuracy is a prerequisite for successful application of genomic selection. The prediction accuracy is often measured by the correlation between observed phenotypes and the predicted GEBVs or predicted phenotypes of cross-validation (Xu 2017). The predictive accuracy is influenced by several factors such as population size, variation within the training population and between the training and the test populations, heritability of a trait, marker density, and statistical method (Crossa et al. 2017a;Robertsen et al. 2019). In this study, the predictive abilities of a number of models were measured by Pearson and Spearman correlations between predicted GEBVs and actual values of the hybrid rice accessions. The predictive accuracy of the same trait varied among models, and the predictive accuracy of the same model also showed varying performance on different traits. No single model can be used for a good prediction of all the traits since the genetic structure of a trait is   Fig. 4 Comparison of predictive accuracy of different GS models for different traits Previous studies showed that the significant markers for GWAS have obvious effect on genomic prediction and can be used to assist in deciding what model strategies should be considered (Wilson et al. 2021). In this study, when comparing the predictabilities of models with or without considering the SNP markers from GWAS, almost all the models have higher predictive accuracy with the consideration of markers from GWAS. Thus, markers associated with the trait should be considered in the genomic selection models for better predict accuracy.
Predictive accuracy and potential application of GS models At present, there is no model that can be widely applied to all traits. Though the stability and accuracy of GS models are continuously improved over time, there are still two main challenges, namely, computational accuracy and computational efficiency. The direct method (represented by GBLUP) had the higher calculation efficiency (i.e., The time and resources required for the computation), but lower calculation accuracy when compared with the indirect method (represented by BayesB). The other factors which perplexed the direct method were the setup of parameters which highly depend on researchers' experiences, as various parameter setups have profound effects on the final results. Similarly, though the indirect method has high accuracy, it is difficult to effectively guide breeding practice because of the large amount of calculation in the process of parameter solution and the inability to realize parallel operation. In this study, BayesB and RKHS had high predictive accuracy for plant height, panicle length, thousand grain weight, grain length, grain length and width ratio, amylose content, and alkali value; while BRNN, GBLUP, rrBLUP, RFC, SPLS, SVC, and SVR had low predictive accuracy for other traits (Table S4). The predictive accuracy was low (0.22-0.35) for grain yield, but high for yield component traits such as number of grains per panicle, spikelet fertility, and thousand grain weight. Thus, it will be useful to predict the yield-related traits such as grain weight rather than the yield itself. The predictive accuracies for grain quality traits were higher, with the only exception of transparency.
The goal of GS for hybrid breeding is using the genotypes of the parents to predict the performance of the hybrids, which will significantly reduce the number of crosses for field test (Xu et al. 2014Labroo et al. 2021). Although only F1 hybrids were investigated in this study, in practical breeding, only the parents (sterile lines and restore lines) need to be genotyped, and then genotypes of the F1 hybrids can be simulated and used for prediction of yield and grain quality traits by the GS models. This will significantly reduce the number of crosses to be made and the hybrids to be tested; thus, genomic selection is more efficient for hybrid rice breeding.

Conclusions
In this study, a total of 404 hybrid rice accessions were genotyped by 56 K SNP chip, and 20 traits related to yield and grain quality were investigated. Sixty-seven significant loci were identified for the yield-related traits, and 123 significant loci were identified for the grain quality traits by genome-wide association study. Two of these loci associated with increasing grain yield but decreasing grain quality. Genomic selection models of 15 different prediction algorithms were established using the genotypic and phenotypic data. The GS models are useful for plant height, panicle length, thousand grain weight, grain length and width ratio, amylose content, and alkali value, but the predictive accuracy for other traits are low. Use of proper model for specific trait is important for successful genomic selection. The GS models could be used for prediction of some important traits of hybrid rice through the genotypes of the parental varieties.
Author contribution P Yu and C Ye wrote the original draft; L Li, H Yin, J Zhao, and Y Wang performed the experiments; Z Zhang, W Li, and Y Long analyzed the data; X Hu, J Xiao, G Jia, and B Tian designed and supervised this study.
Funding This study was financially supported by the National Key Research and Development Project of Ministry of Science and Technology (2017YFD0102002-4).

Data availability
The data supporting the findings of this study are available within the article and its supplementary materials.