3.1 Experimental material and genotyping
Four bi-parental rice populations (MP2), named MPA, MPB, MPC, and MPD, were developed using eight elite US parents. The combination of parents and the development methodology of the MP2 was explained by Cerioli et al. (2022). Four-way populations (MP4) were obtained by crossing pairs of F1 individuals from the MP2 populations: MPA x MPB (MP4-AB), MPA x MPC (MP4-AC), MPA x MPD (MP4-AD), MPB x MPC (MP4-BC), MPB x MPD (MP4-BD) and MPC x MPD (MP4-CD). Multi-parent populations consisting of six or eight parent combinations (MP6-8) were generated using F1 individuals from the MP4 populations. A set of 430 MP2, 1,026 MP4, and 324 MP6 lines were genotyped and used for this study. For the MP2 materials, a single F3 plant was sampled for genotyping, serving as the seed source for all subsequent generations. In the MP4 and MP6-8 populations, an F4 plant was derived and genotyped.
Additionally, data from preliminary yield trials (PYTs) from the LSU Rice Breeding Program performed at the H. Rouse Caffey Rice Research Station (HRCRRS) in Rayne, LA, during 2020, 2021, and 2022 was also used for this work. These PYTs consisted of 2,929 F2:F4 and F3:F5 lines, of which 1,263 correspond to the PYTs from 2020, 943 from 2021, and 733 from 2022. All PYT entries were genotyped with a single leaf sample from the plot in the field.
Genotyping of the PYT materials, MP2, MP4, and MP6-8, was conducted using the LSU500 AmpSeq custom marker set that was described by Cerioli et al. (2022) through the genotyping service provider Agriplex Genomics (https://agriplexgenomics.com). This SNP array is an optimized set of informative markers used routinely for genotyping US rice germplasm that has proven useful for breeding purposes.
Markers used to calculate LD decay were quality-controlled using the “raw. data” function from the package snpReady (Granato et al., 2017), with filtering set for minor allele frequency (MAF) > 0.05 and a call rate of 0.95. The final dataset was transformed from nucleotide genotype coding (e.g., 'A', 'C', 'T', 'G') to numeric coding (0= homozygote class 1, 1= heterozygotes, and 2= homozygote class II, respectively). This conversion allowed for efficient computation and interpretation of the genetic data within the statistical analysis framework.
Additionally, the additive and epistatic genomic relationship matrices (i.e., Ga and GI, respectively) were calculated using the LSU500 marker set. The Ga was obtained using the “A.mat” function from the sommer R package markers were also filtered for a MAF >0.05 (Covarrubias-Pazaran, 2019), which implemented the equation proposed by VanRaden (2008) to account for genetic relatedness among individuals. Then, the second-order epistatic kinship (additive x additive) was obtained using the “E.mat” function from the same package and is computed as the Hadamard product of the genomic additive relationship matrix (Henderson, 1985).
3.2 Phenotypic data
All tested lines were grown at the HRCRRS (Crowley, LA, 30°14′30″ N, 92°20′46″ W), following the recommended nutrition and water management practices from the Louisiana Rice Production Handbook (Saichuk, 2014). The trials were planted in 1.42 m wide plots and 4.12 m long with a heavy-duty grain drill (Almaco Equipment, Nevada, IA), using a seeding rate of 78 kg.ha-1, and 2 cm was the planting depth. Days to heading was measured as the number of days elapsed from the emergence of the plants until the day in which 50% of the plants within the plot present panicles emerging from the sheath of the flag leaf. The determination of grain yield (kg.ha-1) was carried out by harvesting the entire plot through a Delta Plot (Wintersteiger AG) combine harvester and a Harvest Master Grain Meter (Juniper Systems), which was used to collect the weight and moisture (%) of the grain. Moisture values from each plot were adjusted to 120 g.kg-1 of water content. Samples of 100 g were milled in a PAZ laboratory mill (Zaccaria USA) and were used to measure the whole milling percentage (%) expressed as grams per kilogram of whole milled kernels over rough rice seed. Chalk percentage (%) was measured with SeedCount (Next Instruments) and corresponded to the percentage of the chalky area of the milled grain. Days to heading, grain yield, whole milling percentage, and chalk percentage were collected for MP2 and MP6-8 materials and all PYTs; this data was not collected for the MP4 materials which were omitted from the phenotypic evaluation.
430 MP2 genotypes were tested in 2020; from those, 220 were evaluated in 2021 in two different environments (i.e., different planting dates in the same location). 318 and 178 MP6-8 genotypes were also evaluated in two environments (i.e., other planting dates in the same location) during 2021 and 2022, respectively. During 2020, 2021, and 2022, three different PYTs were grown as regular breeding activities within the LSU Rice Breeding Program. These PYTs represent different forms of herbicide resistance within the long grain materials, namely the “Conventional,” “Clearfield ®” and “Provisia ®”, hereafter PYT CN, PYT CL, and PYT PV respectively (Table 1).
Table 1. Name of the trials used for this study, description, number of genotypes, year of evaluation, and the dates of emergence and harvest. The grey color indicates the multi-parent (MP6-8) or bi-parental (MP2) materials developed and used for this study, and the white color shows the use of Preliminary Yield Trials (PYTs) that are part of the regular breeding activities and were used as part of this research.
3.3 Statistical-genetics Analysis
All the analyses were performed using the statistical programming language R (R Core Team, 2021). The sommer package (Covarrubias-Pazaran, 2019) was utilized for calculating LD as well as the linear mixed model equations (LMME) for estimating genetic heritability values, obtaining the best linear unbiased estimators and predictors (BLUEs and BLUPs, respectively), and the GEBVs.
3.3.1 LD decay and haplotype size
The calculation of LD involves estimating the Pearson correlation coefficient r2, which measures the non-random association of alleles at different loci. LD decay, conversely, refers to the decrease in linkage disequilibrium over physical distance. This study assessed LD decay by calculating r2 between pairs of genetic markers using the “LD.decay” sommer function that incorporates a marker matrix and a map providing the physical distances between markers in kilobase pairs (i.e., kbp).
The LD statistics were computed per chromosome for the MP2, MP4, and MP6-8 materials using the LSU500 marker set. Subsequently, the LD values were aggregated across all 12 chromosomes. To visualize the LD decay pattern, the r2 was plotted against the physical distance in kbp. Only significant correlations (p < 0.001) were retained to focus on meaningful associations to calculate the average and median LD values. Haplotype sizes in terms of mean and media, along with selected percentiles (10th, 25th, 75th, and 90th), were calculated using distances (in kbp) between significant pairs of markers. In this study, LD was estimated within the MP2 populations as a whole and separately within the MP6-8, aligning with conventional practices in GS (Meuwissen et al., 2001; Heffner et al., 2009). As part of a practical approach, information from populations was combined.
3.3.2 Phenotypic data analysis and statistical models for genomic predictions
Outliers from the phenotypic data were removed using the “outlierTest” function in the car package (Fox and Weisberg, 2019) in R, which uses the Bonferroni outlier test. For the statistical analysis of the trials, a two-step approach was performed. The subsequent statistical analysis was carried out in three steps. First, a spatial adjustment was performed based on rows and columns using two-dimensional splines. For this step, the “SpATS” package from sommer was used. The best linear unbiased estimators (BLUEs) for the traits in consideration for every line were extracted using the function “predict. mmer” from the same package. Hence, the SpATS mixed model for each trial is formulated as follows (Equation 1):
where the vector y contains the phenotypic observations, arrayed as rows within columns, β is a vector of fixed terms that includes the intercept, and X is the associated design matrix. The expression for the smooth spatial surface is enclosed by the fixed (unpenalized) term Xsβs and the random (penalized) component Zss. The vector u comprises mutually independent sub-vectors of random row and column effects accounting for discontinuous field variation, expressed as u = [ur|uc] with a design matrix Z. Additionally, the vector g contains the fixed effect of the genotypes and Zg represents the design matrix for genotypes. The vector e represents the error term with a distribution e ~ N (0, σ2e).
In the second step, a multi-environment trial (MET) analysis was performed, adding the planting date and year as fixed effects (Equation 2):
Here y* represents the adjusted means from the first step, (Equation 1); p is the planting date effect with X as the correspondent design matrix, s is the year effect with T as the design matrix, and u is the random term for genotypes with Z as the design matrix. The best linear unbiased predictors (BLUPs) from each genotype were extracted using the predict.mmer function mentioned above.
To perform the genomic predictions, we used Additive (A; reduced model) and Additive + Epistasis models (A+I; full model), as follows (Equation 3):
2.3.3 Heritability and k-fold cross validation (CV) predictive ability
Narrow-sense genetic heritability (h2) was calculated for the A and A+I models described above for each of the traits, as the ratio between the additive variance component and the sum of the additive and residual error component (A model), in the case of the A+I model, the epistatic variance component was added in the denominator of the equation, as is defined in Equation 4:
where σa2 is the additive genetic variance from the A or the A+I model, σi2 is the genomic epistatic variance, and σe2is the residual variance. These calculations were conducted for each of the traits and each trial. Subsequently, the computed heritability values were averaged within each population, encompassing MP2, MP6-8, and PYTs.
To evaluate prediction accuracy, a cross-validation (CV) approach was employed (Gianola et al., 2009; Meuwissen et al., 2001) using the A or A+I as the LMME to assess their predictive performance in each trial. A k-fold cross-validation scheme was used, with five replicates and five folds, totaling twenty-five predictions per scenario. Consequently, the GEBVs were obtained for all individuals in the validation population. The CV predictive ability was calculated as the Pearson Correlation (R) between the adjusted phenotypes (BLUP) and the computed GEBVs.