Plant materials
The set of 227 Texas elite (TXE, F9) breeding lines were developed by the two Texas A&M AgriLife Research wheat breeding programs located at Amarillo and College Station, TX during 2009 – 2018, which included 13 released TAM cultivars. Briefly, the lines were originally selected at the F6 generation according to their performance in the observation yield trials conducted at two locations followed by evaluation preliminary yield trials in six locations and advanced yield trials in ten locations at the F7 and F8 generations, respectively. The state-wide yield TXE trials were conducted at 16 locations across Texas with three replicates per location. The superior TXE lines were further evaluated for traits of yield, end-use quality, biotic and abiotic tolerance, and agronomic traits either toward the new cultivar release or as the germplasms to enter the new breeding cycles. Therefore, the set of TXE lines represented the major gene sources in Texas wheat breeding programs and were appropriate materials for building genomic prediction models to improve selection efficiency. In addition, a set of 105 lines entered into the advanced yield trials was evaluated at ten locations with two replications per location, and their yield data were used to validate the genomic prediction model developed from the TXE collection.
Grain yield data analysis
The TXE trials were conducted during 2009 - 2018 were conducted in three replications at 16 locations that represented four typical wheat growing regions (High plains, Rolling plains, Blacklands and South Texas) in Texas (Fig. S1), and grain yield data were collected from 6 to 14 locations each year (Table 1) due to the abandoned harvest in some locations experienced serious damage caused by severe weather. Totally, yield data were from 102 environments defined as year-by-location combinations (Table 1). However, TXE yield data were typically imbalanced with 40 to 66 lines evaluated each year and 38 to 41 lines were also tested in the following year (Table 1). Three cultivars, ‘TAM 112’, ‘TAM 401’ and ‘TAM W-101’ were used as controls across all years.
To manage the imbalanced yield data for using in GWAS and then developing genomic prediction model for grain yield in Texas wheat breeding programs, genetic correlations coefficients among yield data of common TXE lines in different environments were calculated through R package META-R (Alvarado et al. 2020), and the significantly positive correlation among data collected from different environments indicated that TXE lines showed similar trends of reacting to growing condition under those environments. Yield data collected from all environments were then grouped based on their correlation calculated from META-R, and the best linear unbiased prediction (BLUP) of each line was calculated using the mixed model y = Xβ + Zu + ε using the R package lme4 (Bates et al. 2015) with genotype was set as the only random effect, where y represents the vector of observations, β and u mean fixed and random effects, respectively, X and Z are matrices of observations related to fixed and random effects, respectively, and ε is the residual of the model. Since genotype was the only random effect in this model, variation from random effect will be the genetic variance (VG) and thus can be used to estimate the broad-sense heritability (H2) of grain yield in each correlation group using the formula H2 = VG / (VG + Ve), where Ve is the residual variance. If a dataset from one environment was included in heritability estimation and lead to an increase in yield heritability of one correlation-based group, the dataset was kept in that correlation-based group to increase the power of detecting genetic effects on grain yield. Such a broad-sense heritability estimation was conducted for each environment, including the datasets that were non-correlated in correlation analysis in the previous step, to finally determine if the dataset should be kept in the corresponding correlation-based group.
Once correlation-based groups of yield data were finalized through broad-sense heritability estimation, the R package lme4 and the mixed model y = Xβ + Zu + ε were used again to calculate the best linear unbiased estimation (BLUE) of each line in each correlation-based group with genotype was set as the fixed and environment set as the random effects. The yield BLUEs calculated in each group were used for GWAS and developing genomic prediction models.
Evaluation resistance to greenbug and Hessian fly
Growth chamber and greenhouse experiments were conducted to evaluate resistance to greenbug (Schizaphis graminum Rondani) and Hessian fly (Mayetiola destructor Say) in the TXE lines. Briefly, wheat plants grown in one-gallon pots were maintained in 60 × 60 × 60 cm cages (MegaView Science Co., Ltd., Taichung, Taiwan) equipped with insect proof mesh. Greenbug biotype E and Hessian fly biotype GP colonies were established and maintained on caged wheat plants for approximately six weeks prior to the evaluation experiments and were used as the source for the subsequent assays.
Greenbug infestation was done according to Weng and Lazar (2002). Cultivars ‘TAM 105’ or TAM 111, and ‘TAM 110’ were used as susceptible and resistant controls, respectively. Sixteen lines and two controls were grown in a 30 × 50 cm flat with 20 seeds per line. At the three-leaf stage, about 500 greenbugs were scattered over each test flat and the flats were then kept in a growth chamber at 22 °C with a day length of 8 hours. The plant was scored as either resistant (normal healthy) or susceptible (chlorotic leaf and necrotic stem lesions) 10 -14 days after infestation. Percentage of resistant plants in each line was recorded for GWAS.
Hessian fly infestation was conducted as described in (Chen et al. 2009). Wheat accessions ‘Carol’ (H3), ‘Cardwell’ (H6) and ‘Molly’ (H13) were used as the resistant checks, and ‘Danby’ as the susceptible control. Twenty lines and four checks with 25 seeds per line were planted in one plastic flat (56 × 36 cm) in a greenhouse at 18 ± 3 °C with day length as 14 hours. When the first leaf was fully expanded and the second leaf started emerging, about 200 newly mated female flies were released to each flat covered with a cheesecloth tent (540 × 120 × 40 cm). Resistance rating were conducted three weeks later and the stunted plants having bloated live larvae at stem base were considered as susceptible (S), and the normally healthy plants with small dead larvae or tiny live larvae between leaf sheaths as resistant (R). Percentage of resistant plants per line was calculated for GWAS.
SNP genotyping and marker data management
Whole genomic DNA was extracted from leaf samples using CTAB (cetyl trimethylammonium bromide) method (Stewart and Via 1993) with slight modification (Liu et al. 2013). SNP genotyping was done through ddRADSeq procedure (Peterson et al. 2012). Briefly, genomic DNA was co-digested with two restriction enzymes PstI (CTGCAG) and MspI (CCGG) and barcoded adapters were then ligated to DNA segments of each individual sample. Adapter oligos were synthesized from Integrated DNA Technologies (IDT), Inc. (Coralville, Iowa), and were mixed in equimolar amounts (30 µM of top and bottom oligos). After denaturing at 95°C for 10 sec, oligos were cooled to 12°C at a rate of 0.1°C per sec. P5-Index adapters were made through annealing the top and bottom oligos (Top oligo (5’ - 3’): AAT GAT ACG GCG ACC ACC GAG ATC TAC ACX XXX XXX XTC TTT CCC T; Bottom oligo (5’ -3’): /5Phos/AXX XXX XXX GTG TAG ATC TCG GTG GTC GCC GTA TCA TT, where XXXXXXXX represents 8-base i5 index sequences). The P5-PstI-Bridge adapters were made by annealing top (Pster_T, 5’ to 3’): /5Phos/ACA CGA CGC TCT TCC GAT CTT GCA and bottom (Pster_B, 5’ to 3’): AGA TCG GAA GAG CGT CGT GTA GGG AAA G oligos. P7-MluCI Adapter was made by annealing top (P7-MluCI_T, 5’ to 3’): AAT TAG ATC GGA AGA GCA CAC GTC TGA ACT CCA GTC AC and bottom (P7-MluCI_B, 5’ to 3’): GTG ACT GGA GTT CAG ACG TGT GCT CTT CCG ATC T.
The ddRADSeq libraries were constructed using 96-plex plate with a single random blank well used for quality control, and were then sequenced through an Illumina HiSeq 2000 at the Genomics & Bioinformatics Services of Texas A&M AgriLife Research at college station, TX (Yang et al. 2020), and SNP calls were made using the reference-based Stacks Pipeline (Catchen et al. 2013) using IWGSC v1.0 as the reference genome (IWGSC 2014), which obtained over 247,000 raw SNP data with missing rate below 50%. Considering all TXE lines were at F9 generation or later that have a very low level of heterozygosity and thus the homozygous SNP readings should be more reliable, which is also approved by comparison of SNP readings of few control lines with 2 - 4 replications included for DNA sequencing and SNP calling (data not shown). Majority of heterozygous SNP readings were more likely due to technique error during sequencing according to their extra high heterozygosity rate. Therefore, heterozygous marker data from TXEs thus were all converted as the missing data, and all SNP data were filtered again using criteria of data-missing rate less than 30% and minor allele frequency (MAF) below 5% through the computer package Tassel v5.0 (http://www.maizegenetics.net/) (Bradbury et al. 2007), which retained over 75,000 SNPs with a higher level of reliability. Genotype imputation with accuracy of 98% was then conducted using computer program Beagle (v5.0) (Browning and Browning 2007) and achieved data-missing rate to less than 10%. Imputed data were filtered again through MAF less than 5% and obtained the final set of 70,525 SNPs were used for GWAS.
In the set of 105 advanced breeding lines from yield trial in 2018, SNP genotyping and marker data management was done using the similar methods as indicated for TXE collections. A total of 384,648 SNPs was called and imputed in the 105 advanced lines with marker data missing rate less than 10%. Among those SNPs, 37,975 were the common set between TXE and advanced breeding lines and were extracted for validating the accuracy of genomic prediction model developed from TXE by comparing the predicted with observed yield in those 105 advanced lines.
Population structure analysis
From the raw 247,000 SNPs in TXE with a data missing rate of less than 50%, a set of 8,401 SNPs with data missing rate less than 18% and heterozygosity less than 5% was considered the most reliable markers for analyzing population structure in TXE lines. The computer program Structure v2.3.4 (https://web.stanford.edu/group/pritchardlab/structure.html) (Falush et al. 2003; Pritchard et al. 2000) was used with the number of presumable sub-populations (K) set from three to ten with iteration number equal to ten. For simulation running under each K, length of burn-in period was set to 10,000 and number of MCMC replicates was set to 100,000 with the model of admixture and correlated allelic frequency was used. The number of sub-populations was then determined using delta K (ΔK) method described in Evanno et al. (2005) through the online tool Structure Harvester (http://taylor0.biology.ucla.edu/structureHarvester/). Meanwhile, phylogenetic tree using 70,525 imputed SNP data through UPGMA (unweighted pair group method with arithmetic mean) hierarchical clustering method was also carried out through Tassel v5.0 (Bradbury et al. 2007) and the clade tree was drawn using the online tool Interactive Tree Of Life (iTOL v5) (Letunic and Bork 2019, https://itol.embl.de/). The phylogenetic tree was used to verify the results obtained from Structure v2.3.4.
Genome-wide association studies
GWAS was carried out using the set of 70,525 imputed SNPs through Tassel v5.0 (Bradbury et al. 2007). Principal component analysis (PCA) was conducted with the number of sub-populations determined by Structure v2.3.4 to generate the Q-matrix that incorporated as the covariate in association analysis, and the fixed and random effect mixed model (MLM) (Liu et al. 2016) was used for association mapping with the K-matrix showing the relationship of all individuals that was used to account for effects due to kinship. For detecting genomic regions associated with grain yield, the yield BLUE of each line calculated using the R package lme4 in correlation-based groups was used as the trait data and GWAS was separately conducted in each correlation-based group. A modified Bonferroni adjustment was applied to estimate the significance threshold for marker-trait associations (Gao et al. 2010). Briefly, the R package simpleM (http://simplem.sourceforge.net/) was used and detected 5,930 effective independent tests from all SNPs used for GWAS in TXE lines, indicated that significant thresholds would be -log10 (P) = 5.77 and -log10 (P) = 5.07 for α = 0.01 and α = 0.05, respectively, and the suggestive threshold would be -log10 (P) = 4.38 (α = 0.25). Since grain yield has the quantitative nature and controlled by many loci in addition to various effects from environmental conditions, the LOD value of 4.0 thus was set as the indicative threshold to determine genomic regions associated with grain yield. Whereas for insect resistance, the threshold of LOD = 6.0 was applied since the traits in TXE were likely to be qualitative and governed by few genes.
SNP allele frequency change during new TXE lines development
According to the time of line development, the 227 TXE lines were divided into three groups using a three-year interval with the first group including 92 lines developed during 2009 – 2011(namely the old group), the second group containing 67 lines from 2012 – 2014, and the third group had 68 lines from 2015 – 2017 (namely the newly developed group). Therefore, comparing allele frequency between the first and the third groups would have a good indication of allele drifting due to breeding selection in Texas wheat breeding programs. Allele frequency change was investigated focusing on the major allele genotype of SNPs in the TXE collection. The frequency of each SNP major allele was respectively calculated in the first and third groups and then to find the difference between the two frequencies. The allele frequency of a SNP allele was considered as changed if the frequency difference between the first and the third groups was greater than 20%.
Genomic prediction model development
For developing genomic prediction models, 7,573 SNP markers from all chromosomes with at least one million bases (Mb) apart were selected for estimating the mean effect of each marker. The R package rrBLUP (ridge regression best linear unbiased prediction, Endelman 2011) was used for developing genomic prediction model. The mixed model y = µ + Xβ + ε was used with y as the vector of phenotypic means, µ as the overall mean, X as the marker matrix, β as the matrix of marker effects and ε as the vector of residual effects. The genomic estimated breeding values (GEBVs) of each line were calculated by adding the grand mean to the product of genotypic matrix and the vector of mean effect of each marker. The prediction accuracy was measured by the correlation between the predicted and observed yield BLUEs. Genomic prediction models were developed separately in each of the correlation-based groups, and the prediction accuracy was estimated at three times using 60%, 70% and 80% of TXE lines as training sets and 40%, 30% and 20% as testing sets, accordingly. For each training/testing set, prediction accuracy was obtained based on a calculation using 500 repeated runs.
To validate the prediction models developed in each correlation-based group, all TXE lines were used as the training set and 105 advanced breeding lines were used as the testing set. The common SNPs between TXEs and advanced breeding lines with at least one Mb apart on each chromosome were selected. Marker effects were estimated using BLUEs of each correlation-based group through rrBLUP mixed model y = µ + Xβ + ε. The GEBVs of each advanced breeding line were calculated by adding the grand mean to the product of genotypic matrix and the vector of mean effect of each marker. Prediction accuracy was then measured through the correlation between predicted and observed yield in 105 advanced breeding lines.