A new strategy for using historical imbalanced yield data to conduct genome-wide association studies and develop genomic prediction models for wheat breeding

Using imbalanced historical yield data to predict performance and select new lines is an arduous breeding task. Genome-wide association studies (GWAS) and high throughput genotyping based on sequencing techniques can increase prediction accuracy. An association mapping panel of 227 Texas elite (TXE) wheat breeding lines was used for GWAS and a training population to develop prediction models for grain yield selection. An imbalanced set of yield data collected from 102 environments (year-by-location) over 10 years, through testing yield in 40–66 lines each year at 6–14 locations with 38–41 lines repeated in the test in any two consecutive years, was used. Based on correlations among data from different environments within two adjacent years and heritability estimated in each environment, yield data from 87 environments were selected and assigned to two correlation-based groups. The yield best linear unbiased estimation (BLUE) from each group, along with reaction to greenbug and Hessian fly in each line, was used for GWAS to reveal genomic regions associated with yield and insect resistance. A total of 74 genomic regions were associated with grain yield and two of them were commonly detected in both correlation-based groups. Greenbug resistance in TXE lines was mainly controlled by Gb3 on chromosome 7DL in addition to two novel regions on 3DL and 6DS, and Hessian fly resistance was conferred by the region on 1AS. Genomic prediction models developed in two correlation-based groups were validated using a set of 105 new advanced breeding lines and the model from correlation-based group G2 was more reliable for prediction. This research not only identified genomic regions associated with yield and insect resistance but also established the method of using historical imbalanced breeding data to develop a genomic prediction model for crop improvement.


Introduction
The complex nature of grain yield makes it difficult for precise selection of lines with high yield potential. Accumulation of favorite allele combinations for yield would lead to lines with improved yields. For identifying favorite alleles, genome-wide association studies (GWAS) (Atwell et al. 2010;Rafalski 2010) have showed advantages over traditional QTL mapping using bi-parental mapping populations. GWAS uses the natural collection of germplasm lines such as landraces, varieties, and breeding lines as mapping panels, and detects historical recombination events and linkage disequilibrium (LD) to identify the non-random association between allele loci and traits (Flint-Garcia et al. 2003). GWAS can identify multiple alleles simultaneously since a wider range of germplasms in a panel would contain more diverse genetic composition (Zhu et al. 2008;Myles et al. 2009;Atwell et al. 2010).
Increasing marker coverage in the genome will enhance the power of GWAS for allele identification. Single nucleotide polymorphisms (SNPs), the variations on a single nucleotide at the specific position, are the most abundant and widely distributed genome markers (Agarwal et al. 2008). With the advances in DNA sequencing technology, the genotype-bysequencing (GBS) (Elshire 2011) and later doubledigested restriction-site associated DNA sequencing (ddRADseq) (Baird et al. 2008;Peterson et al. 2012) are robust approaches of identifying SNPs that are randomly distributed throughout the whole genome and thus are suitable for investigating genome-wide genetic variations. Particularly, by aligning SNP flanking DNA sequences to assembled whole genome sequences of hexaploid wheat (IWGSC 2014;Zimin et al. 2017), tetraploid wheat (Avni et al. 2017), and Aegilops tauschii (Jia et al. 2013b), chromosomal location of each SNP can be precisely located to accurately track favorite alleles.
Previously, QTL analysis was widely used to identify genomic regions associated with the traits and then develop molecular markers for marker-assisted selection (MAS). However, most important traits such as grain yield, yield components, and end-use quality are all highly polygenic with each locus contributing only a very small proportion of total phenotypic variance (Simmonds et al. 2014;Tyagi et al. 2014;Jia et al. 2013a;Collard and Mackill 2008), which leads to weak stability and low repeatability in those QTLs and thus limits the application of MAS for accumulating desirable genes in crop improvement. Genomic selection (GS) utilizes a large set of markers covering a whole genome to detect all possible alleles within the LD and their effects on the trait, and to estimate the genomic breeding value of each line to conduct selection in breeding (Meuwissen et al. 2001;Bernardo and Yu 2007;Bhat et al. 2016;Rutkoski et al. 2016;Daetwyler et al. 2008;Sun et al. 2019;Tsai et al. 2020). Therefore, using GWAS to identify favorite alleles and form the training population to develop prediction models for conducting GS will be an efficient way of accumulating desirable alleles for improving yield and other polygenic traits.
To conduct GWAS and GS, the composition of training populations and precision of phenotyping are two additional critical factors affecting prediction accuracy Michel et al. 2017;Marulanda et al. 2015). Using advanced lines from the same breeding program as a training population has showed a positive effect on prediction accuracy (Endelman et al. 2014;Daetwyler et al. 2008). It was demonstrated that a training population including lines from the same family, half sibs, and more distant lines could be efficient for a GS scheme (Verges and Van Sanford 2020). Using historical advanced breeding lines developed at different periods together with germplasm lines from the same breeding program may also be a good strategy for association mapping 1 3 Vol.: (0123456789) and genomic prediction studies, simply because all those lines could represent all genetic sources in the program with historical recombination maintained to ensure mapping resolution, particularly when the training population is keeping updated as the new germplasm lines were introduced into the program.
Another advantage of using advanced breeding lines in GWAS and genomic prediction is that the lines have been evaluated in many years under different environmental conditions and have phenotypic data already available. This would be especially helpful for traits such as grain yield that requires many resources for phenotyping (Verges and Van Sanford 2020). However, the most significant difficulty in using phenotypic data of the historical breeding lines for GWAS is that the lines are evaluated at different time and that the data were typically imbalanced. Research using imbalanced data in historical breeding lines for GWAS and genomic prediction have been seldom reported. The possibility of using imbalanced data for GS was explored by clustering analysis of data through pre-defined mega-environments based on climatic patterns, farming systems, water regimes, and the incidence of biotic and abiotic stress, but this strategy appears ineffective for genomic selection (Dawson et al. 2013). Therefore, it is necessary to identify an appropriate way that can directly utilize the imbalanced historical data for GWAS and genomic prediction.
During the past few decades, the Texas A&M AgriLife Research and Extension Center at Amarillo, TX, has released several drought tolerant cultivars such as 'TAM 105', 'TAM 107', 'TAM 111', and'TAM 112' (Porter et al. 1980, 1987;Lazar et al. 2004;Rudd et al. 2014). Among them, TAM 111 and TAM 112 were two broadly planted cultivars in the Great Plains hard red winter wheat regions since 2010 based on planted acreages (NASS, 2011(NASS, -2013. They have been the core parents in many HWW breeding programs due to their drought tolerance and wide adaptability that have been confirmed using RNA-seq . A newer release 'TAM 114' has superior bread-making quality and drought tolerance . A grain and forage dual purpose awnless wheat 'TAM 204' has higher yield, drought tolerance, and a good level of resistance to insects such as greenbug, Hessian fly, and wheat curl mite ). These widely adapted winter wheat cultivars have been used as germplasm lines in wheat breeding programs in the USA and many other countries. To localize their genes conferring the superior traits will greatly improve selection efficiency for improving yield, end-use quality, and tolerance to biotic and abiotic stresses (Liu et al. 2016a;Yang et al. 2020Yang et al. , 2019Yu et al. 2021).
In this research, we used a set of 227 elite breeding lines (including the aforementioned released cultivars) developed by Texas A&M AgriLife Research wheat breeding programs in the last 10 years as the mapping panel for conducting GWAS to identify favorite alleles and as the training population to build a genomic prediction model for selecting grain yield. By combining correlation analysis with genetic heritability estimation using the imbalanced yield data collected from diverse environments, we successfully developed a data management strategy of using the imbalanced historical yield data for conducting GWAS and building genomic prediction models. In addition, the genomic prediction model was further validated using a set of newly developed advanced breeding lines from Texas wheat breeding programs.

Plant materials
The set of 227 Texas elite (TXE, F 9 ) breeding lines were developed by the two Texas A&M AgriLife Research wheat breeding programs located at Amarillo and College Station, TX, during 2009-2018, which included 13 released TAM cultivars. Briefly, the lines were originally selected at the F 6 generation according to their performance in the observation yield trials conducted at two locations followed by evaluation preliminary yield trials in six locations and advanced yield trials in ten locations at the F 7 and F 8 generations, respectively. The state-wide yield TXE trials were conducted at 16 locations across Texas with three replicates per location. The superior TXE lines were further evaluated for traits of yield, end-use quality, biotic and abiotic tolerance, and agronomic traits either toward the new cultivar release or as the germplasms to enter the new breeding cycles. Therefore, the set of TXE lines represented the major gene sources in Texas wheat breeding programs and were appropriate materials for building genomic prediction models to improve selection efficiency. In addition, a set of 105 lines entered into the advanced yield trials was evaluated at ten locations with two replications per location, and their yield data were used to validate the genomic prediction model developed from the TXE collection.
Grain yield data analysis The TXE trials conducted during 2009-2018 were conducted in three replications at 16 locations that represented four typical wheat growing regions (High Plains, Rolling Plains, Blacklands, and South Texas) in Texas (Fig. S1), and grain yield data were collected from 6 to 14 locations each year (Table 1) due to the abandoned harvest in some locations experienced serious damage caused by severe weather. Totally, yield data were from 102 environments defined as year-by-location combinations (Table 1). However, TXE yield data were typically imbalanced with 40 to 66 lines evaluated each year and 38 to 41 lines were also tested in the following year (Table 1). Three cultivars, 'TAM 112', 'TAM 401', and 'TAM W-101' were used as controls across all years.
To manage the imbalanced yield data for using in GWAS and then developing genomic prediction model for grain yield in Texas wheat breeding programs, genetic correlations coefficients among yield data of common TXE lines in different environments were calculated through R package META-R (Alvarado et al. 2020), and the significantly positive correlation indicated that those common TXE lines showed similar trends of reacting to growing condition under those environments. Yield data collected from all environments were then grouped according to their correlations. For example, if data of common lines in datasets A and B were correlated and data of common lines in datasets B and C were correlated, the three datasets A, B, and C will be kept in one correlated group though no common lines between A and C for correlation calculation. The best linear unbiased prediction (BLUP) in each correlation group was calculated using the mixed model y = Xβ + Zu + ε using the R package lme4 (Bates et al. 2015) with genotype was set as the only random effect, where y represents the vector of observations, β and u mean fixed and random effects, respectively, X and Z are matrices of observations related to fixed and random effects, respectively, and ε is the residual of the model. Since genotype was the only random effect in this model, variation from random effect will be the genetic variance (V G ) and thus can be used to estimate the heritability (H 2 ) of grain yield in each correlation group using the formula where V e is the residual variance. If a dataset from one environment was included in heritability estimation and lead to an increase in yield heritability of one correlation-based group, the dataset was kept in that correlation-based group. Such a heritability estimation was conducted for each environment, including the datasets that were non-correlated in correlation analysis in the previous step, to finally determine if the dataset should be kept in the corresponding correlation-based group. Once correlation-based groups of yield data were finalized through heritability estimation, the R package lme4 and the mixed model y = Xβ + Zu + ε were used again to calculate the best linear unbiased estimation (BLUE) of each line in each correlation-based group with genotype was set as the fixed and environment set as the random effects. The yield BLUEs calculated in each group were used for GWAS and developing genomic prediction models.
Evaluation resistance to greenbug and Hessian fly Growth chamber and greenhouse experiments were conducted to evaluate resistance to greenbug (Schizaphis graminum Rondani) and Hessian fly (Mayetiola destructor Say) in the TXE lines. Briefly, wheat plants grown in one-gallon pots were maintained in 60 × 60 × 60 cm cages (MegaView Science Co., Ltd., Taichung, Taiwan) equipped with insect proof mesh. Greenbug biotype E and Hessian fly biotype GP colonies were established and maintained on caged wheat plants for approximately 6 weeks prior to the evaluation experiments and were used as the source for the subsequent assays.
Greenbug infestation was done according to Weng and Lazar (2002). Cultivar 'TAM 105' or TAM 111 was the susceptible control while 'TAM 110' was the resistant control. Sixteen lines and two controls were grown in a 30 × 50 cm flat with 20 seeds per line. At the three-leaf stage, about 500 greenbugs were scattered over each test flat and the flats were then kept in a growth chamber at 22 °C with a day length of 8 h. The plant was scored as either resistant (normal healthy) or susceptible (chlorotic leaf and necrotic stem lesions) 10-14 days after infestation. Percentage of resistant plants in each line was recorded for GWAS.
Hessian fly infestation was conducted as described in Chen et al. (2009). Wheat accessions 'Carol' (H3), 'Cardwell' (H6), and 'Molly' (H13) were used as the resistant checks, and 'Danby' as the susceptible control. Twenty lines and four checks with 25 seeds per line were planted in one plastic flat (56 × 36 cm) in a greenhouse at 18 ± 3 °C with day length as 14 h. When the first leaf was fully expanded and the second leaf started emerging, about 200 newly mated female flies were released to each flat covered with a cheesecloth tent (540 × 120 × 40 cm). Resistance rating was conducted 3 weeks later and the stunted plants having bloated live larvae at stem base were considered as susceptible (S), and the normally healthy plants with small dead larvae or tiny live larvae between leaf sheaths as resistant (R). Percentage of resistant plants per line was calculated for GWAS.

SNP genotyping and marker data management
Whole genomic DNA was extracted from leaf samples using CTAB (cetyl trimethylammonium bromide) method (Stewart and Via 1993) with slight modification (Liu et al. 2013). SNP genotyping was done through ddRADSeq procedure (Peterson et al. 2012). Briefly, genomic DNA was co-digested with two restriction enzymes PstI (CTG CAG ) and MspI (CCGG) and barcoded adapters were then ligated to DNA segments of each individual sample. Adapter oligos were synthesized from Integrated DNA Technologies (IDT), Inc. The ddRADSeq libraries were constructed using 96-plex plate with a single random blank well used for quality control and were then sequenced through an Illumina HiSeq 2000 at the Genomics & Bioinformatics Services of Texas A&M AgriLife Research at College Station, TX (Yang et al. 2020), and SNP calls were made using the reference-based Stacks Pipeline (Catchen et al. 2013) using IWGSC v1.0 as the reference genome (IWGSC 2014), which obtained over 247,000 raw SNP data with missing rate below 50%. Considering all TXE lines were at F 9 generation or later that have a very low level of heterozygosity and thus the homozygous SNP readings should be more reliable, which is also approved by comparison of SNP readings of few control lines with 2-4 replications included for DNA sequencing and SNP calling (data not shown). Majority of heterozygous SNP readings were more likely due to technique error during sequencing according to their extra high heterozygosity rate. Therefore, heterozygous marker data from TXEs thus were all converted as the missing data, and all SNP data with > 30% missing rate or MAF < 5% were removed using the computer package Tassel v5.0 (http://www.maizegenetics.net/) (Bradbury et al. 2007), which retained over 75,000 SNPs with a higher level of reliability. Genotype imputation with accuracy of 98% was then conducted using computer program Beagle (v5.0) (Browning and Browning 2007) and achieved data missing rate to less than 10%. Imputed data were filtered again by removing SNPs with MAF less than 5% and obtained the final set of 70,525 SNPs were used for GWAS.
In the set of 105 advanced breeding lines from yield trial in 2018, SNP genotyping and marker data management was done using the similar methods as indicated for TXE collections. A total of 384,648 SNPs was called and imputed in the 105 advanced lines with marker data missing rate less than 10%. Among those SNPs, 37,975 were the common set between TXE and advanced breeding lines and were extracted for validating the accuracy of genomic prediction model developed from TXE by comparing the predicted with observed yield in those 105 advanced lines.

Population structure analysis
From the raw 247,000 SNPs in TXE with a data missing rate of less than 50%, a set of 8,401 SNPs with data missing rate less than 18% and heterozygosity less than 5% was considered the most reliable markers for analyzing population structure in TXE lines. The computer program Structure v2.3.4 (https://web.stanford.edu/group/pritchardlab/structure.html) (Falush et al. 2003;Pritchard et al. 2000) was used with the number of presumable sub-populations (K) set from three to ten with iteration number equal to ten based on preliminary structure scan in the TXE collection. For simulation running under each K, length of burnin period was set to 10,000 and number of MCMC replicates was set to 100,000 with the model of admixture and correlated allelic frequency was used. The number of sub-populations was then determined using delta K (ΔK) method described in Evanno et al. (2005) through the online tool Structure Harvester (http://taylor0.biology.ucla.edu/structureHarvester/). Meanwhile, phylogenetic tree using 70,525 imputed SNP data through UPGMA (unweighted pair group method with arithmetic mean) hierarchical clustering method was also carried out through Tassel v5.0 (Bradbury et al. 2007) and the clade tree was drawn using the online tool Interactive Tree Of Life (iTOL v5) (Letunic and Bork 2019). The phylogenetic tree was used to verify the results obtained from Structure v2.3.4.
Genome-wide association studies GWAS was carried out using the set of 70,525 imputed SNPs through Tassel v5.0 (Bradbury et al. 2007). Principal component analysis (PCA) was conducted with the number of sub-populations determined by Structure v2.3.4 to generate the Q-matrix that incorporated as the covariate in association analysis, and the fixed and random effects mixed model (MLM) (Liu et al. 2016b;Yu et al. 2006;Zhang et al. 2010) was used for association mapping with the K-matrix showing the relationship of all individuals that was used to account for effects due to kinship. For detecting genomic regions associated with grain yield, the yield BLUE of each line calculated using the R package lme4 in correlation-based groups was used as the trait data and GWAS was separately conducted in each correlation-based group. Bonferroni adjustment using R package simpleM (http://simplem.sourceforge.net/) (Gao et al. 2010) determined the significant threshold of − log10(P) = 4.0 for grain yield and − log10(P) = 6.0 for insect resistance. the second group containing 67 lines from 2012 to 2014, and the third group had 68 lines from 2015 to 2017 (namely the newly developed group). Therefore, comparing allele frequency between the first and the third groups would have a good indication of allele drifting due to breeding selection in Texas wheat breeding programs. Allele frequency change was investigated focusing on the major allele genotype of SNPs in the TXE collection. The frequency of each SNP major allele was respectively calculated in the first and third groups and then to find the difference between the two frequencies.
Genomic prediction model development For developing genomic prediction models, 7,573 SNP markers from all chromosomes with at least one million bases (Mb) apart were selected for estimating the mean effect of each marker. The R package rrBLUP (ridge regression best linear unbiased prediction) (Endelman 2011) was used for developing genomic prediction model. The mixed model y = µ + Xβ + ε was used with y as the vector of phenotypic means, µ as the overall mean, X as the marker matrix, β as the vector of marker effects, and ε as the vector of residual effects. The genomic estimated breeding values (GEBVs) of each line were calculated by adding the grand mean to the product of genotypic matrix and the vector of mean effect of each marker. The prediction accuracy was measured by the correlation between the predicted and observed yield BLUEs. Genomic prediction models were developed separately in each of the correlation-based groups, and the prediction accuracy was estimated at three times using 60%, 70%, and 80% of TXE lines as training sets and 40%, 30%, and 20% as testing sets, accordingly. For each training/testing set, prediction accuracy was obtained based on a calculation using 500 repeated runs.
To validate the prediction models developed in each correlation-based group, all TXE lines were used as the training set and 105 advanced breeding lines were used as the testing set. The common SNPs between TXEs and advanced breeding lines with at least 1-Mb apart on each chromosome were selected. Marker effects were estimated using BLUEs of each correlation-based group through rrBLUP mixed model y = µ + Xβ + ε. The GEBVs of each advanced breeding line were calculated by adding the grand mean to the product of genotypic matrix and the vector of mean effect of each marker. Prediction accuracy was then measured through the correlation between predicted and observed yield in 105 advanced breeding lines.

Grain yield in two correlation-based groups
Based on correlations among yield of overlapped lines in different environments, yield data from eighteen environments were not correlated with any of the remaining 84 environments (Tables S1 and S2). Data from those 84 environments were divided into two correlation-based groups with groups G1 containing 36 and G2 including 48 environments (Table 1). After heritability estimation, data from five non-correlated environments were added into the group G1 but data from two environments were dropped from the group G2. Therefore, group G1 contained data of 41 environments and group G2 carried data from 46 environments for further analysis, and data from 15 noncorrelated environments were abandoned (Table S3). Interestingly, the majority of dataset in correlationbased group G1 included environments from the High Plains and Rolling Plains that normally have low level of rainfall and represented the drought-prone areas of wheat acreage in Texas, whereas the majority of data in group G2 contained environments from Blacklands and South Texas that usually received relatively a higher level of rainfall and represented the wet growing conditions of Texas (Table S3).
Of the two correlation-based groups that respectively included data collected from 41 and 46 environments, the best linear unbiased estimation (BLUE) of each line was calculated through R package lme4 and the skewed yield distributions were observed in both groups (Fig. 1). In correlation-based group G1, the grain yield of all lines ranged from 2,000 to 4,500 kg/ha, but 201 (88.5%) lines have grain yield in the range of 3,250-4,000 kg/ha. Whereas in the correlation-based group G2, grain yield of each line was in the range of 2,250-4,750 kg/ha but with more lines distributed in a broader range (Fig. 1, Table S11). The grain yield in two datasets was not correlated and Bartlett's test also indicated heterogeneous variances in the two datasets. Since the correlation-based groups G1 and G2 were corresponding to dry and wet growing condition in north and south Texas, respectively, the broader trait variation in group G2 corresponding to the less stressful growing conditions in south Texas may more likely explain the genetic variance.

SNP genotyping and population structure in TXE collection
Of the 70,525 imputed SNPs with data missing rate less than 10%, the B-genome chromosomes carried the most (37,773) SNPs, followed by the chromosomes in the A-genome (23,782) and the D-genome (8,970) (Table S4), indicating the relatively lower level of genetic diversity in the D-genome. Particularly, only 731 and 708 SNPs were identified on chromosomes 4D and 5D, respectively, suggesting that the two chromosomes in TXE lines have the least variations.
Population structure analysis revealed five subpopulations in the TXE collection (Fig. 2) and the population was mainly admixed with all released cultivars (Fig. S2) spread into different sub-populations, which indicated that the TXE collection covered primary gene sources in Texas wheat breeding programs. Phylogenetic tree developed using 70,525 imputed SNPs also suggested the similar population structure (Fig. 3). Fig. 1 Distribution of the best linear unbiased estimations (BLUEs) for grain yield in the two correlationbased groups G1 and G2. BLUEs were calculated using the R package lme4 (Bates et al. 2015) 1 3 Vol.: (0123456789)

Genome-wide association studies of grain yield in TXE lines
Association analysis identified 74 genomic regions in two correlation-based groups associated with grain yield with significant level above threshold of − log10(P) = 4.0 (Fig. 4, Tables S5 and S6). Of those associations, two regions were commonly detected in both groups and located on chromosome 1D at 229.6 and 345.4 Mb with − log10(P) scores of 4.4 and 4.8 and each contributing around 10% of yield variation. The favorite alleles at two regions had the additive effects of increasing yield by 330.5 and 325.0 kg/ha in group G1 and 406.7 and 343.5 kg/ha in group G2, respectively. In addition, 17 genomic regions identified only in correlation-based group G1 and 55 regions only in group G2 were associated with grain yield. In group G1, those genomic regions were located on nine chromosomes, namely, 2A, 3A, 3B, 5A, 5D, 6B, 6D, 7B, and 7D. The − log10(P) scores ranged from 4.0 to 5.8 and explained 7 to 14% of yield variations with favorite alleles having the potential of increasing yield 220.9-816.3 kg/ha. There were ten genomic regions with each explaining 10% or more of phenotypic variations located on chromosome 2A at 308.1 Mb, 3A at 12.8 Mb, 3B at 245.3 Mb, 5A at 24.1 Mb, 6B at 32.8 Mb and 682.6 Mb, 6D at 454.1 Mb, and 7B at 627.2, 637.8, and 676.8 Mb. In group G2, the significant genomic regions were detected from all chromosomes with the significance varying from − log10(P) = 4.0 to 8.1 and each accounting for 7 to 16% of trait variations with the favorite allele having the additive effects of increasing yield 258.0-516.1 kg/ha. The most significant association was located at 14.7 Mb on chromosome 7D and explained 16% of the phenotypic variations with the favorite allele having the potential of increasing yield by 516.1 kg/ha (Tables S5 and S6). A total of 16 regions on seven chromosomes each explained over 10% of trait variations in group G2 (Table S5 and Fig. 4).

Genome-wide association studies of greenbug and Hessian fly resistance in TXE lines
For reaction to greenbug, 173 and 42 TXE lines were susceptible and resistant, respectively, and twelve lines showed partial resistance (Table S7). GWAS indicated that three genomic regions on chromosome 3DL (565.0 Mb), 6DS (7.2 Mb), and 7DL (597.9 Mb) were associated with greenbug resistance in TXE lines ( Fig. 5a and Tables S8 and S9). A region on chromosome 7DL showed the largest effect and explained 48.7% of the trait variation in TXE lines, and the regions on 3DL and 6DS explained 10.7% and 15.3% of phenotypic variation, respectively. For reaction to Hessian fly, data were obtained from 219 TXE lines with 166 susceptible, 18 resistant, and 35 partially resistant (Table S7). Only the genomic region on 1AS at 7.8 Mb was significantly associated with the resistance and explained 17.0% of trait variation in TXE lines ( Fig. 5b and Tables S8 and S9).

SNP allele drift in TXE lines due to breeding selection
By comparing frequency of major alleles in 70,525 SNPs between TXE groups of 2009-2011 (old) and 2015-2017 (new), allele frequencies of 10,034 SNPs decreased. Meanwhile, allele frequency in a different set of 974 SNPs each increased over 20% in the new TXE group. Of the SNPs with allele frequency decreasing in the newly developed TXEs, the allele genotype of 2,000 SNPs that have been the major alleles in TXE group of 2009-2011 was changed to minor allele. Whereas in SNPs with allele frequency increasing in the newly developed  (Table S10). Comparing the SNPs that had significant associations in increasing yield, several were located in the vicinity of the SNPs that had allele status changing from minor to major in new TXEs, such as the ones on 2A at 101.9 Mb, 3B at 245.3 and 738.3 Mb, 6D at 454.1 Mb, 7A at 620.3 Mb, 7D at 621.2 Mb, and 7B at 654.1, 711.9, and 741.0 Mb. Each of these alleles showed the potential of increasing yield by 281.8-611.3 kg/ha (Table S5). Among the 74 SNPs significantly associated with yield in groups G1 or G2, there were trends that the newer lines or cultivars had more favorite alleles for increasing yield (Table S6). Based on the pseudomolecule physical position indicated in the reference wheat genome sequence (IWGSC 2014), markers with allele frequency changing over 20% were mostly located at the distal sides of the chromosomes (Fig. 6) and agreed with the higher rate of recombination observed at the distal regions of the chromosomes.

Genomic prediction in TXE lines and model validation using advanced breeding lines
Genomic prediction models were tested in three situations that randomly picked 60%, 70%, and 80% of TXE lines as training populations to predict the remaining TXE lines. After 500 independent runs in each situation, prediction accuracy using yield data from correlation-based group G2 is higher than using data from correlation-based group G1 (Table 2). Average prediction accuracies in group G1 varied from 0.42 to 0.44 but that increased from 0.68 to 0.70 in group G2 as the size of training population increased. The lowest range of the prediction accuracies was 0.14-0.19 in group G1 and 0.49-0.52 in group G2 and the maximum prediction accuracies were 0.61-0.75 in group G1 and 0.81-0.89 in group G2. This indicated that yield data in correlation-based group G2 were more reliable for genomic prediction.
To validate the prediction models developed in two correlation-based groups, yield data in a set of 105 advanced breeding lines were collected in 2018 from four environments including rain-fed and irrigated location in Bushland, TX, irrigated location in Etter, TX, and the rain-fed location in McGregor, TX. All TXE lines were used as the training set and 105 advanced breeding lines were used as the testing set. From the common 37,975 SNPs between TXE and advanced breeding lines, a total of 5,542 SNPs that were at least 1-Mb apart on each chromosome were selected for genomic prediction, and the prediction accuracies using the models developed from the correlation-group G2 ranged from 0.12 to 0.29, but none of the predictions based on the models from correlation-group G1 was correlated with the observed yield (Table 3). This is consistent with previous results of models from correlation-group G2 which were more reliable for genomic prediction when using TXE lines as both training and testing sets.

Discussion
Wheat grain yield is a very complex trait and is affected by numerous genes involved in many different biological processes affecting plant development, photosynthesis, carbon mobilization, grain filling, and maturity. The effect of each gene is very limited and varied under different environments. Testing grain yield in breeding lines thus demands major efforts and breeding resources since it needs to be done in many locations under multiple years with replications included. Using historical yield data in the past or current breeding lines or cultivars to conduct genome-wide association analysis and genomic prediction will provide a cost-effective way of identifying beneficial genes for increasing yield and cumulating favorite alleles for crop improvement. However, imbalanced historical data obtained during breeding are hard to use since each environmental condition is unique and cannot be repeated. Interactions between genotypes and environments vary at different times and locations, which greatly increased difficulties of identifying favorite alleles. In this study, we developed a strategy of using correlations among yield data of overlapped lines evaluated at different times and locations to group different environments that have showed interactions with similar magnitudes. The grouping is further tested through heritability estimation. The best linear unbiased estimation (BLUE) calculated in each correlation-based group was then used for GWAS and developing genomic prediction models, which were further validated through a set of advanced breeding lines. This research thus developed a new strategy of using the imbalanced historical data for detecting beneficial genes and conducting genomic prediction in breeding.
There are numerous QTLs for yield and yield components identified from bread wheat trials worldwide. Chromosome regions with significant SNPs associated with yield from GWAS in this study were almost coincided to the position of several QTLs identified from previous research from bread wheat trials conducted in the US Great Plains or other regions (Table S5) 2 Mb on 7D that were associated with yield and yield components identified from a bi-parental mapping population derived from the cross between two popular cultivars TAM 111 and TAM 112 (Yang et al. 2020). Both cultivars and their derivatives have been the core parents in both Texas whet breeding programs. Particularly, the region around 591.2 Mb on 7D is harboring gene Gb3 conferring greenbug resistance . Breeders found that the majority of the TAM 112 derivatives had a decent yield in dry environments as Gb3 was kept (J Rudd, personal communication, 2020).
Yield-associated regions at 603.0 Mb on chromosome 3D and 25.4 Mb on 7B were very close to the QTLs associated with spikes per square meter at 603.8 Mb on 3D linked to XIWA6485 and kernel per spike at 22.6-24.7 Mb on 7B linked to XIWB71684 with favorite alleles from 'TAM 111' (Assanga et al. 2017). The yield-associated region at 603.0 Mb on 3D was very close to a QTL at 603.4 Mb on 3D of 'ND 705' that was associated with spikes per square meter and linked to XIWB17317 .
Yield-associated regions at 532.8 Mb on 1A and at 711.5 Mb on 7B from this study were physically close to a flour yield QTL at 533.4 Mb on 1A and a grain volume weight QTL at 709.6 Mb on 7B detected in a recombinant inbred mapping population derived from cross between TAM 111 and TAM 112 population (Yang et al. 2020;Dhakal et al. 2021a). The yield-associated region at 531.3 Mb on 2D from this study was also identified in the mapping population derived from TAM 112/TAM 111 and associated with thousand kernel weight and kernel diameter (Yang et al. 2020;Dhakal et al. 2021b). Since cultivars TAM 111 and TAM 112 were core parents used in the Texas A&M AgriLife Research wheat breeding programs, it is very possible that these favorite alleles were carried through generations due to selections.
The yield-associated region at 16.2 Mb on 7D was close to gene TaGS3-D1 located in region 6.5-6.8 Mb on 7D affecting wheat kernel weight and length (Rasheed et al. 2016;Zhang et al. 2014). Two QTLs at 32.8 Mb and 47.5 Mb on 6B associated with thousand kernel weight (Zou et al. 2017) coincided with the two yield-associated regions at 32.8 and 47.5 Mb on 6B in this study. The regions at 709.2 Mb on 3A, 625.7 Mb on 5A, and 633.9 Mb on 6B from this study were very close to gene TaTGW6-A1 at 711.1 Mb on 3A, a QTL at 619.5 Mb on 5A for thousand kernel weight, and a QTL at 631.8 Mb on 6B for grain volume weight . Therefore, the consistence between findings of this research and those from previous reports further proved that the strategy we developed for using imbalanced historical breeding data is effective for identifying important beneficial genes.
The region at 597.9 Mb on 7DL showed significant association with greenbug resistance and is corresponding to Gb3, the gene known to be carried by germplasms used for developing TXE lines and present in cultivars such as TAM 110, TAM 112, 'TAM 115', and TAM 204 (Lazar et al. 1997;Rudd et al. 2014Rudd et al. , 2019Weng and Lazar 2002;Liu et al. 2014). The other two regions on 3DL and 6DS with minor effects on greenbug resistance might be novel genes since no greenbug resistance have been reported from these two genomic regions. Hessian fly resistance was associated with a region on 1AS (7.8 Mb) in this study; the position coincided with a Hessian fly resistance QTL on 1AS in 'Duster' (PI 644,016) (Li et al. 2015;Edwards et al. 2012). It is likely that Hessian fly resistance in TXE lines is derived from Duster since many TXE lines had this cultivar in their pedigree. It thus indicated that the additional resistance sources need to be included in the future Texas wheat breeding to maintain the long-lasting resistance to Hessian fly.
From this study, eight chromosome regions significantly associated with yield, along with several major genes in TXE lines such as wheat curl mite resistance genes Cmc TAM112 and Cmc3 (Dhakal et al. 2018;Dhakal et al. 2017;Zhao et al. 2021;Gaurav et al. 2021), seed storage protein subunit genes Gli-B1 and Glu-D1, dwarf gene Rht-B1, and grain weight and length gene TaGS-D1 Liu et al. 2014). These mostly coincided with regions in which allele frequencies were greatly increased in the newly developed lines (Fig. 6b; Table S10), which may be a good indication of accumulating favorite alleles during selection. Particularly, the recently released cultivars fall into different sub-populations (Figs. 3 and S2) showing improved yield, disease and/or insect resistance, drought tolerance, and enhanced baking and milling quality attributes. As aforementioned, cultivars TAM 111 (Lazar et al. 2004) and TAM 112  were used as parental lines in new releases due to their high yield and superior drought  TXE, 68 lines). a Allele frequency decreased in newly developed TXEs. b Allele frequency increased in newly developed TXEs with some major genes indicted in the corresponding position according to previous research Dhakal et al. 2018;Zhang et al. 2014). Physical position of SNPs was determined according to pseudomolecule position in wheat reference genome v1.0 (IWGSC 2014). Darker regions indicated that more markers have frequency changed ◂ tolerance in addition to the greenbug and wheat curl mite resistance carried in TAM 112. For example, cultivar TAM 114 derived from crosses of using TAM 111 as parents showed excellent baking and milling quality, and intermediate resistance to Hessian fly , and cultivar TAM 204 selected from the crosses involving TAM 112 showed a good level of resistance to greenbug, Hessian fly, and wheat curl mite in addition to the high grain yield . The newly released TAM 115 is also a selection from crosses involving TAM 112 and showed high yield, good drought tolerance, and resistance to greenbug and wheat curl mite (Rudd et al., not published). Therefore, further research focusing on the regions where allele frequency greatly increased in those newly developed TXE lines may provide an efficient way of revealing beneficial alleles for wheat improvement.
Of the two correlation-based groups G1 and G2 developed through historical yield data of TXE lines, results from GWAS and genomic prediction both indicated that yield data from group G2 may be more reliable. This is supported by the fact that group G2 contained environments either in north Texas under irrigated conditions or from south Texas with relatively higher level of rainfall and thus had better growing conditions. On the other hand, the group G1 included mainly dryland environments with severer drought stress, which greatly limited the expression of yield potential in each line and led to a much narrow yield variation (Fig. 1). Similarly, GWAS detected fewer genomic regions significantly associated with yield in group G1 than in G2 (Table S5), and genomic prediction models in two groups using 60-80% of lines as training set also pointed to lower prediction accuracy in group G1 (0.14-0.75) than in G2 (0.49-0.89) ( Table 2). Validation of genomic prediction models through a set of advanced breeding lines also indicated that predictions made using yield data of group G2 showed stronger correlation with the observed data (Table 3). Therefore, the strategy of combining correlation and heritability estimates to group data from different environments used in this study also provided a way of selecting appropriate data from diverse environments for genomic predicting in breeding.