Genome-wide association study and genomic selection for plant height, maturity, seed weight, and yield in soybean

study A total of 250 soybean accessions were evaluated for maturity, plant height, seed weight, and yield over three years. This panel was genotyped with a total of 10,259 high quality SNPs postulated from genotyping by sequencing (GBS). Population structure was inferred using STRUCTURE 2.3.4, GWAS was performed using a Bayesian Information and Linkage Disequilibrium Iteratively Nested Keyway (BLINK) model, and GS was evaluated using a ridge regression best linear unbiased predictor (rrBLUP) model. The results revealed that: a total of 20, 31, 37, 31, and 23 SNPs were signicantly associated with the average 3-year data for maturity, plant height, seed weight, and yield, respectively; some signicant SNPs were mapped into previously described loci (E2, E4, and Dt1) affecting maturity and plant height in soybean and a new locus mapped on chromosome 20 was signicantly associated with plant height; Glyma.10g228900, Glyma.19g200800, Glyma.09g196700, and Glyma.09g038300 were candidate genes found in the vicinity of the top or the second best SNP (if no annotated genes found close the top one) for maturity, plant height, seed weight, and yield, respectively; a 11.5-Mb region of chromosome 10 was associated with both seed weight and yield; and GS accuracy was trait-, year-, and population structure-dependent. were found on chromosome 19. The plant height-associated SNP with the highest LOD value found in this study was mapped on chromosome 19. We found that all signicant SNPs associated with plant height and mapped on chromosome 19 overlapped with the Dt1 locus, which has been well-described for controlling plant height in soybean [5]. These ndings suggest that the present investigation contributes towards enriching SNP markers associated with plant height, which is essential in eciently establishing a breeding pipeline for agronomic trait improvement in soybean.

Correlation between maturity, plant height, seed weight, and yield Pearson's correlation coe cients (r) between maturity, plant height, seed weight, and yield were computed. Maturity was almost not correlated with seed weight (r = 0.054) and yield (r = 0.019), respectively (Table 4). However, a medium correlation was found between maturity and plant height (r = 0.439). A low correlation was identi ed between seed weight and yield (r = 0.080). Seed weight was negatively correlated with plant height (r=-0.257). Plant height was not correlated with yield (r=-0.004) ( Table 4).
For each trait, Pearson's correlation coe cients were also calculated between the three years. The correlation between years was high for maturity, seed weight, and plant height, and relatively moderate for yield. For maturity, the highest correlation was between data from 2008 and 2010 (r = 0.763) ( Table 4), whereas the lowest one was between 2009 and 2010 (r = 0.657) ( Table 4). For seed weight, the data were highly consistent over years since the lowest correlation was 0.876, corresponding to seed weight in 2008 and 2010 (Table 4). Yield data varied from year to year where the highest correlation was found between 2009 and 2009 (r = 0.604), and the lowest one between 2008 and 2010 (r = 0.484) ( Table 4). Similarly to seed weight, data were also consistent over years for plant height where the lowest correlation was 0.848 for the data obtained in 2009 and 2010 (Table 4).
Population structure and genetic diversity analysis SNP ltering yielded a total of 10,259 high-quality SNPs that were used for GWAS and GS. SNP number per chromosome varied from 292 (chromosome 12) to 785 (chromosome 18), with an average SNP number of 513 per chromosome (Table 5). There was also a large variation in the SNP density per chromosome with chromosome 16 having the lowest SNP density (73.29 kb between two adjacent SNPs) and chromosome 12 having the highest one (136.50 kb) ( Table 5). The average distance between two adjacent SNPs was 94.86 kb. The average minor allele frequency (%) per chromosome ranged between 16.24% and 24.73%, with an average of 21.02% ( Table 5). The average percentage of heterozygous SNPs per chromosome varied from 0.42-0.74%, with an average of 0.59%. There was not signi cant discrepancy in the average percentage of missing SNPs per chromosome (Table 5).
STRUCTURE Harvester indicated a delta K peak at K equal to 2 (Fig. 3A), indicating that the panel involving the 250 soybean genotypes consisted of two subpopulations. The bar plot from STRUCTURE 2.3.4 showed the two-well differentiated subpopulations where the subpopulation 1 was shown in red and the subpopulation shown in 2 in green (Fig. 3B). The rst cluster consisted of 146 genotypes, accounting for 57.2% of the total population. The second subpopulation involved a total of 103 genotypes, which corresponded to 40.4% of the whole panel. In addition to the two distinct subpopulations, an admixture consisting of 6 genotypes was also identi ed and accounted for only 2.4% of the soybean panel used in this study. The mean inbreeding coe cients of the subpopulation relative to the total population were 0.565 and 0.043 for the subpopulation 1 and subpopulation 2, respectively (Table S2). The average distances between individuals in the same cluster were 0.332 (subpopulation 1) and 0.154 (subpopulation 2) ( Table S2). The overall proportions of membership of a genotype within each cluster were 0.567 and 0.433 for subpopulation 1 and subpopulation 2, respectively (Table S2). The average allele frequency divergence among populations was 0.095 (Table S2).
A genetic diversity analysis was also carried out along with population structure as shown in Fig. 3C. In the phylogenetic tree, the subpopulation 1 was displayed using solid red circles, the subpopulation 2 was shown using solid green circles, and the admixture using solid blue circles (Fig. 3C). A good correlation was found between the genetic diversity analysis and the structure analysis. A large number of genotypes belonging to the subpopulation 1 was placed on the left-hand side of the phylogenetic tree (Fig. 3C), whereas most of the genotypes grouped into subpopulation 2 were clustered on the right part of the phylogenetic tree (Fig. 3C). The admixture genotypes were randomly scattered in the phylogenetic tree.
Genome-wide association study (GWAS) in maturity, plant height, seed weight and yield

Seed weight
Similarly to the pattern of SNPs found for maturity and plant height, the SNP number and location varied between years for seed weight (Fig. 6). For the combined data (2008,2009, and 2010), a total of 37 SNPs were found to be associated with seed weight ( Table 7). The SNP Chr04_36949349 was the only one which was consistent across three years (LOD_2008 = 13.55, LOD_2009 = 9.01, LOD_2010 = 26.76, MAF = 16.47%) ( Table S3) (Table 7). T-test analysis was signi cant for genotypic class comparison for these SNPs (Figs. 8K-O). Out of the 37 SNPs associated with the average seed weight over three years, 10 were located on chromosome 10 ( Table 7).
The average LOD values of the signi cant SNPs in 2010 was also lower than that of found in 2008 but was almost similar to the 2009 results (Table S3). The SNP markers for the 2010 yield data were scattered across the soybean genome (  (Table S3), which were located on chromosomes 8, 5, 18, 3, and 14, respectively (Fig. 7C).

Candidate genes selection
Candidate genes found within a 10-kb genomic region spanning a signi cant SNP associated with the combined data over three years for maturity, plant height, seed weight, and yield were investigated. For maturity, a total of 20 signi cant SNPs were identi ed ( Table 6). Of which, 14 were mapped to genomic regions harboring annotated genes in Soybase (www.soybase.org). The annotated genes had a wide variety of functions. Leucine-rich repeat (LRR) domain was prevalent as shown in Table 6. The candidate genes associated with the most signi cant SNPs were Glyma.10g228900, Glyma.13g220200, Glyma.08g046800, Glyma.17g100600, and Glyma.08g176000, which encoded for LRR receptor-like protein kinase, F-box/leucine rich repeat protein, sensor histidine kinase, malate and lactate dehydrogenase, and replication protein A-related, respectively (Table 6). A candidate gene involved in epigenetics such as O-methyltransferase was also identi ed.
Of the 31 signi cant SNPs associated with the combined data for plant height, 25 were found in the vicinity of annotated genes (Table 6). Functional annotations of the candidate genes were diverse and mainly included transcription factors, kinases, and biomolecule transporters. The candidate genes found near the most signi cant SNPs were Glyma.19g200800, Glyma.05g048800, Glyma.19g195500, Glyma.19g187900, and Glyma.06g134200, encoding for transcription factor NF-Y alpha-related, cleavage and polyadenylation speci city factor, ubiquitin, PHD-nger, and serine/threonine-protein kinase, respectively (Table 6).
A total of 24 candidate genes were found for seed weight. Of which, 19 had functional annotations and one has an uncharacterized function (Table 7). GWAS for seed weight revealed a high probability of QTL(s) on chromosome 10; however, most of the SNPs found on the chromosome 10 were not mapped in the vicinity of an annotated gene (Table 7). Functional annotations related to the candidate genes had diverse functions. The candidate genes with de ned functional annotations and associated with the most signi cant SNPs were Glyma.09g196700, LOC100801248, Glyma.20g181100, Glyma.02g161100, and Glyma.18g251800, which encoded for ring nger domain-containing, protein TIC 56 chloroplasticlike, LRR receptor-like protein kinase, camp-response element binding protein-related, and putative serine/threonine protein kinase, respectively (Table 7).
A total of 15 annotated genes were found near the signi cant SNPs associated with the combined data for yield over three years. Of which, 14 had functional annotations. Similarly to what was found for seed weight, only one candidate gene was identi ed on chromosome 10, which harbored 10 SNPs associated with yield ( Table 7). Most of the candidate genes encoded for transpiration factors and transferases. The candidate genes with functional annotations and found close to the most signi cant SNPs for yield were Glyma.09g038300, Glyma.08g366900, Glyma.07g082800, Glyma.18g021100, and Glyma.01g093300, encoding for calmodulin-binding transcription activator (CAMTA), zinc nger protein with krab and scan domains, homo-oligomeric avin containing cys decarboxylase family, gamma-glutamyltransferase, and RNA recognition motif, respectively (Table 7).
Genomic selection (GS) in maturity, plant height, seed weight and yield GS accuracy for maturity, plant height, seed weight, and yield were estimated. Using a 5-fold crossvalidation within the 250-soybean accession panel, GS accuracy was found to be trait-dependent ( Fig. 9) (Table S4). On average, the accuracy of GS was the highest for seed weight (0.84) and was the lowest for maturity (0.47) (Table S4). Variations in GS accuracy were found across years (Fig. 9). The between-year variation was important for yield where the highest accuracy was 0.72 in 2008 and the lowest one was 0.50 in 2010 (Fig. 9). The estimation of GS accuracy was also conducted using the two subpopulations derived from structure analysis. Cross-validation using all 250 soybean accessions and samples from the Q1 group provided similar trend as shown in Fig. 8. However, GS accuracy signi cantly dropped down for maturity when samples from the Q1 cluster were used ( Fig. 9) (Table S4). Interestingly, GS averaged 0.64 with a less variation between traits and across years when samples from subpopulation 2 were used to estimate GS accuracy (Fig. 9). In addition, accuracy for predicting maturity was the best using Q2 samples ( Fig. 9) (Table S4). The ANOVA results indicated statistically signi cant interaction effect between year and population type on the accuracy of GS for maturity (F value = 5.77, p-value = 0.0001), plant height (F value = 11.47, p-value < 0.0001), seed weight (F value = 76.17, p-value < 0.0001), and yield (F value = 12.82, p-value < 0.0001) ( Table 8). The results indicated that year and population structure were important factors to take into account when incorporating GS in a breeding program. GS accuracy using datasets from different years had similar trends to that of obtained from a within-year cross validation (Figs. 9 and 10). Overall, there is lack of consistency between GS accuracy and the training year across traits and population types. This could be explained by the signi cant interaction effect of population structure and years on GS of plant height (F value = 2.86, p-value = 0.0091), seed weight (F value = 4.56, p-value = 0.0001), and yield (F value = 17.37, p-value < 0.0001) ( Table 9). Yield in 2010 was better predicted using dataset from 2009 than using yield data obtained in 2008 when all samples within the panel and individuals from Q2 were used for cross-validation, respectively ( Fig. 10) (Table S5). However, plant height in 2010 was better predicted by the dataset recorded in 2008 for all sample-and Q1 sample-cross validation. Different results were found for plant height when crossvalidation was carried out using individuals from the Q2 group (Fig. 10). Genomic prediction appeared to be year-independent but subpopulation-dependent for maturity ( Fig. 10) (Table S5). ANOVA results showed that there was no signi cant interaction effect between population structure and year in predicting maturity (F value = 1.68, p-value = 0.1229) ( Table 9). In addition, year effect on GS of maturity was not signi cant (F value = 0.88, p-value = 0.4519) ( Table 9). However, GS for maturity was heavily in uenced by population structure (F value = 1232.94, p-value < 0.0001) ( Table 9). Maturity could be better predicted than the other traits when samples from the group Q2 were used for cross-validation (Fig. 10). Despite the stability of maturity prediction between training and testing year as shown in Fig. 9, GS performed poorly for this trait when all samples and samples from the subpopulation 1 were used, respectively ( Fig. 10) (Table S5).
GS was also conducted using samples from subpopulation 1 as training set and samples from subpopulation 2 as testing set and vice versa. For cross-validation using data from the same year (Table  S6), discrepancy in GS accuracy was found when samples from one subpopulation was used to predict the ones from the other group (Fig. 11). Overall, selection accuracy was slightly higher for most traits when the GS model was trained using samples from subpopulation 1 (Fig. 11). This difference was substantial for maturity. The average GS accuracy for maturity was 0.49 and 0.20 for Q1-based training set and Q2-based population, respectively (Table S6). Unlike the two previous approaches where crossvalidation was carried out using within-subgroup samples, variation of selection accuracy between years was more pronounced when prediction was done across subpopulations (Fig. 11). In addition, the interaction effect of population structure and year on GS was signi cant for maturity (F value = 89.30, pvalue < 0.0001), plant height (F value = 14.02, p-value < 0.0001), seed weight (F value = 56.77, p-value < 0.0001), and yield (F value = 184.92, p-value < 0.0001) ( Table 10). When the training set was taken from subgroup 1 in order to predict yield of subpopulation 2, selection accuracy was 0.72 and 0.41 for 2009 and 2008, respectively (Table S6). A similar trend was found for yield when the training samples were derived from subpopulation 2 (Fig. 11).
In addition to performing a within-year GS using two subpopulations, trait prediction for other years of a genetically-distant population subset was also conducted. Overall, results indicated a similar trend to what was found using a within-year GS approach (Figs. 11 and 12). The interaction effect of population structure and year on GS was signi cant for maturity (F value = 13.18, p-value < 0.0001), plant height (F value = 14.70, p-value < 0.0001), seed weight (F value = 264.27, p-value < 0.0001), and yield (F value = 25.23, p-value < 0.0001) (Table 11). Prediction accuracy for most traits was slightly higher when the samples from Q1 were used to train the GS model (Table S7) (Table S7). When the GS model was trained using samples from Q2, the prediction accuracy for yield was 0.  (Table S7). Updating data on plant height, maturity date, and seed weight each year helped increase GS when the model was trained under Q2. The update was achieved by averaging the training set data from 2008 and 2009 to predict the testing set in 2010. GS accuracy was 0.47, 0.07, and 0.60 for plant height, maturity date, and seed weight, respectively, using the 2008 data from the Q2 samples to predict the 2010 data from the Q1 samples (Fig. 11) (Table S7). By taking the average data from 2008 and 2009 to establish the training set, plant height, maturity date, and seed weight of the Q2 samples were predicted with an accuracy of 0.52, 0.09, and 0.68, respectively (Table S7).
These results indicated that GS accuracy was impacted by multiple factors such as population structure and the variable year from which the training set was established.

Discussion
The soybean accessions used in this study were evaluated over 3 years for plant height, maturity, seed weight, and yield. These agronomic traits are key characters to improve in a soybean breeding program.
Soybean varieties with good agronomic performance could be highly competitive in the soybean seed market. Efforts towards providing soybean growers with high performing varieties have been of upmost importance in a soybean breeding program. In this report, signi cant differences in plant height, maturity, seed weight, and yield were identi ed among the genotypes. In addition, the year effect was also signi cant. The design used in this research did not allow for an estimation of a possible genotype by year interaction effect. This will be conducted in a further study. However, the ndings were in agreement with previously reported studies where the year effect can substantially impact plant height, maturity, seed weight, and yield in soybean [22][23][24][25][26]. In this report, seed weight was negatively correlated with plant height. This result was different from that of reported by Kato et al [25] who found that seed weight was positively correlated with plant height. Since the germplasm genotypes used by Kato et al [25] was different from the ones used in this investigation, the discrepancy found between the two studies suggested that agronomic trait should be investigated on a per population-type basis in a breeding program, which provides additional rationale to our study.
GWAS was conducted using a BLINK model. BLINK is one the latest and most improved statistical models to conduct GWAS [27]. Spurious associations could be reduced by incorporating population structure and Kinship effects into the GWAS model [28]. This has been established into the BLINK algorithm [29]. Therefore, we did not run the MLM (Q + K) of Tassel 5 [30] for GWAS since previous investigations had successfully demonstrated that BLINK had more statistical power in identifying true associations and reducing false positives for a large number of traits [27]. The only purpose of running population structure (Q) in this study was to assess GS accuracy between two unrelated subpopulations, which will be further discussed in this report.
A total of 20, 37, 31, and 23 SNPs were found to be signi cantly associated with maturity, seed weight, plant height, and yield, respectively, in soybean using the average data obtained over 3 years (Tables 6  and 7). Diers et al [31] also reported a similar range of SNP number associated with these traits using a nested association mapping (NAM) soybean population. A total of 19, 29, 15, and 23 SNPs were reported to be associated with maturity, seed weight, plant height, and yield, respectively [31]. Assefa et al. [32] found a total of 14, 10, and 9 SNPs associated with seed weight, plant height, and yield, respectively, upon a GWAS study involving a total of 419 soybean accessions. After SNP validation is performed, the information from this report could be used in trait introgression efforts in soybean breeding programs.
This has been successfully demonstrated by Hegstad et al. [33] who introgressed large-effect QTL regions from commercial soybean cultivars with high yield into the Corteva Agriscience soybean accessions.
Previous investigations reported a total of more than 60 loci controlling maturity in Soybase (https://www.soybase.org/). Chromosome 16 has been shown to harbor the most signi cant loci affecting soybean maturity. Of the 20 SNPs found to be associated with the average maturity over 3 years in this study, one was mapped at 31 Mb on chromosome 16. Diers et al. [31] found a total of 19 regions for maturity on chromosome 16. However, Xia et al. [34] reported a signi cant discrepancy in SNPs associated with maturity across multiple environments, which was also consistent with that of reported in this study where different SNPs were reported in different years with different environmental conditions. A cluster of signi cant SNPs were identi ed in a 232-kb region of chromosome 20. This region spanned a signi cant SNP associated with maturity that was reported by Zatybekov et al. [35]. A high-LOD SNP (LOD > 10), Chr10_45903960, associated with maturity was also found on chromosome 10. A total of 10 loci on chromosome 10 were reported to be associated with maturity in Soybase (https://www.soybase.org/). One of those loci harbored the SNP Chr10_45903960. Two of the SNPs associated with maturity, Chr10_45903960 and Chr20_41339091, were located into the E2 and E4 loci that have been reported to control maturity in soybean [10]. For the plant height-related SNPs, previous investigations showed discrepancy in terms of SNP location. Zatybekov et al. [35] reported two SNPs signi cantly associated with plant height on chromosome 9 and 20, which were mapped at 42 Mb and 8 Mb, respectively. Our results did not indicate any plant-height associated SNPs on chromosome 9, whereas the one mapped on chromosome 20 is about 7.5 Mb away from that of reported by Zatybekov et al. [35]. Of the 68 loci associated with plant height in Soybase (https://www.soybase.org/), 19 loci were found on chromosome 19. The plant height-associated SNP with the highest LOD value found in this study was mapped on chromosome 19. We found that all signi cant SNPs associated with plant height and mapped on chromosome 19 overlapped with the Dt1 locus, which has been well-described for controlling plant height in soybean [5]. These ndings suggest that the present investigation contributes towards enriching SNP markers associated with plant height, which is essential in e ciently establishing a breeding pipeline for agronomic trait improvement in soybean.
Zhang et al. [1] reported that seed weight was a complex trait controlled by a large number of loci. This statement appeared to be sound when taking into account the number of SNPs associated with maturity reported in this study. To date, more than 100 QTLs affecting seed weight have been reported (https://www.soybase.org/). A large number of these QTLs were mapped on chromosomes 2, 4, 5, 7, 17, 18, and 20 [31,35]. Our results revealed a SNP with an LOD of 22.58 at 37 Mb location on chromosome 4, indicating the likelihood of a strong QTL affecting seed weight in this region. In addition, a large cluster of SNPs were found on chromosome 10. Some of which overlapped with previously reported seed weightrelated QTLs (https://www.soybase.org/). For yield, previous reports showed that SNP markers associated with yield were scattered across the soybean genome. To date, more than 170 loci have been associated with yield in soybean (https://www.soybase.org/). Zatybekov et al. [35] mapped SNP markers associated with soybean yield on chromosomes 14, 17, and 20. Diers et al. [31] reported 23 loci affecting soybean yield on chromosome 16 alone. Two of the SNPs on chromosome 19 reported in this study were in the vicinity of a signi cant yield SNP marker identi ed by Assefa et al. [32]. Despite the lack of overlapping SNPs between traits, overlapping signi cant loci were identi ed to control two or more traits. A 7.2-Mb region of chromosome 2 and de ned by the SNP markers Chr02_12086588 and Chr02_19239630 were associated with both seed weight and plant height. A 9-Mb region of chromosome 4 harboring the SNPs Chr04_46043483, Chr04_46043518, and Chr04_36949349 was signi cantly associated with both maturity date and seed weight. A genomic DNA sequence spanning a 450-Kb region of chromosome 7 harbored the SNPs Chr07_33588669 and Chr07_7610107 and was associated with seed weight and yield, respectively. In addition, a 270-Kb region of chromosome 8 de ned by the signi cant SNP markers Chr08_47483065 and Chr08_47747059 were associated with both seed weight and yield. One of the most important loci reported in this investigation is de ned by an 11.5-Mb region of chromosome 10 containing the signi cant SNPs Chr10_18370776, Chr10_19170955, Chr10_19620114, Chr10_20805615, Chr10_24454215, Chr10_24773660, Chr10_29894008, Chr10_19477000, Chr10_24773517, Chr10_26343503, Chr10_27017034, and Chr10_29778879. These signi cant SNPs were associated with seed weight and yield. A 4-Mb region of chromosome of chromosome 19 was found to be associated with maturity, plant height, and yield in soybean, which was in agreement with a study investigated by Assefa et al. [32]. A 5.4-MB region of chromosome 20 harbored loci that were associated with maturity, plant height, and seed weight. These ndings were consistent with previously reported studies [31,35].
Our results suggested that maturity-associated candidate genes encoding for Leucine-rich repeat proteins were prevalent. Osakabe et al. [36] showed that LRR proteins acted as a key regulator for maturity in plant. In addition, Jinn et al. [37] reported that these proteins were also involved in oral organ abscission.
These previous investigations supported that the LRR domains found in this study could be good candidate genes for maturity in soybean and could be further investigated towards validation. The ndings indicated O-methyltransferase being a candidate gene for maturity in soybean. Held et al. [38] found that O-methyltransferase-related genes were highly expressed during cell maturation in maize. The candidate genes associated with plant height consisted of transcription factors, kinases, and biomolecule transporters. One of these transcription factors is NF-Y alpha-related. Zhao et al. [39] reported that this transcription factor regulated plant growth, indicating that NF-Y alpha-related could be a good candidate gene for plant height in soybean. The SNP markers associated with plant height and mapped in the Dt1 locus on chromosome 19 were in the vicinity of a gene that encodes for a tetratricopeptide repeat protein (TPR). However, the role of TPR domains in affecting plant height has been poorly investigated. The candidate genes associated with seed weight that were reported in this study had diverse functional annotations. A large number of candidate genes playing signi cant roles in seed development were hormone-signaling [40]. However, no candidate genes involved in hormone signaling pathways were identi ed. A large number of candidate genes found for yield were transcription factors and transferases.
These results were consistent with that of reported by Diers et al. [31] who reported candidate transcription factors that could affect yield in soybean. Candidate genes associated with maturity, plant height, seed weight, and yield were reported in this study. Additional investigations would be required to validate these candidate genes.
GS accuracy for maturity, plant height, seed weight, and yield was assessed. Howard and Jarquin [41] reported that GS performed better than phenotypic selection in soybean when dealing with complex trait.
In this study, we found that the trend of GS accuracy was similar for all traits when all samples and Q1 samples, respectively, were used for cross validation (Figs. 8 and 9). Interestingly, GS for plant height was lower than yield when cross-validation was conducted using data from the same year from all samples and Q1 samples, respectively. These results were different from that of reported by Ma et al. [42] who also used a one-year data to estimate GS accuracy for plant height and yield in soybean. They reported an accuracy of 0.86 and 0.47 for plant height and yield, respectively. However, when cross-validation was conducted using data from different years regardless of the subpopulation, GS was higher for plant height than yield. This could be explained by the fact that plant height has higher heritability than yield [42], thus resulting in a higher genomic prediction accuracy. In addition, inconsistency in results has been found in previous studies investigating GS for yield in soybean. Jarquín et al. [21] suggested an accuracy of 0.64 for yield, whereas Stewart-Brown et al. [43] and Duhnen et al. [44] reported an accuracy of 0.26 and 0.39, respectively. We have also highlighted the effect of population structure on the accuracy of GS.
The results indicated that GS accuracy for maturity was heavily affected by population structure. This could be explained by the fact maturity can cause a structure within a population, thus using maturity data from one subpopulation to predict the maturity data from another unrelated subpopulation would not work. We have also found that year can signi cantly affect GS, implying that updating the training model each year will be necessary for e ciently establishing a GS pipeline within a breeding program for agronomic trait improvement in soybean.

Conclusions
GWAS and GS were conducted for four agronomic traits: maturity, plant height, seed weight, and yield in an association panel consisted of 250 soybean accessions. A total of 20, 37, 31, and 23 SNPs were found to be signi cantly associated with maturity, seed weight, plant height, and yield, respectively, in this soybean panel using the average data obtained over 3 years. Glyma.10g228900, Glyma.19g200800, Glyma.09g196700, and Glyma.09g038300 were identi ed as candidate genes for maturity, plant height, seed weight, and yield, respectively. A 11.5-Mb region of chromosome 10 was associated with both seed weight and yield. The GS accuracy was trait-, year-, and population structure-dependent. The SNP markers for plant height, maturity, seed weight and yield can be used in molecular breeding to improve the four agronomic traits in soybean through MAS and GS. After validation, the candidate genes can be transferred to new soybean cultivars using the linked SNP markers through MAS. The high GS accuracy has con rmed that the four agronomic traits can be selected in soybean molecular breeding through GS.

Plant materials and phenotyping
A total of 250 soybean accessions were used in this study and they were originated collected from China, the United States, South Korea and Japan with 168 (67.2% out of 250), 76 (30.4%), 3 (1.2%), and 2 (0.8%) accessions, respectively plus with one with unknown origin (S1 Table). Soybean accessions were planted in Shijiazhuang (114°83′E, 38°03′N) in Hebei province, China with a randomized complete block design (RCBD) with three replicates during the growing season of 2008, 2009, and 2010. A total of 36 soybean seeds were sown in two rows with a row length of 2 meters and space between rows was 0.5 meter.
Phenotyping was conducted for maturity, plant height, seed weight, and yield.
Data distribution was visualized using the 'MASS' package of R v. 3.4.2 [45]. Distribution of each trait was displayed separately for each year and then combined. ANOVA for each trait was conducted using PROC MIXED of SAS® v. 9.4. The statistical model for the analysis was the following Y ij = µ + Gi + T j + ε ij with i = 1,2, … ,250 and j = 1,2,3 Y ij was the response from the i th genotype in the j th year, µ was the overall mean, G i represented the effect of the i th genotype ( xed effect), T j was the effect of the j th year ( xed effect), and ε ij was the experimental error associated with the ij th observation.
Genotyping DNA was extracted from young soybean leaves using the CTAB (hexadecyltrimethyl ammonium bromide) method [46]. DNA library was prepared using the restriction enzyme ApeKI following the GBS protocol described by Elshire et al. [47] and DNA sequencing was performed using GBS method [47,48]. The 90bp, double-end sequencing was performed on each soybean genotype using the GBS protocol by an Illumina HiSeq at the Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Beijing, China. The GBS dataset contained 3.26 M short-reads or 283.74 Mbp of sequence for each accession. The short reads were aligned to the soybean whole genome sequence (Wm82.a1.v1) (https://www.soybase.org/GlycineBlastPages/archives/Gma1.01.20140304.fasta.zip; https://www.soybase.org/GlycineBlastPages/index.php?db_select=Gma1.01) using SOAPaligner/soap2 (http://soap.genomics.org.cn/) and SOAPsnp v. 1.05 was used for SNP calling [49,50]. Approximately a half million SNPs were identi ed upon SNP calling. Minor allele frequency (MAF) threshold was set to 5%, SNP with heterozygosity more than 10% was discarded, and SNP with missing data more than 15% were also removed. Upon SNP ltering, a total of 10,259 high quality SNPs were retained and used for further analysis.

Population structure analysis
Population structure was inferred using STRUCTURE 2.3.4 through a Bayesian resampling technique [51]. Out of the 10,259 ltered SNPs, a total of randomly chosen 5,129 SNPs were used for inferring population structure (K). Using a randomly selected SNP for conducting a population structure analysis is a valid approach when the use of the complete set of SNPs is computationally heavy as described by Huang et al. [52]. Analysis was run using an admixture model along with a correlated allele frequency model, which was independent of each run [53].
A total of ten runs were performed for each estimated K. The Markov chain Monte Carlo (MCMC) length of the burn-in period was 30,000 and the number of MCMC iterations was 50,000. The identi cation of optimal K was done using STRUCTURE Harvester [54, http://taylor0.biology.ucla.edu/structureHarvester/] and based upon the equation established by Evanno et al [55]. The optimal K was that of corresponding the highest delta K value.
The inferred population structure K was used to generate the structure Q-matrix consisting of K vectors.
Each soybean accession was assigned to a Q cluster using a cut-off probability of 0.55. Any soybean accessions that could not be grouped in any of the clusters would be considered as admixture.
Population structure was visualized using STRUCTURE PLOT with the option "sort by Q" [56].

Genetic diversity analysis
The genetic diversity analysis and the establishment of the phylogenetic tree were done using MEGA 7 [57]. Statistical inference was achieved using a maximum likelihood procedure and the parameters for The Q-matrix component was imported into MEGA 7 and was used along with the genetic diversity analysis when drawing the phylogenetic tree. Consistency or discrepancy between the population structure analysis and the genetic diversity analysis could be easily captured by doing so. The shape of "node/subtree marker" and the "branch line" had the same color as the bar plots obtained from STRCUTURE PLOT for the sub-trees of each cluster (Q).

Genome-wide study analysis (GWAS)
GWAS was conducted using a Bayesian Information and Linkage Disequilibrium Iteratively Nested Keyway (BLINK) and run in R using the package 'BLINK' [27]. Signi cant SNPS were those with an LOD value greater than 3 [58].
BLINK was an improved model version of Fixed and Random Model Circulating Probability Uni cation (FarmCPU) and has been shown to be statistically powerful and e cient in identifying signi cant SNPs associated with trait of importance [27]. FarmCPU involved a xed effect model (FEM) and a random effect model (REM), which were run iteratively. FarmCPU assumed an even distribution of markers within the genome. However, this assumption was relaxed in BLINK where a Linkage Disequilibrium information was used instead [27]. The REM part was replaced by a second FEM in BLINK. The two FEM models used in BLINK were the following. QTNs, which were initially empty and with effects b 1 , b 2 , …, b k , respectively; M ij represented the j th genetic marker of the i th sample; and e i was the residual having a distribution with mean zero and a variance σ 2 e .
Candidate gene discovery A 10 kb-genomic region spanning a signi cant SNP was used for candidate gene search using the G.
max Williams 82.a2 reference in Soybase (https://www.soybase.org/). The SNPs associated with the combined data over three years for maturity date, plant height, seed weight, and yield were used for candidate gene discovery. Functional annotations pertaining to each postulated candidate gene were also investigated using Soybase (https://www.soybase.org/).

Genomic-selection and cross-validation
GS was conducted using a ridge regression best linear unbiased predictor (rrBLUP) model. rrBLUP has been shown to be effective in estimating the effects of loci controlling complex traits [59]. The rrBLUP model was y = WGβ + ε [17]. In this equation, y represented the vector phenotypic data; β denoted the marker effect with β ~ N(0,Iσ 2 β ); W represented the incidence matrix relating the genotype to the vector phenotype; G was the matrix displaying the genetic marker; and ε referred to the random error. The solution for rrBLUP was β_hat = (Z T Z + Iλ) −1 Z T y with Z = WG. The ridge parameter was de ned as λ = σ 2 e / σ 2 β , where σ 2 e was the residual variance and σ 2 β the marker effect variance. rrBLUP was carried out in R using the 'rrBLUP' package [60].
Cross-validation was conducted using 4 different approaches. The rst approach consisted of sampling accessions from and cross-validating within the 255 soybean genotypes (whole panel), the genotypes belonging to subpopulation 1 (Q1) from structure analysis (Q1 panel), and the genotypes from subpopulation 2 (Q2) from structure analysis (Q2 panel), respectively. For each de ned subgroup, the training dataset was taken from a year and the validation dataset was also extracted from the same year.
The second approach differed from the rst one in a way that the training dataset from a year was used to predict the genotype's performance in the succeeding year(s). The third strategy used samples from the Q1 panel to predict the Q2 samples' performance within the same year and vice versa. The fourth methodology consisted of training the GS model using dataset from the Q1 panel from a particular year that was used to predict the traits of the Q2 panel from another year and vice versa. Due to the relatively small sample size of each panel, a 5-fold cross-validation was carried out for GS involving the whole panel and the Q1 panel, and a 3-fold cross validation was performed for GS using the Q2 panel. Doing so allowed for an accurate estimation of the Person's correlation coe cient (GS accuracy) that was established based on a sample size (> 30) that could be statistically valid under such constraint.
ANOVA was conducted to assess the interaction effect of year and population type (whole panel, Q1 panel, and Q2 panel) on GS accuracy using PROC MIXED of SAS® v. 9.4. The statistical model for this analysis was the following. Y ijk = µ + T i + P j + + YP ij + ε ijk with i = 1,2,3, j = 1,2,3, and k = 1,2,3,…,100 Y ijk was the GS accuracy from the i th year using the j th population type at the k th replication, T i represented the effect of the i th year ( xed effect), P j was the effect of the j th population type ( xed effect), and ε ijk was the experimental error associated with the ijk th observation.

Declarations
Ethics approval and consent to participate All data and materials are not related to human and animals. All research materials of soybean germplasm accessions are obtained from Chinese Academy Of Agricultural Sciences. This research is not related to any plant specimens to be deposited as vouchers or any other association for this section.

Consent to publish
Not applicable Availability of data and materials SNP data can be found in the ENA using accession number PRJEB34546 (https://www.ebi.ac.uk/ena/data/view/PRJEB34546) and other data supporting this article have been listed in the supplementary materials

Competing interests
The authors declare that they have no competing interests.

Tables
Due to technical limitations, the tables are only available as a download in the supplemental les section.
Supplementary File Table S1. List of genotypes evaluated for maturity, plant height, seed weight, and yield over three years.
SD represents the standard deviation for each trait over three years Table S2. Mean Fst values, average distance between samples within the same subpopulation, average probability value of from each individual within each cluster, and allele frequency divergence among populations.  )) values, minor allele frequency at the SNP locus, and gene ID and functional annotation. Table S4. Genomic selection accuracy for maturity, plant height, seed weight, and yield using 100 replications and where cross-validation was performed within all samples, samples from subpopulation Q1, and samples from subpopulation Q2, respectively. Table S5. Genomic selection accuracy where cross-validation was performed within all samples, samples from subpopulation Q1, and samples from subpopulation Q2, respectively. Cross-validation within each group was conducted as following. The data from 2008 were used to predict the data from 2009 and 2010, respectively, the data from 2009 were used to predict the data from 2010, and the average data from 2008 and 2009 were used to predict the data from 2010. Table S6. Genomic selection accuracy for maturity, plant height, seed weight, and yield using samples from subpopulation 1 (Q1) as training set and samples from subpopulation 2 (Q2) as testing set and vice versa. Estimation of genomic selection accuracy was done using 100 replications. Table S7. Genomic selection for maturity, plant height, seed weight, and yield using samples from subpopulation 1 (Q1) as training set and samples from subpopulation 2 (Q2) as testing set, and vice  Genomic selection accuracy for yield, maturity, plant height, and seed weight using training/testing sets from all 225 soybean accessions (all samples), samples derived from Q1, and samples from the Q2 subpopulation, respectively. Cross-validation was conducted in a way that the data from a year was used to predict that of from the succeeding year(s).

Figure 11
Genomic selection accuracy for plant height, maturity, 100-yield, and yield using samples from Q1 as a training set and individuals from Q2 as a testing set, and vice versa. Cross-validation was done using data from the same year.

Figure 12
Genomic selection accuracy for plant height, maturity, 100-yield, and yield using samples from Q1 as a training set and individuals from Q2 as a testing set, and vice versa. Cross-validation was performed using the data from a year to predict that of from the succeeding year(s).

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.