GWAS and genomic selection for marker-assisted development of sucrose enriched soybean cultivars

Sucrose concentration in soy-derived foods is becoming a seminal trait for the production of food-grade soybeans. However, limited scientific knowledge is reported on this increasingly important breeding objective. In this study, 473 genetically diverse soybean germplasm accessions and 8477 high-quality single nucleotide polymorphisms (SNPs) were utilized to pinpoint genomic regions associated with seed sucrose contents through a genome-wide association study (GWAS). A total of 75 significant SNPs (LOD ≥ 6.0) were identified across GLM, FarmCPU and BLINK models, including four stable and novel SNPs (Gm03_45385087_ss715586641, Gm06_10919443_ss715592728, Gm09_45335932_ss715604570 and Gm14_10470463_ss715617454). Gene mining near 20 kb flanking genomic regions of the four stable SNP markers identified 23 candidate genes with the majority of them highly expressed in soybean seeds and pod shells. A sugar transporter encoding major facilitator superfamily gene (Glyma.06G132500) showing the highest expression in pod shells was also identified. Moreover, selection accuracy, efficiency and favorable alleles of 75 significantly associated SNPs were estimated for their utilization in soybean breeding programs. Furthermore, genomic predictions with three different scenarios revealed better feasibility of GWAS-derived SNPs for selection and improvement of seed sucrose concentration. These results could facilitate plant breeders in marker-assisted breeding and genomic selection of sucrose-enriched food-grade soybean cultivars for the global soy-food industry.


Introduction
Soybean (Glycine max L.) is an important legume crop cultivated worldwide for food, feed, industrial and bioenergy purposes (Qiu et al. 2011). It is integrated as a rotational crop in cropping schemes due to nitrogen-fixing properties, which aids in more sustainable agricultural practices (Lee et al. 2015). The demand for soy-derived foods is increasing over the past few decades due to popularity of plant-based diets (Lynch et al. 2018). On a dry matter basis, soybean seeds contain approximately 40% proteins, 35% carbohydrates, 20% oils and 5% ash (Krober and Cartter 1962). Among carbohydrates, about 40% are soluble sugars which are pivotal for soybean nutritional quality and further comprised of 5% sucrose, 3% stachyose and 1.5% raffinose (Wilson 2016). Soybean seeds also contain other minor soluble sugars such as fructose, glucose and galactose (Hymowitz et al. 1972). Soluble sugars, especially sucrose, substantially affect the nutritional quality of soy-derived foods including edamame, natto, soymilk, tempeh and tofu (Taira et al. 1990; Poysa and Woodrow 2002;Hou et al. 2009). High sucrose concentration improves the sweetness and flavor of soy foods. Additionally, sucrose is involved in various metabolic and biosynthetic processes, plays a vital role in growth and development, coordinates relationship between plant sources and sinks, and responds to different abiotic stresses (Farrar et al. 2000;Ruan et al. 2010;Ruan 2012;Du et al. 2020). Thus, the paramount roles of sucrose in higher plants cannot be underestimated, especially in nutritional quality improvement of plant-based foods.
Food-grade soybean cultivars are chosen based on physical, chemical and processing quality attributes of seeds (Rao et al. 2002). To successfully meet the food-grade quality standards set by the Ontario Soybean and Canola Committee (OSACC), soybean seeds must contain 6.0-8.0% sucrose content (OSACC 2020). The OSACC classifies soybean cultivars based on seed sucrose contents as low (< 6.4%), moderate (6.4-7.0%) and high (> 7.0%). Seed sucrose concentration varies significantly among cultivars and is highly influenced by environment and agronomic management practices (Bellaloui et al. 2011;Li et al. 2012). Therefore, breeding efforts received little attention for sucrose improvement as compared with other agronomic and quality traits, as later are comparatively more valuable to commodity-grade soybeans. Nevertheless, with shifts in demand for plantbased foods, genetic breeding for sucrose-enriched seeds is becoming a major objective of soybean breeders.
Linkage mapping has been effectively utilized to identify quantitative trait loci (QTL) associated with soybean seed sucrose contents. Maughan et al. (2000) identified 17 QTLs related to sucrose contents using a recombinant inbred line (RIL) population. Kim et al. (2005Kim et al. ( ) & (2006 reported six sucrosecontributing QTLs on chromosomes 2, 11, 12, 16, and 19 using a RIL population derived from a cross between Keunolkong and Shinpaldalkong soybean cultivars. Skoneczka et al. (2009) reported a major QTL on chromosome 6 in an interval of Sat_213 and Satt_643 molecular markers using two F 2 segregation populations, which could explain 76% of the total genetic variation in sucrose concentration. Maroof and Buss (2011) detected a large effect QTL linked with sucrose and stachyose contents on chromosome 11. Wang et al. (2014) reported one QTL for sucrose and raffinose contents in a set of 170 F 2:3 derived RILs. Zeng et al. (2014) identified three sucrosecontributing QTLs from an F 2 mapping population derived from contrasting parents MFS-553 (low sucrose) and PI 243545 (high sucrose). These QTLs were mapped on chromosomes 5, 9, and 16 and could explain 46%, 10% and 8% variations in the total sugar content, respectively. Akond et al. (2015) detected 14 QTLs significantly associated with soluble sugar contents including three sucrose-related QTLs in a set of 92 F 5:7 derived population. Recently, Patil et al. (2018) identified four QTLs on chromosomes 6, 8, 16, and 20 using a RIL population derived from G. max (Williams 82) and G. soja (PI 483460B) and reported a major QTL for sucrose contents (qSuc_08) on chromosome 8. However, all these family-based QTL mappings relied upon biparental populations which offer limited resolution and precision, requiring the use of high-throughput and accurate modern approaches.
As an alternative to linkage analysis, GWAS has been used to elucidate the molecular basis of complex traits. Advances in genome sequencing techniques have accelerated the resolution and accuracy of GWAS in soybean (Li et al. 2015). Next-generation sequencing enables the GWAS to precisely uncover the genetic basis of quantitative traits and their underlying putative QTLs/SNPs as compared with traditional QTL mapping techniques (Korte and Farlow 2013;Fang et al. 2017). The application of low-cost and high throughput sequencing approaches such as genotyping-by-sequencing (GBS) with advancement in computational analysis has enabled to design molecular markers that could help to improve complex traits with high selection accuracy and efficiency. The use of diverse genotypic panels in GWAS provides comprehensive information on allelic diversity and single nucleotide polymorphisms (SNPs) in different genetic backgrounds that may not be possible with a biparental population (Heffner et al. 2009). Using GWAS, several studies on the improvement of soybean agronomic and quality traits have been reported (Hwang et al. 2014;Zhou et al. 2015;Cao et al. 2017;Khan et al. 2019;Yang et al. 2020;Lee et al. 2021). However, to date, only a few studies involving the GWAS approach have been conducted to identify genomic regions associated with sucrose content in soybean seeds (Sui et al. 2020;Ficht et al. 2022;Xu et al. 2022;Lu et al. 2022).
In this study, a GWAS was performed using 473 diverse soybean germplasm accessions and 8,477 high-quality SNPs to pinpoint stable and large-effect genomic regions associated with seed sucrose contents. Selection accuracy, efficiency and favorable alleles of significant SNPs, as well as candidate genes were also identified for marker-assisted breeding and genomic selection of sucrose enriched soybean cultivars. The improved knowledge would facilitate towards development of food-grade cultivars and enhance the global soy-derived food market.

Plant materials
The United States Department of Agriculture (USDA) soybean germplasm panel comprising 5,483 accessions along with seed sucrose concentration data were retrieved from the U.S. National Plant Germplasm System (https:// npgsw eb. ars-grin. gov/ gring lobal/ metho dacce ssion? id1= 51073 & id2= 494186, accessed on January 10, 2023). The USDA germplasm panel consisted of breeding material, elite lines, approved cultivars, obsolete cultivars and landraces. Among this panel, 473 accessions with 100 seed weights of > 23 g were filtered out and chosen for GWAS (Table S1). The selected accessions mainly originated from 11 countries including China, France, Japan, Nepal, North Korea, Russian Federation, South Korea, Sweden, Taiwan and the United States. The USDA-ARS germplasm curation staff and their collaborators conducted field experiments at various locations across several years to record the seed weight of soyabean germplasm panel. Sucrose concentration in this germplasm panel was determined by following standard sucrose extraction and quantification methods (Choung 2010;Teixeira et al. 2012) and details are provided at GRIN (https:// npgsw eb. arsgrin. gov/ gring lobal/ method? id= 494186).

Genotyping
Genomic DNA was isolated from fresh leaves of selected germplasm panel using DNeasy 96 Plant Kit (QIAGEN, Valencia, CA), followed by genotyping with Illumina Golden Gate SNP assay. The panel was genotyped with SoySNP50K iSelect Bead Chip (Song et al. 2013) consisting of 42,291 SNPs distributed across 20 soybean chromosomes were downloaded from https:// www. soyba se. org/ snps/ downl oad. php. The density and distribution map of SNPs on each soybean chromosome was drawn using a CMplot package (Yin et al. 2021) and shown in Fig.  S1. SNPs with > 10% missing & heterozygosity data, < 5% minor allele frequency and tri-allelic status were removed, keeping only 8,477 high-quality SNPs for further analyses.

Population genetics analysis
Principle component analysis (PCA) and neighborjoining phylogenetic tree analysis were conducted through the Genomic Association and Prediction Integrated Tool version 3 (GAPIT3)  implemented in R (v4.1.3; R Core Team 2022) with default settings. The possible population structure was inferred based on block principal pivoting method using the LEA package (Frichot and François 2015) in R. The simulations were performed using genotypic information of 8,477 high-quality SNPs by setting 1,000 repetitions, 100 alpha, 100 iterations and number of populations (K) from 2 to 10. The cross-entropy inference was set at a threshold of 5% and a Q-matrix threshold of > 0.5 for subsequent population structure analysis of 473 USDA soybean germplasm accessions. Moreover, a similar attempt was conducted with the whole set of SNPs to find 97 Page 4 of 18 Vol:. (1234567890) possible discordance in population structure derived from high-quality SNPs, however, a non-significant difference in population structure was observed (data not shown). Hence, only high-quality SNP set-derived population structure was considered for this study.

Association analysis
GWAS was performed to find SNPs significantly associated with seed sucrose concentration using generalized linear model (GLM) (Nelder and Wedderburn 1972), mixed linear model (MLM) (Zhang et al. 2010), fixed and random model circulating probability unification (FarmCPU) (Liu et al. 2016), and Bayesian-information and linkage-disequilibrium iteratively nested keyway (BLINK) (Huang et al. 2019) models implemented in GAPIT3 . In this study, a comparatively strict logarithmic of odds (LOD ≥ 6.0) threshold based upon the Bonferroni correction test (α = 0.05) (Bland and Altman 1995) was set to determine highly significant associations.

Favorable SNP alleles, selection accuracy and selection efficiency
Favorable alleles of GWAS-derived 75 significant SNPs were determined from genotypic and phenotypic data of 114 contrasting soybean accessions containing high and low (57 of each level) seed sucrose contents. Similarly, SNP selection accuracy and efficiency were computed by following previously reported formulas (Shi et al. 2016;Ravelombola et al. 2019). Genomic selection/prediction Genomic selection/prediction was computed based on genomic estimated breeding values (GEBVs) of 473 selected germplasm panel. GEBVs were estimated using nine genomic prediction models (BA, BB, BL, BRR, cBLUP, gBLUP, RF, rrBLUP and SVM) with five different training to testing set ratios (50/50, 67/33, 75/25, 83/17 and 90/10), and whole-set versus GWAS-derived SNPs (Ravelombola et al. 2019(Ravelombola et al. , 2021Shi et al. 2021Shi et al. , 2022. For GP estimation using different models, ridge regression best linear unbiased prediction (rrBLUP) package (Endelman 2011) was implemented in R, genomic BLUP (gBLUP) and compressed BLUP (cBLUP) implemented in GAPIT3 , Bayes A (BA), Bayes B (BB), Bayes LASSO (BL) and Bayes ridge regression (BRR) implemented in BGLR (Pérez and De Los Campos 2014), random forest (RF) implemented in randomForest R package (Breiman 2001), and support vector machine (SVM) implemented in kernlab packages (Karatzoglou et al. 2023). Prediction accuracy of tested GP models was computed using the average Pearson's correlation coefficient (r) between GEBV estimates from training and testing sets (Zhang et al. 2016). The averaged r-values were estimated from 100 repetitions of five above-mentioned training to testing sets and boxplots were generated using the ggplot2 package (Wickham 2016).

Genetic diversity among USDA soybean germplasm
The USDA soybean germplasm consisted of a hefty collection of diverse accessions collected from all over the world. Among this collection, 473 accessions with 100 seed weights > 23 g were chosen for this study (Table S1). The majority of the accessions originated from South Korea, Japan and China with marginal contributions from eight other countries. The chosen accessions included advanced breeding material, elite lines, cultivars, landraces and wild relatives. Sucrose in germplasm collection varied from 0.1% to 10.5% with a median value of 4.3% (Fig. 1A). Frequency distribution plot indicated a nearly symmetrical distribution of sucrose contents and germplasm accessions. Wider genetic diversity in Japanese, South Korean, Chinese and Russian accessions was observed as sucrose in these countries accessions ranged from 0.4% to 10.5%, 1.2-10.2%, 0.1-6.5% and 1.2-6.8% respectively (Fig. 1B), indicating Asia as edamame soybean centre of the origin and further improvement. Two accessions (PI 398768 and PI 229343) with > 10% sucrose contents were recognized from this diverse germplasm set (Table S1), which could be exploited in soybean breeding programs as donor parents for the development of higher-yielding and more nutritious cultivars. Hence, the germplasm collection was genetically diverse for conducting genome-wide association and genomic selection studies.
Population structure, PCA and phylogenetic analyses By subjecting the genotypic information of 8,477 high-quality SNPs set, population structure, PCA and phylogenetic analyses were carried out for 473 soybean accessions selected from the USDA germplasm. An optimal number of three ancestral subpapulations or clusters were inferred with K value 2-10 and Q-matrix threshold > 0.5 using LEA package in R ( Table S2). The projected population structure is shown in Fig. 2A. A similar attempt utilizing the whole 42,291 SNPs set yielded a non-significant difference in structure results when compared with high-quality SNPs set-derived population structure (data not shown). Cluster I, II and III consisted of 41 (8.67%), 167 (35.31%) and 238 (50.32%) accessions with major contributions from China, South Korea, and Japan, respectively ( Fig. 2A and Table S2).
Similarly, PCA and phylogenetic analyses clearly distributed 473 soybean accessions into three distinct clusters or groups by using the high-quality SNPs set information subjected to GAPIT 3 (Fig. 2B, C). Collectively, these results support genetic diversity results and indicate that sub-populations or clusters have stable boundaries with well-defined genetic backgrounds. GWAS for sucrose contents GWAS was conducted to pinpoint high-quality SNPs significantly associated with sucrose contents in USDA soybean germplasm. GLM, MLM, FarmCPU and BLINK models were tested by GAPIT3. Combined and individual Manhattan plots identified a total of 75 SNPs significantly associated (LOD ≥ 6.0) with sucrose contents in seeds ( Fig. 3A and S2A-S5A). The number of significant SNPs varied from 6 to 63 among GLM, FarmCPU and BLINK, without single SNP contribution by MLM, and unevenly distributed on 13 out of 20 soybean chromosomes (Table S3). The observed LOD distributions [− log10(p)] in QQ-plots also showed deviations from expected LOD distributions ( Fig. 3B and S2B-S5B) confirming a significant association of identified SNPs with sucrose contents in USDA germplasm. Moreover, this study also identified a single overlapping SNP (Gm03_45385087_ ss715586641) with LOD > 6.0 across GLM, Farm-CPU and BLINK models (Fig. 3A), indicating the presence of a highly stable large effect quantitative trait loci (QTL) significantly associated with sucrose contents on chromosome 3. Furthermore, three other overlapping significant SNPs were also pinpointed by GLM and FarmCPU (Gm06_10919443_ss715592728 and Gm09_45335932_ss715604570) or GLM and BLINK (Gm14_10470463_ss715617454) models, revealing preponderance of stable and large effect QTLs for sucrose contents on chromosomes 6, 9 and 14 (Table 1).
GLM identified 63 SNPs on chromosomes 2, 3,4,6,8,9,11,13,14 and 18 (Fig . S2A) with LOD and minor allele frequency (MAF) ranging from 6.0 to 9.49 and 0.084-0.490, respectively (Table S3). Whereas MLM was unable to find even a single SNP with LOD ≥ 6.0, however, few SNPs were associated with sucrose contents at a lower LOD value (≥ 4.0) ( Fig. S3A). FarmCPU and BLINK identified six SNPs each on chromosomes 3, 5,6,8,9,11,13,14 and 15 (Figs. S4A and S5A). The LOD and MAF of FarmCPU-derived SNPs ranged from 6.38 to 9.69 and 0.064-0.425, respectively. Likewise, BLINKderived SNPs had LOD and MAF in the range of 6.28-7.53 and 0.022-0.425, respectively (Table S3). In this study, identification of at least four stable and large-effect SNPs/QTLs significantly associated with sucrose contents in seeds (Table 1) could provide a basis for marker-assisted development of high sucrose-containing soybean genotypes. Candidate genes associated with sucrose contents Candidate genes for sucrose contents were mined from 20 kb flanking genomic regions of four stable and large-effect SNPs through the Phytozome database. A total of 23 protein-encoding candidate genes were identified whose start or end positions lay within 20 kb genomic regions (Table 2). Among these, 9 genes code for hypothetical proteins with unknown functions, whereas 3 genes code for P-loop containing nucleoside triphosphate hydrolases. Identified genes from chromosome 3 code for cytochrome B5 isoform E, auxin-responsive protein, Rho GTPase activating protein with PAK-box and DHHC-type zinc finger family proteins with moderate to highest expressions in soybean seeds and pod shells, along with other vegetative tissues. Similarly, candidate genes positioned on chromosome 6 code for sugar transporter/major facilitator superfamily protein, rhodanese/cell cycle control phosphatase superfamily protein and P-loop containing nucleoside triphosphate hydrolase superfamily proteins with moderate to highest expressions in seeds and pod shells. Whereas candidate genes from chromosome 9 code for purple acid phosphatase 10, chromatin remodeling 31, NAD(P)-linked oxidoreductase superfamily protein, prefoldin 6 and RNI-like superfamily proteins with only three showing moderate to highest expressions in reproductive, as well as, in vegetative tissues. However, both of the genes positioned on chromosome 14 code for hypothetical proteins and only one is highly expressed in root tissues. Putative functions of all these candidate genes included energy production for conformational changes, transmembrane transport of materials and regulation of proteins localization, stability and activity. Interestingly, a sugar transporter encoding major facilitator superfamily gene (Glyma.06G132500) was also identified in this study which showed highest expressions in soybean pod shells at 10-17 days after fertilization (DAF) (Severin et al. 2010). None of the four stable SNPs was positioned within the genic region of any candidate gene, as all four SNPs were positioned at < 5-25 kb distance from start or end position of all candidate genes (Table 2). Further detailed studies are required to explore the putative roles of identified candidate genes in sucrose biosynthesis, transport, and storage.

Genomic prediction of sucrose contents
Genomic prediction using different training to testing set ratios and GP models GP was performed by subjecting five different ratios of training/testing sets (50/50, 67/33, 75/25, 83/17 and 90/10), nine GP models (BA, BB, BL, BRR, cBLUP, gBLUP, RF, rrBLUP and SVM) and the whole SNPs set, making a total of 45 combinations. The averaged Pearson correlation coefficients (r Ȳ100 ) along with standard errors of 100 individuals run between GEBVs and sucrose contents for 45 combinations are displayed in Fig. 4. Five training/ testing sets had similar but not identical averaged r values across all nine GP models, except cBLUP with slightly lower r values. Increasing the training/ testing ratio marginally improved selection accuracy across all tested GP models, however, the SE was also enlarged. The averaged r value across all tested models was 0.49 and ranged from 0.34 to 0.52 (Table S5). Likewise, 90/10 ratio set had the highest averaged r value (0.51) and varied from 0.36 to 0.55. In general, four GP models (BB, BRR, gBLUP and rrBLUP) had the highest selection accuracy values over all other models with cBLUP having the lowest accuracy values (Fig. 4). Overall, these results indicate that genomic selection in soybean for sucrose contents can be carried out using any training/testing set and GP model (except for cBLUP).
Genomic prediction using whole set and GWAS-derived SNPs GP was also completed between the whole set and GWAS set SNPs to explore selection efficiency of identified SNPs significantly associated with sucrose contents. Significantly higher prediction accuracy values were observed for GWAS set SNPs across all GP models as compared with whole set SNPs (Fig. 5). The r-values of GWAS-derived SNPs ranged from 0.44 to 0.63 with a mean value of 0.60, which are 16-33% higher than r-values of whole set SNPs. On average, prediction accuracy was improved by 22% with GWAS-set SNPs (Table 3). Among GP models, the highest increase in prediction accuracy using GWAS set SNPs was observed with cBLUP (33%), followed by BL (25%) and gBLUP (24%), whereas the lowest increase was found in rrBLUP (16%) and RF (17%). However, despite the highest increase in prediction accuracy with cBLUP, its r-value was significantly lower than r-values of all other GP models (Fig. 5), indicating that this model is less efficient for genomic selection of sucrose contents as compared with all other models. Collectively, these results highlight that the prediction accuracy of GWAS-derived SNPs set is high and these SNP markers could be utilized for genomic selection/prediction of high sucrose contents in soybean seeds.

Discussion
Wider genetic diversity in soybean for improvement of seed sucrose concentration Sucrose is desirable for the better taste and flavor of soy-derived foods. High sucrose concentration in soybean seeds is becoming an important breeding objective for development of food-grade soybean cultivars. To successfully meet the food-grade quality standards, seed sucrose concentration should be 6.0-8.0% (OSACC 2020). During recent years, significant attention have been given to characterising the soybean germplasm globally for exploring the range of available genetic diversity in seed sucrose concentration. For example, Sui et al. (2020)  locations in China and reported 3.33-9.94% seed sucrose contents. Likewise, Ficht et al. (2022) characterized a panel of 266 historic and current soybean accessions across four field locations in Ontario (Canada) and described a wider genetic variation of 1.48-9.31% in seed sucrose concentration. Lu et al. (2022) assessed 278 diverse soybean accessions across three different locations and reported 5.52-16.46% total soluble sugar contents. Similarly, Xu et al. (2022) reported 0.58-13.99% total soluble sugars in 264 soybean accessions evaluated under two different environments. In this study, we selected 473 USDA soybean germplasm accessions in which seed sucrose contents showed a near-symmetrical distribution with a wider genetic variation of 0.1-10.5% (Fig. 1, Table S1). Four Asian countries including Japan, South Korea, China and Russia harbored relatively diverse accessions, as large variations were observed in seed sucrose contents of soybean accessions originating from these countries. Among all accessions, at least 83 contained 6.0-10.5% sucrose concentration in their seeds, indicating these as "food-grade soybean" accessions (Table S1). Moreover, at least two accessions with > 10% sucrose contents were also identified which could be exploited as donor parents for high sucrose contents in soybean breeding programs. Overall, the availability of this wider genetic diversity provides a rich resource for further improvement of seed sucrose concentration along with desirable agronomic and seed composition traits.
Stable SNPs for screening of soybean genotypes Molecular markers are an essential prerequisite for accurate and robust screening of genetically diverse germplasm sets. Linkage analysis and GWAS are the most commonly used methods for molecular marker identification. To date, 42 QTLs have been reported for soybean seed sucrose contents through linkage analysis of bi-parental mapping populations (https:// www. soyba se. org/ search/ qtlli st_ by_ symbol. php, accessed on February 13, 2023). However, only a few studies involving GWAS have been reported in recent years (Sui et al. 2020;Ficht et al. 2022;Xu et al. 2022;Lu et al. 2022). This study also followed GWAS approach to pinpoint important genomic regions significantly associated with seed sucrose contents. After subjecting the genetic and biochemical variation information of 8,477 high-quality SNPs and 473 USDA soybean germplasm accessions respectively, 75 SNPs were significantly (LOD ≥ 6.0) associated with seed sucrose contents (Fig. 3A). Among these, four stable SNP markers were also identified which showed overlapping between two or three GWAS tested models (Table 1). However, none of these stable SNPs overlapped with previously reported GWAS-derived SNPs indicating these as novel and stable for screening of diverse soybean germplasm sets. Possible reasons for their non-overlapping with already identified SNPs include a comparatively higher LOD threshold and differences in the number of highquality SNP sets, genetic background of germplasm sets, GWAS-tested models and softwares/packages used for GWAS. Overall, on basis of a combination of highest LOD threshold and overlapping between GWAS-tested models, it could be expected that four candidate SNPs identified in this study are comparatively more stable and efficient than previously reported GWAS-derived SNPs for screening of soybean germplasm accessions containing genetic variation in their seed sucrose contents.
Candidate genes regulating sucrose biosynthesis, transport and storage GWAS has become a mainstream approach for the identification of trait-linked genes due to involvement of relatively short linkage disequilibrium fragments (Li et al. 2015). At present, several hundreds of genes have been identified to participate in the regulation of soybean sucrose concentration (Sui et al. 2020;Lu et al. 2022). In this study, 23 candidate genes were identified within 20 kb flanking genomic regions of four stable SNPs (Table 2). Among these, at least 15 candidate genes were moderate to highly expressed in seeds and pod shells, along with other vegetative tissues (Severin et al. 2010). Interestingly, a sugar transporter encoding major facilitator superfamily gene (Glyma.06G132500) showing highest expression in pod shells at 10-17 DAF was also identified. Positioning of these candidate genes near four stable SNP peaks and their expression patterns in seeds and pods mimic their putative roles in regulating soybean seed sucrose concentration. Until now, none of these candidate genes has been functionally characterized. Their detailed characterization would be intriguing, however, beyond the scope of the current study, and could open new avenues to breed sucrose-enriched soybean. Therefore, these genes are worth exploring in future studies for deciphering their possible roles in sucrose biosynthesis, transport and storage within soybean plants.

Marker-assisted development of sucrose enriched soybean cultivars
Marker-assisted breeding has proven greater advantages over conventional breeding for the improvement of plant quantitative traits. Among various molecular markers, kompetitive allele specific PCR (KASP) marker assays offer high-throughput accuracy, efficiency and low cost for the detection of SNPs in large and diverse germplasm sets (He et al. 2014). KASP markers have several applications in marker-assisted selection and breeding for the improvement of crop yield, resistance/tolerance and quality traits (Patil et al. 2016(Patil et al. , 2017Zhao et al. 2019;Liu et al. 2020;Wilkes et al. 2023 (Table S4). High selection accuracy (> 66%) and efficiency (> 40%) were computed for four stable SNPs, except for one SNP which showed comparatively moderate efficiency (24.5%). Moreover, their favorable alleles were also identified using contrasting soybean genotypes with high and low seed sucrose contents. However, development and validation of KASP markers assay are required, which is beyond the scope of this study, for practical utilization of identified SNP markers and their favorable alleles in soybean breeding programs. In future, additional studies involving development and validation of KASP markers could reveal their practicability in marker-assisted breeding of sucrose enriched soybean cultivars.

Genomic selection/prediction for higher genetic gains
Genomic prediction is one of the most important parameters for measuring the performance of genomic selection. GP accuracy is dependent upon several factors including objective trait with its heritability, markers density and their association with the trait, population size, level of linkage disequilibrium, genomic selection models and relationship between training to testing populations . In this study, GP accuracy was tested using three scenarios: (1) different training to testing set, (2) different GP models, and (3) using whole-set versus GWAS-derived SNPs. The five training/testing sets (50/50, 67/33, 75/25, 83/17 and 90/10) showed similar but non-identical averaged r-values (range 0.34-0.52) (Fig. 4). Marginal improvements in GP accuracy (0.03-0.07) were observed when training/testing set ratios were increased (Table S5). The highest accuracy was observed in 90/10 ratio set, however, variance also enlarged. Moreover, GP accuracy testing with nine GP models also revealed similar results, although slightly lower accuracy values were observed with cBLUP model (Fig. 4). Comparable accuracy trends have been reported in other recent studies (Keller et al. 2020;Ravelombola et al. 2021;Shi et al. 2021Shi et al. , 2022. Similarly, testing with GWAS-derived SNPs showed significantly higher accuracy values across all GP models than whole-set SNPs (Fig. 5). On average, a 22% increase in prediction accuracy was observed which ranged from 16% to 33% (Table 3). GWAS-derived SNPs have also shown higher prediction accuracy values than wholeset and randomly selected SNPs for seed size and amino acid contents in soybean (Zhang et al. 2016;Qin et al. 2019). Collectively, our results highlighted that GWAS-derived SNPs are more efficient and any training/testing set ratio of ≥ 50% and tested GP models, except cBLUP, could be used for genomic selection/prediction of sucrose enriched soybean seeds. However, genomic predictions could be biased when GWAS-derived SNPs are used to predict genomicsassisted breeding values of the same GWAS panel or probably lower prediction accuracy may be observed in other panels with different individuals (Shi et al. 2022). In conclusion, genomic selection/prediction along with GWAS-derived SNPs and MAS would be beneficial for marker-assisted breeding of high sucrose-containing soybean cultivars and/ or other quantitative traits of important plant species.

Conclusion
In this study, 473 USDA soybean germplasm accessions were evaluated for estimation of genetic diversity in seed sucrose contents. Wider genetic diversity existed in accessions originating from four Asian countries including China, Japan, Russia and South Korea. Two accessions with > 10% seed sucrose contents (PI 398768 and PI 229343) were identified which could be utilized as donor parents for the development of food-grade soybean cultivars. Population structure, PCA and phylogenetic analyses classified the 473 accessions into three distinct groups or clusters having stable boundaries with well-defined genetic backgrounds. GWAS with 8,477 high-quality SNPs pinpointed four stable and largeeffect SNPs/QTLs (Gm03_45385087_ss715586641, Gm06_10919443_ss715592728, Gm09_45335932_ ss715604570 and Gm14_10470463_ss715617454) significantly associated (LOD ≥ 6.0) with seed sucrose contents. A total of 23 candidate genes including a sugar transporter (Glyma.06G132500) were identified from 20 kb flanking genomic regions of four stable SNPs. Selection accuracy and efficiency, as well as favorable alleles, of GWAS-derived SNPs were also estimated. Moreover, genomic predictions were tested using three different scenarios and results revealed that GWAS-derived SNPs are more valuable for the improvement of soybean seed sucrose contents through marker-assisted and genomic selections.