Genetic analysis of sucrose concentration in soybean seeds using a historical soybean genomic panel

Significant QTL for sucrose concentration have been identified using a historical soybean genomic panel, which could aid in the development of food-grade soybean cultivars. Soybean (Glycine max (L.) Merr) is a crop of global importance for both human and animal consumption, which was domesticated in China more than 6000 years ago. A concern about losing genetic diversity as a result of decades of breeding has been expressed by soybean researchers. In order to develop new cultivars, it is critical for breeders to understand the genetic variability present for traits of interest in their program germplasm. Sucrose concentration is becoming an increasingly important trait for the production of soy-food products. The objective of this study was to use a genome-wide association study (GWAS) to identify putative QTL for sucrose concentration in soybean seed. A GWAS panel consisting of 266 historic and current soybean accessions was genotyped with 76 k genotype-by-sequencing (GBS) SNP data and phenotyped in four field locations in Ontario (Canada) from 2015 to 2017. Seven putative QTL were identified on chromosomes 1, 6, 8, 9, 10, 13 and 14. A key gene related to sucrose synthase (Glyma.06g182700) was found to be associated with the QTL located on chromosome 6. This information will facilitate efforts to increase the available genetic variability for sucrose concentration in soybean breeding programs and develop new and improved high-sucrose soybean cultivars suitable for the soy-food industry.


Introduction
Soybean (Glycine max (L.) Merr) is a crop of global importance for both human and animal consumption and is used largely as a source of protein meal and vegetable oil in human diets (Hartman et al. 2011;Qiu et al. 2013). Historically, soybean has been a staple in Asian cuisine, but over time it has been integrated into both European and North American food markets. Soy-derived food products including tofu, soymilk, tempeh and natto have gained popularity over the past few decades due to an increased demand of plant-based diets (Lynch et al. 2018). Soybeans provide a similar amount of protein as food from animal sources; however, they have lower amounts of saturated fat and little to no cholesterol, making them an ideal health food items, as mentioned above (Young 1991;Qiu et al. 2013). This has translated into an overall greater increase in the number of soy-food products in the global food sector.
Successful soybean food-grade cultivars are chosen based on physical traits, chemical traits and processing quality of the seed (Brar and Carter 1993;Rao et al. 2002). Sucrose concentration is a notable compositional trait that must be considered, especially because its target range differs depending on its intended end product. The range in seed sucrose concentration that is needed to meet the food-grade quality testing is between 6.0 and 8.0% (OSACC 2020). The Ontario Soybean and Canola Committee (OSACC) classifies cultivars based on seed sucrose concentration as low (< 6.4%), moderate (6.4-7.0%) and high (> 7.0%) (OSACC 2020).
Breeding for soybean seed sucrose concentration has received little attention, as the improvement of other agronomic and seed composition traits, including yield, seed oil, seed protein and disease resistance, are more important to commodity-grade soybeans. However, with shifts in demand, seed sucrose concentration is becoming an important chemical trait for breeders to focus on. Sucrose is the Communicated by Dechun Wang. 1 3 major product of photosynthesis and thus a major sugar transported throughout higher plants. It has been shown to have many important functions including acting as a major substrate for sink metabolism, a main form of translocated carbon as well as a number of regulatory and integrative functions within the plant (Farrar et al. 2000). Sucrose metabolism has been shown to be linked to the metabolism of both organic and inorganic nitrogen and may also be an important factor for the balance of resource acquisition and allocation within and between plant organs (Farrar et al. 2000). Finally, the seed sucrose content of a plant is representative of a long-term balance between supply (e.g., photosynthesis) and demand (e.g., plant growth, storage and nutrient assimilation) (Farrar et al. 2000).
Sucrose metabolism is well defined in the literature and has, therefore, been a focus for plant biochemists. Sucrose synthase (SUS) is a glycosyl transferase enzyme that is involved in sugar metabolism (Stein and Granot 2019). SUS catalyzes the reversible cleavage of sucrose into fructose and either uridine diphosphate glucose (UDP-G) or adenosine diphosphate glucose (ADP-G; Stein and Granot 2019). Both cleaved products are then available for use by the plant within many metabolic pathways including energy production, primary metabolite production and the synthesis of complex carbohydrates (Stein and Granot 2019). Previous studies have shown that the overexpression of SUS genes displayed increased growth, increased xylem area and cellwall width and increased cellulose and starch content, indicating the potential for using SUS genes as candidate genes for the improvement of agronomic and chemical traits in crop plants (Stein and Granot 2019). Xu et al. (2019) identified 100 SUS genes in a number of higher plants, including soybean, and provided the phylogenetic relationship framework between the diverged SUS gene subfamilies (I, II and III). This information provides potential insight for its utility in marker-assisted selection.
The availability of high-throughput genotyping methods (e.g., genotyping-by-sequencing [GBS]) based on nextgeneration sequencing (NGS) technology allows for using GWAS as a method for the detection of putative QTL related to soybean seed sucrose concentration. GWAS is a powerful tool for analyzing complex traits with a larger scope than traditional QTL identification strategies (Korte and Farlow 2013;Fang et al. 2017). Unlike biparental cross populations, the use of GWAS and a diversity panel captures allelic diversity and genetic background effects that may not have been uncovered otherwise (Heffner et al. 2009). The use of GWAS allows breeders to detect genomic regions associated with traits of interest and provides them with estimates regarding the size and direction of allelic effects (Abdel-Shafy et al. 2014;Contreras-Soto et al. 2017).
The objective of this study was to use GWAS to identify QTL related to sucrose concentration in soybean seed. The improved knowledge of particular QTL related to sucrose will allow for the integration of genetic variation into breeding germplasm leading toward the development of superior food-grade soybean cultivars with a focus on enhancing the global soy-food market.

Selection of germplasm
A soybean panel was formed representing breeding material, historical and elite cultivars indicative of decades of selection within the University of Guelph, Guelph Campus (174 accessions) and University of Guelph, Ridgetown Campus (96 accessions) soybean breeding programs, were used to evaluate phenotypic diversity for seed sucrose concentration as described earlier (Bruce et al. 2019(Bruce et al. , 2020 (Table S1). Four soybean cultivars, Colby, RCAT Wildcat, OAC 07-26C and DH4202, were grown in all locations.

Field trials
The soybean panel was grown at four field locations: . Trials were planted as nearest-neighbor randomized complete block designs (RCBD) with two replications. Each four-row plot was planted with 500 seeds. Plots were 5 m in length, with 40 cm row spacing. Each site was maintained using conventional management practices. Locations were harvested when every plot had reached full maturity.

Phenotypic data collection
Seed sucrose concentration, protein concentration and oil concentration were recorded for each plot. A Perten DA 7250 near-infrared reflectance (NIR) spectrometer (Perten Instruments, Hägersten, Sweden) was used to measure all traits using calibrations provided by Perten Instruments, as reported in Whiting et al. (2020). A small tray of seeds from each plot were randomly sampled and hand-screened before each measurement was taken. Perten DA 7250 NIR estimates of sucrose concentration are imperfect, but unbiased (Table S2). Manitoba Brown was removed from further analysis due to issues with NIR measurements and darkcolored seed coats.

Statistical analysis
Sucrose concentration data were examined in each environment (year × location) to identify environments with poor performance. All environments that grew the Guelph and Ridgetown panels had consistent seed trait performance; therefore, all data were kept for combined analysis. As a result of sampling errors in the Chatham 2016 environment, only a single-entry-based seed sample was used and was therefore treated as a single block.
Plot data for sucrose concentration from St. Pauls and Woodstock, Ontario locations, were spatially corrected using radial smoothing in PROC GLIMMIX in SAS 9.4 (2013) (SAS Institute, Cary, NC, USA) with the model including genotype and cov_sp as fixed effects and longitude (long) and latitude (lat; physical field positions) as random effects and an effect of cov_sp = spline (lat long). All further analyses used the spatially corrected data.
Both Chatham and Ridgetown locations were analyzed using a nearest-neighbour adjustment in Agrobase (Agronomix Software 2019) to produce adjusted plot values. The model that was employed was y ij = m + v i + t ij + e ij , where y ij is the plot value of the i th cultivar of the j th block for sucrose concentration, m is the trial mean for sucrose concentration, v i is the effect of the i th cultivar in the j th block and e ij is the residual error for the i th cultivar in the j th block, where the two adjacent plots are used in the calculation of the local trend.
Analyses of variance (ANOVA) of sucrose concentration within test locations were partitioned into fixed effects and random effects. Entry, Environment and Entry × Environment were used as the fixed effects, and Block (Environment) was used as the random effect using the PROC GLIMMIX procedure in SAS version 9.4 (2013) (SAS Institute Inc., Cary, NC, USA) for a RCBD.
Combined analyses of variance were conducted over ten environments using the PROC GLIMMIX procedure in SAS version 9.4 (2013) (SAS Institute Inc., Cary, NC, USA). Least square means (LSMEANS) were calculated using PROC GLIMMIX procedure across single locations and combined location and year analyses. Pearson's correlation coefficients were calculated using the PROC CORR procedure performed between sucrose concentration LSMEANS and protein, yield and oil LSMEANS to determine any linear correlation.

Genotypic data
DNA extraction was performed using the Qiagen DNeasy 96 Plant Kit (Toronto, Canada) following the manufacturer's protocol. The soybean panel was genotyped using the genotyping-by-sequencing (GBS) method, as described by Elshire et al. (2011) and Sonah et al. (2013) (Torkamaneh et al. 2017) and FASTQ files were demultiplexed determined through barcodes sequences. The demultiplexed reads were trimmed and mapped using the Williams82 soybean reference genome (Gmax_275_Wm82. a2.v1) (Schmutz et al. 2010;Bruce et al. 2019). Nucleotide variants were identified from the mapped reads and were removed if: (i) there were two or more alternate alleles, (ii) the read quality (QUAL) score was < 32, (iii) the mapping quality score was below 30, and (iv) the read depth was < 2 (Bruce et al. 2019). Missing data imputation and integration was performed using BEAGLE v5 (Browning and Browning 2007) previously described by Torkamaneh and Belzile (2015) and Torkamaneh et al. (2018). In total, 76,549 genome-wide SNPs were used for this study. A minor allele frequency (MAF) filter of 0.05 and heterozygous SNP filter of 0.5 were applied prior to carrying out the GWAS. No missing data were present in the dataset. A principal component analysis (PCA) and phylogenetic tree analysis were performed in TASSEL v5 (Bradbury et al. 2007).

GWAS analysis
GWAS analyses were performed with rMVP R package (Yin et al. 2020) using fixed and random model circulating probability unification (FarmCPU; Liu et al. 2016). The factored spectrally transformed linear mixed model (FaST-LMM) and efficient mixed model analysis (EMMA; Kang et al. 2008;Lippert et al. 2011) embedded in FarmCPU were implemented. A kinship matrix was calculated either using the VanRaden method (K) or the EMMA method (K*) to determine relatedness among individuals (Kang et al. 2008). The population structure was determined using fast-STRU CTU RE (Raj et al. 2014) and a principal component analysis (PCA). A model taking into account kinship and PCA (P + K*) was found to provide the best fit. The negative log(1/n) was used to establish a significance threshold (Wang et al. 2012).

Phenotypic analysis of seed sucrose concentration
The panel was evaluated for sucrose concentration in multi-environment trials during the 2015-2017 field seasons. The minimums, maximums and means for each environment are displayed in Table 1. The 2015 St. Pauls, Ontario (STPL15) environment displayed the highest average sucrose concentration with a mean and standard error of 7.37 ± 0.067. The lowest average sucrose concentration was observed in Chatham, Ontario, in 2016 with a mean seed sucrose concentration of 6.13 ± 0.062.
The panel showed a normal distribution (Shapiro-Wilk, p < 0.05) for seed sucrose concentration (Fig. 1a), with an overall mean of 6.83 ± 0.9. A combined analysis of variance was carried out for the panel displaying a significant difference among genotypes, environments and genotype*environment effects in the panel (Table S3). Broad sense heritability (H 2 ) was calculated for sucrose concentration in each environment and ranged from 0.09 in Ridgetown, Ontario, in 2016 to 0.91 in Woodstock, Ontario, in 2015 ( Table 2). The combined environment H 2 was 0.77.

Relationship between traits
The relationship between sucrose concentration and other value-added and agronomic traits (seed protein, oil and yield) was also studied in the panel (Tables 3 and 4). The Woodstock yield data for 2017 were dropped due to error. Analysis by year revealed significant negative relationships between oil and protein in both years in across all locations (Tables 3 and 4). A significant negative relationship between sucrose and protein was observed in across all years at the p < 0.001 level (Table 3). A positive relationship was observed between sucrose concentration and oil; however, this relationship was not significant. A significant positive relationship between sucrose concentration and yield was observed in 2015, 2016 and 2017 (Table 3).

Genotypic analysis of seed sucrose concentration
In total, 76,549 high-quality genome-wide GBS-SNPs were identified in this panel consisting of 266 soybean   genotypes. A PCA and phylogenetic tree clustering were carried out using 76 k SNPs ( Fig. 1b and c) that indicated genetic admixture within the panel, without any tight or distinct subsets between the Guelph and Ridgetown campus breeding programs, due to shared founder germplasm. A single best linear unbiased estimator (BLUE) for sucrose concentration and 76 k GBS-SNPs data were used as input for the GWAS. The GWAS analysis was performed using the models described above, and population structure (P) and cryptic relatedness (K*) were incorporated as covariates to reduce false positive signals. Using this approach, we identified seven QTL (− log 10 P ≥ 4.88) for seed sucrose concentration on chromosomes 1, 6, 8, 9, 10, 13 and 14 ( Fig. 2a; Table 5). Of these, qSUC.1.39, qSUC8.22 or qSUC14.26 were novel QTL, which have not been reported previously in those genomic regions on chromosomes 1, 8 and 14, respectively. One of the most intriguing features of this study is that almost all alleles related to these QTL were frequently present in this panel, as represented by the average sucrose concentration per allele ( Table 5). The largest differences in allelic contribution to sucrose concentration are associated with qSUC.1.39 (and qSUC.10.5 (Table 5). The estimated allele effects of the putative QTL indicate that they control close to 30% of the phenotypic variation within this panel (Table 5).

Candidate genes
We used the soybean public database (SoyBase 2020) and soybean reference genome (Williams 82) annotation (Wm82. a2.v1) to identify candidate genes for sucrose accumulation.
To search for candidate genes, the QTL flanking regions were set up as 100 kb on either side of the QTL peak. There  Fig. 2 Genome-wide association study for seed sucrose concentration (% seed composition) in the University of Guelph soybean population. a A Manhattan plot showing 7 putative QTL for seed sucrose concentration. b The normal Q-Q plot for seed sucrose concentration in the population were no previously identified QTL or genes in the region associated with qSUC.1.39, qSUC.8.22 or qSUC.14.26.
T h e ge n e a s s o c i a t e d w i t h q S U C . 6 . 1 5 i s Glyma.06g182700, located approximately 361 bp upstream of the SNP peak (15,682,704). It is annotated as a protein encoding carbonic anhydrase and has been shown to have an indirect role in the sucrose metabolic pathway (Neupane et al. 2020). Gene expression data provided by Severin et al. (2010) showed that Glyma.06g182700 is highly expressed in soybean root nodules.
The gene associated with qSUC.9.2 is Glyma.09g035700, with the peak (2,989,052) falling within the gene itself. Glyma.09g035700 is annotated as iron-sulfur binding protein. This gene plays a role in assisting with the oxidation-reduction reactions of electron transport in both mitochondria and chloroplasts (Balk and Pilon 2011). Gene expression data provided by Severin et al. (2010) showed that Glyma.09g035700 is highly expressed in the soybean young leaf and flower, with moderate expression in soybean pods, 7 and 10-13 days after fertilization (DAF) and soybean seeds from 14 to 35 DAF.
T h e ge n e a s s o c i a t e d w i t h q S U C . 1 0 . 5 i s Glyma.10g060200, with the peak (5,593,982) falling within the gene itself. It is annotated as glutamate-ammonia ligase. This gene is involved with the production of glutamine in the glutamine biosynthetic pathway (Miflin et al. 1981). Gene expression data provided by Severin et al. (2010) showed that Glyma.10g060200 is highly expressed in the soybean flower and moderate expression in the young leaf and pod shell.
The gene associated with qSUC.13.17 includes Glyma.13g070500, with the peak (17,093,380) falling within the gene itself. It is annotated as a protein with N-terminal bromo-adjacent homology (BAH) and transcription elongation factor S-II (TFS2N) domains (SoyBase 2020). It is shown to be involved in the miRNA pathway and involved in translational repression (TAIR 2020). Gene expression data provided by Severin et al. (2010) showed that Glyma.13g070500 is highly expressed throughout the entire plant during the plant's development.

Discussion
The majority of soybean breeding programs have focused on the improvement of commodity-related traits, including yield, seed protein and seed oil concentration. Increased demand for soy-food products has shifted the focus of some soybean breeding programs to include value-added, foodgrade traits, such as seed sucrose concentration. The identification of sucrose-related QTL would aid the development of high-sucrose soybean cultivars for the food-grade market through marker-assisted selection.
Fewer than ten other studies have sought to identify sucrose-related QTL in soybean, with each utilizing biparental RILs populations (Maughan et al. 2000;Kim et al. 2005;Skoneczka et al. 2009;Zeng et al. 2015). Only one study (Sui et al. 2020) used GWAS to identify five QTL associated with sucrose concentration. While these studies have identified over 25 sucrose-related QTL, these loci have limited applicability in marker-assisted selection schemes using different genetic backgrounds. The improvement of seed sucrose concentration became a priority for the University of Guelph soybean breeding programs, and it became necessary to identify novel sucrose-related QTL in a vast array of genetic backgrounds associated with these programs. Before the current study, GWAS had not been used to analyze genetic data for putative sucrose-related QTL.
Candidate genes for each QTL were identified. A number of the genes play roles in general biosynthetic pathways including the glutamine pathway. However, one candidate gene, Glyma.06g182700 encoding carbonic anhydrases, has been shown to play a role in the sucrose metabolic pathway specifically (Kavroulakis et al. 1999). A similar gene and function has previously been reported in pigeonpea (Cajanus cajan; Dutta et al. 2011).
Although putative QTL for seed sucrose concentration were found, they were limited in number and the amount of phenotypic variation explained. The moderate to high heritability for seed sucrose indicated that approximately ~ 77% of the phenotypic variation for seed sucrose concentration was controlled genetically, inferring that phenotypic selection has the ability to increase genetic gain in the breeding program. A potential reason for locating no more than seven QTL may be the result of the method of determining seed sucrose levels. NIR estimation of sucrose concentration though unbiased could be variable (Table S1). This indicates that we had somewhat limited power to detect QTL or map them to precise locations in the genome. Moreover, the limited range in seed sucrose concentration is the direct result of selecting for other important agronomic and quality traits. As most of the breeding focus was geared toward increased yield and seed protein or oil concentration, the available genetic diversity was lost. In order to regain variation in seed sucrose concentration, the introduction or introgression of high-sucrose parental genotypes should be explored.
Currently, there is minimal research available for QTL for seed sucrose concentration; however, with the increased interest in soy-food products, sucrose is becoming a more important trait for breeders. Sucrose is an important quality trait for soy-food products, with an acceptable range of approximately 6.0-8.0% seed sucrose concentration. More research is needed to provide soybean breeders with confirmed QTL for seed sucrose concentration to allow for their use in marker-assisted selection. After decades of selection for agronomic traits like yield and seed protein concentration, the available variation in seed sucrose in soybean breeding germplasm may have been reduced compared to ancestral cultivars. The findings of these putative QTL provide a foundation for the elucidation of useful QTL for the future development of food-grade soybean cultivars.