Phenotypic Analyses of Protein and Other Value-added Food-grade Traits
The RIL populations were evaluated for seed weight, yield, protein and sucrose concentrations in multi-environment trials during the 2015 and 2016 field seasons (Fig. 1; Supplementary Table S1-S4). Contrasts were noted for seed protein concentration between the parental cultivars in both populations. In POPn_1, ‘AC X790P’ had an average protein concentration of 48.08% (± 0.19%, standard error) across the five testing environments, while ‘S18-R6’ had an average of 40.93% (± 0.19%). In POPn_2. ‘AC X790P’ had an average protein concentration of 48.24% (± 0.21%) across the five testing environments, while ‘S23-T5’ had an average of 42.60% (± 0.21%).
Differences in protein concentration between the RIL lines in each population were significant in the individual and combined multi-environment (Fig. 1; Supplementary Table S1). In POPn_1, seed protein concentration varied from 41.53% to 45.27%, with an average protein concentration of 43.31% (± 0.03%). In POPn_2, seed protein concentration varied from 41.93% to 47.46%, with an average protein concentration of 44.60% (± 0.03%) (Fig. 1; Supplementary Table S1). Transgressive segregation was observed in some individual environments but was not observed when the combined multi-environment data was considered (Supplementary Table S1). The normally distributed (Fig. 2) entry LSMEAN estimates indicate that protein concentration is controlled by multiple genes.
The parental cultivars also differed for seed yield, seed weight and seed sucrose concentration, and considerable variation was also noted within the combined multi-environment data for both populations (Fig. 1). In POPn_1, entry seed weight estimates (grams per 100 seeds) varied from 18.08 grams to 23.88 grams, with an average seed weight of 21.18 grams (± 0.055 grams). Seed yield also varied from 2.55 tonnes ha-1 to 4.49 tonnes ha-1, with an average seed yield of 3.57 tonnes ha-1 (± 0.025 tonnes ha-1) and seed sucrose concentration varied from 5.44% to 6.82%, with an average sucrose concentration of 6.06% (± 0.016%; Supplementary Table S2-S4). Similar variability was noted in POPn_2 (Fig. 1). Seed weight varied from 17.67 grams to 22.95 grams, with an average seed weight of 20.34 grams (± 0.057 grams). Seed yield varied from 2.52 tonnes ha-1 to 4.40 tonnes ha-1, with an average seed yield of 3.34 tonnes ha-1 (± 0.024 tonnes ha-1) and seed sucrose concentration varied from 4.95% to 6.75%, with an average sucrose concentration of 5.84% (± 0.014%). Transgressive segregation was noted for seed yield and seed sucrose concentration in both populations. While some RILs exhibited transgressive segregation in individual environments for seed weight, this was not observed when the combined multi-environment data was considered (Supplementary Table S2-S4).
Our previous study revealed significant differences (p < 0.01) in genotype, environment, and genotype x environment treatments for protein concentration and yield in these populations , which indicates the important role of genetic factors on the performance of these target traits. High heritability was noted for protein concentration and 100-seed weight (H2 = 0.93-0.95 and 0.87-0.89, respectively; Supplementary Table S5). Moderate heritability was observed for sucrose concentration (H2 = 0.70-0.81; Supplementary Table S5), and low heritability was observed for seed yield (H2 = 0.22-0.36) (Supplementary Table S5).
Relationships between Traits
Pearson’s correlation coefficients were used to determine the relationship between seed protein concentration and sucrose concentration, seed weight and yield in individual environments as well as combined multi-environment. Based on the combined multi-environment data, large, significant (α=0.05) negative correlations were observed between seed protein and sucrose concentration in both populations (POPn_1: r = -0.47; POPn_2: r = -0.70; Fig. 2). In POPn_1, seed protein concentration and seed weight were positively correlated (POPn_1: r = 0.53), and seed weight and sucrose concentration were negatively correlated (POPn_1: r = -0.29). Interestingly, no significant relationships were noted between seed protein concentration and seed yield in either population (POPn_1: r = 0.09; POPn_2: r = -0.06) (Fig. 1; Fig. 2). The linear relationship among the target agronomic and seed quality traits from individual environments are available in Supplementary Table S6.
SNP Mapping of the Soybean Genome
Linkage maps were constructed from polymorphic SNP markers in each population. In POPn_1, a linkage map was created using 807 polymorphic SNP markers, and divided into 39 linkage groups. A linkage map consisting of 1,406 SNP markers on 40 linkage groups was created on POPn_2. All 20 chromosomes in the soybean genome were represented, with most chromosomes consisting of two or more linkage groups. The linkage maps were 2,385 and 2,690 cM in length for POPn_1 and POPn_2, respectively. The number of linkage groups was attributed to a lack of polymorphic markers between the parental genotypes distributed over large chromosomal regions, as elite Canadian soybean cultivars may share similar pedigrees.
QTL Associated with Seed Protein Concentration
Using combined multi-environment data, fourteen large-effect QTL were identified associated with seed protein concentration on Chromosomes 1, 2, 4, 5, 6, 8, 12, 13, 15 and 18. All the QTL were associated with protein in at least four individual environments. These fourteen QTL explained between 10.4% and 21.9% of the observed phenotypic variation of seed protein concentration measured from combined multi-environment data (Table 1). Six of these QTL – qPro_Gm01-2, qPro_Gm04-3, qPro_Gm06-1, qPro_Gm06-3, qPro_Gm12-3, and qPro-Gm12-4 – carried the beneficial alleles from ‘S18-R6’ or ‘S23-T5’, while the remaining eight QTL – qPro_Gm02-3, qPro_Gm04-4, qPro-Gm05-2, qPro_Gm06-6, qPro-Gm08-2, qPro-Gm13-4, qPro_Gm15-3, and qPro_Gm18-3 – carried the favorable alleles from ‘AC X790P’. Positive protein-related QTL alleles in different genetic backgrounds suggests that it may be possible to stack favorable alleles to develop superior high-protein progeny.
Nine putative QTL – qPro_Gm01-2 (R2 = 10.4%), qPro-Gm04-4 (R2 = 13.7%), qPro-Gm05-2 (R2 = 14.2%), qPro_Gm06-1 (R2 = 21.9%), qPro_Gm06-3 (R2 = 12.6%), qPro_Gm08-2 (R2 = 12.3%), qPro-Gm12-3 (R2 = 11.6%), qPro-Gm12-4 (R2 = 12%), and qPro_Gm13-4 (R2 = 11.6%) – identified in this study were previously unreported (Table 1; 26]. Four of these QTL were identified in both mapping populations (Table 1). The five QTL associated with seed protein concentration that co-localized with previously identified protein-related QTL on SoyBase are listed in Table 1; Supplementary Table S7.
QTL Associated with Additional Value-Added Traits
Genomic regions harboring putative large-effect QTL associated with seed protein concentration were evaluated for their associations with seed yield, sucrose concentration, and seed weight using composite interval mapping analysis with the multiple QTL mapping (MQM) algorithm. (Table 2; Supplementary Table S8). Of the fourteen protein-related QTL, eight QTL were co-localized with QTL associated with other traits. Three protein-related QTL – qPro_Gm01-2, qPro_Gm02-3, and qPro_Gm12-4 – were co-localized with QTL associated with seed sucrose concentration (Table 2). The favorable alleles were inherited from opposing parental sources for each of these genomic regions, which supports the significant negative relationship observed between seed protein and sucrose concentration in this study. (Table 2; Fig. 3). The remaining five protein-related QTL were associated with seed weight, with positive associations noted for three of these regions (Table 2; Fig. 3). Favourable alleles were donated by each parental cultivar for all traits-of-interest. Protein-related QTL were not co-localized with significant regions for seed yield, consistent with the non-significant relationship between seed protein concentration and seed yield in both populations. SoyBase associated seven of our protein-related QTL with previously identified QTL for seed weight (nine QTL), seed oil concentration (five QTL) and seed yield (two QTL) (Supplementary Table S7 .
Candidate Genes Mining within Protein QTL Region
Four further validation of the QTL identified as associated with seed protein concentration, a list of candidate genes was compiled using the Glyma 2.0 Assembly of Williams 82 on SoyBase (Wm82.a2.v1) according to their functional knowledge . The number of genes in each QTL flanking region varied from four to seventy-four. In the flanking region corresponding to qPro_Gm13-4 (spanning 26 kb), five genes were identified. These genes include Glyma.13G167800 and Glyma.13G167900, which are located 6 and 9 kb downstream of the SNP peak (28246299) and are annotated as a ribosomal protein and a ribosome biogenesis regulatory protein, respectively (Table 3). These genes have an indirect role in protein synthesis. Gene expression data provided by Severin et al.  noted that Glyma.13G167800 is expressed in the seed from 10 to 21 day after flowering (DAF). Glyma.13G167900 is also expressed in the seed albeit at a lower level compared to Glyma.13G167800. Two candidate genes, Glyma.06G004500 and Glyma.06G001800, underlying qPro_Gm06-1 were identified. These genes, located in 74 kb upstream and 148 kb downstream of the QTL peak, respectively, encode transmembrane amino acid transporter proteins and ribosomal family proteins and (Table 3). Previous transcriptomic analyses noted increased expression of Glyma.06G004500 in the seed at 14 to 17, and 21 DAF .
Glyma.04G212500 and Glyma.04G214500 were identified under qPro_Gm04-4 intervals. These genes are associated with the cupin superfamily and ribosomal protein family, respectively (Table 3). The cupin superfamily is involved in seed storage protein , while ribosomal protein family genes are associated with mRNA translation. In addition, candidate gene Glyma.04212500 are located exactly in the SNP peak position, which support the role of cupin associated with seed protein concentration. Glyma.06G113700, Glyma.06G116400, and Glyma.06G119700 were located in qPro_Gm06-3 region (Table 3). Glyma.06G113700 encodes a potential structural constituent of 40S ribosomal protein. Glyma.06G116400 and Glyma.06G119700 were associated with a transmembrane amino acid transporter protein and an intracellular transport protein, respectively (Table 3).
Three candidate genes, Glyma.15G129800, Glyma.15G130000, and Glyma.15G134800, were identified from qPro_Gm15-3 which are involved in structural constituents of the ribosome (Table 3). Moreover, Glyma.06G225600 and Glyma.06G225700, which were annotated as translation initiation factor proteins were identified under qPro_Gm06-6 intervals (Table 3). Glyma.02G220000 and Glyma.02G221500, which contribute to the structural integrity of the ribosome and play a role in translation were located in qPro_Gm02-3 region (Table 3). Based on previous transcriptomic analyses, Glyma.02G220000 is expressed in the seed 14 to 17, 21, 25, 28 and 35 DAF .
Candidate genes were also postulated for sucrose- and seed weight-related QTL that co-localized with protein-related regions. Four candidate genes were identified: Glyma.06G004400 and Glyma.06G007900, which were located under qPro_Gm06-1 and qWt_Gm06-1 region, and Glyma.15G133600 and Glyma.15G133800 that were located under qPro_Gm15-3 and qWt_Gm15-4 region. All four genes are involved in carbohydrate metabolism (GO:0005975) (Table 4).