SNP selection and characteristics of the ZDX1 array
In order to develop a high-throughput SNP typing array customized for the Illumina sequencing platform, SNPs were selected in re-sequence data from 2,214 soybean germplasm accessions, including 862 improved cultivars (Glycine max (L.) Merr.), 1,131 landraces, 218 annual wild soybean accessions (Glycine soja Sieb. & Zucc.), and three perennial wild soybean accessions (Glycine Subgenus Glycine) (Fig. S1). The SNP information was compared with the soybean reference genome Wm82.a2.v1 (Phytozome, https://phytozome.jgi.doe.gov/pz/portal.html), and 11,048,862 initial SNP sites were obtained after screening and quality control. Following removal of SNPs with no polymorphism, 50 bp interference, or one-sided probe length of 0, a total of 2,039,377 loci were retained. Using GenomeStudio software for genotyping and adjustment (Fig. S2), we then screened these loci for previously described QTL, GWAS, and developed array loci, as well as other published loci associated with valuable traits, ultimately resulting in a set of 158,959 high-quality sites (Fig. 1a, Supplemental Table 1).
We next examined the genomic distribution of the 158,959 high-quality SNPs and found that they were evenly distributed across the 20 soybean chromosomes. The number of SNP sites on each chromosome ranged from 6,085 to 9,314, of which, 90.23% fell within 10 kb (6.0 kb average distance) (Supplemental Table 2). In addition, SNP number showed a highly significant positive correlation with chromosome length, with a Pearson’s coefficient of 0.98 (p = 8.61E-14) (Fig. 1b). We mapped 64,435 of the candidate SNPs to 50,592 annotated genes, accounting for 90.92% of the total number of predicted genes in the soybean reference genome (Fig. 1c). In addition, another 4.29% of the large-effect SNPs could potentially affect gene function, including 5,684 non-synonymous SNPs, 119 Stoploss SNPs (four of which are both nonsynonymous or Stoploss), 604 Stopgain SNPs, six frameshift SNPs, and 414 alternative splicing SNPs.The SNPs selected for inclusion in the ZDX1 array also included 14,685 synonymous sites, 6,120 unknown sites, 14,845 sites located in intronic regions, 12,158 sites located within the 1,000 bp upstream or downstream of a gene, 9,804 sites located in UTR regions, and 94,524 sites located in intergenic regions (Supplemental Table 1). A/G and T/C (transitions) represented the main nucleotide variants on the chip, with 54,219 and 54,262, respectively, accounting for 68.25% of the total SNPs. The site frequency spectrum (SFS) for the 2,114 re-sequenced accessions showed that the sites with minor allele frequency (MAF) > 0.1 accounted for 81.3% of the total. SNPs with MAFs between 0.10–0.20, 0.20–0.30, 0.30–0.40, and 0.40–0.50 accounted for 31.48%, 19.20%, 15.79%, and 14.85%, respectively (Fig. 1d). Collectively, these data showed that the SNPs selected for ZDX1 were evenly distributed across the soybean chromosomes and that the array had high gene coverage and utilization, thus enabling population structure analysis, whole-genome-based selection, and other related studies.
In addition, the ZDX1 array was also designed to retain high-priority loci, including 2,402 SNPs for genes related to important traits and 627 SNPs for genes that underwent domestication or improvement (Supplemental Table 3). In addition, Analysis using Soybase showed that the candidate SNPs for ZDX1 also included 953 SNPs in QTL intervals, 547 GWAS-identified SNPs (https://soybase.org), and 110,811 SNPs that differed between ecological groups (Supplemental Table 4). Moreover, 3,869 SNPs from two low-density arrays, the 1.5K BeadChip (Shen et al. 2005) and the BARCSoySNP6K (Song 2014) array were also included (Supplemental Table 5). Compared with the three high-density arrays SoySNP50K, 180K AXIOM®, and NJAU 355K SoySNP, the ZDX1 array contains 134,737 unique sites (Fig. 1e), with a specificity rate as high as 84.8%. In addition, 14 important functional sites (causal SNPs) related to traits such as growth period, resistance to cyst nematodes, leaf type, pod setting habit, seed coat color, seed dormancy, and phosphorus efficiency (Supplemental Table 6) were selected for the array, thus facilitating identification of economically and agronomically valuable traits and screening for elite germplasm.
As a final step in marker selection, we evaluated the accuracy of the marker information. To this end, we first determined the site detection rates in 817 well-established breeding materials (Supplemental Table 7) and found that the detection rate for each sample was between 84.40% and 95.98%, with an average of 95.19%. At the same time, three DNA samples were randomly selected for two repetitions, and the genotype similarity between the repetitions was > 99.9% (Supplemental Table 8). The above data shows that this array has a high degree of accuracy and repeatability. Taken together, these results confirmed that the high-density ZDX1 array was both reliable and accurate.
Analysis of genetic diversity of breeding population and screening of fixed sites in breeding improvement
Subsequently, we applied the ZDX1 array for genotyping in a test population of 817 breeding lines in soybean breeding program, including 77 parental lines, 169 non-parental lines, and 571 stable progeny lines developed using the pedigree method after crossing. To analyze the genetic diversity of the three subpopulations, we next conducted linkage disequilibrium analysis (LD; indicated by r2). The results showed that the attenuation rate of the non-parental lines r2 was higher than that of the progeny and parental lines, and the distance at which r2 decayed by half was 244 kb, 276 kb, and 303 kb, respectively (Fig. 2a). These results indicated that the genetic diversity among the non-parental lines was higher than that of the parental lines and progeny, which thus helped to broaden the genetic diversity of the parental lines. Similar to the results of LD analysis, PCA analysis (Fig. 2b) confirmed that the distribution of the non-parental line subgroup was more scattered, further indicating higher genetic diversity.
The results of MAF showed that 46,376 sites in the test population were completely fixed, and that 38,625 sites (83.29%) contained differences between groups from geographically separated ecological regions. Furthermore, the percentages of fixed sites in the non-parental lines, parental lines, and progeny were 34.72%, 41.79%, and 34.63%, respectively (Supplemental Table 9, Fig. S3). To further clarify which sites were selected and fixed during the breeding process, 6,579 sites were selected based on their polymorphisms in the 817 accessions, as well as in the progeny subgroup (MAF = 0). It is worth noting that the minor allele types corresponding to these sites were the same across all three subgroups. Statistical analysis showed that the MAF values of the parental lines ranged between 0 and 0.0390, while the MAF values of the non-parental lines ranged between 0 and 0.1317 (Fig. 2c). Among them, 235 sites were identified where the MAF values of the parental and the non-parental lines were > 0.01, and 109 sites were located in genic regions spanning 95 genes (Supplemental Table 10). Taken together, these results suggested that these apparently informative SNP sites were fixed during the breeding process, which can be selected in future breeding.
Germplasm screening for breeding target traits using functional sites in ZDX1 array
In order to then develop elite germplasm using the functionally informative SNP sites, we selected fourteen SNP sites from the array to identify the test population, among which five were found to be non-polymorphic. These five marker sites included stem termination (Dt1/Gmtfl1-ta, Dt1/Gmtfl1-ab), and the seed coat color (Gm850). Among the six sites related to maturity, e4-oto and GmGPRR3b/Tof12 were completely fixed, and the MAF values for e1-fs, e1-as, e3-fs, and e4-keshuang were calculated to be between 0.001 and 0.192. In addition, the MAF values for three sites associated with cyst nematode resistance, rhg1-a/GmSNAP18, Rhg4/GmSHMT08, and GmSNAP11, were between 0.012 and 0.017, while the MAF value for leaf-shape Ln/ln site was 0.203. These findings suggested that these unfixed sites could be related to the genetic diversity of the phenotype, and potentially controlled traits that are desirable for breeders (Fig. S4).
We next analyzed six alleles from E1, E3, E4, and GmPRR3b related to fertility, and MAF analysis showed that these six alleles were found in 11 distinct allelic combinations in the population, among which the e1-fs, e1-as/e3-fs/e4-kes, e1-as/e3-fs, and e1-as/e4-kes genotypes were associated with precocity (Supplemental Table 11). Notably, only one accession, ‘Dongnong36’ (80.73d), carried the e1-fs genotype. Among the materials with two or more genotypes that included e1-as, e3-fs, and e4-kes, nine parental lines/nonparental lines exhibited an earlier fertile period (87.14-97.98d), while three progeny (HJ15-1231, HJ15-896, and HJ15-897) had relatively late growth periods (109.17-114.32d). These results showed that functional SNPs associated with determinate growth period in the ZDX1 array could thus be used to identify germplasm with early maturity phenotypes.
Three nematode resistance-associated SNPs, including Gm18_1643660, Gm08_8361148, and Gm11_32970174 (located in the rhg1, Rhg4, and SCN3-11 genes, respectively) were also covered by the array (Table 1). We found that the frequency of alleles for enhanced disease resistance in the tested materials was relatively low: 1.22%, 1.71%, and 1.47%, respectively. These three sites could be found in eight allelic combinations among the 817 accessions of the diversity panel, while seven accessions were identified that carried all of the resistance loci, including three known resistant varieties, ‘Kangxian1hao’, ‘Kangxian5hao’, and ‘Kangxian8hao’. In addition to these accessions, three new varieties not previously known to carry nematode resistance were also identified, including ‘Shundou5hao’, ‘Qinong1hao’, ‘Fengdou 23’, as well as the progeny ‘HJ15-863’. The proportion of resistant progeny was extremely low, potentially due to the difficulty of large-scale phenotypic identification and the lack of directional selection against SCN in the progeny.
Table 1
Allelic combinations at the rhg1-a, Rhg4, and GmSNAP11 loci
Combination
|
rhg1-a/GmSNAP18
Gm18_1643660
|
Rhg4/GmSHMT08
Gm08_8361148
|
GmSNAP11
Gm11_32970174
|
Number of parental lines
|
Number of non-parental lines
|
Number of progeny
|
Com1
|
GG
|
GG
|
TT
|
0
|
6
|
1
|
Com2
|
CC
|
CC
|
CC
|
76
|
162
|
557
|
Com3
|
CC
|
CC
|
TT
|
0
|
0
|
3
|
Com4
|
GG
|
CC
|
TT
|
0
|
0
|
2
|
Com5
|
CC
|
GG
|
CC
|
1
|
0
|
6
|
Com6
|
GG
|
CC
|
CC
|
0
|
0
|
1
|
Com7
|
CG
|
CC
|
CC
|
0
|
0
|
1
|
Com8
|
CG
|
GC
|
TC
|
0
|
1
|
0
|
SNPs mapped to the Ln/ln loci were also used to genotype 817 test materials. This screen revealed that 649 narrow leaflet soybean accessions all carried the CC genotype, 166 broad leaflet accessions harbored the GG genotype, and two heterozygous CG accessions showed both broad and narrow leaflet. This result reflected the prevalence of narrow leaflet among the soybean breeding lines used in Northeast China, and demonstrates the high accuracy of leaf shape detection using SNPs associated with functional loci in the ZDX1 array. Indeed, a greater proportion of round-leaf accessions were present in the non-parental lines (32.0%), while round-leaf accessions in the parental lines and progeny accounted for 10.4% and 18.4% of these populations, respectively. These proportions again reflected the preference by breeders for narrow leaflet (Fig. S5). We also found that breeders in the high latitudes of Northeast China have a preference for narrow-leaf breeding materials. This is related to the fact that the narrow leaflet lines have > 4 pods (Fang et al. 2013) and greater light transmission through the canopy.
Using breeding index and genetic distance to explore the method of parental selection
In order to illustrate how the ZDX1 array can improve the parental selection process, we next used genotype data to generate a kinship matrix for the full accessions, which revealed pairwise genetic distance that ranged between − 0.54 and 2.56 (with larger values indicating closer kinship; see Fig. S6). Analysis of R7 (beginning maturity), SW (100-seed weight), protein content, oil content, and SY (seed yield) in 298 progeny derived from the parental lines for both parents showed that the rate over best-parent of each trait was non-significantly negatively correlated with the genetic relationship between the parents (p = 0.30–0.97), and the correlation coefficients (rhd) were − 0.02 to -0.42, suggesting that the more distant the parental relationship, the greater the possibility that a higher proportion of progeny would outperform the parental lines. In addition, the mean value of each trait among progeny was positively correlated with the average parental value, with correlation coefficients (rpo) were between 0.33 and 0.73. Among them, mean oil and seed weight of the progeny showed an extremely significant (p < 0.01) correlation with the mean of these trait values among the parental lines (Fig. 3), which indicated that the use of elite parents in hybrid combinations allows the selection of elite progeny.
While high yield is the most important goal in soybean breeding, traits such as maturation time and seed quality (i.e., protein and oil contents) should also be comprehensively considered during selection. To this end, we included all of the five traits in the selection index, defined as the breeding index (BI), with which we scored the parental lines and 298 progeny which both parents are included in parental lines. Based on BI values, the parents could be categorized as high, medium, or low phenotypes (Supplemental Table 7). The 30 (top 10%) high-performance progeny could then be divided into five types based on the parental BI index. Using this system, we identified two high×high types, 11 high×medium types, nine high×low types, three medium×medium types, and five medium×low types. Of these types, 73.3% involved contributions from at least one high type parent (Fig. 4). When evaluating new lines with BI, the commonly used control varieties ‘Keshan1hao’ (BI = 0.53) and ‘Neidou4hao’ (BI = 0.25) were rated as a “High” type. This standard was also used to screen out two new varieties ‘Mengdou1137’ and ‘Mengdou640’ that passed the national certification, the "High×Low" parental combination was used to generate these two varieties, the average genetic relationship was relatively distant (-0.15). These results indicated that the selection of more distantly-related parents, among which at least one parent has strong multiple trait indexes, will more likely produce progeny with highest composite agronomic performance for these traits. It provides a reference for us to select suitable parents in complex self-bred crop breeding.
Following identification of candidate parental lines with suitable genetic distance and high-performance phenotypes, we also needed to enable efficient breeding decisions by eliminating redundant germplasm accessions from the diversity panel, which otherwise results in considerable genetic redundancy in the selected parental subgroup. We have counted the parental lines of the bottom 30 progeny (bottom 10%), among these parents, 12 parents including ‘Dengke4hao’ and ‘Hujiao1120’ did not derive excellent progeny (top 10%) (Supplemental Table 12), they can no longer be used in future breeding. Meanwhile, based on the genetic relationships indicated by different metrics for genetic distance, including 0.5 ~ 1.0, we identified the non-parental lines with higher similarity to the parental lines used in crosses (Fig. S7A), and finally eliminated 21 redundant non-parental lines including ‘Mei1’ and ‘Nenao08-1092’ based on kinship scores of > 1.0 (Supplemental Table 13). After screening out redundant parents, the improved combinations were proposed for future breeding. Meanwhile, the high-performing progeny lines should also be included in the parent nursery. We therefore selected the accessions with the top 10% of BI values, and calculated the number of potential combinations that could be formulated using different metrics for genetic distance, including − 0.5 ~ 0.0 (Fig. S7B). Using a kinship score of <-0.3 as the standard, we selected 46 high-potential combinations for use in future breeding experiments (Supplemental Table 14). By eliminating redundant parents and formulating potential combinations, the parent population structure is optimized and breeding efficiency is improved.
Different strategies based on ZDX1 array improve the accuracy of genomic selection in theoretical and actual breeding
We next explored the efficiency of different strategies to improve the accuracy of genomic selection using the ZDX1 array. The results of GBLUP (genomic best linear unbiased prediction) analysis to test the accuracy of selection based on the ZDX1 array revealed that the prediction accuracy for the five traits of beginning maturity, seed weight, protein content, oil content, and seed yield were 0.79, 0.73, 0.78, 0.77, and 0.69 respectively; these scores were all significantly higher than those of ABLUP (pedigree-based best linear unbiased prediction), based on pedigree relationships, and HBLUP (combined best linear unbiased prediction) based on both pedigree relationship and genotype data (p < 0.01) (Fig. 5a, Supplemental Table 15). These results indicated that the genomic information provided by the array can better reflect the population structure than pedigree relationships.
We subsequently identified 33,756, 33,733, and 33,761 sites that were respectively selected as marker subsets from gene regions, intergenic regions, or the whole genome. GBLUP analysis confirmed these three marker sampling methods showed no significant differences in their accuracy for predicting yield. For the other traits, the accuracy of prediction using markers for genic regions was 2.33% higher than that of SNP markers for intergenic regions, with highly significant differences among methods for each of the four traits (p < 0.01). Also, markers associated with genic regions were more accurate by an average of 0.57% than those sampled from across the whole genome, and were significantly more accurate for predicting 100-seed weight, protein content, and oil content (p < 0.01). Furthermore, use of only the 33,756 SNPs in genic regions also significantly improved the predictive accuracy (p < 0.01) for selecting 100-seed weight, and protein and oil contents compared with accuracy provided by using all 69,022 SNPs (Fig. 5b, Supplemental Table 15). These results showed that the sites on the genic-region in the array include more useful genetic information. For most traits, the strategy of sampling SNP markers for gene-encoding regions can reduce the number of requisite markers while also improving the accuracy of genomic selection.
In order to evaluate the efficiency of ZDX1 in predicting progeny in actual breeding, we also selected 246 parents as training population I, and 283 of the 571 progeny bred in 2015 as Predicted Population I. We used ZDX1 array to predict the top 50% of the 283 progeny. The prediction accuracy for the five traits in these high-value lines ranged from 0.30 to 0.45 (Circle1). We then used these selected 141 high-value progeny to expand training population I to generate training population II to further predict the 288 progeny bred in 2016 (Predicted Population Ⅱ) (Fig.5c). The results showed that with the exception of yield, the predictive accuracy was improved for these traits, ranging from 0.48 to 0.67 (Circle2), while the average accuracy was significantly increased by 32.1% (p=0.024) (Fig.5d, Supplemental Table15). Collectively, these results demonstrate that the predictive accuracy of breeding decisions based on the ZDX1 array can be improved by establishing a model using the parental lines and continuously expanding the model with high-performing progeny.