Evaluation of Wild Segments Related to Growth Period Traits Using a Modified Wild Soybean (Glycine Soja Sieb. Et Zucc.) Chromosome Segment Substitution Line Population

He Qingyuan Anhui Science and Technology University Shihua Xiang Zigong Academy of Agricultural Sciences Huawei Yang Zigong Academy of Agricultural Sciences Wubin Wang Nanjing Agricultural University Yingjie Shu Anhui Science and Technology University Zhengpeng Li Anhui Science and Technology University Songhua Wang (  Wangsh@ahstu.edu.cn ) Anhui Science and Technology University Xiaoyan Yang Anhui Science and Technology University


Introduction
Soybean [Glycine max (L.)] is a major source of vegetable protein and oil providing approximately 69% and 57% of dietary protein and oil, respectively (USDA-FAS, http://www.fas.usda.gov/). Its seed is composed about 40% protein and 20% oil (Li et al., 2019). Improving seed yield and quality is the major target for soybean breeders. Seed traits including 100-seed weight (100SW), protein content (PRC) and oil content (OIC) are important traits determining the yield and quality of soybean (Yang et al., 2019). 100seed weight is an important yield component of soybean and was generally positively correlated with seed yield (Burton et al. 1987a). Increases in seed oil and protein contents of soybean would improve the quality. Therefore, it is important way to improve soybean yield and quality by increasing 100SW, PRC and OIC of soybean (Han et al., 2012;Li et al., 2019). To dissect their genetic basis will help the improvement of yield and quality of soybean.
The 100SW, PRC and OIC are complex traits governed by multiple quantitative trait loci (QTLs) and in uenced considerably by environments in soybeans. Hitherto, more than 304 QTL for 100SW, 248 for PRC and 327 for OIC have been reported from 59, 58 and 34 bi-parental mapping populations, respectively, which were stored in Soybase (http://soybase.org; Diers et al., 1992;Mansur et al., 1996;Specht et al., 2001;Jun et al., 2008;Rossi et al., 2013;Akond, et al., 2014). However, the most locations of QTLs found in different studies often do not map to the same position. Only a few for 100SW, PRC and OIC have been con rmed or cloned (Lu et al., 2017;Fasoula et al., 2004;Wang et al., 2014). In addition, the limited recombination of bi-parent population or relatively low density of marker which hindered breeding efforts to improve seed traits through marker assisted selection (MAS).
In recent year, with the development of high density gene chip and re-sequencing, natural populations having more extensive recombination events in the evolutionary process of a natural population and shorter linkage disequilibrium (LD) blocks were used for mapping and wider phenotypic variation available through genome-wide association studies (GWAS) (Gupta et al., 2005). It has widely been applied to dissect the genetic basis for some complex traits in human (Pritchard and Przeworski, 2001;Jiang et al., 2013) and animal (Farnir et al., 2000;Miller et al., 2015). At present, association studies have also been used in many plants, such as Arabidopsis (Kim et al., 2007;Fulgione and Hancock, 2018), maize (Falke et al., 2007;Michael et al., 2009), rice (Yang et al., 2014;), cotton Liu et al., 2018) and soybean (Zhang et al., 2015;Chang et al., 2018). However, how to break the negative correlation between oil and protein, breed soybean varieties having high PRC, OIC and 100SW has not been established in soybean. In order to better identify loci of high PRC, OIC and 100SW, and to service for breeding, high protein germplasm were used to dissect related genetic mechanism.
In the present study, GWAS analysis of soybean PRC, OIC and 100SW in Sichuan and Chongqing China based on 228 tested local and breeding varieties and 135 SSR markers and 107081 SNPs. Soybean of Sichuan and Chongqing are elite high protein germplasms (Yang and Zhou, 2020). The goal of this study is to identify QTLs associated with PRC, OIC and 100SW and to screen candidate genes located in peak marker regions. This study enhances our understanding of their genetic mechanisms of PRC, OIC and 100SW in soybean. This study will lay a foundation for increasing soybean breeding e ciency of PRC, OIC and 100SW.

Phenotypic evaluation and statistical analysis
After the seeds were bulk-harvested and dried at each plot, the 100SW was determined in grams per 100 uniformly cleaned seeds. The PRC and OIC of seed were measured with about 20 g of seed by nearinfrared analysis instrument (Perten DA7200, Swedish).
The joint analysis of variance (ANOVA) was conducted for three traits using PROC GLM procedure of SAS 9.2 (SAS Institute, Cary, United States). The statistical model was Y ijk = μ + G i + E j +(GL) ij + γ k + e, where μ is the total mean, G i is the effect of the i th genotype, E j is the effect of the j th environment; (GE) ij is the genotype × environment interaction; γ k is the effect of k th replication; and e is random error following N (0, σ e 2 ). The broad-sense heritability (h 2 ) was estimated as: h 2 = σ 2 g / [σ 2 g + (σ 2 ge /j) + (σ 2 /rj)] for whole experiment and h 2 = σ 2 g / [σ 2 g + σ 2 /r] for an individual environment, where σ 2 g is the genotypic variance among accessions, σ 2 ge is the genotype × environment interaction variance, σ 2 /rj is the error variance, j, r are the numbers of environment and replications per environment, respectively. All parameters were estimated from the expected mean squares in the ANOVA.
DNA extraction, SSR marker analysis, SNPs polymorphism, genotyping and haplotype block estimation Total DNA extracted from fresh leaves using the slightly modi ed CTAB method (Doyle and Doyle, 1984).
A total of 135 simple sequence repeat (SSR) markers distributed basically evenly in Soymap2 (Song et al., 2004) were selected to genotype each accession. The reaction system and program of PCR were performed according to the method of He et al. (2015). The PCR products were detected on 8% polyacrylamide gels, and the fragments were visualized by sliver staining.
High-throughput SNPs were generated by illumina soybean 200K gene chip (Beijing Compass Biotechnology Co.). The internal reference quality control of SNP chip and methods of calling variation were based on according to chip operation manual (Illumina, 2012). A total of 107081 single nucleotide polymorphisms (SNPs) with low minor allele frequency (MAF) ≥ 1% and missing proportion ≤ 10% were used to genotype and further study in the 228 accessions. The genotype and SNP markers were established according to pLINK V2.0 software (Shaun, 2010). The haplotype blocks were divided by LD heatmap package of R project (Shin et al., 2006). The distribution of SNPs on chromosomes was displayed by Gapit software (version 3.0) (Tang et al., 2016).

Linkage disequilibrium estimation
Linkage disequilibrium (LD) between pair-wise SNPs was estimated by r 2 (the squared correlation between two loci) and D' (standardized disequilibrium coe cient) (Gaut and Long, 2003) using TASSEL5.2.59 software (Bradbury, 2007). In addition, LD (D' and r 2 ) was estimated separately for all pairwise SNPs with a 100 kb summary bin setting within the 5 Mb distance. The physical location of the euchromatin and heterochromatin region for each chromosome was de ned as in the reference genome of G. max 2.0 (www.soybase.org). The LD decay was plotted as the physical distance versus r 2 with smooth curves tted by locally weighted regression (LOESS) curves decayed to half its maximum value or to r 2 = 0.2 (Huang et al., 2010).

Population structure and principal component analysis
The kinship of accessions was calculated using Gapit software (version 3.0) with VanRaden program.
The population structure was estimated by the software STRUCTURE 2.4.3 (Pritchard et al., 2000) using non-linkage SNPs. The number of subpopulations (k) was set from 1 to 9 initially, with 5,000 burn-in period, 1000,000 Markov chain Monte Carlo (MCMC) method. Each Q matrix and the relevant P-value were obtained. The most likely number of subpopulations was selected according to the relationship between k and Δk (Evanno et al., 2005). Principal component analysis (PCA) was also performed by Gapit software to examine genetic structure and variation of different ecotype of soybean, and the rst three eigenvectors were plotted in three dimensions. The cladogram was constructed using TASSEL V5.2.59 with neighbor joining method (Bradbury, 2007).

Genome-wide association studies
Associations mapping was performed by the TASSEL2.1 using SSR markers. The further GWAS was conducted by TASSEL5.2.59 with mixed linear model (MLM) using SNPs. To minimize false positives and increase result reliability, both population structure (PCA) and kinship (K) matrix were applied for the population. The Bonferroni-corrected threshold of genome-wide signi cance was de ned at 0.05 for the rst stage, and 5 different threshold values were selected to converge in computing process for the second stage, respectively. QTLs were considered to exist at haplotype blocks where LOD score exceeded the corresponding signi cance threshold.

Prediction of candidate genes
The particular candidate genes were considered basing on the predictive genes source of Glyma 2.0 gene model. Their functions were predicated using protein-protein BLAST search tool of NCBI (http://blast.ncbi.nlm.nih.gov/Blast.cgi).

Phenotypic variation in the SCLBP
The phenotypic characteristic of 100SW, PRC and OIC for SCLBP are showed in table 1. Averaged over four environments, 100SW, PRC and OIC showed a large variation with range values of 4.82-33.35 (g), 36.47-49.75% and 14.68-21.77%, respectively. The ranges of absolute values of kurtosis and skewness were 0.11-1.58 for 100SW, PRC and OIC, therefore, their distributions were approach to normal distributions. The heritability (h 2 ) and genetic coe cient of variation (GCV) of 100SW were highest, secondly is OIC, the lowest is PRC. The variation distribution of three traits' phenotypic was showed in Fig. 1. The variance analysis showed that the 100SW, PRC and OIC were signi cantly affected by varieties, environments and reciprocity of varieties and environments (Table 2).

GWAS analysis of 100SW, PRC and OIC using SSR
Based on the MLM model of tassel2.1, the Q matrix of population structure was used as covariate, and to calculate the associate between phenotypic value of 100SW, PRC and OIC and 135 pairs of SSR marker.
A total of 26 SSR markers were associated with 100SW, 33 with PRC and 31 with OIC were detected in four environments. The allele of Sat_260 on chromosome 8 controlling 100SW was simultaneously detected in four environments. Two loci controlling 100SW were simultaneously detected in three environments, 6 loci controlling 100SW, 7 loci controlling PRC and 6 loci controlling OIC were simultaneously detected in two environments, respectively (Table S1).

Distribution of SNPs and genetic characteristics of the mapping population
A total of 159072 SNPs were initially identi ed in 238 accessions. 107081 valid SNPs were obtained and used for further analyses in 228 accessions after ltration according to the following criteria: (1) the minimum allele frequency (MAF) ≥ 0.01, (2) sites with deletion ratio less than 0.10, and (3) remove samples with high deletion ratio less than 0.10, the average marker density of 1 SNP every 8.86 kb genome-wide, varying across chromosomes from 7.45 kb per SNP on chromosome 17 to 12.22 kb per SNP on chromosome 12 (Table S2). The MAF and haplotype block (>50 kb) for the population characteristics are presented in Fig. 2. The population structure was analyzed using non-linkage 4792 SNP markers. The most likely K-value was determined 3 according to the relationship between K-value and ΔK-value (Fig. 3A), which suggested that the overall population could be divided into three subpopulations. The result was also supported by the phylogenetic analysis (Fig. 3B, C) and PCA (Fig.   3D). The LD decay rate of the population was estimated at 2000 kb in euchromatin, where r 2 dropped to half of its maximum value (r 2 = 0.50) (Fig. 4). The haplotype analysis showed that 107081 SNPs were grouped into 8720 haplotype blocks. The size of the blocks ranged from 58 bp to 1000 kb across the whole genome. The distribution of haplotype blocks is shown in table S3.

GWAS analysis for 100SW, PRC and OIC using SNPs via MLM
Genome-wide association studies were conducted using the best linear unbiased predictors (BLUPs) of individual performance in four environments. A total of 28 SNP loci for 100SW, 198 loci for PRC, and 250 loci for OIC were identi ed in the mixed linear model (MLM) association panel at suggestive signi cance level (P = 1 × 10 -4 ). Among them, the two loci (Gm11_9895764 and Gm11_9917646) controlling 100SW were simultaneously detected in three environments, 9 loci in two environments. Two loci controlling PRC and six loci controlling OIC were simultaneously detected in two environments, respectively. Other loci were only detected in one environment (Fig. 5, Table S4).
The haplotype analysis showed that 10 SNPs were grouped into 5697 haplotype blocks. The size of the blocks ranged from 6 bp to 200 kb across the whole genome. The distribution of haplotype blocks is shown in Supplementary Fig. S3. The haplotype analysis showed that the mapping results were classi ed into 13 blocks controlling 100SW, 35 blocks controlling RPC and 60 blocks controlling OIC, respectively. Thereinto, the block Gm11_9895764-9917646 controlling 100SW was simultaneously detected in four environments, and two blocks Gm14_46067605-46114908 and Gm14_48656669-48766017 were simultaneously detected in two environments. The block Gm20_32208295-33208205 controlling PRC was simultaneously detected in three environments, and the block Gm01_1554929-157332 was simultaneously detected in FY2016 and ZG2016. The block Gm20_31492271-32206162 controlling was simultaneously detected in three environments. Four blocks (Gm13_28317040-28337047, Gm18_56604855-56608672, Gm20_26429128-27429119 and Gm20_32208295-33208205) were simultaneously detected in two environments. Other blocks were only detected in one environment. The block (Gm14_48656669-48766017) was detected to control both 100SW and PRC. Three blocks were detected to control both PRC and OIC, as follows: Gm13_28856666-28922456, Gm20_31492271-32206162 and Gm20_32208295-33208205 (Table S4).

Prediction of Candidate Genes
To predict candidate genes for loci signi cantly associated with 100SW, PRC and OIC, we selected the putative genes tagged by the most likely reliable blocks. There are 15 putative genes in the range of 50 kb upstream and downstream of the Gm11_9895764-9917646 block associated with 100SW. Among them, the gene of Glyma.11g130800 encodes transducin/ WD40 repeat-like superfamily protein which regulate seed weight and volume in Arabidopsis thaliana (You, et al., 2011). The gene of Glyma.11g130800 is considered to be the most likely gene for 100SW. There are 17 putative genes in the range of 50 kb upstream and downstream of Gm13_32352799-32393061 block associated with PRC. The gene of Glyma.13g217000 encodes CCCH-type zinc nger protein which regulates the seed storage protein (Chen et al., 2013). There are 15 putative genes in the range of 50 kb of the Gm08_ 9426150 SNP associated with OIC. Among them, the gene of Glyma.08g122600 encodes Aldo-keto reductase which is one of the most important enzymes in fatty acid synthesis. There are 29 putative genes in the block of Gm20_32208295-33208205 for OIC. Among them, the gene of Glyma.20g086700 encodes MATH domain-containing protein which regulates seed oil content (Lassner et al., 1995).

Potential applications and features of SCLBP
The high-density genetic map could identify more recombination events and will increases the accuracy of QTLs mapping. In the study, a high-density genetic map was used, and a total of 159072 SNPs was generated by illumina soybean 200 K gene chip lter 107081 valid SNP markers after ltration on 20 linkage groups. A major goal of development SCLBP was to map the locations of favorable alleles from landraces that could be useful in soybean improvement. A comprehensive evaluation and utilization of large-scale representative germplasm is important for crop improvement through identi cation and access to allelic variations affecting the crop phenotype. The SCLBP population has advantage of weight of 100SW, elite high protein and oil content for soybean breeding. Mapping populations were the basis of QTL mapping. In soybean, numerous QTL for kinds of traits have been detected by primary mapping populations such as F 2 (Maughan et al. 1996), F 2:3 (Keim et al. 1990), RILs (Orf et al. 1999;Zhang et al. 2004) and so on. The GWAS has advantage in convenience of group construction and detecting multiple alleles of the same locus, and providing better service for plant breeding. It laid a foundation for marker assisted selection and molecular breeding that the closely linked markers were obtained and the elite allelic variation were dissected using the SCLBP population.
Genetic System of 100SW, PRC and OIC in SCLBP The di culty in QTL studies is nding stable and real QTLs that can be detected in various environments and plant materials. With the help of soybean genome (www.soybase.org), the QTL was detected in different populations could be compared. At least 29 31 and 58 associated with 100SW, PRC and OIC have been reported in near or the same regions by previous literatures using SSR in the study, respectively (Table S1). At least 11, 44 and 101 for 100SW, PRC and OIC have been reported in near or the same regions by previous literatures using SNP in the study, respectively (Table S4). However, the 9, 11 and 9 loci for 100SW, PRC and OIC haven't been reported using SSR. The 3, 5 and 8 for 100SW, PRC and OIC haven't been reported using SNP in the study, respectively. These loci may be novel discovered loci controlling 100sw, PRC and OIC. Broader genetic variation for 100SW, PRC and OIC exists in different soybean populations due to natural and arti cial selection in different environments. Further study should con rm these loci and identify candidate genes for them. Zhang J.P., Song Q.J., Cregan P.B., Nelson R.L., Wang X.Z., Wu J.X., Jiang G.L. (2015) Genome-wide association study for owering time, maturity dates and plant height in early maturing soybean (Glycine max) germplasm. BMC Genomics 16: doi: 10.1186/s12864-015-1441-4. Zhang W.K., Wang Y.J., Luo G.Z., Zhang J.S., He C.Y., Wu X.L., Gai J.Y., Chen S.Y. (2004) QTL mapping of ten agronomic traits on the soybean (Glycine max L. Merr.) genetic map and their association with EST markers. Theor. Appl. Genet. 108 (6)