Development of SNP marker panels for genotyping by target sequencing (GBTS) and its application in soybean

A high-throughput genotyping platform with customized flexibility, high genotyping accuracy, and low cost is critical for marker-assisted selection and genetic mapping in soybean. Three assay panels were selected from the SoySNP50K, 40K, 20K, and 10K arrays, containing 41,541, 20,748, and 9670 SNP markers, respectively, for genotyping by target sequencing (GBTS). Fifteen representative accessions were used to assess the accuracy and consistency of the SNP alleles identified by the SNP panels and sequencing platform. The SNP alleles were 99.87% identical between technical replicates and 98.86% identical between the 40K SNP GBTS panel and 10× resequencing analysis. The GBTS method was also accurate in the sense that the genotypic dataset of the 15 representative accessions correctly revealed the pedigree of the accessions, and the biparental progeny datasets correctly constructed the linkage maps of the SNPs. The 10K panel was also used to genotype two parent-derived populations and analyze QTLs controlling 100-seed weight, resulting in the identification of the stable associated genetic locus Locus_OSW_06 on chromosome 06. The markers flanking the QTL explained 7.05% and 9.83% of the phenotypic variation, respectively. Compared with GBS and DNA chips, the 40K, 20K, and 10K panels reduced costs by 5.07% and 58.28%, 21.44% and 65.48%, and 35.74% and 71.76%, respectively. Low-cost genotyping panels could facilitate soybean germplasm assessment, genetic linkage map construction, QTL identification, and genomic selection.


Introduction
Increasing crop productivity to meet demands is critical in the face of a rapidly growing global population (Ray et al. 2013;Arabzai et al. 2021). Breeding new varieties with high yields is often considered a solution. Traditional selection-based breeding mainly relies on visual observation of plant traits and performance in the field, which is a time-consuming and labor-intensive task (Guo et al. 2019;Yang et al. 2021). In recent decades, marker-assisted selection (MAS) has been developed for crop improvement (Cuo et al. 2017). Molecular markers have been shown to play a key role in accelerating the "Green Revolution" . For instance, markers associated with the sd1 (semi-dwarf) gene have been widely used to reduce plant height and improve lodging resistance and yield in rice (Srivastava et al. 2019), while markers associated with a well-known dwarfing gene in wheat, Rht (reduced height), have been used for the selective breeding of commercial wheat varieties (Grover et al. 2018;Miedaner et al. 2018).
Many types of molecular markers, such as amplified fragment length polymorphisms (AFLPs), restriction fragment length polymorphisms (RFLPs), random amplified polymorphism DNA (RAPD), simple sequence repeats (SSRs), insertions/deletions (indels), and single nucleotide polymorphism (SNP) markers, have been developed and applied in crop breeding (Chen et al. 2021). Due to their abundance in the genome, high degree of polymorphism, and ease of genotyping, SNPs have become the most commonly used markers (Song et al. 2013;Lee et al. 2015;Chen et al. 2021;Sun et al. 2022). In addition, advances in next-generation sequencing technologies have enabled the detection of a large number of highquality SNP markers in plant populations.
SNP assays have become a key technology in soybean genetic research, and a variety of soybean SNP chips/assays have been developed. For example, the SoySNP50K assay, an Illumina Infinium BeadChip containing 52,041 SNPs, was the first to be developed (Song et al. 2013). The SoySNP180K assay, with 170,223 SNP markers (Lee et al. 2015); the SoySNP355K assay, with 355,595 SNP markers (Wang et al. 2016); the BARC-SoySNP6K assay, with 6000 SNP markers (Song et al. 2020); the SoySNP200K assay, containing 158,959 SNP markers (Sun et al. 2022); and the GenoBaits Soy40K assay, with 40,334 SNPs/indels , have been developed and used for soybean population evaluation and breeding.
Bead chip and genome-by-sequencing (GBS) genotyping platforms have been widely applied in soybean breeding programs. GBS is commonly used for genotyping, but SNPs are not fixed among populations, which is a major limiting factor for MAS application (Guo et al. 2019). Bead chip assays are easy to use, and the SNPs in the assay are fixed. The SoySNP50K assay is one of the most widely used bead chip assays in soybean research. For example, the SoySNP50K assay has been used to genotype thousands of soybean germplasms, including 18,480 domesticated soybean and 1168 wild soybean accessions in the USDA Soybean Germplasm Collection (Song et al. 2015), and the dataset is widely used by soybean researchers. However, access to the SoySNP50K assay outside the USA is limited due to high costs and a lack of genotyping equipment in other countries. Genotyping by target sequencing (GBTS), as a targeted sequence capture strategy, has been used to develop SNP marker panels in maize and cucumber breeding (Guo et al. 2019). However, there have been very few reports on the application of GBTS in soybean. Therefore, our objectives are to develop and evaluate SNP marker panels, including 40K, 20K, and 10K panels, from the SoySNP50K assay (Song et al. 2013) and apply the assays to QTL mapping. This study will provide an alternative approach to accurately genotype soybean populations at low cost. In addition, because the SNPs from the SoySNP50K assays are highly polymorphic and the SoySNP50K dataset from the USDA Soybean Germplasm Collection is publicly available from SoyBase (https:// www. soyba se. org/ snps/), the SNP panels developed in this study will enable the comparison of new germplasm with US accessions.
In addition, two RIL populations, QH3417 and XH1617, consisting of 171 and 183 lines, respectively, were created from crosses between Qihuang34 × HJ117 and Xudou16 × HJ117 with the single-seed descent (SSD) method and were used to construct genetic linkage maps and assess the quality of the SNP panels through collinearity analysis. The genotypic dataset was also used to analyze QTLs for 100seed weight in soybean. Two RIL populations and their parents were grown in Shijiazhuang, Hebei,China (located at 114.8 E longitude,38.0 N latitude) in June 2019. A field experiment with a randomized complete block design and two replicates was conducted. Each line and its parents were planted in a single row, 2.0 m long and 0.5 m wide, with a distance of 0.1 m between plants.

Plant sampling and genetic analysis
After harvest, the seeds were dried at room temperature, and the 100-seed weight for each line of QH3417 and XH1617 was measured in 4 replicates by electronic scales. The broad-sense heritability (h 2 b ) of 100-seed weight was estimated based on the where V G is the genetic variance among RILs and V E is the error variance (Cui et al. 2020).

DNA extraction
At the first fully expanded trifoliate leaf stage (V2 stage), ten leaves were collected from each soybean accession and from each line of each of the two RIL populations to extract genomic DNA using the cetyltrimethylammonium bromide (CTAB) method (Saghai-Maroof et al. 1984). The quality and concentration of DNA were examined by agarose gel electrophoresis and a NanoDrop spectrophotometer.

Selection of SNP loci for development of the marker panels
A total of 45,000 SNPs, including 42,509 SNPs in the SoySNP50K dataset, were selected from the SoySNP50K assay (Song et al. 2013). A set of probes, each with 110 nt and a GC content greater than 30%, were designed using GenoBaits Probe Designer (Guo et al. 2019) to target each SNP flanking region. The specificity of all probes from each SNP region in the reference genome (Schmutz et al. 2010) was assessed. Each SNP was captured by 2 cross-covered probes. The probe set was synthesized by a semiconductor-based in situ synthesis process. All the SNPs were ranked by the average missing rate per locus and the average sequencing depth according to the 15 accessions. The lowest-ranking 3459 SNPs were removed. Finally, 41,541 SNPs were selected and included in the 40K SNP marker panel. Two additional marker panels, 20K and 10K, were generated from the 40K SNP marker panel by sequencing at low depths.
DNA library construction and probe hybridization DNA library construction and probe hybridization were performed according to Guo et al. (2019). Briefly, the genomic DNA was fragmented and end repaired. After adding the A-tail and adapter ligation, the ligation products with 200-300 bp sizes were indexed with different barcode sequences. The resulting DNA library was eluted with Tris-HCl at pH=8.0. For probe hybridization, 500 ng library DNA, 5 μL GenoBaits Block I and 2 μL GenoBaits Block II were mixed in the tube and vacuum-dried. The dry powder and hybridization buffer were mixed and incubated at 65°C for 2 h in a PCR instrument for library hybridization. Then, 100 μL of GenoBaits DNA Probe Beads were added to the reaction for target capture. Library amplification was carried out by PCR with probe beads, which were washed at room temperature. The probe beads with post-PCR products were washed with 75% ethanol. Finally, the DNA library was checked with a Qubit 2.0 Fluorometer (Thermo Fisher Scientific, CA); the portion that passed quality control was loaded onto the flow cell and sequenced with PE150 on the MGISEQ-2000 platform (MGI, Shenzhen, China).

SNP identification and genetic linkage map construction
Sequence quality was assessed using FastQC (www. bioin forma tics. babra ham. ac. uk/ proje ct), and the reads were mapped onto the reference genome using BWA (biobwa. sourc eforge. net) with default parameters. SNP identification was performed using GATK (softw are. broad insti tute. org/ gatk). The genotypic datasets of the 40K, 20K, and 10K panels were obtained using a Perl script written by Mol Breeding.
The SNPs in the 10K genotypic files of the two RIL populations QH3417 and XH1617 were eliminated if alleles in the parent were identical or missing, the missing rate was > 20.00% among RILs, or segregation was distorted at P value < 0.05 in Chi-square tests. Genetic linkage maps for the two RIL populations of QH3417 and XH1617 were constructed as described by Yang et al. (2019). A total of 2055 SNP markers in QH3417 and 2111 in XH1617 were used to construct the genetic linkage maps using Join Map 4.1 (Van Ooijen 2006).

Phylogenetic tree construction and collinearity analysis
Phylogenetic trees and the collinearity of the physical and genetic positions of the markers were analyzed to evaluate the quality of the 40K, 20K, and 10K SNP marker panels developed by GBTS technology. Phylogenetic trees were constructed by FigTree v1.4.3 software (Gadissa et al. 2021). The collinearity of SNP positions between the physical map (PM, measured in megabases/ Mb) of the Wm82.a2.v1 reference genome and the genetic maps (GMs, measured in centimorgans/cM) in the RIL populations QH3417 and XH1617 was calculated and displayed using an R program. The coordinate scales of the physical map (PM, Mb) and the genetic map (GM, cM) of each SNP marker in the RIL populations were calculated as follows: where Mr means the rth (r=1, 2, …, n) SNP marker; the maximum values of n for the RIL populations QH3417 and XH1617 are 2055 and 2111, respectively. A statistical program to detect the collinearity between the PM and the CM of each RIL population was written in the form of R scripts, and the results were displayed in R.

QTL identification
The average 100-seed weight of the replicates of each line in the two RIL populations was used for QTL detection. The analysis was performed using the composite interval mapping (CIM) method in Windows Mr Page 5 of 11 26 Vol.: (0123456789) QTL Cartographer software 2.5 (Wang 2007). A permutation test with 500 iterations was conducted to obtain a proper log of the odds (LOD) score or threshold to declare QTLs (Doerge and Churchil 1996). Model 6 (standard model) was chosen with a control marker and window size of 5 and 10 cM, respectively. The walk speed of precision selection was 1 cM.
The SNP density per 1 Mb was higher in euchromatic regions than in the centromeres of the Wm82. a2.v1 assembly ( Fig. 1a and Fig. S1) because recombination events in the heterochromatic regions were infrequent, and the source panel, SoySNP50K, was purposely selected to reduce the number of SNPs in this region (Song et al. 2013).
The missing rate ( Fig. 1b and Fig. S2) decreased as the number of sequencing reads in each panel increased. A total of 1.2 Gbp, 0.6 Gbp, and 0.4 Gbp sequencing reads were required to reduce the missing rate to less than 0.02 for 40K, 20K, and 10K, respectively ( Fig. 1b  and Fig. S2). Sequencing reads were reduced by 50.00% and 66.67% in the 20K and 10K SNP marker panels, respectively, compared to the 40K panel.

Consistency of SNPs from different sequencing technologies among 15 soybean accessions
To examine the reliability and consistency of SNP alleles from GBTS, the 15 representative soybean accessions were sequenced twice with a 40K SNP GBTS panel, and the SNP alleles from these replicates were compared ( Table 2). The concordance of the SNP genotypes from repeat target sequencing averaged 99.87% and ranged from 99.28% in ZYD04569 with 41054 SNPs to 99.98% in Hobbit with 41293 SNPs. The concordance of SNP alleles of the 15 soybean accessions between 10× resequencing data and GBTS data averaged 98.86% and ranged from 95.64% for ZYD02878 with 39499 SNPs to 99.70% for Hobbit with 41314 SNPs (Table 3). The results suggested that the 40K SNP marker panel was highly reliable and consistent.

Phylogenetic relationships revealed by different marker panels
Phylogenetic trees were constructed for the 15 representative soybean accessions using 40K, 20K and 10K SNP marker panels to further evaluate the quality of the SNP marker panels. As shown in Fig. 1c  and Fig. S3, identical phylogenetic relationships among the soybean accessions were observed based on the 40K, 20K, and 10K SNP markers. The wild soybean accessions were in one cluster, and the cultivated soybean accessions were in another cluster. Although all three GBTS panels could be used for germplasm evaluation, the 10K panel might be a better choice in terms of minimizing costs.
High-density genetic linkage map construction using the 10K SNP marker panel To evaluate the application of these marker panels from GBTS in genetic linkage map construction in soybean. Two RIL populations, derived from Qihuang34 × HJ117 (QH3417) and Xudou16 × HJ117 (XH1617), were sequenced using the 10K SNP marker panel. The linkage  (57), with a genetic length of 108.70 cM (Table S2). The linkage map of XH1617 consisted of 2111 SNP markers (Table S3). The total linkage map length and average genetic distance between two adjacent SNP markers were 4798.81 cM and 2.27 cM, respectively. The largest and smallest numbers of SNP markers were 134 on chromosome 18 and 71 on chromosome 04 and chromosome 12, respectively. A relatively high degree of collinearity between the SNP positions in the reference genome and the genetic maps was observed in these two RIL populations, as there was a distinct diagonal direction to the relationship between physical length and genetic distance in the plot of each population (Fig. 2).

Genetic analysis and QTL identification for the 100-seed weight trait in soybean
The 100-seed weight of three representative plants from each line of QH3417 and XH1617 was determined and used for QTL identification. As shown in Table S4, the 100-seed weight was between 16.60 g and 31.75 g in QH3417 and 11.45 g and 28.05 g in XH1617. Genetic analysis suggested that the distributions of the 100-seed weight approximated a normal distribution, and the skewness and kurtosis were 0.16 and 0.05 in QH3417, respectively, and −0.19 and −0.09 in XH1617, respectively. The broad-sense heritability (h 2 b ) for the two RILs was 0.96 and 0.92, demonstrating that the variation in 100-seed weight caused by experimental error was small.
QTL analysis in the two RIL populations identified a common stable locus on chromosome 06, Locus_OSW_06 (Table 4). In QH3417, qOSW-34 was mapped to the 6,541,348-6,611,603 bp intervals between the markers Gm06_6611603_A_G and Gm06_6541348_G_T with an LOD value of 4.85, which explained 9.83% of the phenotypic variation. In XH1617, qOSW-16 was detected in the interval of 6187779-6370390 bp between the markers of Gm06_6187779_T_C-Gm06_6370390_G_A with an LOD value of 4.09, explaining 7.05% of the phenotypic variation.

Cost of GBTS compared with GBS and DNA chips
The cost in soybean breeding programs, especially the high genotyping cost, is the major constraint for breeders. The costs of GBTS, GBS, and DNA chips were analyzed (Table 5). According to the Mol Breeding Company, the costs for DNA extraction, library construction, probe hybridization, and labor were $0.44, $2.94, $2.21, and $1.47 per sample, respectively, for each panel of GBTS; however, the costs for sequencing, bioinformatics analysis and equipment depreciation varied depending on the panels. Compared to the costs of the 40K marker panel, the costs of the 20K and 10K panels were reduced by 33.33% and 66.67% for sequencing, 60.81% and 79.73% for data analysis, and 29.93% and 59.86% for equipment depreciation, respectively. The total genotyping costs of GBTS for 40K, 20K, 10K, GBS with a sequence depth of 2× and DNA chips containing 50K SNP markers were $13.68, $11.32, $9.26, $14.46, and $32.79 per sample, respectively. Compared to GBS and DNA chips, the costs for 40K, 20K, and 10K were reduced by 5.07% and 58.28%, 21.44% and 65.48%, and 35.74% and 71.76%, respectively. GBTS has a cost advantage over GBS and DNA chips, especially when the 10K marker panel is used.

Discussion
In recent decades, sequencing technologies, such as first-generation DNA sequencing, next-generation DNA sequencing, and Affymetrix GeneChips, have been used for genotyping and application of MAS in soybean breeding (Song et al. 2013;Lee et al. 2015;Heather and Chain 2016;Wang et al. 2016;Song et al. 2020;Sun et al. 2022). However, whole-genome resequencing of all genotypes is not cost-effective and is sometimes unnecessary, while gene chips require expensive equipment. Breeders frequently need highefficiency, low-cost, and flexible genotyping platforms for breeding (Rasheed et al. 2017). The new sequencing technology of GBTS has the advantages of customized flexibility, high sequencing accuracy, and low sequencing cost and has been widely used in maize breeding (Guo et al. 2019).
As two core SNP capture technologies in GBTS, GenoPlex (based on multiplexing PCR) and GenoBaits (based on sequence capture in solution) can assay many types of molecular markers, including SSRs, SNPs, and indels (Xu et al. 2020;Guo et al. 2021). Compared to DNA chips/assays, GBTS is more flexible in meeting the requirement of various numbers of markers by controlling the sequencing depth (Guo et al. 2019;Xu et al. 2020). In this study, three high-quality SNP marker panels were developed for GBTS (Table S1) and could be utilized for soybean germplasm evaluation, genetic mapping, marker-assisted selection, etc. In addition, the relatively high cost of genotyping accessions is the main constraint on the application of MAS in soybean breeding (Guo et al. 2019); GBTS provides an alternative approach to genotype lines at relatively low cost, according to our cost comparison.
Recently, one GBTS liquid chip, the GenoBaits Soy40K assay, was developed based on the whole genome sequences of approximately 2900 soybean accessions . The 40K, 20K, and 10K SNP marker panels were developed based on the SoySNP50K assay (Song et al. 2015), which will greatly facilitate the comparison of the genotypes to the 18,480 domesticated soybean and 1168 wild soybean accessions in the USDA Soybean Germplasm Collection, which were genotyped with SoySNP50K chips (Fig. 1a and Fig. S1).
Molecular markers have been widely used for soybean germplasm evaluation (Song et al. 2013;Lee et al. 2015;Wang et al. 2016;Song et al. 2020;Sun et al. 2022), genetic linkage map construction (Song et al. 2016), and QTL identification (Song et al. 2020;Yang et al. 2021). SNPs have become the mainstream marker type due to their advantages of high abundance and ease of high-throughput genotyping, which could greatly improve the efficiency of soybean genetic studies and breeding programs (Chen et al. 2021). However, the quality of the SNP marker is the key factor in the success of MAS application and genetic studies (Ben-Ari and Lavi 2012; Chen et al. 2021). In this study, the 40K, 20K and 10K marker panels were all from the well-tested SoySNP50K assay panel reported by Song et al. (2013). The density of the SNPs in these three marker panels was higher in the euchromatic regions than in the centromeres ( Fig. 1a and Fig. S1). The SNP alleles were consistent between resequencing data and the 40K marker panel developed by GBTS among 15 soybean accessions (Tables and Table 3), suggesting that the SNP markers in these three marker panels were reliable and of high quality.
The results from phylogenetic tree analysis of the 15 tested soybean accessions showed that the phylogenetic relationships were consistent based on the markers in the three panels. Interestingly, HJ117 was grouped with JD12, JD17 with Hobbit, and Zhong-huang42 with Qihuang34, which was consistent with their pedigrees. HJ117 is the progeny of JD12; JD17 is the progeny of Hobbit; and Zhonghuang42 and Qihuang34 share the same parent, Youchu4. Additionally, Zhonghuang13 and Zheng196 share the same parents, Juxuan23 and 5905 (Zhang et al. 2014).
A relatively high degree of collinearity of SNP positions between the reference genome and genetic map constructed by the 10K SNP marker panel was observed in the two RIL populations. As expected, one stable QTL associated with soybean 100-seed weight, namely, Locus_OSW_06, was identified on chromosome 06 with a LOD value of 4.09-4.85, explaining 7.05-9.83% of the phenotypic variation. This QTL was reported in a previous study by Pathan et al. (2013). This result suggested that the linkage map constructed by GBTS technology is appropriate for QTL identification.
The SoySNP50K assay is one of the most widely used bead chip assays in the soybean community and has accelerated soybean breeding progress. However, some developing countries have not been able to take advantage of the SoySNP50K assay due to the high costs of purchasing genotyping equipment and carrying out the sequencing process. The 40K, 20K, and 10K SNP marker panels in this study are low cost and have high accuracy, which may solve the problem more effectively in some developing countries. In addition, the 40K, 20K, and 10K panels contained fewer markers than the SoySNP50K assay. However, the evaluation of too many markers will increase the cost and generate redundant information (Song et al. 2020). Marker panels with fewer SNPs could also have certain advantages if they are widely used and commercialized for soybean population evaluation and breeding. For instance, the BARCSoySNP6K assay, with 6000 SNP markers, was carefully chosen from the SoySNP50K assay and was found to be an excellent tool for QTL detection, genomic selection and genetic relationship assessment (Song et al. 2020). In this study, all three marker panels were used for germplasm evaluation, and the 10K marker panel can used for linkage map construction and QTL 26 Page 10 of 11 Vol:. (1234567890) identification in soybean. Overall, the application of these three marker panels was developed by new sequencing technology for GBTS and is reasonable and beneficial for soybean breeding programs.

Conclusion
This study provided an alternative approach for genotyping soybean lines at low cost and with high accuracy. Three SNP marker panels, 40K, 20K, and 10K, were selected from the SoySNP50K assay and were developed by GBTS. The marker panels could be used for soybean germplasm evaluation, linkage map construction, and QTL identification. Most importantly, the resulting genotypic data of soybean accessions from these three marker panels could be compared to the SoySNP50K dataset of the USDA Soybean Germplasm Collection. Data availability The datasets generated and/or analyzed during the current study are available from Long Yan on reasonable request.

Declarations
Ethics approval and consent to participate No applicable.

Competing interests
The authors declare no competing interests.