Platinum-grade soybean reference genome facilitates characterization of genetic underpinnings of traits

Cultivated soybean (Glycine max) is an important source for protein and oil. Each soybean strain has its own genetic diversity, and the availability of more soybean genomes may enhance comparative genomic analysis of soybean. In this study, we constructed a high-quality de novo assembly of an elite soybean cultivar Jidou 17 (JD17) with high contiguity, completeness, and accuracy. We annotated 59,629 gene models and reconstructed 235,109 high-quality full-length transcripts. We have molecularly characterized the genotypes of some important agronomic traits of JD17 by taking advantage of these newly established genomic resources. We reported a high-quality genome and annotations of a wide range of cultivars, and used them to analyze the genotypes of genes related to important agronomic traits of soybean in JD17. We have demonstrated that high-quality genome assembly can serve as a valuable reference for soybean genomics and breeding research community.

We identi ed 58,738 protein-coding genes and 253,172 full-length transcripts in the JD17 genome. The average length of transcripts (2,525 bp) and 3'UTR (707 bp) in JD17 genome are comparatively longer than that in Wm82 (1,628 bp, 374 bp), ZH13 (1,512 bp, 338 bp), and W05 (1,650 bp, 405 bp) genome (Table 3).  Evolutionary relationship between JD17 and other soybean lines We compared the genome sequencing results of the JD17 genome and 68 lines (including 10 wild soybean lines from different regions) [2]. SNVs and small Indels were obtained (Table S2) through aligning the Illumina sequencing results of 68 soybean strains to the JD17 reference genome by using BAW [18], followed by calling variations using GATK [19]. In order to determine the evolutionary status of JD17, all these variations were merged, obtaining a total of 18,156,237 SNV sites and 2,307,757 small Indel (≤ 10 bp) sites. After quality ltering, 5,544,995 SNV sites were used to construct a neighbor-joining tree. The results showed that JD17 was closely related to Williams (PI548631). This result is consistent with that Williams was the parental ancestor of JD17 [12]. JD17 was also closely related to the cultivated varieties Fen Dou No.85, Cang Dou-11 and Jin Da No.52, which were all cultivars in the Huang-Huai-Hai region (Fig. 2).

Characterization of ower color related genes in JD17
In order to determine the genes of important agronomic traits in JD17, we investigated a lot of literature and found 140 genes in total (Table S3). Among them, 25 genes related to soybean owering were discovered (Table S3). Six of them control and regulate soybean ower color, including W1, W2, W3, W4, Wm and Wp [20]. Cloning results showed that these six locus were F3'5'H [21], MYB [22], DFR1 [23],DFR2 [24] FLS1/gm s1 [25] and F3H [26], respectively (Fig. 1). The ower color of soybeans with dominant W1W3W4 genotype is dark purple, the ower color of W1W3w4 genotype is light purple, the ower color of W1w3W4 genotype is purple, and W1w3w4 has nearly white owers [27]. Subsequent research showed that w1w3W4 has white owers [23]. The F3'5'H gene was cloned in Clark and B09121 strains [21]. The ower colors of the two soybean lines are purple (W1) and light purple (w1-lp). Compared with the wild-type W1, the difference between the mutant-type w1-lp is that a single base G is replaced with A at the 653 bp after the initial codon, while the mutant-type w1 has a 65 bp insertion (Fig. 3a). At the W1 locus, the sequence of JD17 is exactly the same as Clark-w1 (white ower), and there is such a 65 bp sequence insertion, indicating that the genotype of JD17 is w1w1, which is consistent with the white ower phenotype of JD17. For the two genes that regulate ower color, W3 (DFR1) and W4 (DFR2), the genotype of w3w4 is characterized by nearly white owers, while the genotype of W3w4 is characterized by purple throat owers [23]. The DFR1 gene was rst cloned in L70-4422, L68-1774, Harosoy and Williams 82 [23]. Comparative analysis found that the genotype of JD17 was consistent with Wm82 except that there was no loss of 32 bp (Fig. 3b). Although they have this difference in sequence, their ower color is still the same, indicating that the 32 bp deletion has no effect on ower color. The W4 (DFR2) gene was cloned in Clark and kw4, Bay and E30-D-1 [24]. The comparative analysis found that the key site of JD17 gene sequence is exactly the same as that of Bay (W4) ( Fig. 3c), which means that the genotype of JD17 is also W4. Therefore, we determined that the gene of JD17 is w1w3W4 with white owers, which is the same as Wm82 [23].
The Wm (FLS/gm s1) gene was rst cloned in Harosoy (WmWm) and Harosoy-wm (wmwm). The homozygous wild-type Harosoy (WmWm) has purple owers, while the homozygous mutant Harosoy-wm (wmwm) is magenta [25]. The change in owers color here is due to the deletion of a single base G from FLS1 in Harosoy. In the Wm82 strain, the FLS1 gene is exactly the same as Harosoy, and they are all homozygous WmWm without single-base G deletion [25]. The gene sequence of JD17 is consistent with Wm82 and Harosoy, except for a single base A insertion in the intron region (Fig. 3d). It is inferred that JD17 should also be a white ower, which is the same as Wm82.

Characterization of owering time-related genes in JD17
Among the genes related to owering (Table S3), there are 19 genes related to the owering cycle, including 3 genes FT2c [28], J gene [29], and E1 gene [30] DNA mutation sites were determined at the molecular level ( Fig. 1). The comparison found that the FT2c gene in the JD17 strain is the same as the allele carrying wild-type GsFT2c (Fig. 4a). The owering time of plants with wild-type GsFT2c was about 2.8 days earlier than that of Wm82 under short-day light, and about 2 days later under long-day light [28]. The mutations at this locus speculate that JD17 should have a short owering cycle under short-day light, about 29 days, and about 45 days under long-day light.
The J gene was cloned in PI 159925, BR121 and Harosoy [29]. In these strains, PI 159925(j) has a single base C deletion at 1,352 bp after the start codon compared with Harosoy(J), which leading to the early termination of encoding (Fig. 4b). However, BR121(j) compared with Harosoy(J) has a 10 bp deletion at 587-596 bp after the initial codon, which resulting in a frameshift mutation. The variation in both cases delayed the owering time under short-day sunlight, from about 32 days to about 45 days [29]. The sequence of this gene in JD17 is consistent with Wm82 and Harosoy (J), so it can be speculated that JD17 should behave to promote owering under short-day light.
The E1 gene was rst cloned in the Harosoy-E1 strain [30]. Compared with Wm82, the coding sequence of E1 gene in the Harosoy-E1 strain is G at 44 bp after the initial codon ( Fig. 4c), while the mutant e1-as genotype is shown here as a missense mutation caused by a C base substitution, the genotype of e1-fs at 49 bp is a frameshift mutation caused by A base insertion, and the mutant e1-as genotype is shown here as a missense mutation caused by a C base substitution [30]. The comparative analysis showed that the genotype in the JD17 genome was consistent with e1-as, which suggested that JD17 should have a shorter owering cycle.

Characterization of seed traits-related genes in JD17
Soybean seeds have many traits, including seed coat color, seed size, grain weight, oil content, protein content, etc., which are all very important agronomic traits.
The GmIRCHS gene is an important gene that controls the color of the seed coat. Compared with the TM strain, the study found that THM (A pigmented seed coat mutant Toyohomare) mainly caused the pigmentation of the seed coat through a 3.3-kb region deleted (Fig. 5a), from normal yellow to brown [31]. The result of sequence alignment revealed that the 3.3Kb was not lost in JD17, but about 56 kb was inserted between the adjacent J and CHS3, and the seed coat color of JD17 was still yellow [11].
The GmHs1-1 gene controls the cracking of the seed coat and is closely related to water penetration. The genotype is distinguished based on mutations at 7 sites in the promoter region of the gene [32]. The sequence comparison evidence of JD17 showed that it was completely consistent with the Wm82 sequence (Fig. 5b), the genotype was Gmhs1-1, and the phenotype was permeable seed coat.
In the seed weight study, the SoyWRKY15a gene was identi ed in cultivated soybean Suinong14 (SN14) and wild soybean ZYD00006. Through 6 polymorphic loci, 4 genotypes were de ned, H1, H2, H3 and H4. The indel at -61 bp in the 5'UTR is a major and independent cis-motif variation (Fig. 5c) [33]. JD17 has the same genotype as Wm82, and can be determined to be H1. According to studies, the phenotype of this genotype has large seeds and relatively large leaves.
The FATB1a gene is related to the palmitic acid content of seeds. Compared with the Jack strain, there is a 254-kb deletion in the mutant N0304-303-3 (Fig. 5d), which directly affects the low-level palmitic acid content of soybean seeds, about 5% [34].
Through the result of sequence alignment, we found that there is such a sequence in JD17 (GmJD_05G0012300). The genotype is No deletion, Homozygous FATB1a, and it is speculated that the palmitic acid content phenotype of JD17 may be around 10%.
The ZF351 gene is closely related to the oil content of soybean seeds, which is mainly determined by the A/T SNP at 767 bp upstream of the gene start codon (Fig. 5e), and the expression of the GmZF351 allele with A at -767 bp is signi cantly higher than that of the GsZF351 allele with T at -767 bp, and promotes the increase of oil content [35]. The result of sequence alignment shows that the genotype of JD17 is GmZF351, which is consistent with Wm82, and it is speculated that the oil content is higher. GmDREBL is another gene related to oil content, and its genotype is determined by the 7 sites upstream of the gene [36]. The sequence of JD17 shows that its genotype is consistent with Wm82 (variation type 2) (Fig. 6a). Since the two genes closely related to oil content in JD17 are the same as Wm82, it is inferred that the oil content of JD17 is also the same as Wm82, which is 18.2%. GA20OX mainly affects the 100-seed weight of seeds, and its genotype is mainly determined by 29 sites in the promoter region ( Fig. 6b) [37]. Sequence alignment showed that the genotype in JD17 was completely consistent with Wm82, and it was inferred that its 100-seed weight was 15-20 g, which is consistent with the phenotype [12].
NFYA gene is also related to oil content, and the main difference in its genotype comes from whether there is an insertion of ~ 1,500 bp in the promoter region (Fig. 6c) [36]. While the sequence of JD17showed that there is such an insertion, it is speculated that the oil content of JD17 is also similar to Wm82, which is 18-22% [12].
The GmSWEET10a and GmSWEET10b genes are related to the seed 100-kernel weight, fatty acid content, and protein content. They were identi ed by GWAS association analysis of more than 800 soybean lines [38]. Among them, the GmSWEET10a gene distinguishes different genotypes by 5 sites upstream of the gene and 5 sites of the gene body (Fig. 6d). The sequence of JD17 is showed be H_III-3 type through aligned with the region of Wm82. GmSWEET10b mainly de nes different genotypes based on the differences of 17 sites (Fig. 6e). According to the sequence characteristics of JD17, its genotype was identi ed as H_III-1 type. According to the phenotypic statistics in the article [38], we can roughly speculate that the seed weight of JD17 is about 18 g, the fatty acid content is about 20%, and the protein content is about 36% [12].

Characterization of genes underlying other important traits
There are also genes related to other important traits, such as the gene GmphyA2 related to pigment synthesis [39], the gene Dt2 related to stem growth [40], the gene SHAT1-5 (Fig. 1), which is closely related to fruit pod dehiscence [41]. And the genotypes and related phenotypes of JD17 were identi ed and predicted respectively based on the characteristics of these published genotypes.
In TK780 and wild soybean Hidaka 4 (H4), the insertion of the repetitive sequence GTTTT changed the GmphyA2 genotype from E4 to e4 (Fig. 7a), resulting in yellow owers appearing on plants under continuous far-red light [39]. By determining that the genotype of JD17 is consistent with the wild type H4, it is inferred that its phenotype is also consistent with that of the wild type H4.
The genotype of Dt2 is mainly determined by the differences in the three sites within the three genes (Fig. 7b) [40]. Although there are some differences between the sequence of JD17 and Wm82 in these regions, these three key sites are consistent. It is concluded that the genotype of JD17 is consistent with Wm82 as dt2/dt2, and the predicted stem growth phenotype is indeterminate. SHAT1-5 gene is a gene related to the control of pod dehiscence identi ed in cultivated soybean HEINONG44 and wild soybean ZYD00755, mainly by promoting the thickening of the secondary wall of brous cap cells to inhibit pod dehiscence [41]. The genotype of JD17 was inferred to be GmSHAT1-5 based on the deletion of its sequence (Fig. 7c). It is speculated that the pod phenotype of JD17 is the same as that of cultivated soybean HEINONG44, and it is easy to crack.

Discussion
In this study, we successfully constructed a platinum-grade de novo assembly and annotation of JD17 genome. The contig N50 of the JD17 genome assembly reached approximately 18 Mb (Figure S1), taking the advantage of long PacBio reads, deep coverage (110X). The chromosomal continuity was obtained through Hi-C analysis. The high-quality assembly of JD17 led to the reduced number of gaps within genes and between genes ( Table 3). The JD17 genome assembly, whose contig N50, scaffold N50, and accuracy have reached 18.0 Mb, 50.6 Mb, and 99.999%, respectively, suggesting that it has reached platinum-grade quality.
The high quality JD17 was further annotated using Iso-Seq long reads, which signi cantly improved gene annotation with increased alternative-splicing isoforms, more complete 3'UTRs and transcripts (Table 3). Therefore, the JD17 genome could be widely used as a high-quality reference genome for soybean biology and molecular breeding.
Taking advantage of this platinum-grade genome assembly, we successfully identi ed the molecular sequences that explain important traits of JD17. By aligning the sequence of cloned genes related to important agronomic traits to the genome sequence of JD17, the genotypes of these genes in JD17 were determined. And based on previous studies, it is consistent with the phenotype, such as white owers, owering cycles, yellow seed coats, 100-seeds weight, oil and protein content and so on [11,12]. And combined with the results of SNPs, it is determined that there are few SNPs near these genes, especially no SNPs exist at the key sites for determining the genotype. High-quality assembly and more precise annotation will help gene location and cloning, and provide accurate locus and mutation information for its genotype. I believe that high-quality genomes will provide a good foundation for molecular cloning, GWAS, and genetic breeding research, especially the discovery of QTLs for important traits in the genetic background of JD17. It will also become an important foundation for soybean genetic diversity research

Conclusions
In summary, we have integrated a platinum-grade genome through PacBio and Hi-C technology, and obtained a more complete genome-wide annotation. Our precise genome and annotation will help accurately mapping the trait-related genes, important agronomic traits and their genotypes and genetic analysis. It will also provide important resources for soybean diversity research and pan-genome research.

Plant and Sample preparation
Soybean seeds (Glycine max cv. JD17) used in this study were from Hebei Academy of Agricultural and Forestry Sciences. The seeds are planted and extracted in Xuelu Wang's Laboratory in Huazhong Agricultural University. The seeds were sterilized with chlorine gas (5 ml of 32% (w/w) HCl to 100 ml 4-5% (wt/vol) sodium hypochlorite in a beaker) for 15 h [42] and then left in a sterile hood for 2 h. The sterilized seeds were sown in growth bottles lled with sand after being soaked in sterile Milli-Q water for 30 min and watered with sterile Fahraeus solution [43] containing 2 mM KNO 3 . Seeds were grown in a growth chamber (light) at 28℃ and 8 h (dark) at 23℃ with 60% humidity for 16 h. RNA preparation and Sequencing 1, 4, 6, 8, 10, 15, 20, 25 and 30 dpi seedings were selected to carry out PacBio isoform sequencing. Underground tissues of inoculation and uninoculation from the 9 timepoints were collected and used for RNA extraction respectively. After that, nine RNA samples of inoculation and uninoculation were mixed equally as a sample, respectively. The two RNA samples were prepared for PacBio isoform sequencing (Iso-Seq). In addition, we also selected different tissues including root, nodule, stem, leaf, pod, seed and ower for mixed RNA-SEq. All the RNA was extracted by TRIzol reagent (Invitrogen 15596026). we performed RNA-Seq on illumina platform and produced approximately 10 Gb raw data with 150 bp pair-end reads.
Whole-genome sequencing using SMRT technology 10-day post inoculation (dpi) root tissue of plants was used for SMRT whole-genome sequencing. Underground tissues were collected for genomic DNA preparation with modi ed CTAB method [44]. Using 97 ug DNA, PacBio sequencing libraries were produced following manufacturers protocols as described for the Greater than 30 kb-SMRTbell Libraries Needle Shearing  (Table S4). At the same time, we also used a part of DNA samples for resequencing with a sequencing depth of approximately 60X. These data are mainly used for the evaluation of genome size and its heterozygosity, post-assembly error correction and genome quality assessment.

Hi-C library construction and sequencing
For samples used for Hi-C assisted assembly, leaves xed in 1% (volume / volume) formaldehyde were used for library construction. Cell lysis, chromatin digestion, proximity ligation treatment, DNA recovery and subsequent DNA manipulation were performed as previously described [45]. The restriction enzyme used in chromatin digestion is Mbol. Finally, the Hi-C library was sequenced on the Illumina HiSeq X 10 platform for 150 bp paired end reads, with a sequencing depth of approximately 100X.
De novo genome assembly of JD17 To perform de novo assembly of the JD17, we combined three different assemblers, including CANU [46] Table  S5). The main assembly was performed on whole SMRT sequenced long reads. All assembly softwares were performed with a presumed 1-Gb genome size. If not speci ed, all programs in our study were run with default parameters. CANU was run with 'errorRate = 0.013', and FALCON was run with 'length_cutoff = -1' for initial mapping of seed reads for the error-correction phase. For a better FALCON assembly, we additionally optimized parameters as 'DBsplit = -x500, --s400, pa_HPCdaligner = -v -B128 -t16 -e.70 -l1000 -s1000 -T8 -M24, ovlp_HPCdaligner = -v -B128 -t32 -h60 -e.96 -l500 -s1000 -T8 -M24 and overlap_ ltering = --max_diff 60 --max_cov 60 --min_cov 2'. The stats of three initial assemblies were show in Table S5. We subsequently used the CANU assembly as the working set because it was better able to phase the diploid genome as well as indicating better contiguity from N50 evaluations. Subsequently, the draft assembly was polished using Quiver (SMRT Link v 5.0.1.9585) over twice iterations and nally corrected using Illumina short reads with Pilon ( Figure S3). Then ltering contigs that were not part of soybeans were aligned to NT library using blastn.

Continuation and connection of contigs
Comparing three different assembly results from CANU [46], FALCON and HGAP4, we found that they have some complementarity, so we continue to further extend and connect the CANU contigs to optimize our assembly results. The extended and connected contigs were performed on CANU contigs using the GPM [47] pipeline. Firstly, GPM will loading Wm82 as reference genome, and loading CANU assembly as input. These contigs were ordered and located on chromosome based on Wm82 (Glycine_max_v2.0) using blastn. This step produced a draft chromosome assembly, as JD17 v0.1. Secondly, we loaded FALCON assembly result and aligned with CANU assembly using blastn [48], then we can get the relationships between CANU and FALCON contigs. So, the CANU contigs can be extended or connected by FALCON contigs, which is the JD17 v0.2. Thirdly, we loaded HGAP4 assembly and repeat the above steps, as the JD17 v0.3. Fourthly, we loaded parts of contigs from assembly was performed on 70 Gb PacBio long reads using CANU and repeat the above steps, as the JD17 v0.4. The whole 120 Gb PacBio long reads are sort by length of reads, and extract 70 Gb reads according the length of reads. In whole process, these extended or connected contigs are supported by PacBio subreads to avoid introducing new errors. Finally, the contigs belong to chloroplast and mitochondrial sequences were removed from JD17 v0.4 (Table S6). At every steps, we merged these assemblies by GPM [47], polished them with Quiver and Pilon [49].
But the JD17 v0.4 was also polished using Quiver over twice iterations and corrected using Illumina short reads with Pilon [49]. The JD17 genome assembly (Glycine_max_JD17v1.0) was 995.0 Mb in size, with a contig N50 of 18.0 Mb ( Table 1). The detailed assembly process is shown in the Figure S3.
The assessment of genomic heterozygosity and size is using the Genomic Character Estimator program, gce v 1.0.0 (ftp://ftp.genomics.org.cn/pub/gce), and the heterozygous ratio based in kmer individuals is 0.029, and the corrected estimate of genome size is about 1.11 Gb.
To anchor hybrid contigs into chromosome, the Hi-C sequencing data were aligned into contigs using bwa. According to the orders and orientations provided by the alignment, those contigs were clustered into chromosomes by ALLHiC v0.9.8 [50] with default params. According to the ALLHiC groups and assembly results create hic les, manual correction and validation were also performed by drawing contact maps with juicerbox [51]. The genome assembly was nalized after this correction (Table  S6).

Quality assessment of JD17 genome assembly
To assess the quality of the Glycine_max_JD17v1.0 assembly, we used our 65 Gb resequencing data. First, by aligning all reads to the assembly with BWA-MEM in BWA [52], the mapping rate is over 98.8% and the coverage was over 99.65%, which shows the consistency between the assembly and reads. By using the GATK tools [13,14] for SNPs calling with JD17 resequencing data, we found 78,033 SNP, of which only 8,000 were homozygous, indicating that the JD17 genome has an accuracy of over 99.999% (Table S2). The completeness of the assembly was estimated by BUSCO with default parameters.

Annotation of TE and ncRNA sequences
To investigate the JD17 genome sequence features, we identi ed transposable elements (TEs) and other repetitive elements by RepeatMasker [44]. MITEs (miniature inverted transposable elements) were collected by MITE-Hunter [53] with all default parameters. In order to get as much reliable LTR (long terminal repeat) retrotransposons information as possible, we used the LTR_retriever [54] analysis process, which integrates the output of LTR_FINDER [55] and LTRharverst tools in GenomeTools [56]. Masking sequence with RepeatMasker (version 4.0.8) (http://www.repeatmasker.org/) based on MITEs and LTR library that has been identi ed. The other tandem repeats were identi ed by constructing a de novo repeat library using Repeatmodeler (version 1.0.11) (http://www.repeatmasker.org/). RepeatMasker was run against the genome assembly again, with all above library as the query library.
Non-coding RNAs were predicted by the Infernal program using default parameters [57] and comparing the similarity of secondary structure between the JD17 genome sequence and Rfam [58] (v12.0) database.

Annotation of protein-coding genes
We performed gene calling analysis with MAKER [59,60], by using multi-sourced EST and protein sequences as evidences (including nonredundant soybean EST/protein sequences from NCBI, assembled JD17 RNA-Seq from mixed samples and Iso-Seq data from underground samples, Wm82 transcripts and proteins, Lotus japonicus and Medicago truncatula protein sequences), AUGUSTUS [61] and SNAP [62] as ab initial gene predictors, and a customized repeat library for RepeatMasker [44,63].
The RNA-Seq data was de novo assembled using Trinity [64] to obtain the assembled cDNA sequence, and annotated with the PASA tool along with the non-redundant isoforms sequence from Iso-SEq. The annotation results will be used for AUGUSTUS model training and prediction. At the same time, these Iso-Seq de-redundant sequences are also used by MAKER for preliminary prediction, and the results are used for de novo sequencing of SNAP. In order to obtain more accurate and complete annotation results, we will use the AUGUSTUS trained model, the SNAP hmm model, RNA-Seq and Iso-Seq evidence, ESTs from NCBI, William 82 transcripts, RepeatMasker annotated repeated sequence results, And protein sequences from multiple species are combined as MAKER's input evidence to make predictions, and EVM is called to integrate the prediction results, and nally the PASA software [65] is used to update these annotation results. The detailed process is shown in the Figure S4.
Finally, in order to obtain more complete alternative splicing information, we manually added the dropped Iso-forms. These transcripts are from the full-length transcript sequence of Iso-seq, and the results are generated by using GMAP [66] alignment to the genome.
Function annotation of protein sequence were predicted by interproscan software with InterPro databases [67]. Nucleotide sequences were aligned to the 'nr' database by BLASTX [48] (version 2.2.28+). Gene Ontology terms for all genes were annotated by Blast2GO [68].
Blastn were used to search highest hit of genes between two genomes (identity > = 90%). For the multiple genes which corresponded to one gene, we only retained the adjacent genes that mapped to different positions within one gene.

Evolutionary analysis
Data source of evolution analysis: The 68 strains used for evolution analysis are all from BIG Data Center (http://gsa.big.ac.cn/index.jsp) under Accession Number PRJCA000205 [69]. The selection of these strains is mainly based on different regions and higher sequencing depth. The detailed strains are shown in the Table S2.
By comparing these 68 strains with JD17, using the bwa tool to compare to the JD17 reference genome, and then using the GATK tool for SNP calling analysis, merge these result for evolutionary analysis, set the lter value to the lter value is "SNP: The sequencing data (PacBio Whole Genome Sequencing data for assembly, resequencing data, Hi-C data and Mixed-Sample RNA-seq data for Annotation) used in this study have been deposited into National Center for Biotechnology Information under BioProject Number PRJNA412346 with accession number SRR12416523 -SRR12416632, SRR12416444, SRR12416750 and SRR9643849 -SRR9643851. The nal assembly had been deposited at GenBank JACKXA000000000.

Funding
The funding body played no role in the design of the study and collection, analysis, and interpretation of data and in writing the  Overview of the JD17 reference genome Tracks from outer to inner circles indicate: a) the chromosome of the genome; b) the gene density map; c) the repeat sequence density map; d) density distribution diagram of SNP sites; e) density distribution diagram of InDel. The reddish color indicates greater density, the yellower color indicates lower density.

Figure 2
Evolutionary relationship between JD17 and other soybean lines The lines marked in red are the representative strains of each area, the blue lines are the branches where JD17 is located, and the blue text is the area to which the strains belong.   Gene models that regulate seed traits a) Gene model of GmIRCHS, which regulates the color of the seed coat. b) Gene model of GmHs1-1, which is related to the cracking of the seed coat. c) Gene model of SoyWRKY15a, which is related to the size of seed. d) Gene model of FATB1a, which is related to palmitic acid content of seed. e) Gene model of GmZF351, which is which is closely related to the oil content of seeds.

Figure 6
Gene models related to seed traits a) Gene model of GmDREBL, which is related to the oil content of seeds. b) Gene model of GA20OX, which is related to the 100-seed weight of seeds. c) Gene model of NFYA, which is related to the oil content of seeds. d) and e) Gene models of GmSWEET10a and GmSWEET10b, which are related to 100-seed weight, fatty acid content and protein content of seeds.