The Glycine max sample was collected from Shijiazhuang (37°6′25″N, 114°42′47″E). Genomic DNA and total RNA were isolated from leaf tissues. High-quality DNA was extracted using QIAGEN® Genomic kits. Three methods were used to quantify and check the extracted DNA, NanoDrop 2000 Spectrophotometer (Thermo Fischer Scientific), agarose gel electrophoresis and Qubit Fluorometer (Invitrogen). After the detection, the DNA was purified using AMPure PB beads (Pacbio 100-265-900), and the subsequent library construction utilized the final high-quality genomic DNA (gDNA). The size and concentration of the library fragments were assessed using an Agilent 2100 Bioanalyzer (Agilent Technologies, USA). Qualified libraries were evenly loaded on SMRT Cell and sequenced for 30 hours using Sequel II/IIe system (Pacific Biosciences, CA, USA).
Briefly, the DNA sample was initially fixed with formaldehyde and subsequently digested using HindIII restriction enzyme. Next, the DNA ends underwent repair and were labeled with biotin. Subsequently, T4 DNA ligase was used to ligate the interacting fragments to form a loop. After ligation, protease K was added for cross-linking, and then protein of ligated DNA fragments was digested to obtain purified DNA. Finally, the purified DNA was fragmented into sizes ranging from 300 to 500 base pairs. The biotin-labeled DNA fragments were then isolated using Dynabeads® M-280 Streptavidin (Life Technologies). Subsequently, the Hi-C library was constructed and sequenced on the Illumina NovaSeq6000 sequencing platform using paired-end reads of 150 base pairs.
To ensure the acquisition of high-quality data, the raw polymerase reads were subjected to quality control using the PacBio SMRT-Analysis package (https://www.pacb.com). This involved filtering out the following types of polymerase reads: (1) polymerase reads less than 50 bp in length, (2) Polymerase readings with a mass value below 0.8, (3) a polymerase read comprising an adaptor attached to itself and removing the adaptor sequence in the polymerase read. Then use SMRTLink 9.0 (parameter --min-passes = 3 --min-rq = 0.99) to generate CCS reads for subsequent assembly.
Hifiasm (https://github.com/chhylp123/hifiasm) was employed to assemble the HiFi reads, and the preliminarily assembled genome version (primary contigs) was obtained. To obtain chromosome level genome, we performed Hi-C assisted assembly. For the ~ 114.5 Gb raw reads (Data file 1 and Data file 2), preliminary quality control was performed using Fastp [14], and the resulting clean reads were subsequently aligned to primary contigs using hicup. Valid pair reads were utilized for further analysis. AllHIC was used for auxiliary assembly, and then Juicebox was used for fine-tune AllHIC clustering results. Finally, A genome was obtained with a contig N50 length of 19.32 Mb and a total contig length of 1041.94 Mb, as well as a scaffold N50 length of 51.43 Mb and a total scaffold length of 1041.95 Mb (Data file 3 and Data file 4).
To assess the quality of the assembly the self-written script was used to perform statistics on the number of single chromosome cluster scaffolds, chromosome sequence length, and genome mounting rate. According to the number of sequences assembled to the chromosome level and the number of sequences that were not assembled to the chromosome level, the Hi-C mounting rate was calculated. The chromosome-level genome was partitioned into 500 Kb bins of equal length. The number of Hi-C read pairs spanning any two bins was used as the intensity signal to represent the interaction between the respective bins. Heatmaps (Data file 5) were generated based on these signals. BUSCO (Benchmarking Universal Single-Copy Orthologs: http://busco.ezlab.org/) [18] was also applied to perform a quality assessment of the genome. The conserved genes (248 genes) existing in six eukaryotes were selected to construct the core gene library for CEGMA [19] evaluation. The evaluation results revealed that the majority of core eukaryotic genes (97.18%) and genes in the BUSCO dataset (99.4%) were successfully identified (Data file 6).
Repeatmasker [21] and repeatproteinmask (http://www.repeatmasker.org/) were employed to identify sequences that exhibit similarity to known repeat sequences. LTR_FINDER [22] was used to perform de novo prediction. Totally, 361,475,923 bp RepBase TEs and 453,714,080 bp de novo repetitive sequences were identified, respectively (Data file 7). Structural prediction of genes was performed by using AUGUSTUS (http://bioinf.uni-greifswald.de/AUGUSTUS/) [24] (Data file 8 and Data file 9). Then, we used the protein databases NR (https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/), SwissProt (http://www.uniprot.org/), KEGG (http://www.genome.jp/kegg/) and InterPro (https://www.ebi.ac.uk/interpro/) to annotate the gene set obtained from the gene structure annotation. A total of 57,151 genes were predicted, with 54,550 of these genes being functionally annotated in the database (Data file 10). The circular plot illustrates gene density, transposable element (TE) density, and GC density (Data file 11). The tRNAscan-SE [29] (http://lowelab.ucsc.edu/tRNAscan-SE/) was used to identify tRNA sequences within the genome. Blast [30] alignment was used to find the rRNA in the genome. The prediction of miRNA and snRNA sequences within the genome was performed using INFERNAL (http://infernal.janelia.org/). The copy number of miRNA, tRNA, rRNA and snRNA ranged from 68 to 5,116 (Data file 12).
Table 1
Overview of data files/data sets.
Label
|
Name of data file/data set
|
File type (file extension)
|
Data repository and identifier (DOI or accession number)
|
---|
Data file1
|
Statistics on sequence data
|
Spreadsheet (.xls)
|
https://figshare.com/s/6de11eca18b3ccef8314 [12]
|
Data file2
|
Hi-C raw data
|
Fastq file (.fastq)
|
https://ngdc.cncb.ac.cn/gsa [13]
|
Data file3
|
Assembly statistics of HJ117
|
Spreadsheet (.xls)
|
https://figshare.com/s/6de11eca18b3ccef8314 [15]
|
Data file4
|
genome.fa
|
Fasta file (.fasta)
|
https://figshare.com/s/6de11eca18b3ccef8314 [16]
|
Data file5
|
Hi-C interaction heatmap
|
Image file (.tif )
|
https://figshare.com/s/6de11eca18b3ccef8314 [17]
|
Data file6
|
Assessment results of CEGMA and BUSCO
|
Spreadsheet (.xls)
|
https://figshare.com/s/6de11eca18b3ccef8314 [20]
|
Data file7
|
Results of transposable element classification statistics
|
Spreadsheet (.xls)
|
https://figshare.com/s/6de11eca18b3ccef8314 [23]
|
Data file8
|
Results of gene structure prediction
|
Spreadsheet (.xls)
|
https://figshare.com/s/6de11eca18b3ccef8314 [25]
|
Data file9
|
Glycine.max.gene.gff
|
Gff file (.gff)
|
https://figshare.com/s/6de11eca18b3ccef8314 [26]
|
Data file10
|
Genome annotation of HJ117
|
Spreadsheet (.xls)
|
https://figshare.com/s/6de11eca18b3ccef8314 [27]
|
Data file11
|
Overview of the HJ117 reference genome
|
Image file (.tif )
|
https://figshare.com/s/6de11eca18b3ccef8314 [28]
|
Data file12
|
Statistics on non-coding RNA annotation results
|
Spreadsheet (.xls)
|
https://figshare.com/s/6de11eca18b3ccef8314 [31]
|
Data file13
|
Clean RNA reads of leaf tissues
|
Fastq file (.fastq)
|
https://ngdc.cncb.ac.cn/gsa [34]
|
Data file14
|
Clean HiFi data
|
Fastq file (.fastq)
|
https://ngdc.cncb.ac.cn/gsa [35]
|