Chromosome Level Genome Assembly Reveals Cassia Tora May Be an Ancient Species in Leguminosae

Background: Cassia tora L. is an annual leguminous plant. Its seeds had wide utility in herbal medicine in Asian, but is usually regard as a potent, invasive weed which even helps farmers eliminate other parasitic weed species. However, compared with other important legume crops, C. tora is far from fully developed and the genetic basis is greatly lacking. A reference genome sequence will be great valuable resource for its genome evolution, genetic breeding and development. Results: Here, we de novo assembled a chromosome-scale genome for C. tora by combining PacBio sequencing technology with chromatin interaction mapping, resulting in 621-Mb genome with a contig N50 of 2.5 Mb, of which 77.44% was ordered and oriented on 13 pseudochromosomes. The genome contained 32,361 protein-coding genes with a repetitive DNA content of approximately 58.15%. Gypsy-type LTRs constituted the largest subfamily. LTR insertion events seldom occurred in this genome over the past 10 million years. Comparative genomic analyses showed that C. tora diverged ~95 million years ago (MYA), which revealed it has the most distant genetic relationship with 11 other legumes. Compared with other legume crops, Cassia tora is maybe an ancient species in Leguminosae. Conclusions: The high-quality reference genome sequence reported here furnishes unprecedented insights into genome dynamics and provides an important basis for future research on legume genome evolution.


Background
Cassia tora L. (syn. Senna tora Roxb.) is an annual leguminous weed in the subfamily Caesalpinioideae [1]. It was originally introduced from the tropical Americas and is now extensively distributed throughout China, Korea, India, Southeast Asia, and the Southwest Paci c [2]. C. tora grows easily and rapidly, is very stress tolerant, and thrives in dry soil from elevations of sea level to 1,800 m. It interferes with the growth of agriculturally important crops and economically useful plants and depletes soil fertility [3]. C. tora disperses widely and occupies substantial tracts of farmland that could otherwise be allocated towards the cultivation of various types of valuable vegetation.
However, C. tora may actually be bene cial to farmers in certain ways. As it starts to grow, C. tora occupies all adjacent spaces and prevents other weeds from establishing there. Natural green grass seldom grows in areas overtaken by C. tora. Thus, it naturally removes and replaces weeds. Parthenium hysterophorus has disastrous effects on agricultural elds and may trigger skin and respiratory ailments. Though it has been di cult to eradicate, P. hysterophorus is effectively inhibited by C. tora. Aqueous extracts of C. tora may contain allelochemicals that suppress the germination of seeds produced by other varieties of weeds [4]. Moreover, the application of C. tora seed gum as a natural coagulant of raw and undiluted pulp and paper mill e uent (PPME) has been investigated [5].
C. tora also contains anthraquinones and has been used in traditional Chinese and Ayurvedic medicine. The dry and ripe seeds of C. tora are known as 'Juemingzi' and considered traditional Chinese medicine (TCM). Their use was recorded in 'Shennong Ben Cao Jing' and they have been administered to treat headache, dizziness, dysentery, and eye disease for > 2,000 y. Modern pharmacology has demonstrated that Juemingzi extracts have antioxidant, hypolipidemic, neuroprotective, and other physiological e cacy [6]. Thus, Juemingzi has become a very popular TCM for the treatment of mutagenicity, genotoxicity, hepatotoxicity, and acute in ammatory diseases [7][8][9][10]. Various pharmacologically active chemical constituents have been detected and identi ed in C. tora seeds including anthraquinones, terpenoids, avonoids, lipids, phenolic compounds, amino acids, and polysaccharides [11,12]. C. tora seeds are also used as an ingredient in functional and popular herbal beverages in Asia. Puri ed C. tora endosperm our has been used as a food thickener (E-427) [13].
As a widely distributed legume plant, compared with other legumes, its use is far from being developed, the genetic basis is greatly lacking, and the breeding of varieties has not been reported. The genome evolution of leguminous plants has aroused widespread concern [14][15][16][17][18][19], but the genome evolution of C. tora is still unclear.
High-quality genomes are the foundations of genetic and genome-wide studies [20][21][22][23]. Here, we assembled a high-quality, chromosome-scale genome sequence for C. tora using a combination of singlemolecule real-time (SMRT), and Hi-C sequencing. This genome sequence will facilitate genomic research, metabolic engineering, and cultivar improvement in C. tora.

Genome size estimate
We selected a landrace line JMZ1 for genome size estimate. The highest peak provided a peak depth of 74 for genome size estimation. As the total number of k-mers was 45,035,768,647, the genome size was calculated to be ~ 601.44 Mb. Therefore, the sequenced Illumina reads (56.18 Gb) provided ~ 93 × coverage. The estimated repetitive sequence content was ~ 58.61% based on a peak depth of 142. As no heterozygosity peak was found, the estimated heterozygosity was ~ 0.02% based on a half-depth of 74. The GC content of the genome of this species was ~ 37.30% (Supplementary Fig. S1; Supplementary Table S1;).

Genome sequencing and assembly
De novo DSS3 genome assembly was performed using a combination of PacBio single-molecule, realtime (SMRT) sequencing and chromatin interaction mapping (Hi-C). We obtained 56.40 Gb raw data by A total of 119 Mb of read pairs was collected via Illumina sequencing and 35.94 Gb of clean Hi-C reads were generated with 57-fold coverage. The scaffolds were broken into 50-kb fragments of equal length and reassembled by Hi-C data analysis. The low-and middle Hi-C coverage depths in this region were localized and identi ed as the error points. The aforementioned genome assembly was corrected and a nal genome was generated that captured 621.06 Mb with contig N50 = 2.5 Mb, 4,473 contigs, and scaffold N50 = 29.35 Mb (Table 1).  Table 2). The Hi-C assembly heatmap of the chromosomes displayed 13 distinct groups which denoted high-quality assembly (Fig. 1). Note: Ratio% of rst two items shows % of clustered sequence number or length compared to contig number or genome size. Ratio% of next two items shows % of ordered and oriented sequence number or length compared to cluster sequence number or sequence.
We used the raw Illumina HiSeq paired-end reads to validate the genome assembly. Of these, 85% could be properly mapped to the assembly (Supplementary Table S5). Thus, the genome assembly contained comprehensive genome information. The Core Eukaryotic Genes Mapping Approach (CEGMA) and Benchmarking Universal Single-Copy Orthologs (BUSCO) methods were used to assess genome assembly accuracy and completeness. The CEGMA v. 2.5 alignment showed that 448 CEG proteins (97.82%) with high-con dence hits and 229 (92.34%) highly conserved full-length sequences were contained in the genome assembly. A BUSCO analysis revealed that we obtained > 93.89% (1,352/1,440 BUSCOs) genome coverage.
Annotation of the C. tora genome sequence Using homology-based and de novo approaches, we identi ed 361.17 Mb of repetitive elements accounting for 58.15% of the genome. The main repetitive sequences were transposable elements (TEs) of which the most abundant were long terminal repeats (LTRs) (Supplementary Table S6). Among the Class I elements (retrotransposons), the Gypsy-, LARD, and Copia-type LTRs represented the three largest subfamilies (19.88%, 7.12%, and 6.81%, respectively). Among the Class II elements (DNA transposons), the CACTA repeats and Helitron-type elements occupied 1.68% and 1.39% of the genome, respectively. The remaining genome consisted of putative host genes (1.74%) and simple sequence repeats (SSRs; 6.82%). Most of these TEs were distributed in the central parts of the chromosomes. Most of the repeat sequence types were unequally distributed along the chromosomes. Their densities were higher in the centromeric than the telomeric regions on all pseudomolecules with similar patterns (Fig. 2). Overall, the centromeric regions were enriched in various repeat elements.
We identi ed 11.64% unclassi ed repeats in the genome. Thus, many new LTR types could be discovered in plants. We hypothesize that these TEs are different from those of the ancestral legume species and may have contributed to diversi cation and speciation in these plants.
A combination of homology-based, de novo, and RNA-seq approaches predicted 32,361 protein-coding genes that covered 20.37% of the genomic sequence length (126.52 Mb) (Supplementary Table S7). In total, 87.76% of the predicted genes were supported by homology-based-and RNA-seq data derived from a mixture of leaf, stem, ower, and pod tissues ( Supplementary Fig. S2). The average gene size and exon number per gene were 3,875 bp and 5.09, respectively (Supplementary Table S8). The average exon and intron lengths were 239.09 bp and 528.87 bp, respectively. Of the predicted genes, 29,163 (90.12%) were functionally annotated in the NR, TrEMBL, GO, KOG, and KEGG databases (Supplementary Table S9).
Most of the protein-coding genes were concentrated at the distal chromosome regions and were relatively sparse in the proximal regions on pseudochromosomes with similar distribution patterns (Fig. 2). Thus, the gene distribution was non-random and resembled that for other plant species reported [23,24].
We predicted 3,948 non-coding genes including 632 transfer RNAs, 446 ribosomal RNAs, 88 microRNAs, and 2,782 pseudogenes with evidence of transcription but no consistent coding sequence. We also detected 2,980 motifs and 34,761 domains (Supplementary Table S10).
Comparative analysis of the genomes of C. tora and other plant species  Fig. S3). The contracted gene families in C. tora were enriched in 'biological death process', 'antioxidant activity', and 'nutrient reservoir activity'.
A phylogenetic tree was constructed based on the single-copy orthologous genes shared by C. tora and 11 other legumes (Fig. 3). The legumes differentiated from grapes over 116 MYA. Cassia tora differentiated from 11 other legumes ~ 95 MYA. Therefore, C. tora is only distantly related to other legumes. After ~ 20 MYA, peanut differentiated from the other nine legumes. The latter then evolved into two clusters. Five vegetable legume species grouped into one cluster and differentiated only ~ 33-4.7 MYA. The other cluster included four legumes (alfalfa for leaf use, licorice for root use, lobelia for leaf use, and chickpea for bean use) that differentiated as early as 56-34 MYA.
LTR family expansion number and time were analyzed for the genomes of C. tora and 11 other plant species (Fig. 4). Using C. cajan as a reference, no distinct peak LTR ampli cation activity was observed in C. tora over 10 MYA. Thus, the C. tora genome was stable and inactive compared to the other 11 species.
Collinearity analysis between C. tora and A. ipaensis Cassia tora and A. ipaensis (subfamily Faboideae) diverged ~ 95 MYA. A synteny analysis between C. tora and A. ipaensis revealed that 33.79% of their genes were collinear in 490 blocks (Fig. 5). The genes on one peanut chromosome were dispersed over four C. tora chromosomes. This nding re ected poor collinearity and demonstrated substantial differences between genomes.
The 4DTv (fourfold degenerate transversion) distribution of the homologous gene pairs within these syntenic blocks suggested that C. tora has not undergone any recent lineage-speci c whole-genome duplication (WGD) event. Rather, it experienced the paleohexaploidy event (γ) common to all eudicots (Fig. 6).

Discussion
Compared with other legumes, the genetic research of C. tora is still in its infancy. Although its nutritive and pharmacological properties of the active metabolites have been investigated [7][8][9][10] and their biosynthetic pathways have been preliminarily characterized [25], to the best of our knowledge, the C. tora genetic basic research is very limited.
Single-molecule sequencing technology combined with Hi-C assembly technology [26,27] has been widely used in the construction of a chromosome level high-quality genome [28][29][30][31]. A high quality genome at chromosome level lays a very important foundation for the further research and development of a species [19,24,32,33]. With the dividend of the decreasing cost of sequencing, we were able to sequence the species with Single-molecule sequencing technology and assemble it with Hi-C technology to form a high-quality genome. The ancestral legume karyotype was predicted consisting of a minimum of 19 proto-chromosomes [19]. In the study, the Hi-C assembly displayed 13 distinct groups. We propose the C. tora genome maybe have been massively rearranged during evolution.

Genome expansion in plants is primarily driven by polyploidization (whole-genome duplication events)
and the proliferation of TEs [34]. Based on the single-copy orthologous genes shared by C. tora and 11 other legumes (Fig. 3), these legume species differentiated from grapes on about 116 MYA. C. tora differentiated from 11 other legumes ~ 95 MYA which revealed C. tora has the most distant relationship with other legumes. After ~ 20 MYA, peanut differentiated from the other nine legumes. Other legume species differentiated in 75-80 MY, which was consistent with recent report about pea genome [19].
Among the other 11 legumes, A. ipaensis has the closest relationship with it. Analysis of genome sequences of different legumes indicated that whole genome duplication has played an important role in evolution of legume genome [17]. The 4DTv distribution of the homologous gene pairs within these syntenic blocks suggested that C. tora has not undergone any recent lineage-speci c whole-genome duplication (WGD) event except for the paleohexaploidy event (γ) common to all eudicots. A synteny analysis between C. tora and A. ipaensis re ected poor collinearity.
Repetitive elements were major drivers in the evolution of these large genomes such as Pea genome (~ 4.45 Gb) [19]. In the Fabeae tribe of Leguminosae, genome dynamics are dominated by a single lineage of Gypsy elements that account for 57% of the variation in genome size in this clade [34,35]. In C. tora genome, Fifty-eight percent of the genomic sequences belongs to repetitive. The main repetitive sequences were TEs, of which the most abundant were long terminal repeats (LTRs) and the Gyps -type LTRs represented the three largest subfamilies, which was consistent with previous reports [34,35]. Compared to the other 11 species, LTR ampli cation activity peak in C. tora was not observed in over 10 MYA, which re ected a feature of genome formation which is not active. Another possible explanation is that the LTR-TEs in C. tora had undergone lower arti cial selection than those in the other species. Moreover, as C. tora naturally self-pollinates, it has low heterozygosity (0.02%). The foregoing genomic analysis indicated that the Cassia tora genome is ancient and stable.
In addition to comparative genomic research, this genome also will provide an important foundation for the later research, such as characterizing and ampli ng the genes encoding the anthraquinones in C. tora, or identi ng the genes responsible for its virulence against other plants (bene cial or detrimental).

Conclusion
Here, we produced a high-quality genome sequence for C. tora that may serve as a vital tool to analysis features of the C. tora genome and help to characterize legume family and establish deep syntenic relationships among different taxa of legumes. The foregoing analysis indicated that the C. tora was a relatively stable and relatively old genome. The valuable genome sequence will also facilitate the characterization of its many excellent traits, enhance genetic improvement and allow more e cient use of the genus.  Table 1a). All of the ∼56.18 Gb of clean reads obtained from the Illumina platforms were subjected to 19-mer frequency distribution analysis. The K-mer distribution can be used to infer the estimated genome size using the following formula: genome size = kmer number/peak depth.

Materials
Genome sequencing, assembly, and scaffolding The libraries for SMRT sequencing were constructed using DNA isolated from leaf tissue according to the standard protocol provided by Paci c Biosciences (Menlo Park, CA, USA). This protocol comprised library construction and sequencing as well as library and sample quality detection.
De novo assembly of the PacBio reads was initially conducted with the Canu pipeline (v1.5) and Falcon (https://github.com/ruanjue/wtdbg) assemblers. Their parameters were set for high consistency and continuity. The sequence assembled by Canu was then used to assemble the genome in Wtbdg (v1.2.8) (https://github.com/ruanjue/wtdbg). Illumina paired-end reads were then used in assembly polishing with Pilon v. 1.17 to correct the assembly [36]. Statistical information is shown in Supplementary Table 2.
Hi-C sequencing, assembly, scaffolding, and evaluation Hi-C experiments were performed as previously reported [37]. Both valid and non-valid interaction pairs were identi ed by HiC-Pro [27]. Hi-C data were aligned with the initially assembled genome (v. 1.1) by BWA v. 0.7.10-r789 to obtain mapped data [38]. The paired-end reads were mapped onto v. 1.2 PacBio contigs which were then grouped into13 chromosome clusters and scaffolded using Lachesis software [37].

Genome quality evaluation
Illumina reads were used to map and con rm deciphered sequence coverage (Supplementary Table 5). The completeness of the nal genome assembly was evaluated by CEGMA v.2.3 [32]and Benchmarking Universal Single-Copy Orthologs (BUSCO v2.0) [39].
All of the predicted gene structures were integrated into a consensus set with EVidenceModeler [51].

Comparative genome analysis
OrthoMCL (v2.0) [67] was used to analyze the family classi cation of protein sequences. PHYML [68] software was used to construct the phylogenetic tree. CAFE (v4. Neighbor-joining (NJ) tree for C. tora and 11 other legume species. Phylogenetic analysis and divergence time estimation for 13 plant species and gene family dynamics for each branch. Divergence times are indicated by black numbers next to branching nodes (right). Blue and red numbers below each species name (left) indicate numbers of gene family contraction and expansion events, respectively.