Sequencing and assembly. The sequencing and assembly of the B. ramiflora genome were performed based on long-read PacBio SMRT sequencing in combination with short-read Illumina HiSeq2500 sequencing for error correction. Through approximately 60 Gb of next-generation sequencing data, we conducted a survey analysis of the genome with a K-mer of 19 and estimated that its genome size was approximately 973 Mb, with a heterozygosity of 0.634% and a repeat sequence ratio of 57.5% (Fig. S1). The Illumina HiSeq2500 sequencing generated approximately 68.6 Gb of raw reads, with a sequencing depth of 60×, while the PacBio SMRT sequencing generated 114.14 Gb of raw reads, with a sequencing depth of 100×. The B. ramiflora genome was assembled through different assembly methods (Table S1), including CANU + deredundancy, WTDBG (Canu error correction + WTDBG assembly), WTDBG + deredundancy, and Flye + deredundancy. The CANU + deredundancy method showed the lowest properly paired mapping rate (69.59%) in the next-gen sequencing data, and WTDBG, Flye, and WTDBG + deredundancy showed obvious abnormal secondary peaks in the GC depth graph (inconsistent with the GC content of the main peak, with an average GC depth of below 20×). Considering that the assembly difficulty may be related to the complexity of the species itself, based on the positions of the abnormal peaks in the above assembly results, we proportionally filtered the original subreads, in which subreads with a GC content in the range of 40–45%/45–50% were filtered out, and the GC content of the filtered subread sequences essentially became normally distributed. The filtered subreads were assembled using the Canu assembler, after which the preliminary assembly did not show any abnormal GC peak, and the genome size after deredundancy was close to the survey evaluation size, with a ContigN50 of approximately 300 kb. To further improve the assembly, after filtering out the subreads highly homologous to mitochondrial and chloroplast sequences, the remaining sequences were assembled using the Canu + deheterozygosity method, which achieved the best assembly, with a normal distribution of GC content and no abnormal peaks.
The size of the final assembly of B. ramiflora genome was approximately 975.8 Mb, which was slightly larger than the genome size estimated based on K-mer. This was likely because the high heterozygosity of the genome led two sister chromatids to be assembled together, resulting in a mixed genome greater than the actual size. The assembly contained 3,346 contigs, with the following statistics: GC content: 35.31%; N50: 509.33 kb; Contig Min: 1.63 kb; N90: 118.07Kb; Contig Max: 7.74 Mb. For plant genomes with high heterozygosity, an assembly with a contig N50 of 509.33 kb is considered good. Compared with the assembled genomes of other plant species of family Euphorbiaceae, i.e., Ricinus communis (genome size: 350.5 Mb; Contig N50 21.4 kb), Manihot esculenta (genome size: 693.6 Mb; Contig N50 3.3 Mb), Hevea brasiliensis (genome size: 1.5 Gb; Contig N50 152.7 kb), and Jatropha curcas (genome size: 266.8 Mb, Contig N50 134.3 kb) (data source: NCBI database), the genome of B. ramiflora is large, being smaller only than that of H. brasiliensis, with a Contig N50 only smaller than that of M. esculenta.
Genomic evaluation. To verify the accuracy and completeness of the assembly, we first evaluated the data mapping ratio of the DNA library, which accounted for 98.86% of the total. When we used the RNA libraries of three different tissues of B. ramiflora for alignment, the alignment rate was 88.38% for roots, 89.73% for stems, and 90.43% for leaves (Table S2). Finally, we evaluated the B. ramiflora genome assembly using the benchmarking universal single-copy orthologs (BUSCO)21 and found 2,121 BUSCOs, of which 97.5% were single-copy BUSCOs (86.7% were both unique and complete, and 10.8% were multicopy and complete), 0.8% were BUSCOs with incomplete coverage, and 1.7% were missing BUSCOs, indicating that most of the single-copy genes in the genome of B. ramiflora are included without repeats or overassembly. The genome assembly is high-quality and has high coverage.
Gene annotation. The information of assembled repeated sequences of B. ramiflora is summarized in Table S3. The assembled repeated sequences accounted for 73.47% (716.93 Mb) of the total sequences, higher than the K-mer estimation (57.5%), indicating the complexity of the B. ramiflora genome. Long terminal repeat (LTR)-retrotransposons are a class of transposon with the highest proportion of repeat sequences in plant genomes and the main cause of genome expansion, so high-quality sequences of LTR-retrotransposons are crucial to a genome assembly22,23. Sequences of LTR-retrotransposons in the B. ramiflora genome, which accounted for 52.1% of the total sequence, were mainly of the types Copia (89.67 Mb, 9.19% of the genome) and Gypsy (281.22 Mb, 28.82% of the genome), and the LTR-retrotransposons of other types were 137.53 Mb in length (14.09% of the genome), indicating that the B. ramiflora genome contains many unique LTR-retrotransposons. Tandem repeats were very rare, only accounting for 0.09% of the total. Unknown repeat sequences accounted for 15.91% of the total, indicating that the genome contains many new genes and unique genes. Statistical results showed that the size of the B. ramiflora genome was mainly determined by the LTR-retrotransposons with their repetitive sequences, which accounted for 70.91% of the total repetitive sequences and 52.1% of the total genome. There were still many unknown repeat sequences in the B. ramiflora genome. A total of 3,452 noncoding RNAs were found in the genome, which were mainly snRNA, rRNA, and tRNA sequences and numbered 1981, 674, and 526 respectively.
We performed a de novo, homology-based search and RNA-Seq to predict genes from repeat-masked B. ramiflora genome sequences. Table 1 shows the final prediction results. A total of 29,172 protein-coding genes, 127.25 Mb in total length, were predicted for the B. ramiflora genome, with an average size of 4.4 kb per gene, generating 32,692 transcripts (52.18 Mb in total length, 1.60 kb in size on average), each gene generating an average of 1.1 transcripts. There were a total of 185,435 exons, including 178,416 coding exons (281 bp on average), each transcript containing 5.7 exons on average; there were a total of 152,743 introns (629 bp on average). The total length of coding sequences (CDS) was 40.83 Mb (1,248 bp in size on average). Compared with some species of the Euphorbiaceae family, such as R. communis (number of predicted genes: 26,049), M. esculenta (number of predicted genes: 34,312), H. brasiliensis (number of predicted genes: 42,686), and J. curcas (number of predicted genes: 22,718) (data source: NCBI database), B. ramiflora has a medium number of genes, which is likely related to its high proportion of repeat sequences throughout its genome.
Table 1
Gene model prediction statistics in B. ramiflora
Type | Value |
Number of genes | 29,172 |
Total genic length (bp) | 127,254,215 |
Mean gene length (bp) | 4,362 |
Number of transcripts | 32,692 |
Transcripts per gene | 1.1 |
Total transcript length (bp) | 52,174,963 |
Mean transcript length (bp) | 1,595 |
Number of exons | 185,435 |
Exons per transcript | 5.7 |
Exon length (bp) | 281 |
Number of coding exons | 178,416 |
Number of introns | 152,743 |
Mean intron length (bp) | 629 |
coding region Total CDS length (bp) | 40,826,528 |
Average length of coding regions (mean CDS length) (bp) | 1,248 |
We next compared the gene function annotation results with those of five major databases (NR, Swiss-Prot, eggNOG, GO, and KEGG) and found that 25,980 genes (89.06%) had been annotated and 3,192 genes (10.94%) had not, indicating that there were still many unknown genes in the B. ramiflora genome, which suggested that B. ramiflora may have a highly unique evolutionary process. Some 25,962 (89.00%), 19,939 (68.355%), 24,699 (84.67%), 17,621 (60.40%), 8,724 (9.91%) coding genes had been annotated in the NR, SwissProt, eggNOG, GO, and KEGG databases, respectively. Fewer genes of B. ramiflora were found in the KEGG database, likely because B. ramiflora is a wild tree species containing many genes of unknown function.
Phylogenetic evolution analysis. To further understand the evolutionary position of B. ramiflora, we performed gene homology analysis on B. ramiflora and five related angiosperms (R. communis, M. esculenta, H. brasiliensis, J. curcas, and Populus euphratica) using Arabidopsis thaliana as the outgroup, the seven species containing 206,360 protein sequences, including 27,886, 18,174, 26,645, 30,653, 20,267, 28,266, and 22,938 sequences, respectively, with 3,283, 304, 747, 1,024, 542, 875, and 3,081 unique protein sequences and 6,267, 8,440, 5,836, 4,836, 8,214, 4,407, and 6,863 single-copy gene sequences. The result of gene family cluster analysis of five species of the original Euphorbiaceae (B. ramiflora, R. communis, M. esculenta, H. brasiliensis, and J. curcas) is shown in Fig. 2. The common gene families contained a total of 11,667 genes, while B. ramiflora-specific gene families contained 1,150 genes, the most among the five species, suggesting that B. ramiflora has acquired new gene families/genes in the evolutionary process. In B. ramiflora-specific gene families, 102 GO enrichment categories, mainly in catalytic activity, transferase activity, kinase activity, hydrolase activity, and nucleic acid metabolic process, were found. Thirteen enriched KEGG pathways, mainly in three aspects, i.e., genetic information (mismatch repair, homologous recombination, basal transcription factors, DNA replication, RNA degradation, nucleotide excision repair, and spliceosome), metabolic pathways (monoterpenoid biosynthesis, isoquinoline alkaloid biosynthesis, glutathione metabolism, tyrosine metabolism, alanine, aspartate, and glutamate metabolism), and plant-pathogen interaction, were found.
We performed clustering analysis on the gene families of B. ramiflora, R. communis, M. esculenta, H. brasiliensis, J. curcas, Populus euphratica, and A. thaliana using the OrthoMCL method, in which 1,739 single-copy homologous genes were used to generate the phylogenetic tree. As shown in Fig. 3, B. ramiflora clustered into two sister groups with four species of the Euphorbiaceae family, which is consistent with the APG IV system, in which B. ramiflora belongs to the Phyllanthus genus. We estimated that the divergence between B. ramiflora and Euphorbiaceae occurred 59.9 Mya (in the range of 50.4–75.2 Mya), and B. ramiflora and P. euphratica diverged from Euphorbiaceae 66.6 Mya (in the range of 53.4–75.2 Mya), while A. thaliana diverged from the other species 108.0 Mya (in the range of 107–109 Mya).
Gene family expansion and contraction analysis. During the evolutionary process, 173 gene families of B. ramiflora expanded, mainly in metabolic pathways, including linoleic acid metabolism, monoterpenoid biosynthesis, alpha-linolenic acid metabolism, sesquiterpenoid and triterpenoid biosynthesis, glyoxylate and dicarboxylate metabolism, peroxisome, diterpenoid biosynthesis, tropane, piperidine and pyridine alkaloid biosynthesis, and anthocyanin biosynthesis; 22 gene families contracted, mainly in kinase activity, transferase activity, phosphorylation, metabolic process, and catalytic activity. Compared with four species of Euphorbiaceae, B. ramiflora showed nine collapsed gene families and no expanded gene families, indicating that B. ramiflora is closely related to the Euphorbiaceae family.
A total of 278 candidate positive selection genes (P < 0.05) were screened through the likelihood ratio test and then subjected to functional enrichment analysis, which registered them to 160 GO pathway categories and four KEGG pathway categories (Fig. 4).
Whole-genome duplication and collinearity analysis. Whole-genome duplication (WGD) events are of great significance in plant evolution and speciation. Accompanied by the doubling of genes, they provide new genes for plant evolution, accelerate the generation of new genes and the expansion of gene families, and improve the plant’s ability to adapt to environmental changes, thereby promoting plant evolution. The Ks and transversions at four-fold degenerate sites (4DTv) distribution maps of the homologous genes of B. ramiflora, P. euphratica, and M. esculenta (Fig. 5) showed that the B. ramiflora genome had one peak, corresponding to Ks and 4DTv values of 2.4 and 0.6, respectively, which were also present in both M. and P. euphratica, suggesting a genome-wide triplication event (γ) that occurred before the eudicot differentiation. No recent genomic duplication signal was detected in the B. ramiflora genome.
The Vitis vinifera genome contains chromosomal relics of its angiosperm ancestors and has been commonly adopted as a reference genome for analyzing the collinearity of angiosperm species. Through collinearity alignment of the B. ramiflora and V. vinifera genomes, we found 13,698 pairs of homologous genes. As shown in the dot plot (Fig. 6), the two genomes had many collinear fragments. However, long fragments at the same position hardly matched, suggesting that after their evolutionary divergence, no WGD events have occurred, which is consistent with the Ks and 4DTv values. We found that some fragments showed ambiguous matches at scattered spots in other intervals, which we speculated are relics left by the ancient genome-wide triplication γ event in the B. ramiflora genome, supporting the consistent collinearity between B. ramiflora and V. vinifera.