Sequencing data statistics
The PacBio platform were used for whole genome sequencing the A. cannabinum genome. Table 1 represents total sequencing volume was 117.13 G, and the coverage was 490.04 X based on calculations of the genome size (239.02 M) estimated from a survey [41]. In addition, the Illumina platform was used for the sequencing of a constructed second-generation small fragment library.
Table 1
Apocynum cannabinum genome sequencing data statistics
Pair-end libraries
|
Insert size
|
Total data (G)
|
Read length (bp)
|
Sequence coverage (X)
|
llumina reads
|
350 bp
|
31.94
|
150
|
133.63
|
PacBio reads
|
29.8 kb
|
8.76
|
11734
|
162.16
|
10X_Genomics
|
500–700 bp
|
46.43
|
150
|
194.25
|
Total
|
-
|
117.13
|
-
|
490.04
|
Assembly results and statistics
The genome size was 260 Mb, the contig N50 was 3.11 Mb, and the scaffold N50 was 4.19 Mb (scaffolds of 100 bp and above were selected for assembly results). The GC analysis showed that the GC content of A. cannabinum was 33.26%. Through Hi-C-assisted assembly, the final genome sequence was anchored to 11 chromosome. The draft genome scaffold chromosome anchoring results showed that the constructed super-scaffold N50 was 21.16 Mb and the number of super-scaffolds was six, which was significantly decreased. The number of scattered scaffolds was low, showing good assembly results. Among the scaffolds, N90 was 11, indicating that 90% of sequences were on the 11 chromosome. A comparison of N50, N60, N70, N80, and N90 showed that the Hi-C assembled A. cannabinum genome result was significantly improved from the original draft genome (Table 2).
Table 2
Apocynum cannabinum genome assembly results
Sample ID
|
Length
|
Number
|
Contig (bp)
|
Scaffold (bp)
|
Contig (bp)
|
Scaffold (bp)
|
10X_Genomics
|
Total
|
259,856,155
|
260,145,681
|
479
|
439
|
Max
|
9,567,299
|
15,154,524
|
-
|
-
|
Number > = 2000
|
-
|
-
|
479
|
439
|
N50
|
3,107,135
|
4,189,930
|
26
|
19
|
N60
|
2,442,000
|
3,120,689
|
35
|
27
|
N70
|
1,722,976
|
2,442,000
|
48
|
36
|
N80
|
1,116,808
|
1,547,882
|
66
|
49
|
N90
|
257,161
|
339,954
|
109
|
80
|
Hi-C
|
Total
|
259,856,155
|
260,165,581
|
569
|
330
|
Max
|
9,567,299
|
26,508,341
|
-
|
-
|
Number > = 2000
|
-
|
-
|
567
|
330
|
N50
|
2,442,000
|
21,162,547
|
33
|
6
|
N60
|
1,938,819
|
20,272,879
|
45
|
7
|
N70
|
1,256,579
|
19,617,059
|
61
|
9
|
N80
|
786,196
|
19,407,958
|
87
|
10
|
N90
|
207,482
|
17,325,238
|
152
|
11
|
A heatmap was used to validate the Hi-C assembly results (Fig. 1). From the heatmap, we can see that the interaction relationships near the diagonal were far stronger than those far away from the diagonal. The overall results showed that there was no significant clustering error between the A. cannabinum chromosomes.
Evaluation of assembly results
BUSCO and CEGMA software were used to evaluate the assembly results for A. cannabinum. The results showed that of the 1440 orthologous single-copy genes in the plant database, 1348 genes (C = 93.6%) were completely covered, of which 1258 single copies were covered (S = 87.4%), 89 multiple copies were covered (D = 6.2%), 30 genes were not completely covered (F = 2.1%), and 62 genes were missing (M = 4.3%). This suggests that the vast majority of single-copy genes was completely assembled, that there was no over-assembly or re-assembly, and that the assembly accuracy and integrity were good. At the same time, 227 genes were assembled from 248 core eukaryotic genes (CEGs), indicating that the assembly results were relatively complete. Alignment using BWA software showed that the alignment rate of all small fragment reads to the genome was 94.80% and the coverage was 96.95%, demonstrating that the reads and the assembled genome possess good consistency.
Repeat sequence annotation
Repeat sequences can be classified as tandem repeats and interspersed repeats. Tandem repeats include microsatellite sequences and minisatellite sequences. Interspersed repeats are also known as transposon elements and include DNA transposons that use DNA-DNA transposition and retrotransposons. Commonly seen retrotransposons include long terminal repeats (LTRs), long interspersed elements (LINEs), and short interspersed elements (SINEs). Around 36.62% of sequences in the A. cannabinum genome were identified as repeat sequences, of which DNA transposons, LTRs, LINEs, and SINEs accounted for 2.22%, 28.45%, 1.01%, and 0%, respectively(Table 3).
Table 3
Statistics of repeat sequence classification results for A. cannabinum
|
Denovo + Repbase length (bp)
|
% in genome
|
TE protein length (bp)
|
% in genome
|
Combined TE length (bp)
|
Type
|
DNA
|
5,282,438
|
2.03
|
765,646
|
0.29
|
5,763,303
|
2.22
|
LINE
|
1,744,782
|
0.67
|
1,157,517
|
0.44
|
2,630,319
|
1.01
|
SINE
|
1,597
|
0.00
|
0
|
0
|
1,597
|
0.00
|
LTR
|
72,896,704
|
28.02
|
16,110,716
|
6.19
|
74,004,538
|
28.45
|
Unknown
|
14,826,765
|
5.70
|
0
|
0
|
14,826,765
|
5.70
|
Total
|
92,189,164
|
35.44
|
18,033,358
|
6.93
|
93,170,210
|
35.81
|
Note: Denovo + Repbase are transposon elements obtained after the databases predicted by RepeatModeler, RepeatScout, Piler, and LTR_FINDER were combined with the RepBase nucleic acid database. This was followed by integration using Uclust software according to the 80-80-80 principle, following which RepeatMasker was used for genome annotation. The TE proteins are transposon elements that were obtained when the RepeatProteinMask software was used to annotate genomes in the RepBase protein database. Combined TEs is the result obtained by combining the results obtained from these two methods and following the removal of redundant genes. Unknown indicates that this repeat sequence could not be classified by RepeatMasker. |
Table 3
Statistics of repeat sequence classification results for Apocynum cannabinum
|
Denovo + Repbase length (bp)
|
% in genome
|
TE protein length (bp)
|
% in genome
|
Combined TE length (bp)
|
Type
|
DNA
|
5,282,438
|
2.03
|
765,646
|
0.29
|
5,763,303
|
2.22
|
LINE
|
1,744,782
|
0.67
|
1,157,517
|
0.44
|
2,630,319
|
1.01
|
SINE
|
1,597
|
0.00
|
0
|
0
|
1,597
|
0.00
|
LTR
|
72,896,704
|
28.02
|
16,110,716
|
6.19
|
74,004,538
|
28.45
|
Unknown
|
14,826,765
|
5.70
|
0
|
0
|
14,826,765
|
5.70
|
Total
|
92,189,164
|
35.44
|
18,033,358
|
6.93
|
93,170,210
|
35.81
|
Note: Denovo + Repbase are transposon elements obtained after the databases predicted by RepeatModeler, RepeatScout, Piler, and LTR_FINDER were combined with the RepBase nucleic acid database. This was followed by integration using Uclust software according to the 80-80-80 principle, following which RepeatMasker was used for genome annotation. The TE proteins are transposon elements that were obtained when the RepeatProteinMask software was used to annotate genomes in the RepBase protein database. Combined TEs is the result obtained by combining the results obtained from these two methods and following the removal of redundant genes. Unknown indicates that this repeat sequence could not be classified by RepeatMasker. |
Gene structure annotation
Through the annotation of the A. cannabinum genome, a total of 22,793 non-redundant protein coding genes were predicted. The mean gene sequence length of the gene was 3532 bp, the mean length of coding sequence was 1206 bp, and the exons and introns were 231 bp and 551 bp, respectively. The mean number of exons in every gene was 5.22. Comparison of the lengths of genes, coding sequences, introns, exons, and the number of exons in A. cannabinum with C. gigantea, Catharanthus roseus, C. arabica, and L. usitatissimum showed that the lengths of the coding sequences and exons and the number of exons in every gene in A. cannabinum were closest to C. gigantea and L. usitatissimum, while C. roseus had lower numbers than these three species and C. arabica had higher numbers than these species. The lengths of the genes and introns of A. cannabinum were most similar to C. arabica, while the lengths of the genes and introns of L. usitatissimum were significantly lower than the other four plant species. The lengths of almost all introns in the five plant species were within 1000 bp, and introns with lengths greater than 2000 bp were extremely rare or absent (Fig. 2).
Annotation of gene function and non-coding RNAs
A total of 78.70% of genes could be aligned to known metabolic pathways; 12,823 functional genes could be annotated by GO function; and 4.4% of genes had unknown functions. Among the annotated protein-coding genes, 21,780 genes had homologous sequences in the protein database, accounting for 95.6% of annotated genes. Furthermore, 95.2% of genes were aligned to non-redundant protein databases, of which 92.30% contained conserved protein domains, showing that the functions of most A. cannabinum genes were relatively conserved. In addition, 689 tRNAs, 3934 rRNAs, 592 snRNAs, and 159 miRNAs were annotated in the A. cannabinum genome (Fig. 3).
Comparative genomic analysis of A. cannabinum
Clustering analysis of the gene families indicated that 43,837 gene families were clustered in 17 species, of which 3084 gene families were common to all these species, while there were 55 common single-copy gene families in the different species. The clustering results of A. cannabinum, A. venetum, C. roseus, C. gigantea, and C. capsularis were used to plot a Venn diagram (Fig. 4).
From the Venn diagram, we can see that A. cannabinum contains 283 species-specific gene families and 449 genes compared with other species. The gene family expansion and contraction analysis results showed that the most recent common ancestor (MRCA) contained 43,823 gene families, of which 399 gene families and 166 genes were expanded in A. cannabinum, while 492 gene families and 231 genes were contracted (Fig. 5).
The gene family clustering analysis results showed that C. roseus, A. venetum, A. cannabinum, and C. gigantea clustered in the same taxon. The Markov chain Monte Carlo (MCMC) program in PAML software indicated that the speciation time for A. cannabinum was around 35.9 (26.7–46.8) million years ago.