De novo RNA-seq assembly from C. latifolia and C. capitulata fruits
We sequenced cDNA libraries from C. latifolia and C. capitulata using the Illumina HiSeq 2500 platform. To analyze the data, we filtered out raw reads with average quality values < 20, reads with < 50 nucleotides, and reads with ambiguous ‘N’ bases. After trimming reads for adapter sequences and filtering, we obtained 44,396,896 reads from C. latifolia and 43,863,400 from C. capitulata. We then assembled high-quality reads from C. latifolia and C. capitulata into 85,697 and 76,775 contigs with a mean length of 775 bp and 744 bp, respectively, using Trinity 2.11. The distribution of transcript lengths and transcripts per million (TPM) values are shown in Additional File 1 and Additional File 2. The N50 values for C. latifolia and C. capitulata transcripts were 1,324 and 1,205, respectively (Table 1). Unigene clustering using CD-Hit revealed 70,371 unigenes in C. latifolia and 63,704 in C. capitulata (Table 1).
Table 1. Overview of de novo RNA-seq assembly from C. latifolia and C. capitulata fruits.
|
C. latifolia
|
C. capitulata
|
High-quality reads
|
44,396,896
|
43,863,400
|
Total Trinity genes
|
69,446
|
63,951
|
Total Trinity unigenes
|
70,371
|
63,704
|
Total Trinity transcripts
|
85,697
|
76,775
|
GC (%)
|
44.0
|
45.6
|
N10 (nts)
|
3,214
|
2,676
|
N20 (nts)
|
2,460
|
2,103
|
N50 (nts)
|
1,324
|
1,205
|
Total assembled bases
|
66,426,868
|
57,098,016
|
The gene repertoires of the two Curculigo species fitting the monocots
Low annotation rate of the transcripts: To gather functional information about the transcripts identified from de novo assembly, we aligned all transcripts against nucleotide sequences from various protein databases, including the nonredundant protein (NR) database at the National Center for Biotechnology Information (NCBI), RefSeq, UniProt/Swiss-Prot, Clusters of Orthologous Groups of proteins (COG), the rice (Oryza sativa) genome (Os-Nipponbare-Reference-IRGSP-1.0, Assembly: GCF_001433935.1), and the Arabidopsis (Arabidopsis thaliana) genome (Assembly: GCF_000001735.4) and selected the top hits from these queries. We obtained annotations for 38,433 out of 85,697 transcripts (44.8%) in C. latifolia and 40,554 out of 76,775 transcripts (52.8%) in C. capitulata with a threshold of 1e−10 by performing a Basic Local Alignment Search Tool search with our in silico-translated transcripts against protein databases (BLASTx) using the NR, RefSeq, UniProt, and COG databases and the proteomes of rice and Arabidopsis. All annotations are listed in Additional File 3. The number of annotated transcripts for each database is listed in Table 2. The low annotation rate suggests that the two Curculigo species are significantly different from classical model plant systems that drive much of the information stored in public databases.
Table 2. Number of functional annotations of transcripts from C. latifolia and C. capitulata fruits.
Annotated database
|
C. latifolia
|
C. capitulata
|
COG1
|
11,875
|
12,448
|
RefSeq
|
37,922
|
39,369
|
Uniprot
|
36,783
|
38,901
|
NR2
|
37,118
|
39,340
|
Rrice3
|
34,761
|
36,204
|
Arabidopsis4
|
33,332
|
34,684
|
All six databases
|
38,433
|
40,554
|
1 COG: Clusters Groups of proteins.
2 NR: nonredundant protein databases of the National Center for Biotechnology Information.
3 Assembly: GCF_001433935.1.
4 Assembly: GCF_000001735.4.
Conservation across monocots: After BLASTx searches with the C. latifolia and C. capitulata transcripts against the NR database, we determined the extent of gene conservation across plant species by running Blast2GO [40]. We estimated the similarity of the two Curculigo species to various plant species by counting the number of hits from each species obtained by BLAST searches (Fig. 2). The top six species displaying the highest homology with C. latifolia and C. capitulata transcripts were monocots, like Curculigo, supporting the view that the assembled Curculigo genes are highly similar to known genes from other monocots. The top six species sharing the highest similarity with C. latifolia and C. capitulata were identical in terms of both species and rank order.
Expression of functionally similar genes between the two species: Using the COG database, we classified 11,875 transcripts from C. latifolia and 12,448 from C. capitulata into functional categories (Fig. 3). We observed no significant differences between the two species, which supports the notion that these two species have functionally similar genes.
We also analyzed the functions of the assembled transcripts via Gene Ontology (GO) analysis using the rice genome annotation (Additional File 4). Again, no significant differences were observed between the two species. The results also suggested that the repertoires of genes from the two species are similar to those of better-known species.
The genes with high similarity between C. latifolia and C. capitulata fruits are less than half of the genes
Using the unigene sequences, we analyzed the similarity of between C. latifolia and C. capitulata genes. We performed BLAST searches using each transcript from one species as the query sequence against all transcripts from the other species with a threshold E-value of 1e−5 or less and selected the reciprocal best hits. We defined unigenes with high similarity between the two species as common genes and unigenes with low similarity between the species, or present in only one species, as unique genes. In total, we deemed 38.6% (27,155 out of 70,371) of genes in C. latifolia and 42.6% (27,155 out of 63,704) of genes in C. capitulata to be common genes (Fig. 4). The relatively small number of common genes suggests that a long time has passed since the divergence of these species, which is consistent with results of lineage analysis based on plastid DNA from Hypoxidaceae family members. Indeed, although the Curculigo genus constitutes a single clade, C. latifolia and C. capitulata are not the most closely related species within this clade [5].
Next, we investigated the proportion of annotated genes in these species using the COG, RefSeq, UniProt, and NR databases and the genomes of rice and Arabidopsis (shown in Table 2). Among the common genes, 17,337 and 17,199 genes were annotated (63.8% and 63.3% of common genes) in C. latifolia and C. capitulata, respectively. By contrast, there were 11,718 annotated unique genes (27.1% of unique genes) among genes found only in C. latifolia and 14,848 (40.6% of unique genes) among those found only in C. capitulata. Thus, the annotation rate was higher for common genes than for unique genes, despite the smaller number of common genes. One possible explanation for this observation is that many of the genes common to both species may also be common genes in other model plant species that are highly represented in the databases employed.
We then compared the expression profiles of 27,155 common genes between C. latifolia and C. capitulata. Although the sequences of the corresponding genes in C. latifolia and C. capitulata were similar, their expression profiles were not necessarily equivalent. Nonetheless, only 111 out of the 27,155 common genes had TPM ratios ≥ 50 (Table 3). Of these 111 genes, five were neoculin-related genes, indicating that the expression profiles of at least some neoculin-related genes differ significantly between the two species.
Table 3. Comparison of the expression profiles of C. latifolia and C. capitulata.
C. latifolia
|
C. capitulata
|
|
|
TRYNITY_ID
|
RefSeq
|
TPM
|
TRYNITY_ID
|
RefSeq
|
TPM
|
Pident†
|
E-value†
|
L_19492_c6_g1_i1
|
trans-resveratrol di-O-methyltransferase
|
36282
|
C_19332_c0_g2_i1
|
trans-resveratrol di-O-methyltransferase
|
277
|
99.18
|
0
|
L_20774_c6_g2_i5
|
trans-resveratrol di-O-methyltransferase
|
31648
|
C_20405_c1_g1_i2
|
trans-resveratrol di-O-methyltransferase
|
573
|
99.02
|
0
|
*L_22219_c0_g1_i1
|
mannose-specific lectin-like
|
7634
|
*C_16562_c0_g1_i1
|
mannose-specific lectin-like
|
80
|
97.75
|
0
|
L_22040_c0_g1_i1
|
chalcone synthase-like
|
6483
|
C_22230_c0_g1_i1
|
chalcone synthase-like
|
69
|
100
|
0
|
L_39489_c0_g1_i1
|
cinnamoyl-CoA reductase 1-like
|
4584
|
C_43958_c0_g1_i1
|
cinnamoyl-CoA reductase 1-like
|
37
|
100
|
0
|
L_17418_c0_g1_i1
|
benzyl alcohol O-benzoyltransferase
|
2848
|
C_20771_c2_g1_i3
|
benzyl alcohol O-benzoyltransferase
|
18
|
96.27
|
0
|
L_18625_c0_g1_i1
|
glutelin type-A 1-like
|
2641
|
C_18515_c0_g1_i1
|
glutelin type-A 1-like
|
35
|
100
|
0
|
L_20161_c0_g1_i1
|
probable polyamine oxidase 5
|
2333
|
C_20921_c0_g1_i1
|
probable polyamine oxidase 5
|
38
|
99.17
|
0
|
L_20171_c0_g1_i1
|
pyruvate decarboxylase 1 isoform X1
|
2140
|
C_19622_c0_g1_i1
|
pyruvate decarboxylase 1 isoform X1
|
30
|
99.74
|
0
|
L_19390_c0_g1_i1
|
benzyl alcohol O-benzoyltransferase-like
|
1721
|
C_20336_c0_g1_i1
|
benzyl alcohol O-benzoyltransferase-like
|
25
|
99.01
|
0
|
L_17288_c0_g1_i1
|
5-methyltetrahydropteroyl-triglutamate--homocysteine methyltransferase 1
|
1527
|
C_20491_c0_g1_i4
|
5-methyltetrahydropteroyl-triglutamate--homocysteine methyltransferase 2-like
|
19
|
98.22
|
0
|
L_22101_c0_g1_i1
|
cytochrome P450 71A1-like
|
1130
|
C_20591_c0_g1_i1
|
cytochrome P450 71A1-like
|
14
|
100
|
0
|
L_9054_c0_g2_i1
|
uncharacterized protein LOC105052971
|
891
|
C_20462_c0_g1_i1
|
uncharacterized protein LOC105052971
|
16
|
99.15
|
0
|
L_19899_c1_g1_i5
|
elongation factor 1-alpha-like
|
720
|
C_16211_c0_g1_i1
|
hypothetical protein CARUB_v100096370mg, partial
|
11
|
99.75
|
0
|
L_39417_c0_g1_i1
|
palmitoyl-acyl carrier protein thioesterase, chloroplastic-like
|
659
|
C_1125_c0_g1_i1
|
palmitoyl-acyl carrier protein thioesterase, chloroplastic-like
|
0.89
|
99.88
|
0
|
L_8999_c0_g1_i1
|
probable protein Pop3
|
657
|
C_3239_c0_g1_i1
|
probable protein Pop3
|
10
|
99.79
|
0
|
*L_16562_c0_g1_i1
|
mannose-specific lectin-like
|
652
|
*C_16324_c0_g1_i1
|
mannose-specific lectin-like
|
8
|
98.8
|
0
|
L_20784_c0_g1_i1
|
mannan endo-1,4-beta-mannosidase 5-like
|
477
|
C_20300_c0_g1_i1
|
mannan endo-1,4-beta-mannosidase 5-like
|
8
|
99.81
|
0
|
L_17063_c0_g1_i1
|
uncharacterized protein LOC103705182
|
457
|
C_15604_c0_g1_i1
|
|
7
|
99.51
|
0
|
L_9763_c0_g1_i1
|
4-hydroxyphenyl-pyruvate dioxygenase
|
441
|
C_17419_c0_g1_i2
|
4-hydroxyphenyl-pyruvate dioxygenase
|
6
|
97.85
|
0
|
L_15645_c0_g1_i1
|
hypothetical protein PHAVU_005G042200g
|
378
|
C_19503_c0_g1_i2
|
uncharacterized protein LOC103713005
|
4
|
98.6
|
0
|
L_39500_c0_g1_i1
|
uncharacterized protein C24B11.05-like isoform X2
|
323
|
C_15665_c0_g1_i2
|
uncharacterized protein C24B11.05-like isoform X2
|
6
|
96.74
|
0
|
L_16206_c0_g1_i1
|
cytochrome P450 71A1-like
|
295
|
C_18399_c0_g1_i1
|
cytochrome P450 71A1-like
|
5
|
99.88
|
0
|
L_9770_c0_g1_i1
|
Os09g0480700, partial
|
278
|
C_11365_c0_g1_i1
|
Os09g0480700, partial
|
3
|
99.52
|
0
|
L_20943_c2_g1_i1
|
LOW QUALITY PROTEIN: ATP-citrate synthase beta chain protein 1-like
|
276
|
C_20189_c1_g1_i6
|
LOW QUALITY PROTEIN: ATP-citrate synthase beta chain protein 1-like
|
5
|
99.74
|
0
|
L_5031_c0_g1_i1
|
|
265
|
C_26197_c0_g1_i1
|
|
3
|
99.53
|
2E-108
|
L_19581_c0_g1_i1
|
peroxidase 43
|
244
|
C_20763_c0_g1_i7
|
peroxidase 43
|
3
|
99.32
|
0
|
L_22200_c0_g1_i1
|
|
237
|
C_21279_c0_g3_i1
|
|
0
|
92.42
|
0
|
L_16082_c0_g1_i1
|
uncharacterized protein LOC105035694
|
230
|
C_20815_c0_g1_i2
|
uncharacterized protein LOC105035694
|
4
|
97.73
|
0
|
L_1821_c0_g1_i1
|
protein EARLY RESPONSIVE TO DEHYDRATION 15-like
|
213
|
C_5863_c0_g3_i1
|
protein EARLY RESPONSIVE TO DEHYDRATION 15-like
|
1
|
94.17
|
0
|
L_21840_c4_g7_i1
|
|
197
|
C_46444_c0_g1_i1
|
|
2
|
100
|
0
|
L_11489_c0_g1_i1
|
|
189
|
C_51079_c0_g1_i1
|
|
3
|
95.13
|
4E-114
|
L_21813_c0_g1_i1
|
protein kinase APK1B, chloroplastic-like
|
184
|
C_20869_c0_g1_i9
|
protein kinase APK1B, chloroplastic-like
|
0.97
|
95.04
|
0
|
L_16611_c0_g1_i1
|
|
163
|
C_8161_c0_g1_i1
|
|
2
|
100
|
0
|
L_12355_c0_g1_i1
|
myb-related protein 306-like
|
160
|
C_7266_c0_g1_i1
|
myb-related protein 306-like
|
3
|
99.89
|
0
|
L_18378_c0_g1_i1
|
probable L-ascorbate peroxidase 4
|
158
|
C_17994_c1_2_i1
|
probable L-ascorbate peroxidase 4
|
2
|
96.39
|
0
|
L_21677_c0_g1_i1
|
S-adenosylmethionine decarboxylase proenzyme-like
|
149
|
C_15562_c0_g2_i1
|
S-adenosylmethionine decarboxylase proenzyme-like
|
0.92
|
96.67
|
0
|
L_14830_c0_g1_i1
|
NAC transcription factor 29-like
|
135
|
C_20428_c0_g1_i1
|
NAC transcription factor 29-like
|
0
|
97.61
|
0
|
L_14165_c0_g2_i1
|
probable peroxygenase 4
|
131
|
C_17339_c0_g1_i2
|
probable peroxygenase 4
|
2
|
95.32
|
0
|
L_21840_c4_g4_i2
|
|
130
|
C_11729_c0_g1_i1
|
|
2
|
100
|
0
|
L_39737_c0_g1_i1
|
Glutathione peroxidase 2
|
127
|
C_8347_c0_g1_i1
|
Glutathione peroxidase 2
|
2
|
96.26
|
0
|
L_4928_c0_g1_i1
|
|
124
|
C_44794_c0_g1_i1
|
|
1
|
100
|
3E-101
|
L_20250_c0_g1_i1
|
protein NRT1/ PTR FAMILY 5.6-like
|
114
|
C_29979_c0_g1_i1
|
protein NRT1/ PTR FAMILY 5.6-like
|
2
|
97.44
|
0
|
L_15628_c0_g1_i1
|
formin-A-like
|
103
|
C_20575_c0_g1_i5
|
formin-A-like
|
0
|
90.44
|
0
|
L_21235_c2_g9_i1
|
|
101
|
C_9877_c0_g1_i1
|
|
1
|
92.77
|
3E-98
|
*L_19752_c0_g1_i1
|
mannose-specific lectin 3-like
|
33
|
*C_18595_c0_g1_i1
|
mannose-specific lectin 3-like
|
2301
|
97.6
|
0
|
L_16463_c0_g2_i2
|
LOW QUALITY PROTEIN: S-norcoclaurine synthase-like
|
16
|
C_6989_c0_g1_i1
|
LOW QUALITY PROTEIN: S-norcoclaurine synthase-like
|
8393
|
91.39
|
0
|
L_32395_c0_g1_i1
|
|
14
|
C_4973_c0_g1_i1
|
|
8765
|
86.17
|
4E-92
|
L_19456_c0_g1_i1
|
polyphenol oxidase, chloroplastic-like
|
13
|
C_20237_c3_g1_i1
|
polyphenol oxidase, chloroplastic-like
|
1496
|
83.82
|
0
|
L_14333_c0_g1_i1
|
|
12
|
C_13197_c0_g1_i1
|
|
42047
|
94.54
|
7E-75
|
L_55067_c0_g1_i1
|
defensin Ec-AMP-D1 {ECO:0000303| PubMed:18625284}-like
|
9
|
C_39416_c0_g1_i1
|
defensin Ec-AMP-D1 {ECO:0000303| PubMed:18625284}-like
|
2475
|
95.1
|
0
|
L_5253_c0_g1_i1
|
Disease resistance-responsive (dirigent-like protein) family protein, putative
|
9
|
C_16870_c0_g2_i1
|
Disease resistance-responsive (dirigent-like protein) family protein, putative
|
547
|
94.54
|
0
|
L_1586_c0_g1_i1
|
glycine-rich protein-like isoform X1
|
8
|
C_39384_c0_g1_i1
|
|
2895
|
94.72
|
0
|
L_23556_c0_g1_i1
|
basic blue protein-like
|
5
|
C_14117_c0_g1_i1
|
basic blue protein-like
|
606
|
94.75
|
0
|
L_13618_c0_g1_i1
|
non-specific lipid-transfer protein 1-like
|
5
|
C_13976_c0_g1_i1
|
lipid transfer protein precursor
|
655
|
96.6
|
0
|
L_465_c0_g2_i1
|
microsomal glutathione S-transferase 3-like
|
5
|
C_4959_c0_g1_i1
|
microsomal glutathione S-transferase 3-like
|
246
|
93.89
|
0
|
L_21384_c3_g4_i1
|
|
5
|
C_17484_c0_g1_i1
|
|
424
|
86.76
|
1E-56
|
L_9003_c0_g1_i1
|
dirigent protein 22-like isoform X1
|
5
|
C_19511_c0_g1_i1
|
dirigent protein 22-like
|
834
|
96.09
|
0
|
L_4015_c0_g1_i1
|
CASP-like protein 2A1
|
4
|
C_4840_c0_g1_i1
|
CASP-like protein 2A1
|
241
|
98.25
|
0
|
L_16618_c0_g1_i1
|
hypothetical protein SORBIDRAFT_05g026700
|
3
|
C_4999_c0_g1_i1
|
Bowman-Birk type trypsin inhibitor-like isoform X2
|
5459
|
86.06
|
1E-135
|
L_4834_c0_g2_i1
|
xylem serine proteinase 1-like
|
3
|
C_9966_c0_g1_i1
|
subtilisin-like protease
|
232
|
96.91
|
0
|
L_6907_c0_g1_i1
|
serine/threonine-protein kinase CDL1-like
|
3
|
C_11871_c1_g1_i1
|
serine/threonine-protein kinase CDL1-like
|
183
|
96.44
|
0
|
L_17444_c0_g1_i1
|
cytochrome P450 CYP82D47-like
|
3
|
C_20684_c0_g1_i1
|
cytochrome P450 CYP82D47-like
|
182
|
94.92
|
0
|
L_40485_c0_g3_i1
|
non-specific lipid-transfer protein 1-like
|
3
|
C_39065_c0_g1_i1
|
non-specific lipid-transfer protein 1-like
|
1455
|
90.92
|
0
|
L_18380_c0_g1_i2
|
conserved hypothetical protein
|
3
|
C_39186_c0_g1_i1
|
conserved hypothetical protein
|
654
|
93.2
|
4E-127
|
L_31252_c0_g1_i1
|
|
3
|
C_12650_c0_g1_i1
|
non-specific lipid-transfer protein-like
|
283
|
94.9
|
4E-65
|
L_42464_c0_g2_i1
|
alpha carbonic anhydrase 8-like, partial
|
3
|
C_40148_c0_g1_i1
|
alpha carbonic anhydrase 7-like
|
235
|
93.1
|
0
|
L_13852_c0_g1_i1
|
endoglucanase 6
|
3
|
C_18579_c0_g1_i1
|
endoglucanase 19-like
|
707
|
97.65
|
0
|
L_39898_c0_g1_i1
|
oxygen-evolving enhancer protein 3-1, chloroplastic-like
|
2
|
C_12932_c0_g1_i1
|
oxygen-evolving enhancer protein 3-1, chloroplastic-like
|
160
|
96.93
|
0
|
L_6056_c0_g1_i1
|
Calvin cycle protein CP12-1, chloroplastic-like
|
2
|
C_12691_c0_g1_i1
|
calvin cycle protein CP12-1, chloroplastic
|
215
|
92.42
|
3E-128
|
L_6093_c0_g1_i1
|
|
2
|
C_41578_c0_g1_i1
|
|
170
|
93.87
|
2E-135
|
L_17773_c0_g1_i1
|
uncharacterized protein LOC105056845
|
2
|
C_12165_c0_g1_i1
|
uncharacterized protein LOC105056845
|
249
|
92.36
|
0
|
L_24151_c0_g1_i1
|
ribonuclease 3-like
|
2
|
C_39292_c0_g1_i1
|
ribonuclease 3-like
|
389
|
98.07
|
0
|
L_8678_c0_g1_i1
|
uncharacterized protein LOC105056672
|
2
|
C_4730_c0_g1_i1
|
uncharacterized protein LOC105056672
|
116
|
98.71
|
0
|
L_19431_c2_g4_i1
|
polyubiquitin 4-like, partial
|
2
|
C_20039_c0_g8_i1
|
hypothetical protein PHAVU_003G1236000g, partial
|
2116
|
94.38
|
6E-106
|
L_250_c1_g1_i1
|
probable glutathione S-transferase parA
|
2
|
C_16559_c0_g1_i1
|
probable glutathione S-transferase parA
|
419
|
98.26
|
0
|
L_10676_c0_g1_i1
|
probable linoleate 9S-lipoxygenase 5
|
2
|
C_17658_c0_g1_i1
|
probable linoleate 9S-lipoxygenase 5
|
1644
|
98.82
|
0
|
L_8975_c0_g1_i1
|
chitinase-like protein 1
|
2
|
C_16475_c1_g1_i1
|
chitinase-like protein 1
|
183
|
97.92
|
0
|
L_44393_c0_g1_i1
|
hypothetical protein POPTR_0004s03650g
|
2
|
C_18495_c2_g1_i1
|
conserved hypothetical protein
|
2751
|
92.78
|
3E-66
|
L_759_c0_g1_i1
|
CAS1 domain-containing protein 1-like
|
2
|
C_21365_c0_g1_i1
|
CAS1 domain-containing protein 1-like isoform X2
|
183
|
99.26
|
0
|
L_30327_c0_g1_i1
|
conserved hypothetical protein
|
2
|
C_23037_c0_g1_i1
|
conserved hypothetical protein
|
134
|
96.47
|
2E-116
|
L_5572_c0_g1_i1
|
short-chain type dehydrogenase/reductase-like
|
2
|
C_14194_c0_g1_i1
|
short-chain type dehydrogenase/reductase-like
|
205
|
96.22
|
0
|
L_45139_c0_g1_i1
|
putative germin-like protein 2-1
|
2
|
C_21890_c0_g1_i1
|
putative germin-like protein 2-1
|
111
|
96.2
|
1E-146
|
L_22251_c0_g1_i1
|
xyloglucan endotransglucosylase/ hydrolase protein 9-like
|
2
|
C_40495_c0_g2_i1
|
LOW QUALITY PROTEIN: xyloglucan endotransglucosylase/ hydrolase protein 9-like
|
131
|
98.37
|
0
|
L_56341_c0_g1_i1
|
peroxidase 4-like
|
1
|
C_10149_c0_g1_i1
|
peptide-N4-(N-acetyl-beta-glucosaminyl)asparagine amidase A-like
|
1301
|
93.81
|
2E-93
|
L_56680_c0_g1_i1
|
peptide-N4-(N-acetyl-beta-glucosaminyl)asparagine amidase A-like
|
1
|
C_19920_c0_g1_i1
|
peroxidase 4-like
|
583
|
98.68
|
7E-113
|
*L_307_c0_g2_i1
|
mannose-specific lectin-like
|
1
|
*C_9931_c0_g1_i1
|
mannose-specific lectin-like
|
14867
|
99.35
|
0
|
L_48085_c0_g1_i1
|
probable indole-3-acetic acid-amido synthetase GH3.1
|
1
|
C_19080_c1_g1_i1
|
probable indole-3-acetic acid-amido synthetase GH3.1
|
107
|
84.39
|
2E-70
|
L_46946_c0_g1_i1
|
chlorophyll a-b binding protein 7, chloroplastic-like
|
1
|
C_41884_c0_g1_i1
|
chlorophyll a-b binding protein, chloroplastic
|
141
|
98.41
|
0
|
L_4845_c0_g1_i1
|
chlorophyll a-b binding protein CP26, chloroplastic-like
|
1
|
C_10575_c0_g1_i1
|
chlorophyll a-b binding protein CP26, chloroplastic-like
|
353
|
98.25
|
0
|
L_23363_c0_g1_i1
|
uncharacterized protein LOC105056050
|
1
|
C_21609_c0_g1_i1
|
uncharacterized protein LOC105056050
|
1612
|
98.84
|
0
|
*L_30823_c0_g1_i1
|
mannose-specific lectin-like
|
1
|
*C_17363_c2_g1_i3
|
mannose-specific lectin-like
|
317
|
98.64
|
0
|
L_645_c0_g1_i1
|
putative lipid-transfer protein DIR1
|
1
|
C_12082_c0_g1_i1
|
putative lipid-transfer protein DIR1
|
108
|
97.09
|
0
|
L_50661_c0_g1_i1
|
oxygen-evolving enhancer protein 2, chloroplastic-like
|
1
|
C_14711_c0_g1_i1
|
oxygen-evolving enhancer protein 2, chloroplastic-like
|
133
|
97.74
|
0
|
L_16663_c0_g1_i2
|
|
1
|
C_20564_c0_g1_i1
|
|
642
|
89.54
|
2E-112
|
L_41624_c0_g1_i1
|
isocitrate lyase
|
1
|
C_15046_c0_g1_i1
|
isocitrate lyase
|
116
|
98.32
|
7E-180
|
L_33923_c0_g1_i1
|
galactinol synthase 2-like isoform X1
|
1
|
C_13705_c0_g1_i1
|
galactinol synthase 1-like
|
127
|
92.82
|
0
|
L_36400_c0_g1_i1
|
putative cell wall protein
|
0.98
|
C_26021_c0_g1_i1
|
putative cell wall protein
|
117
|
98.05
|
2E-98
|
L_53880_c0_g1_i1
|
uncharacterized protein LOC105056050
|
0.93
|
C_5177_c0_g1_i1
|
proactivator polypeptide-like 1
|
644
|
98.94
|
0
|
L_6399_c0_g1_i1
|
auxin-induced protein 22D-like
|
0.93
|
C_10469_c0_g1_i1
|
auxin-induced protein 22D-like
|
171
|
96.71
|
0
|
L_10569_c0_g2_i1
|
|
0.91
|
C_16356_c0_g1_i1
|
|
1072
|
93.62
|
0
|
L_21646_c1_g1_i3
|
protein HOTHEAD-like
|
0.91
|
C_14207_c0_g1_i1
|
protein HOTHEAD-like
|
330
|
96.12
|
0
|
L_22097_c0_g1_i1
|
|
0.82
|
C_9693_c0_g2_i1
|
|
155
|
92.67
|
0
|
L_30250_c0_g1_i1
|
polygalacturonase inhibitor
|
0.64
|
C_17486_c1_g1_i2
|
Polygalacturonase inhibitor
|
171
|
93.72
|
2E-170
|
L_50985_c0_g1_i1
|
putative phytosulfokines 6 isoform X1
|
0.47
|
C_22933_c0_g1_i1
|
putative phytosulfokines 6 isoform X2
|
136
|
95.71
|
0
|
L_39567_c0_g2_i1
|
profilin-1
|
0
|
C_15886_c0_g1_i1
|
profilin-1
|
615
|
97.97
|
0
|
L_5103_c0_g1_i1
|
trans-resveratrol di-O-methyltransferase-like
|
0
|
C_39904_c0_g1_i1
|
trans-resveratrol di-O-methyltransferase-like
|
430
|
79.15
|
0
|
L_24431_c0_g6_i1
|
60S ribosomal protein L24
|
0
|
C_1942_c0_g1_i1
|
60S ribosomal protein L24
|
278
|
97.76
|
0
|
L_3220_c0_g1_i1
|
|
0
|
C_1273_c0_g1_i1
|
chlorophyll a-b binding protein 6, chloroplastic
|
264
|
92.97
|
3E-47
|
L_16735_c0_g2_i2
|
uncharacterized protein LOC105047938
|
0
|
C_39063_c0_g1_i1
|
uncharacterized protein LOC105047938
|
172
|
92.49
|
0
|
L_256_c0_g1_i2
|
Os06g0133500
|
0
|
C_16734_c1_g2_i1
|
Os06g0133500
|
151
|
92.07
|
9E-165
|
*: neoculin-related transcripts (cf. Fig. 5 and Additional File 6)
†: Pident and E-value are BLASTN results performed with C. latifolia as query against C. capitulata.
Common genes with TPM value ≥ 50 between the two species, except when the TPM values of both genes is < 100. The genes were sorted based on the TPM value of C. latifolia along with the corresponding genes of C. capitulata. Note that there were no cases of genes that were highly expressed in both species. This pattern strongly suggests changes in the gene expression regulatory system due to divergence of two species.
Lectin genes expressed in C. latifolia and C. capitulata fruits
We previously demonstrated that C. latifolia fruits contain a taste-modifying protein consisting of a NBS-NAS heterodimer that is similar to lectins in the GNA family. We therefore investigated the number of lectin genes expressed in the fruits of C. latifolia and C. capitulata that were categorized into each of the 12 lectin families to better understand the general outline of the GNA gene family in these species. To determine the number of lectin genes, we performed tBLASTN searches against all transcripts in each species using the sequences of 12 representative lectins as query [41] (Table 4). In both species, the largest lectin family was the GNA family, which includes the neoculin (NBS and NAS) genes. Ten of the 45 lectin genes in C. latifolia and 13 of the 49 lectin genes in C. capitulata belonged to the GNA family. Thus, we analyzed the many GNA family genes in these species, including the neoculin genes, in more detail.
Table 4. Number of predicted lectin genes using tBLASTN in C. latifolia and C. capitulata fruits.
Lectin domain
|
Model lectin
|
C. latifolia
|
C. capitulata
|
ABA domain
|
Agaricus bisporus agglutinin
|
0
|
0
|
Amaranthin domain
|
Amaranthus caudatus agglutinin
|
0
|
0
|
CRA domain
|
Robinia pseudoacacia chitinase-related agglutinin
|
3
|
4
|
Cyanovirin domain
|
Nostoc ellipsosporum agglutinin
|
0
|
0
|
EUL domain
|
Euonymus europaeus agglutinin
|
1
|
1
|
GNA domain
|
Galanthus nivalis agglutinin
|
10
|
13
|
Hevein domain
|
Hevea brasiliensis agglutinin
|
3
|
2
|
JRL domain
|
Artocarus integer agglutinin
|
9
|
4
|
Legume domain
|
Glycine max agglutinin
|
8
|
16
|
LysM domain
|
Brassica juncea LysM domain
|
1
|
1
|
Nictaba domain
|
Nicotiana tabacum agglutinin
|
10
|
8
|
Ricin-B domain
|
Ricinus communis agglutinin
|
0
|
0
|
Total number of lectin genes
|
45
|
49
|
Analysis of GNA family and neoculin-related transcripts
We constructed a phylogenetic tree using the deduced protein sequences from 17 transcripts of well-known GNA family members and 25 full-length neoculin-related transcripts from Curculigo (10 from C. latifolia and 15 from C. capitulata; Fig. 5); the method used for sequence selection is shown in Additional File 5. The TPM values (calculated by RSEM) are listed after the transcript IDs. An alignment of all sequences is shown in Additional File 6. The C. latifolia transcript L_16562_c0_g1_i1 was a good match for NBS, while L_16562_c0_g1_i2 was a good match for NAS, except for one amino acid substitution (Additional File 7); these transcripts will be referred to as NBS and NAS hereafter. The predicted proteins derived from neoculin-related transcripts formed a distinct group separate from known GNA family members. Neoculin-like sequences formed one group that included NBS and NAS (named the ‘neoculin group’), as well as two other large groups (group 1 and group 2) (Fig. 5). In addition to NBS and NAS, the neoculin group also included proteins whose transcripts were highly expressed (C_9931_c0_g1_i1) and that presented the conserved amino acid residues critical for binding mannose (and thus have the potential for lectin activity). In addition, each transcript had an ortholog in both Curculigo species. Furthermore, transcripts from this group exhibited such a high DNA sequence identity that qRT-PCR analysis could not be performed with high accuracy on individual members.
Many highly expressed transcripts belonged to group 1 (L_22219_c0_g1_i1 [TPM: 7,600]; C_18595_c_g1_i1 [TPM: 2,300]; C_9454_c0_g1_i1 [TPM: 2,000]). Although these highly expressed transcripts encode proteins that are very similar to mannose-binding lectins, they are not mannose-binding lectins, as they lack the conserved and essential amino acid residues that form the mannose-binding sites. At this time, we do not know their physiological functions or the reason for their high expression. Predicted proteins encoded by group 2 transcripts were also relatively close to the lectins Polygonatum multiflorum agglutinin (PMA) and Polygonatum roseum agglutinin (PRA) from the Polygonatum genus. Unlike in group 1, there were no highly expressed transcripts in this group.
In each group, we detected neoculin-related orthologous transcripts with high similarity between C. latifolia and C. capitulata. The existence of many orthologs in each species, combined with the presence of relatively few common genes (comprising only approximately 40% of all transcripts in both species; Fig. 4), is noteworthy. We infer that these orthologs probably existed before the divergence of these two species, whereas their amino acid differences probably arose afterwards. Genetic diversity is beneficial for plants, including Curculigo, due to their lack of mobility to increase population survival against multiple stresses. It would be interesting to determine whether Curculigo plants other than C. latifolia and C. capitulata contain neoculin-related genes, especially genes in the neoculin group.
Within the neoculin group, we identified transcripts encoding proteins with high similarity to NBS and NAS in both C. latifolia and C. capitulata. Notably, although the corresponding NBS and NAS genes were highly expressed in C. latifolia, their C. capitulata orthologs were only weakly expressed (C_16324_c0_g1_i1 and C_16324_c0_g1_i2). The TPM values for NBS and NAS genes in C. latifolia were approximately the same, with 650 and 620 TPMs, respectively. This result is in agreement with the finding that their encoded proteins form a heterodimer [18]. Although C_9931_c0_g1_i1 was highly expressed in C. capitulata, with a TPM value of 15,000 (the fifth highest expression level among all C. capitulata transcripts), its C. latifolia ortholog (L_307_c0_g1_i1, L_307_c0_g2_i1) was expressed at a very low level. Curiously, in all three groups (neoculin group, groups 1 and 2) for which there were orthologs in both species, if a gene was highly expressed in one species, its ortholog was weakly expressed in the other species; we did not identify a single case where orthologs were highly expressed in both species. The data shown in Table 3 also support this pattern. These results strongly suggest changes in the gene expression regulatory system due to divergence of the two species.
Next, we aligned the deduced amino acid sequences for the proteins belonging to the neoculin group (Fig. 6a). We divided the sequences into nine regions, including the regions removed by cleavage of the secretion signal peptide and three mannose binding site (MBS)-like regions: N pro-sequence (N-Pro), N-terminal (N-term), MBS1, inter1, MBS2, inter2, MBS3, C-terminal (C-term), and C pro-sequence (C-Pro). The His-11 residue was present in the N-term region of NBS and in the predicted proteins encoded by transcripts L_16562_c0_g1_i1 in C. latifolia and C_16324_c0_g1_i1 in C. capitulata. This site essential for the pH-dependent taste-modifying activity of neoculin. By contrast, transcripts C_9931_c0_g1_i1 in C. capitulata and L_307_c0_g1_i1 and L_307_c0_g2_i1 in C. latifolia (abbreviated ‘C_9931 series’) did not code for His-11, which was replaced by Tyr-11, as in NAS. In addition, Cys-77 and Cys-109, which form an intermolecular disulfide bond between NBS and NAS, were present within the inter2 and C-term regions in both species, but were absent in the C_9931 series. Thus, it is likely that proteins corresponding to the C_9931 series do not form dimers.
Four residues are responsible for the binding and activation of the human sweet receptor: Arg-48, Tyr-65, Val-72, and Phe-94 [26]. Although Tyr-65 and Val-72 were identified in the C_9931 series, Leu-48 and Val-94 were missing. The lack of His-11 and these four indispensable residues, as well as the lack of dimerization, indicate that the C_9931 series proteins may not possess the sweet taste or taste-modifying properties of classic neoculin. Indeed, a preliminary test indicated that C. capitulata fruits did not have a sweet taste or taste-modifying properties despite the high expression level of C_9931_c0_g1_i1 (data not shown). Three sites similar to the MBS were present in the MBS1, MBS2, and MBS3 regions of this protein. Moreover, whereas NBS and NAS lack the essential residues of the MBS, all of these residues were conserved in C_9931_c0_g1_i1, making C_9931_c0_g1_i1 a likely lectin candidate.
Based on this protein alignment, we investigated all amino acid substitutions in each region in comparison to the two reference sequences, NBS and NAS (Additional File 8). The amino acid substitution rate with reference to NBS is shown in the heatmap in Fig. 6b. Between the NBS series and the NAS series, 18% to 27% of substitutions occurred in the overall regions from the N-term region to C-term region (23%, 26 of 114 residues in NBS). The highest substitution rate was 27% in the MBS2 region, followed by 24% in the inter2 and C-term regions. In the C_9931 series, the highest substitution rate was 53% in the C-term region, followed by the MBS3 region (44%) and inter2 region (43%). These results suggest that the region from inter2 to C-term is the main source of sequence diversity among neoculin group members.
Biochemical analysis
We extracted proteins from C. latifolia and C. capitulata fruits and subjected them to SDS-PAGE, followed by Coomassie brilliant blue (CBB) staining and immunoblotting using a mixture of polyclonal anti-NAS and anti-NBS specific antibodies (Fig. 7). The CBB-stained gel is shown in Fig. 7a and the corresponding immunoblot in Fig. 7b. By CBB staining, we detected an 11-kDa band representing NBS and a 13-kDa band representing NAS in C. latifolia fruit samples (Fig. 7a). In C. capitulata fruits, some bands around 11 kDa may be the protein encoded by C_9931_c0_g1_i1, which had a high TPM value. Immunoblotting confirmed the identity of the bands corresponding to NBS and NAS in C. latifolia fruits. However, we detected no such bands in C. capitulata fruits (Fig. 7b), perhaps because NBS and NAS accumulate at very low levels in this species, as reflected by the low TPM values of their encoding transcripts (as described above). The amino acid sequence of the C-term region, which is recognized by the antibody, was also very different in C_9931_c0_g1_i1 compared to both NBS and NAS, which is consistent with the finding that the proteins detected by CBB staining were not detected by immunoblotting.