To generate a comprehensive overview of neem transcriptome, total RNAs were extracted from leaves, fruits (containing seeds), roots, stems, and flowers. To obtain the transcriptome data, we used the hybrid-seq technology that combines Illumina HiSeq and PacBio SMRT sequencing data and corrects the errors in long reads with short reads [24]. Different neem tissues (leaves, flowers, stems, fruits (containing seeds) and roots) were sequenced separately using Illumina HiSeq platform and generated 41.14, 41.35, 40.60, 40.59 and 40.64 million clean reads, respectively. PacBio SMRT platform produced 6.75 million clean reads. After calibration with short reads from Illumina platform, the assembled unigenes were corrected with an N50 of 5076 bp and mean length of 3607 bp (Table 2). The obtained assembly was 2.5 times longer than in the previous report [14]. The length of these unigenes ranged from 500 to 6001 bp. The majority (over 55.5%) of reads were distributed in the range of 4501 bp and above (Figure 2a).
Table 2. Summary of Illumina and PacBio output data quality and assembled sequences of libraries of neem.
Sequencing Platform
|
Sample
|
Reads Number (M)
|
Clean Bases (G)
|
Q20 (%)
|
Q30 (%)
|
Total number of unigene
|
Total length of unigene (bp)
|
Mean length of unigene (bp)
|
N50
|
GC (%)
|
Ilumina HiSeq
|
Leaf
|
41.14
|
6.17
|
96.95
|
92.30
|
50394
|
82508501
|
1380
|
2201
|
40.64
|
Flower
|
41.35
|
6.20
|
97.29
|
93.06
|
62426
|
95564612
|
1530
|
2335
|
40.13
|
Stem
|
40.60
|
6.09
|
97.29
|
93.11
|
65762
|
98612970
|
1499
|
2372
|
40.39
|
Fruit
|
40.59
|
6.09
|
97.45
|
93.39
|
66668
|
81799852
|
1226
|
2077
|
41.93
|
Root
|
40.64
|
6.10
|
97.44
|
93.43
|
45459
|
57083335
|
1255
|
2100
|
41.10
|
Hiseq summary
|
—
|
—
|
—
|
—
|
—
|
113008
|
175268545
|
1550
|
2599
|
40.99
|
PacBio SMRT
|
Mixed tissue
|
6.75
|
11.28
|
—
|
—
|
22884
|
82035635
|
3584
|
5068
|
43.55
|
Calibrated PacBio by Illumina
|
|
—
|
—
|
—
|
—
|
20201
|
72872459
|
3607
|
5076
|
43.55
|
Q20 and Q30 on Illumina platform correspond to the predicted base call error rate of 1 % and 0.1%, respectively.
N50: The minimum contig length needed to cover 50% of the transcriptome.
To confirm the accuracy of the hybrid-seq (FPKM) results, we selected 10 unigenes and used qRT-PCR to determine their relative expression (Figure S2). The qRT-PCR and FPKM results were consistent except for transcript/19882 and transcript/16577. Their expression from RNA-seq (FPKM) and relative expression level by qRT-PCR in root and stem were inconsistent. The inconsistency between qRT-PCR and FPKM in transcriptome analysis has also been reported by other researchers. In the study by Zhang [30], the FPKM expression levels of 8 out of 58 genes from transcriptome were inconsistent with their qRT-PCR results. In another study comparing the gene expression fold changes in the samples of Metarhizium acridum CQMa102, approximately 86.2% of the genes showed consistent results between RNA-sequencing and qRT-PCR data [31]. The inconsistency between FPKM and qRT-PCR may result from multiple reasons. Although qRT-PCR and RNA-seq are both used to measure gene expression, the unit of measurement [32] as well as the computing method are different for FPKM and qRT-PCR. Many factors affect the accuracy of FPKM and qRT-PCR. Bias in PCR amplification [33] and RNA-seq library preparation [34] and sequencing adds noise to the RNA-seq data. The quality of the mRNA, amplification efficiency, and the choice of reliable internal controls referred to as reference genes affect the accuracy of qRT-PCR [35]. Therefore, the consistency of transcript/16577 and transcript/19882 between qRT-PCR and FPKM was acceptable. Further examinations need to be performed on these two unigenes.
Ten unigenes were cloned by PCR using primers listed in Additional file 1. According to the sequencing results of the cloned unigenes, each of them was 100% identical to the sequences obtained from hybrid-seq platform. Among them, 4 unigenes attracted our attention since they contained a complete ORF of the genes. One of them (transcript/14554) was annotated as the neem NADPH-cytochrome P450 reductase 2. The cloned transcript/14449 was found to encode an OSC consisting of 760 amino acids. The length of two CYP450 unigenes (transcript/16971 and transcript/16742) was 1536 bp and 1527 bp, and they encoded proteins consisting of 511 and 508 amino acids, respectively. Detailed sequencing results indicated that the unigenes in our study had good accuracy and were therefore reliable for further analyses.
Functional annotation and classification of unigenes
A total of 19,907 unigenes (98.54% of 20,201 unigenes) were annotated in at least one database (Table S3). The annotated unigenes were compared to known nucleotide sequences of other plant species. They best matched to the known nucleotide sequences from C. sinensis (52.43%), Citrus clementine (23.49%), Theobroma cacao (2.44%), Vitis vinifera (2.4%), and others (19.23%).
Clusters of Orthologous Groups of proteins (COG) and Gene Ontology (GO) classification were used to further evaluate the completeness and effectiveness of the neem annotation. All 17,634 unigenes (87.3% of 20,201 unigenes) were classified into 25 functional COGs (Figure 2b). 2610 unigenes (14.8% of the total 17,634 classified unigenes) were categorized into general function prediction only cluster, which formed the largest group, whereas the clusters for replication, transcription, recombination, and repair followed closely. Although only 378 unigenes were categorized into the “Secondary metabolites biosynthesis transport and catabolism” cluster, they may play important roles in providing precursors for secondary metabolite biosynthesis.
A total of 13,453 assembled unigenes (66.6% of 20,201 unigenes) were assigned at least one of the 55 GO terms (Figure 2c). These unigenes were predominantly assigned to metabolic process (GO:0008152) and cellular process (GO:0009987). The unigenes categorized in the molecular function category were predominantly associated with catalytic activity (GO:0003824) and binding functions (GO:0005488). The unigenes categorized in the cellular component category were predominantly associated with membrane (GO: 0016020), cell part (GO: 0044464), and cell (GO: 0005623). These findings showed that the main COG and GO classifications for the fundamental biological processes were identified.
The Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway database was used to systematically evaluate the gene biological functions in different pathways. A total of 16,778 unigenes were matched to the database and assigned to 135 KEGG pathways (Table S4). Metabolic pathways (3590 unigenes, 21.4%) and biosynthesis of secondary metabolites (1560 unigenes, 9.3%) were the two dominant categories. In the biosynthesis of secondary metabolites category (Table S5) in neem, subcategories of flavonoid biosynthesis, terpenoid backbone biosynthesis (Table S6), steroid biosynthesis, sesquiterpenoid and triterpenoid biosynthesis (Table S7), diterpenoid metabolism, and carotenoid biosynthesis were included. There were 242 unigenes involved in the metabolic pathways of terpenoids and polyketides.
Differential expression analysis of unigenes in neem
In order to find candidate genes in the azadirachtin A biosynthesis pathway, all transcripts had been identified, annotated, and mapped in different pathways. Genes with a higher expression in the tissues with high azadirachtin A content, such as fruits and leaves, are more likely to be the involved in azadirachtin A biosynthesis. The expression level of unigenes was calculated by the FPKM method. The upregulated DEGs in leaf and fruit were unigenes which higher-expressed in leaf or fruit compared to the other tissues. Up-regulated DEGs in fruit and leaf were 219 and 397, respectively. These DEGs were used for mining candidates involved in azadirachtin A biosynthesis.
First screening of candidate genes involved in azadirachtin A biosynthesis
According to the putative azadirachtin A pathway in Figure 1, tirucalla-7,24-dien-3β-ol is assumed as the scaffold formed from 2,3-oxidosqualene. A few steps such as hydroxylation and furan ring formation occur after scaffold formation. The hydroxyl groups are then either oxidized to acid or acylated or esterified to esters, forming limonoid compounds like azadirone or nimbin. Azadirachtin A is finally obtained after modifications on azadirone or nimbin. ADH, CYP450, ACT, and EST are supposed to be involved in azadirachtin downstream pathway and thus their encoding-unigenes were chosen as candidates.
As a triterpenoid, the first step of azadirachtin A biopathway was the formation of its scaffold catalyzed by OSC. Among all unigenes involved in terpenoid biosynthesis, 8 detected unigenes were annotated as OSC, including transcript/1784, transcript/1866, transcript/8176, transcript/8892, transcript/9751, transcript/14584, transcript/19700, and transcript/14449. Among them, only transcript/14449 expressed higher in fruit. After phylogenetic analysis (Figure 3) of transcript/14449 with several characterized OSCs from other plants, transcript/14449 was grouped with AiOSC1 [18], a newly characterized OSC from neem, catalyzing the formation of tirucalla-7,24-dien-3β-ol. The other unigenes were grouped with cycloartenol synthase. After DNA sequence analysis with AiOSC1(Figure S3), transcript/14449 is 100% identical to AiOSC1, which means that these two genes were the same gene. It indicated that transcript/14449 could be a candidate gene for producing the azadirachtin A scaffold.
Among all the DEGs in fruit and leaf transcriptome data, DEGs encoding ADH, CYP450, ACT and EST were discovered. There were sixteen DEGs encoding ADH. Four up-regulated DEGs encoding ADH were selected for further screening. As for CYP450, sixteen DEGs encoded CYP450 and ten DEGs were selected. Similarly, thirteen and nineteen DEGs encoded ACT and EST respectively and there were four and twelve DEGs up-regulating in leaf and fruit tissues, respectively. DEGs with same sequence or sequences within 150 amino acids were excluded in further screening.
Among all the DEGs in fruit and leaf transcriptome data, DEGs encoding ADH, CYP450, ACT, and EST were identified. There were 16 DEGs encoding ADH, 4 upregulated DEGs encoding ADH were selected for further screening. There were 16 DEGs encoding CYP450, 10 out of these DEGs were selected. Similarly, 13 and 19 DEGs encoded ACT and EST, respectively. There were 4 and 12 DEGs upregulated in leaf and fruit tissues, respectively. DEGs with the same sequence or sequences coding for proteins of length less than 150 amino acids were excluded from further screening.
Further screening of the other four enzymes through phylogenetic analysis and domain prediction
Phylogenetic analysis (Figure 4) and protein domain prediction were used for further screening of these DEGs. Both, transcript/22186 and transcript/18833, were grouped with cinnamyl-alcohol dehydrogenase 4 (CADH4) [36] and transcript/19291 was grouped with CADH1. CADHs catalyze the biosynthesis of cinnamaldehyde from cinnamyl alcohol. Transcript/18482 was grouped with ADHX [37] which showed activity to primary and secondary alcohols. Transcript/18833 and transcript/19291 contained the PLN02514 domain that was also found in CADH [38]. Transcript/22186 contained the nsLTP2 [39] domain that is present in non-specific lipid-transfer protein. Transcript/18482 contained the GxGxxG motif [40] found in S-(hydroxymethyl) glutathione dehydrogenase. Transcript/18833 and transcript/19291 were selected as candidates for further examination.
According to the phylogenetic analysis of unigenes encoding CYP450, 5 transcripts (transcript/16057, transcript/16577, transcript/16950, transcript/16777, and transcript/17001) were grouped with members in CYP71 [41] and CYP72 [42] clades, whose members are reported to be involved in terpenoid biosynthesis. Transcript/16971 was grouped with CYP94B1 [43], an enzyme catalyzing the hydroxylation at C12 of jasmonyl-L-amino acid. Transcript/17284, transcript/17057, transcript/17636, and transcript/17854 were grouped in another clade. Transcript/17057 fell into a group with CYP82C4[44], an enzyme hydrolyzing xanthotoxin (8-methoxypsoralen) into 5-hydroxyxanthotoxin. Transcript/17284 was grouped with CYP94B3 [45], this revealed that transcript/17284 may act as a hydroxylase. Transcript/17636 and transcript/17854 were classified into a group with uncharacterized CYP98A1.
According to CYP450 domain analysis, transcript/17636 and transcript/16971 contained the same CypX domain [46] as CYP81B1. The P450-cyclo_AA_1 domain in transcript/16950 was also found within cytokinin trans-hydroxylase [47]. The PLN00168 domain [48] was found in transcript/17824 and transcript/19854. The PLN02687 domain (a domain in flavonoid 3'-monooxygenase [49]) was also contained by transcript/16057, transcript/16577, and transcript/17057. Transcript/16777 and transcript/17001 contained the PLN02774 domain, which is often found in brassinosteroid-6-oxidase [50]. Therefore, 5 DEGs (transcript/16057, transcript/ 16577, transcript/17057, transcript/16777, and transcript/17001) contained domains found within CYP450 oxidases and the others contained domain found within CYP450 hydroxylase.
Among all ACT DEGs, transcript/18186 was grouped with ARE1, the enzyme encoding sterol O-acyltransferase [51]. Transcript/18214 fell into a subclade with a DCR, a member of BAHD acyltransferase from A. thaliana involved in cutin biosynthesis [52]. Transcript/17792 was grouped with ARE2, a sterol O-acyltransferase from Candida albicans [53]. Transcript/19132 and TSM1 were in a group which reveal that transcript/19132 was likely to be methyltransferase. Through domain analysis, transcript/17792 and transcript/18214 were shown to contain the HXXXD domain that is often found in the BAHD ACT family [52]. Transcript/19132 contained the domain PLN02177, also found in glycerol-3-phosphate acyltransferase [55]. Transcript/18186 and eukaryotic initiation factor 4B [56] had the same eIF-4B domain. Therefore, after combining phylogenetic analysis and domain prediction, transcript/17792 and transcript/18214 were selected as candidates of ACT involved in azadirachtin A biosynthesis.
Upon phylogenetic analysis of all EST DEGs, transcript/19998 and transcript/16750 were grouped with KAI2 [57]. KAI2 has been reported to be involved in seed germination and did not show esterase activity. Transcript/19188 was divided into a subclade with A. thaliana GDL15, that belonged to GDSL-like [58] lipase/ acylhydrolase superfamily and displayed hydrolytic activity with esters. Transcript/18100 was grouped with PME3, a pectinesterase catalyzing the hydrolysis of (1,4)-α-D-galacturonosyl methyl ester [59]. Transcript/19748 formed a tight subclade with TGL1, which is a sterol esterase mediating the hydrolysis of steryl esters [60]. Transcript/19882 and transcript/19697 were in the same group as HIDH [61] and CXE18, respectively and these two enzymes show activity to carboxylic esters. The conserved domain analysis of EST candidates presented that transcript/19188 contained the Ser-His-Asp (Glu) triad found in an SNGH plant lipase [62]. Transcript/19882, transcript/19697, and transcript/19748 had the AES domain [63] which is also contained within acetyl esterase/lipase. PAE domain [64] in transcript/18100 was also found in pectin acetylesterase while transcript/16750 and transcript/19998 contained the plant pectinesterase inhibitor domain PLN02201. Therefore, transcript/16750 and transcript/19998 were removed from the list of candidates after phylogenetic and domain analysis.
Molecular Docking analysis of CYP450s
The active site prediction in the 5 CYP450s was performed and the results are listed in Table S8. To further analyze the interactions between CYP450s and the four ligands, molecular docking was performed with Autodock 4.0. The details of docking are listed in Table S9 and specific interactions are displayed in Figure 5. Analysis revealed that binding energy was lowest in case of CYP16057 docked with tirucalla-7,24-dien-3β-ol forming zero hydrogen bond. However, azadirone and nimbolide formed stable complexes with CYP16057 with one hydrogen bond with −7.20 and −6.40 kcal/mol of binding energies (Figure 5 and Table S9), respectively. The docking analysis for CYP16577 revealed that among all the ligands, binding energy was lowest for tirucalla-7,24-dien-3β-ol and nimbin with −10.07 and −9.83 kcal/mol, respectively, forming zero and two hydrogen bonds, respectively. Docking of CYP16577 with triterpenoids showed interaction through one hydrogen bond with binding energy of −9.42 and −9.16 kcal/mol for azadirone and nimbolide, respectively (Figure 5 and Table S9). The binding energy of CYP16777 docked with azadirone and tirucalla-7,24-dien-3β-ol was −9.96 and −9.75 kcal/mol, respectively, forming one hydrogen bond each. However, nimbolide and nimbin formed stable complexes with three and two hydrogen bonds respectively, indicating that the conformation of nimbolide is best suitable for CYP16777 when the number of hydrogen bonds formed between protein and ligand is set as the criterion. The hydrogen bonds between nimbolide and CYP16777 are formed at PHE354, ARG355, and ARG419, with ligand moiety at different positions (Table S9) with varying bond length (Figure 5). The docking analysis for CYP16950 revealed that among all the ligands, binding energy was lowest for tirucalla-7,24-dien-3β-ol and azadirone with −8.31 and −7.90 kcal/mol without forming hydrogen bond. However, nimbin and nimbolide formed stable complexes with CYP16950 with one hydrogen bond with −6.32 and −6.66 kcal/mol binding energies, respectively (Figure 5 and Table S9). Docking of CYP17001 with triterpenoids showed interaction through only one hydrogen bond with the lowest binding energy of −10.11 kcal/mol for tirucalla-7,24-dien-3β-ol. Both, azadirone and nimbolide, formed stable complexes with CYP17001 through two hydrogen bonds. The hydrogen bonds between nimbin and CYP17001 are formed at ARG355, ARG419, and GLY423 with the binding energy of −9.82 kcal/mol (Figure 5 and Table S9).