Transcriptome analysis and data validation
Hybrid-seq based omics analysis had made great advances in the fusion gene discovery  and genome assembling . The high quality of transcriptome sequencing and assembly represent a great resource for the study of non-model species. Neem genome data and transcriptome data were obtained using next-generation sequencing (NGS) technology, followed by a genome draft improvement using PacBio assembly. Given the complexity of neem genome, and just the complexity of secondary metabolites synthesis in general, both long- and short-read technologies at the highest level of accuracy were needed in order to propel the level of discover the elusive mechanisms lies behind. The sensitive and high-throughput Illumina HiSeq technique can add long-reads capabilities with PacBio SMRT assistance. The accuracy achieved with PacBio SMRT essentially is on par with what was achieved with sequencing by synthesis short-read technology, which was critically important for de novo sequencing and highly homologous regions of plant genomes.
The precursors, intermediates and even azadirachtin A were predominantly distributed and differentially concentrated in the seed kernels and leaves of neem . These secondary metabolites with plant defense activities were accumulated in order to survive the damage by pests during the critical period of fruit development and dry matter accumulation. Therefore, tissues of leaf and fruit are very likely to reveal most functional genes participated in the production of azadirachtin A or intermediates related to biosynthetic process.
To generate a comprehensive overview of neem transcriptome, total RNAs were extracted from leaves, fruits (containing seeds), roots, stems and flowers, then the mRNA was isolated, and cDNA libraries were established and sequenced separately using Illumina Hiseq platform, which generated 41.14, 41.35, 40.60, 40.59 and 40.64 million clean reads. PacBio SMRT platform were applied to mixed tissue sample (composed of these five tissues) to assist the short-read technology and 6.75 million clean reads were generated. The clean reads were individually generated with the average GC percentage of 40.99% and 43.55% by Illumina Hiseq and PacBio platforms. After calibration with long reads data generated by PacBio platform, short reads were assembled into 20201 good quality unigenes with an N50 of 5076 bp and mean length of 3607 bp (Table 2). The length of these unigenes ranged from 500 to 6001bp, and the majority (over 55.5%) were disturbed in 4501 bp and above (Figure 4a). However, there are still 8987 unigenes whose lengths were less than 4501 bp.
To verify the reliability of DEGs from transcriptome analysis, ten candidates were selected and validated by their expressions. Gene expressions determined using RNAseq and qRT-PCR were compared. Good concordance was observed between the qRT-PCR and RNAseq results (Figure 3), which indicated the high reliability of Illumina Hiseq data. To verify the reads accuracy from PacBio SMRT, these ten candidates were also selected to clone the sequences. The PCR sequencing results of selected unigenes showed that they were 100% identical to the sequences obtained by PacBio SMRT. Detailed sequencing results indicated that the generated unigenes in our study were of good quality and therefore reliable for further DEG analysis and functional annotation.
Functional annotation and classification
In order to analyze the function of these assembled unigenes, the unigenes were aligned against the Nr database, Nt database, SwissProt protein database, KEGG database, COG and GO database using the BLASTX program with an E-value cut-off of 10-5. In total, 19907 unigenes (98.54% of 20201 unigenes) were annotated in at least one database (Table S2). The annotated unigenes were compared to known nucleotide sequences of other plant species, which were best matched to the known nucleotide sequences from Citrus sinensis (52.43%), followed by Citrus clementine (23.49%), Theobroma cacao (2.44%), Vitis vinifera (2.4%) and other (19.23%).
COG and GO classification were used to further evaluate the completeness and effectiveness of the neem annotation. All 17634 unigenes (87.3% of 20201 unigenes) were classified into 25 functional Clusters of Orthologous Groups (COG) clusters (Figure 4b). Of those, 2610 unigenes (14.8% of the total 17634 classified unigenes) were categorized into general function prediction only cluster, which formed the largest group, whereas the clusters for replication, transcription, recombination and repair followed closely. Although only 378 unigenes were categorized into the “Secondary metabolites biosynthesis transport and catabolism” cluster (Figure 4c), they may play important roles in providing precursors in secondary metabolite biosynthesis.
In total, the 74905 assembled unigenes (66.3% of 113008 unigenes obtained in Hiseq sequencing analysis) were assigned at least one of the 55 GO terms (Figure 4b). Of those, these unigenes were predominantly assigned to the metabolic process (GO:0008152) and cellular process (GO:0009987). The unigenes categorized in the molecular function category were predominantly associated with catalytic activity (GO:0003824) and binding functions (GO:0005488). The unigenes categorized in the cellular component category were predominantly associated with membrane (GO: 0016020), cell part (GO: 0044464) and cell (GO: 0005623). These findings showed that the main COG and GO classifications for the fundamental biological processes were identified.
The KEGG pathway database was used to systematically evaluate the gene biological functions participated in different pathways. A total of 16778 unigenes were matched to the database and assigned to 135 KEGG pathways (Table S3). Metabolic pathways (3590 unigenes, 21.4%) and biosynthesis of second metabolites (1560 unigenes, 9.3%) were the two dominant categories. In the biosynthesis of secondary metabolites category (Table S4) in neem, subcategories of flavonoid biosynthesis, terpenoid backbone biosynthesis, steroid biosynthesis, diterpenoid metabolism and carotenoid biosynthesis were included. There were 242 unigenes involved in the metabolism pathways of terpenoids and polyketides.
Based on the KEGG database analysis, a total of 63 unigenes (0.38% of 16778 unigenes with pathway annotation) were associated with terpenoid backbone biosynthesis (Table S5), which included 13 mevalonate (MVA) pathway unigenes and 38 methylerythritol phosphate (MEP) pathway or 1-deoxy-D-xylulose-5-phosphate (DXP) pathway unigenes (Figure 6). Among them, three unigenes encoded squalene synthase and squalene epoxidase, respectively. Transcript/13723 and transcript/18081 encoding squalene synthase and they were up-regulated in both fruit and leaf. Transcript/17638 was found expressing highest in fruit. For unigenes encoding squalene epoxidase, transcript/15581 expressed high in flower and leaf while transcript/15615 expressed much higher in fruit and root. Transcript/21847 expressed the highest in fruit while expressed the lowest in leaf.
As a triterpenoid, the first step of azadirachtin A biopathway was the formation of its scaffold catalyzed by OSC. However, there was no exact information about the scaffold or key enzyme catalyzing its formation in neem. Among all unigenes involved in terpenoids biosynthesis, eight detected unigenes were annotated as OSC, including transcript/1784, transcript/1866, transcript/8176, transcript/8892, transcript/9751, transcript/14584 and transcript/19700 and transcript/14449. Among them, only transcript/14449 expressed higher in fruit and leaf than in other three tissues. While after phylogenetic analysis (Figure S1) of these OSCs with several identified OSCs from other plants, transcript/14449 was grouped with AiOSC1, a newly characterized OSC from neem catalyzing the formation of tirucalla-7,24-dien-3β-ol, while other unigenes were grouped with cycloartenol synthase. After amino acid sequence analysis with AiOSC1, transcript/14449 is 100% identical to AiOSC1, which means that these two genes were the same gene. It indicated that transcript/14449 could be a candidate gene for producing azadirachtin A scaffold.
Differential expression analysis of unigenes in neem
In order to find candidate genes in azadirachtin A biosynthesis pathway, all transcripts had been identified and annotated and mapped in different pathways. Genes expressed higher in azadirachtin A stimulating-organ leaf and fruit would be more likely to be the involved-genes in azadirachtin A biosynthesis. The expression level of unigenes was calculated by FPKM. The number of up-regulated DEGs in leaf and fruit comparing to other tissues were 219 and 397, respectively. All these DEGs were used for mining candidate involved in azadirachtin A biosynthesis.
First screening of candidate genes involved in azadirachtin A biosynthesis
From the putative azadirachtin A pathway in Figure 1, tirucalla-7,24-dien-3β-ol assumed as scaffold and formed from 2,3-oxidisqualene. After scaffold formation, a few important steps occurred such as hydroxylation and furan ring formation. Formation of hydroxyl group at C7 of azadirachtin A was critical for maintaining insecticidal activity . The hydroxyl groups were either oxidized to acid or acylated by acyltransferase or esterified by esterase/lipase and then further formed limonoid compounds like azadirone or nimbin. After multiple steps such as C ring clearage and endoperoxide biosynthesis, azadirachtin A was finally obtained. In our study, enzymes ADH, CYP450, ACT and EST were supposed to be involved in azadirachtin pathway and their encoded unigenes were chosen for mining candidates. According to the annotation of DEGs and their expression levels in five tissues, DEGs up-regulated in leaf and fruit tissues were used for mining azadirachtin A biosynthetic candidates.
Among all the DEGs in fruit and leaf transcriptome data, sixteen DEGs encoding ADH were discovered, and four DEGs (transcript/19291, transcript/18833, transcript/18482 and transcript/22186) were up-regulated in fruit and leaf. As for CYP450, sixteen DEGs encoding CYP450 and ten higher expressing DEGs (transcript/17636, transcript/17854, transcript/16057, transcript/17284, transcript/16777, transcript/17001, transcript/16577, transcript/16950, transcript/17057 and transcript/16971) were identified. Similarly, thirteen and nineteen DEGs encoding ACT and EST were isolated and there were four and twelve DEGs up-regulating in leaf and fruit tissues, respectively. There are 30 DEGs up-regulated in leaf and fruit. Sequences under 150 amino acids were excluded. Finally, the resulted 26 DEGs were selected and their amino acid sequences were used for further screening phylogenetic analysis and domain prediction.
Further screening of candidates through phylogenetic analysis and domain prediction
Phylogenetic trees were constructed (Figure 5) with MEGA7. Protein domains (Table 1) were predicted and assigned with HMMSCAN. Transcript/22186 was grouped with CADH4 . Transcript/18833 and transcript/19291 were grouped with CADH1. CADH1 and CADH4 are enzymes catalyzing biosynthesis of cinnamaldehyde from cinnamyl alcohol. Transcript/18482 was grouped with ADHX  which had activity to primary and secondary alcohols. Transcript/18833 and transcript/19291 contained PLN02514 domain which was also found in cinnamyl-alcohol dehydrogenase . Transcript/22186 contained nsLTP2  domain which was contained in non-specific lipid-transfer protein. Transcript/18482 contained GxGxxG motif  found in S-(hydroxymethyl) glutathione dehydrogenase.
According to the phylogenetic analysis of CYP450 DEGs, they were divided into three clades. Among them, transcript/17001, transcript/16777 were grouped with CYP71AV1, and transcript/16971 was grouped with CYP94B1. CYP71AV1 was involved in artemisin biosynthesis and catalyzed three continuous oxidations of amorpha-4,11-diene to produce artemisinic acid . CYP94B1  catalyzed the hydroxylation at C12 of jasmonyl-L-amino acid. Transcript/17284, transcript/17057, transcript/17636 and transcript/17854 were contained in second clade. Transcript/17057 fell into a group with CYP82C4 , an enzyme hydrolyzed xanthotoxin (8-methoxypsoralen) into 5-hydroxyxanthotoxin. Transcript/17284 grouped with CYP94B3  and it revealed that transcript/17284 may act as a hydroxylase. Transcript/17636 and transcript/17854 were divided into a group with uncharacterized CYP98A1. Transcript/16950 and transcript/16577 were grouped with a hydroxylase CYP71D1  and CYP72A14, respectively. Transcript/16057 grouped with CYP72A63 which catalyzed three sequential oxidations at C-30 of 11-oxo-beta-amyrin . Through phylogenetic analysis, transcript/17001, transcript/16777 and transcript/16057 may act as oxidases, while the others were likely to be hydroxylases.
According to CYP450 domain analysis, transcript/17636 and transcript/16971 contained the same CypX domain  with CYP81B1. P450-cyclo_AA_1 domain in transcript/16950 was also found within cytokinin trans-hydroxylase . PLN00168 domain  was found in transcript/17824 and transcript/19854, and PLN02687 domain (a domain in flavonoid 3'-monooxygenase ) was also contained by transcript/16057, transcript/16577 and transcript/17057. Transcript/16777 and transcript/17001 contained PLN02774 domain, which was often found in brassinosteroid-6-oxidase . Therefore, four DEGs (transcript/16057, transcript/16577 and transcript/17057 and transcript/17001) contained domains within CYP450 oxidases and the others contained domain within CYP450 hydroxylase
Among all ACT DEGs, transcript/18186 was grouped with ARE1, the enzyme encoding sterol O-acyltransferase . Transcript/18214 fell into a subclade with LPAT2, an acyl-sn-glycerol-3-phosphate acyltransferase of Brassica oleracea . Transcript/17792 was grouped with HCT2 which catalyzing the transfer of an acyl from p-coumaroyl-CoA to various acyl acceptors. Transcript/19132 and TSM1  were in a group which reveal that transcript/19132 was likely to be methyltransferase. Through domain analysis, transcript/17792 and transcript/18214 contained HXXXD domain that often found in BAHD family . Transcript/19132 contained same domain PLN02177 with glycerol-3-phosphate acyltransferase . Transcript/18186 and eukaryotic initiation factor 4B  had the same eIF-4B domain.
For EST DEGs after phylogenetic analysis, transcript/19998 and transcript/16750 grouped with KAI2 . KAI2 was reported to be involved in seed germination and didn’t show activity as esterase. Transcript/19188 was divided into a subclade with A. thaliana GDL15, which belonged to GDSL -like lipase/acylhydrolase superfamily and displayed hydrolytic activity with esters. Transcript/18100 was grouped with PME3, a pectinesterase catalyzed the hydrolysis of (1,4)-α-D-galacturonosyl methyl ester . Transcript/19751 and transcript/19748 formed a tight subclade with TGL1which was a sterol esterase and mediated the hydrolysis of steryl esters . Transcript/19882 and transcript/19697 were in a group with HIDH  and CXE18 , respectively and these two enzymes shew activity to carboxylic esters. The conserved domain analysis of EST candidates presented transcript/19188 contained Ser-His-Asp (Glu) triad found in a SNGH plant lipase . Transcript/19882, transcript/19697, transcript/19748 and transcript/19751 had AES domain  which contained by acetyl esterase/lipase. PAE  domain in transcript/18100 was also found in pectin acetylesterase while transcript/16750 and transcript/19998 contained the plant pectinesterase inhibitor domain PLN02201.
According to results from phylogenetic analysis and domain prediction, three ADH DEGs (transcript/22186 was excluded) and ten CYP450 DEGs were finally selected as candidates although there were some differences in results from the phylogenetic tree and domain prediction. As for ACT DEGs, transcript/18214 and transcript/17792 were selected as candidates. Among DEGs for EST candidates, transcript/16750 and transcript/19998 were excluded from the candidate pool. Finally, 22 DEGs were isolated and chosen as candidates involved in azadirachtin A biosynthesis.
Measurement of unigenes expression in secondary metabolite pathways in neem
The expressions of unigenes involved in secondary metabolites including three terpenoids, two sterols and putative azadirachtin A downstream pathway in five neem tissues were analyzed based on the KEGG annotation and FPKM (Figure 6). There are two pathways for the biosynthesis of terpenoids precursor IPP (isopentenyl pyrophosphate), one is MVA pathway, and the other is MEP pathway. Among all unigenes, 13 unigenes were found to be related to MVA pathway and 38 unigenes related to MEP pathway (Table S5). They encoded some enzymes involved in IPP biosynthesis in MVA and MEP pathway. Some of them ((unigenes encoding mevalonate kinase (MVK), 1-deoxy-D-xylulose-5-phosphate synthase (DXPS) and 2-C-methyl-D-erythritol 2,4-cyclodiphosphate synthase (MDS)) were expressing higher in leaf and fruit comparing with the other three tissues.
GPP (geranyl pyrophosphate) was the common intermediate of all monoterpenoids and it formed by IPP catalyzed by Geranyl diphosphate synthase (GPPS). Unigenes encoding four enzymes in monoterpenoids myrcene, limonene and terpineol biosynthetic pathways expressed high in flower and fruit. As for diterpenoid (E, E)-4, 8, 12-trimethyltrideca-1, 3, 7, 11-tetraene (TMTT) biosynthesis, unigene encoding Geranylgeranyl diphosphate synthase (GGPPS) and CYP82G1 expressed higher in flower and stem followed by in fruit and leaf, respectively.
Expression pattern of unigenes involved in in sesquiterpenoid and triterpenoid were also measured. Farnesyl diphosphate synthase (FDS) catalyzed formation of farnesyl pyrophosphate (FPP) from IPP and unigene encoding FDS expressed highest in fruit followed by in root and flower. Solavetivol pathway is the sesquiterpenoid pathway detected in neem and unigene encoding solavetivol synthase (CYP71D55) expressed higher in fruit and leaf. As for triterpenoids biosynthesis, unigenes involved in 2,3-oxidosqualene biosynthesis from FPP were measured. Two FPPs were catalyzed by squalene synthase (SQS) and then formed into 2,3-oxidosqualene through squalene epoxidase (SQLE). Unigenes encoding SQS and SQLE were found to express the highest in leaf.
Plants contain various kinds of terpenoids, polyketides and sterols. Campesterol, stigmasterol and β-sitosterol had also been successfully isolated from neem . They are important sources for the production of heterocyclic compounds. In this study, unigenes encoding cycloartenol synthase (CAS1) were identified. The expression level of the unigenes in different tissues were in the order of fruit> stem>root. Unigenes encoding enzymes ((delta14-sterol reductase (TM7SF2) and CYP51G1)) respectively were relatively high in fruit or leaf while the expression level was not the same for unigene encoding methylsterol monooxygenase (SMO1) and sterol-4alpha-carboxylate 3-dehydrogenase (NSDHL).
The expression level of unigenes encoding enzyme in putative azadirachtin A pathway had been presented. Transcript/14449 encoded the first enzyme in azadirachtin A downstream pathway and it expressed highest in fruit followed by in leaf. After scaffold synthesis, the deletion of a methyl group at C14 occurred and catalyzed by enzyme encoded by transcript/16198. Transcript/16198 shew similar expression pattern with transcript/14449. Secondary alcohol at C3 was continuously oxidized into ketone group by transcript/18725 and transcript/17679, two unigenes expressed highest in fruit tissue. Intermediates with C3 ketone were more common in the isolated compounds . And next important step involved in azadirachtin A was the formation of furan ring. Transcript/16971 and transcript/16742 encoded two CYP450s producing compound melianone  which was common in. members of Meliaceae family such as Melia azedarach  and Melia toosenda . With some uncharacterized enzymes, melianone was modified into intermediate with furan ring as well as the alcohol at C7. The C7-hydroxylased compound was further modified by esterase which encoded by unigenes transcript/18100 (highest-expressed in fruit and leaf) and finally became azadirachtin A under several reactions.
After analysis of the expression pattern of unigenes in secondary metabolites pathway in neem, the DEGs involved in these pathways were obtained. As a kind of terpenoid, the synthesis of azadirachtin A starts from IPP. The increase of precursor leads to the higher production of terpenoid. In the case of artemisinic acid production, improvement of terpenoid precursor by engineering MVA pathway made 500 time increases in yield . Thus, the up-regulated DEGs in MVA or MEP pathway and 2,3-oxidosqualene biosynthesis from IPP would be used as building blocks in the construction of azadirachtin A precursor biosynthetic pathway in the microorganism.
Unigenes up-regulated in both leaf and fruit (azadirachtin A producing tissues) were selected as candidates. While some unigenes which expressed the highest either in leaf or in fruit were also selected for further characterization. For example, one unigene transcript/16198 encoded sterol 14-demethylase catalyzing deletion of methyl group of C14 in sterols. It was chosen as a candidate and supposed to catalyzed removement of methyl group at C14 of tirucalla-7,24-dien-3β-ol because of the structural similarity between sterols and tirucalla-7,24-dien-3β-ol.
Although reaction types and key enzymes were partially proposed based on the structural differences between intermediates in our putative azadirachtin A pathway, some information was still missing. For instance, we couldn’t find the enzyme in hydroxylation reaction at C7 site from neem. What’s more, the order of reactions during the synthesis from azadirone to azadirachtin A was not clear. Neither the numbers of reactions nor the catalysis type were characterized. These limitations of pathway lead to insufficient mining on neem transcriptome data. This might be one of the reasons for slow progress in azadirachtin pathway exploration although numerous available neem genome and transcriptome data.