Transcriptome analysis of developing castor bean seeds and identication of ricinoleic acid biosynthesis genes

Background Ricinoleic acid is a kind of unsaturated fatty acid in castor oil with wide application value. It is sourced from R. communis seed oil, where it is present in large amounts. However, there is little transcriptomic information on genes related to ricinoleic acid biosynthesis in castor bean. Results In order to better understand the regulation mechanism of ricinoleic acid biosynthesis, immature seeds at three developmental stages (15, 30, and 45 days after pollination) were collected. The results indicated that the accumulation of castor oil and ricinoleic acid increased gradually during seed development, and reached the maximum value at the late stages of seed development (45 days after pollination). Furthermore, RNA sequencing was conducted to analyze the transcriptome of the developing seeds at three developing stages. Totals of 9,875 differentially expressed genes were identied among the three time points. Based on the annotation information, 49 DEGs related to lipid biosynthesis were screened among all DEGs. Through cluster analysis of the 49 DEGs, ten genes with increasing FPKM values from seed development stages S1 to S3 were selected as candidate key enzymes, since they showed similar patterns of increase with castor oil accumulation and ricinoleic acid biosynthesis during seed development. The transcriptomic data of the 10 candidate key enzyme genes was further validated by qRT-PCR. Ultimately, a putative model of key genes correlated with ricinoleic acid accumulation was built. Our study identied a series of key genes and revealed the proposed molecular mechanism of ricinoleic acid accumulation in castor seeds through the transcriptional analysis. It broadens our knowledge of ricinoleic acid biosynthesis and castor oil accumulation and also provides a theoretical foundation for the genetic engineering key genes that can improve the ricinoleic acid production in castor bean as well as other plants.


Abstract
Background Ricinoleic acid is a kind of unsaturated fatty acid in castor oil with wide application value. It is sourced from R. communis seed oil, where it is present in large amounts. However, there is little transcriptomic information on genes related to ricinoleic acid biosynthesis in castor bean.
Results In order to better understand the regulation mechanism of ricinoleic acid biosynthesis, immature seeds at three developmental stages (15,30, and 45 days after pollination) were collected. The results indicated that the accumulation of castor oil and ricinoleic acid increased gradually during seed development, and reached the maximum value at the late stages of seed development (45 days after pollination). Furthermore, RNA sequencing was conducted to analyze the transcriptome of the developing seeds at three developing stages. Totals of 9,875 differentially expressed genes were identi ed among the three time points. Based on the annotation information, 49 DEGs related to lipid biosynthesis were screened among all DEGs. Through cluster analysis of the 49 DEGs, ten genes with increasing FPKM values from seed development stages S1 to S3 were selected as candidate key enzymes, since they showed similar patterns of increase with castor oil accumulation and ricinoleic acid biosynthesis during seed development. The transcriptomic data of the 10 candidate key enzyme genes was further validated by qRT-PCR. Ultimately, a putative model of key genes correlated with ricinoleic acid accumulation was built.
Conclusion Our study identi ed a series of key genes and revealed the proposed molecular mechanism of ricinoleic acid accumulation in castor seeds through the transcriptional analysis. It broadens our knowledge of ricinoleic acid biosynthesis and castor oil accumulation and also provides a theoretical foundation for the genetic engineering key genes that can improve the ricinoleic acid production in castor bean as well as other plants.

Background
Castor bean (Ricinus communis L.) which belongs to the spurge family Euphorbiaceae, is a kind of oil crops with very high economic value. Its seeds contain more than 50% of an unusual oil with many industrial uses.
The oil is particularly rich in ricinoleic acid (~ 80%), which is a high-value hydroxy fatty acid that emerging as a raw material for high-grade lubricating oil production.
However, the oil content of different castor bean varieties varies greatly, and a knowledge-based molecular breeding system has not been established. The yield of castor oil has always been relatively low because of inadequate traditional planting methods, which hampered the broader use of castor oil. Although the castor oil biosynthetic pathway has been elucidated by Lin et al [1], as well as Broun et al [2], among others [3,4], there is a lack of molecular research on improving the castor oil yield.
Illumina RNA sequencing has promptly spread into the most important and commonly used approach for exploring functional genes, screening differentially expressed genes, or identifying new genes, in both animals and plants [24][25]. In plants, RNA transcriptome pro ling is also widely used to provide valuable data on new genes, highly expressed genes, SSR, differentially expressed genes and etc.. Many such studies were done in Arabidopsis [26], rice [27], maize [28] and R. communis. Chandrasekaran et al. (2014) [29] identi ed ABA mediated regulatory changes in the process of oil accumulation in developing castor seeds through transcriptome pro ling. Han et al. (2020) [30] found autophagy-related genes in castor bean by RNA-sEq. Tian et al. (2019) revealed a series of genes related to biosynthesis via transcriptome analysis in Hiptage benghalensis [31].
In this study, we aimed to better understate the regulation mechanism of ricinoleic acid biosynthesis and castor oil accumulation by transcriptional pro ling. A high oil-content variety of castor bean was chosen and seeds were harvested at three developmental stages [15,30, and 45 days after pollination (DAP)] for ricinoleic acid analysis. Additionally, the developing castor seeds transcriptome pro les of developing seeds and candidate key enzyme genes of were summarized by second-generation sequencing (SGS). Furthermore, a pool of differentially expressed genes encoding key enzymes that possibly regulate ricinoleic acid accumulation were identi ed and the veri cation of their expression pro les with related genes by quantitative real-time reverse transcription PCR (qRT-PCR) was analyzed. Finally, a putative model of ricinoleic acid biosynthesis and castor oil accumulation was built. The results provide a theoretical foundation for the genetic engineering and key genes that can improve the ricinoleic acid production in castor bean as well as other plants.

Results
Castor oil and ricinoleic acid accumulation during seed development in R. communis seeds The oil content of mature seeds from 213 R. communis varieties was analyzed, and the strain 93 − 10 with the highest seed oil content (about 69.2%) was selected. As displayed in Fig. 1, the castor oil and ricinoleic acid accumulation in R. communis seeds at three developmental stages (15,30, and 45 DAP) were analyzed. The seed oil content quickly increased from stage 1 (15 DAP, S1) to stage 2 (30 DAP, S2) (53.8%), and then gradually increased to 61.7% at stage 3 (45 DAP, S3). The ricinoleic acid content showed a similar trend, increasing rapidly to 64.5% from S1 to S2, and reached the maximum of 74.8% at S3 (Fig. 1).
Reads from mature mRNAs should be aligned to exon regions. The comparison of reads to introns is due to the retention of mRNA precursors and introns with variable splicing. Some reads matched to intergenic regions due to incomplete genome annotation. As shown in Fig. 2, most of the reads were aligned to exon regions, 87.56% (S1, Fig. 2a), 87.09% (S2, Fig. 2b) and 87.56% (S3, Fig. 2c) in each library.
Global gene expression and SNP/InDel analysis Overall,19,442,19,143, and 18,941 expressed genes were identi ed at the S1, S2 and S3 stage, respectively (Fig. 3a). In all, there were 20,423 genes were identi ed during the entire seed development (Fig. 3b). Among these, 17,872 genes were continuously expressed from S1 to S3, while conversely, 1,192 genes (529 in S1, 367 in S2 and 296 in S3) were uniquely expressed at unique developmental stage. At stage 1, it owned the highest abundance of developmental stage-speci c genes (529), which suggested that a considerable number of genes regulating oil accumulation are concentrated in S1.Furthermore, serious genes (657 in S1 and S3, 318 in S2 and S3, 384 in S1 and S3) were found 657 to be expressed simultaneously at two stages. (Fig. 3b).
We identi ed a total of 130,461 polymorphisms (97,932 SNPs and 32,529 InDels) at the three time points. The annotated information showed that the top locations for both SNPs (Fig. 4a) and InDels (Fig. 4b) were intergenic, upstream and downstream of genes.
Identi cation of Differentially expressed genes related to ricinoleic acid biosynthesis during R. communis seed development A total of 9,875 (4,955 up-and 4,920 down-regulated) genes were differentially expressed during R. communis seed development. All the identi ed differentially expressed genes (DEGs) were searched against six databases, the Gene Ontology (GO), Cluster of Orthologous Group (COG), Protein family (Pfam), Kyoto Encyclopedia of Genes and Genomes (KEGG), the manually annotated and reviewed protein sequence database (SwissProt) and non-redundant (NR) database, to obtain their annotated information using Basic Local Alignment Search Tool (BLAST) software [32].
Next, the number of DEG were analyzed between any two libraries (S1 vs. S2, S2 vs. S3 and S1 vs. S3). In above three paired comparison groups, of the number of up-regulated DEGs was all higher than the number of down-regulated DEGs (Fig. 5). The smallest number of DEGs (4,240) was detected for the comparison of S1 with S2, encompassing 2,323 up-and 1,917 down-regulated genes. The DEGs number for the comparison of S2 with S3 increased slightly 4,936, including 2,613 up-and 2,323 down-regulated genes. The largest number of DEGs (7,807) were found for the comparison of S1 and S3, among which 3,930 genes were up-and 3,877 down-regulated (Fig. 5). Heat maps from the hierarchical clustering of DEGs between S1 vs. S2 (Fig. 6a), S2 vs. S3 (Fig. 6b) and S1 vs. S3 (Fig. 6c) paired groups were shown in Fig. 6.
Selection of differentially expressed oil-related genes in R. communis and qRT-PCR con rmation According to the analysis results of castor oil and ricinoleic acid content during castor seed development, we found that the contents increased from S1 to S2 and then to S3. On this basis, we screened the key enzyme DEGs with high FPKM values at S3. Ten genes of cluster I with similar expression patterns were selected as the candidate key genes related to seed oil accumulation in R. communis (  The expression changes of 10 key candidate genes were con rmed by qRT-PCR analysis to validate the Illumina RNA sequencing data (Fig. 10). The results suggested that the data by RNA-seq were consistent with qRT-PCR analysis during castor seed development.

Discussion
Ricinoleic acid is a highly valuable unusual hydroxy fatty acid with wide industrial value, but its yield is not too high. Therefore, the identi cation of the ricinoleic acid accumulation mechanism may provide a theoretical foundation for the genetic engineering and key genes that can improve the ricinoleic acid production in castor bean as well as other plants.

Gene expression analysis between different seed maturation stages
In this study, transcriptome sequencing was performed to comprehensively analyze the transcriptome pro le and identify the DEGs in developing seeds from different oil-accumulation stages.
Through analyzing the DEG number between different pair groups, we found that, from S1 to S2, a higher number of upregulated DEGs than down-regulated DEGs were detected, while from S2 to S3, a higher number of down-regulated DEGs were identi ed, indicating in the early stage more positive control of gene expression and then in the late stage more negative regulation during the ricinoleic acid accumulation. It was found that the number of DEGs in the early stage (S1 vs. S2) was relatively lower than in the late stage (S2 vs. S3 and S1 vs. S3), and most of these were down-regulated. This result indicated that castor oil accumulation is mainly the result of late key gene regulation. The gene regulation in the early stage was not obvious. Six DEGs could be found both from S1 vs. S2 group and S2 vs. S3 group, indicating that these six genes played stable roles in ricinoleic acid biosynthesis and regulation during the seed maturation.

Expression of oil-related genes during seed development after owering
In this study, 20 differentially expressed lipid-related genes were identi ed as likely playing important roles in oil accumulation during seed maturation of R. communis. In the Kennedy pathway, DGAT, which was the last catalytic enzymes for TAG formation, has a dominant role in adding the hydroxyl fatty acyl into TAGs [23]. One DGAT-encoding gene (29801.t000101), found in this study during castor seed development, may act a pivotal part in transferring hydroxyl fatty acids from PC to TAG in R. communis ( Table 3). The nal enzyme catalyzing TAG formation in the acyl editing pathway, PDAT, showed high expression levels at S3 during castor seed development in this study (Table 3). Cagliari et al. (2010) also reported that DGAT2 played a vital role in the enrichment of ricinoleic acid in R. communis [33]. Furthermore, in other plants which can produce ricinoleic acid, DGAT also possesses a signi cant affect in ricinoleic acid-enriched TAGs enrichment. In P. fendleri, DGAT1, DGAT2 and PDAT2 were all conducive primarily to hydroxy fatty acid enrichment [34], whereas in H. benghalensis, DGAT2 and PDAT2 were found to own high expression levels in the process of ricinoleic acid biosynthesis [31].
LACS might participate in transferring modi ed fatty acids from PC to the acyl-CoA pool together with PLA2, which is later for producing TAGs by some enzymes such as DGAT and etc., in the Kennedy pathway [15]. In this study, a LACS7gene (29844.t000211), had high expression levels during seed development, may contribute to ricinoleic acid accumulation (Table 3). Nevertheless, in H. benghalensis, LACS8 has been reported to have the same effect [31]. At the same time, LACS9 as the dominant LACS isoform was revealed to act an vital role in oil synthesis in R. communis and P. fendleri [34]. During castor seed development, PLA2 can help to accumulate a high level of the hydroxy fatty acid ricinoleate in TAGs [35]. In this study, a PLA2 gene (30142.t000005) showed a high expressed level at S3 during seed development, may have important contribution to the accumulation of ricinoleic acid. .
OLEs are the main oil-body proteins, with steroleosins and caleosins constituting the other two major classes of proteins associated with oil bodies [36]. In this study, two OLE genes (29794.t000071 and 30147.t000162) showed the highest expression level among all selected candidate key DEGs. Tian et al. reported that high expression of OLE3 during seed development in H. benghalensis was participated in the accumulation of ricinoleic acid in oil-body formation [31]. OLE genes had been reported to contribute to monoacylglycerol acyltransferase and PLA2 bifunctional enzyme activity in peanuts [37]. The PXG gene encodes a caleosin, another kind of oil body [38]. In this study, two PXG genes (30008.t000036 and 29673.t000033) showed a high expression level at S3 during seed development and may possess a part in TAG storage. In previous studies, a similar function of PXG was observed in H. benghalensis, but not in R. communis [33] or P. fendleri [34]. Lin at al. (2007) found PLC2 catalyzes the transformation of 2-oleoyl-PC into 1-acyl-2-oleoyl-sn-glycerol [39]. PLC and/or PLD, together with PAPase, may also relate to the transformation of PtdCho to DAG [40]. In this study, a PLC gene (29756.t000030) showed high expression levels at S3 during seed development and may play a role in converting ricinoleoyl-PC into 1,2-diricinoleoyl-sn-glycerol.
In this study, an ACCase-α gene (30174.t000396) showed a high expressed level at S3 during seed development. The synthesis of oil in castor bean seeds starts from the production of malonyl-CoA by ACCase, which is a key regulatory step of fatty acid synthesis and oil formation in seeds, and the encoding gene affects the entire life process of plants. ACCase is present in the cytosol and plays the role of transferring a carboxyl group in the reaction process [40]. The ACCase-catalyzed reaction is a rate-limiting step in fatty acid synthesis. Therefore, overexpression of the ACCase gene in plants, especially in oil crops such as soybean, Brassica napus and sesame, can improve their ability to synthesize fatty acids, which has been con rmed by many studies [41][42].
To deep illustrate the mechanism of ricinoleic acid biosynthesis, a theoretical model of gene expression changes during seed development was constructed based on our transcriptomic data (Fig. 11).

Conclusions
In this study, the mechanism and regulation of ricinoleic acid biosynthesis in R. communis was revealed at the transcriptome level. A series of genes encoding lipid-related enzymes were screened, and their expression pro les during ricinoleic acid accumulation were obtained through bioinformatics analysis. Based on the results of this study and previous reports, a theoretical model of oil accumulation in developing seeds of R. communis was constructed. The ndings of this study broaden our understanding of ricinoleic acid biosynthesis and regulation at the level of molecular biology, laying a solid basis for improving the ricinoleic acid accumulation capacity of important oil crops. Analysis of the oil and ricinoleic acid content Ten immature seeds of castor line 93 − 10 during seed development at three time points were harvested with three biological replicates. After removing the seed coat, the seeds were weighed in the fresh state, dried in the oven at 60 °C for 48 h, and weighted again in the dry state. The water content of seeds was calculated by subtracting the dry weight from the fresh weight of the seeds. To measure the castor oil content the dry seeds were ground into a powder with a mortar and pestle, wrapped in lter paper, and extracted with n-hexane in a Soxhlet extractor at 85 °C for 12 hours. After extraction, the n-hexane was removed in a rotary evaporator, followed by drying in an oven at 68 °C for 1 hour. The resulting lipid mass was used to calculate the oil content as the percentage of oil in dry seeds (w/w). The ricinoleic acid content was measured by gas chromatography. Heptadecanoic acid (C17:0) (≥ 99%, GLPBIO, USA) was chosen as the internal standard, following the method of Pan et al with minor modi cation [31,43]. About 15 mg of seed samples were methylated to generate fatty acid methyl esters with 3 N methanolic HCl at 80 °C for 2 h, which were extracted with n-hexane, dried in nitrogen and suspended in 1.5 mL dichloromethane. The analysis was carried out by an Agilent 6890 N gas chromatograph, was and then equipped with a hydrogen ame ionization detector and a DB-WAX capillary column (30 m × 0.32 mm × 0.15 µm) (Agilent, USA).The temperature program was 200 °C keep for 25 min, 5 °C per min until to 220 °C for 20 min. Then 1 µL of injection volume, 250 °C of injector temperature, and a split injection mode with a split ratio of 30:1 was used. Helium was used as carrier gas with a ow rate of 1.5 mL per min. Fatty acids meet fatty acid methyl ester standards (Sigma-Aldrich, USA). The relative percentage of ricinoleic acid was calculated from its peak area.

Bioinformatics analysis
FastQC (v0.11.8) was used for quality-checking the raw RNA-seq reads to eliminate the low-quality reads and adaptor sequences [44]. The high-quality clean reads were then mapped to the published R. communis genome sequences [45]. Bowtie (v2.2.3) software was used to adjust the reference genome [46]. The paired-end clean reads were aligned to the reference genome using TopHat (v2.0.12) [47]. Cu inks (v2.1.1) was a software used to detect all transcripts via qRT-PCR [48]. GATK [49] was used to identify single base mismatches between sequenced samples and the reference genome, and identify potential SNP loci. SnpEff [50] was used for annotating variations (SNP, InDel) and predicting the effects of variations.

Differential expression analysis and functional annotation
For analyzing the differential expression of genes, software TopHat and Cu inks were used to blast the RNA sequencing data against R. communis genome sequences. The abundance of gene transcripts was calculated using the FPKM method. DEG screening was conducted using DESeq to analyze biological duplicates. During the DEG screening, DEG fold change was ≥ 8 and a FDR < 0.001 was considered that DEG was signi cantly differentially expressed between the control and treatment groups. The FPKM values of DEGs of three time points were rst normalized via log 2 and then used for clustering with their expression patterns.
All DEG sequenceswere blast by the GO, COG, KEGG pathway, SwissProt and NR databases using BLAST software to gain the functional annotation information. The annotated information of all DEGs during seed development was analyzed and then used to screen out the genes encoding key enzymes related to the ricinoleic acid biosynthesis pathways.

QRT-PCR analysis
The transcript levels of a total of 10 genes at three time points (S1, S2 and S3) were performed by qRT-PCR analysis to verify the Illumina RNA sequencing data. Total RNA of developing seeds at S1-S3 was isolated and 1 µg was reverse-transcribed using the PrimeScript™ RT reagent Kit with gDNA Eraser (Takara, Japan). The actin gene of R. communis was used as the internal standard [51]. Primers were designed by Primer 5.0 software (Premier, Canada). All gene primer sequences including actin gene are listed in  Table S1. Statistics of the comparison of sequencing data with the reference genome. Table S2. List of all annotated information of DEGs between S1 and S2. Table S3. List of all annotated information of DEGs between S2 and S3. Table S4. List of all annotated information of DEGs between S1 and S3. Table S5. List of 49 key enzymes associated with lipid biosynthesis. Table S6. Primers used for quantitative real-time PCR. Figure 1 The castor oil content (A) and ricinoleic acid (C18:1-OH) (B) content of developing castor bean (mean±SD, n=3).   The annotation of all SNPs (A) and InDels (B).

Figure 5
Analysis of differential genes expression in paired comparison groups among the three development stages.

Figure 6
Hierarchical clustering analysis of intersectional DEGs in the three paired groups (S1 vs. S2, S1 vs. S3, and S2 vs. S3) based on data of log2Ratio from one gene. The color bar ranging from blue to red indicates low to high transcriptional expression level.

Figure 9
Three clusters with different gene expression patterns for all key enzyme genes. (A) Cluster I: rising pattern; (B) Cluster II: bell-shaped pattern; (C) Cluster III: decreasing pattern.

Figure 10
The expression changes of 10 key differentially expressed lipid-related genes were con rmed by qRT-PCR analysis to validate the Illumina RNA sequencing data. The comparative log2FC and ΔΔCt values at stage S1 were used as the control for normalization. The results represent the means of three biological replicates (mean ± SD, n=3).

Figure 11
Theoretical model of gene expression changes the ricinoleic acid biosynthetic pathway during seed development based on the transcriptomic data obtained in this study. The expression levels (represented by Log2FPKM) of the 49 key enzyme-encoding genes in developing seeds of R. communis at different development stages (S1-S3) are highlighted in color scales (green to red).

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.