Shorts reads and assembly quality
Table 1 shows the different quality variables of the S. thurberi fruit peel transcriptome. A total of 288,199,704 reads with 150 base pairs (bp) in length in paired-end mode were sequenced with the Illumina NextSeq500 platform at the Arizona Genetic Core facility of the University of Arizona at Tucson, AZ, USA. After trimming, 243,194,888 (84.38%) cleaned short reads with at least 29 mean quality scores per read in the Phred scale, and between 80 to 150 bp in length were obtained to carry out the assembly. After removing contaminating sequences, redundancy, and low-expressed transcripts, the assembly included 174,449 transcripts with an N50 value of 2,110 bp. BUSCO score showed that 85.4% are completed transcripts, although out of these, 37.2% were found to be duplicated.
Table 1
Quality metrics of the Stenocereus thurberi fruit peel transcriptome.
Metric | Data |
Total transcripts | 174449 |
N50 | 2110 |
Smallest transcript length (bp) | 200 |
Largest transcript length (bp) | 19114 |
Mean transcript length (bp) | 1198.7 |
GC (%) | 41.33 |
Total assembled bases | 209110524 |
TransRate score | 0.05 |
BUSCO score (%) | C: 85.4% (S:48.2%, D:37.2%) F: 10.7% M: 3.9% |
Values were calculated through the TrinityStats function of Trinity (11) and TransRate software (24). Completeness analysis was carried out through BUSCO (25) by aligning the transcriptome to the Embryophyte database through BLAST with an E value threshold of 1x10− 3. Abbreviations: Complete (C), single (S), duplicated (D), fragmented (F), missing (M).
Homology searches
A summary of the homology search in the main public protein database for the S. thurberi transcriptome is shown in Table 2. From these databases, the higher homologous transcripts were found in RefSeq (26) with 93,993 (53.87%). Based on the E value distribution, 41,685 (44%) and 68,853 (49%) of the hits, it was found a strong homology (E value lower than 1x10− 50) to proteins in the Swiss-Prot (27) and RefSeq databases, respectively (Fig. 1a-b). On the other hand, 56,539 (52.34%) and 99,599 (71.11%) of the matches showed a percentage of identity higher than 60% in the Swiss-Prot and RefSeq databases, respectively (Fig. 1c-d).
Table 2
Summary of homology search for S. thurberi transcripts in different databases.
Category | Number of transcripts (%) |
Total transcripts | 174449 (100%) |
With a hit to RefSeq | 93993 (53.87%) |
With a hit to SwissProt | 70381 (40.34%) |
With a hit to nr-NCBI | 72162 (41.37%) |
With a hit to plantTFDB | 45970 (26.35%) |
With a hit to iTAK-TR | 58704 (33.65%) |
With a hit to iTAK-PK | 48186 (27.65%) |
With a hit to TAIR (Arabidopsis thaliana) | 81556 (46.75%) |
With a hit to ITAG (Solanum lycopersicum) | 85331 (48,91%) |
With a hit to any of these databases | 98611 (56.52%) |
Without a hit to any of these databases | 75838 (43.47%) |
Homologous sequences were predicted by an alignment through BLASTx (28) to the protein databases listed in the table with an E value threshold of < 1x10− 10 for the nr-NCBI database and an E value threshold of < 1x10− 5 for the others.
Figure 1 Homology analysis of assembled transcripts. E value distribution (a, b) and identity distribution (c, d) of the matches in the Swiss-Prot (a, c) and RefSeq (b, d) databases. (a,b) The number inside the pie chart indicates the number of transcripts recorded using that E value. Alignment by BLASTx with an E value threshold of 1x10− 5.
In Additional file 1, it is shown the distribution of the top hits found in different species, which showed that the highest number of matches in the Swiss-Prot database were for proteins from Arabidopsis (80.2%). For the RefSeq database, the highest number of matches were for proteins from Beta vulgaris (62%). In Fig. 2, it is shown the homology between transcripts from S. thurberi and proteins of commercial fruits, as well as proteins and transcripts of cacti. Transcripts from S. thurberi homologous to proteins from fruits of commercial interest avocado (Persea americana), peach (Prunus persica), strawberry (Fragaria vesca), orange (Citrus sinensis), and grapefruit (Vitis vinifera) ranged from 77,285 (44.30%) to 85,421 (48.96%), with 70,802 transcripts homologous to all the five fruit protein databases (Fig. 2a).
Figure 2 Venn diagram of the homology search results against model plants databases, commercial fruits, and cacti. The number in the diagram corresponds to the number of transcripts from S. thurberi homologous to sequences from that plant specie. (a) Homologous to sequences from Fragaria vesca (Fa), Persea americana (Pa), Prunus persica (Pp), Vitis vinifera (Vv), and Citrus sinensis (Cs). (b) Homologous to sequences from Opuntia streptacantha (Of), Selenicereus undatus (Su), Hylocereus polyrhizus (Hp), and Pachycereus pringlei (Pap). (c) Homologous to sequences from Solanum lycopersicum (Sl), Arabidopsis thaliana (At), from the commercial fruits (Fa, Pa, Pp, Vv, and Cs), or the cactus included in this study (Of, Su, Hp, and Pap). Homologous searching was carried out by BLAST alignment (E value < 1x10− 5). The Venn diagrams were drawn by ggVennDiagram in R Studio.
Transcripts homologous to transcripts or proteins from the cactus dragon fruit (Hylocereus polyrhizus), prickly pear cactus (Opuntia streptacantha), Mexican giant cardon (Pachycereus pringlei), and pitahaya (Selenicereus undatus) ranged from 76,238 (43.70%) to 114,933 (65.88%), with 64,009 transcripts homologous to all the four cactus databases (Fig. 2b). Further, out of the total of transcripts, 44,040 transcripts (25.25%) showed homology only to sequences from cactus, but not for model plants or commercial fruits (Fig. 2c).
A total of 69,622 (39.91%) transcripts showed homology to transcription factors, protein kinases, and other transcriptional regulators in the PlantTFDB and iTAK databases (Table 2). In the PlantTFDB, 45,970 (26.35%) transcripts homologous to transcriptional factors (TF) from 57 families were identified (Fig. 3). From these, the most frequent TF families were the basic-helix-loop-helix (bHLH), myeloblastosis-related (MYB-related), NAM, ATAF, and CUC (NAC), ethylene responsive factor (ERF), and the WRKY domain family (WRKY) (Fig. 3).
Figure 3 Transcription factor (TF) families distribution of S. thurberi fruit peel transcriptome. The X-axis indicates the number of transcripts with hits to each TF family. Alignment to the PlantTFDB database by BLASTx was carried out with an E value threshold of 1x10− 5. The bar graph was drawn by ggplot2 in R Studio.
Functional categorization
In Additional file 2, it is shown the results of functional annotations by the Blast2GO (29) suite pipeline. 72,162 transcripts showed homology to proteins in the nr-NCBI database (Table 2), and 83,066 transcripts showed to have at least a functional domain in the InterPro database. This data allowed the assignment of Gene Ontology (GO) terms to 68,559 transcripts (39.3% of the total) (See Additional file 2).
Figure 4 shows the top 20 GO terms assigned to the S. thurberi transcriptome, corresponding to the Biological Processes (BP) and Molecular Function (MF) categories. For BP, organic substance metabolic processes, primary metabolic processes, and cellular metabolic processes showed the higher number of transcripts. Further, for MF, organic cyclic compound binding, heterocyclic compound binding, and ion binding were the processes with the higher number of transcripts.
In Fig. 5, it is shown the KEGG pathways with the higher number of transcripts involved. However, S. thurberi transcripts were classified into 142 KEGG pathways. The pathways with the higher number of transcripts recorded were pyruvate metabolism, glycerophospholipid metabolism, glycolysis/gluconeogenesis, and citrate cycle. Further, among the top 20 KEEG pathways, the cutin suberin and wax biosynthesis include more than 30 transcripts.
Figure 4 Top 20 Gene Ontology (GO) terms assigned to the S. thurberi fruit peel transcriptome. Bars indicate the number of transcripts assigned to each GO term. Assignment of GO terms was carried out by Blast2GO with default parameters. BP and MF mean Biological Processes and Molecular Functions GO categories, respectively. The graph was drawn by ggplot2 in R Studio.
Figure 5 Top 20 KEGG metabolic pathways distribution in the S. thurberi fruit peel transcriptome. Bars indicate the number of transcripts assigned to each KEGG pathway. Assignment of KEGG pathways was carried out in the Blast2GO suite. The bar graph was drawn by ggplot2 in R Studio.
Identification of lncRNAs
Out of the total of transcripts, 122,234 (70.07%) were identified as coding transcripts, and 43,391 (24.87%) were classified as lncRNA. In Fig. 6, it is shown a comparison of the length (Fig. 6a) and expression (Fig. 6b) of lncRNA and coding RNA (cRNA). Besides, this graph includes the homology search results among lncRNA and cRNA against the transcriptomes of the cactus O. streptacantha, H. polyrhizus, P. pringlei, and S. undatus (Fig. 6c).
Both length and expression values were higher in cRNA than in lncRNA. In general, cRNA ranged from 201 to 18,629 bp with a mean length of 1,507.18, whereas lncRNA ranged from 200 to 5198 bp with a mean length of 481.51 (Fig. 6a). The higher expression values recorded from cRNA and lncRNA were 12.83 and 9.45 log2(CPM), respectively (Fig. 6b). Besides, the Venn diagram shows that 18,727 lncRNA were homologous to transcripts reported from H. polyrhizus, P. pringlei, and S. undatus (Fig. 6c). Furthermore, 111,554 transcripts were found to be homologous between cRNA from S. thuberi and the transcriptomes of the different cactus species included. Also, 10,680 transcripts were found to be present in S. thurberi only (Fig. 6c). The alignment to the cactus transcriptomes found that lncRNA from S. thurberi showed lower identity (Fig. 6d), higher E values (Fig. 6e), and lower coverage (Fig. 6f) than the cRNAs.
Figure 6 Comparison of coding RNA (cRNA) and lncRNAs from S. thurberi transcriptome and among cactus species. (a) Box plot of transcript length distribution. The Y-axis indicates the length of each transcript in base pairs. (b) Box plot of expression levels. The Y-axis indicates the log2 of the count per million of reads (log2(CPM)) recorded for each transcript. Expression levels were calculated by the edgeR package in R studio. (a, b) The lines inside the boxes indicate the median, and the higher and lower boxes limits represent the 25th-75th percentiles. (c) Venn diagram of the homology search results against the proteins of Opuntia streptacantha and transcripts of Selenicereus undatus, Hylocereus polyrhizus, and Pachycereus pringlei. (d, e, f) Density plots of the identity (d), E value (e), and coverage (f) for the hits recorded from the alignment of the cRNA and lncRNA from S. thurberi against transcripts of S. undatus, H. polyrhizus, and P. pringlei. (e) The X-axis indicates the negative exponential of the E values (1e− X). As the highest negative exponential value recorded was 180, a value of 1e− 200 was assigned to all the E values of 0.0 to represent the lowest E values. Homologous searching was carried out by BLAST alignment using an E value of less than 1x10− 5. The box plots, the density plots, and the Venn diagram were drawn by ggplot2 and ggVennDiagram in R Studio.
Identification of tentative reference genes (dup: abstract ?)
Mean transcripts per million of reads (TPM) and coefficient of variation (CV) were calculated for 4,980 not differentially expressed (NDE) transcripts in S.thurberi fruit peel transcriptome (log2FC between + 1.0 and − 1.0; FDR < 0.05). Transcripts with a CV value lower than 0.11, corresponding with the percentile five of the CV, and a mean transcript per million higher than 1,281.7, corresponding with the percentile 95 of mean TPM, were used as filters to identify stably expressed transcripts. From these, 5 transcripts were selected as tentative reference genes. Besides, three NDE transcripts homologous to previously identified stable expressed reference genes in other species of cactus fruit (32–34) were selected. Homology metrics for the 8 tentative reference genes selected are shown in Additional file 3. Primer sequences used to quantify the transcripts by qRT-PCR during pitaya fruit development are shown in Additional file 4.
Expression stability of tentative reference genes
In Additional file 5, it is shown the melting curves of the eight candidate reference genes in which it is shown the amplification specificity. On the other side, for the eight tentative reference transcripts, cycle threshold (Ct) values obtained by qRT-PCR during sweet pitaya fruit development ranged from 16.85 to 30.26 (Fig. 7a). Plastidic ATP/ADP-transporter (StTLC1) showed the highest Ct values with a mean of 27.34. Polyubiquitin 3 (StUBQ3) showed the lowest Ct values in all five sweet pitaya fruit developmental stages (Fig. 7a).
Figure 7 Expression stability analysis of tentative reference genes. (a) Box plot of cycle threshold (Ct) distribution of candidate reference genes during sweet pitaya fruit development (10, 20, 30, 35, and 40 days after flowering). The black line inside the box indicates the median, and the higher and lower boxes limits represent the 25th -75th percentiles. (b) Bar chart of the geometric mean (geomean) of ranking values calculated by RefFinder for each tentative reference gene (X-axis). The lowest values indicate the best reference genes. (c) Bar chart of the pairwise variation analysis and determination of the optimal number of reference genes by the geNorm algorithm. A pairwise variation value lower than 0.15 indicates that the use of Vn/Vn + 1 reference genes is reliable for the accurate normalization of qRT-PCR data. The Ct data used in the analysis were calculated by qRT-PCR in a QIAquant 96 5 plex (QIAGEN) according to the manufacturer's protocol. The box plot and the bar graphs were drawn by ggplot2 and Excel programs, respectively. Abbreviations: Actin 7 (StACT7), a-tubulin (StTUA), elongation factor 1-alpha (StEF1a), COP1-interactive protein 1 (StCIP1), plasma membrane ATPase 4 (StPMA4), BEL1-like homeodomain protein 1 (StBLH1), polyubiquitin 3 (StUBQ3), and plastidic ATP/ADP-transporter (StTLC1).
NormFinder (22) calculated the stability value for all eight transcripts, ranging from 0.45 to 1.27 (See Additional file 6). The lowest stability values were 0.45, 0.51, 0.97, and 0.99, corresponding to the transcripts elongation factor 1-alpha (StEF1a), α-tubulin (StTUA), plastidic ATP/ADP-transporter (StTLC1), and actin 7 (StACT7), respectively. For BestKeeper (21), the most stable expressed transcripts were polyubiquitin 3 (StUBQ3), α-tubulin (StTUA), and elongation factor 1-alpha (StEF1a), with values of 0.72, 0.75, and 0.87, respectively. In the case of the delta Ct method (35), the transcripts elongation factor 1-alpha (StEF1a), α-tubulin (StTUA), and plastidic ATP/ADP-transporter (StTLC1) showed the best stability values (See Additional file 6).
According to geNorm (20) analysis, the most stable expressed transcripts were α-tubulin (StTUA), elongation factor 1-alpha (StEF1a), polyubiquitin 3 (StUBQ3), and Actin 7 (StACT7), with values of 0.74, 0.74, 0.82, 0.96, respectively. All the pairwise variation values (Vn/Vn + 1) were lower than 0.15, ranging from 0.019 for V2/V3 to 0.01 for V6/V7 (Fig. 7c). The V value of 0.019 recorded for V2/V3 indicates that the use of the best two reference genes StTUA and StEF1 is reliable enough for the accurate normalization of qRT-PCR data. No third reference genes are required (20). With the exception of BestKeeper analysis, StEF1a and StTUA were the most stable transcripts for all of the stability analysis methods carried out in this study (See Additional file 6). The comprehensive ranking analysis indicates that StEF1, followed by StTUA and StUBQ3, are the most stable expressed genes and are stable enough to be used as reference genes in qRT-PCR analysis during sweet pitaya fruit development (Fig. 7b).
Identification of cuticle biosynthesis-related transcripts
Three cuticle biosynthesis-related transcripts DN17030_c0_g1_i2, DN15394_c0_g1_i1, and DN23528_c1_g1_i1 tentatively coding for the enzymes cytochrome p450 family 77 subfamily A (CYP77A), Gly-Asp-Ser-Leu motif lipase/esterase 1 (GDSL1), and an ATP binding cassette transporter family G member 11 transporter (ABCG11/WBC11), respectively, were identified and quantified. The best homology match for StCYP77A (DN17030_c0_g1_i2) was for AtCYP77A4 (AT5G04660) from Arabidopsis and SmCYP77A2 (P37124) from eggplant (Solanum melongena) in the TAIR and Swiss-Prot databases, respectively (See Additional file 3).
TransDecoder, InterPro (36), and TMHMM (37) analysis showed that StCYP77A codes a polypeptide of 518 amino acids (aa) in length that comprises a cytochrome P450 E-class domain (IPR002401) and a transmembrane region (residues 10 to 32). The phylogenetic tree constructed showed that StCYP77A is grouped in a cluster with all the CYP77A2 proteins included in this analysis, being closer to CYP77A2 (XP_010694692) from B. vulgaris and Cgig2_012892 (KAJ8441854) from Carnegiea gigantean (See Additional file 7).
StGDSL1 (DN15394_c0_g1_i1) alignment showed that it is homologous to a GDSL esterase/lipase from Arabidopsis (Q9LU14) and tomato (Solyc03g121180) (See Additional file 3). TransDecoder, InterPro, and SignalP (38) analysis showed that StGDSL codes a polypeptide of 354 aa in length that comprises a GDSL lipase/esterase domain, IPR001087 and a signal peptide with a cleavage site between position 25 and 26 (See Additional file 8).
In Fig. 8, it is shown the bioinformatic analysis carried out in the in silico protein predicted StABCG11. The phylogenetic tree constructed shows three clades corresponding to the ABCG13, ABCG12, and ABCG11 protein classes with bootstrap support ranging from 40 to 100% (Fig. 8a). StABCG11 is grouped with all the ABCG11 transporters included in this study in a well-separated clade, being closely related to its tentative ortholog from C. gigantean Cgig2_004465 (KAJ8441854). InterPro and TMHMM results showed that the StABCG11 sequence contains an ABC-2 type transporter transmembrane domain (IPR013525; PF01061.27) with six transmembrane helices (Fig. 8b).
The predicted protein sequence of StABCG11 (DN23528_c1_g1_i1) is 710 aa in length, holding the ATP binding domain (IPR003439; PF00005.30) and the P-loop containing nucleoside triphosphate hydrolase domain (IPR043926; PF19055.3) of the ABC transporters of G family. Multiple sequence alignment shows that the Walker A and B motif sequence and the ABC signature (39) are conserved between the ABCG11 transporters from Arabidopsis, tomato, S. thurberi, and C. gigantean (Fig. 8c).
Figure 8 Analysis of the predicted protein StABCG11 from Stenocereus thurberi. (a) Phylogenetic tree of StABCG11 and related proteins of the classes ABCG11, ABCG12, and ABCG13 from Arabidopsis thaliana (At), Gossypium arboreum (Ga), Citrus sinensis (Cs), Medicago truncatula (Mt), Solanum lycopersicum (Sl), Eutrema halophilum (Eh), Carnegiea gigantean (Cg), Beta vulgaris (Bv), and Spinacia oleracea (So). The database accession number next to the protein name is shown. The scale bar of 0.10 represented a sequence divergence of 10%. The number in the branches is the percentage bootstrap value of 1000 replicates. The highest percentages represent higher significant results. The black square beside the protein name shows AtABCG11, AtABCG12, and AtABCG13 from A. thaliana. The red circle and red triangle next to the protein name show StABCG11 from S. thurberi and a protein from the closest related species, C. gigantean, respectively. Neighbor-joining (NJ) phylogenetic tree constructed by MEGA11 software. (b) The predicted transmembrane helices of StABCG11. The probability of membrane insertion (Y-axis) and transmembrane region represented by purple color was determined by TMHMM software. (c) Multiple sequence alignment of StABCG11 and its homologous from A. thaliana (AT1G17840), S. lycopersicum (Solyc03g019760), and C. gigantean (KAJ8441854). Amino acids are colored according to the chemistry classification of their side-chain. The darkest blue bars below the protein sequences indicate 100% conservation. Black rectangles show the conserved sequence of the Walker A and B motif and the ABC signature, named below the rectangles. Black width lines below the sequence show the predicted transmembrane helices of StABCG11. Alignment was carried out by MUSCLE in MEGA11 and drawn by ggmsa in R Studio.
Evaluation of reliable reference genes and quantification of cuticle biosynthesis-related transcripts
According to the results of the expression stability analysis (Fig. 7), four normalization strategies were tested to quantify three cuticle biosynthesis-related transcripts during sweet pitaya fruit development. The four strategies consist of normalizing by StEF1a, StTUA, StUBQ3, or StEF1a + StTUA. Primer sequences used to quantify the transcripts StCYP77A (DN17030_c0_g1_i2), StGDSL1 (DN15394_c0_g1_i1), and ABCG11 (DN23528_c1_g1_i1) by qRT-PCR during sweet pitaya fruit development are shown in Additional file 4. Melting curves showing the amplification specificity of StCYP77A, StGDSL1, and ABCG11 are shown in Additional file 5.
The three cuticle biosynthesis-related transcripts showed differences in expression during sweet pitaya fruit development (See Additional file 9). The same expression pattern was recorded for the three cuticle biosynthesis transcripts when normalization was carried out by StEF1a, StTUA, StUBQ3, or StEF1 + StTUA (Fig. 9). A higher expression of StCYP77A and StGDSL1 is shown at the 10 and 20 days after flowering (DAF), showing a decrease at 30, 35, and 40 DAF. StABCG11 showed a similar behavior, with a higher expression at 10 and 20 DAF and a reduction at 30 and 35 DAF. Nevertheless, unlike StCYP77A and StGDSL1, a significant increase at 40 DAF, reaching the same expression as compared with 10 DAF, is shown for StABCG11 (Fig. 9).
Figure 9 Relative expression of cutin biosynthesis-related transcripts StCYP77A, StGDSL1, and StABCG11 during sweet pitaya fruit development. Relative expression was calculated through the 2− ΔΔCt method using elongation factor 1-alpha (StEF1a), α-tubulin (StTUA), polyubiquitin 3 (StUBQ3), or StEF1a + StTUA as normalizing genes at 10, 20, 30, 35, and 40 days after flowering (DAF). The Y-axis and error bars represent the mean of the relative expression ± standard error (n = 4–6) for each developmental stage in DAF. The Ct data for the analysis was recorded by qRT-PCR in a QIAquant 96 5 plex (QIAGEN) according to the manufacturer's protocol. The graph line was drawn by ggplot2 in R Studio. Abbreviations: cytochrome p450 family 77 subfamily A (StCYP77A), Gly-Asp-Ser-Leu motif lipase/esterase 1 (StGDSL1), and ATP binding cassette transporter family G member 11 (StABCG11).