Content of glucosides in samples
All validation projects (Limit of detection (LOD), Limit of quantitation (LOQ), Calibration curve, Mean correlation coefficient Linear range, Accuracy, Injection precision, Stability, System suitability) used for detecting steviolbioside, rebaudioside B (Reb B), 1,2-stevioside (ST), rebaudioside F (Reb F), rebaudioside A (Reb A), rebaudioside D (Reb D) and rebaudioside M (Reb M) from HPLC-UV analysis satisfied the quantitative requirements (Table 1). The contents of steviol glucosides in the leaves of all detected samples are shown in Table 2, and the HPLC chromatograms are also shown in Figure 1. The results demonstrated that all of the analytical glucosides in the experimental genotypes were clearly varied and provided a potential basis for WGCNA co-expression network analysis to uncover the glucosyltransferases involved in the biosynthesis pathway of the corresponding glycosides. Furthermore, data obtained separately from the leaves of the seedling, adult, and budding stages of the ‘023’ genotype also revealed that the accumulation of steviol glycosides in S. rebaudiana peaked in the budding period.
Output from combining sequencing approach to the leaves of stevia
To obtain the complete full-length transcriptome of S. rebaudiana and identify the potential genes involved in the biosynthesis pathway of steviol glycosides, both NGS (ILLUMINA) and SMRT (PACBIO) sequencing platforms were combined to sequence six different stevia genotypes. First, eighteen high-quality cDNA libraries from six different genotypes (each in triplicate) were sequenced on the Illumina HiSeq X Ten platform, and 48,234,398 clean reads were generated after quality filtering (Table S1). Then, high-quality full-length cDNAs from the pooled RNA sample of the ‘023’ genotype were sequenced on the PacBio Sequel platform, and a total of 16,289,315 subreads (approximately 29.1 billion bases) were obtained. After we performed the IsoSeq protocols (https://github.com/PacificBiosciences/IsoSeq_SA3nUP/wiki#datapub), which included Circular Consensus Sequences (CCS), Classify and Cluster, a total of 39,879 consensus isoform sequences with considerably more accurate sequence information were obtained, and their average length was 1,949 bp. To further correct the isoform sequences, all NGS clean reads were subsequently used to correct the 39,879 consensus isoform sequences using LoRDEC software (v0.7, parameter: -k 21 -T 20 -s 3 and others set to default) [19]. After removing the redundant sequences for all SMRT isoforms using CD-HIT (identity=0.98) [20], 30,859 contigs (containing approximately 59.8 million bases) were produced with a mean length of 1,938 bases. To validate the quality of 30,859 corrected contigs, firstly, we mapped the corrected contigs to the eukaryota_odb9 database and the result showed that there were 229 sequences on the perfect match (Figure S1). Secondly, mapping of the clean reads to the corrected contigs and resulting a high alignment rate of 85.42%.
In addition to contigs coding for proteins and the remaining contigs, homologous searches against the Coding Potential Calculator (CPC), Coding-Non-Coding Index (CNCI), Coding Potential Assessment Tool (CPAT) and Pfam reference protein databases predicted 508 of these contigs to be long non-coding RNAs (lncRNAs) with a mean length of 1,321 bases (Figure 2A, B). After SSR analysis, the number of sequences containing SSR was 1,776 and without the compound SSRs; furthermore, the repeat numbers of the mononucleotides, dinucleotides, trinucleotides, tetranucleotides, pentanucleotides, and hexanucleotides were 4,212, 2,030, 3,544, 94, 44, and 130, respectively.
According to the results from the comparison of transcript length distribution between Illumina and PacBio Sequel platforms, it was indicated that the transcripts assembled from the Illumina short reads by Trinity software [21] could not accurately represent full-length cDNAs in S. rebaudiana. The average length of assembled contigs from the Illumina platform (mean 905.8 bp) was notably shorter than those from the PacBio Sequel platform (mean 1,949.2 bp). In addition, approximately 69.4% of the assembled contigs from NGS reads were <1,000 bp, whereas only 13.4% of the contigs from PACBIO reads were <1,000 bp (Figure 3). Nevertheless, from this study, it seemed that the SMRT reads further corrected by the NGS data could be considerably better than simply relying on IsoSeq protocols.
Phylogenetic analysis of the UDP-glycosyltransferase multigene family
UDP-glycosyltransferases (UGTs) are defined by the presence of a C-terminal consensus sequence containing 44 amino acids and are responsible for transferring a glycosyl moiety from an activated donor to an acceptor molecule in all living organisms [22, 23]. In Arabidopsis thaliana, a molecular phylogenetic tree was constructed consisting of ninety-nine UGT sequences and a composite phylogenetic tree that also includes all of the additional plant UGTs with known catalytic activities [22]. This work has significantly promoted the prediction of the evolutionary history, substrate specificities and structure-function relationships of UGTs in Arabidopsis. Nevertheless, although many studies have been performed on the glycosyltransferases of S. rebaudiana, there are still no reports on its phylogenetic tree. Therefore, a comprehensive neighbour-joining tree with ninety-eight complete SrUGTs was constructed (Figure 4). The alignment included twenty-six SrUGTs functionally characterized in this study. After bootstrap analysis with 1000 replicates, the SrUGTs were strongly divided into fourteen major groups, with each having a support greater than 95% in distance analysis excluding group I (66% bootstrap). The fourteen well-defined major groups of SrUGTs suggest that at one time, there were fourteen ancestral genes. One sequence (SrUGT78D2) with a long unique terminal branch, suggesting accelerated evolutionary rates, tends to distort phylogenetic analyses by reducing apparent bootstrap support for nearby clades. Therefore, the data were reanalysed without this sequence. This analysis provided stronger statistical confidence (bootstrap from 64% to 88%) to two of the ancestral genes, corresponding to groups M and N, which are likely to share a more recent common origin. Interestingly, a similar AtUGT78D1 gene has been found in Arabidopsis [22], indicating that the UGT78D genes in the glycosyltransferase family may have evolved more rapidly.
In an attempt to predict the structure-function relatedness of the SrUGT family, numerous UGTs identified from a wide range of plant species and having different biochemical functions were aligned with the ninety-eight SrUGTs and constructed a composite phylogenetic tree. Among these additional plant UGTs, seven of the corresponding UGTs were successfully clustered within the fourteen groups identified by this study (Figure 5). Interestingly, PdUGT94AF1 and PdUGT94AF2 derived from Prunus dulcis involved in the formation 1,6-β-D-glucosidic linkage of Prunasin [24] were clustered in group A but had a long genetic distance between them and SrUGTs of this group, implying that the SrUGTs in this group may have no related specificity. Due to the lack of other glycosyltransferases capable of forming 1,6-glucosidic bonds, this tree cannot predict the glycosyltransferases involved in the synthesis of steviol glycosides containing 1,6-glucosidic bonds (such as Reb L). FeUGT79A8 and PhUGT79A1 identified from Fagopyrum esculentum and Petunia x hybrida could utilize UDP-rhamnose as sugar donors [25]; both UGTs belong to group A and are closely related to the SrUGT79 subfamily (100% bootstrap), demonstrating that the SrUGT79 subfamily may be the glycosyltransferases of UDP-rhamnose specificity in S. rebaudiana. In addition, EUGT11 from Oryza sativa and SrUGT91D2 are known to catalyse the 1,2-β-D-glucosidic linkage of steviol glycosides [7, 26]; therefore, it is reasonable to believe that the SrUGT91 subfamily is responsible for the formation of the 1,2-β-D-glucosidic linkage in stevia. UBGAT from Scutellaria baicalensis is one of the few glycosyltransferases identified in plants that could use UDP-GlcUA as the sugar donor to catalyse a glucuronosylation reaction [27]. In the composite tree, UBGAT was clustered in group C and had a closer relationship with the SrUGT88 subfamily; moreover, the similarity of the PSPG box between the UBGAT and SrUGT88 subfamily was more than 65%; therefore, we speculate that the SrUGT88 subfamily should be the UDP-glucuronic acid-recognizing glycosyltransferases in stevia. AtUGT78D1 identified from Arabidopsis thaliana could utilize UDP-xylose as a sugar donor [28]. In this tree, AtUGT78D1 and SrUGT78D2 are clustered in group L and have high bootstrap support (100%). Therefore, it is speculated that SrUGT78D2 should be a glycosyltransferase in stevia, which may recognize UDP-xylose and participate in the xylosylation of glycosides, such as RF. In 2005, Richman et al. (2005) identified and characterized three UGTs (SrUGT85C2, SrUGT74G1 and SrUGT76G1) involved in the synthesis of steviol glycosides: SrUGT85C2 and SrUGT74G1 glucosylate the C13-hydroxyl and C19-carboxylic acid functional groups of the steviol backbone, forming a β-D-glucoside, respectively, while SrUGT76G1 is capable of catalysing 1,3-β-D-glucosylation at both the C13 and C19 positions of steviol. Most SrUGTs in group N belong to the SrUGT85 subfamily and have a high support value; therefore, we hypothesized that the SrUT85C subfamily may be responsible for glucosylating the C13-hydroxyl position of the steviol backbone. Similar to the SrUGT85 subfamily, in group J, the SrUGT74G subfamily could theoretically be the glucosylated C19-carboxylic acid functional group of the steviol backbone. Furthermore, the SrUGT76G and SrUGT76I subfamilies not only have a high degree of support (100% bootstrap) but also have a close genetic distance, indicating that these two subfamilies may be responsible for the formation of the 1,3-β-D-glucosidic linkage.
WGCNA co-expression network analysis for the investigation of steviol glycoside biosynthesis
To date, many steps involved in the biosynthesis pathway of steviol glycosides have been successfully uncovered, especially for elucidating four UDP-dependent glucosyltransferases (UGTs) [6, 29], but several glucosylation steps of some glycosides that have not been resolved to date. In addition, for the large family of glucosyltransferases, we speculate that multiple enzymes with similar functions may participate in the same catalytic step in the glucosylation of steviol glycosides. Typically, the traditional method for differential expression analysis is constrained to paired sample analysis and thus unable to perform systematic analysis with large datasets from heterogeneous sources simultaneously [30, 31]. Therefore, in this study, one co-expression network approach named WGCNA, which was proved to be a powerful tool in systematically describing the correlation relationship between clusters of highly correlated genes or modules and external conditions or sample traits [31, 32], was used to analyse the potential UGTs involved in the glucosylation of steviol glycoside. First, we performed qRT-PCR analysis of the expression levels of nine UGTs (SrUGT71H1, SrUGT85B1-2, SrUGT91D2, SrUGT76G1-1, SrUGT91D3, SrUGT85C3-1, SrUGT79A2, SrUGT73G2 and SrUGT71I1) in the leaves of six genotypes to confirm the reliability of the transcriptome data, and the primers used for qRT-PCR are shown in Table S2. The results showed that the tendency of these genes to be expressed was similar between the qRT-PCR and the transcriptomic data, confirming that the transcriptomic results were reliable (Figure 6). And then all 71,718 transcripts assembled from NGS as input raw data for WGCNA co-expression network analysis. First, genes with low fluctuation expression (standard deviation < 1) were filtered, and 14,995 genes remained. When the power value of adjacency functions for weighted networks was 9, both the correlation coefficient and degree of gene connectivity could satisfy the requirement of scale-free network distribution to the greatest extent possible. Based on the selected power value, a weighted co-expression network model was established, and 14,995 genes were eventually divided into fifteen modules, of which the grey module had no reference significance because of the failure to assign to any module. The hierarchical clustering dendrogram of gene networks is visualized in Figure 7A.
To identify modules that are significantly associated with the traits of steviol glycoside content, fifteen generated modules were correlated with the traits. The modules related to each trait were screened according to the absolute value of correlation coefficient > 0.3 and p-value < 0.05. The colour-coded table in Figure 7B shows the full module-trait relationships. For each trait-related module, the correlation between the gene expression profile and the corresponding traits (Gene Significance, GS) and the correlation between the gene expression profile and the module eigengenes were calculated. The results showed that the genes in the module are both highly correlated with the traits and the eigengenes. For example, the module eigengenes of turquoise (r=0.71, correlation p-value=0.00092) and dark-orange (r=-0.78, correlation p-value=0.00014) were significantly positively correlated or negatively correlated with RA, respectively. As a result, fourteen gene modules that are highly associated with steviol glycosides were identified. Among these genes in the fourteen modules, two genes belong to the acetylglucosaminyltransferase, fifty-five genes were annotated to be members of the plant UGT superfamily, including three SrUGT85C2, one SrUGT74G1, one SrUGT76G1, one SrUGT85A8 and one SrUGT91D2, which have already been reported to be involved in the glucosylation of steviol glycosides except for SrUGT85A8 [6, 7], illustrating the reliability of our results. Furthermore, the expression levels of these genes in the leaves of the six genotypes are shown in Figure 9.
In S. rebaudiana, steviol glycosides have been derived from the tetracyclic diterpene steviol backbone [6], and the precursors of steviol are actually synthesized via a series of enzymes consisting of DXS, DXR, CMS, CMK, MCS, HDS, HDR, GGDPS, CPPS, KS, KO and KAH [8, 33, 34]. Accordingly, the genes encoding enzymes involved in steviol biosynthesis might be expected to exhibit a similar co-expression pattern with the SrUGTs involved in the synthesis of steviol glycosides. Therefore, we further performed a similar co-expression analysis between the fifty-five SrUGTs and the genes involved in steviol biosynthesis. Notably, forty-four SrUGTs, including SrUGT85C2, SrUGT74G1, SrUGT76G1 and SrUGT91D2, were then identified as being co-expressed with at least one of the upstream genes (Figure 8), and it is reasonable to believe that these SrUGTs may be directly involved in the synthesis of the corresponding steviol glycosides, which warrants further research.
To confirm the sequence accuracy of the candidate SrUGTs, twenty of these forty-four candidate SrUGTs were cloned by reverse transcriptase polymerase chain reaction (RT-PCR) using the designed primers (Table S3), and then sequenced with the pClone007 Blunt vector. After analysis the nucleotide similarity between the cloned and the transcriptome, it was found that there were three base substitutions in the SrUGT91D1-1 sequence, twelve base substitutions in the SrUGT85B4 sequence, one base substitutions in the SrUGT73G2 sequence, eight base substitutions in the SrUGT85C3-1 sequence, eleven base substitutions in the SrUGT95A2 sequence and no base difference in the remaining fifteen SrUGTs sequence (additional file 6). Because of the self-incompatible in stevia, the sequence difference should be due to genetic heterozygosity [9, 10]. These results demonstrated the reliability of our transcriptome sequences.