Analysis of PacBio sequencing datasets
Transcriptome sequencing of the caucasian clover rhizome was completed, and 19.82 GB of clean data were obtained using one cell. We identified 658,323 reads of inserts (ROIs) with a mean length of 2,286 bp, quality of 0.94; 12 passes from 720,832 polymerase reads with full passes>=0 and a predicted consensus accuracy>0.8 (Table 1). In total, the ROIs included 62.87% (449,460) full-length (FL) reads and 29.4% (193,513) non full-length reads of the entire transcriptome sequence from the 5’ to the 3’ end and polyA tail. Additionally, the number of full-length nonchimeric (FLNC) reads was 441,885, with an average full-length nonchimeric read length of 1,969 bp (Table 1). The main number distribution of cDNA and ROIs were shown in Additional file 1: Figure S1a and S1b.
As PacBio sequencing results have a high error rate, FLNC reads were clustered using the iterative isoform-clustering (ICE) algorithm and corrected with the Illumina Hiseq2500 platform to correct errors. We generated 227,516 consensus isoforms with an average consensus isoform length of 2,086 bp, including 148,836 high-quality isoforms (Table 1). We successfully obtained 80,654 non-redundant transcripts using CD-HIT for caucasian clover rhizomes analysis.
Prediction of ORFs, SSRs and lncRNAs and identification of AS events
To identify putative protein-coding sequences, we predicted 78,209 open reading frames (ORFs) using TransDecoder. In total, 65,227 coding sequences (CDSs) were identified with start and stop codons, and the distribution of the numbers and lengths of complete CDSs is shown in Fig.1a. Among them, 12,630 transcripts were distributed in the 100-200 bp range.
A total of 79,424 sequences (167,351,883 bp) were examined, including 58,276 simple sequence repeats (SSRs) and 36,110 SSR-containing sequences (Additional file 2: Table S1). The number of sequences containing more than one SSR was 13,856, and the number of SSRs present in compound form was 10,041. In addition, most are mononucleotides (33,533), dinucleotides (8,610) and trinucleotides (14,026).
In this study, 2,429 long noncoding RNAs (lncRNA) transcripts were predicted by a coding potential calculator (CPC), coding-non-coding index (CNCI), pfam protein structure domain analysis and coding potential assessment tool (CPAT) (Fig.1b), revealing candidate lncRNAs for future research.
Regarding alternative splicing (AS) events, 6,821 were detected. Because no reference genome is available for caucasian clover, we could not identify the types of AS. Nonetheless, as AS is an important mechanism for regulating gene expression and producing proteome diversity, we show the results of these AS events in the KEGG enrichment (Fig.1c), and the genes were found to be highly enriched in the following categories: “glycolysis/Gluconeogenesis” (147), “spliceosome” (129), “carbon metabolism” (129), “protein processing in endoplasmic reticulum” (109) and “biosynthesis of amino acids” (96).
Transcript annotation
We annotated 77,927 (96.61%) transcripts in at least one of seven databases, including NCBI non-redundant protein (NR), Swiss-Prot (a manually annotated and reviewed protein sequence database), Gene Ontology (GO), Clusters of Orthologous Groups (COG), EuKaryotic Orthologous Groups (KOG), Protein family (Pfam) and Kyoto Encyclopedia of Genes and Genomes (KEGG). The number of detailed annotations for five of the databases (GO, COG, NR, KEGG and Swiss-Prot) is shown in a Venn diagram (Additional file 3: Figure S2).
Through homologous species analysis comparing transcriptome sequences in the NR database, 77,721 transcripts were annotated. Approximately 65.23% (50,689) of the sequences were aligned to Medicago truncatula sequences, followed by Cicer arietinum (23.72%, 18,435) (Fig.2a).
All the assembled transcripts were subjected to searches against the COG database to evaluate the effectiveness and completeness of the transcriptome annotation, and the results were divided into 26 main categories (Fig.2b). The clusters “general function predicted only” (10,388), “transcription” (6,199), “replication, recombination and repair” (5,988), represented three of the largest groups, followed by “signal transduction mechanisms” (5,828) and “posttranslational modification, protein turnover, chaperone” (2,943).
A 18,529 KEGG pathway analysis was performed to identify associated biochemical pathways (Fig.2c). A total of 33,383 (42.84%) transcripts were matched in the KEGG database and further classified into 128 KEGG pathways. “biosynthesis of amino acids” (1396), “carbon metabolism” (1389), “protein processing in endoplasmic reticulum” (1232), “starch and sucrose metabolism” (1189) and “spliceosome” (1170) were the most represented pathways.
Based on the GO analysis, 57,583 transcripts were enriched in the three ontologies (Fig.2d) “biological process”, “molecular function” and “cellular component”. Transcripts involved in biological processes mainly included “metabolic process” (39,010), “cellular process” (33,383), and “single-organism process” (26,933). In molecular function, transcripts were mainly enriched in “binding” (32,350), “catalytic activity” (32,206), and “transporter activity” (3,404). Regarding the “cellular component” category, the major classes of transcripts were related to “cell part” (24,076) “cell” (23,984) and “organelle” (16,648).
Expression of specific genes and statistics ofdifferentially expressed genes (DEGs)
We investigated the transcript expression levels in the five tissues, and T1 had the highest number of expressed genes (76,124), followed by T4 (75.978), T2 (75,885), T3 (74,396) and T2 (74,327) (Additional file 4: Figure S3a). The number of genes co-expressed in each tissue was 68,241. Fragments per kilobase of transcript sequence per million base pairs sequenced (FPKM) values were used to represent the expression levels of genes, and the distributions in each tissue are shown in Additional file 4: Figure S3b. To determine specific genes expressed in tissues and provide insight into the specialized developmental process, specific genes (at least two repeats and FPKM>0.1) with the top 5 expression levels were selected for investigation(Table 2). Among them, FPKMs were higher in T2 and T5: F01_cb16574_c994/f1p0/1592 (mitogen-activated protein kinase, MAPK3) and F01_cb71 58_ c 94/f1p0/1000 (betaine aldehyde dehydrogenases, BADH).
To identify the gene expression differences in the development of the caucasian clover rhizome, we focused on the identification of differentially expressed genes (DEGs). As shown in the diagram of the DEG distribution in different rhizome tissues (Fig.3a), T3 and T4 had the largest number (33,612) of DEGs, with 18,372 up-regulated and 15,240 down-regulated genes. Moreover, T5 and other tissues (T1, T2, T3 and T4) had more DEGs; 4,585 co-upregulated genes and 4,196 co-downregulated genes were found in T5 compared with different tissues (Fig.3b and c). With regard to the up-regulated genes in T5, only 65 were enriched in the KEGG analysis: “phenylpropanoid biosynthesis” (9), “protein processing in endoplasmic reticulum” (8), “phenylalanine metabolism” (7), “carbon metabolism” (7) and “isoflavonoid biosynthesis” (6) (Fig.3d). For the down-regulated genes in T5, 231 genes were annotated, mainly involving “carbon metabolism” (35), “biosynthesis of amino acids” (34), “glycolysis/gluconeogenesis” (29), “protein processing in endoplasmic reticulum” (25) and “spliceosome” (21) (Fig.3e).
TFs prediction and WGCNA analysis
Transcription factors (TFs) play critical roles in plant growth and development. We examined 4,501 putative TFs from 64 different families((Additional file 5: Table S2), and the top 20 TF families are shown in Fig.4a, with the AP2/ERF-ERF (374) , C3H (372), BHLH (324), WRKY (305), GRAS (302), NAC (270), BZIP (246), C2H2 (239), and MYB-related (180) families having the most. These TFs are widely involved in plant growth and responses to stress and are related to rhizome development.
We used weighted gene co-expression network analysis (WGCNA) to further explore the relationship between TFs (filtering TF with FPKM value <1 and K-ME<0.7) and physiological characteristics in the rhizome of caucasian clover (Additional file 6: TableS3). Highly correlated TFs genes clusters are defined as modules, in which TFs genes within the same cluster are highly correlated. WGCNA analysis identified eight distinct modules (labelled with different colours in Fig.4b). The correlation coefficient between the characteristic genes of each module of 10 different modules and each different sample (trait) is presented in Fig.4c. Notably, IAA-trait and GA3-trait were significantly correlated with the MEgreenyellow modules and MEgreen modules (r2 > 0.8, p <10-4). The majority genes of the green module were up-regulated for six traits, except for ABA; most of these TFs were mainly up-regulated in T3 (Additional file 7: FigureS4). We identified 11 hub genes based on the criteria of KME (eigengene connectivity) ≥0.99 and edge weight value ≥0.5 in the green module based on the regulatory network (Fig.4d). These hub genes mainly belong to the HB-KNOK, AP2/ERF-ERF, GRAS, C2H2, C3H and NAC families (MEgreen-GA3); moreover, these TF families were upregulated in T1, T2 and T3, particularly in T3 (Fig. 4e), and may be related to the formation of nodules in the rhizome.
Identification of hormone signalling-related genes in rhizome development
Plant hormones play an important role in all aspects of development. Accordingly, we mapped DEGs to hormone signalling and transduction pathways for caucasian clover and analysed their expression in different tissues. In total, 276 transcripts are related to the synthesis and metabolism of eight hormones, including auxin (IAA), abscisic acid (ABA), ethylene (ETH), cytokinin (CTK), gibberellic acid (GA), brassinosteroid (BR), jasmonic acid (JA) and salicylic acid (SA) (Fig.5). The maximum number of transcripts (62) is related to IAA synthesis and metabolism, followed by ABA at 60, JA at 52, SA at 24, ETH at 28, BR at 20, CTK at 18 and GA at 10. All these significant genes related to hormones synthesis and metabolism exhibited different expression in the different tissues. In addition, most genes associated with the IAA pathway were up-regulated in T2 and T5. Regarding SA signalling, almost all genes belonged to the TGA family, with up-regulation only in T4. Most transcripts associated with BR signaling showed higher levels in T4 and T5. For CTK transduction, all crucial genes associated with CTK signalling pathway were identified as DEGs. Only three genes showed no up-regulation trend in T3, which may be the tissues in which cells divide in large numbers. Only ten DEGs were associated with GA signalling; five transcripts were classified as GID1 and significantly up-regulated in T2. Genes related to ABA signalling and transduction displayed no significant change. For JA signalling, 52 transcripts all were annotated as JAZ, with up-regulation in T1, T2 and T3. In contrast, most of ETH signaling genes exhibited higher expression in T3, and all were down-regulated in T2.
Verification of gene expression by qRT-PCR
To confirm the accuracy of the genes obtained by RNA-seq, twelve genes including six plant hormone signal transduction genes, three TFs and three genes belonging to other classes were randomly selected for quantitative real-time RT-PCR (qRT-PCR) analysis. Good reproducibility between the qRT-PCR and RNA-seq results was indicated by Pearson’s correlation analysis, verified the accuracy and reliability of the RNA-seq data (Additional file 8: TableS4).