Overview of full-length transcriptome sequencing in abiotic stress-treated M. sativa L. roots. In this study, M. sativa L. roots were grown under different abiotic stressors (darkness, low temperature, 400 mM NaCl, 25% PEG, and 100 µM ABA) for 0, 3, 6, 12, 24, 48, and 72 h (Fig. 1A). Then, we extracted total RNA from the M. sativa L. roots, which were subjected to different abiotic stressors, equally mixed the RNAs together. After purification and repair, we constructed and examined the library using Agilent 2100 system (Agilent Technologies, Palo Alto, CA, USA). A full-length transcriptome sequencing experiment was carried out using PacBio Sequel (Fig. 1B). The obtained accurate full-length transcripts were analyzed using PacBio Single Molecule Real Time (SMRT™) DNA sequencing technology (Fig. 1C). A total of 21.53 Gb clean reads were obtained and utilized to correct the SMRT reads in M. sativa L. roots under abiotic stress. Based on full passes ≥ 0 and sequence accuracy > 0.80, we successfully obtained 566,076 reads-of-insert (ROI) (total bases: 1,058,177,144) with mean length of 1,869 bp, mean read quality of 0.96, and 19 passes.
Summary of Pacbio RS sequencing base on sequencing by synthesis (SMRT). Besides, the ROI read length distribution of each size bin and the 1–6 k size bin were shown, and the ROI read length was mainly distributed around 2000 (Fig. 2A and 2B). The number of full passes is related to the length of the cDNA, but generally decreases with the increase in cDNA length. The accuracy of the ROI sequence is affected by the number of full passes. The more the full passes contribute the higher the sequence accuracy. ROI quality values reflect the accuracy of the sequence. The distribution of full passes of ROI sequences was shown in Fig. 2C, and the distribution of ROI quality values was high (Fig. 2D). Besides, through the full-length transcriptome sequences, we obtained 34,871 filtered short reads, 116,022 non-full-length reads, 415,183 full-length reads, and 409,291 full-length non-chimeric reads. The average full-length non-chimeric read length (FLNC) was 1,916; the full-length percentage was 73.34% (Table S2). The FLNC length distributions of each size bin and the 1–6 k size bin were what was expected, and the FLNC length also mainly distributed around 2000 (Fig. 2E and 2F). According to the ROI classification for the 1–6 k size, the percentage of filtered short reads was 6.2%; the percentage of full-length (chimeric) was 1%; the percentage of full-length (non-chimeric) was 73.2%; the percentage of non-full-length (no poly-A) was 11.6%; and the percentage of the non-full-length (no primer) was 8.9% (Fig. 2G).
Alternative splicing and SSR analyses and the prediction of coding sequences. In our results, there were 194,286 consensus isoforms (average consensus isoform read length of 2,077), 41,248 polished high-quality isoforms, 152,636 polished low-quality isoforms; the percent of polished high-quality isoforms was 21.23%. The consensus isoform length distribution of each size bin was exhibited, and the consensus isoform length was also mainly distributed around 2000 (Fig. 3A). In addition, BUSCO was used to evaluate transcriptome integrity, and the results indicated that the number of complete and single-copies was 418, the number of complete and duplicated was 536, the number of fragmented transcriptomes was 29, and the number of missing transcriptomes was 457 (Fig. 3B). Meanwhile, BLAST software was applied to predict the candidate events of alternative splicing through pairwise comparison of the unreferenced transcriptomes in the three generations of after redundantion[27]. If the comparison results satisfy the conditions in the previous research, they are considered as candidate alternative splicing events. And we also exhibited the alternative splicing events in Table S3. Furthermore, we analyzed SSRs using the MIcroSAtellite identification tool (MISA). Here, the detection of SSRs was part of the genetic analysis for M. sativa L. roots under abiotic stress. And the distribution of SSR types was displayed. There were 3944 compound SSR (c), 35 compound SSR with overlapping bases between two SSRs (c*), 13133 Mono-nucleotide (p1), 3535 Di nucleotide (p2), 6825 Tri nucleotide (p3), 349 Tetra nucleotide (p4), 117 Penta nucleotide (p5) and 113 Hexa nucleotide (p6) (Table S4). The coding region sequences and the corresponding amino acid sequences of the transcripts were predicted using TransDecoder software, and we also successfully obtained a total of 75,596 ORFs, including 42,725 complete ORFs. The complete length distribution of the predicted CDS coding proteins was shown in Fig. 3C.
LncRNA prediction and transcription factor analysis. Meanwhile, the numbers of lncRNA transcripts were presented in Fig. 4A, which was predicted by CPC, CNCI, CPAT and a pfam protein structure domain analysis. We discovered 3043 lncRNAs through the conjoint analysis of the CPC, CNCI, CPAT and pfam. Next, the target genes of the predicted 3,043 lncRNA sequences were also predicted using LncTar, and the predicted target genes were exhibited in Table S5. In addition, the transcription factors were predicted and analyzed using iTAK software; 8,336 transcription factors were predicted, and the distribution of the different transcription factors was shown in Fig. 4B. We discovered that the transcription factors mainly contained C3H, C2H2, bZIP, bHLH, GRAS, WRKY, SNF2, AP2/ERF-ERF, MYB-related, PHD, etc.
Functional annotation of the transcripts. In our results, 77,221 transcripts were annotated, including 32,297 transcripts in the COG database, 55,044 in the GO, 34,349 in the KEGG, 48,647 in the KOG, 62,168 in the Pfam, 54,521 in the Swiss-Prot, 72,017 in the eggNOG and 76,356 in the NR (Table S6). And the integrated_function annotation of all identified transcripts was also displayed in Table S7. According to the NR database, in the homologous species distribution, Medicago truncatula accounted for 80.03%, Fusarium verticilliides accounted for 3.92%, Cicer arietinum accounted for 2.66%, and Medicago sativa accounted for 2.66% (Fig. 5A). The GO annotation system was divided into biological processes (BP), molecular functions (MF), and cellular components (CC). We discovered that these differentially expressed transcripts under abiotic stresses were mainly enriched in CC terms (cell part, cell, organelle, membrane, organelle part, membrane part, macromolecular complex and cell junction, etc), MF terms (catalytic activity, binding, transporter activity, structural molecule activity, nucleic acid binding transcription factor activity, electron carrier activity, molecular transducer activity and enzyme regulator activity, etc) and BP terms (metabolic process, cellular process, single-organism process, response to stimulus, biological regulation, localization and cellular component organization or biogenesis, etc) (Fig. 5B). The COG database can be utilized to directly classify homologous gene products. Our results exhibited that general function prediction only accounted for 18.84%, replication, recombination and repair accounted for 10.4%, transcription accounted for 10.28%, and signal transduction mechanisms accounted for 9.84% (Fig. 5C). The eggNOG database serves as an annotation of functional descriptions and classifications for directly homologous groups. Our data revealed that the eggnog function classification under abiotic stresses mainly included posttranslational modification, protein turnover, chaperones (7.89%), signal transduction mechanisms (6.87%), transcription (5.11%), and carbohydrate transport and metabolism (4.92%), etc (Fig. 5D). The KOG database was used to divide homologous genes from different species into different orthologous clusters. Our data displayed that the KOG function classification under abiotic stresses mainly contained general function prediction only, posttranslational modification, protein turnover, chaperones, signal transduction mechanisms, translation, ribosomal structure and biogenesis, and carbohydrate transport and metabolism, etc (Fig. 5E). The KEGG database is a collection of various pathways, representing molecular interactions and reaction networks. Our data from KEGG analysis disclosed that the enrichment pathways of the transcripts under abiotic stresses mainly included circadian rhythm-plant (ko04712), ubiquitin mediated proteolysis (ko04120), plant-pathogen interaction (ko04626), peroxisome (ko04146), endocytosis (ko04144), eyc. Our data exhibited the ubiquitin mediated proteolysis (Fig. 5F). We also exhibited all KEGG pathways and the related genes (Table S8).
Analysis of SNP and the expression of transcripts. The number of SNPs in HomoSNP, HeteSNP, and AllSNP was exhibited in Table S9. We also provide the SNP-related data in the Table S10. The distribution of SNP density was high in the 0–1 kb (Fig. 6A). In addition, the non-redundant transcripts obtained by third-generation sequencing (single-molecule real-time sequencing) were used as a reference for sequence alignment and subsequent analysis. STAR was used to conduct sequence alignment between Clean Reads and transcripts to obtain location information on transcripts. And the comparison of non-redundant transcripts between second-generation and third-generation sequences was shown in Table S11. The total distribution of expression for the transcriptomes was displayed in Fig. 6B. To further examine the degree of dispersion in transcript expression, a boxplot of fragments per kilobase of exon per million (FPKM) in each sample was prepared to intuitively compare the expression levels of the transcripts in different samples (Fig. 6C). The correlation of biological replicates can not only test the reproducibility of biological experiments, but also assess the reliability of differentially expressed genes. Pearson’s correlation coefficient analysis was used as the evaluation index of biological repeatability and correlation. We discovered that the three biological replicates in each group were relatively good according to the heatmap of the pairwise comparisons (Fig. 6D).
Identification of differentially expressed transcripts in M. sativa L. roots under CK, NaCl, and PEG stresses. Furthermore, based on the full-length transcriptome sequencing, the differentially expressed transcripts were then identified by NGS technology in M. sativa L. roots under CK, 400 mM NaCl, and 25% PEG stress. The differentially expressed transcripts were screened by DESeq, and the correlation plots of the M. sativa L. transcripts under CK, 400 mM NaCl, and 25% PEG stress were exhibited in Fig. 7A. In addition, the volcano plot and hierarchical clustering analysis revealed the differentially expressed transcripts in M. sativa L. transcripts in response to the three stressors. There were 4,080 differentially expressed transcripts (2,609 downregulated and 1,471 upregulated) in the CK stress group compared to the 400 mM NaCl stress group; 5,854 transcripts were differentially expressed (3,241 downregulated and 2,613 upregulated) in the CK stress group compared to the 25% PEG stress group; 8463 transcripts were differentially expressed (3,896 downregulated and 4,567 upregulated) in the 400 mM NaCl stress group compared to the 400 mM NaCl stress group (Fig. 7B and 7C, Table 1).
Table 1
The statistics of the differentially expressed transcripts
DEG_Set | All_DEG | up-regulated | down-regulated |
T01_T02_T03 vs. T04_T05_T06 CK stress vs. NaCl stress | 4080 | 1471 | 2609 |
T01_T02_T03 vs. T07_T08_T09 CK stress vs. PEG stress | 5854 | 2613 | 3241 |
T04_T05_T06 vs. T07_T08_T09 NaCl stress vs. PEG stress | 8463 | 4567 | 3896 |
Annotation analysis of differentially expressed transcripts in M. sativa L. roots under CK, NaCl, and PEG stresses. According to the NGS results, the differentially expressed transcripts were also documented by GO, COG, eggNOG, KOG, and KEGG pathway analyses. The data from GO analysis showed that the differentially expressed transcripts in M. sativa L. roots under CK, NaCl, and PEG stresses were mainly enriched in CC terms (cell, cell part, organelle, membrane, membrane part, organelle part and macromolecular complex, etc), MF terms (catalytic activity, binding, transporter activity, nucleic acid binding transcription factor activity, structural molecule activity and electron carrier activity, etc) and BP terms (metabolic process, cellular process, single-organism process, response to stimulus, biological regulation, and localization, etc) (Fig. 8A). The COG function classification in M. sativa L. roots under CK, NaCl, and PEG stresses mainly included general function prediction only, signal transduction mechanisms, carbohydrate transport and metabolism, and transcription, etc (Fig. 8B). The eggNOG function classification in M. sativa L. roots under CK, NaCl, and PEG stresses mainly included carbohydrate transport and metabolism, osttranslational modification, protein turnover, chaperones, energy production and conversion, and amino acid transport and metabolism, etc (Fig. 8C). The KOG function classification in M. sativa L. roots under CK, NaCl, and PEG stresses mainly included General function prediction only, signal transduction mechanisms, posttranslational modification, protein turnover, chaperones, and translation, ribosomal structure and biogenesis, etc (Fig. 8D). In addition, the data from KEGG analysis presented that the enrichment pathways of differentially expressed transcripts between CK and NaCl stresses mainly included carbon metabolism, biosynthesis of amino acids, glycolysis/gluconeogenesis, starch and sucrose metabolism, ribosome and protein processing in endoplasmic reticulum, etc; the enrichment pathways of differentially expressed transcripts in PEG stress mainly included carbon metabolism, biosynthesis of amino acids, starch and sucrose metabolism, glycolysis/gluconeogenesis, phenylpropanoid biosynthesis, plant hormone signal transduction and plant-pathogen interaction, etc; the enrichment pathways of differentially expressed transcripts between NaCl and PEG stresses mainly included biosynthesis of amino acids, carbon metabolism, glycolysis/gluconeogenesis, ribosome, starch and sucrose metabolism, protein processing in endoplasmic reticulum and RNA transport, etc (Fig. 8E).Therefore, the differentially expressed transcripts in M. sativa L. roots under CK, NaCl, and PEG stress were mainly enriched in carbon metabolism, biosynthesis of amino acids, glycolysis/gluconeogenesis pathways.
Screening and identification of differentially expressed transcripts. Based on the KEGG analysis, we selected the highly differentially expressed transcripts, which were related to carbon metabolism, biosynthesis of amino acids, glycolysis/gluconeogenesis pathways among the CK, NaCl, and PEG stress groups. As shown in Fig. 9A, hierarchical clustering exhibited the distributions of the Top10 up-regulated and Top10 down-regulated transcripts in Medicago sativa L. roots between NaCl and CK groups (Left), between PEG and CK groups (Middle), and between PEG and NaCl groups (Right). Subsequently, we also selected 5 upregulated and 5 downregulated transcripts between the three groups for verification. The results of the RT-qPCR assay disclosed that ADH1 was upregulated, PEPC and MJG19.6 were downregulated in the NaCl stress group relative to the CK stress group; PCKA and GAPC1 were upregulated, PEPC was downregulated in the PEG stress group compared to the CK stress group; BCAT2, PCKA and GAPC1 were upregulated, ADH1 and PEPC were downregulated in the PEG stress group compared to the NaCl stress group (Fig. 9B and 9C).