PacBio Iso-Seq sequencing of Miscanthus transcriptome
To obtain a wide coverage of the Miscanthus transcriptome, high-quality mRNAs from leaves and stems of two Miscanthus sinensis genotypes (B0605 and C0542) were sequenced using the PacBio Iso-Seq Sequel platform. A total of 1,453,866,187 nucleotides were generated from the B0605 sample with a total of 647,252 multipass reads of inserts (ROIs), which included 597,037 (92.24%) FL non-chimeric ROIs and 39,686 (6.13%) non-FL ROIs (Table 1). The length of B0605 ROIs ranged from 200 bp to 15,600 bp, with a mean read length of 2,246 bp (Fig. 1a and Table 1). A total of 1,413,954,821 nucleotides were generated from the C0542 sample with a total of 635,350 ROIs, which included 583,088 (91.77%) FL non-chimeric ROIs and 43,779 (6.89%) non-FL ROIs (Table 1). The length of C0542 ROIs ranged from 200 bp to 14,000 bp, with a mean read length of 2,225 bp (Fig. 1a and Table 1). Overall, our PacBio Iso-Seq dataset consisted mostly of high-quality ROIs with quality values above 0.95, which is much higher than the quality of most PacBio ROIs reported in previous studies (above 0.85) (Fig. 1b) [18, 22]. Furthermore, we used an isoform-level clustering tool CD-HIT-EST to generate the cluster consensus of all the FLNC ROI sequences. In total, 240,665 final consensus isoforms were obtained with lengths ranging from 200 bp to 18,237 bp and a N50 value of 2,863 bp (Fig. 2a and Table 1).
Combined Miscanthus transcriptome with Illumina short-read sequencing
To provide high coverage sequence information, high-quality mRNAs from two biological replicates of the same tissues used for PacBio sequencing were simultaneously sequenced on the Illumina HiSeq 2,500 platform with pair-end of 150 bp. After cleaning and quality checks, more than 6 Gb of sequencing data with over 400 million total clean reads were obtained from each sample (Table 2). The Q20 percentage of each dataset was above 97.5%, Q30 percentage of each dataset was above 94% and quality scores of most reads exceeded 39 (Table 2 and Fig. 1c). Overall, the length of Illumina reads was distributed among lower reads than PacBio reads and constituted 65.83% of reads < 600 bp (Fig. 1a). The filtered Illumina read sequences of each sample were then aligned against the PacBio isoforms of the corresponding sample using Bowtie 2 [26] with the default highly sensitive settings. More than 90% of Illumina reads were completely mapped to the PacBio isoforms of each sample, which revealed a good agreement between the short-read datasets and the long-read datasets at the nucleotide level (Table 2). The other clean reads that could not be mapped to the PacBio isoforms were further de novo assembled into 196,798 transcripts using the Trinity program [27]. The size of the assembled transcripts ranged from 200 to 8,526 bp, with a N50 value of 619 bp (Table 3 and Fig. 2b). Among these transcripts, 147,335 (78.75%) were shorter than 600 bp and only 16,206 (8.66%) were longer than 1,000 bp (Fig. 2b).
Furthermore, we used CD-HIT-EST to generate the cluster consensus of all the PacBio isoforms and Illumina assembled transcripts. A total of 408,801 final non-redundant transcripts were obtained after combining the two datasets with the length distributed from 200 bp to 18,237 bp and a N50 of value 2,697 bp (Table 3 and Fig. 2c). Of all the detected transcripts, 189,406 (46%) genes could be identified by both Illumina HiSeq and PacBio Iso-Seq platforms (Fig. 2d). Illumina HiSeq detected more uniquely identified genes (168,149, 41%) than PacBio Iso-Seq (51,246, 13%) (Fig. 2d), suggesting that the sequencing depth of PacBio Iso-Seq is lower than that of Illumina HiSeq. Moreover, BUSCO (Benchmarking Universal Single Copy Orthologs) was utilized to determine the completeness of our transcript dataset with the Viridiplantae (green plant) dataset [28]. The results showed that this dataset contains 87.5% complete plus 11.5% partial BUSCO orthologues (Additional file 1). Only a minority (0.9%, 4 genes) of the genes was missing in our Miscanthus transcriptome, which indicated good coverage and transcriptome completion of the dataset.
Comparison of the Miscanthus transcriptome with genomes of related species
As a non-model species in the Andropogoneae tribe, a draft chromosome-scale genome sequence for M. sinensis has recently been released, which is insufficient compared to other Andropogoneae crops with high quality, well annotated genomes such as rice, maize, sorghum, and sugarcane [29, 30]. The transcriptome is the comprehensive, functional readout of the genome. The availability of complete genome sequences of rice, maize, sorghum, sugarcane and the recently released miscanthus genome makes it possible to conduct comparative genomic analyses to gain a better understanding of the Miscanthus transcriptome [29–35]. In our study, the six sequenced datasets including two sets of PacBio FLNC ROIs and four sets of Illumina clean reads as well as all the final clustered transcripts were mapped to the genomes of rice [31], maize [32], sorghum [33], sugarcane [34] and Miscanthus [29] respectively (Table 4). For all the species, mapping ratios were highest with PacBio FLNC ROIs (34.92%–99.80%), followed by clustered transcripts (10.06%–96.64%), and then Illumina clean reads (0.6%–87.02%). This indicates that the Illumina-derived reads and de novo assembled transcripts may generate more sequence divergence than the PacBio Iso-Seq long isoforms. In total, 99% of the PacBio reads, about 85% of the Illumina reads, and 96% of the final combined transcripts could be mapped back to the Miscanthus genome. The high mapping ratios between the Miscanthus transcriptome and genome reflects the high accuracy of PacBio sequencing and strengthened the confidence in our integrated approach.
On the other hand, the mapping ratios of both the PacBio and Illumina-derived Miscanthus transcriptome sequences to the genome sequences of the four other species ranged from low to high, in the following order: maize, sorghum, sugarcane, and rice. The mapping ratios were the lowest when compared with the rice genome (0.6%–35.23%), which is unsurprising because rice is a C3 photosynthesis species while Miscanthus and the other three species are C4 photosynthesis species.
Miscanthus and the other three species that belong to the Saccharastrae subtribe were more closely related. Among them, the highest mapping ratios were identified between the Miscanthus and the sugarcane genome (45.6%–93.57%), followed by the sorghum genome (30.7%–80.22%), and the maize genome (12.3%–77.67%). These results provided new evidence to support previous phylogenetic studies of the relationships between Saccharastrae (which includes Miscanthus, maize, sorghum and sugarcane), Saccharum (including Miscanthus, sorghum and sugarcane), and the still unresolved interspecific breeding group “Saccharum complex” (which includes Miscanthus and sugarcane) [36].
Transcript annotation and functional classification
To confirm the putative function of the assembled transcripts, all 408,801 non-redundant transcripts were annotated using BLAST based on sequence similarity searches against five public protein databases, including the NR (NCBI non-redundant protein sequences), Swiss-Prot (a manually annotated and reviewed protein sequence database), eggNOG (evolutionary genealogy of genes: Non-supervised Orthologous Groups), GO (Gene Ontology), and KEGG (Kyoto Encyclopedia of Genes and Genome) databases, with an E-value threshold of 10-5 (Additional file 2). In total, 306,228 (∼75%) transcripts were successfully annotated in at least one of these five databases. Amongst them, 302,375 (∼74%) transcripts had significant hits in Nr database, 201,067 (∼49%) in Swiss-Prot database, 297,975 (∼73%) in eggNOG database, 154,759 (∼38%) in GO database, and 11,405 (∼3%) in KEGG database (Fig. 3a). Meanwhile, 102,573 (∼25%) transcripts remained unknown, so they can be considered as new transcripts putatively unique to Miscanthus (Additional file 2). Of all the hits to the NR proteins from BLASTX, most transcripts (149,618, ∼49%) were annotated to Sorghum bicolor, followed by Zea mays (59,871; 19.8%) (Fig. 3b), which is consistent with several previous Miscanthus transcriptome analyses when the sugarcane genome sequences were not yet available [10–12]. Thus, it is no surprise that most of our Miscanthus sequences could be mapped to the sugarcane genome but could not be annotated owing to the limited sugarcane genome annotation information in public databases.
As an international standardized gene functional classification system, GO offers a dynamic updated controlled vocabulary and a strictly defined concept to describe the properties of genes and their products in any organism [37, 38]. A total of 154,759 transcripts were annotated to at least one GO term, which were classified into 50 functional groups under three major functional categories (biological process, cellular component, and molecular function) (Fig 4). In the category of biological process, the main groups were “cellular process” (75,733; 48.9%), “metabolic process” (73,453; 47.5%), and “single-organism process” (52,302, 33.8%). In the cellular component category, transcripts related to the “cell” (72,611; 46.9%), “cell part” (72,224; 46.7%), and “organelle” (19,514; 12.6%) were the best-represented groups. For the molecular function category, the most abundant transcripts were associated with “catalytic activity” (74227; 48%) and “binding” (71,790; 46.4%). Gene products often participate in multiple processes, and a single gene can therefore be annotated to multiple GO terms. Of the transcripts with GO annotations, about 24% (36,930) were assigned to just one GO term, more than 75% (117,529) were assigned to over two GO terms, and about 3% (5,563) can be assigned to over ten GO terms (Additional file 3). All these transcripts are important resources for genetic manipulations of Miscanthus in the future. Furthermore, the networks of gene interactions in cells could be well understood by the KEGG pathway analysis. In total, 11,405 transcripts were assigned into 32 KEGG pathway levels covering five main categories including organismal systems (1,770; 15.5%), cellular processes (1,137; 10%), environmental information processing (928; 8.1%), genetic information processing (2,558; 22.4%), and metabolism (4,860; 42.6%) (Fig. 5). The three most represented pathways were “translation”, “signal transduction”, and “biosynthesis of other secondary metabolites”, followed by “folding, sorting and degradation” and “metabolism of terpenoids and polyketides”, whereas “sensory system” and “membrane transport” pathways represented the smallest categories.
As Miscanthus sinensis is a potentially high-yielding bioenergy crop, cell wall biosynthesis/assembly is of fundamental importance to the biomass accumulation and utilization of the plant. Thus, a total of 1,108 transcripts associated directly or indirectly with cell wall biosynthesis and assembly were sorted out in Additional file 4 according to the annotation information of all transcripts (Additional file 2). Most of the identified transcripts were correlated with genes involved in the biosynthesis of three major cell wall components, including lignin, cellulose and hemicellulose. Lignin biosynthesis is a complex process that involves many enzymatic reactions. Almost all genes involved in each step of the monolignol biosynthetic pathway have been identified in the Miscanthus transcriptome, including phenylalanine ammonialyase (PAL), 4-coumarate-CoA ligase (4CL), Cinnamate 4-hydroxylase (C4H), caffeoyl CoA 3-O-methyltransferase (CCoAOMT), Caffeic acid 3-O-methyltransferase (COMT), cinnamyl alcohol dehydrogenase (CAD), cinnamoyl-CoA reductase (CCR), hydroxycinnamoyl CoA shikimate hydroxycinnamoyl transferase (HCT), p-coumaroylshikimate 3′-hydroxylase (C3′H), and ferulic acid 5-hydroxylase (F5H). Cellulose is mainly synthesized by members of cellulose synthase (CESA) and cellulase enzymes, which were representative ones found in the cellulose-related transcripts. KORRIGAN (KOR) and COBRA family genes that may be involved in the assembly of crystalline cellulose were also identified. We have also identified numerous hemicellulose biosynthesis related transcripts of Xyloglucan endo-transglycosylase (XTH), Xyloglucan 6-xylosyltransferase (XXT), Xyloglucan glycosyltransferase (XGT), Xylosyltransferase (XT) and Glucan synthase (GS) subfamily genes, indicating that Xyloglucans maybe the predominant hemicelluloses in the Miscanthus cell walls. In addition to the three major cell wall components, a minor proportion of cell wall structural protein related transcripts have also been found in the transcriptome, such as pectin related genes and cell wall loosening genes expansin (EXP). Additionally, transcripts of other glycosyltransferases(GTs)genes may also be related to cellulose or hemicellulose biosynthesis. The identification of genes involved in the formation of cell wall components and structures could provide insight into the molecular mechanisms underlying biosynthesis and assembly of cell wall in Miscanthus and thus serve as good candidates for future functional studies for improving the biomass properties of Miscanthus.
Transcription factors
Transcription factors (TFs) play critical roles in various plant developmental processes by regulating transcription to switch genes on and off. They act either alone or in a coordinated fashion to trigger many fundamental genomic processes such as cell division, cell death, and development, and periodically react to signals coming from outside the cell [39]. The TF-encoding Miscanthus transcripts were annotated by comparing all the 408,801 integrated transcripts against the plant transcription factor database (PlantTFDB v3.0) [40]. A total of 125,608 (~30.7%) TFs were identified to be broadly distributed against 116 plant species (Additional file 5). Most TF transcripts were annotated from the well characterized rice database, followed by sorghum, which showed very high genome mapping ratios to Miscanthus transcripts (Fig. 6a). These TF transcripts were from 57 annotated TF families, of which the top 25 identified families have been listed in Figure 6b. The largest five families included bHLH (11,877; 9.5%), MYB (10,219; 8.1%), WRKY (9,479; 7.5%), NAC (8,071; 6.4%), and ERF (7,392; 5.9%), which are all well known and most studied plant transcription factors. Additional noteworthy families with high numbers of transcripts included FAR1-like, C2H2, C3H, bZIP, and B3 families. The high prevalence of TFs identified in our transcriptome suggest relatively complex transcriptional regulation in Miscanthus plants.
Identification of alternative splicing in Miscanthus
Alternative splicing (AS) is a crucial regulatory mechanism that played an important role in understanding transcriptome and proteome diversity [41]. The extent and complexity of AS has been studied in several model plants with reference genomes mainly using high-throughput next-generation sequencing [14]. However, it is difficult to use this technique to examine AS in species without well-annotated reference genomes, especially in heterozygosity species. Although short-read sequencing transcriptome analysis of Miscanthus have previously been investigated, a precise prediction and identification of the alternative transcript splicing has not been possible in this potentially very complex transcriptome. However, the newest PacBio sequencing enables examining alternative splicing events without using a reference genome, 240,665 consensus full-length transcripts were used to identify alternative splicing events by using the de novo pipeline [23]. A total of 3,898 alternative splicing events were identified, including 2,155 (55.28%) exon skipping events, 1,245 (31.94%) intron retentions, 142 (3.64%) alternative 3’splice sites, and 1,356 (9.13%) alternative 5’splice sites (Fig. 7a). Among the AS genes, more than half of them (78.03%) possess only two isoforms, 15.50% genes possess three isoforms, 3.64% genes possess four isoforms, 1.53% genes possess five isoforms, 1.02% genes possess more than six isoforms and 0.27% genes have >10 isoforms (Fig. 7b). Among them, 2,936 isoforms were annotated from the Nr database and sorted into 10 broad functional categories according to their predict functions (Fig. 7c, Additional file 6). The major category included isoforms whose functions have not yet been ascertained, which were assigned as ‘unknown’ category. The rest noticeably enriched categories were primary metabolism (4.77%), defence/stress (3.34%) and binding and transport (3.30%). Besides, 0.85% isoforms were assigned to ‘cell wall related’ category and 1.63% isoforms were assigned to ‘transcriptional regulation’ category, these isoforms are worth further studies to reveal their regulatory functions on cell wall biosynthesis/assembly.