Tree tomato full-length transcriptome sequencing with SMRT
To capture a representative FL transcriptome of tree tomato, RNA samples of ten different tissues were collected and equally pooled together for library preparation and sequencing. Using SMRT sequencing technology, a total of 9.92G subreads base were obtained, comprising 9,877,631 subreads, with an average subreads length of 1005 bp and an N50 length of 1974 bp. Some of these subreads were extremely long (>5000 bp). Approximately 70.41% of the subreads fell in the size range of 200 to 1000 bp. Of the 416144 CCS isoforms, 308699 were identified as consensus FLNC reads with a mean length of 2099 bp (Table 1) based on the clustering algorithm of ICE. The length distribution of these subreads and FLNC are shown in Fig. 1A and 1B, respectively.
Functional annotation of transcripts of tree tomato with Nr, Swiss-Prot, KEGG, KOG, GO, Nt and Pfam databases
To acquire a comprehensive reference FL transcriptome of tree tomato, transcripts were functionally annotated by sequence similarity search against seven different databases. A total of 140327, 104294, 135138, 78300, 53520, 152310 and 53520 transcripts were functionally annotated using Nr, Swiss-Prot, KEGG, KOG, GO, Nt and Pfam databases, respectively (Fig. 2; Additional file 1: Table S1). The annotation of Nr homologous species distribution showed the best blast hit with tree tomato were Solanum tuberosum (52712 isoforms), Solanum pennellii (21171 isoforms), Solanum lycopersicum (16666 isoforms) and Capsicum annuum (15851 isoforms) (Fig. 3; Additional file 2: Table S2).
Transcripts were successfully annotated with GO terms and enriched in three categories, including biological process, cellular component and molecular function (Fig. 4; Additional file 3: Table S3). In the biological process category, the share of the genes under metabolic process (27699 matched genes, 51.75%), cellular process (27089, 50.61%), single-organism process (20063, 37.49%), localization (7706, 14.40%), biological regulation (6786, 12.68%), regulation of biological process (6648, 12.42%) and response to stimulus (5893, 11.01%) were highly represented. The most abundant subcategory of cellular component was cell (12693 matched genes, 23.72%) and cell part (12693, 23.72%), followed by organelle (8699, 16.25%), macromolecular complex (7507, 14.03%), membrane (7344, 13.72%), membrane part (7019, 13.11%) and organelle part (3961, 7.40%). In the category of molecular function, binding (30712 matched genes, 57.38%), catalytic activity (26279, 49.10%), transporter activity (3491, 6.52%), molecular function regulator (2574, 4.81%), structural molecule activity (1939, 3.62%), nucleic acid binding transcription factor activity (1199, 2.24%) and molecular transducer activity (958, 1.79%) were the most prominently represented (Fig. 4; Additional file 3: Table S3).
The transcripts obtained were also compared with the KOG database. KOG analysis showed that tree tomato transcripts were assigned to a total of 26 categories (Fig. 5; Additional file 4: Table S4). The largest group belonged to general function prediction only (15323 matched genes, 19.57%), followed by post translational modification, protein turnover, chaperones (9750, 2.45%) and signal transduction mechanism (8614, 11.00%) (Fig. 5; Additional file 4: Table S4). A total of 5895 out of the 135138 transcripts were assigned to the signal transduction, thus making it the largest group (4.36%) among the major categories of KEGG functional classification. The three major categories assigned in KEGG pathways were translation (5233, 3.87%), folding, sorting and degradation (4989, 3.69%), and carbohydrate metabolism (4745, 3.51%) (Fig. 6; Additional file 5: Table S5). In addition, alignment results against Pfam, Swiss-Prot and Nt databases are summarized in Additional files 6, 7, and 8: Table S6, S7, and S8, respectively.
Structure analysis of the full-length transcriptome of tree tomato
CDS from the full-length transcriptome of tree tomato were predicted using ANGEL software. The frequency for each length of CDS was evaluated. The most prevalent length of CDS ranges from 400 to 2000 bp (Fig. 7). A detailed breakup for each of such CDS categories is listed in Additional file 9: Table S9.
By predicting non-redundant transcripts using iTAK software, a total of 5114 genes were predicted to be TFs (Additional file 10: Table S10). These TFs belonged to different TF families, among which the most abundant observed was SNF2 (338 matched genes, 6.61%), followed by C3H (336, 6.57%), others (309, 6.04%), GRAS (213, 4.17%), MYB-related (188, 3.68%), bHLH (167, 3.27%), WRKY (163, 3.19%) and SET (161, 3.15%) (Fig. 8). A total of 43227, 42872, and 110333 noncoding RNAs candidates were predicted by CPC, CNCI and Pfam databases, respectively. Among them, 29453 transcripts were simultaneously identified by the three computational approaches (Fig. 9).
SSR identification and validation of tree tomato
A screen of the 79549 genes using MicroSatete yielded diverse SSR types including mononucleotide, dinucleotide, trinucleotide, tetranucleotide, pentanucleotide, hexanucleotide and some complex nucleotides. Among these, the mononucleotide repeats (63.97%) exhibited the highest frequency of occurrence, followed by dinucleotide (8.54%) and trinucleotide repeats (7.79%) (Fig. 10; Additional file 11: Table S11). For validation purposes, 30 primer pairs were randomly selected to evaluate the application of SSR markers, 23 of which were successfully amplified in the genomic DNA of tree tomato, resulting in clear PCR amplicons and expected product sizes. These 23 primer pairs showed reproducible bands and had stable repetition can be selected for further analysis (Fig. 11).