PacBio single molecule real-time sequencing-based full-length transcriptome of tree tomato (Solanum betaceum Cav.) and mining of simple sequence repeat (SSR) markers

Background: Tree tomato (Cyphomandra betacea (Cav.) Sendtn.) is a neglected, fast-growing, promising small fruit crop which provides a rich source of nutrition for human consumption. However, the transcriptome atlas of this important species is still lacking. Results: In this study, RNA samples from a broad diversity of tissues (roots, leaves, stems, owers and fruits) of tree tomato were sequenced using Pacic Biosciences’ long-read single-molecule real-time sequencing technology. A total of 308699 full-length non-chimeric sequences with a mean length of 1005 bp and an N50 length of 1974 bp were obtained from our multi-tissue normalized cDNA libraries. A total of 140327, 104294, 135138, 78300, 53520, 152310 and 53520 transcripts were functionally annotated using Nr, Swiss-Prot, KEGG, KOG, GO, Nt and Pfam databases, respectively. Gene structural characteristics of the full-length transcriptome of tree tomato was subsequently investigated, including the predication of coding sequence and the identication of transcription factor families, long non-coding RNA and simple sequence repeat (SSR) marker. Thirty primers were randomly selected to evaluate the application of SSR markers, 23 of which obtained successfully amplication. Conclusions: This is the rst condent characterization of FL transcriptome proling of tree tomato. The large-scale and high-quality transcriptome atlas and SSR molecular markers provided in the present study will facilitate further genetic studies of this important species.


Background
Tree tomato (Solanum betaceum Cav., syn. Cyphomandra betacea (Cav.) Sendtn.), also familiarly known as tamarillo, is a neglected, fast-growing, promising small fruit crop native to the Andean region [1] and widely cultivated in the tropics and subtropics of South America, New Zealand, Australia and India, etc. [2][3][4]. As fresh fruit, it is an important and nutrient dense food source of human diets containing plenty of sugars, organic acids, minerals, ascorbic acid, provitamin A, carotenoids, vitamin B 6 and phenolics [5][6][7]; as processed product, it represents an important export commodity and stimulates both the local and overseas demand in fruit markets [3].
Previous studies mainly focused on its biochemical property [5,6], phenology [8], and reproductive biology, including ower and pollen morphology, physiology, fruit characteristics, intraspeci c hybridization and genetic diversity [4]. Despite the importance and recent progress, reference genome and transcriptome of tree tomato are not available, which severely impeded in-depth functional genomics, molecular genetics and genetic-assisted breeding of tree tomato. Additionally, de novo assembly of transcriptome sequence by the old-fashioned secondary-generation short-read sequencing, without a wellannotated reference genome, has been challenging [9]. The advent of Paci c Biosciences' (PacBio) longread single-molecule real-time (SMRT) sequencing approach, also called third-generation sequencing technology, addressed these challenges and provided opportunities to obtain reliable genome-wide fulllength (FL) transcripts directly [9].
The third-generation sequencing technology could generate an average read length more than 10 kb ('P6-C4' chemistry), thereby saving the need of further assembly and covering the size distribution of most transcripts in eukaryotes [10,11]. Under such circumstances, PacBio sequencing has become an ideal tool to effectively and accurately capture FL or nearly FL transcripts of model or non-model species [9].
To overcome the drawback of high sequencing error rate in PacBio sequencing, a tailored analysis pipeline, Isoform Sequencing (Iso-Seq) pipeline, has been developed to calculate the circular consensus sequence (CCS) from more than two subreads [12].
In addition, transcriptome pro ling has proved an effective approach for genome-wide development of simple sequence repeat (SSR) markers in many non-model plants at a large scale and low cost [15,[24][25][26]. SSRs are good DNA ngerprinting markers to assess genetic diversity and population structure and to distinguish closely-related cultivars because of the advantages of single locus, multiple allele variations and abundant polymorphism [27]. To date, only AFLP markers was used to measure the genetic diversity of different tree tomato varieties [28]. SSR markers identi ed and developed at the genome-wide scale of tree tomato are therefore highly desirable.
Herein, the PacBio SMRT sequencing technology was adopted to construct FL cDNA libraries from several tissues of tree tomato. The structural characteristics of transcripts was then investigated including the predication of coding sequence (CDS) and the identi cation of transcription factor (TF) families, long non-coding RNAs (lncRNA) and SSR markers. Distribution of SSR motifs was also investigated and SSR analysis was performed. This is the rst con dent characterization of FL transcriptome pro ling of tree tomato. Molecular breeding of tree tomato will be accelerated by developing SSR markers associated with FL transcriptome. The results of this study have already opened exciting avenues in transcriptome-based studies for this important and promising fruit crop.

Results
Tree tomato full-length transcriptome sequencing with SMRT To capture a representative FL transcriptome of tree tomato, RNA samples of ten different tissues were collected and equally pooled together for library preparation and sequencing. Using SMRT sequencing technology, a total of 9.92G subreads base were obtained, comprising 9,877,631 subreads, with an average subreads length of 1005 bp and an N50 length of 1974 bp. Some of these subreads were extremely long (>5000 bp). Approximately 70.41% of the subreads fell in the size range of 200 to 1000 bp. Of the 416144 CCS isoforms, 308699 were identi ed as consensus FLNC reads with a mean length of 2099 bp (Table 1) based on the clustering algorithm of ICE. The length distribution of these subreads and FLNC are shown in Fig. 1A and 1B, respectively.
The transcripts obtained were also compared with the KOG database. KOG analysis showed that tree tomato transcripts were assigned to a total of 26 categories ( Fig. 5; Additional le 4: Table S4). The largest group belonged to general function prediction only (15323 matched genes, 19.57%), followed by post translational modi cation, protein turnover, chaperones (9750, 2.45%) and signal transduction mechanism (8614, 11.00%) ( Fig. 5; Additional le 4: Table S4). A total of 5895 out of the 135138 transcripts were assigned to the signal transduction, thus making it the largest group (4.36%) among the major categories of KEGG functional classi cation. The three major categories assigned in KEGG pathways were translation (5233, 3.87%), folding, sorting and degradation (4989, 3.69%), and carbohydrate metabolism (4745, 3.51%) ( Fig. 6; Additional le 5: Table S5). In addition, alignment results against Pfam, Swiss-Prot and Nt databases are summarized in Additional les 6, 7, and 8: Table S6, S7, and S8, respectively. Structure analysis of the full-length transcriptome of tree tomato CDS from the full-length transcriptome of tree tomato were predicted using ANGEL software. The frequency for each length of CDS was evaluated. The most prevalent length of CDS ranges from 400 to 2000 bp (Fig. 7). A detailed breakup for each of such CDS categories is listed in Additional le 9: Table  S9.

SSR identi cation and validation of tree tomato
A screen of the 79549 genes using MicroSatete yielded diverse SSR types including mononucleotide, dinucleotide, trinucleotide, tetranucleotide, pentanucleotide, hexanucleotide and some complex nucleotides. Among these, the mononucleotide repeats (63.97%) exhibited the highest frequency of occurrence, followed by dinucleotide (8.54%) and trinucleotide repeats (7.79%) ( Fig. 10; Additional le 11: Table S11). For validation purposes, 30 primer pairs were randomly selected to evaluate the application of SSR markers, 23 of which were successfully ampli ed in the genomic DNA of tree tomato, resulting in clear PCR amplicons and expected product sizes. These 23 primer pairs showed reproducible bands and had stable repetition can be selected for further analysis (Fig. 11).

Discussion
Transcriptome has become a powerful technique for investigating global gene expression pro les and has shaped our understanding of multiple biochemical pathways associated with physiological processes in the past few years [39]. Previously, due to the limitation of short sequencing reads, transcriptome analysis in species that lacks reference genome sequences often encounters complicated problem [9,39]. Recent advances in PacBio SMRT sequencing technique enable the simultaneous and accurate interrogation of genome-wide gene expression [9]. The availability of PacBio SMRT sequencing technique has thus generated interest in understanding the complex transcriptome qualitatively and quantitatively [13][14][15][16][17][18][19][20].
Tree tomato has been identi ed as a promising small fruit crop high in antioxidants and nutritional value [2][3][4][5][6][7]. A high-con dence transcriptome atlas of this important species is still lacking. In the current study, through PacBio SMRT sequencing without assembly, we successfully obtained the rst high-quality functionally annotated reference transcriptome for tree tomato. Moreover, transcriptome-derived SSR markers were developed in this study. The large-scale and high-quality transcriptome atlas and molecular markers provided in the present study will facilitate further genetic studies of this important species.
In order to capture as many transcribed genes as possible, a broad diversity of tissues from major plant organs (roots, leaves, stems, owers, fruits etc.) at different developmental stages (juvenile and adult) of tree tomato were collected for the RNA-Seq analysis in the current study. Collection encompassed the juvenile, vegetative and reproductive phases, representing a variety of transcriptional stages. Thus, the transcriptome atlas will be useful for the future study of tissue and development states.
PacBio generated 308699 FLNC sequences with a mean length of 1005 bp and an N50 length of 1974 bp (Table 1). N50 value is a weighted median describing half of the sum of the lengths of all contigs [40]. In previous studies an N50 of 3356 bp, 2459 bp and 3179 bp in white myoga ginger (Z. striolatum Diels) [15], tea plant (C. sinensis) [18] and litchi (L. chinensis Sonn.) [20] was reported, respectively. Comparatively speaking, larger N50 values represent more accurate and effective transcriptome assembly [40]. The differences among may be related with species. Because of a lack of NGS sequencing transcriptome, the full-length transcriptome obtained by PacBio SMRT sequencing here was not compared to its own previous version transcriptome.
Since no other reference transcriptome or draft genome data is available for tree tomato, it is imperative to assign transcripts to different biological functions and metabolic pathways. In this study, sequencebased alignments were therefore performed against multiple databases, resulting in signi cant BLAST hits in Fig. 2-6. For example, we used the GO annotations to assign each transcript to a set of GO slims including biological process, cellular component and molecular function categories (Fig. 4). The GO annotations results illustrate that the transcripts of tree tomato involved in diverse molecular functions and biological pathways [30]. The largest KOG group belonging to "general function prediction n only" (Fig. 5) generally denotes biochemical activity [32]. KEGG in Fig. 6 integrated the molecular interaction networks and metabolic pathways in tree tomato [31].
Another important aspect of our study was to analyze the structure of full-length transcriptome of tree tomato as shown in Fig. 7-9. CDS (Fig. 7), TFs (Fig. 8) and lncRNA (Fig. 9) were thoroughly analyzed. LncRNA represents a novel class of nonprotein coding transcripts and exerts a regulatory effect on numerous biological process [41]. In this study, a rigorous screening criterion combined with CPC, CNCI and Pfam databases lead to the identi cation of lncRNA (Fig. 9), which is useful for further investigating functional roles or evolution of lncRNA in tree tomato.
Full-length transcriptome that contains an enormous quantity of sequence information is a potentially rich source for SSR discovery [15]. Moreover, transcriptome-based SSR mining and development increased the likelihood of detecting SSR markers associated with functional genes due to the close linkage to expressed genes of transcriptome [42]. SSR markers have proved to be the most favored genetic marker for estimation of genetic diversity, phylogenetic analyses, genotype identi cation, markertrait association, comparative mapping and genetic map construction [43]. Previously, genetic diversity study of tree tomato germplasm has mostly relied on AFLP marker [28]. To the best of our knowledge, no SSR markers are available in tree tomato until now. The current study presents the rst mining and development of SSR markers in tree tomato ( Fig. 10; Additional le 11: Table S11). We believe that the SSR markers generated here would su ce the gap to some extent if not completely. Moreover, in view of lack of genome sequences for tree tomato, the SSR markers identi ed here contributes a valuable resource for marker-assisted breeding in tree tomato.

Conclusion
Recent advances in PacBio long-read single-molecule real-time (SMRT) sequence approach enable to decipher the complex transcriptome in plant species even without reference genome sequences. In this study, we successfully obtained a high-quality full-length transcriptome of an important and promising small fruit crop, tree tomato (Cyphomandra betacea (Cav.) Sendtn.), by using the PacBio SMRT sequencing technology. This is the rst long-read transcriptome for tree tomato, which will be important for multiple gene discovery in tree tomato and for future delineation of gene function and annotations of the tree tomato genome sequence. In addition, the newly discovered SSR markers from transcriptome data will facilitate future molecular breeding of tree tomato.

Plant materials
Five-year old bearing tree tomato plants used in this study were grown at the experimental base of the College of Horticulture, Sichuan Agricultural University, Chengdu, China (latitude 30.71˚N, longitude 103.87˚E). Seven tissues including root tips, shoot tips, mature leaves, ower buds, owers in full bloom, young fruit and mature fruit of three independent mature trees, and three tissues of root tips, shoot tips and leaves of three tree tomato seedlings were sampled and mixed afterward. Tree tomato seedlings were obtained by incubation of seeds at 22˚C and 95% relative humidity, which were randomly collected from the above mature trees.

RNA extraction
Total RNA was extracted using the a PureLink RNA Mini Kit (Invitrogen Inc., Carlsbad, CA, USA), followed by DNase digestion and RNA puri cation using an on-column PureLink DNase Kit (Invitrogen Inc.) according to the manufacturer's instructions. 1% agarose gel was used to monitor whether there existed RNA degradation and potential contamination. The purity of RNA samples was determined by using a NanoPhotometer Spectrophotometer (Implen, Westlake Village, CA, USA). RNA concentration was measured using a Qubit 2.0 Fluorometer (Invitrogen Inc.). RNA integrity was checked using an RNA Nano 6000 Assay Kit on a BioAnalyzer 2100 system (Agilent Technologies, Santa Clara, CA, USA) before sequencing library preparation.
Construction of Iso-Seq complementary DNA (cDNA) library and PacBio sequencing Construction of Iso-Seq cDNA library and PacBio Sequencing were performed at Novogene Co., Ltd (Beijing, China). The mRNA was enriched using oligo-dT magnetic beads from 4.0 μg total RNA and reverse transcribed into cDNA using the SMARTer PCR cDNA Synthesis Kit (Clontech, now Takara, http://www.takarabio.com). The size-selected cDNA library was constructed according to the BluePippin Size Selection System protocol as described by PacBio (PN 100-092-800-03) and sequenced on the PacBio Sequel platform.

Reads processing and error collection
Row data acquired after SMRT sequencing were processed using SMRTlink v5.0 software. CCS reads were yield from subread BAM les, and the full-length non-chimeric (FLNC) reads and non-full-length reads were determined by the simultaneous presence of the poly-A tail signal and the 5' and 3' cDNA primers from reads of insert (ROIs). The short reads were discarded. Subsequently, the FLNC sequences were isoform-level clustered with iterative clustering and error correction (ICE) software and herein generated one consensus isoform [29]. The non-full-length CCSs were polished with the Quiver algorithm. Finally, isoform with a minimum Quiver accuracy of 0.99 was considered high quality isoform and used for further analysis.

Transcript analysis
Potential CDS regions within transcripts were predicted by ANGEL software, a long read implementation of ANGLE [34]. TFs were predicted with iTAK software from putative protein sequences [35]. TFs were downloaded from Plant Transcription Factor Database (v4.0) and blastp with default cutoff (E-value<0.05) parameters [36]. LncRNA was rstly screened via coding-non-coding-index (CNCI) with default parameters [37] and Coding Potential Calculator with NCBI eukaryotes' protein database (E-value <1 e-10 ) [38]. Then, each transcript was translated in three possible frames, and Pfam Scan with default parameters of -E 0.001 --domE 0.001 was utilized to determine whether there exists a domain of known protein family. SSRs within the transcriptome were identi ed by MIcroSAtellite (MISA) program (http://pgrc.ipk-gate rsleben.de/misa/), which allows the identi cation and localization of both the perfect and compound microsatellites. For PCR ampli cation, genomic DNA was extracted from fresh leaves of tree tomato using DNeasy Plant Mini Kit (Qiagen; Valencia, CA, USA) according to manufacturer's protocol. 30 primer pairs used for PCR ampli cation are listed in (Additional le 12: Availability of data and materials The data generated or analysed during this study are included in this published article and its supplementary information les.
Authors' contributions LL designed this study and contributed to the concept of this paper. HD performed the bioinformatic analysis and wrote the paper. ML, XL, and QD performed the transcriptome analysis. ZW, JW, DL, XH, and XW performed the SSR experiment. All authors have read and approved the manuscript.