InTransBo: An Integrative Transcript Library to Enable Genome-Free Systematic Exploration of Bougainvillea

Members of the genus Bougainvillea are rich sources of natural dyes, pigments, and traditional medicines. They are also commonly used as ornamentals in roadside landscape construction. However, the horticultural development of Bougainvillea owers with extended growth periods and coloration is not always feasible. One reason is limited molecular knowledge and no genomic information for Bougainvillea. Here, we compiled an expressed transcript sequence library for Bougainvillea by integrating 20 Illumina-sequencing RNA transcriptomes. The library consisted of 97,623 distinct transcripts. Of these, 47,006 were protein-coding, 31,109 were lncRNA, and 19,508 were unannotated. We also conrmed that the library is an alternative genomic reference for accurate transcriptome assembly and its performance was substantially better than that of the de novo method. We also curated the Integrative Transcript Library database for Bougainvillea known as InTransBo (http://www.bio-add.org/InTransBo/index.jsp). To the best of our knowledge, the present study is the rst large scale genomic resource for Bougainvillea. Overall, the library helps ll the genomic gap and elucidate the transcriptional nature of Bougainvillea. It may also advance progress in the precise regulation of owering in horticulture. The same strategy can be readily applied toward the systematic exploration of other plant species lacking complete genomic information.

Analyses of the metabolites (Sangthong, Suksabye, & Thiravetyan, 2016), natural dyes and pigments (Sangthong et al., 2016), medicinal uses, and species diversity of Bougainvillea have been conducted. In contrast, there have been few molecular studies of this genus. No genome of Bougainvillea has ever been sequenced up-to-date; and it won't be done in the coming few years due to technical and economic di culties. Limited omics research has been performed to elucidate the molecular basis of the aforementioned properties of Bougainvillea especially at the systematic level. As no genome has been clari ed for Bougainvillea, current molecular research on this plant is often compared against or referred to Arabidopsis thaliana. Consequently, its gene behavior is uncertain, ambiguous, or even misunderstood as there are genomic gaps between organisms. Therefore, an alternative genomic resource is required that can complement or ll the no-genome gap. Here, our objectives were to use multiple Illumina RNA sequencing (RNA-seq) transcriptomes determined for various Bougainvillea tissues and generate a sequence library consisting of all expressed transcripts. This library could serve as an alternative genomic reference for the molecular exploration of Bougainvillea and will be presented as an online interactive database.

Sample Collection
Tissue samples of pot-grown Bougainvillea glabra L. Choisy Tonganhong) in its normal owering period were collected in October 2018 at Xiamen City, P.R. China. Samples comprised thorns, ower thorns, small buds, bracteoles, leaf sprouts, ower sprouts, lobules, stems under buds, stems under bracteoles, and owers. Two biological replicates per tissue type were used ( Table 1). The tissues were excised either from different parts of the same plant or from different plants, washed with distilled water, and brie y airdried in a clean environment. The tissues were mixed, randomly divided into two replicates, packed in silver paper, frozen in liquid nitrogen, and stored in the School of Life and Science, Xiamen University, Chen's lab.

RNA Library Construction and Deep Sequencing
The sample mixtures were lysed with 1 mL TRIzol reagent (Invitrogen, Carlsbad, CA, USA). Total RNA was prepared according to the manufacturer's instructions. RNA purity was evaluated with a NanoPhotometer® spectrophotometer (Implen USA, Westlake Village, CA, USA). The RNA concentrations were measured with a Qubit® RNA assay kit and a Qubit® 2.0 uorometer (Life Technologies, Waltham, MA, USA). RNA integrity was evaluated with the RNA Nano 6000 assay kit in an Agilent Bioanalyzer 2100 system (Agilent Technologies, Santa Clara, CA, USA). RNA samples passing the quality control test were stored at -20 °C until later use.
The RNA library was constructed by Novogene Co. Ltd. (Beijing, China) following the standard operation procedure. Three micrograms RNA per sample was used as the input material for library preparation. Polyadenylated (poly(A)) RNA was puri ed from total RNA with poly T oligo-attached magnetic beads. RNA sequencing was performed in an Illumina HiSeq 4000 system (Illumina, San Diego, CA, USA) using the 125-bp, strand-speci c, paired-end mode.

RNA-Seq Data Preprocessing
Before proceeding to the transcriptome assembly, the RNA-seq raw data were ltered with Trimmomatic (Bolger, Lohse, & Usadel, 2014) to exclude low-quality reads and those with adaptors or N content > 10%.
The proportion of bases with sQ ≤ 5 occurred in > 50% of all reads.

Library-Based Transcriptome Assembly and Performance Comparison
To make a genomic reference for the transcriptome assembly, the library was annotated and preformatted as a gff le with Annoscript (version 1.

Database Construction
For user convenience, the transcript library was presented as an online interactive database called InTransBo, which was constructed on the Linux-Apache-JSP platform. MySQL software was used to manage data storage, access, and maintenance. E cient and friendly user interfaces were designed with JavaScript for interactive transcript search and retrieval.  (19,508) were not annotated (Fig. 1a).

Results And Discussion
We performed a gene conservation analysis by mapping the expressed Bougainvillea transcripts against the Arabidopsis thaliana genome. We found that 44,344 transcripts had homologs in the Arabidopsis thaliana genome, 24,870 transcripts mapped to intergenic regions, and 28,409 were unmapped (Fig. 1). The latter were either annotated Bougainvillea-speci c or wrongly assembled transcripts. We selected ten conserved plant genes to verify sequence completeness. To this end, we compared them with their homologs in Arabidopsis thaliana (Fig. 1b). Out of the ten Bougainvillea genes, seven had sequence identity and coverage > 90% of their corresponding homologs in Arabidopsis thaliana. The remaining three genes failed either criterion. Thus, we con rmed the sequence completeness of the transcript library.

Integrative Transcript Library Database for theBougainvillea
The InTransBo database is freely accessible at http://www.bio-add.org/InTransBo/ or at its mirror site http://bioinf.xmu.edu.cn/InTransBo/. InTransBo uses keyword search and sequence BLAST to retrieve data interactively. The keyword search function allows both accurate and fuzzy transcript search via the input of complete or partial gene symbols, gene names, protein names, and abbreviated protein names (Fig. 2a). The search engine feeds back the hit terms in alphabet order along with the gene symbols, protein names, transcript lengths, and transcript ID. Clicking on the transcript ID redirects to the transcript information page containing various data listed in order including transcript annotations, sequence information, and transcript expression pro les (Fig. 3).
For newly identi ed sequences and unnamed sequences, InTransBo enables data access via an alternative BLAST method. The database supports BLASTn and tBLASTn to identify nucleic acid and protein sequences, respectively. The input may either be the typing sequence in text form or an uploaded le in FASTA format (Fig. 2b). The embedded BLAST engine responds to all hit sequences meeting the default expectation threshold of E value = 1e-10 and sorted by hit score (Fig. 2b). For each hit, alignment details may be obtained using the "Alignment" hyperlink. Detailed information of the hit transcript may be acquired via the transcript ID hyperlink. The transcript library is free for downloading; however, user registration is required. The library is annotated and formatted as a gff le such that it may be used as an alternative reference to the genome for transcriptome assembly and other molecular applications.

Library-based Transcriptome Assembly
The library was used to demonstrate the reference-based assembly of 20 RNA-seq transcriptomes according to a typical genome-based method. The gene expression pro les in different tissues are accessed in the InTransBo database by specifying gene symbols. We compared transcriptome assembly performance between the library-based and de novo Trinity methods. The comparison was tested on two external Bougainvillea RNA-seq datasets (Table 1) determined by different research groups under various experimental conditions and using diverse sequencing qualities. Assembly performance was evaluated based on the read mapping ratio, number of unique genes and bridges, sequence completeness, fragmentation ratio, and estimated assembly score. The results are illustrated in Fig. 4. The library-based method substantially outperformed the de novo Trinity method. The former had superior N50, lower fragmentation ratio, and higher assembly score. The Trinity-assembled transcriptomes had relatively higher read mapping ratios and unigene numbers. However, the Trinity method also considered numerous fragmental sequences that could not be correctly assembled.
Bougainvillea sp. require systematic transcriptome research but lack genome support. Whole-genome sequencing is not a viable option to-date as it is costly and technically impracticable. Here, we used the computational method TransIntegrator to construct an integrative transcript library for Bougainvillea based on 20 heterogeneous RNA-seq datasets. We curated the InTransBo database on the transcript library for interactive data retrieval. By using this library, we demonstrated reference-based transcriptome assembly for the 20 RNA-seq transcriptomes. To the best of our knowledge, this study is the rst to investigate Bougainvillea at the molecular level with no genome, and the InTransBo database could be the rst genomic sequence source for Bougainvillea. A subsequent analysis revealed that the librarybased method outperformed the de novo Trinity method in terms of transcriptome assembly. Therefore, the library may serve as a reliable alternative reference that replaces the genome and enables the molecular exploration of Bougainvillea. The present study showed that the InTransBo database mitigates the genome constraint on the systematic investigation of Bougainvillea. We believe that it also helps elucidate the molecular biology of Bougainvillea and facilitates precise ower regulation in horticultural practices. The same strategy could be readily applied toward the systematic exploration of other plant species lacking adequate genomic data.

CONFLICT OF INTEREST
The authors declare that the research was conducted in the absence of any commercial or nancial relationships that could be construed as a potential con ict of interest.