Full-Length Transcriptome Sequencing and EST-SSR marker development of tiger lily (Lilium lancifolium Thunb.)

Mingwei Sun (  sunmingweihappy@163.com ) Lianyungang Academy of Agricultural Sciences Yilian Zhao Nanjing Agricultural University Xiaobin Shao Lianyungang academy of agricultural science Jintao Ge Lianyungang academy of agricultural science Xueyan Tang Lianyungang academy of agricultural science Pengbo Zhu Lianyungang academy of agricultural science Jiangying Wang Lianyungang academy of agricultural science Tongli Zhao Lianyungang Academy of Agricultural Sciences


Introduction
Owning different corolla shape and color, tiger lily was world widely welcome for its beautiful appearance. On the other hand, tiger lily is also considered as the healthy food and taken as one of the medical plants published by the Chinese Ministry of Health for the reason that it contains various of nutrients and bioactive compounds such as phenolic glycosides, pectin, steroidal saponins, alkaloids, and kinds of vitamins. As a triploid plant, tiger lily owns an enormous genome about 32.8 pg − 47.9 pg (Leitch et al. 2007). Information of the whole genome and transcriptome would make great contribution for tiger lily gene model study, but the large and not unrevealed genome and high heterozygosity make it di cult for gene functional study.
Next-generation sequencing platforms have provided us active transcriptional patterns in transcriptomic analysis over the past decade. Despite gene expression could be accurately quanti ed with the massive resequencing throughput, some important transcript information covering long lengths, including AS (alternative splicing), SSR (ssimple sequence repeats), and lncRNA (long non-coding RNA), would be lost for the reason that RNA or cDNA need to be fragmented during sample preparation and only information with short-reads could be obtained with the next-generation sequencing platforms. With the research requirement, the long-read sequencing platforms that could provide entire cDNA molecules are available, including Paci c Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) (Rhoads, Au 2015;Bayega et al. 2018). Because the full length of transcripts could be captured by PacBio, the accuracy of gene annotation, isoform identi cation, and lncRNA discovery was improved much compared with those from the next-generation sequencing platforms (Abdel-Ghany et al. 2016;Gonzalez-Garay 2016). Thus, PacBio single-molecule real-time isoform sequencing has been widely used for transcriptome pro ling in many plants and animals (Dóra et al. 2018;Dahe et al. 2019).
In this study, the PacBio SMRT technology was adopted to carried out the full-length cDNA library sequencing of tiger lily. To obtain a transcriptome with high accuracy, we combined the RNA-seq from Illumina HiSeq plat to achieve comprehensive full-length transcriptomes for isoform transcripts identi cation and quanti cation. This dataset could provide us rich information about full-length cDNA sequences that can help broaden our understanding of tiger lily. The function annotation of these fulllength transcripts and detection of lncRNAs (Long noncoding RNAs) would help us to further understand the gene functions. The developed SSR primers in this study would be useful tools in gene mapping, a nity and systematic classi cation, and make contribution for tiger lily breeding. Therefore, These results would provide e cient tools for gene study and breeding.

Materials And Methods
Plant material, RNA was extraction, cDNA library preparation, and PacBio ISO-Seq Tiger lily (Lilium lancifolium Thunb.) was grown at Lianyungang Academy of Agricultural Sciences (Lianyungang, China). Samples of the scales from three individual plants were collected and frozen in liquid nitrogen and stored at −70 °C for RNA extraction. The Iso-Seq library was constructed using the Isoform Sequencing protocol (Iso-Seq) with the Clontech SMARTer PCR cDNA Synthesis Kit and the BluePippin Size Selection System protocol as described by Paci c Biosciences (PN 100-092-800-03).
PacBio ISO-Seq analysis Circular consensus sequence (CCS) was generated after the raw reads were processed with the SMRTlink 7.0 software (https://www.pacb.com/support/software-downloads/). Then the CCSs were classi ed into full length and non-full length reads according to the rule that whether the 5′primer, 3′primer and poly A tail were existed. Clustering was completed to predict consensus isoforms from full-length transcripts using the Interactive Clustering and Error Correction (ICE) algorithm and the consensus isoforms polishing with non-full-length transcripts was carried out with software Quiver v1.1.0 (Paci c Biosciences of California Inc.; Menlo Park, CA, USA) (Chin et al. 2013). The obtained sequences without redundancy and extention on either end were de ned as transcripts. To ensure the sequence accuracy, PacBio reads were corrected using the Illumina RNAseq data with the software LoRDEC (Leena et al. 2014).After all redundancy in corrected consensus reads was removed by CD-HIT (Li, Godzik 2006), the nal consensus isoforms for the subsequent analysis were obtained.

Full-length transcripts from the sepals of Lilium lancifolium Thunb
To obtain a comprehensive full-length transcriptome data, the extracted RNA from ve samples were pooled together and sequenced using the PacBio Sequel platform. A total of 54.86 G subread bases was obtained by two SMRT cells from the constructed PacBio library, yielding 815,624 CCS (Circular Consensus Sequence) reads with mean length of 1,295 bp (Table 1). Of the total reads, 461,744 (56.62%) were full-length reads containing the 5' primer, 3' primer and the poly (A) tail and 459,322 (56.32%) were FLNC (full length readsnon-chimeric) sequences (Table 1). With ICE Quiver and Arrow polishing algorithms, 38,707 polished full-length consensus transcripts with a mean length of 1574 bp were produced (Supplementary material 1). To improve consensus accuracy, FLNC sequences were corrected with Illumina short-read RNA-seq reads with software LoRDEC. As a result, 8,596 nucleotides in the 38,707 transcripts were revised, and the mean length is still 1574 bp (Table 1). We found that suquence length of most transcripts (98.3%) ranges from 0.2 to 4 kb (Supplementary material 1). With software CD-HIT, the redundant and similar sequences for one unigene was removed. Finally, a total of 15,608 unigenes were obtained with the threshold of 95% of software CD-HIT, and mean length of the unigenes turns to be 1627 (Supplementary material 2).  (Fig. 1) . Gene sequence character of the TFs would help us much in gene functional studie and serve as an excellent starting point for rapid gene discover.

GO classi cation
GO analysis about the unigenes was carried out to further classify the gene functions. GO enrichment illustrated that 10,706 unigenes were categorized into 52 functional groups, which could be divided into three categories: biological process, cellular component and molecular function. In the biological process group, metabolic process accounts for the most unigenes (5188), followed by cellular process (4676) and single-organism process (3044). The largest processes in molecular function group are binding (6393), catalytic activity (4851), and transporter activity (473) in turn. Cell and cell part owns the same number (2258) of Cellular Component processes, followed by organelle (1594), macromolecular complex (1282), and membrane (1136)

Discussion
As an e cient and reliable method for full-length transcripts screening, SMRT sequencing technology has made great contribution for whole-transcriptome pro ling studies, especially for some plant species without reference genome sequences. In this study, the tiger lily whole-transcriptome equencing was carried out with SMRT sequencing technology based on the PacBio Sequel platform. To ensure the sequencing accuracy, a total of 63.85 Gb sequencing data was generated, including 815,624 CCS and 459,322 FLNC reads. Percentage of FLNC reads in all CCS reads was 56.31%, which is similar with the result that obtained in alfalfa (Chao et al. 2019) and strawberry (Li et al. 2017). Although gene expression could be accurately quanti ed with next-generation sequencing platforms, the long nucleotide sequences usually could not be captured. Transcripts obtained by SMRT sequencing technology are commonly longer than that from next-generation sequencing platforms, where one read usually represents a fulllength transcript (Sharon et al. 2013). In this study, the average length of tiger lily transcripts by SMRT was 72,076 bp, which is much longer than those obtained by Illumina sequencing technology from previous research (114.59 bp − 297.76 bp) (Li et al. 2014;Tian, Jihua, Xie 2019). Furthermore, we found that 97.45% of all transcripts were longer than 2,000 bp in this SMRT sequencing result, indicating that PacBio SMRT sequencing technology is an e cient and appropriate approach for transcript sequences information study, especially for the long transcript sequences.
Transcription factors are a kind of important regulatory factors that affecting gene expression. With lncRNA, microRNA, and methylation together, transcription factors play a regulatory role in ne-tuning gene expression in cell differentiation and growing development, especially in response to abiotic stress and plant disease (Feng et al. 2020;Gong et al. 2020) (Arnold et al. 2020). An increasing number of studies were focused on regulatory network in plants in the recent research. However, no lncRNAs from tiger lily have been reported until now. In this study, 768 transcription factors and 6852 lncRNAs were identi ed from the tiger lily transcripts, which will be useful for further research in tiger lily.
For the reason the tiger lily genome is not avaliable now, the SSR primers would be useful tools in gene mapping, genetic diversity, comparative genomics, and gene functional study, especially for the breeding work. The long transcript sequence information made it easier for high-quality SSR maker development. Based on the polished gene sequence, 3319 SSRs were detected from 2968 SSR-containing unigenes. Among the six types of SSR repeat motifs (motif with 1 to 6 nucleotides), mononucleotide repeats account for the biggest proportion, accounting for a proportion of about 41.32%, followed by di-and trinucleotide motifs. In the present study, the most frequent mono-, di-, and tri-nucleotide motifs were A/T, AT/AT and CCG/CGG, respectively. Of all the nucleotide repeats, the most abundant repeat motif is A/T, which accounting about for 39.59% of all the motifs and is thought to be frequent in the genomic sequences of plants (Ulf, Hans, Leif 1993), followed by AT/AT (15.94%) and AG/CT (14.41%) belonging to the di-nucleotide motifs. Of the tri-nucleotide motifs, CCG/CGG accounts for a proportion of 9.01% of the total motifs, followed by AGG/CCT (3.93%) and AGC/CTG (3.65%). According to the SSR marker information, 128 pairs of primers were developed to test the maker reliability. As a result, when the cDNA, which obtained through reverse transcription of RNA, was taken as PCR ampli cation temple, all the 128 pairs of primers could successfully ampli ed PCR products. But when the whole genome DNA was taken as template, only 105 (82.03%) primer pairs could successfully ampli ed PCR products. The targeting region with long-length intron and primer nucleotide located on two neighboring exons may be the reasons for the above failed PCR ampli cation of 23 pairs of primers. These results suggested that the developed SSR markers based on PacBio SMRT sequencing platform is an effective and helpful tools for genetic diversity, gene mapping, and other types of genetic studies in tiger lily, and would make great contribution in the tiger lily breeding work.
In conclusion, the full-length transcriptome of tiger lily based on the PacBio SMRT sequencing technology was analyzed. This study provides the rst the third-generation long-read transcriptome sequencing of tiger lily. Based on the obtained transcriptome data, 38,707 unigenes, 6852 lncRNAs, and 768 TF members were identi ed. In addition, 3,319 SSRs were detected, and 105 from 128 primer pairs could successfully ampli ed PCR products. The obtained transcriptome data and developed SSR markers may facilitate further genetic studies and help to breeding new varieties on tiger lily.

Figure 5
The largest processes in molecular function group are binding (6393), catalytic activity (4851), and transporter activity (473) in turn. Cell and cell part owns the same number (2258) of Cellular Component processes, followed by organelle (1594), macromolecular complex (1282), and membrane (1136) Supplementary Files