PacBio sequencing and error correction
The full-length transcriptome of mature and immature G. luofuense seeds comprised a total of 12,869,707 subreads (19.81 Gb) with an average length of 1,540 bp (Table S1, Fig. S1A). After self-correction with an accuracy value of ROIs > 0.8, 384,042 circular consensus sequences (CCSs) with an average length of 1,919 bp were generated, of which full-length, non-chimeric (FLNC) reads accounted for 81% (312,444, Fig. S1B). The FLNC reads were clustered using the ICE algorithm, and non-FLNC reads were polished. The FLNC reads and polished non-FLNC reads were merged, yielding 165,883 polished consensus isoforms ranging from 167 to 13,816 bp in length (Fig. S1C). The 165,883 polished consensus reads were further corrected using Illumina sequencing data with LoRDEC software. The mean length and N50 and N95 values changed slightly after correction (Table S2).
Genome mapping and novel gene detection
The corrected polished consensus reads were mapped to the G. luofuense reference genome using GMAP. 162,887 (98.19%) reads were mapped to the reference (Fig. S1D); of these, 63,049 uniquely mapped reads (38.01% of total mapped reads) were mapped to the positive strand of the reference genome, 60,292 uniquely mapped (36.35%) reads were mapped to the negative strand, 39,546 (23.84%) were multiply mapped reads, and 2996 (1.81%) reads were unmapped. The mapping density on each scaffold of G. luofuense genome was shown in Fig. S1E. Over 98% of the mapped reads showed similarity to the reference genome, and coverage values of the mapped reads were all above 80% (Fig. S1F). After deleting the unmapped and redundant reads, 41,151 reads remained, of which 7,899 were novel isoforms of known genes and 5,726 reads were from novel genes.
Annotation and classification of novel genes
The 5,726 novel genes were annotated by searching against six databases—NCBI NR, KEGG, GO, SwissProt, KOG, and Pfam. A total of 4,099 novel genes were annotated, of which 2,588 were annotated in the NR database (Table S3). Five species—Picea sitchensis (649 genes), Amborella trichopoda (116), Vitis vinifera (88), Elaeis guineensis (80), and Nelumbo nucifera (61)—produced the largest numbers of hits to the G. luofuense novel genes (Fig. S2A). 2,487 novel genes were annotated with KEGG pathways (Table S3), and the most enriched pathways were “signal transduction” (169 genes), “carbohydrate metabolism” (83 genes), and “translation” (69 genes, Fig. S2B). GO analysis classified 2,069 genes into three categories: “biological process”, “cellular components” and “molecular functions” (Fig. S2C). Novel genes classified in the biological process category were mainly annotated with the terms “metabolic process” (1,052), “cellular process” (1,037), and “single-organism process” (581). Novel genes classified in the cellular component category were mainly annotated with the terms “cell” (519), “cell part” (519), and “membrane” (367). Novel genes classified in the molecular function category were mainly annotated with the terms “binding” (1,192), “catalytic activity” (942), and “transporter activity” (132). 1,930 genes, 1,315 genes and 2,069 genes were annotated with the Swiss Prot, KOG and Pfam databases, respectively (Table S3).
AS and APA analysis
After mapping reads to the reference genome of G. luofuense, a total of 9,061 AS events were detected. These could be classified into seven types (Fig. 2A): retained intron (2,713, 29.94%), alternative 3′ splice site (2,468, 27.24%), alternative 5′ splice site (1,769, 19.52%), skipped exon (1305, 14.40%), alternative first exon (542, 5.98%), alternative last exon (217, 2.39%), and mutually exclusive exon (47, 0.52%).
To verify the AS events identified, expression of two genes, i.e. Tns00138667g03 and TnS000973269g04 were validated by qRT-PCR (Fig. 2B). In addition, a total of 8,512 genes from G. luofuense seeds had at least one supported poly(A) site. Of these, 3,654 (42.93%) had a single poly(A) site, and 640 (7.52%) had at least five poly(A) sites (Fig. 2C). The largest number of poly(A) sites—21—was found in the gene TnS000670009g01.
Identification of TFs and lncRNAs
A total of 2,160 transcription factors (TFs) from 86 gene families were detected using iTAK. The largest fraction of identified TFs came from the C3H (5.6%), bHLH (4.53%), and MYB-related (4.26%) families (Fig. 3A). In addition, 11,885, 5,958, 11,294 and 11,037 lncRNAs were identified using the CNCI, CPC, PFAM and PLEK methods, respectively. A total of 3,551 lncRNAs were identified by all four methods (Fig. 3B), with lengths ranging from 200 to 7,840 bp. The lncRNAs were further classified into four types (Fig. 3C): 1,422 (40.05%) sense intronic lncRNA, 1,149 (32.36%) long intergenic non-coding RNA, 547 (15.40%) antisense lncRNA, and 433 sense overlapping lncRNA (12.19%). The length distribution of the identified lncRNAs was considerably narrower than that of mRNAs predicted from the G. luofuense genome (Figs. 3D). Moreover, most identified lncRNAs had five or fewer exons, whereas mRNAs predicted from the reference genome tended to have larger numbers of exons (Fig. 3E).
Illumina sequencing of seed samples at two developmental stages
To explore gene expression patterns during seed development of G. luofuense, 306,900,384 clean Illumina reads (46.04 Gb of raw data) with Q30 values from 93.54% to 94.07% were generated from three immature seed samples (IS) and three mature seed samples (MS) (Table S4). After the deletion of adaptors and low-quality reads, the average GC content of the six samples was 47.08%. PCA analysis showed that gene expression was highly correlated among the replicate samples of immature and mature seeds (correlation efficiency value=0.95, cumulative proportion of variation explained by PC1 and PC2 = 78.7%) (Fig. 4A). After mapping to the G. luofuense genome, the mapping ratios of IS (average 89.44%, Table S5) were found to be significantly larger than those of MS (average 84.46%, Student’s t-test p-value=0.003). RNA-seq analysis of the two developmental stages yielded a total of 23,977 genes (19,010 in IS and 20,737 in MS), of which 2,970 were identified as novel genes.
Enrichment analysis of DEGs and qRT-PCR validation
A total of 14,323 differentially expressed genes (DEGs) were identified between IS (control group) and MS: we found 7,891 upregulated genes and 6,432 genes downregulated (Fig. 4B) from IS to MS. The DEGs were also annotated with the three categories of GO terms, and multiple GO terms in the “biological process” category were significantly enriched with regard to Z-scores and adjusted p-values (Fig. 4C). The top five enriched GO terms were “single-organism cellular process” (GO:0044763), “single-organism process” (GO:0044699), “metabolic process” (GO:0008152), “cellular metabolic process” (GO:0044237), and “Organic substance metabolic process” (GO:0071704). The DEGs were also enriched in multiple KEGG pathways with reference to Arabidopsis thaliana. The top five enriched KEGG pathways were “metabolic pathways” (KEGG ID: ath01100, 1229 genes), “biosynthesis of secondary metabolites” (ath01110, 844 genes), “carbon metabolism” (ath01200, 179 genes), “ribosome” (ath03010, 164 genes), and “starch and sucrose metabolism” (ath00500, 154 genes) (Fig. 4D). qRT-PCR was used to validate the relative expression of 14 genes of interest: four MADS-box genes, four Aux/IAA genes, four bHLH genes, and two MYB genes. The relative expression of the 14 genes at the two seed developmental stages is presented in Fig. 4E.