Construction and annotation of the FLNC transcriptome database for A. cristatum
After quality control, a total of 11,966,252 subreads, namely, 6,447,695 and 5,518,557 subreads from two different cells, were successfully generated (Table 1). A total of 504,811 representative CCSs for ZMWs were obtained. A total of 405,302 CCSs were classified as FL transcripts based on the presence of 5’ primers, 3’ primers and poly(A) tails. After demultiplexing, refining, clustering and polishing of FL transcripts were performed, a total of 44,372 FLNC transcripts with a maximum length of 9,468 bp, a N50 of 3,572 bp and average FL coverage of 5.1 were generated (Table 1). In addition, the proportion of incomplete transcripts of FLNC transcripts was only 6.30% in BUSCO analysis (Table 2). As expected, the PacBio FLNC transcripts were generally longer and more complete than the transcripts assembled via the Illumina sequencing platform in previous studies [20, 21] (Figure 2, Table 2). However, the higher proportion of unmapped reads (72.24%) indicated that PacBio could not detect all transcripts due to insufficient sequencing data (Table 2). These results indicated that the PacBio FLNCs and transcripts assembled by 2nd generation sequencing should be integrated to obtain a high-quality A. cristatum transcriptome database.
Functional annotation of the FLNC transcripts was conducted using 5 different public databases (Table 3 and Figure 3). Of these, 30,854 FLNC transcripts were found to have homologs in the SwissProt database. A total of 24,588 transcripts had significant matches in the eggNOG database, and 23,996 transcripts received Pfam domain assignments. Furthermore, 23,754 transcripts had matches in the Kegg database, and 29,424 transcripts were associated with GO terms. Moreover, the numbers of FLNC transcripts with transmembrane regions, signal peptides and rRNA transcripts were 5,601, 2,344 and 329, respectively. Altogether, 32,318 FLNC transcripts had at least one annotation (Table 3). In addition to protein-coding RNAs, 8,202 candidate non-coding RNAs were predicted in non-annotated FLNC transcripts.
Tissue-enriched FLNC isoforms
To analyse tissue-enriched transcript expression, a total of 11 transcriptome libraries were generated from 4 different tissues with multiple biological replicates of A. cristatum (Table S1). The Illumina sequencing generated approximately 15 million sequencing reads in each sample. After filtering the low-quality reads, about 99.98% of the sequencing reads were retained for downstream analysis. Quality-controlled RNA-Seq reads from the leaves, stems, roots and caryopses of A. cristatum were mapped to FLNC transcripts (Table S1). “Expressed” transcripts were defined as those with both (1) an average FPKM greater than 4 and (2) an FPKM greater than 2 for each replicate of the given tissue [29], resulting in the detection of 12,251 leaf, 13,440 stem, 14,192 root and 15,253 caryopsis protein-coding transcripts and 8,899 transcripts that may have “housekeeping” functions and were expressed in all sampled tissues (Figure 4A). As expected, GO enrichment analysis showed that basic cell biological and metabolic processes were enriched in the 8,899 ubiquitously expressed transcript set, including terms such as organonitrogen compound metabolic and biosynthetic process, organic substance metabolism, protein and peptide metabolism, and amide metabolic and biosynthetic based process (Figure 4B and Table S2). Additionally, the ubiquitous category shared intracellular part, organelle, ribonucleoprotein complex, and mitochondrial part terms.
Tissue-enriched transcripts, that is, transcripts expressed at significantly higher levels in a particular tissue compared to all other tissues (FDR ≤0.01, Fold Change ≥4, FPKM ≥2) were next identified in each type of tissue. We observed that the caryopsis tissue had the highest number of tissue-enriched transcripts (1,515), followed by leaf (266), root (210), and stem (32) tissues. As expected, GO analysis showed that tissue-enriched FLNC transcripts were enriched for particular molecular functions that varies with tissues. Leaf tissue-enriched transcripts were associated with photosynthesis, with GO terms such as oxidoreductase activity, ribulose-bisphosphate carboxylase activity, photosynthesis dark reaction, carbon-carbon lyase activity, chloroplast, and flavonoid biosynthetic process. (Figure 4C and Table S3). In addition, the stem tissue-enriched set was associated with many well-characterized transporter activity functions, including transferase activity, transferring glycosyl groups, transferring hexosyl groups, sucrose 1F-fructosyltransferase activity, fructosyltransferase activity, peptide:proton symporter activity, solute:proton symporter activity, solute:cation symporter activity, amide transmembrane transporter activity, symporter activity, and proton-dependent peptide secondary active transmembrane transporter activity GO terms (Figure 4D and Table S4). GO enriched analysis of the root tissue suggested that, in addition to expected categories associated with response to stress (response to external biotic stimulus, response to fungus, and response to biotic stimulus, regulation of defence response to fungus, and regulation of response to stimulus) and signal transduction (hormone-mediated signalling pathway, salicylic acid mediated signalling pathway, ethylene-activated signalling pathway and phosphorelay signal transduction system), response to chitin, oxygen-containing compound, and organonitrogen compound terms appeared in the root-enriched transcript list (Figure 4E and Table S5). The vast majority of GO terms associated with the caryopsis tissue-enriched genes were related to cellular processes, including protein heterodimerization activity, protein-DNA complex, DNA packaging complex, nucleosome, chromosomal part, protein dimerization activity, and DNA/nucleic acid/protein binding/heterocyclic compound binding terms (Figure 4F and Table S6). In summary, these tissue-enriched GO terms may provide insight into gene expression in A. cristatum tissue development and maintenance.
FLNC transcripts compared with wheat gene model
To compare transcripts between A. cristatum and wheat, the 44,372 FLNC transcripts were aligned to the IWGSC RefSeq v1.0 genome (Figure 5) and compared with the wheat gene set model. A total of 43,020 FLNC transcripts were mapped to 17,510 loci that were spread across the wheat genome (Figure 5B and 5C). Among these transcripts, 16,374 FLNC transcripts had multiple exons, and 4,604 loci had multiple transcripts, with an average of 1.8 transcripts per locus. The distribution and density of FLNC transcripts on the wheat genome were calculated for all chromosomes in the wheat genome, and sharply decreased from the telomeres to centromeres in the whole wheat chromosomal regions (Figure 5B). The number of FLNCs in each chromosome was not directly proportional to the chromosomal length and gene number. The most FLNC transcripts were aligned to the homologous group 2 chromosomes, whereas the homologous group 6 chromosomes contained the fewest FLNC transcripts (Figure 5E). Interestingly, the highest FLNC transcript number and density were observed on the wheat D genome (17,326, 43.5 FLNC transcripts/10 Mb), followed by the wheat B genome (13,500, 25.8 FLNC transcripts/10 Mb), and the wheat A genome (11,612, 23.4 FLNC transcripts/10 Mb) (Figure 5B, 5E and 5F). The distribution and density of the wheat-genome loci to which FLNC transcripts were mapped were similar to the distribution and density of FLNC transcripts (Figure 5C, 5G and 5H). In total, 3,398 novel FLNC transcripts were mapped to the intergenic regions of the wheat genome that did not overlap with wheat genes (Figure 5D and 5I). The density of the novel transcripts also decreased from the chromosome ends towards the centromeres (Figure 5D), and the highest density was also observed in the wheat D genome (Figure 5J).
GO analysis showed that novel FLNC transcripts were enriched for nucleic acid biological activity and biosynthetic processes, such as DNA polymerase activity, endonuclease activity, DNA recombination, integration and DNA biosynthetic processes, RNA-DNA hybrid ribonuclease activity, and nucleotidyltransferase activity (Figure 6A and Table S7). In addition, there were 1,352 FLNC transcripts that were not aligned to the wheat genome that are considered to be A. cristatum-specific transcripts compared with wheat. The vast majority of GO terms associated with the A. cristatum-specific transcripts, including the COPI vesicle coat, retrograde vesicle-mediated transport from Golgi to endoplasmic reticulum, and Golgi vesicle transport terms, were related to protein transport processes in the cytoplasm. Additionally, these transcript categories shared terms associated with multi-organism metabolic processes such as the RNA-DNA hybrid ribonuclease activity, transporter activity of nucleobase:cation symporter, uptake transmembrane and nucleobase transmembrane terms (Figure 6B and Table S8). Thus, these 4,750 FLNC transcripts, containing 3,398 novel and 1,352 A. cristatum-specific transcripts, might represent particularly positive selection compared with wheat and be helpful for understanding the genetic diversity of Triticeae.
Identification of candidate genes associated with thousand-grain weight in A. cristatum-wheat translocation line Pubing 3035
A total of 4 transcriptome libraries were generated from mixed RNA from the leaves and caryopses of Fukuho, Pubing 3035, BC2F2_6P+ and BC2F2_6P- sampled during four different periods. Illumina sequencing generated 42,700,222, 30,258,610, 29,705,108 and 30,538,203 sequence reads in Fukuho, Pubing 3035, BC2F2_6P+ and BC2F2_6P-, respectively. After filtering the low-quality reads, approximately 99.98% of the sequencing reads (42,692,045 reads for Fukuho, 30,253,410 reads for Pubing 3035, 29,699,108 reads for BC2F2_6P+ and 30,533,230 for BC2F2_6P-) were retained for downstream analysis (Table S1).
To identify genes specifically expressed in the translocation fragment, high-quality clean sequencing data were aligned to the reference sequences from an integration of the A. cristatum transcriptome and the wheat genome. Differential analysis using DESeq2 revealed that a total of 12 A. cristatum transcripts exhibited differential expression between non-translocation and translocation lines that met the parameters of log2(fold change) ≤ -4 and adjusted P value ≤ 0.05 (Table 4). The sequences of these 12 significantly differentially expressed transcripts were used as queries to search orthologous regions from genome sequences of wheat; the search indicated that homologous genes were located in the same interval on chromosome 6A/B/D. These intervals ranged from the TraesCS6A02G191200 gene to the TraesCS6A02G202900 gene on chromosome 6A, spanning 82.8 Mbp, from the TraesCS6B02G219700 gene to the TraesCS6B02G233700 gene on chromosome 6B, spanning 80.9 Mbp, and from the TraesCS6D02G174400 gene to the TraesCS6D02G187400 gene on chromosome 6D, spanning 88.7 Mbp (Table S9). transcript/24685 and TRINITY_DN94508_c0_g1_i1, transcript/16718 and TRINITY_DN118140_c0_g2_i2, transcript/14210 and transcript/9968 and TRINITY_DN12662_c0_g1_i1 and TRINITY_DN75295_c0_g1_i1 corresponded to the same homologs of the wheat genome, suggesting that they might be isoforms of the same gene or be derived from different homologous genes (Table S9). We developed polymorphic markers based on the sequences of homologous genes in the wheat 6A/B/D chromosomal regions corresponding to the 12 differentially expressed transcripts (Tables 4 and S10). The orthologous genomic regions of the translocation fragment in A. cristatum were identified in wheat chromosome 6A (Figure 7), indicating that the wheat chromosome interval corresponding to the A. cristatum translocation fragment in Pubing 3035 was from the TraesCS6A02G190200 to the TraesCS6A02G204000 gene of chromosome 6A and that obvious rearrangements could be observed on the 6P translocation segment compared with the wheat 6A chromosome (Figure 7). According to these results, it could be speculated that the genomic region of the translocation fragment in A. cristatum shows collinearity with chromosomes 6A of wheat.
The functions of these 12 significantly differentially expressed transcripts were investigated and one of them, transcript/7882, was homologous to the rice gene OsUBP15/LG1, which encodes a constitutively expressed ubiquitin-specific protease 15 (OsUBP15) that possesses de-ubiquitination activity in vitro and is a positive regulator of grain width and size in rice [52]. The marker WGRG9 was developed from transcript/7882 and corresponding with TraesCS6A02G192600, TraesCS6B02G231700 and TraesCS6D02G179700 (Figure 7 and Table S9). Therefore, the orthologue of WGRG9 in the corresponding P genomic region can serve as a candidate gene for control of thousand-grain weight in Pubing 3035; this gene should be subjected to functional verification in a future study.