After submitting approximately 100 million paired-end RNA-seq reads to the pipeline described above, we obtained a preliminary transcriptome assembly containing 9,275 elements, which are summarized in Supplementary Table ST1-S0c and shown in detail in Supplementary Table ST1-S1 (available at doi.org/10.6084/m9.figshare.14143757.v3). Transcriptome completeness identified 655/758 (86.4%) of the universal genes present in the Fungi odb10, supporting the high quality of this assembly (Fig. 1A; Table 1; Supplementary Table ST1-S2).
Transcripts were classified into different Transcription Class Codes (TCCs), reflecting their respective nature/origin (see Supplementary Table ST1-S0d) and filtered, to maintain only transcripts displaying FPKM ≥ 1 and belonging to TCCs “=”, “j”, “u”, “i” and “x”. This resulted in a reference TM pan-transcriptome containing 8,613 elements (Table 1; Supplementary Table ST1-S3). In total, 5,525 of these elements belong to TCC “=”, representing transcripts with exact match to a gene’s exon chain (Table 1; Supplementary Table ST1-S3). We also identified 2,503 putative mRNA isomorphs (derived from 2,236 genes), characterized as multi-exon transcripts, containing at least one exon-exon junction match (TCC “j”) (Table 1; Supplementary Table ST1-S3). This suggests that ∼23.6% of the 9,480 CDSs currently mapped in the T. marneffei PM1 draft genome [6] may be alternatively processed through alternative splicing, differential polyadenylation, or alternative transcription start sites. Finally, we identified 585 transcripts that do not match any gene currently mapped in the TM genome (Table 1; Supplementary Table ST1-S3). These novel transcripts are represented by 82 elements from TCC “u” (transcripts mapping in intergenic regions), 2 elements from TCC “i” (intronic transcripts) and 501 elements from TCC “x” (transcripts displaying overlap with exonic regions, but mapped on the opposite strand). Since the most recent Ensembl genomic annotation for the T. marneffei PM1 genome identified 9,480 genes, the 585 novel elements identified herein represent an ∼5.81% increase in the number of genes identified in this fungus.
The 8,613 transcripts in the reference pan-transcriptome were annotated with EnTAP against Swiss-Prot, EggNOG proteins, RefSeq and Nr proteins databases, resulting in matches for 8,219 transcripts (95.4%). Most transcripts (6,790; 78.8%) were functionally annotated with GO terms. Among these, 6,469 transcripts (75.1%) displayed matches to GO Biological Processes, while 5,189 (60.2%) displayed matches to GO Cellular Components and 6,351 (73.7%) to GO Molecular Functions. The ten most frequent functional groups within each GO category are shown in Fig. 1C. In addition, 2,353 (27.3%) isomorphs were annotated into at least one KEGG pathway term (Supplementary Table ST1-S4). InterProScan assigned protein family hits for 8,469 (98.3%) isomorphs (Supplementary Table ST1-S4). Next, transcripts not annotated by EnTAP were distributed into 10 RNA classes of the Rfam database, with the aid of Infernal, allowing their identification as non-coding RNAs (ncRNAs), based on their similarity to a series of such molecules, previously described in other organisms (Supplementary Table ST1-S4). Among these, are the small Cajal body RNAs (scaRNAs), involved in modifying small nucleolar RNAs (snoRNAs), such as snR80, widely observed in the Dikarya sub-kingdom [14]. We also identified snosnR61, described in Schizosaccharomyces pombe and Saccharomyces cerevisiae; and the Afu transcripts, previously described in Aspergillus fumigatus, Paracoccidioides brasiliensis and other fungi [10–13, 15]. After Infernal annotation, we conducted blastn searches, comparing the sequences of all 585 novel transcripts with EST data available for Talaromyces and other Trichocomaceae, at NCBI. This analysis resulted in matches for 100 transcripts (Table 1; Supplementary Table ST1-S5), confirming the existence of a large proportion of such novel elements (∼17.1%) by an independent approach, based on direct cDNA cloning (thus, reinforcing that they are not PCR artifacts produced during RNA-seq library construction.
All 8,613 transcripts were also evaluated for their coding potential (CP) with the aid of a custom pipeline, based on three CP prediction tools: CPAT, CPC2 and RNAsamba (see Material and Methods). These analyses led to the identification of 7,358 coding and 313 noncoding transcripts (Table 1; Supplementary Table ST1-S4). Additionally, there were 942 elements with undetermined coding potential, since their analyses by the CP-prediction tools provided conflicting results (Table 1; Supplementary Table ST1-S4). From the 7,358 coding elements, 90 were considered as derived from previously unmapped genes, since they do not correspond to any coding sequence (CDS) previously described in the TM genome (TCCs “u”, “x”) (Table 1; Supplementary Table ST1-S6). The 313 noncoding transcripts were further evaluated by FEElnc, which identified 102 lncRNAs among such elements, along with their corresponding partner mRNAs (Table 1; Supplementary Table ST1-S7-S8). Thus, the remaining 211 elements were considered novel putative non-coding RNAs (ncRNAs) (Table 1; Supplementary Table ST1-S9). Finally, the 313 putative ncRNAs were submitted to a blastn analysis against previously described ncRNA sequences from Talaromyces (available at the Ensembl Fungi database). These analyses resulted in matches for 125 putative ncRNAs reported herein (Table 1; Supplementary Table ST1-S10), indicating that our reference transcriptome identified 188 novel noncoding elements in the Talaromyces genus, including 48 novel lncRNAs (Table 1; Supplementary Table ST1-S11).