New Putative Long Non-Coding RNAs (lncRNA) Revealed by Pan-Transcriptome of the Emerging Human Pathogenic Fungus Talaromyces Marneffei

Previous genomic/transcriptomic analyses of Talaromyces marneffei (TM) unravelled relevant pathogenicity-related elements, as well as chromosomal regions potentially involved with the production of non-coding RNAs (ncRNAs), which have been parsimoniously reported in fungi. This manuscript describes a comprehensive pan-transcriptome assembly for TM that identies a series of previously undetected genetic elements in this emerging pathogenic fungus. Our results conrm that ~58.28% of the 9,480 genes currently annotated in the TM genome are, in fact, transcribed in vivo and that ~23.6% of them may display alternative isomorphs. Moreover, we identied 585 transcripts that do not match any gene currently mapped in the genome, represented by 90 coding transcripts and 140 ncRNAs, including 48 long non-coding RNAs (lncRNAs). Overall, we expect that the novel elements described herein may contribute to improve the currently available Talaromyces databases and foster studies aiming at characterizing lncRNA-mediated gene expression control in fungi.


Introduction
Talaromyces marneffei (TM), previously known as Penicillium marneffei, is a thermally dimorphic fungus viewed as an important emerging pathogen in endemic regions of Southeast Asia, Southern China and Eastern India [1][2][3]. Recent data showed that TM infection correlates with the highest mortality rates among HIV/AIDS patients, surpassing those observed in most HIV-associated complications [3][4]. TM infection has also been occasionally observed in individuals submitted to treatments involving immunosuppressive drugs (including cancer, systemic lupus erythematosus and bone marrow/solid organ transplants, among others) [5].
Omics analyses constitute some of the most effective tools currently available to properly understand the general biology and virulence mechanisms in pathogenic fungi. Accordingly, analysis of the ~28.9 Mb draft TM genome unravelled a series of relevant elements, which seem to be associated with development and virulence in this microorganism, including meiotic genes, proteins involved in pheromone response pathways and putative virulence factors [6]. Moreover, genome annotation identi ed chromosomal regions that potentially encode a series of non-coding RNAs (ncRNAs) in TM, while preliminary transcriptome analyses provided evidence for the existence of microRNA-like RNAs (miRNAs) in this microorganism [7][8][9][10]. Although miRNAs and other types of ncRNAs are important controllers of gene expression in plants and animals, their presence in fungi, as well as their role in controlling the outcome of relevant biological processes in such organisms has only started to be discovered [11][12][13][14][15][16]].

Raw Libraries collection
The TM mycelial-yeast pan-transcriptome described herein was built from RNA-seq libraries of T. marneffei PM1, available at the NCBI SRA database, under project ID PRJNA212740 [9, 6, 15].

Bioinformatics Analysis
Reads were submitted to a work ow previously described by [16], with minor modi cations. Brie y, raw FASTQ sequencing data were processed in a Public Galaxy Server, available at usegalaxy.eu. Initially, the quality of raw sequences was assessed using FastQC [17] and MultiQC [18]. Fastp [19] was then used to remove low-quality reads (Q<30) and adapters. To remove sequences from the human host, reads were aligned with Bowtie2 [20] against a local database containing the human reference genome hg38, available at Galaxy Europe (usegalaxy.eu). To remove rRNA reads, the high-quality reads were aligned to sequences in the SILVA ribosomal RNA (rRNA) [21] and Rfam databases [22] using SortMeRNA [23]. NCBI UniVec database (https://www.ncbi.nlm.nih.gov/tools/vecscreen/univec/) was used to remove vector contaminants from libraries, with the aid of Bowtie2. Quality-ltered reads were then mapped to the latest version of the T. marneffei PM1 reference genome (GCA_000750115, ASM75011v1), available at the Ensembl Fungi database [24], using HISAT2 [25].
StringTie [25] was then used to assemble the mapped reads into transcripts, using the de novo transcriptome reconstruction method, allowing identi cation of all transcripts present in each sample (including currently annotated genes, as well as newly identi ed elements and isomorphs). StringTie merge [25] was next used to combine redundant transcription structures, providing a uni ed reference transcriptome, with unique identi ers. Cu inks [26] was then used to estimate expression values (FPKM) for each element in the StringTie-generated transcriptome. Transcriptome completeness analysis was conducted with the aid of BUSCO using the Fungi odb10 reference database [27]. Finally, transcripts were classi ed into different Transcription Class Codes (TCCs), re ecting their respective nature/origin, with the aid of Cuffcompare [26], using a GFF3 reference annotation le for the T. marneffei PM1 genome (obtained from Ensembl Fungi). The nal assembly derived from the StringTie/Cu inks/Cuffcompare analyses was ltered by expression level and only elements displaying FPKM ≥ 1 and TCCs "=", "j", "i", "u" and "x" were considered real transcripts and used in subsequent analyses, as suggested by [16].

Functional Annotation
Functional annotations were obtained with the aid of the Eukaryotic Non-Model Transcriptome Annotation Pipeline (EnTAP) [28]. Contigs were queried for similarity (blastx, e-value ≤ e-5) against the National Center for Biotechnology Information non-redundant protein database (NCBI nr); NCBI proteins reference database (RefSeq); Swiss-Prot curated database from UniProt Knowledgebase (UniProtKB) and the EggNOG proteins database. EnTAP ontology searches via EggNOG also helped to assign the biological function to the genes, identifying GO (Gene Ontology) and KEGG (Kyoto Encyclopedia of Genes and Genomes) terms. Additionally, InterProScan [29] mapped the transcripts to their respective protein families. Elements not annotated by EnTAP were evaluated with Infernal (cmscan) [30] and classi ed into the different families of ncRNAs de ned in the Rfam database. After Infernal annotation, blastn (e-value ≤ e-5) searches compared possible novel transcripts with EST (expressed sequence tags) data available for Talaromyces and other Trichocomaceae, at NCBI.

lncRNA discovery and conservation
All real transcripts were evaluated for their respective coding potential (CP) with the aid of three tools: RNAsamba [31], Coding Potential Calculator (CPC2) [32] and Coding Potential Assessment Tool CPAT [33]. In this work, we considered putative long non-coding RNAs (lncRNA), those transcripts with length ≥200 bp, not annotated by EnTAP/Infernal and that were identi ed as non-coding by all CP tools. Finally, these putative lncRNAs were submitted to FEElnc [34], which evaluated their potential nature as lncRNAs, identi ed their respective partner genes and classi ed the new lncRNAs to their localization and the directionality, in relation to their partner RNAs. To discover conserved ncRNAs, we aligned putative ncRNA sequences against Talaromyces ncRNAs sequences (available at the Ensembl Fungi database) using blastn with a cut-off e-value ≤ e-5, coverage ≥ 90 and identity ≥ 50%.

Reproducibility of methods
To guarantee the reproducibility of the data presented in this manuscript, we provide all information about the versions, sources and references for all tools used to obtain the results presented herein in the Supplementary

Results And Discussion
After submitting approximately 100 million paired-end RNA-seq reads to the pipeline described above, we obtained a preliminary transcriptome assembly containing 9,275 elements, which are summarized in Supplementary Transcripts were classi ed into different Transcription Class Codes (TCCs), re ecting their respective nature/origin (see Supplementary Table ST1-S0d) and ltered, to maintain only transcripts displaying FPKM ≥ 1 and belonging to TCCs "=", "j", "u", "i" and "x". This resulted in a reference TM pantranscriptome containing 8,613 elements (Table 1; Supplementary Table ST1-S3). In total, 5,525 of these elements belong to TCC "=", representing transcripts with exact match to a gene's exon chain (Table 1;  Supplementary Table ST1-S3). We also identi ed 2,503 putative mRNA isomorphs (derived from 2,236 genes), characterized as multi-exon transcripts, containing at least one exon-exon junction match (TCC "j") ( Table 1; Supplementary Table ST1-S3). This suggests that ∼23.6% of the 9,480 CDSs currently mapped in the T. marneffei PM1 draft genome [6] may be alternatively processed through alternative splicing, differential polyadenylation, or alternative transcription start sites. Finally, we identi ed 585 transcripts that do not match any gene currently mapped in the TM genome (Table 1; Supplementary   Table ST1-S3). These novel transcripts are represented by 82 elements from TCC "u" (transcripts mapping in intergenic regions), 2 elements from TCC "i" (intronic transcripts) and 501 elements from TCC "x" (transcripts displaying overlap with exonic regions, but mapped on the opposite strand). Since the most recent Ensembl genomic annotation for the T. marneffei PM1 genome identi ed 9,480 genes, the 585 novel elements identi ed herein represent an ∼5.81% increase in the number of genes identi ed in this fungus.
The 8,613 transcripts in the reference pan-transcriptome were annotated with EnTAP against Swiss-Prot, EggNOG proteins, RefSeq and Nr proteins databases, resulting in matches for 8,219 transcripts (95.4%). Most transcripts (6,790; 78.8%) were functionally annotated with GO terms. Among these, 6,469 transcripts (75.1%) displayed matches to GO Biological Processes, while 5,189 (60.2%) displayed matches to GO Cellular Components and 6,351 (73.7%) to GO Molecular Functions. The ten most frequent functional groups within each GO category are shown in Fig. 1C with the aid of Infernal, allowing their identi cation as non-coding RNAs (ncRNAs), based on their similarity to a series of such molecules, previously described in other organisms (Supplementary Table   ST1-S4). Among these, are the small Cajal body RNAs (scaRNAs), involved in modifying small nucleolar RNAs (snoRNAs), such as snR80, widely observed in the Dikarya sub-kingdom [14]. We also identi ed snosnR61, described in Schizosaccharomyces pombe and Saccharomyces cerevisiae; and the Afu transcripts, previously described in Aspergillus fumigatus, Paracoccidioides brasiliensis and other fungi [10][11][12][13]15]. After Infernal annotation, we conducted blastn searches, comparing the sequences of all 585 novel transcripts with EST data available for Talaromyces and other Trichocomaceae, at NCBI. This analysis resulted in matches for 100 transcripts (Table 1; Supplementary Table ST1-S5), con rming the existence of a large proportion of such novel elements (∼17.1%) by an independent approach, based on direct cDNA cloning (thus, reinforcing that they are not PCR artifacts produced during RNA-seq library construction. All 8,613 transcripts were also evaluated for their coding potential (CP) with the aid of a custom pipeline, based on three CP prediction tools: CPAT, CPC2 and RNAsamba (see Material and Methods). These analyses led to the identi cation of 7,358 coding and 313 noncoding transcripts (Table 1; Supplementary   Table ST1-S4). Additionally, there were 942 elements with undetermined coding potential, since their analyses by the CP-prediction tools provided con icting results (Table 1; Supplementary Table ST1-S4). From the 7,358 coding elements, 90 were considered as derived from previously unmapped genes, since they do not correspond to any coding sequence (CDS) previously described in the TM genome (TCCs "u", "x") ( Table 1; Supplementary Table ST1-S6). The 313 noncoding transcripts were further evaluated by FEElnc, which identi ed 102 lncRNAs among such elements, along with their corresponding partner mRNAs (Table 1; Supplementary Table ST1-S7-S8). Thus, the remaining 211 elements were considered novel putative non-coding RNAs (ncRNAs) ( Table 1; Supplementary Table ST1-S9). Finally, the 313 putative ncRNAs were submitted to a blastn analysis against previously described ncRNA sequences from Talaromyces (available at the Ensembl Fungi database). These analyses resulted in matches for 125 putative ncRNAs reported herein (Table 1; Supplementary Table ST1-S10), indicating that our reference transcriptome identi ed 188 novel noncoding elements in the Talaromyces genus, including 48 novel lncRNAs (Table 1; Supplementary Table ST1-S11).