Collection of fungi and corresponding RNA-seq datasets
A total of 220 fungi covering 51 orders were collected in FungiExp, whose genome assemblies and annotations are hosted in the Ensembl database (Release 47) [12]. Only RNA-seq data generated from Illumina platforms were considered because of their ubiquity and high base-calling accuracy. The corresponding RNA-seq raw sequence files were retrieved by searching the Sequence Read Archive (SRA, https://www.ncbi.nlm.hih.gov/sra) database using terms Platform=’Illumina’, Strategy=’RNA-seq’ and Source=’Transcriptomic’. The retrieved RNA-seq datasets were gathered with further manual curation of sample information including strain, tissue, development stage, genotype, and treatment conditions. After being mapped, a total of 35,821 RNA-seq experiments from 3,052 studies containing 67.8 terabytes bases were adopted in FungiExp (Table S1).
Improvement of gene model annotation
The correct structural annotation of genes is critical for downstream analyses, especially for the identification of alternative splicing events based on read alignments on exons. Except for a few model fungi, the gene model annotations of most fungi are largely incomplete. Only one reference transcript is annotated for each gene. We used the following procedure to improve gene models.
1. High-quality RNA-seq data curated manually were mapped to the reference genome with Hisat2 [13] and generated transcript structures based on read alignments with StringTie2 [14].
2. Only novel multi-exonic transcripts of at least 200 bps in length, an average coverage of 2x per transcript, and an average coverage of 1x per exon were retained for further refining.
3. We further removed the novel transcripts not occurring in at least 1/3 of all experiments and at least 3 experiments.
After improvement, the splice junctions and exons were respectively increased by 7.4% and 10.86% on average (Table S2). The multi-exon genes with alternative transcripts were increased from 0.76% to 15.85% on average (Table S2). The average transcript number per multi-exon gene was increased from ~1 to 1.2. Some fungi, such as Saccharomyces cerevisiae, did not give additional alternative transcripts, nearly all their genes being single-exonic. Ustilaginoidea virens showed the greatest improvement in gene models, with more than 69% of multi-exon genes having alternative transcripts after the process.
In order to explore cross-species conservation conveniently, we identified orthologous gene groups based on the longest protein sequences of genes by using orthofinder [15].
Identification of alternative splicing events
Fungal alternative splicing events are evolutionary adaptations to changing external conditions and widely interest the fungus research community [16]. In our experiments, rMATs was used to call five types of alternative splicing events according to RNA-seq reads alignments on exons (Figure 1A) [17]. As in plants, RI (retained intron) was the most popular type of alternative splicing event detected in all collected RNA-seq samples, accounting for 58% on average (Figure 1B). The intron chains of 74.35% alternative splicing events conformed to known transcripts, and the remaining events were deduced by rMATS according to read mapping on exons (Figure 1C).
Estimate of gene expression and alternative splicing profiles
StringTie2 was used to detected gene and transcript expression levels based on RNA-seq read alignments. Two expression level metrics such as TPM (transcripts per million) and FPKM (fragments per kilo base per million mapped reads) were adopted in FungiExp.
PSI (percentage spliced in) is used to estimate alternative splicing profiles [18]. It refers to the ratio between reads including or excluding exons and indicates how efficiently sequences of interest are spliced into transcripts. Two metrics are used to estimate alternative splicing including JCEC and JC. For JCEC, PSI values are calculated based on both reads that span junctions defined by rMATS (Junction Counts) and reads that do not cross an exon boundary (Exon Counts). For JC, PSI values are calculated based only on reads that span junctions (Junction Counts) [17].
Differential and specific expression analyses
Two popular methods, DESeq2 [19] and edgeR [20], are integrated into FungiExp for users to perform differential expression analysis. The former supports comparison with sample replicates, whereas the latter supports comparison both with and without sample replicates.
Three statistical models, including rMATS_unpaired, rMATS_paired, and MATS_LRT are used to detect differential alternative splicing by means of read counts on events. In the MATS_LRT model, read counts of all replicates in a group are added together. The rMATS_unpaired model is used for replicates without pairs between groups. The rMATS_paired is employed when each replicate is paired with another between the two sample groups.
For specific expression analysis, users need to customize at least 4 experimental groups to detect specifically expressed genes. If a gene/transcript’s expression level is significantly higher/lower in a group than all other groups, the gene/transcript is supposed to be specifically expressed in this group. A similar definition is used for a specific alternative splicing event.
Enrichment analysis
The two enrichment analysis methods, namely the Hypergeometry test and Gene Set Enrichment Analysis, were integrated into FungiExp for users to summarize the function distribution of differentially/specifically expressed genes [21-22]. The former will find enrichment among genes where the difference is large, but it will not detect a situation in which the difference is small but evidenced in a coordinated way in a set of related genes. The latter method addresses this limitation.
Utility
The FungiExp platform was developed based on improved gene models, alternative splicing events, gene/transcript expression levels, and alternative splicing profiles in all collected RNA-seq samples (Figure 2). The 220 fungi are listed in an interactive table in the home page for users to access. For each fungus, users can download all of its data in the summary page, perform retrieval operations in the search and blast pages, and perform expression analyses in the comparison and specificity pages.
Retrieval
In the search page, users can query genes to display gene/transcript expression levels and profiles of associated alternative splicing events by gene identifier, symbol, and functional annotation terms (Figure 3A). Users can also search genes by a pathway through which relationships among genes can be displayed visually (Figure 3B). In addition, users can specify metrics (TPM or FPKM) to show expression levels and specify a metric (JCEC or JC) to display alternative splicing profiles. Alternatively, genes can be queried by sequence similarity in the blast page (Figure 3C). This is useful when looking for homologous or orthologous genes if they are not easily identified by an identifier or symbols. The retrieved genes are displayed in a concise table with basic information and links to a gene page showing expression levels (Figure 3D).
Display of gene expression and alternative splicing
By means of the query and analysis modules, the genes are listed in an interactive table that includes links for users to open pages showing gene/transcript expression levels and alternative splicing profiles (Figure 4A). The gene, transcript, and alternative splicing pages consist of similar structures.
The basic information in them will be listed at the top of page, including genome position, symbol, and orthologous and various functional annotation terms. Moreover, by clicking the orthologous group ID, users can open a pop-up page to explore orthologous gene expression levels in other species (Figure 4B).
Subsequently, a genome browser allows users to explore gene, transcript, and alternative splicing in the genome context (Figure 4C). Notably, the sequence of interest can be directly obtained from the genome browser.
As a transcript page, the next section is a functional structure graph that illustrates the positions of Pfam domains (Figure 4D).
This is followed by a drop-down list box containing all collected studies and a hierarchical column chart displaying expression levels or alternative splicing profiles in a chosen study. Study information, such as study title, RNA-seq information, and experimental group number, is listed in the drop-down list box. The average expression levels or alternative splicing profiles in the experimental groups are shown in the 1st level column chart. The sample information of the experimental group, such as strain, tissue, genotype, developmental stage and treatment, are displayed in the horizontal ordinate. A click on an experimental group opens the 2nd column chart, which displays the expression level or alternative splicing profile in each experiment (Figure 4E).
Finally, two tables respectively list the associated transcripts and alternative splicing events with links to pages showing corresponding transcript expression levels or alternative splicing profiles. In transcript and alternative splicing pages, users can open the impact page to show the way that alternative splicing events act on the transcript if available (Figure 4D).
Differential expression analysis
In the comparison page, users can flexibly customize two experimental groups to perform differential expression analyses. In addition, users can select different differential expression/splicing analysis methods and downstream enrichment analysis methods and corresponding parameters. FungiExp supports offline analysis and email notification functions. Users can open the result page to view or download results of analysis by clicking on the link in the notification email.
On the result page, three sections display the differentially expressed genes. The first section contains the principal components analysis (PCA) result and heat map diagrams illustrating experiment clustering based on gene expression levels (Figure 5A). Then, two diagrams present the results for GO and pathway enrichment analysis in differentially expressed genes (Figure 5B). Finally, the differentially expressed genes are listed in an interactive table (Figure 5C). In the table, users can directly obtain important information such as mean expression levels in the control and treatment groups, fold-changed, Q-value, and links for users to open new pages to explore the details of gene expression levels.
Similarly, three sections display the differentially alternative splicing events. The PCA and heat map diagrams are based on alternative splicing PSI values. The GO and pathway enrichment analysis diagrams are based on differentially spliced genes. The differentially alternative splicing events are also listed in an interactive table with links for users to open new pages to explore the details of gene alternative splicing.
Specific expression analysis
In addition to the analysis methods and parameters that are similar in the comparison page, users should customize at least 4 experimental groups to discover differentially expressed or alternative splicing genes. For example, users may want to find the specifically expressed genes in different developmental stages. Indeed, specific expression analysis consists of multiple differential expression analyses. Therefore, compared with differential expression analysis, it tends to consume more time.
On the result page, four sections display the specifically expressed genes. The first section contains the PCA result and heat map diagrams illustrating experiment clustering based on gene expression levels (Figure 6A). Then, a column chart summarizes the specifically expressed gene numbers in each group (Figure 6B). Next, two tables respectively show enriched GO and pathways for specifically expressed genes (Figure 6C). Finally, the specifically expressed genes are listed in an interactive table (Figure 6D). In the table, users can click links to open new pages to explore the details of gene expression. Similarly, four sections display specifically alternative splicing events.
A case study to explore fungal gene expression changes induced by heat stress
Heat stress affects a broad range of cellular processes that result in cell cycle arrest, damage to membranes and cytoskeletal structures, and accumulation of misfolded proteins [23-24]. We applied FungiExp to a set of RNA-seq datasets (SRP071140) that were obtained from Fusarium graminearum under normal and high temperature and released by Duc-Cuong et. al.. FungiExp reported a total of 3,268 differentially expressed genes after heat stress. Among these genes, the change in gene expression level of FGRAMPH1_01G04861 (FgHSP90) was validated by quantitative real-time (qRT)-PCR [25]. FungiExp showed that these differentially expressed genes were enriched with respect to unfolded protein binding (GO:0051082), transmembrane transporter activity (GO:0022857), transmembrane transport (GO: 0055085), mismatch repair (GO:0006298), and protein folding (GO:0006457) (Figure 7A-1).
We further explored gene splicing changes after heat stress. From the PCA result and heat maps, the RNA-seq samples were clustered into control and treatment groups based on alternative splicing profiles (Figure 7B). This implied that gene splicing was also affected by heat stress in F. graminearum. A total of 235 differential alternative splicing events in 189 genes were identified in which RI events accounted for more than 87% (Figure 7C). In the 189 differentially spliced genes, only 59 genes were significantly changed in their expression levels (Figure 7D). FungiExp showed that the differentially spliced genes were enriched with regard to protein kinase activity (GO:0004672) and protein phosphorylation (GO:0006468) (Figure 7A-2) and to MAPK signaling pathway – yeast (ko04011). Some studies have shown that, in eukaryotic species from yeast to human, stress-activated protein kinases, i.e., members of the MAPK subfamily, regulate the transcriptional response to various environmental stresses [26].
In the differentially spliced genes, the protein product of FGRAMPH1_01G26803 contained two RNA binding domains (PF00076) that are highly abundant in eukaryotes and are found in proteins involved in post-transcriptional gene expression processes including mRNA and rRNA processing, RNA export, and RNA stability [27]. The expression level of this gene was not changed significantly, but the PSI value of an associated alternative splicing event FGRAPH1RI0000000595 was significantly increased by 0.516 after heat stress (Figure 7E). The alternative splicing removed 54 nucleotides from the associated transcript FGRAMPH1_01T26803 and finally caused a shorter distance between the two domains in the corresponding protein product (Figure 7F). The findings of alternative splicing in the regulation gene may provide some cues for further investigation of fungal response to external stress. We believe that the discovery of these events demonstrates the value of FungiExp as an important and necessary complement to the existing RNA-seq studies.