Overview of the software ecosystem
We have developed a suite of web-based tools consisting of EcoOmicsDB (www.ecoomicsdb.ca) for ortholog query and cross-species comparisons, EcoOmicsAnalyst (www.ecoomicsanalyst.ca) for raw reads processing using EcoOmicsDB, and ExpressAnalyst (www.expressanalyst.ca) for expression profiling. While the primary motivation for these tools is to improve RNA-seq analysis for non-model organisms, these tools can also be used to analyze data from common model organisms. The three software tools are described in more details below.
Ecoomicsdb: A High-resolution Ortholog Database For Cross-species Gene Annotation
EcoOmicsDB is a custom ortholog database that was developed to significantly improve the resolution and transcriptome coverage of the KEGG ortholog database described previously in Seq2Fun version 1.0 (Liu, Ewald et al. 2021). It currently includes ~ 13 million protein-coding genes from 596 species (Table 1). Of these protein-coding sequences, 5,871,017 were annotated with KEGG pathways and 1,567,627 were annotated with GO terms. The 596 species were organized into 29 taxonomic sub-groups, based on the NCBI taxonomy database (Schoch, Ciufo et al. 2020). Symbols, descriptions, and functional annotations were harmonized across individual proteins for each ortholog group (more details are provided in the methods section). All details for each ortholog group in EcoOmicsDB are accessible via a web-interface (www.ecoomicsdb.ca) that can be queried by either ortholog ID or Entrez ID.
Table 1
EcoOmicsDB contains 29 taxonomic sub-groups available for RNA-seq annotation and quantification of non-model organisms in eukaryotes. The taxonomic levels are indicated by the number in the Level column and also by indentations and font styles in the Group column.
Level | Group | Species | Proteins | Ortholog | Seq2Fun (v1) |
1 | Eukaryotes | 596 | 12 828 537 | 666 067 | 8041 |
2 | Animals | 370 | 7 150 735 | 270 089 | - |
3 | Vertebrates | 212 | 4 588 985 | 83 704 | 6723 |
4 | Mammals | 94 | 1 910 363 | 47 144 | 5883 |
4 | Birds | 31 | 482 205 | 22 397 | 4177 |
4 | Reptiles | 20 | 384 584 | 21 725 | 4342 |
4 | Fishes | 61 | 1 736 572 | 42 497 | 4308 |
3 | Arthropods | 119 | 1 727 651 | 113 673 | 3541 |
4 | Insects | 101 | 1 376 824 | 70 170 | - |
4 | Crustaceans | 7 | 154 960 | 37 407 | - |
3 | Cnidarians | 9 | 203 000 | 24 003 | - |
3 | Mollusks | 9 | 206 905 | 35 775 | - |
3 | Nematodes | 6 | 134 093 | 35 865 | 2324 |
2 | Plants | 127 | 3 968 027 | 162 988 | 3012 |
3 | Eudicots | 93 | 3 180 221 | 102 677 | - |
3 | Monocots | 17 | 560 027 | 43 451 | - |
3 | Algae | 14 | 155 495 | 38 334 | - |
2 | Fungi | 138 | 1 278 312 | 148 080 | 2423 |
3 | Ascomycetes | 100 | 904 642 | 98 151 | - |
4 | Eurotiomycetes | 20 | 196 228 | 25 710 | - |
4 | Saccharomycetes | 36 | 195 913 | 14 873 | - |
4 | Dothideomycetes | 10 | 123 200 | 28 898 | - |
3 | Basidiomycetes | 33 | 363 997 | 56 935 | - |
2 | Protists | 52 | 660 237 | 134 451 | 2696 |
3 | Alveolates | 21 | 207 674 | 51 205 | - |
4 | Apicomplexans | 18 | 93 576 | 14 632 | - |
3 | Stramenopiles | 8 | 119 746 | 31 581 | - |
3 | Amoebozoa | 7 | 81 844 | 22 114 | - |
3 | Euglenozoa | 9 | 86 483 | 12 363 | - |
EcoOmicsDB was created with the OrthoFinder software (Emms and Kelly 2019), which identifies rooted ortholog groups by inferring groups of sequences that share a common ancestor. The sequence-similarity parameters for ortholog definition were chosen to produce ortholog groups at a higher resolution than the KO system, and species-specific functional annotations were compiled to produce both KEGG and GO term gene sets for Seq2Fun ortholog IDs (Kanehisa, Sato et al. 2016, Consortium 2019). The first round of analysis took more than ten days to complete using a server with 54-threads and 504GB RAM. The analysis binned 12,828,537 protein-coding sequences from 596 species into 666,067 ortholog groups. The size distribution of these ortholog groups largely follow a power law distribution (Fig. 2A) (Girvan and Newman 2002, Arita 2005). While most orthologs group contain fewer than ten sequences, the largest ones were many times larger than this, with the biggest one (s2f_0000000) containing more than 50,000 transcription factor sequences. Of particular concern to our toxicology-focused authors was the 5th largest ortholog group (s2f_0000004) that contained > 28,000 cytochrome P450 enzymes, an average of 47 per species. Grouping at this level makes it difficult to infer gene-level insights. To address this issue, we further split the largest 10, 000 ortholog groups into 76, 066 groups with an adaptive k-means clustering-based approach (Fig. 2B), with the largest ortholog group being split into more groups (n = 96) than the smallest (n = 2). An example of this is shown for the vitellogenin ortholog group (Fig. 2C), a protein family that is important in the study of non-mammalian vertebrate species because it is a highly sensitive biomarker for exposure to estrogenic compounds. It is not found in the KEGG ortholog database. Details on the ortholog splitting approach, parameters, and rationale are given in the Methods section.
Ecoomicsanalyst: Efficient Rna-seq Data Processing Via Web-based Interface
EcoOmicsAnalyst is a web-based tool designed to produce count tables from raw RNA-seq reads for any eukaryotic species. For species with no reference transcriptome, reads are aligned to EcoOmicsDB using Seq2Fun 2.0. For species with a reference transcriptome, users can choose between using either Seq2Fun or Kallisto (Bray, Pimentel et al. 2016). EcoOmicsAnalyst has a user account system to allow users to upload, store, and process FASTQ files on our server. Each account is limited to 30GB of data. If users are uncomfortable uploading their data, have a dataset larger than 30GB, or want to avoid the time-consuming upload step, we provide a Docker image for local installation.
The EcoOmicsAnalyst interface and workflow are designed around ‘Projects’ (Figure S1). Upon login, users are brought to their ‘Project Home’, which includes a row for each project with a status bar that indicates whether the job is “running” or “completed”. Users can initiate a new project from this page by clicking either “With Reference” for Kallisto-based quantification or “Without Reference” for Seq2Fun-based quantification. After initiating a project, users are brought to the “File Management” page, which allows them to view and navigate FASTQ files in their directory on the server. Files can be added to this directory using FileZilla (filezilla-project.org). Detailed instructions and tutorials on how to do this are available on this page. Next, the “Data Sanity Check” page is used to select samples for quantification, label samples with experimental factors, and, in the case of paired end reads, group forwards and backwards reads from the same sample. The “Reads Quantification” page allows users to select a reference transcriptome or use our ortholog database for read mapping together with a few quantification parameters. After job submission, the “Job Status View” page will be displayed, showing a summary of the number of completed samples and a log of the Kallisto or Seq2Fun output. At this stage, users can close the EcoOmicsAnalyst browser tab, and the job will continue to run on the server. Total processing time will vary depending on current server load, dataset size, and sample sequencing depth, but in our experience a dataset with 15 FASTQ files that are ~ 1GB each takes around six hours to complete. Users will receive email notices once their jobs are complete. The “Project Results” page includes a summary table of quantification statistics, PCA plots, rarefaction curves, and plots of reads quality before and after filtering. From here, users can go to the “Download Results” page to obtain the count table, feature annotation details, and other analysis information. The count table is ready for downstream analysis using ExpressAnalyst.
The underlying algorithm of EcoOmicsAnalyst is Seq2Fun 2.0 (www.seq2fun.ca). The version 2.0 significantly reduces memory footprint while maintains a high efficiency compared to version 1.0 (Liu, Ewald et al. 2021). For example, Seq2Fun 2.0 maintains ~ 2 million reads per minute while decreasing memory usage from 1.49 GB to 0.94 GB while processing our test datasets, despite version 2.0 having a much larger (> 125 times increase) database compared to version 1.0. Version 2.0 also includes a new function called SeqTract to retrieve the mapped reads for a given list of genes. Seq2Fun generates mapped reads of all genes into a single fastq.gz or a pair of fastq.gz files for single and pair-end reads, respectively. To conduct a target gene assembly, SeqTract takes a list of Seq2Fun IDs and the mapped fastq.gz files as an input and outputs a fastq.gz file for each ID. It supports multi-threading, consumes very little memory and is highly efficient. The fastq.gz file for each gene can be fed into popular de novo assemblers (Holzer and Marz 2019) to assemble the contigs for primer design, isoform identification, or phylogenetic analysis.
Expressanalyst: Comprehensive Transcriptome Profiling Through Visual Analytics
ExpressAnalyst is a web-based platform combing visualization, statistics, and functional analysis for gene expression data analysis. It currently contains three modules according to the data input type: ‘Enrichment Analysis’ for functional analysis of gene lists, ‘Expression Profiling’ for statistical and functional analysis of gene expression tables, and ‘Meta Analysis’ for statistical and functional analysis of several gene expression tables collected under similar experimental conditions. The ‘Enrichment Analysis’ and ‘Meta Analysis’ modules have been described by our previous publications (Xia, Fjell et al. 2013, Xia, Lyle et al. 2013, Zhou, Soufan et al. 2019). Here, we focus on the ‘Expression Profiling’ module which contains many new and updated features to specifically support results generated from EcoOmicsAnalyst.
The ‘Expression Profiling’ module supports microarray or RNA-seq tables for 28 model species, as well as Seq2Fun IDs and KEGG ortholog IDs for any eukaryotic species with transcriptomics data. Users can also upload their own custom annotation files. After data upload, the ‘Data Quality Check’ page gives statistical and visual summaries of the feature annotation and raw data. Next, the ‘Data Filtering & Normalization’ page allows users to filter by variance, abundance, and perform several common normalization techniques. Boxplots, density plots, and PCA plots allow the users to examine their data before and after applying these normalization steps. On the ‘Differential Expression Analysis’ (DEA) page, users can select between several established DEA methods – limma, edgeR and DESeq2 (Robinson, McCarthy et al. 2010, Love, Huber et al. 2014, Ritchie, Phipson et al. 2015) with their associated parameters. Next is the ‘Select Significant Genes’ page, where users can define their p-value and/or fold-change cut-offs and view a table of the resulting gene expression results (Fig. 3A). All genes in the table are linked to their NCBI gene cards or EcoOmicsDB ortholog profiles (Fig. 3B). Users can view violin plots of expression values across experimental factors (Fig. 3A). Finally, the ‘Analysis Overview’ provides six visual analytics functions to help identify important features, functions and their correlations through interactive Volcano Plot, Enrichment Network, Ridgeline Chart, Heatmap, etc. (Fig. 3C-D).
Benchmark And Case Studies
Seq2Fun 1.0 was rigorously validated in our original publication (Liu, Ewald et al. 2021). Here, to ensure that Seq2Fun 2.0 also reproduces results obtained using traditional approaches, we carried out two cases studies using organisms with reference transcriptomes (American lobster and zebrafish), as well as with one new case study involving salamander species with and without reference genomes. Seq2Fun 2.0 produced almost identical PCA variance structures and relative numbers of DEGs between treatment groups compared to analysis with reference transcriptomes (see SI for details), following results obtained for Seq2Fun 1.0. Here, we focus on the third case study to demonstrate how the concepts and tools described in this paper can be used to efficiently analyze and interpret a comparative transcriptomics dataset from multiple salamander species, some without reference transcriptomes.
The RNA-seq dataset was originally collected as part of a comparative study of transcriptional responses to limb regeneration in three ambystomatid salamander species (Dwaraka, Smith et al. 2019), one with a reference genome (Ambystomatidae mexicanum, abbreviation MEX) and two without (Ambystomatidae andersoni, abbreviation AND; Ambystomatidae maculatum, abbreviation MAC). In the original experiment, an upper arm was amputated from larvae from each of the three species, and tissue samples were taken at the time of amputation (time0), and 24 hours after amputation (time24). Three individuals from each species and time group were sequenced, resulting in 3 reps * 2 time points * 3 species = 18 RNA-seq samples. RNA-seq data was quantified using a reference transcriptome from MEX and de novo transcriptomes from AND and MAC. Differential expression analysis was conducted separately for each species, and differentially expressed genes (DEGs) that were shared across species were identified by searching for sequence similarities using the BLAST algorithm.
For our analysis, FASTQ files were downloaded from NCBI GEO (accession GSE116777) and were re-processed in nine hours with the Seq2Fun module (vertebrates database) using EcoOmicsAnalyst. The count table was uploaded to ExpressAnalyst for downstream statistical and functional analysis. PCA plots on the “Data Normalization” page show that the primary source of variability in the normalized counts matrix is species differences, as shown by the separation according to species along PC1 (Fig. 4A). AND and MEX samples fall closer to each other than to MAC, which makes sense given that AND is more closely related to MEX (estimated divergence time = 4.27 million years) than MAC is (estimated divergence time = 21.47 million years) (Hedges, Marin et al. 2015). The second largest source of variability was time since amputation, shown by the separation of samples with time0 and time24 annotations along PC2 (Fig. 4A).
Differential expression analysis was performed with the limma option to identify genes that were significantly different between time0 and time24 across species by analyzing all samples together and considering species as a blocking factor. This takes variability associated with species into account when calculating the p-value for differences between time0 and time24. Using the same statistical thresholds as the original publication (adj. p-value < 0.1 (FDR), no log2FC cut-off), a total of 2780 DEGs were identified.
The “Interactive Volcano Plot” tool in ExpressAnalyst was used to perform overrepresentation analysis (ORA) separately on the up- and down-regulated genes with KEGG, Gene Ontology Biological Process (GO BP), Molecular Function (GO MF), and Cellular Component (GO CC) gene sets (Fig. 4B). Overall, there were 27 significantly enriched pathways in the list of up-regulated genes and 81 in the list of down-regulated genes. The up-regulated pathways were mainly related to immune response, cell proliferation, and programmed cell death. The down-regulated pathways were mainly related to muscle tissue and cellular metabolism. This is consistent with the functional analysis results reported by the original publication which noted that up and down-regulated genes were enriched in GO BP terms related to wound healing and tissue development, and muscle tissue and cellular metabolism, respectively (Dwaraka, Smith et al. 2019).
The top five DEGs (Table 4) were queried in EcoOmicsDB to investigate gene-level details of the ortholog groups of interest. The database can be queried by Seq2Fun or Entrez ID to retrieve details and generate graphics on all genes and species in the relevant ortholog group. For convenience, the table of differential expression results in ExpressAnalyst contain links to the EcoOmicsDB profiles for all Seq2Fun ortholog IDs. Further, graphics on the coverage of different species sub-groups provides valuable insights into the taxonomic domain of Seq2Fun ortholog groups.
Table 4
Top 5 DEGs from the salamander case study. The p-value and log2FC statistics are from the ExpressAnalyst results; the # Entrez ID and # Species are from ortholog profiles in EcoOmicsDB.
Seq2Fun ID | Symbol | Adj. p-value | log2FC | # Entrez ID | # Species |
s2f_0005954003 | PLEK2 | 4.72E -8 | 4.22 | 222 | 207 |
s2f_0000105017 | MMP8 | 1.06E -7 | 6.22 | 156 | 143 |
s2f_0000040013 | TNC | 1.06E -7 | 4.34 | 309 | 218 |
s2f_0000131002 | LAMB3 | 1.06E -7 | 4.60 | 234 | 211 |
s2f_0000105021 | MMP12 | 2.25E -7 | 5.99 | 463 | 153 |
EcoOmicsDB shows that the top five DEGs are supported by lots of evidence (> 155 genes and > 142 species for each). The top four orthologs have an Entrez ID to species ratio close to one, and examination of the tables in EcoOmicsDB show that in cases where there are multiple Entrez IDs from one species, the symbols and descriptions are very similar (i.e. TNC and TNC-like isoform). The 5th ortholog (s2f_0000105021) contains on average three Entrez ID sequences per species. Examination of the EcoOmicsDB output shows that these are generally matrix metalloproteinase-3 (also known as stromelysin; K01394), matrix metalloproteinase-7 (also known as matrilysin; K01397), and matrix metalloproteinase-12 (also known as macrophage elastase; K01413). The Entrez ID-specific GO terms show that they have identical functional annotations in nearly all cases, for example the details in EcoOmicsDB for Entrez IDs 109569210 (MMP12, zebu cattle), 102388711 (MMP7, Chinese alligator), and 103089307 (MMP3, Yangtze river dolphin). Taken together, there is ample evidence that these differentially expressed orthologs are robust and represent real genes/proteins.
The original publication reported 405 transcripts that were significantly impacted in all three species. We quantified all RNA-seq samples with Seq2Fun, even though there was a reference transcriptome for one species. This greatly simplified the downstream analysis. Since all quantified samples shared the same set of Seq2Fun IDs, the data could be integrated across species and analyzed in a single DEA, which improved the statistical power because the overall sample size was 18, instead of three separate analyses with six samples each. This likely explains our 2780 DEGs versus their 405. Finally, we note that the use of salamanders makes this a particularly strong case for our proposed RNA-seq analysis framework. Amphibians can have notoriously large genomes (Liedtke, Gower et al. 2018), estimated to range from 14 to 120 GB across salamander species (for reference, the human genome is 3.2 GB) (Nowoshilow, Schloissnig et al. 2018). Performing de novo assembly of two salamander genomes would be extremely computationally intensive. The full analysis for this case study, including raw reads processing, statistical analysis, and figure preparation, took less than 24 hours, was completed without using the command line or R programming, and was all done on a laptop computer. In cases where more detailed isoform analysis is desired, targeted assembly of reads mapping to individual ortholog groups of interest can be performed.