A unified platform for RNA-seq analysis in non-model species

doi:10.21203/rs.3.rs-2187066/v1

Download PDF

Article

A unified platform for RNA-seq analysis in non-model species

https://doi.org/10.21203/rs.3.rs-2187066/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 24 May, 2023

Read the published version in Nature Communications →

Version 1

posted

You are reading this latest preprint version

The increasing application of RNA-seq to study non-model organisms demands easy-to-use and efficient bioinformatics tools to help researchers quickly uncover biological and functional insights from large datasets. Here, we present a unified software suite for processing, analyzing, and interpreting RNA-seq data from any eukaryotic species. This suite consists of a) EcoOmicsDB (www.ecoomicsdb.ca), a database for ortholog mapping and cross-species comparison; b) EcoOmicsAnalyst (www.ecoomicsanalyst.ca), a platform for raw data processing and annotation; and c) ExpressAnalyst (www.expressanalyst.ca), a platform for statistical and functional analysis. The utilities of this suite are demonstrated through case studies of RNA-seq data from multiple non-model species with or without reference transcriptomes. By coupling ultra-fast read mapping algorithms with high-resolution ortholog databases through a user-friendly web interface, the tool suite enables researchers to obtain global expression profiles and gene-level insights from raw RNA-seq reads within 24 hours.

Biological sciences/Computational biology and bioinformatics/Computational platforms and environments

Biological sciences/Computational biology and bioinformatics/Data processing

Biological sciences/Computational biology and bioinformatics/Databases/Genetic databases

Biological sciences/Computational biology and bioinformatics/Functional clustering

Biological sciences/Computational biology and bioinformatics/Software

The last decade has seen growing applications of RNA-seq to environmental and agricultural studies involving non-model organisms (Wachi, Matsubayashi et al. 2018). Reference genomes/transcriptomes are not available for many of these species, and thus de novo-assembled transcriptomes are typically required to quantify raw RNA-seq reads. The approach contains two main steps: transcript assembly and gene annotation. The first step, assembly, involves piecing together putative transcripts from the raw RNA-seq data, which is a computationally intensive task that typically requires weeks of runtime on high-performance computers (HPC) equipped with 100s of gigabytes (GB) of memory. Commonly used software includes Trinity, SOAPdenovo-Trans, and Oases (Luo, Liu et al. 2012, Schulz, Zerbino et al. 2012, Haas, Papanicolaou et al. 2013). In the second step, possible annotations are assigned to the assembled transcripts in the form of gene symbols, short descriptions, and functions as defined by widely used pathway libraries and functional ontologies such as Kyoto Encyclopedia of Genes and Genomes (KEGG) and Gene Ontology (GO) (Kanehisa and Goto 2000, Consortium 2019). Annotation is usually performed with Basic Local Alignment Search Tool (BLAST) based algorithms through a process called “annotation transfer” that leverages existing genome annotations from other species (Otto, Dillon et al. 2011). Despite the wide adoption of this strategy, there is no consensus approach for choosing how many or which genomes to use, or for how to resolve multiple, conflicting, or missing annotations (Raghavan, Kraft et al. 2022). Count tables obtained from those de novo-assembled transcriptomes are difficult to analyze and interpret, as this process often results in hundreds of thousands of transcripts, most of which have either inconsistent or lacking functional annotations (i.e., hypothetical proteins) (Liao, Li et al. 2019, Raghavan, Kraft et al. 2022). In summary, the current practice for RNA-seq analysis involving non-model organisms is computationally intensive, requires advanced programming skills, and produces transcript IDs (usually in-house IDs) that are difficult to reproduce, compare across studies, and ultimately re-use. There is an urgent demand for computationally efficient, user-friendly, reproducible, and functionally coherent methods for RNA-seq processing and analysis for non-model organisms.

Here we describe a unified conceptual framework for reads quantification and annotation of RNA-seq data from non-model organisms and present three companion tools to streamline the whole process (Fig. 1). By leveraging the concept that short reads (~ 75bp) can be uniquely mapped to ortholog groups (Schatz, Delcher et al. 2010, Chhangawala, Rudy et al. 2015), the typical steps of transcript assembly and annotation are replaced with the alignment of individual translated reads directly to a comprehensive, high-resolution protein ortholog database created from the genomes of hundreds of eukaryotic species. Downstream analysis is performed on the resulting high-resolution ortholog count table, similar to increasingly popular strategies to process shotgun metagenomics data (Menzel, Ng et al. 2016).

Our current effort builds upon a command-line algorithm, Seq2Fun, that we previously developed for mapping RNA-seq reads from eukaryotic species to KEGG ortholog (KO) databases with a translated search (Liu, Ewald et al. 2021). We demonstrated that Seq2Fun outperforms de novo assembly in terms of accuracy, precision, computing time, and memory usage, while producing functional profiles that are consistent with traditional methods (Liu, Ewald et al. 2021). Despite this promising result, subsequent testing revealed several important limitations associated with this KO-based annotation system. The first is limited transcriptome coverage. Not all protein-coding genes are annotated with KOs. For example, the human genome has 19,648 protein-coding genes, of which only 14,964 (76.16%) are annotated with KOs (Kanehisa, Sato et al. 2016). Coverage is especially lower for non-mammalian species. For example, the zebrafish genome has 26,584 protein-coding genes and only 16,322 (61.40%) are annotated with KOs. This incomplete transcriptome coverage is biased, particularly for non-mammalian species for which fewer KEGG pathways are defined, and some biological processes, such as egg yolk formation in oviparous species, are notably absent. For example, vitellogenin, a precursor to egg yolk protein formation and a biomarker of interest (Hansen, Dizer et al. 1998), is not found in the KO system. The second limitation is transcript resolution as KO groups often gather many genes from one species together. While this is inevitable to some extent during ortholog definition, increased resolution would be an asset to many in the user community. The third limitation is lack of functional annotation beyond KEGG pathways. Finally, the command line interface of Seq2Fun limits accessibility to those with programming skills.

In the current work, we have developed three web-based tools that work in concert to enable streamlined, user-friendly analysis of RNA-seq data from any eukaryotic species. To demonstrate its utility, we provide a comprehensive case study involving RNA-seq data from three species of salamanders. In doing so we show that researchers can use this suite of tools to obtain comprehensive functional insights from raw RNA-seq reads from any eukaryotic species generally in less than 24 hours of runtime, the majority of which is unsupervised upload (if using our public server) and processing time. A powerful feature of these tools is that they are web-based and can be run on personal computers by individuals who have limited (or no) knowledge of programming.

Overview of the software ecosystem

We have developed a suite of web-based tools consisting of EcoOmicsDB (www.ecoomicsdb.ca) for ortholog query and cross-species comparisons, EcoOmicsAnalyst (www.ecoomicsanalyst.ca) for raw reads processing using EcoOmicsDB, and ExpressAnalyst (www.expressanalyst.ca) for expression profiling. While the primary motivation for these tools is to improve RNA-seq analysis for non-model organisms, these tools can also be used to analyze data from common model organisms. The three software tools are described in more details below.

Ecoomicsdb: A High-resolution Ortholog Database For Cross-species Gene Annotation

EcoOmicsDB is a custom ortholog database that was developed to significantly improve the resolution and transcriptome coverage of the KEGG ortholog database described previously in Seq2Fun version 1.0 (Liu, Ewald et al. 2021). It currently includes ~ 13 million protein-coding genes from 596 species (Table 1). Of these protein-coding sequences, 5,871,017 were annotated with KEGG pathways and 1,567,627 were annotated with GO terms. The 596 species were organized into 29 taxonomic sub-groups, based on the NCBI taxonomy database (Schoch, Ciufo et al. 2020). Symbols, descriptions, and functional annotations were harmonized across individual proteins for each ortholog group (more details are provided in the methods section). All details for each ortholog group in EcoOmicsDB are accessible via a web-interface (www.ecoomicsdb.ca) that can be queried by either ortholog ID or Entrez ID.

Table 1

EcoOmicsDB contains 29 taxonomic sub-groups available for RNA-seq annotation and quantification of non-model organisms in eukaryotes. The taxonomic levels are indicated by the number in the Level column and also by indentations and font styles in the Group column.
Level	Group	Species	Proteins	Ortholog	Seq2Fun (v1)
1	Eukaryotes	596	12 828 537	666 067	8041
2	Animals	370	7 150 735	270 089	-
3	Vertebrates	212	4 588 985	83 704	6723
4	Mammals	94	1 910 363	47 144	5883
4	Birds	31	482 205	22 397	4177
4	Reptiles	20	384 584	21 725	4342
4	Fishes	61	1 736 572	42 497	4308
3	Arthropods	119	1 727 651	113 673	3541
4	Insects	101	1 376 824	70 170	-
4	Crustaceans	7	154 960	37 407	-
3	Cnidarians	9	203 000	24 003	-
3	Mollusks	9	206 905	35 775	-
3	Nematodes	6	134 093	35 865	2324
2	Plants	127	3 968 027	162 988	3012
3	Eudicots	93	3 180 221	102 677	-
3	Monocots	17	560 027	43 451	-
3	Algae	14	155 495	38 334	-
2	Fungi	138	1 278 312	148 080	2423
3	Ascomycetes	100	904 642	98 151	-
4	Eurotiomycetes	20	196 228	25 710	-
4	Saccharomycetes	36	195 913	14 873	-
4	Dothideomycetes	10	123 200	28 898	-
3	Basidiomycetes	33	363 997	56 935	-
2	Protists	52	660 237	134 451	2696
3	Alveolates	21	207 674	51 205	-
4	Apicomplexans	18	93 576	14 632	-
3	Stramenopiles	8	119 746	31 581	-
3	Amoebozoa	7	81 844	22 114	-
3	Euglenozoa	9	86 483	12 363	-

EcoOmicsDB was created with the OrthoFinder software (Emms and Kelly 2019), which identifies rooted ortholog groups by inferring groups of sequences that share a common ancestor. The sequence-similarity parameters for ortholog definition were chosen to produce ortholog groups at a higher resolution than the KO system, and species-specific functional annotations were compiled to produce both KEGG and GO term gene sets for Seq2Fun ortholog IDs (Kanehisa, Sato et al. 2016, Consortium 2019). The first round of analysis took more than ten days to complete using a server with 54-threads and 504GB RAM. The analysis binned 12,828,537 protein-coding sequences from 596 species into 666,067 ortholog groups. The size distribution of these ortholog groups largely follow a power law distribution (Fig. 2A) (Girvan and Newman 2002, Arita 2005). While most orthologs group contain fewer than ten sequences, the largest ones were many times larger than this, with the biggest one (s2f_0000000) containing more than 50,000 transcription factor sequences. Of particular concern to our toxicology-focused authors was the 5th largest ortholog group (s2f_0000004) that contained > 28,000 cytochrome P450 enzymes, an average of 47 per species. Grouping at this level makes it difficult to infer gene-level insights. To address this issue, we further split the largest 10, 000 ortholog groups into 76, 066 groups with an adaptive k-means clustering-based approach (Fig. 2B), with the largest ortholog group being split into more groups (n = 96) than the smallest (n = 2). An example of this is shown for the vitellogenin ortholog group (Fig. 2C), a protein family that is important in the study of non-mammalian vertebrate species because it is a highly sensitive biomarker for exposure to estrogenic compounds. It is not found in the KEGG ortholog database. Details on the ortholog splitting approach, parameters, and rationale are given in the Methods section.

Ecoomicsanalyst: Efficient Rna-seq Data Processing Via Web-based Interface

EcoOmicsAnalyst is a web-based tool designed to produce count tables from raw RNA-seq reads for any eukaryotic species. For species with no reference transcriptome, reads are aligned to EcoOmicsDB using Seq2Fun 2.0. For species with a reference transcriptome, users can choose between using either Seq2Fun or Kallisto (Bray, Pimentel et al. 2016). EcoOmicsAnalyst has a user account system to allow users to upload, store, and process FASTQ files on our server. Each account is limited to 30GB of data. If users are uncomfortable uploading their data, have a dataset larger than 30GB, or want to avoid the time-consuming upload step, we provide a Docker image for local installation.

The EcoOmicsAnalyst interface and workflow are designed around ‘Projects’ (Figure S1). Upon login, users are brought to their ‘Project Home’, which includes a row for each project with a status bar that indicates whether the job is “running” or “completed”. Users can initiate a new project from this page by clicking either “With Reference” for Kallisto-based quantification or “Without Reference” for Seq2Fun-based quantification. After initiating a project, users are brought to the “File Management” page, which allows them to view and navigate FASTQ files in their directory on the server. Files can be added to this directory using FileZilla (filezilla-project.org). Detailed instructions and tutorials on how to do this are available on this page. Next, the “Data Sanity Check” page is used to select samples for quantification, label samples with experimental factors, and, in the case of paired end reads, group forwards and backwards reads from the same sample. The “Reads Quantification” page allows users to select a reference transcriptome or use our ortholog database for read mapping together with a few quantification parameters. After job submission, the “Job Status View” page will be displayed, showing a summary of the number of completed samples and a log of the Kallisto or Seq2Fun output. At this stage, users can close the EcoOmicsAnalyst browser tab, and the job will continue to run on the server. Total processing time will vary depending on current server load, dataset size, and sample sequencing depth, but in our experience a dataset with 15 FASTQ files that are ~ 1GB each takes around six hours to complete. Users will receive email notices once their jobs are complete. The “Project Results” page includes a summary table of quantification statistics, PCA plots, rarefaction curves, and plots of reads quality before and after filtering. From here, users can go to the “Download Results” page to obtain the count table, feature annotation details, and other analysis information. The count table is ready for downstream analysis using ExpressAnalyst.

The underlying algorithm of EcoOmicsAnalyst is Seq2Fun 2.0 (www.seq2fun.ca). The version 2.0 significantly reduces memory footprint while maintains a high efficiency compared to version 1.0 (Liu, Ewald et al. 2021). For example, Seq2Fun 2.0 maintains ~ 2 million reads per minute while decreasing memory usage from 1.49 GB to 0.94 GB while processing our test datasets, despite version 2.0 having a much larger (> 125 times increase) database compared to version 1.0. Version 2.0 also includes a new function called SeqTract to retrieve the mapped reads for a given list of genes. Seq2Fun generates mapped reads of all genes into a single fastq.gz or a pair of fastq.gz files for single and pair-end reads, respectively. To conduct a target gene assembly, SeqTract takes a list of Seq2Fun IDs and the mapped fastq.gz files as an input and outputs a fastq.gz file for each ID. It supports multi-threading, consumes very little memory and is highly efficient. The fastq.gz file for each gene can be fed into popular de novo assemblers (Holzer and Marz 2019) to assemble the contigs for primer design, isoform identification, or phylogenetic analysis.

Expressanalyst: Comprehensive Transcriptome Profiling Through Visual Analytics

ExpressAnalyst is a web-based platform combing visualization, statistics, and functional analysis for gene expression data analysis. It currently contains three modules according to the data input type: ‘Enrichment Analysis’ for functional analysis of gene lists, ‘Expression Profiling’ for statistical and functional analysis of gene expression tables, and ‘Meta Analysis’ for statistical and functional analysis of several gene expression tables collected under similar experimental conditions. The ‘Enrichment Analysis’ and ‘Meta Analysis’ modules have been described by our previous publications (Xia, Fjell et al. 2013, Xia, Lyle et al. 2013, Zhou, Soufan et al. 2019). Here, we focus on the ‘Expression Profiling’ module which contains many new and updated features to specifically support results generated from EcoOmicsAnalyst.

The ‘Expression Profiling’ module supports microarray or RNA-seq tables for 28 model species, as well as Seq2Fun IDs and KEGG ortholog IDs for any eukaryotic species with transcriptomics data. Users can also upload their own custom annotation files. After data upload, the ‘Data Quality Check’ page gives statistical and visual summaries of the feature annotation and raw data. Next, the ‘Data Filtering & Normalization’ page allows users to filter by variance, abundance, and perform several common normalization techniques. Boxplots, density plots, and PCA plots allow the users to examine their data before and after applying these normalization steps. On the ‘Differential Expression Analysis’ (DEA) page, users can select between several established DEA methods – limma, edgeR and DESeq2 (Robinson, McCarthy et al. 2010, Love, Huber et al. 2014, Ritchie, Phipson et al. 2015) with their associated parameters. Next is the ‘Select Significant Genes’ page, where users can define their p-value and/or fold-change cut-offs and view a table of the resulting gene expression results (Fig. 3A). All genes in the table are linked to their NCBI gene cards or EcoOmicsDB ortholog profiles (Fig. 3B). Users can view violin plots of expression values across experimental factors (Fig. 3A). Finally, the ‘Analysis Overview’ provides six visual analytics functions to help identify important features, functions and their correlations through interactive Volcano Plot, Enrichment Network, Ridgeline Chart, Heatmap, etc. (Fig. 3C-D).

Benchmark And Case Studies

Seq2Fun 1.0 was rigorously validated in our original publication (Liu, Ewald et al. 2021). Here, to ensure that Seq2Fun 2.0 also reproduces results obtained using traditional approaches, we carried out two cases studies using organisms with reference transcriptomes (American lobster and zebrafish), as well as with one new case study involving salamander species with and without reference genomes. Seq2Fun 2.0 produced almost identical PCA variance structures and relative numbers of DEGs between treatment groups compared to analysis with reference transcriptomes (see SI for details), following results obtained for Seq2Fun 1.0. Here, we focus on the third case study to demonstrate how the concepts and tools described in this paper can be used to efficiently analyze and interpret a comparative transcriptomics dataset from multiple salamander species, some without reference transcriptomes.

The RNA-seq dataset was originally collected as part of a comparative study of transcriptional responses to limb regeneration in three ambystomatid salamander species (Dwaraka, Smith et al. 2019), one with a reference genome (Ambystomatidae mexicanum, abbreviation MEX) and two without (Ambystomatidae andersoni, abbreviation AND; Ambystomatidae maculatum, abbreviation MAC). In the original experiment, an upper arm was amputated from larvae from each of the three species, and tissue samples were taken at the time of amputation (time0), and 24 hours after amputation (time24). Three individuals from each species and time group were sequenced, resulting in 3 reps * 2 time points * 3 species = 18 RNA-seq samples. RNA-seq data was quantified using a reference transcriptome from MEX and de novo transcriptomes from AND and MAC. Differential expression analysis was conducted separately for each species, and differentially expressed genes (DEGs) that were shared across species were identified by searching for sequence similarities using the BLAST algorithm.

For our analysis, FASTQ files were downloaded from NCBI GEO (accession GSE116777) and were re-processed in nine hours with the Seq2Fun module (vertebrates database) using EcoOmicsAnalyst. The count table was uploaded to ExpressAnalyst for downstream statistical and functional analysis. PCA plots on the “Data Normalization” page show that the primary source of variability in the normalized counts matrix is species differences, as shown by the separation according to species along PC1 (Fig. 4A). AND and MEX samples fall closer to each other than to MAC, which makes sense given that AND is more closely related to MEX (estimated divergence time = 4.27 million years) than MAC is (estimated divergence time = 21.47 million years) (Hedges, Marin et al. 2015). The second largest source of variability was time since amputation, shown by the separation of samples with time0 and time24 annotations along PC2 (Fig. 4A).

Differential expression analysis was performed with the limma option to identify genes that were significantly different between time0 and time24 across species by analyzing all samples together and considering species as a blocking factor. This takes variability associated with species into account when calculating the p-value for differences between time0 and time24. Using the same statistical thresholds as the original publication (adj. p-value < 0.1 (FDR), no log2FC cut-off), a total of 2780 DEGs were identified.

The “Interactive Volcano Plot” tool in ExpressAnalyst was used to perform overrepresentation analysis (ORA) separately on the up- and down-regulated genes with KEGG, Gene Ontology Biological Process (GO BP), Molecular Function (GO MF), and Cellular Component (GO CC) gene sets (Fig. 4B). Overall, there were 27 significantly enriched pathways in the list of up-regulated genes and 81 in the list of down-regulated genes. The up-regulated pathways were mainly related to immune response, cell proliferation, and programmed cell death. The down-regulated pathways were mainly related to muscle tissue and cellular metabolism. This is consistent with the functional analysis results reported by the original publication which noted that up and down-regulated genes were enriched in GO BP terms related to wound healing and tissue development, and muscle tissue and cellular metabolism, respectively (Dwaraka, Smith et al. 2019).

The top five DEGs (Table 4) were queried in EcoOmicsDB to investigate gene-level details of the ortholog groups of interest. The database can be queried by Seq2Fun or Entrez ID to retrieve details and generate graphics on all genes and species in the relevant ortholog group. For convenience, the table of differential expression results in ExpressAnalyst contain links to the EcoOmicsDB profiles for all Seq2Fun ortholog IDs. Further, graphics on the coverage of different species sub-groups provides valuable insights into the taxonomic domain of Seq2Fun ortholog groups.

Table 4

Top 5 DEGs from the salamander case study. The p-value and log2FC statistics are from the ExpressAnalyst results; the # Entrez ID and # Species are from ortholog profiles in EcoOmicsDB.
Seq2Fun ID	Symbol	Adj. p-value	log2FC	# Entrez ID	# Species
s2f_0005954003	PLEK2	4.72E -8	4.22	222	207
s2f_0000105017	MMP8	1.06E -7	6.22	156	143
s2f_0000040013	TNC	1.06E -7	4.34	309	218
s2f_0000131002	LAMB3	1.06E -7	4.60	234	211
s2f_0000105021	MMP12	2.25E -7	5.99	463	153

EcoOmicsDB shows that the top five DEGs are supported by lots of evidence (> 155 genes and > 142 species for each). The top four orthologs have an Entrez ID to species ratio close to one, and examination of the tables in EcoOmicsDB show that in cases where there are multiple Entrez IDs from one species, the symbols and descriptions are very similar (i.e. TNC and TNC-like isoform). The 5th ortholog (s2f_0000105021) contains on average three Entrez ID sequences per species. Examination of the EcoOmicsDB output shows that these are generally matrix metalloproteinase-3 (also known as stromelysin; K01394), matrix metalloproteinase-7 (also known as matrilysin; K01397), and matrix metalloproteinase-12 (also known as macrophage elastase; K01413). The Entrez ID-specific GO terms show that they have identical functional annotations in nearly all cases, for example the details in EcoOmicsDB for Entrez IDs 109569210 (MMP12, zebu cattle), 102388711 (MMP7, Chinese alligator), and 103089307 (MMP3, Yangtze river dolphin). Taken together, there is ample evidence that these differentially expressed orthologs are robust and represent real genes/proteins.

The original publication reported 405 transcripts that were significantly impacted in all three species. We quantified all RNA-seq samples with Seq2Fun, even though there was a reference transcriptome for one species. This greatly simplified the downstream analysis. Since all quantified samples shared the same set of Seq2Fun IDs, the data could be integrated across species and analyzed in a single DEA, which improved the statistical power because the overall sample size was 18, instead of three separate analyses with six samples each. This likely explains our 2780 DEGs versus their 405. Finally, we note that the use of salamanders makes this a particularly strong case for our proposed RNA-seq analysis framework. Amphibians can have notoriously large genomes (Liedtke, Gower et al. 2018), estimated to range from 14 to 120 GB across salamander species (for reference, the human genome is 3.2 GB) (Nowoshilow, Schloissnig et al. 2018). Performing de novo assembly of two salamander genomes would be extremely computationally intensive. The full analysis for this case study, including raw reads processing, statistical analysis, and figure preparation, took less than 24 hours, was completed without using the command line or R programming, and was all done on a laptop computer. In cases where more detailed isoform analysis is desired, targeted assembly of reads mapping to individual ortholog groups of interest can be performed.

The main motivation of this work was to address the major computational bottlenecks facing researchers collecting RNA-seq data from non-model organisms (Ambrosino, Colantuono et al. 2018, Ambrosino, Tangherlini et al. 2019). As the costs of acquiring these data continue to drop, data analysis is increasingly becoming the limiting factor as many research groups do not have the in-house expertise or resources to process, analyze, and interpret RNA-seq data. From firsthand experience, existing point-and-click software for de novo assembly can cost upwards of $10 000 USD for the assembly and annotation alone. Hiring a contract bioinformatician is also expensive and can result in a lengthy analysis. While the initial analysis may be conducted quickly, each follow-up research query or visualization modification requires communication, typically over email. By removing barriers related to computing resources, programming skills, and knowledge of bioinformatics databases, we believe that the combination of EcoOmicsDB, EcoOmicsAnalyst, and ExpressAnalyst will make RNA-seq data processing and analysis more accessible to the environmental life sciences community.

Using our current framework, RNA-seq data can be easily compared across an unlimited number of species without increasing the complexity of finding ortholog matches. This contrasts with typical BLAST-based approaches where pairwise sequence-similarity searches must be conducted between transcriptomes of different species, greatly increasing the complexity of the analysis with each additional species. Resolving conflicting ortholog annotations is unavoidable in RNA-seq analysis of non-model organisms: even when researchers choose to analyze their data with a de novo transcriptome, they still must annotate this de novo transcriptome by drawing on functional information from other species (Conesa, Götz et al. 2005, Otto, Dillon et al. 2011). EcoOmicsDB comprehensively addresses ortholog grouping and annotation across many species and, when used as the database for Seq2Fun, produces count tables that always use the same set of IDs, so it is well suited for cross-species comparisons of transcriptomics data. Such comparisons are of great interest as demonstrated by recent efforts to use ‘omics data for cross-species extrapolation of toxicity mechanisms and regulatory applications in the field of environmental toxicology (LaLone, Basu et al. 2021). ExpressAnalyst enables fast hypothesis-analysis-conclusion cycles through its interactive visual analytics tools such as the heatmap, ridgeline, and upset plots. By connecting dense statistical results to powerful bioinformatics databases containing functional information in a flexible and visual format, the user is guided to move from overall trends and patterns to investigate prominent results in great depth.

The genomes of non-model organisms have been much less studied compared to mammalian model organisms and there are still many uncharacterized IDs in published reference genomes. With continued development and engagement with the environmental life sciences community, particularly on expanding Seq2Fun to quantify non-coding genes and further improving ortholog annotation (details in SI), EcoOmicsDB can serve as the reference resource for transcript identification and functional annotation in non-model organisms. In future versions, we envision a system in which functional information learned from individual transcriptomics studies is added to the ortholog profiles in EcoOmicsDB. Over time, knowledge from such studies can be pooled across species to gain insights into uncharacterized proteins. This will not only improve the annotation and functional analysis of the current studies but would also benefit the whole research community that works on orthologs and non-model organism transcriptomics.

Overview of the Seq2Fun algorithm

Seq2Fun is designed to efficiently perform translated search of RNA-seq reads against a protein database and was described and evaluated in detail in our first publication (Liu, Ewald et al. 2021). Here we give a brief overview of the Seq2Fun algorithm for convenience. The core algorithm is based on Burrows-Wheeler Transform (BWT) and Full-text Minute-space Index (FMI). BWT compresses millions of protein sequences to significantly reduce the size of the database while maintaining a small memory footprint. FMI creates a searchable data structure based on the compressed sequences that can quickly find matches to short amino acid sequences.

The first step in the Seq2Fun algorithm is to conduct reads quality control and join paired end reads. Next, each cleaned read is translated into dozens of peptide sequences using all possible reading frames. In addition to finding the true peptide among these, Seq2Fun must also consider many other sequences that capture amino acid substitutions and deletions because the reference database contains many species that cover a wide taxonomic range (Schatz, Delcher et al. 2010, Chhangawala, Rudy et al. 2015). Considering all possible amino acid substitutions for all possible reading frames is not computationally feasible, and so Seq2Fun employs two filters to greatly reduce the search space: peptide length and BLOSUM62 score, a widely used scoring matrix that summarizes the frequency of specific amino acids sequences in known peptides. First, translated sequences generated by different reading frames are filtered based on length based on the theory that the ‘true’ translation will result in a long amino acid sequence. If there are multiple amino acid sequences tied for the longest length, the most promising one is selected based on BLOSUM62 scores.

Next, the search starts with a seed of seven amino acids and considers the C terminal of the fragment to find potential start positions of matches to sequences in the reference database. A backwards searching strategy is used to extend potential matches from the C terminal to the N terminal. Allowing mismatches helps overcome evolutionary distances between the query and target sequences, however considering all amino acid substitutions would significantly slow down the search. Seq2Fun reduces the search space by again using BLOSUM62 scores to prioritize the most likely substitutions. Searches for sequences matching the query fragment continue until the N terminal is reached, or the maximum number of substitutions is exceeded. The searching regime for this read will stop if the potential match meets certain thresholds, including minimum matching scores and maximum number of substitutions. Otherwise, the search will continue based on the next most likely peptide according to the previously outlined criteria. Finally, after each read has been searched, Seq2Fun summarizes the number of best matches to each ortholog group into a count table.

We have updated Seq2Fun to version 2.0 to reduce memory footprint (e.g., from 1.49 GB to 0.94 GB) so that the whole process can be run in a personal computer. We notice the main bottleneck is related to input/output (I/O) speed. Adding more threads (higher than 16) may not improve performance due to this I/O constraint. Using a high-speed drive could relieve the constraint. Version 2.0 also includes a new function called SeqTract to retrieve the mapped reads for a given list of genes. Seq2Fun generates mapped reads of all genes into a single fastq.gz or a pair of fastq.gz files for single and pair-end reads, respectively. To conduct a target gene assembly, SeqTract takes a list of Seq2Fun IDs and the mapped fastq.gz files as an input and outputs a fastq.gz file for each ID. It supports multi-threading and consumes very little memory and is highly efficient. The fastq.gz file for each gene can be fed into popular de novo assemblers (Hölzer and Marz 2019) to assemble the contigs for primer design, isoform identification, or phylogenetic analysis.

Creation of a high-resolution ortholog database - EcoOmicsDB

All protein-coding genes (n = 12,828,537) from 596 organisms covering all major phyla of eukaryotes were downloaded from KEGG using KEGGREST (version 1.34.0) (Kanehisa, Sato et al. 2016). Protein FASTA files for each species were submitted to OrthoFinder (version 2.5.4) for classification of genes into ortholog groups (parameters: t = 56, a = 25) (Emms and Kelly 2019). OrthoFinder is a highly accurate and scalable pipeline for ortholog inference. It takes protein sequences as input and identifies all homologs by exploring both heuristic analysis of similarity scores from pairwise sequence comparison and phylogenetic trees of genes to clarify the relationship of ortholog and paralog. OrthoFinder was run on a server with 56 threads and 512 GB’s RAM. It took about ~ 10 days to finish ortholog grouping for all 596 organisms.

The number of sequences in each ortholog group follows a power law (Fig. 2A), with the largest groups combining tens of thousands of sequences. This level of summary does not approximate gene-level count tables because tens of distinct functional groups are collapsed into one, and therefore is difficult to interpret. To solve this, we applied an additional algorithm to the top 10, 000 largest ortholog groups to split them into multiple sub-groups to increase the resolution. First, MAFFT was used to conduct protein alignment for each ortholog group (Katoh and Standley 2013). Subsequently, the FastTree was used to generate a phylogenetic tree of all sequences in the ortholog group (Price, Dehal et al. 2010). Next, the phylogenetic tree, or dendrogram, was converted into a distance matrix based on the pairwise cophenetic distance, which is the height of the dendrogram at the first point where two branches containing both sequences merge. Next, k-means clustering was used to split the sequences into groups based on the distance matrix. Choosing an appropriate value for k for each ortholog group was challenging because we were trying to minimize the grouping of distinct functional proteins, while simultaneously maximizing the grouping of the same functional protein across species. There is an inherent tradeoff between these two objectives because the higher the resolution (increased value of k), the easier to minimize grouping of functional proteins but the harder to group across species. After several iterations of clustering and evaluating the results, we defined k as the (number of sequences/number of species)*2. For each ortholog group, gene-specific information was collapsed to a single gene symbol with associated text description, KEGG pathway, and GO annotation by tabulating the frequency of symbols and descriptions and choosing the most frequent ones after removing generic terms such as “uncharacterized” based on manual inspection.

EcoOmicsDB currently consists of 29 taxonomic sub-group databases based on the NCBI taxonomy system (Table 1). Selecting an appropriate sub-group database can improve performance in terms of speed and specificity for reads annotation and quantification. However, if the protein database is not selected properly (e.g., reads from a protist sample are mapped to the “fishes” database), key functional groups may be missed. As an empirical guideline for quality control, we provide summary statistics on the set of “core” orthologs for each sub-group (defined as orthologs that are present in the genomes of > 90% of sub-group species) and advise that > 80% of core genes should be typically quantified when a taxonomically appropriate database is selected.

Implementation of the web-based platforms

EcoOmicsAnalyst and ExpressAnalyst are implemented based on the latest PrimeFaces (www.primefaces.org) library (version 12) and R (version 4.1.3). EcoOmicsDB was implemented based on the PrimeNG library (version 13). EcoOmicsAnalyst performs reads quality check based on fastp (version 0.21.1), and quantification based on Kallisto (version 0.46.1) with reference genomes, or Seq2Fun (version 2.0.2). For the public server, FASTQ file upload is handled with proFTP (version 1.3.6), and job management is done with Slurm (version 20.11.2). Users frequently reported that data uploading is the most time-consuming and challenging part, due to limited bandwidth or data security concerns. To address this, we have created a Docker image of EcoOmicsAnalyst to enable RNAseq processing and quantification on a local computer through our user-friendly interface. The count tables generated from EcoOmicsAnalyst can be directly uploaded to ExpressAnalyst hosted on a Google Cloud instance. Many features of ExpressAnalyst were previously published as part of the NetworkAnalyst platform (Zhou, Soufan et al. 2019). Here, we split the general gene expression profiling features from the network building and visualization features. Significant efforts were made to develop high-performance interactive volcano plot, heatmaps and ridgeline plot to facilitate exploratory data analysis, with built-in annotation and functional analysis options to support count tables produced by Kallisto or EcoOmicsAnalyst/Seq2Fun. The three tools are available at www.ecoomicsanalyst.ca, www.expressanalyst.ca, and www.ecoomicsdb.ca.

Case study methods

Salamander RNA-seq data were obtained from NCBI’s SRA at sample accessions SRR7499348-SRR7499365, excluding sample SRR7499350 which was identified as an outlier by QA/QC performed by the original publication (Dwaraka, Smith et al. 2019). Files were downloaded and converted from SRA to FASTQ format using the NCBI SRA ToolKit (version 2.11.3) before being uploaded to EcoOmicsAnalyst. Samples were aligned to the “Vertebrates” database using Seq2Fun. The count table was uploaded to ExpressAnalyst as an RNA-seq count table, normalized with the “Log2 counts per million” option based limma voom followed by differential analysis with limma (Ritchie, Phipson et al. 2015). For differential analysis, time was set as the primary factor, and species as the secondary factor with the secondary factor defined as a “blocking factor”. Then, a “specific comparison” was performed between ‘time0’ and ‘time24’. Following the original publication (Dwaraka, Smith et al. 2019), genes were considered differentially expressed if their FDR-adjusted p-values were less than 0.1. The list of DEGs was split into up and down-regulated genes, and each list was analyzed for enriched KEGG pathways using the “Interactive Volcano Plot” tool. A pathway was defined as significantly enriched if the FDR-adjusted p-value was less than 0.05 and there were at least five DEGs in the gene set.

Author contributions: The overarching goals and ideas of this project were conceptualized by PL, JE, EL, YSJ, JS, JH, NB, and JX. The novel statistical methods were developed by PL, JE, and JX. Formal analysis, data curation, and visualization were done by JE and PL. PL, JE, OH, ZP, GZ, and JX contributed to the development of software. The software was validated and improved by general user feedback from JE, OH, EL, YSJ, JS, JH, and NB. JH and JX contributed resources, including samples and computing resources. The original draft was written by PL, JE, and JX, and all authors contributed to reviewing and editing of the final manuscript. NB and JX acquired funding and supervised the project.

Acknowledgements

This project was funded by Genome Canada and Génome Québec’s Bioinformatics and Computational Biology (B/CB) grant. We thank the team at the Huntsman Marine Science Centre for providing lobster data through funding from the Government of Canada’s Department of Fisheries and Oceans Multi-Partner Research Initiative. We thank early users who provided valuable feedback. We also thank Joseph O’Brien for technical assistance.

Ambrosino, L., C. Colantuono, F. Monticolo and M. L. Chiusano (2018). "Bioinformatics resources for plant genomics: opportunities and bottlenecks in the-omics era." Current Issues in Molecular Biology 27(1): 71–88.
Ambrosino, L., M. Tangherlini, C. Colantuono, A. Esposito, M. Sangiovanni, M. Miralto, C. Sansone and M. L. Chiusano (2019). "Bioinformatics for marine products: An overview of resources, bottlenecks, and perspectives." Marine drugs 17(10): 576.
Arita, M. (2005). "Scale-freeness and biological networks." Journal of biochemistry 138(1): 1–4.
Bray, N. L., H. Pimentel, P. Melsted and L. Pachter (2016). "Near-optimal probabilistic RNA-seq quantification." Nature biotechnology 34(5): 525–527.
Chhangawala, S., G. Rudy, C. E. Mason and J. A. Rosenfeld (2015). "The impact of read length on quantification of differentially expressed genes and splice junction detection." Genome biology 16(1): 1–10.
Conesa, A., S. Götz, J. M. García-Gómez, J. Terol, M. Talón and M. Robles (2005). "Blast2GO: a universal tool for annotation, visualization and analysis in functional genomics research." Bioinformatics 21(18): 3674–3676.
Consortium, G. O. (2019). "The gene ontology resource: 20 years and still GOing strong." Nucleic acids research 47(D1): D330-D338.
Dwaraka, V. B., J. J. Smith, M. R. Woodcock and S. R. Voss (2019). "Comparative transcriptomics of limb regeneration: Identification of conserved expression changes among three species of Ambystoma." Genomics 111(6): 1216–1225.
Emms, D. M. and S. Kelly (2019). "OrthoFinder: phylogenetic orthology inference for comparative genomics." Genome biology 20(1): 1–14.
Girvan, M. and M. E. Newman (2002). "Community structure in social and biological networks." Proceedings of the national academy of sciences 99(12): 7821–7826.
Haas, B. J., A. Papanicolaou, M. Yassour, M. Grabherr, P. D. Blood, J. Bowden, M. B. Couger, D. Eccles, B. Li and M. Lieber (2013). "De novo transcript sequence reconstruction from RNA-seq using the Trinity platform for reference generation and analysis." Nature protocols 8(8): 1494–1512.
Hansen, P.-D., H. Dizer, B. Hock, A. Marx, J. Sherry, M. McMaster and C. Blaise (1998). "Vitellogenin–a biomarker for endocrine disruptors." TrAC Trends in Analytical Chemistry 17(7): 448–451.
Hedges, S. B., J. Marin, M. Suleski, M. Paymer and S. Kumar (2015). "Tree of life reveals clock-like speciation and diversification." Molecular biology and evolution 32(4): 835–845.
Holzer, M. and M. Marz (2019). "De novo transcriptome assembly: A comprehensive cross-species comparison of short-read RNA-Seq assemblers." Gigascience 8(5).
Hölzer, M. and M. Marz (2019). "De novo transcriptome assembly: A comprehensive cross-species comparison of short-read RNA-Seq assemblers." Gigascience 8(5): giz039.
Kanehisa, M. and S. Goto (2000). "KEGG: kyoto encyclopedia of genes and genomes." Nucleic acids research 28(1): 27–30.
Kanehisa, M., Y. Sato, M. Kawashima, M. Furumichi and M. Tanabe (2016). "KEGG as a reference resource for gene and protein annotation." Nucleic acids research 44(D1): D457-D462.
Katoh, K. and D. M. Standley (2013). "MAFFT multiple sequence alignment software version 7: improvements in performance and usability." Molecular biology and evolution 30(4): 772–780.
LaLone, C. A., N. Basu, P. Browne, S. W. Edwards, M. Embry, F. Sewell and G. Hodges (2021). "International Consortium to Advance Cross-Species Extrapolation of the Effects of Chemicals in Regulatory Toxicology." Environmental Toxicology and Chemistry 40(12): 3226–3233.
Liao, X., M. Li, Y. Zou, F.-X. Wu and J. Wang (2019). "Current challenges and solutions of de novo assembly." Quantitative Biology 7(2): 90–109.
Liedtke, H. C., D. J. Gower, M. Wilkinson and I. Gomez-Mestre (2018). "Macroevolutionary shift in the size of amphibian genomes and the role of life history and climate." Nature Ecology & Evolution 2(11): 1792–1799.
Liu, P., J. Ewald, J. H. Galvez, J. Head, D. Crump, G. Bourque, N. Basu and J. Xia (2021). "Ultrafast functional profiling of RNA-seq data for nonmodel organisms." Genome research 31(4): 713–720.
Love, M. I., W. Huber and S. Anders (2014). "Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2." Genome biology 15(12): 1–21.
Luo, R., B. Liu, Y. Xie, Z. Li, W. Huang, J. Yuan, G. He, Y. Chen, Q. Pan and Y. Liu (2012). "SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler." Gigascience 1(1): 2047-2217X-2041-2018.
Menzel, P., K. L. Ng and A. Krogh (2016). "Fast and sensitive taxonomic classification for metagenomics with Kaiju." Nature communications 7(1): 1–9.
Nowoshilow, S., S. Schloissnig, J.-F. Fei, A. Dahl, A. W. Pang, M. Pippel, S. Winkler, A. R. Hastie, G. Young and J. G. Roscito (2018). "The axolotl genome and the evolution of key tissue formation regulators." Nature 554(7690): 50–55.
Otto, T. D., G. P. Dillon, W. S. Degrave and M. Berriman (2011). "RATT: rapid annotation transfer tool." Nucleic acids research 39(9): e57-e57.
Price, M. N., P. S. Dehal and A. P. Arkin (2010). "FastTree 2–approximately maximum-likelihood trees for large alignments." PloS one 5(3): e9490.
Raghavan, V., L. Kraft, F. Mesny and L. Rigerte (2022). "A simple guide to de novo transcriptome assembly and annotation." Briefings in bioinformatics 23(2): bbab563.
Ritchie, M. E., B. Phipson, D. Wu, Y. Hu, C. W. Law, W. Shi and G. K. Smyth (2015). "limma powers differential expression analyses for RNA-sequencing and microarray studies." Nucleic acids research 43(7): e47-e47.
Robinson, M. D., D. J. McCarthy and G. K. Smyth (2010). "edgeR: a Bioconductor package for differential expression analysis of digital gene expression data." bioinformatics 26(1): 139–140.
Schatz, M. C., A. L. Delcher and S. L. Salzberg (2010). "Assembly of large genomes using second-generation sequencing." Genome research 20(9): 1165–1173.
Schoch, C. L., S. Ciufo, M. Domrachev, C. L. Hotton, S. Kannan, R. Khovanskaya, D. Leipe, R. Mcveigh, K. O’Neill and B. Robbertse (2020). "NCBI Taxonomy: a comprehensive update on curation, resources and tools." Database 2020.
Schulz, M. H., D. R. Zerbino, M. Vingron and E. Birney (2012). "Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels." Bioinformatics 28(8): 1086–1092.
Wachi, N., K. W. Matsubayashi and K. Maeto (2018). "Application of next-generation sequencing to the study of non‐model insects." Entomological Science 21(1): 3–11.
Xia, J., C. D. Fjell, M. L. Mayer, O. M. Pena, D. S. Wishart and R. E. Hancock (2013). "INMEX—a web-based tool for integrative meta-analysis of expression data." Nucleic acids research 41(W1): W63-W70.
Xia, J., N. H. Lyle, M. L. Mayer, O. M. Pena and R. E. Hancock (2013). "INVEX—a web-based tool for integrative visualization of expression data." Bioinformatics 29(24): 3232–3234.
Zhou, G., O. Soufan, J. Ewald, R. E. Hancock, N. Basu and J. Xia (2019). "NetworkAnalyst 3.0: a visual analytics platform for comprehensive gene expression profiling and meta-analysis." Nucleic acids research 47(W1): W234-W241.

There is NO Competing Interest.

supplementary.docx
Supplementary information
SIcs1zebrafish.xlsx
Statistical results for case study 1
SIcs2lobster.xlsx
Statistical results for case study 2

Download PDF

Journal Publication

published 24 May, 2023

Read the published version in Nature Communications →

Version 1

posted

You are reading this latest preprint version

A unified platform for RNA-seq analysis in non-model species

Status:

Journal Publication

Version 1

Abstract

Figures

Introduction

Results

Overview of the software ecosystem

Ecoomicsdb: A High-resolution Ortholog Database For Cross-species Gene Annotation

Ecoomicsanalyst: Efficient Rna-seq Data Processing Via Web-based Interface

Expressanalyst: Comprehensive Transcriptome Profiling Through Visual Analytics

Benchmark And Case Studies

Discussion

Methods

Overview of the Seq2Fun algorithm

Creation of a high-resolution ortholog database - EcoOmicsDB

Implementation of the web-based platforms

Case study methods

Declarations

References

Additional Declarations

Supplementary Files

Status:

Journal Publication

Version 1