(w)HOL(e)ISTIC gene ontology and pathway analysis of data using open source web tools

Objective Downstream analysis of next generation sequencing (NGS) experiments provides researchers a means of deciphering their results. These downstream analyses elucidate clusters of genes or networks of biological interest under the conditions being studied. One convention for examining gene interactions is to conduct downstream investigations based on gene ontology (GO), pathway, and network analyses of statistically significant genes of interest. Unfortunately, the software available for these types of analyses is expensive, not species specific, and subject to gaps in annotation. These difficulties can cause studies to omit this downstream step, limiting the utility of the data. In order to facilitate pathway and network analyses of candidate gene lists from NGS studies, a workflow was constructed based on the use of open-sourced freely available software and genomic databases termed the “(w)HOL(e)ISTIC GO enrichment” approach. Results Overlap of multiple open-source software was used to annotate, analyze, and visualize biological networks. It is a 3-stage process in which stage 1 (Annotation) is the generation of alias identifiers. Stage 2 (Analysis) is a two-part process generating ontologies and networks with statistical inferences. Stage 2 relies on information from databases such as Reactome, KEGG, and InterPro. Stage 3 (Visualization) allows for figure creation.


Introduction
The data used for the example in this article is from a differential gene expression analysis of monocyte derived cells infected with either the highly pathogenic or low pathogenic Porcine Reproductive and Respiratory Syndrome Virus (PRRSV) designated as HP-PRRSV and VR-2332 respectively. The candidate genes generated from this analysis were based on comparisons of the HP-PRRSV or VR-2332 infected cells that had been stimulated with one of five immune function cytokines.
The samples were derived from blood monocytes from at least five outbred piglets that have similar genetic background. Total RNA was extracted from the pooled cells of four replicates, and used to construct sequencing libraries for the transcriptomic analysis. Sequence libraries were used to generate 100 bp paired-end reads using the Illumina® 2500 HiSEq. Procedures for read processing, mapping, and alignment have been previously described [1]. Normalization of gene counts was 3 carried out using the rlog transformation function and calculations of differential gene expression for each comparison were based upon the average log expression and the regularized log 2 fold change (rLogFC) using DESeq2 [2]. Genes were annotated using the Ensembl database [3] and g:Profiler [4].
The fold changes reported are the difference in expression values for each gene based on the comparison and a threshold of equal to or greater than +/-2 rLogFC. Ensembl identifier (ID), rLogFC, and normalized count data for each treatment group and replicate were compared. These gene lists were used as the query lists for network prediction.

Main Text
In order to facilitate pathway and network analyses of candidate gene lists from high-throughput studies, an in-house method referred to as the (w)HOL(e)ISTIC GO and pathway analysis workflow was constructed based on the use of open source freely available software and genomic databases. The example data is from "Comparative analysis of signature genes in PRRSV-infected porcine monocytederived cells to different stimuli" published by Miller et al., 2017 [5]. The (w)HOL(e)ISTIC GO enrichment approach is useable with multiple species, and allows use of species specific annotations when available.
A list of differentially expressed genes with a rLogFC ≥ 2 or ≥ -2 was supplied from each comparison to the (w)HOL(e)ISTIC GO and pathway analysis work flow. The workflow can be used with data from any analysis method. The workflow takes as input a user-defined gene list. Application of the workflow to a gene list consists of several steps: annotation, analysis, and lastly visualization.
The first step is the annotation of the query list from the gene expression or other experiment (Fig. 1).
This step calls for a researcher to take their output list of genes from their analysis and generate a list of gene name aliases for each gene of interest. This step is carried out in the first step to assuage issues with the lack of robust gene and protein annotation in many livestock and non-model species.
This step is carried out by starting with the query list in your preferred nomenclature [i.e. Ensembl, Human Genome Organisation Gene Nomenclature Committee (HGNC)]. Next copy and paste the query list into any or all of the following web-based software: Ensembl Biomart [3], g:Profiler These software programs will search various biological databases to find any additional gene/protein names associated with the query list of results. The order in which the software is used doesn't matter because all terms will be aggregated at the end prior to moving on to step 2. The main goal of this step is to provide the researcher with multiple gene names to better ensure that the results of interest can examined regardless of up-to-date or out-of-date gene curation. The gene aliases also help to examine syntenic regions and orthologues in case a researcher needs to rely on sequence homology as part of their analyses.
The second step will use the query list that has been populated with the gene aliases to carry out a two-part analysis step. Part one of this step uses the expanded query list to perform a gene ontology (GO) analysis using several web based programs. The use of multiple GO analysis software programs is done to allow a researcher the ability to compare consensus in any terms, pathways/networks, or statistical models shared between the programs. Examination of consensuses pathways and functions is done to afford the researcher repeatability, which in turn leads to greater confidence in an experiments results. The GO analysis portion of step 2 is carried out using the web-based programs: GOtermFinder [7], PantherDB [8], DAVID6.8 [9][10][11], and the g:Profiler (g:GoSt Tool) [4].
Part two of this step employs the programs STITCH [12]and STRING [13] to predict any possible genegene or gene-chemical interactions, in order to help uncover possible gene networks or interactions related to the results. This portion of step 2 is also based upon the expanded query list from step one and as output provides a visual representation of nodes (genes) and edges (the interaction) that exist as networks within the data. The software does not differentiate between the expression levels of the genes in the list, but does draw in information from various databases to predict the effect (positive, negative, unknown) a gene is expected to exert on another in the network. In the predicted network outputs, a red line connecting nodes represents inhibition, green lines represent activation; dark blue lines represent binding; purple lines represent catalysis; yellow lines represent transcriptional regulation, light blue represents phenotype, and black lines are representative of reaction.
The last step of the method is the visualization step, which can be carried over from the pathway/network analysis portion of step 2. This is because the software STITCH [12] and STRING [13] 5 produces a visual output that can be manipulated and downloaded as an image file.

Limitations Annotation
Some ontology programs will eliminate duplicate genes and aliases, others won't, be sure to check input type accepted, compare starting list with output, to avoid duplicate counts.
Multiple databases and IDs might be needed. Ensembl Biomart will only allow 3 external sources for name annotation, g:Profiler will only do one at a time.
Check versions and updates of software. The most recent may not always be the best as it may be a beta version.
Order of which software is used doesn't matter as you will aggregate all terms and aliases at the end prior to step 2. Analysis 2 stage step (ontology and pathway) Be mindful of usefulness of statistics vs. network or pathway.
In the PantherDB GO analysis only one dataset will be output at a time. Check the results that the input list numbers match.
Be aware of most current updates (reference genome and annotation).
All programs give a text and/or html option for output.
Visualization STRING/STITCH images can be manipulated to give the best or additional views.  Step 3. Visualization.