In order to facilitate pathway and network analyses of candidate gene lists from high-throughput studies, an in-house method referred to as the (w)HOL(e)ISTIC GO and pathway analysis workflow was constructed based on the use of open source freely available software and genomic databases. The example data is from “Comparative analysis of signature genes in PRRSV-infected porcine monocyte-derived cells to different stimuli” published by Miller et al., 2017 [5]. The (w)HOL(e)ISTIC GO enrichment approach is useable with multiple species, and allows use of species specific annotations when available.
A list of differentially expressed genes with a rLogFC ≥ 2 or ≥ -2 was supplied from each comparison to the (w)HOL(e)ISTIC GO and pathway analysis work flow. The workflow can be used with data from any analysis method. The workflow takes as input a user-defined gene list. Application of the workflow to a gene list consists of several steps: annotation, analysis, and lastly visualization.
The first step is the annotation of the query list from the gene expression or other experiment (Fig. 1). This step calls for a researcher to take their output list of genes from their analysis and generate a list of gene name aliases for each gene of interest. This step is carried out in the first step to assuage issues with the lack of robust gene and protein annotation in many livestock and non-model species. This step is carried out by starting with the query list in your preferred nomenclature [i.e. Ensembl, Human Genome Organisation Gene Nomenclature Committee (HGNC)]. Next copy and paste the query list into any or all of the following web-based software: Ensembl Biomart [3], g:Profiler (g:convert tool) [4], HGNC [HGNC Comparison of Orthology Predictions tool (HCOP)] [6], or other. These software programs will search various biological databases to find any additional gene/protein names associated with the query list of results. The order in which the software is used doesn’t matter because all terms will be aggregated at the end prior to moving on to step 2. The main goal of this step is to provide the researcher with multiple gene names to better ensure that the results of interest can examined regardless of up-to-date or out-of-date gene curation. The gene aliases also help to examine syntenic regions and orthologues in case a researcher needs to rely on sequence homology as part of their analyses.
The second step will use the query list that has been populated with the gene aliases to carry out a two-part analysis step. Part one of this step uses the expanded query list to perform a gene ontology (GO) analysis using several web based programs. The use of multiple GO analysis software programs is done to allow a researcher the ability to compare consensus in any terms, pathways/networks, or statistical models shared between the programs. Examination of consensuses pathways and functions is done to afford the researcher repeatability, which in turn leads to greater confidence in an experiments results. The GO analysis portion of step 2 is carried out using the web-based programs: GOtermFinder [7], PantherDB [8], DAVID6.8 [9–11], and the g:Profiler (g:GoSt Tool) [4].
Part two of this step employs the programs STITCH [12]and STRING [13] to predict any possible gene-gene or gene-chemical interactions, in order to help uncover possible gene networks or interactions related to the results. This portion of step 2 is also based upon the expanded query list from step one and as output provides a visual representation of nodes (genes) and edges (the interaction) that exist as networks within the data. The software does not differentiate between the expression levels of the genes in the list, but does draw in information from various databases to predict the effect (positive, negative, unknown) a gene is expected to exert on another in the network. In the predicted network outputs, a red line connecting nodes represents inhibition, green lines represent activation; dark blue lines represent binding; purple lines represent catalysis; yellow lines represent transcriptional regulation, light blue represents phenotype, and black lines are representative of reaction.
The last step of the method is the visualization step, which can be carried over from the pathway/network analysis portion of step 2. This is because the software STITCH [12] and STRING [13] produces a visual output that can be manipulated and downloaded as an image file.