Gene Set linkage analysis: a tool for interpreting the overall functional impacts of observed transcriptomic changes

To date, the majority of software tools developed for high-level interpretation of transcriptomics data were based on annotation enrichment. These tools use existing biological concepts to describe the observed omics changes. However, if an observation cannot be accurately described by an existing concept, these tools can not report any term or will report very general terms (such as “biological process”, GO:0008150), which provides limited assistance for researchers to understand the data and design further investigation. We present the gene set linkage analysis (GSLA) tool for interpretation of the collective functional impacts of a set of changed genes. The GSLA algorithm relies on a functional association network to evaluate whether the observed omics changes collectively interfere with functions of known biological processes. Although an omics change may not be accurately described by an existing concept, its functional impact may still be described by well-established concepts. GSLA has been shown useful in several previous studies. It derived novel insights into high-level coordination of physiological processes, where conventional annotation enrichment-based tools did not provide similar insights. This standalone version of GSLA tool integrates interaction networks for four species (i.e., A. thaliana, D. melanogaster, H. sapiens and S. cerevisiae) and using four kinds of annotation gene sets (i.e., Gene Ontology, Reactome pathway, Panther pathway and Wikipathways) for each species.


Background
Biological traits of interest, such as disease states and crop yields, are produced by interactions among different physiological functions and biological processes, and each physiological function and biological process is implemented by a network of molecular interactions to ensure its robustness and coordination with other functions and processes.
In the age of omics, we know that each individual has genetic idiosyncrasies and that biological functions and processes are more robust and consistent across populations.
Therefore, researchers tend to uncover high-level laws of biology from omics data at the biological process level because these laws are more robust, logical, and can be consistently applied in a population.
Various tools have been developed for omics data interpretation. Most tools are based on the annotation enrichment strategy. The widely used tools include GSEA [1], DAVID [2] and Enrichr [3]. These tools use existing concepts of biological processes to describe the observed omics changes. However, if an observation (i.e., an actual biological process) cannot be accurately described by an existing concept, these tools do not report the biological process or report very general biological processes (such as "biological process", GO:0008150), which does not help researchers understand the data or provide direction for further investigation.

Implementation
To address this challenge, we developed the gene set linkage analysis (GSLA) method, which relies on a functional association network to evaluate whether observed omics changes collectively interfere with the functions of known biological processes. Although an omics change may not be accurately described by an existing concept, its functional impact can still be described by well-established concepts. In previous studies, GSLA has been shown to distinguish sensitive and insensitive cell lines based on the transcriptome changes measured before the onset of apoptosis. GSLA correctly anticipated that only the sensitive cell lines would undergo apoptosis [4]. GSLA was also shown to be capable of identifying a key functional impact exerted by stem cells to rescue fulminant hepatic failure in pigs. A cytokine known to regulate this function significantly improved animal survival in both rat and pig models [5]. In both cases, widely used annotation enrichmentbased approaches did not provide similar insights.
This software article describes the standalone version of the GSLA tool, which is an operating system-independent Java program. This program supports 4 species: A. thaliana, D. melanogaster, H. sapiens and S. cerevisiae. It can use local computing resources to improve analyzing efficiency. This program has been successfully applied in a number of previous studies, and the results show that GSLA can provide useful insights that are valuable for designing further investigations, whereas the widely used annotation enrichment-based approaches do not provide similar insights.
Algorism description GSLA relies on testing two hypotheses to detect substantial functional linkages between biologically meaningful gene sets (one set represents the observed changes, and the other set represents a known biological process or function). The first hypothesis (Q1) expects that the inter-gene-set interaction density between functionally linked gene sets is higher than the background interaction density between random gene sets (i.e., the interactions between two functionally linked gene sets are stronger than those between two random gene sets). The second hypothesis (Q2) expects that the observed high density between functionally linked gene sets can only be observed in a biologically correct interactome (i.e., the interactions between two functionally linked gene sets observed in a biologically correct interactome are greater than those observed in random interactomes consisting of the same genes and same topology). Q1 tests the strength of the functional linkage between two gene sets. Q2 verifies that the observed robust linkage is due to the biologically accurate network topology (i.e., our knowledge of molecular mechanisms) rather than the compositions of these two gene sets. Q2 is used to remove the confounding factor of gene set composition and ensure the "biological significance" of the detected functional linkages between gene sets. Q1 and Q2 are related but different tests.
These tests complement each other to increase the sensitivity and specificity of GSLA.
To test Q1, the density of the inter-gene-set interactions is first calculated as the observed number of inter-gene-set interactions divided by the total number of possible inter-gene-set gene pairs (See Fig. 1, Q1). A common gene shared by two gene sets is treated as two distinct delegate genes, and each delegate gene belongs to only one gene set. Any gene that interacts with the shared gene is considered to interact with both delegate genes in both gene sets. Two gene sets sharing a gene do not constitute an inter-gene-set interaction.
To test Q2, the interaction network is randomized 100000 times. Suppose that there are two interactions that involve four genes (i.e., G1-G2 and G3-G4). By replacing these two interactions with two new interactions (i.e., G1-G3 and G2-G4), the topology of the interaction network changes, but the gene composition and number of neighbors of each gene remain the same. This topology perturbation is performed 100000 times to create one random interactome. Using 100000 random interactomes, the P-value for Q2 is computed as the fraction of random interactomes in which the densities of the inter-geneset interactions are higher than the densities observed in the correct interaction network (See Fig. 1, Q2).

Input gene set
To interpret the biological significance of an experimentally observed molecular profile, we recommend using the "most significant" 20-200 genes to represent an observed omics change. The GSLA algorithm considers all query genes equal; therefore, using too many genes will likely dilute the focus on the most important changes. However, because profiling data are intrinsically noisy and interaction networks have false positives and false negatives, using too few genes will not suppress this noise and take advantage of the corroborative strength of a gene set. In our experience, using the top 20-200 significant genes to represent a molecular profile usually produces consistent results.

Interaction network
The quality and coverage of interaction networks used in GSLA determines whether the query gene set can be correctly annotated to biologically meaningful gene sets. For D.
melanogaster and S. cerevisiae, their experimentally reported protein-protein interactions Annotation gene sets GSLA assesses the functional linkage between an observed gene set and a functional gene set to annotate the potential functional impacts of the observed gene set. Several types of functional gene sets are used, including GO biological processes [10], the Panther pathway [11,12], the Reactome pathway [13], and WikiPathways [14]. For all species  [15], the expected mean of this normal distribution is the density of the interactions among all genes (0.0008). As shown in Additional file 1, the expected standard deviation approaches 0.0012 as N increases in our simulation, where N random gene pairs are sampled 100000 times to calculate the mean and standard deviation of the interaction density. According to these data, the not very high inter-geneset interaction density of 0.01 translates into a small P-value (P<10-10) if the two gene sets both have ≥20 members (N≥400). The functional linkages between two gene sets that are worthy of experimental investigation are anticipated to be much stronger than those linkages that marginally pass the significance threshold. Therefore, we set the cutoff value for Q1 to ≥0.01 (indicating that the functional linkages between two gene sets are sufficiently strong) and that for Q2 to ≤0.001 (indicating that the functional linkages between two gene sets are significant) as default. Users can optimize the results by adjusting the cutoff values.

Output file
The output file contains annotation gene sets that have functional linkages with the queried gene set. These annotation gene sets represent phenotype changes that are responsible for the genotype changes.

Result
We use two examples to illustrate the usage of GSLA.
Understanding the stress factors in yeast secondary fermentation GSLA is frequently used to derive insights into omics changes that are associated with specific phenotypes at the biological process level. Here, the yeast secondary fermentation data (GEO accession GSE29273 [16]) were reanalyzed to illustrate this usage. To investigate the main factors that affect yeast physiology during its secondary fermentation, Penacho et al. performed a genome-wide expression analysis of an industrial wine yeast strain under real secondary fermentation conditions. Comparisons were performed between samples obtained at different time points to identify genes that are simultaneously up-or downregulated. The authors analyzed the biological annotations that were significantly associated with these genes using GENECODIS. Their results showed that the main factor influencing gene transcription during secondary fermentation is alcohol respiration, but GENECODIS did not report other stress factors, such as nitrogen starvation.
In our analysis, the GEO2R online service was used to identify differentially expressed genes that satisfied "t statics >0" or "t statics < 0" and were simultaneously up-or downregulated across different time points. Genes were sorted by the absolute value of the t-statics, and the top 50 genes with significantly changed expression, containing both up-and downregulated genes (Additional file 2), were analyzed.
The GSLA results are provided in Additional file 3. In our GSLA analysis of the upregulated genes, some terms were consistent with the results of GENECODIS, such as "electron transport chain", "cellular respiration", and "TCA cycle". In addition, GSLA reported another category of terms that are related to nitrogen metabolism, including "glutathione metabolic process" and "metabolism of amino acids and derivatives". Nitrogen starvation is an important stress factor in secondary fermentation, and the consumption of glutamine is a manifestation of nitrogen starvation [17]. In the original publication, GENECODIS also identified new peripheral terms that are consistent with nitrogen starvation, such as increased ROS, Sod2p, Cat1p activities and the upregulation of autophagy. However, GENECODIS failed to identify any term directly related to nitrogen starvation.
Similarly, the analysis of the downregulated genes also showed that GSLA and GENECODIS produced consistent terms, but the terms produced by GSLA are more specific and relevant.
We also used DAVID [2] and WebGestalt [18] to analyze the same gene list. GSLA was the only tool that reported nitrogen starvation-related processes in the results (Additional file 4). These results showed that GSLA has an improved capability of understanding the functional impacts of an omics change.
Predicting functions of microRNAs GSLA can also predict the overall functional impacts of a set of genes. MicroRNAs are known to regulate target genes to achieve their high-level functions. In this example, GSLA was used to predict high-level microRNA function via an analysis of its target genes.
The target genes of an Arabidopsis microRNA, ath-MIR417, are listed in Additional file 5, along with their GO annotations. Our GSLA analysis of these target genes identified a biological function that has not been previously annotated to any of the ath-MIR417 target genes, namely, water deprivation (Table 1). This biological process was identified because two genes functioning in water deprivation have strong functional interactions with one target gene of ath-MIR417, and these functional interactions passed the GSLA criteria.
This prediction is consistent with a study showing that the expression of ath-MIR417 was significantly changed under dehydration stress [19].

Discussion
GSLA uses interaction networks and annotated functional gene sets to identify the potential functions of a query gene set. To demonstrate the utility of GSLA, we reanalyzed two published datasets, which were parallelly analyzed with widely used enrichment-based tools. In the first case, GSLA demonstrated superior capability in identifying known stress factors that were missed by enrichment-based tools. In the second case, we showed that GSLA may be used to predict the functional impacts of a biological meaningful gene set (e.g. target genes of a microRNA).
Previously, GSLA has also been used in a series of transcriptomic data analyses. It has been used in analysis of the expression profiles of multiple myeloma cells treated by lovastatin. Results show that different mechanisms led sensitive cells to apoptosis and insensitive cells to survival. The identification of disrupted apoptosis pathway was achieved using transcriptomic profiles observed at an early time point before apoptosis really took place. GSLA were compared with other enrichment-based tools. Only GSLA produced this insight [4]. In another study of human bone mesenchymal stem cells rescuing fulminant hepatic failure in pigs, GSLA found a biological process that explains a significant proportion of the transcriptomic benefits of stem cells. Further analysis confirmed that a known regulatory cytokine of this biological process can increase survival of fulminant hepatic failure in both pig and rate models [5]. In an Arabidopsis study, GSLA explained the mechanism of abscisic acid hypersensitivity in Arabidopsis roa1 mutant during its seedling stage, and identified a gene without significant expression change but with significant functional associations to genes that show expression changes. This gene was later demonstrated to have a key role in producing the phenotype [9]. In these examples, enrichment-based analysis tools were parallelly used. Only GSLA produced these insights.
A limitation of GSLA is the requirement for high quality interaction networks to infer functional synergy and high quality annotation gene sets to express the collective functional impacts of a gene set. For a number of species, lacking high quality interaction networks and high quality annotation gene sets limited the capability of GSLA to produce insights that may inspire further study. Continued development of high quality interaction networks and annotation gene sets for more model organisms will fill the gap.

Conclusion
The GSLA tool is designed for interpretation of the collective functional impacts of a set of

Consent for publication
Not applicable.

Supplementary Files
This is a list of supplementary files associated with the primary manuscript. Click to download.