Software Implementation
The core functions of CORALIS has been implemented under the R programming language (version 4.2.1). Additionally, CORALIS has several accessory tools to preprocess and setup a database that combines information of ncRNA-target interactions from different web-sources. See Fig. 1.
CORALIS has been structured into three well-defined modules:
Data collection and pre-process
The python module request has been used to access and retrieve data associated to ncRNA-gene target interactions with experimental support from miRTarbase 9.0 and RNAInter v4 database.
Database design and setup
The python module sqlite3 was used for integrate miRTarbase 9.0 and RNAInter v4 database into a single SQL database.
Target-enrichment analysis
The CORALIS base functions have been entirely implemented in R.
Target enrichment analysis
The CORALIS package provides a series of functions for conducting ncRNA-target enrichment analysis and visualization. The source database, statistical methods and results summary are detailed below:
Data collection & Databases
CORALIS gathers up to 643974 experimentally validated interactions between ncRNAs (miRNA, lncRNA, snoRNA, snRNA, rRNA) and their target genes (mRNA) from miRTarbase and RNAInter database. Briefly, miRTarbase has collected in its latest version more than 2 million interactions between ~ 4,630 miRNAs and ~ 23,426 target genes from 37 species (> 10000 curated articles) (3). Additionally, RNAInter database is a repository for RNA-associated interactions including RNA-RNA, RNA-protein and RNA-DNA associated interactions taking more than 47 million of interactions from 156 species (4). CORALIS combines miRTarbase and RNAInter database into a single SQL database containing experimentally validated interactions between ncRNAs and their target mRNA. Although, all of ncRNA supported here have shown a remarkable implication in the regulation of gene expression, microRNAs remain the most studied molecules due to the their key role as regulators of cell cycle, disease development such as cancer, and even acting as co-factors or inhibitory molecules against pathogens. Table 1 summarizes the total number of annotated interactions by specie and ncRNA type included in CORALIS
Table 1
Supported species in CORALIS for ncRNA-target enrichment analysis. Each entry indicates the available number of ncRNA-target interaction by specie.
Species | miRNA | lncRNA | snoRNA | snRNA | rRNA |
Homo sapiens | 380639 | 72655 | 248 | 9979 | 4040 |
Mus musculus | 40681 | 74871 | 54616 | 908 | 791 |
Caenorhabditis elegans | 3183 | 0 | 0 | 0 | 0 |
Rattus norvegicus | 592 | 0 | 0 | 0 | 0 |
Bos taurus | 265 | 0 | 0 | 0 | 0 |
Drosophila melanogaster | 148 | 0 | 0 | 0 | 0 |
Danio rerio | 144 | 0 | 0 | 0 | 0 |
Arabidopsis thaliana | 96 | 0 | 0 | 0 | 0 |
Gallus gallus | 82 | 0 | 0 | 0 | 0 |
Sus scrofa | 36 | 0 | 0 | 0 | 0 |
TOTAL | 425866 | 147526 | 54864 | 10887 | 4831 |
Statistics
The CORALIS function tienrich performs ncRNA-gene interaction enrichment analysis to test whether a set of ncRNAs (defined by the user) have enriched target genes. Despite the Chi-Square test is often used to approximate p-value, it is inaccurate when expected frequencies in the contingency table are < 5 (8). Therefore, here we implemented the one-tailed Fisher’s exact test, commonly known as hypergeometric distribution test, to calculate an exact p-value regardless of how small are the expected frequencies (9). Here we have built a two-dimensional contingency matrix to represent the frequencies and marginal totals of ncRNAs-target gene interactions annotated in miRTarbase and RNAInter database (Table 2)
Table 2
The 2X2 contingency table for gene i analysis as potential gene-target of ncRNAs in the user’s input with the cell frequencies represented as a, b, c and d, and the marginal totals as a + b, c + d, a + b, and c + b.
| ∈ ncRNAs input | ∉ ncRNAs input | Total |
∈ Gene i | a | b | a + b |
∉ Gene i | c | d | c + d |
Total | a + c | b + d | N = a + b + c + d |
where i represents a particular gene with some annotation in the source databases (miRTarbase or RNAInter); a + b is the number of ncRNAs annotated as ncRNA interactor partners for gene i; c + d is the number of ncRNAs that do not interact with gene i; N points to the number of ncRNAs with at least one annotation; a represents the hits, that is, the number of ncRNAs introduced by the user as input which interact with gene i; c represents the number of ncRNAs in the input that do not interact with gene i.
The p-values of the enrichment analysis are computed as follows:
$$\text{p}=1-{\sum }_{n=0}^{a-1}\frac{\left(\genfrac{}{}{0pt}{}{a+b}{a}\right)\left(\genfrac{}{}{0pt}{}{c+d}{c}\right)}{\left(\genfrac{}{}{0pt}{}{N}{a+c}\right)}$$
Moreover, the tienrich function provides the odds ratio (OR) –plus its respective confidence intervals at alpha = 0.05 and standard errors - and the adjusted p-value (false discovery rate (FDR), using Benjamin-Hochberg correction):
$$OR= \frac{\left(\frac{a}{c}\right)}{\left(\frac{b}{d}\right)}$$
CUSTOMIZABLE PARAMETERS
The ncRNA input should be formatted into miRBase ID -for microRNA-target enrichment analysis (ie: ‘hsa-miR-3196’) - or Official Gene Symbol format -for the rest of ncRNAs (i.e: ‘RUNX2’) (see https://www.mirbase.org/ and https://www.genenames.org/ ).
The CORALIS tienrich has several customizable parameters:
-
-RNA interaction type: which depends on the type of ncRNA defined in the input (miRNA-gene, lncRNA-gene, snoRNA_ mRNA or snRNA_mRNA.
-
-Target organism: Homo sapiens, Mus musculus
-
-Statistical parameters such as i) minimum of interactions per target gene. Users’ can set a minimum number of interactions between target gene i and ncRNAs in the input. If this condition is not met, gene i will be discarded and next target gene is evaluated in an iterative manner. For example, if min argument has been set equal to two (min = 2), tienrich performs target enrichment analysis on genes with two or more interactions with the ncRNAs in the input; and ii) FDR cut-off. The user can also determine the threshold at which enriched target-genes are found by modifying the fdr argument. By default tienrich identifies enriched genes at adjusted p-value (FDR) < 1 and, at the very least, two interactions with the miRNAs in the input.
Software Output And Visualization
The ncRNA-target enrichment analysis returns a dataset containing; the enriched target gene(s), the ncRNAs in the input that are involved in each interaction, the number of interactions identified per target gene, and additional statistical parameters (OR, p-value, FDR) for each interaction (Table 3).
Table 3
Target enrichment analysis output.
Gene_symbol | Num. interactions | ncRNAs | pvalue | FDR | OR | OR.SE | OR.IC.lower | OR.IC.upper |
CDC42BPA | 2 | hsa-miR-98-5p / hsa-miR-485-5p | 0.0010938 | 0.0356036 | 5.663.736 | 0.8486273 | 10.733.718 | 29.885.180 |
HIST1H2AG | 3 | hsa-miR-98-5p / hsa-miR-1291 / hsa-miR-485-5p | 0.0011267 | 0.0356036 | 2.038.710 | 0.7186965 | 4.984.250 | 8.338.941 |
HIST1H4H | 2 | hsa-miR-98-5p / hsa-miR-1291 | 0.0010938 | 0.0356036 | 5.663.736 | 0.8486273 | 10.733.718 | 29.885.180 |
IGF2BP2 | 2 | hsa-miR-98-5p / hsa-miR-589-3p | 0.0003791 | 0.0356036 | 10.542.857 | 0.8866236 | 18.546.541 | 59.931.302 |
MED13L | 2 | hsa-miR-98-5p / hsa-miR-582-3p | 0.0005771 | 0.0356036 | 8.193.651 | 0.8685365 | 14.934.057 | 44.954.905 |
ABCC1 | 2 | hsa-miR-98-5p / hsa-miR-1291 | 0.0033190 | 0.0356901 | 3.054.762 | 0.8275950 | 6.032.915 | 15.467.764 |
CORALIS nodeVisu function allows displaying a series of plots to analyse ncRNA target interactions:
-
Barplot showing the top target genes (scored by FDR and nº interactions) with the ncRNAs in the input (Fig. 2A).
-
Chord diagram that shows up to 25-targeted genes and their respective ncRNAs interactions (Fig. 2B).
-
Interactive html widget that displays the network of interactions between ncRNAs and their respective top target genes (Fig. 2C).