Input data
The main R script of the workflow reads the input data from three distinct files stored in a folder:
(1) Count table: This is the main data file, containing the different samples on columns and the measurement of each well on the rows. The proposed tool is applicable on miScript miRNA PCR Array (Qiagen) which contains 384 wells and examines 372 miRNAs, 12 controls. Specifically, each well of 372/384 contains a miScript Primer Assay for a miRNome or pathway/disease/functionally-related mature RNA. Moreover, 2 wells contain replicate C. elegans miR-39 miScript Primer Assays and can be used as an alternative normalizer for array data (Ce), 6 wells contain an assay for a different snoRNA/snRNA that can be used as a normalization control for the array data. Finally, there are two wells which contain replicate miRTC Primer Assays (RTC) and two wells that contain positive PCR controls (PPC).
(2) Metadata: This file includes a list of sample IDs and the corresponding group e.g. normal/tumor
(3) Annotation of miRNAs well: A file that links the information of the well with the examined miRNA.
Workflow
The framework is implemented into three distinct phases; (1) QC and normalization, (2) differential analysis and (3) functional analysis. Specifically:
(1) QC and normalization.
The first phase takes as input the initial count table. The quality control process examines the maximum percentage of not detected or not available values (NA’s) in each column, as defined by the user. Moreover, the ratio between reverse transcription control (RTC) assay, which detects an artificial RNA template, and positive PCR controls (PPC), which monitor for PCR inhibitors, is calculated and a standard threshold is used to validate the reverse transcription efficiency.
The data normalization module includes the option of endogenous and exogenous miRNA approach. The output of this step is the normalized data matrix that includes the samples which passed the NA’s criterion. Additionally, a visualization option is available, which allows to generate figures that are automatically stored within the analysis folder, and include an upset plot for the NA’s distribution and boxplots with counts before and after the normalization.
(2) Differential analysis
This module is performed using the limma package in R (9). The output includes the differentially expressed miRNAs using a user-defined adjusted p-value as a threshold. Moreover, a hierarchical cluster analysis is performed at this stage and a corresponding heatmap is constructed and stored in the analysis directory.
(3) Functional analysis
The downstream analysis links the differentially expressed miRNAs with the regulated genes using the multiMIR package, which includes several databases such as mirtarbase (10), tarbase (11), diana_microt (12) etc for both predicted and vali-dated targets. Moreover, the targeted genes of the differentially expressed miRNAs are used for KEGG and Gene ontology (GO) enrichment analysis, as facilitated by the enrichR package(13). Finally, barplots that present the results of the enrichment analysis are stored in the analysis folder.
At the end of the entire process, a report file is automatically exported. The report contains information of the particular execution process, including the user-defined criteria, the rationale for the excluded samples, the overall time required for execution and the total memory usage.