RNA extraction, library preparation and Illumina sequencing
We extracted total RNA from 30 serum samples from healthy individuals, anonymous blood donors who consented to collection of a separate fraction for circulating RNA biomarker discoveries with the miRCURY RNA Isolation kit for Biofluids (Exiqon) spiked in with Exiqon RNA spike in mix (1:50) of recommended amount eluted in 50 µl. As RNA concentration was too low to be measured by spectrophotometry or fluorometry, quality of RNA was assessed with the microRNA QC PCR Panel, 96 well (V1.RO) from 5 µl of 1:50 dilution of cDNA obtained after reverse transcribing 2 µl of total RNA with miRCURY LNA™ Universal RT microRNA PCR (Exiqon) which uses external spiked in to control for miRNA yield and profiles specific miRNAs to estimate hemolysis. Quantitative PCR reactions were performed on Roche LightCycler 480 instrument (Roche) in 10 µl volume reactions. All samples included in the analysis had equivalent miRNA content and low levels of hemolysis (DCq (miR-23a – miR-451) ranging from 4.46 to 6.47, below the assay rejection limit value of 7). We constructed indexed libraries using the TruSeq Small RNA kit (Illumina) starting from all remaining RNA extracted from 200 µl (~48 µl, equivalent to ~192 µl serum). Library products were pooled together in equimolar amounts based on Bioanalyzer peak quantification at the expected size around 142 bp. A single pool with 30 libraries was size selected with the Pippin prep system (SAGE Science) with 3% agarose and dye free Marker F in the 115-165 bp broad range to enrich for miRNA products and minimize variation between samples in the size selection step. The sequencing data generated consisted of 2 x 50 nt reads resolved on a HiSeq 2500 system with v3 SBS reagents (Illumina). Two of the 30 samples were excluded because they were clear outliers by PCA exploratory analysis and had poor library yields. The final dataset consists of 28 samples.
Small RNA-seq analysis pipeline
To analyze small RNA-seq PE data we have developed the XICRA pipeline (https://github.com/HCGB-IGTP/XICRA). It is developed in python, with multiple separate modules and it is available as a pip package (https://pypi.org/project/XICRA/). This pipeline is designed to take paired end reads in fastq format, trim adapters and low-quality base pairs positions, and merge read pairs (R1 & R2) that overlap. A mapping step to the reference genome assigns joined reads to all major RNA biotypes including miRNA and isomirs, tRNA fragments (tRFs) and piwi associated RNAs (piRNAs). Then, XICRA produces a miRNA analysis at the isomir level using joined reads, with several choices of software that can be selected by the user with standardized output. Results are generated for each sample, analyzed and summarized for all samples in a single expression matrix. This information can be processed at the miRNA or isomir level (single sequence) but also summarizing for each isomir variant type. Statistical summaries can be easily accessed using the accompanied R package XICRA.stats (https://github.com/HCGB-IGTP/XICRA.stats). Although the pipeline is designed to take paired-end reads, it also accepts single-end reads. The workflow of the pipeline is described in Figure 1.
XICRA uses cutadapt [29] for the adapter trimming analysis. Default trimming preset parameter settings are: to keep all reads regardless of whether the adapter is found or not, a 10% maximum adapter matching error rate (mismatches, insertions and deletions), and a 3bp minimum overlap length. User must provide specific adapter sequences for the trimming analysis. An optional previous quality checking step can be performed for each sample using FastQC [30] before the trimming analysis. Results are summarized for all samples using MultiQC software [31].
Once all reads are adapter trimmed, the tool uses FastqJoin from ea-utils [32] to join the two PE reads, if provided, on the overlapping ends. Apart from the joined reads, this tool also generates two files with the R1 and R2 reads that cannot be joined. As a default the minimum overlap is set to 6 bp and the maximum allowed difference for the reads to be joined is set to 0% to retain 100% matching read pairs ensuring high quality sequencing information. Parameters can be modified using the different options provided.
The XICRA pipeline can continue to process either joined PE reads or SR reads. Two levels of mapping are implemented. The first level profiles RNA biotypes using STAR [33] to map reads against the reference genome and featureCounts [34] to extract and quantify numbers of reads by class. The second level focuses specifically on small RNA subclasses. Here we describe the miRNA analysis implemented within XICRA but the modularity and versatility of the pipeline would make it quite straightforward to include other RNA biotypes analyses in detail.
For miRNAs analysis at the isomir resolution level, XICRA allows the user to use either miraligner [26], sRNAbench from sRNAtoolbox [27] or OPTIMIR [28]. Each software uses different strategies and might produce different results [35]. We have included them as they allow following standardization procedures performed by miRTOP software and adopt the miR.gff3 file format [36]. Again, the pipeline modular implementation would allow adding additional softwares converging and adapting to miRTOP and miR.gff3 format. For each of the softwares mentioned above and included within the miRNA module in XICRA default parameters are used. Some of these parameters can be modified using the different options provided. As a result of this miRNA module, annotation is generated that categorizes isomirs into classes based on their sequence modifications (including iso_5p, iso_3p, iso_add, iso_snv, iso_snv_seed, iso_snv_central_offset, iso_snv_central, iso_snv_central_suppl) following miRTOP suggested classification scheme. A final conversion step from individual per sample miR.gff3 files into a single expression matrix is performed. This file serves as input for differential expression (DE) analysis. Information is provided for each unique sequence and indexed names contain the miRNA, the variant type and license plate (unique identifier, UID) provided by miRTOP. Duplicated entries at the sequence level, produced by different modifications from the same or different miRNA are discarded. An additional matrix is provided containing the sequence information for each encrypted UID.
Per sample read count matrices at the isomir level are summarized into a single expression matrix that it serves as input for DE analysis between the comparison groups of interest. We have generated an additional R package (XICRA.stats) that facilitates the retrieval of these matrices and parses the information included within each unique index name provided. The DE analysis can be done aggregating data at the mature miRNA level (i.e. hsa-miR-501-3p), by isomir class (i.e. hsa-miR-501-3p_iso_5p), by specific length variant cluster (i.e. hsa-miR-501-3p_iso_3p:-2) or with the sequence of the read itself as the counting data. This is useful since different types of modification may coexist in a single sequence, and non-templated additions and internally edited sequences can differ leading to isomirs that can fall into different categories or be derived from different mature miRNAs. DE analysis is performed outside of the tool with DESeq2 package in R [37].
Gene set enrichment analysis applied to isomirs
We adapted the Gene Set Enrichment Analysis (GSEA) tool [38] for miRNA analysis by building miRNA isoform ‘gene sets’ grouping all class identifiers by class type (including iso_5p, iso_3p, iso_add, iso_snv, iso_snv_seed, iso_snv_central_offset, iso_snv_central, iso_snv_central_supp). Gene expression results were issued for pre-ranked mode analysis with the default setting sorted by the log2fold change metric. Gene sets were built from the isomirnome detected in serum from male and female individuals, the dataset generated in house used in this study. All unique sequences obtained after trimming all reads in all samples combined were annotated at the isomir level and grouped by isomir type. Each isomir gene set was composed of all the unique sequences assigned to a particular isomir type.
Venn diagrams
Venn diagrams depicting overlap between regulated genes were made using the Venn function in the gtools package in R [39].
External paired end read small RNA sequencing dataset
We searched the NCBI SRA database for any small RNA sequencing datasets generated using Illumina paired end reads. Very few datasets were found fulfilling our filtering criteria. We found a dataset belonging to a project (GEO GSE114923; PRJNA473134) that studied miRNA biomarkers in nonalcoholic fatty acid liver disease (NAFLD). The design of the experiment contained 8 patients, classified as control or case, and the dataset was sequenced using paired-end NextSeq 500 paired-end libraries. See additional details in the original publication [40].
We produced a full analysis at the isomir level using XICRA and the three different software available (sRNAbench, miraligner and OPTIMIR) using paired-end information and single end reads (R1 and R2, respectively). For the trimming process, we used as adapter sequences the corresponding NEBNext library adapters. Once the isomir expression matrices were generated for each software and read type, we used R package XICRA.stats for parsing the information and DESeq2 in R [37] for the analysis of DE isomirs or miRNAs using the experiment design described above (case vs. control).
Computer simulations
To illustrate the potential of paired-end reads at the isomir level analysis we have generated computer simulations to test the impact of technical errors from single end or paired-end reads. We followed the guidelines previously described for isomir computer simulations by Amsel et al. 2017 [35]. We created biological variation and technical variation using multiple high throughput sequencing profiles and evaluated the performance of the simulation using sensitivity and precision of the isomirs detected under several circumstances. See details of the bioinformatic script and details in (https://github.com/HCGB-IGTP/XICRA/tree/master/BMC_bioinformatics_paper/simulation)
Biological variation
We created artificial miRNA isoforms from Homo sapiens mature and hairpin sequences in miRBase (v.21) [41]. We additionally included the canonical fasta sequence of each miRNA to the variant dataset generated. From the variant frequency table generated we selected 100 miRNAs and discarded some random variants and generated random distribution frequencies of the variant types generated. To simplify the interpretation and evaluation, for each variant type, we selected a single isoform. For each variant type we included the corresponding frequencies generated to a total amount of 100 sequences for each mature miRNA. Thus a total of 10000 sequences were simulated with preset frequencies.
Technical simulation
For each biological dataset generated, we used ART [42] with Illumina HiSeq2500 and MiSeq-v1 sequencing system profiles using paired-end mode to simulate next generation sequencing (NGS) reads. We grouped all isoforms according to length and generated NGS simulation for each length subset to finally merge them all in a single file for each read accordingly. We used a 10x sequencing coverage for each input fasta sequence. As previously noted [35], due to the nature of the ART simulation we had to parse and omit about half the total reads generated as they were reverse complemented. We only discarded reverse complemented R1 reads and its R2 counterpart accordingly. Due to the implementation based on frequencies that we did in the biological variation procedure, we made sure that, when applying the same coverage for each sequence and discarding reverse complement, the biological variation frequencies generated would be maintained in the NGS simulation. The observed range would vary from 5-500 counts for each single variant type simulated.
Performance evaluation
We evaluated the performance of using PE reads for miRNA isomir analysis using the NGS simulation datasets and the pipeline XICRA. For each dataset, we used the miRNA module using paired-end mode and single end mode for the R1 and R2 reads. For the paired-end mode we initially joined reads using two different join percentage difference cutoff (fastq-join parameter) to test the effect of using 100% perfect R1 and R2 reads or allowing the default difference (8%) along the minimum default overlap length cutoff (6 bp). For the single-end mode, we used the total R1 reads simulated or the total R2, reversed complemented using seqtk software [43] respectively. We generated a miRNA analysis at the isomir level using the three different softwares available within XICRA: miraligner, sRNAbench and OPTIMIR.
Using the biological variation frequencies generated as true positives for each dataset, we evaluated the amount of isomirs detected for each software and type of read. For paired-end reads, we also used a different percentage difference cutoff. For each isomir, the detected counts were classified as: True positives (TP) when observed counts matched the expected counts; false positives (FP) when observed counts exceeded the expected counts and were wrongly assigned; and false negatives (FN) when observed counts did not get to the minimum expected counts. We calculated the sensitivity or recall as TP/(TP+FN) and the precision or specificity as TP/(TP+FP). We also reported True Negatives (TN) when expected counts were not observed and new generation isomirs when new variants or miRNA appeared and were not expected. We plotted results using ggplot2 [44] for each software and type of read respectively.