miR-Island: an ultrafast and memory-efficient tool for plant miRNA annotation and expression analysis

Background Next-generation sequencing of small RNAs has yielded an abundance of microRNA (miRNA) profiling data for diverse plant species. Many programs have been developed for plant miRNA annotation based on sequencing data, but these programs typically require computers with powerful hardware configurations. At present, few ultrafast computational tools are available for this type of data analysis on standard personal computers. Results We present miR-Island, an ultrafast tool for plant miRNA identification from deep sequencing data. Two important strategies contribute to the speed of the miR-Island program: (1) extracting precursor candidates using a pseudogenome and (2) using parallel processing for RNA secondary structure prediction. In our analysis, the pseudogenomic strategy reduced the time required for miRNA precursor extraction from 85 seconds to 19 seconds in Salvia miltiorrhiza , and parallel processing significantly reduced the time for the secondary structure prediction of 3957 S. miltiorrhiza miRNA precursors from 90 seconds to 32 seconds. miR-Island completed miRNA annotation for Arabidopsis in 18 minutes on a standard personal computer, which was less than 50% of the time for ShortStack and 9% of that for miRDeep-P. In terms of accuracy, miR-Island identified 128 miRNAs, which included 68 known miRNAs in miRBase. ShortStack predicted 55 total miRNAs, including 38 known miRNAs in miRBase. Of the 175 total miRNAs predicted by miRDeep-P, only 57 miRNAs were registered in miRBase. Agreement between miR-Island and ShortStack was moderate (kappa = 0.47). For the prediction of miRNAs from three other plant datasets, miR-Island spent approximately <50% of the ShortStack run time and <2.5% of the miRDeep-P run time. When the three programs were run on

Conclusion Unlike other approaches, miR-Island is an ultrafast and memory-efficient tool for plant miRNA annotation and quantification on standard personal computer using the strategies of pseudo genome and multiple threads. In addition, miR-Island is a single command Perl script that is convenient to use for miRNA annotation and quantification. miR-Island was implemented in Linux and is available under the GPL GPU license from https://github.com/janeyurigao/miR-Island.

Background
MicroRNAs (miRNAs) are ~ 21 nucleotide noncoding RNAs excised from stem-loop precursors by Dicer polymerase, and they direct transcriptional or posttranscriptional gene expression regulation in metazoans and plants [1]. In the last decade, miRNA detection and expression profiling have become commonly used experimental strategies [2], reflecting the tremendous advances in next-generation sequencing (NGS) technologies [3]. Although mature miRNAs from small RNA (sRNA) sequencing can be quantified without first aligning reads to a reference genome, this approach leads to a high rate of false positives due to the many other classes of sRNAs present in the datasets, including siRNAs, snoRNAs and piRNAs, that could be falsely identified as miRNAs.
In contrast to animal miRNAs, plant miRNAs are more conserved and have variable precursor lengths [4], therefore necessitating special programs for plant miRNA annotation. The currently available tools for plant miRNA identification can be divided into web-based and local tools. The web-based miRNA annotation tools are generally easy to use and include the following: SoMART [5], PlantMiRNAPred [6], miRAuto [7], SeqBuster [8], DSAP [9], CPSS [10], PsRobot [11], DARIO [12] and miRPlant [13]. However, web-based tools are limited in terms of file upload size and reference genome selection. For this reason, local programs have been more widely adopted than web-based tools. MIREAP was available early on for plant miRNA prediction from sRNA-seq data. miRDeep-P [2] was designed for plant miRNA annotation with a plant-specific scoring system and filtering criteria, and it is the most widely known/used tool in plants. ShortStack [14] is a "one-command" script for comprehensive sRNA annotation and quantification, including miRNAs, phased-siRNAs and piRNAs. The University of East Anglia (UEA) sRNA workbench [15] was designed for the comprehensive analysis of NGS sRNA data, such as the identification of miRNAs and their targets as well as expression level comparison for specific sRNA loci. miRDeepFinder [16] provides a comprehensive annotation of plant miRNAs from deep sequencing sRNA datasets. miREvo [17] is an integrated miRNA evolutionary analysis platform for NGS datasets that relies on miRDeep2 as the core algorithm for miRNA prediction. miRanalyzer [18] was developed based on a random forest model and uses support vector machine (SVM) mechanics to annotate and quantify miRNAs. mirTools2 [19] provides detailed annotation for each known miRNA and presents differential miRNA expression in a scatter plot. miRNAkey [20] is a graph-based software for the discovery and quantification of conserved miRNAs. MIReNA [21] was designed for the prediction of miRNAs and pre-miRNAs, and it explores a multidimensional space defined by only five parameters.
shortran [22] is a command-line Python-based software package for miRNA annotation and quantification. Semirna [23] uses a putative target sequence as input and allows miRNA searches. microHARVESTER [24] takes a miRNA query and identifies candidate miRNA homologs from a set of sequences. The candidate genes are subsequently subjected to various filters before the final conclusion. MIRcheck [25] uses sequence/structure specification and 20-mer coordinates for miRNA prediction. MaturePred [26] is a machine-learning method based on support vector machines that predicts the positions of plant miRNAs for new plant pre-miRNA candidates. miR-PREFeR [27] uses miRNA expression patterns and follows the criteria for plant miRNA annotation to accurately predict plant miRNAs from one or more sRNA-seq data samples of the same species. However, all these programs above are not fast to identify miRNAs on standard personal computer on most plants, especially on scaffold genome which was not assembled to chromosome level.
Personal computers now commonly contain multiple CPU cores, which allow the computer to process multiple sets of information in parallel. However, the currently available miRNA prediction tools, including miRDeep-P and ShortStack, were implemented for only one CPU core and therefore do not make full use of the resources commonly available on personal computers today. Additionally, among the steps involved in miRNA prediction from deep sequencing data, the prediction of RNA secondary structures is the most time-consuming task. In this study, miR-Island, a "one-command" Perl script, was developed for plant miRNA annotation on standard personal computers that can be equipped in all laboratories at low cost. To achieve fast miRNA annotation, miR-Island incorporates two strategies: (i) a pseudogenomic strategy that greatly accelerates the extraction of miRNA candidate precursors when using a scaffold-level genome assembly and (ii) RNAfold parallel processing, which can increase the speed of secondary structure prediction for miRNA precursors up to 3-fold. The accuracy and speed of miR-Island were compared to that of miRDeep-P and ShortStack on datasets from Arabidopsis and four other plant species. The results showed that miR-Island could annotate plant miRNAs with unprecedented speed and acceptable accuracy on a standard personal computer. Implementation Overview Our software implements the following six steps (Fig. 1).
1) Formatting of the genome. The input sRNA data are transformed into an acceptable format. If the reference genome is a contig-or scaffold-level assembly, then a pseudogenome is generated from the reference genomic sequences ( Fig. 2A).
Many contigs or scaffolds are tandem linked and spaced with a character string of 20 Ns. The local site of each contig or scaffold in the pseudogenome is recorded, and an index is established for retrieving the real site of an identified miRNA.
2) Mapping sRNA reads to the reference genome. Bowtie (version 0.12.9) [28] is used for this mapping step, and the results/output are in BAM/SAM format.
3) Extraction of potential miRNA precursors. Two different methods are used to identify the potential precursors of known and novel miRNAs. For known miRNAs, two ~ 270 nt fragments of 230 nt or 20 nt genomic sequences flanking the known miRNA at the 5′ or 3′ end are extracted as the potential precursors. For novel miRNAs, fragments containing two or more islands (Fig. 2B) are excised from the genome as potential precursors, with the precursor length not exceeding the userset threshold. 4) Secondary structure prediction. miR-Island predicts precursor secondary structures for the precursors with multiple threads through parallel proceeding (Fig. 3). 5) miRNA identification. miR-Island performs miRNA prediction using plant-specific criteria based on the recommendations of Thakur et al. [29], except for the parameter adapted for plant-specific minimum free energy (MFE). The following additional criteria were also adapted: i) loop length should be greater than 6 nt; ii) subroutine for processing input files; iii) miRNA-miRNA* duplex should have fewer than 5 mismatches; iv) mature sequence should not have a continuous string of six or more of the same base; v) miRNA-miRNA* duplex should account for more than 75% of the reads mapping to the precursor locus.

Datasets
The raw data for the analyzed sRNA-seq datasets were downloaded in SRA format from the GEO database, and the accession numbers are listed in (Additional file 1).
The reference genomes for Arabidopsis, tomato, rice, maize and Salvia miltiorrhiza were downloaded from the ftp sites listed in (Additional file 2). In addition, one S. miltiorrhiza scaffold-level genome assembly [30] was downloaded from the ftp site

9
The pseudogenomic strategy accelerates the extraction of miRNA precursors To reduce the frequency of reading sequences by computer in the precursor extraction step, we first concatenate several scaffold sequences similar to the genome size of plant. Through this transformation, more than 21,000 S. miltiorrhiza genomic scaffold sequences were transformed into 50 pseudogenomic sequences with a maximum length of 60 Mb. Next, the sRNA-seq reads with the adaptor sequence removed were aligned to the pseudogenome for miRNA precursor extraction. It took only 19 seconds to extract 3,957 miRNA candidates from the S. miltiorrhiza pseudogenome. The same extraction step took 85 seconds on the scaffold genome, indicating a > 4-fold processing speed improvement when using the pseudogenome (Fig. 4).
Parallel computing speeds up the secondary structure prediction of miRNA precursors To reduce the time required for the prediction of precursor secondary structures, we used a parallel computing strategy for RNAfold analysis of miRNA precursor candidates. With parallel processing, the secondary structure predictions for the 3,957 precursors took only 32 seconds. This result corresponded to a 3-fold processing speed improvement over single-thread processing, which required ~ 90 seconds for this step (Fig. 5).

Performance on Arabidopsis data
To evaluate the performance of miR-Island, the accuracy of miRNA annotation was first measured on an Arabidopsis dataset. Of the 128 total miRNAs identified by miR-Island, 68 were known miRNAs registered in miRBase (Fig. 6A), indicating an accuracy of 53.1%. Using ShortStack, 38 of the 55 predicted miRNAs were registered in miRBase (Fig. 6A). The kappa coefficient between the two programs was 0.47 (Table 1), indicating moderate agreement. miRDeep-P predicted 175 miRNAs, but only 57 were registered in miRBase (Fig. 6A), resulting in the lowest prediction accuracy of 32.6% using the independent samples t-test (P < 0.05). The kappa coefficients for miR-Island/miRDeep-P and ShortStack/miRDeep-P of 0.54 (Table 2) and 0.43 (Table 3), respectively, indicated moderate agreement. MIREAP identified 214 miRNAs, but none of them were registered in miRBase (Additional file 3). Therefore, all subsequent performance analyses of miR-Island were based on comparisons with ShortStack and miRDeep-P. In terms of the efficiency of the three tools on the Arabidopsis dataset, miRDeep-P and ShortStack completed the miRNA annotation in 2,000 and 45 minutes, respectively, whereas miR-Island required only 18 minutes to finish the miRNA annotation and quantification (Fig. 6B). In addition, the memory usage of miR-Island was equivalent to that of ShortStack (2.7 GB) and less than that of miRDeep-P (3.0 GB) (Fig. 6C).

Performance on other plant datasets
For further comparison, the three tools were used for miRNA annotations of tomato, rice and maize datasets. miR-Island was consistently the fastest tool for miRNA annotation (Fig. 7). For the prediction of tomato miRNAs, miR-Island required only about 9% and 2.5% of the time spent by ShortStack and miRDeep-P, respectively (Fig. 7A). For rice miRNA prediction, it required only about 20% and 0.2% of the time spent by ShortStack and miRDeep-P, respectively (Fig. 7B). For maize miRNA prediction, the time spent by miR-Island was only about 50% of the time spent by ShortStack on predicting maize miRNAs (Fig. 7C). miRDeep-P could not finish the maize miRNA prediction because of insufficient memory. Overall, miR-Island was more memory-efficient than the other two programs (Fig. 7D-F). Finally, the analysis indicated that the known miRNAs predited by miR-Island was the most compared to the other two programs (Fig. 7G-I).
Performance on S. miltiorrhiza data The performance of miR-Island was also evaluated on a scaffold-level S. miltiorrhiza genome assembly, and the results are shown in (Fig. 8). miRDeep-P predicted a total of 118 miRNAs and completed the miRNA analysis in about 11 hours with a maximum memory usage of 2 Gb. Only 8 of the predicted miRNAs were in miRBase, corresponding to an accuracy of 6.8%. ShortStack ran for 40 minutes with a maximum memory usage of > 6 Gb, and a total of 42 miRNAs were predicted, corresponding to an accuracy of 19%. miR-Island completed the miRNA prediction in only 17 minutes with a maximum memory usage of 2 Gb, and 13 of the 69 predicted miRNAs were known miRNAs. In summary, for miRNA prediction using a scaffoldlevel S. miltiorrhiza reference genome assembly, miR-Island was faster than ShortStack and had similar accuracy, and it was faster and had greater accuracy than miRDeep-P (Fig. 8).

Discussion
At present, almost all of the available tools to analyze sRNA-seq data to identify miRNAs, including miRDeep-P [2], ShortStack [14] and MIREAP, first align the sRNA library reads to the genome then extract the candidate precursors according to certain strategies. For some species, only a scaffold-level genome assembly is available. In these cases, the available reference genome comprises a large number of scaffold sequences, and each individual scaffold is not very long. For example, there are more than 21,000 scaffolds in the S. miltiorrhiza genome [30], and the average scaffold length is only about 5 Kb. In contrast, the 12 rice chromosome sequences are 20-40 Mb in length [35]. A large number of discrete scaffold sequences greatly reduces the efficiency of extracting candidate precursors. To circumvent this inefficiency, miR-Island uses a pseudogenomic method to reduce the number of discrete sequences that must be read by the computer during the precursor extraction step. The pseudogenomic strategy improved the processing speed of miR-Island more than 4-fold compared to the non-pseudogenomic method, in which the scaffold sequences are not transformed into a pseudogenome.

Conclusions
The miR-Island was a single command perl script developed to ultrafast and a memory-efficient tool for plant miRNA annotation and quantification on standard personal computer. The key steps of speedup were as following: 1) excise the precursor candidates with high efficiency, the miR-Island firstly distinguish the provided reference genome. If the provided reference genome has whole chromosome sequences such as Arabidopsis, rice or etc., the genome will be

Availability of data and materials
Not applicable

Competing Interests
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Funding
This work was supported by Natural Science Foundation of China (grant no.31372075), the Natu Tech University.

Authors' Contributions
WZ and WJ conceived of the study. WZ designed and implemented the software. TG and XM tested the software. TG and XM wrote the manuscript. All authors read, commented on and approved the final manuscript. Figure 1 A schematic diagram of miR-Island for plant miRNA annotation and expression analysis. Multithreading as a parallel computing strategy for secondary structure prediction.  Performance time analysis for the extraction of miRNA precursors from a scaffold-level assem Performance time analysis for miRNA precursor prediction based on multi-and single-thread Figure 6 Performance evaluation of miR-Island compared to the miRDeep-P and ShortStack programs Figure 7 Performance evaluation of miR-Island, miRDeep-P and ShortStack on tomato, rice and maize