The expression of the primiRNAs was reliably predicted
We modeled the expression of primiRNA in 1829 primary cell types and tissues (Material and Methods). In brief, we considered the TSSs of a miRNA consistently annotated in at least four of fourteen previous studies as the TSS of this miRNA [18]. We then measured the expression of this miRNA by the relative log expression normalized counts of reads mapped to the neighborhood of the TSS in the Cap Analysis of Gene Expression (CAGE) experiments [23]. Finally, we modeled the expression of the primiRNAs with the expression of mRNAs in these 1829 CAGE experiments by the least absolute shrinkage and selection operator (LASSO) regression [24]. LASSO selected a subset of mRNAs for each miRNA. We called these selected mRNAs for a miRNA as its associated mRNAs.
We found that the expression of the associated mRNAs could reliably model the expression of primiRNAs (Fig. 1A). We calculated the correlation of the predicted expression value by the LASSO model with the true expression value. The minimum, mean, and median Pearson’s correlation per miRNA was 0.59, 0.91 and 0.91, respectively (pvalue = 0 for all correlations). Similar, the minimum, mean and median Spearman’s correlation per miRNA was 0.27, 0.77 and 0.80, respectively (pvalue = 0 for all correlations). If we measured the similarity of the predicted expression in every sample, the minimum, mean, and median Pearson’s correlation per sample was 0.51, 0.87 and 0.89, respectively (pvalue = 0 for all correlations). Correspondingly, the minimum, mean and median Spearman’s correlation per sample was 0.43, 0.81 and 0.82, respectively (pvalue < 3.12e14 for all correlations). The significant correlation suggested that the expression of the associated mRNAs could reliably model the primiRNA expression.
We further examined the miRNAs with their expression accurately predicted (Pearson’s correlation > 0.9) and the miRNAs with the expression not predicted so well (Pearson’s correlation < 0.6) (Table 1). Note that the correlation was larger than 0.59 for all miRNAs for the two types of correlation calculations. We found that the miRNAs with their expression accurately predicted were primiRNAs with much higher expression levels and much larger expression variation. On the contrary, miRNAs that were not modeled so well were primiRNAs with low expression and low expression variation. For instance, the miRNAs modeled well had a maximum expression value and standard deviation of 313.46 and 427.57, while the miRNAs modeled not so well had the corresponding value as 0.20 and 0.43, respectively. In fact, for every miRNA modeled not so well, they had zero expression in at least 65.12% of the samples. In other words, these miRNAs were not modeled so well because they were not so related to the experimental conditions these samples considered. If we excluded the miRNAs with their expression standard deviation smaller than 3 in these 1829 samples, the Pearson’s correlation of the predicted expression with the true expression was larger than 0.83 for each of the remaining 240 miRNAs (pvalue = 0). We called these 240 miRNAs active miRNAs in the 1829 samples. The good modeling of the expression of the 240 primiRNAs suggested that the expression of almost all primiRNAs could be predicted well by their associated mRNAs when related samples were considered.
Table 1
The minimum, maximum, and median of Pearson’s correlation coefficient of three groups of miRNAs. The miRNAs in the high group had a correlation > 0.9. The miRNAs in the low group had a correlation < 0.6. The miRNAs in the active group had the standard deviation of expression larger than three.
Group

Min correlation

Max correlation

Median correlation

Min expression

Max expression

Median expression

High

0.90

1.00

0.94

0.37

313.48

16.93

Low

0.59

0.59

0.59

0.13

0.13

0.13

Active

0.83

1.00

0.92

0.47

313.48

15.70

The above analysis was based on all 1829 samples. To see whether the LASSO models trained on a subset of samples could predict primiRNA expression on the remaining samples, we did tenfold crossvalidation (Fig. 1B). On average, the minimum, mean and median Pearson’s correlation per miRNA was 0.02, 0.65 and 0.67, respectively. The minimum, mean and median Spearman’s correlation per miRNA was 0.13, 0.65 and 0.65, respectively. Correspondingly, the minimum, mean and median Pearson’s correlation per sample was 0.13, 0.70, and 0.67, respectively. The minimum, mean and median Spearman’s correlation per sample was 0.15, 0.68 and 0.69, respectively. Although lower compared with the model trained with all 1829 samples, the correlation was still significantly large (pvalue of the mean and median correlation was 0), suggesting that the primiRNA expression was reliably predicted. The lower correlation in the crossvalidation also indicated that the regulation of the primiRNA expression was samplespecific, and the inferred model based on a specific subset of samples would predict primiRNA expression better for that particular subset of samples.
The associated mRNAs were different from target mRNAs
Previous studies modeled mature miRNA expression with the expression of their target mRNAs [8, 13]. The target mRNAs of a miRNA contain its target sites and can be bound by this miRNA. We thus compared the associated mRNAs inferred above with the target mRNAs for each miRNA. The target mRNAs of a miRNA were retrieved from two sources (Material and Methods). One was the computationally predicted targets by the TargetScan Version 7.2 tool [5]. The other was the annotated miRNA targets from the miRTarBase Version 8.0 database [25].
We found that the associated miRNAs were different from the target mRNAs for every miRNA. On average, only about 2.22% of the associated mRNAs were the targetScan mRNAs, and about 5.26% of targetScan mRNAs were the associated mRNAs for a miRNA. Similarly, the corresponding percentage was 1.06% and 5.56%, respectively, when the miRTarBase mRNAs were compared with the associated mRNAs for every miRNA. The rare overlap between the two types of mRNAs may make sense, since if the associated mRNAs were largely the target mRNAs, the primiRNAs would have a highly correlated expression with their mature miRNAs, which was not the case in previous studies [3, 4]. The rare overlap thus corroborates the previous observations that primiRNAs and mature miRNAs have in general uncorrelated expression patterns [3, 4].
Since the associated mRNAs modeled the primiRNA expression better than the target mRNAs, we hypothesized that their expression correlated better with the expression of the corresponding primiRNA than the expression of the target mRNAs. We found that for a miRNA, the expression of most of its associated mRNAs indeed correlated better with its expression than the expression of its target mRNAs across the 1829 samples, no matter whether the target mRNAs were defined by targetScan or miRTarBase (Fig. 2). When we compared the expression correlation of the associated mRNAs with the expression correlation of the target mRNAs, we found that 84.53% of miRNAs had a significantly higher correlation with their associated mRNAs than with their target mRNAs (MannWhitney pvalue < 0.001).
Despite the rare overlap of the two types of mRNAs for most miRNAs, two miRNAs, hasmir326 and hsmir454, both had seven associated mRNAs overlapped with the targetScan mRNAs. For these two miRNAs, their associated mRNAs significantly overlapped with their targetScan mRNAs (hypergeometric testing pvalue 0.026 and 0.006). Interestingly, the miRNA mir454 also had five associated mRNAs overlapped with the miRTarBase mRNAs (hypergeometric testing pvalue 0.001).
The associated mRNAs were biologically sound
The above analysis showed that the associated mRNAs had more correlated expression pattern with the primiRNAs than the target mRNAs, implying the functionality of these associated mRNAs for each miRNA. We further investigated other properties of the associated mRNAs. We found that most of them were not affected by the neighbor sizes around miRNA TSSs and different samples used. Moreover, these associated mRNAs for most miRNAs had enriched functional annotations.
We checked how different neighborhood sizes might affect the associated mRNAs inferred for a miRNA (Material and Methods). We considered 100, 300 and 500 base pairs around an annotated miRNA TSS as the TSS regions to measure the normalized gene expression of miRNAs and mRNAs. For the three neighborhood sizes, the median number of associated mRNAs identified was 164, 163, and 164, respectively, indicating the robustness of the inferred associated mRNAs. We also found that more than 77.0% of the associated mRNAs were the same for a miRNA when different neighborhood sizes were used, suggesting that the associated mRNAs were likely to be biologically meaningful and intrinsically related to the corresponding miRNAs. Although most associated mRNAs were the same for a miRNA, a fraction of them were different, most likely due to the conditionspecific expression of the primiRNAs.
We also studied how different samples may change the associated mRNAs inferred for a miRNA. For a given miRNA, we compared its associated mRNAs inferred from each of the ten folds in the tenfold crossvalidation with the associated mRNAs inferred with all samples (Fig. 3A). Here the neighborhood size was set to be 100 base pairs since different neighborhood sizes did not change the predictions much. Interestingly, the median number of the associated mRNAs was still around 163. Moreover, on average, about 80.79% of the associated mRNAs were shared between every fold and the model with all samples. We also found 79.78% of the associated genes were shared between every fold and at least four other folds (Fig. 3B). Together with the above analysis, it was evident that most associated mRNAs were the same when different neighborhood sizes or different subsets of samples were used. It also showed that a fraction of the associated mRNAs were conditionspecific, occurring in specific sets of samples.
With the associated mRNAs for each miRNA, we investigated whether they significantly shared gene ontology (GO) functions [26]. With the false discovery rate cutoff 1, more than 58.63% primiRNAs had at least one GO term significantly shared by its associated mRNAs. Among these primiRNAs, the median number of significantly shared GO terms was three. The GO analysis suggested that the associated mRNAs were likely biologically meaningful. Note that not all miRNAs had their associated mRNAs significantly shared GO functions, partially due to the imperfect GO annotation.