Finding miRNAs originated by mRNA retrotransposition events
To identify miRNAs originating from mRNA retrocopies (retro-miRs), we developed a set of local pipelines and surveyed several databases (Fig. 1A). Briefly, we constructed and used a set of computational algorithms to assess and integrate information from three databases: miRBase, the reference database for information on several miRNAs [23]; miRIAD, a database containing annotation and further data on intragenic miRNAs and their host genes [24, 25]; and RCPedia, a database of retrocopies present in humans and other species [26]). Further information about the investigated genes (retrocopies and miRNAs), such as their conservation, expression, miRNA targets, and functional information, was retrieved from other databases (e.g., TargetScan [27], miRTarBase [28], FANTOM phase 5, and The Cancer Genome Atlas (TCGA) and locally processed (Fig. 1A). Further details are available in the Methodology section.
Using this strategy, we identified 17 retro-miRs in retrocopied sequences. These miRNAs were grouped into three distinct classes: retroposed (R-retro-miRs), exon junction (EJ)-retro-miRs, and novel retro-miRs (N-retro-miRs) (Fig. 1B). Six R-retro-miRs (35.3%; miR-4444-2, miR-1244-2, miR-1244-3, miR-1244-4, miR-1244-5, and miR-1244-6) are retroduplications of exonic miRNAs already present in their parental gene sequences (Fig. 1B, Supplementary Figure S1). These include four R-retro-miRs that are fully identical to their parental miRNAs and two others that have variations in their precursor stem, including one retro-miR with variations in the mature region (outer seed region). We found seven EJ-retro-miRs (41.2%; miR-492, miR-622, miR-4426, miR-4426-1, miR-3654, miR-3654-1, and miR-10527), which were also in the exonic region of their parental genes but spanned two exons (Fig. 1B, Supplementary Figure S2). Interestingly, none of these miRNAs have been annotated in the parental gene region, likely because they are split by an intronic sequence. The last retro-miR class, N-retro-miR, comprised four miRNAs (23.5%; miR-4788, miR-7161, miR-4468, and miR-572) that had a de novo origin in the retrocopied sequences because of the occurrence of mutations during evolution (Fig. 1B, Supplementary Figure S3). Upon checking the equivalent region in the parental gene, no evidence of a stem loop, that is, a putative miRNA, was found (Supplementary Figure S4).
Next, we investigated the characteristics of retrocopies containing retro-miRs. First, we found that the six R-retro-miRs present in both retrocopy and parental genes originated from six distinct retroduplication events but only from two parental genes: PTMA (five retrocopies with retro-miRs) and HNRNPA3 (one retrocopy) (Supplementary Table S1). Interestingly, these genes were highly retroduplicated in the human genome. PTMA presented 13 retrocopies and HNRNPA3 presented 17 retrocopies (Supplementary Table S2). In contrast, five distinct parental genes originated from the seven retrocopies with seven EJ-retro-miRs, and four distinct parental genes originated from four retrocopies, each with an N-retro-miR (Supplementary Table S1 and S2).
These results led us to investigate the events of chromosomal duplication of genes containing exonic miRNAs, in addition to the retro-miRs we identified. Briefly, we found that among the 19,768 protein-coding genes, 144 contained exonic miRNAs, of which 17 were duplicated by retrotransposition and 24 by DNA-mediated duplication. Interestingly, only two DNA-based duplications contained miRNAs in their duplicates (Supplementary Table S3). Taken together, these results suggest that retroduplication is a common pathway for the origin of novel miRNAs.
Characteristics and conservation of retro-miR events
To trace the evolutionary origin of each retrocopy and its retro-miRs, we used a homology-based approach (see Methods) to determine their conservation (Fig. 2A). First, this analysis revealed that all these retrocopies and their miRNAs originated in the primate lineage (Supplementary Table S4) and, in agreement with the literature on retrocopies, are primate-specific [19]. By correlating the ages of retro-miRs with their classes, we observed that the oldest events (conserved in all primates) were enriched in de novo N-retro-miRs. EJ-retro-miRs were spread across all ages, and R-retro-miRs were only present in great apes and up (Fig. 2A).
These results prompted us to investigate whether retrocopies containing retro-miRs evolved under functional constraints (purifying or positive selection) or under neutral conditions. To assess this, we used retrocopy-predicted coding regions to quantify the ratio of the number of nonsynonymous substitutions per nonsynonymous site to the number of synonymous substitutions per synonymous site (dN/dS). We found three retrocopies under neutral selection (dN/dS ~ 1, retrocopies containing retro-miR-4444-2, retro-miR-1244-2, and retro-miR-3654-1) and 12 retrocopies with a dN/dS ratio different from 1 (potentially under functional constraints) (Fig. 2D). However, only for three of these retrocopies, we confirmed a dN/dS value significantly different than 1 (p-value < = 0.05, likelihood ratio test), with two retrocopies under purifying selection (retrocopies containing retro-miR-1244-6 and retro-miR-7161) and one retrocopy (containing retro-mir-1244-5) under positive selection (Fig. 2D, Supplementary Table S5).
Taken together, these results indicate that retro-miRs originated during primate evolution in distinct waves. Additionally, considering the recent insertion of these retrocopies and retro-miRs into the evolutionary history of humans, we observed a tendency to purify evolutionary pressure on them.
Expression of retro-miRs in normal tissues.
After dissecting the genomic features of retro-miRs, we attempted to identify their transcription and target genes, which are fundamental features of functional miRNAs. We examined evidence of retro-miR transcription using short RNA sequencing (RNA-seq) [29] and CAGE data [30]. All retro-miRNAs (except retro-miR-10527) were expressed in normal samples (404 individuals, 32 tissues) and/or cell lines (48 cell lines) (Fig. 3A). Notably, we observed distinct expression patterns of retro-miRs in different tissues and cell lines. Interestingly, a novel retro-miR (mir-4788) was highly expressed (the highest median expression among all retro-miRs) in normal samples. Conversely, in stem cells, miR-4788 showed the lowest (median) expression.
Next, to further support the true expression of these loci containing retro-miRs, we evaluated the presence of a transcription start site (TSS) (up to 4 Kbp) upstream of their 5' end using CAGE sequencing. The results confirmed TSSs for 82% (14/17) of the retro-miRs (Fig. 3B). In agreement with the RNA-seq data, retro-miRs with a defined TSS had, in general, higher expression and were more broadly expressed in different tissues than retro-miRs without a TSS (Fig. 3B).
Next, we compared the expression of retro-miRs with that of all the other annotated miRNAs (Supplementary Table S6). Figure 3C shows that retro-miRs were in the middle of the first quartile in terms of expression levels and number of tissues expressed. Together, we observed that all retro-miRs had similar expression levels, which is expected for genes (miRNAs) that have recently evolved. Additionally, N-retro-miRs were among the miRNAs expressed in fewer tissues, a pattern that is expected for novel genes (miRNAs).
In a further investigation, we sought to identify the set of genes targeted by retro-miRs, a sine qua non feature of functional miRNAs [27]. Notably, we identified target genes for all retro-miRs using TargetScan (7mer-m8 and 8mer) and TargetScan (8mer), and experimentally validated the targets (Fig. 4A; Supplementary Table S7). Only one retro-miR (miR-10527) had no experimentally validated targets, which is in accordance with our previous results (Fig. 3), where we reported no expression of miR-10527. Next, to evaluate the global functions of these retro-miR targets, we performed a Gene Ontology investigation of their biological processes. We found that these target genes function in key cellular processes, including the regulation of transcription, DNA replication, cell proliferation and differentiation, and DNA binding (Fig. 4B; p-value < 0.05; fold enrichment > 1.5). Next, we separately examined each set of targets for all retro-miRs (Supplementary Figure S5) and identified two sets of target genes. The two novel retro-miRs (miR-4788 and miR-572) had gene targets enriched in neural biological processes, particularly those involved in the regulation of neurotransmitters and synapses (Fig. 4C, Supplementary Figure S5). In addition, miR-4788 is conserved from humans to marmosets, and miR-572 is conserved from humans to gorillas. This finding may be implicated in the phenomenon of species-specific trait heritability of miRNAs.
We also assessed whether the targets of the retro-miRs were enriched in specific Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways. Remarkably, we found enrichment of several cancer-specific pathways, including breast, colorectal, gastric, pancreatic, prostate, and lung cancers (Fig. 4D), in addition to more general cancer-related pathways, such as P53, ErbB, and MAPK signaling, as well as resistance to platinum drugs and EGFR tyrosine kinase inhibitors (Supplementary Table S8). In this analysis, several known cancer genes, including BRAF, KRAS, EGFR, and TP53 emerged as targets of retro-miRs.
Taken together, the expression and gene target investigation revealed that retro-miRs (except miR-10527) have reliable expression and validated gene targets, which are the two fundamental characteristics of functional miRNAs. We also observed that retro-miRs target genes involved in fundamental cell processes (e.g., cell growth, regulation of transcription, and post-transcriptional regulation) and that two novel retro-miRs (miR-572 and miR-4788) regulate genes related to neuronal functions. Additionally, we observed that these retro-miRNAs targeted key cancer genes, suggesting that they play oncogenic roles in cancer tissues.
Retro-miRs are highly or majorly transcribed in cancer tissues
Our previous results (Fig. 4D) prompted us to study the expression of these retro-miRs in cancer tissues. Using short RNA-seq data available in TCGA [31], we expressed mature retro-miRs in 10 solid tumors. Their normal counterpart tissues were used as controls due to the relatively limited number of normal samples. Overall, this investigation revealed that retro-miRs have distinct and high expression levels in cancer (Fig. 5). Specifically, miR-10527-5p, miR-3654, miR-4444, miR-4788, and miR-1244 were highly expressed in cancer tissues but were also expressed in several normal tissues (Fig. 5). Interestingly, miR-10527-5p was highly expressed in tumors, although it was not expressed in normal samples or cell lines (Fig. 3). Remarkably, five retro-miRs emerged as putative tumor biomarkers because they presented a very low (or absent) expression in normal samples and consistent expression in cancer. Notably, two retro-miRs (miR-622 and miR-492) were exclusively expressed in seven cancer types each: BRCA (miR-622), COAD (miR-622 and miR-492), LUAD (miR-622), PAAD (miR-622 and miR-492), PRAD (miR-622 and miR-492), STAD (miR-492), and THCA (miR-492) (Fig. 5). Two mature miRNAs from the same precursor (miR-7161), miR-7161-3p and miR-7161-5p, were also exclusively expressed in six cancer types: BRCA (miR-7161-3p and miR-7161-5p), COAD (miR-7161-3p), LUAD (miR-7161-3p), LIHC (miR-7161-5p), PAAD (miR-7161-3p and miR-7161-5p), and PRAD (miR-7161-3p). Finally, miR-4426 was found to be expressed exclusively in BRCA, COAD, KIRC, and PAAD.
Taken together, these results indicate that retro-miRs are highly expressed in cancer, with some retro-miRs presenting a cancer-specific expression pattern and thus, their potential as tumor biomarkers should be further studied.
Overall Survival based on groups of retro-miRs in TCGA samples
Having confirmed the high and distinct expression of retro-miRs in cancer, we sought to determine whether these miRNAs have prognostic value in the overall survival of patients with cancer. To assess this, we used Reboot [32], an algorithm that identifies gene signatures (including those of miRNAs) associated with cancer patient prognosis. Using this strategy in approximately 8000 samples from 10 cancer types presenting with overall survival (OS) data, we found two signatures with prognostic value: a signature with two retro-miRs (Fig. 6A-B) in liver hepatocellular carcinoma (LICH) and another with three retro-miRs with prognostic value in STAD (Fig. 6C-D). In detail, Fig. 6A shows the overall survival curve of two sets of patients with LICH: patients with higher expression of miR-4444 and miR-3654 presented significantly (p-value < 0.0002) lower overall survival (e.g., overall survival of ~ 1650 days for 50% of patients) than patients with lower miR-4444 and miR-3654 expression (e.g., survival of ~ 2600 days (60% more) for 50% of patients). In patients with STAD, three retro-miRs (miR-622, miR-4788, and miR-4444) with high expression were associated with worse overall survival. Patients with a higher expression of these retro-miRs had a lower overall survival (~ 850 days for 50% of patients) than patients with a lower expression of these retro-miRs (~ 1300 days for 50% of patients). Thus, these data also show the prognostic value of these retro-miRs in LIHC and STAD and add a greater layer of functionality (which needs to be further explored) to these retrocopy-derived miRNAs.