CYP20A1 contains a unique 3’UTR with Alu-driven divergence
Our previous work had identified 319 Alu exonized genes wherein co-occurrence of regulatory events coalesced within Alus (19). Literature mining revealed that 91 out of these 319 genes map to apoptosis and nearly 75% of them cluster around three discrete hubs: cell cycle-DNA damage response (p53 hub; 31 genes), mitochondrial events (mito hub; 22 genes) and proteostasis (ubi hub; 15 genes) (Supplementary Information S1 and Table S1). As the majority of these exonization events occur in the 3’UTRs of transcripts, which can modulate miRNA regulatory networks, we focused on identifying specific events in the 3’UTRs of these genes.
We found a transcript isoform of CYP20A1 gene (referred to as CYP20A1_Alu-LT hereafter) that has an 8.93kb long 3’UTR, 65% of which is derived from the exonization of 23 Alus (Figure 1a). Since this transcript has an unusual density of Alus across the length of the UTR, we characterized it further for regulatory potential. There is a considerable disparity in the annotation of CYP20A1 transcript isoforms on different genomic portals (10, 9 and 4 annotated isoforms in Ensembl, NCBI and UCSC, respectively; see Supplementary information, S1). Amongst the nine transcripts of human CYP20A1 annotated in NCBI, experimental evidence (i.e., support from RNAseq reads or microarray probe signals) is available only for CYP20A1_Alu-LT (NM_177538.2), the longest isoform (10.94 kb). It is an outlier in terms of its 3’UTR length as it occupies the 85th position in the genome-wide length distribution of 3’UTRs (Figure 1b). Such extended 3’UTRs are extremely rare and even among Alu-exonized genes (i.e., genes with Alu exonization events documented in one or more of their transcript isoforms), less than 3% have UTRs longer than 6kb (Supplementary Figure S1). The length and enrichment do not seem to correlate with the density of exonized Alus (r = 0.25) compared to the genomic average of 5.42 exonization events/ 3’UTR. The Alus in CYP20A1_Alu-LT belong to subfamilies of different evolutionary ages, suggesting that their insertion could have happened over a period.
Genomic region proximal to CYP20A1_Alu-LT 3’UTR is relatively well conserved
The coding region of CYP20A1_Alu-LT is remarkably well conserved among vertebrates, both at the sequence level as well as the length of the mature protein (Table 1). The chimpanzee, macaque and mouse CYP20A1 code for the same 462-470aa protein as in humans although their annotated transcript orthologs range between 1-3kb. Multiple sequence alignment across vertebrates reveals a strong conservation at both the N and the C terminals (Supplementary Figure S2); of the first 100aa, 62 are completely conserved while 18 contain lineage specific substitutions with residues that have similar functional groups. This is also corroborated by the minimal evolutionary divergence across vertebrate CYP20A1 proteins (Figure 1c) and a strong purifying selection in CDS (Ka/Ks ~0.2 in mammals and <0.1 in non-mammalian vertebrates) (Table 1).
On the other hand, 3’UTR extension in CYP20A1_Alu-LT seems to be mediated by the primate specific insertion of Alus. Its orthologs in mouse, rat and zebrafish are extremely short (within 1kb). In mouse, we observe a sparse presence of two B1 SINEs, one each of simple repeat and low complexity repeat whereas the zebrafish 3’UTR lacks repeats altogether. The longest annotated CYP20A1 transcripts for mouse (NM_030013.3), rat (NM_199401.1) and zebrafish (NM_213332.2) are 2.27, 2.03 and 1.79kb, respectively. The 5’UTR appears to be well conserved across the primate lineage (except lemur and proboscis monkey); however, the divergence in the 3’UTR, as evident from Jukes Cantor measure, increases as we move from the great apes to rhesus macaque and is primarily contributed by the Alus, with the breakpoints mostly coinciding with an Alu insertion (Figure 1d). It shows maximum divergence from mouse that were treated as a non-primate evolutionary out-group. To control for the length difference between the 5’ and 3’UTRs, we also checked for conservation in the 10kb region upstream of the first transcription start site (TSS) of CYP20A1 and 10kb downstream of transcription end site (TES) and found it to be almost perfectly conserved among the higher primates, except for some New World monkeys (Supplementary Figure S3). Taken together, these observations suggest that insertion of exonized Alus might have contributed to the specific divergence of this 3’UTR, in a genomic region that is otherwise conserved, at least among the higher primates.
Since this UTR seems to have appeared relatively late in the primate evolution, we tested if it carries variations that can differentiate modern human populations. Among the 23 SNPs in CYP20A1 3’UTR (16 within Alus), 11 have average heterozygosity scores >0.2, some of them as high as 0.48. We analyzed the data from 1000 Genomes Phase I and found significant differentiation for seven of these SNPs with global FST values ranging between 0.2-0.4 (Supplementary Table S2). Interestingly, we also found a GWAS SNP (rs11888559, C/T, T=0.237/1187) in this UTR, which is associated with height in Filipino women (27) with a global FST of 0.36, rs11888559 differentiates the east Asian (CHB) and European (CEU) from the ancient African (YRI) population. Expectedly, it also exhibits high-derived allele frequencies (DAFs) in these populations (0.81 and 0.95 in CHB and CEU, respectively). We also found another SNP rs7577078, within Alu, with high DAFs in all the three populations.
Characterization of CYP20A1_Alu-LT
We next investigated whether the full-length transcript containing this uniquely diverged 3’UTR is actually transcribed. As 2/3rd of this 3’UTR comprise repetitive sequences, it was a challenge to capture the full-length transcript in expression arrays or map it uniquely from sequencing reads. Moreover, there are differences in annotations regarding the full-length 3’UTR-containing isoform in various genomic portals (Supplementary information, S1). Therefore, we designed eleven pairs of primers spanning the entire length of the transcript and experimentally confirmed the expression of CYP20A1_Alu-LT. We validated three of its amplicons by Sanger sequencing to negate spurious amplification from other Alu-rich loci in the genome (Figure 2a, also see Supplementary information, S1 for detailed method).
We observed variable expression of CYP20A1_Alu-LT in the six cell lines that we had initially tested (Figure 2a). Since we had used cancerous cell lines, its expression could potentially be attributed to the aberrant transcriptional profiles in cancer (28). To delineate if CYP20A1_Alu-LT expression is due to the cancerous state of the cells, we compared its expression in a neuroblastoma cell line (SK-N-SH) with those in primary neuron, glia (astrocyte) and neural progenitor cells (NPC). Neuroblastoma shares features with both mature neurons and NPCs, but is distinct from glia and we found that CYP20A1_Alu-LT expression differs significantly only between glia and SK-N-SH but not in neurons or NPCs (Figure 2b). These suggest that our observations in the cancerous cell lines are unlikely to be artifactual.
We selected MCF-7, a breast adenocarcinoma cell line, for some of our subsequent experiments as it has been extensively used for drug screening and studying the effect of xenobiotics on different CYP family genes (29,30). The copy number of CYP20A1 is not altered in this line (2n=2) (31). We performed 3’RACE to determine the exact transcription termination site for CYP20A1_Alu-LT. This was followed by nested PCR and amplicon sequencing to confirm CYP20A1_Alu-LT as a bonafide RNA transcript (Figure 2c, Supplementary information, S3). Our findings are further supported by the TargetScan (release 7.2) which builds on the longest Gencode 3’UTR and reports even longer, 12.85kb UTR (ENST000000356079.4). The algorithm calculates 3’UTR length based on 3P-seq tags, accounting for the usage of mRNA cleavage and splice sites, normalized across multiple tissues.
Exon skipping differentiates CYP20A1_Alu-LT from the protein coding isoforms
We observed that the expression of CYP20A1_Alu-LT is rather low although CYP20A1 protein is relatively abundant (Supplementary Figure S4), suggesting that other isoforms may contribute to protein levels. When we compare CYP20A1_Alu-LT with the shorter 3’UTR containing isoforms, we observe a skipping of the sixth exon in this transcript. Using primers encompassing the sixth exon, we could distinguish between the transcripts; with the larger isoform (i.e., CYP20A1_Alu-LT) corresponding to the 196bp amplicon and shorter ones to 277bp which also show a much higher expression (Figure 2d).
In order to assess the relative contribution of different isoforms to the overall expression of CYP20A1, we used RNA-seq data from 15928 single nuclei derived from the different layers of the human cerebral cortex (32). NM_177538 (CYP20A1_Alu-LT) is expressed in 75% of the nuclei whereas all the other RefSeq isoforms are found in <1% (cut-off CPM≥50). There are 7038, 5134 and 1841 single nuclei in which NM_177538 (but no other isoform) is expressed with ≥10, 50 and 100 reads, respectively (Supplementary Table S3). Interestingly, it is expressed in rosehip neurons - a highly specialized cell type in humans (Supplementary Table S4) (33).
Although the long 3’UTR transcript is annotated as the principal isoform, its expression level did not correlate with CYP20A1 protein which is relatively abundant in MCF-7 cells (Supplementary Figure S4). To probe further, we performed in silico translation of all CYP20A1 isoforms in six-frames and compared them to the annotated human CYP20A1 protein. The two short 3’UTR isoforms matched – one perfectly and another with an additional amino acid stretch (Figure 2d), but the CYP20A1_Alu-LT goes out of frame in the sixth and seventh exon and BLAST analysis of the human proteome does not report any hits with the truncated 24 amino acid peptide. Taken together, these data suggest that the CYP20A1_Alu-LT is unlikely to be coding for CYP20A1 protein and may represent a novel non-coding transcript isoform originating from the same locus. This may be a case of evolutionary sub-functionalization of a gene into two different classes of transcripts that might have evolved for different functions (Figure 2e) (32).
CYP20A1_Alu-LT expression in non-human primates
Among the non-human primates, we did not find any annotated transcripts beyond 3kb from this locus. Our preliminary analyses of expression data, from public databases, of chimpanzee and macaque (prefrontal cortex, CD4+ T cells) did not yield any reads mapping to this 9kb 3’UTR. Subsequently, we checked for CYP20A1_Alu-LT expression in the reference transcriptomes of non-human primates (http://www.nhprtr.org) (33). Total RNA reads derived from 157 libraries of 14 non-human primate species show consistent mapping patterns on CYP20A1 3’UTR. Mapping is higher in the neighboring coding exons, but the pattern is consistent across different tissues and the number of reads comparable, with a slightly higher expression in kidney and lungs. In chimpanzee, reads are evenly distributed across the length of the entire 3’UTR; however, distribution is patchy in the other Old world monkeys (with peaks mostly in the non-repeat regions). Expression is minimal in New world monkeys (marmoset, squirrel monkey) and completely absent in lemur, although the adjoining coding exons show comparable expression, suggesting that CYP20A1_Alu-LT is expressed only in the higher primates (Supplementary Information S3).
CYP20A1_Alu-LT 3’UTR as an evolving miRNA regulatory hub
Our earlier study had revealed that 3’UTR exonized Alus could provide novel miRNA binding sites thereby making the transcripts amenable to miRNA mediated post-transcriptional regulation (21). We explored whether this recently diverged, Alu-rich 3’UTR could have evolved as a regulatory hub. We first checked whether the 3’UTR of CYP20A1_Alu-LT is also targeted by miRNAs. A query in miRTarBase (release 6.0) (34) revealed that CYP20A1_Alu-LT 3’UTR had predicted target sites for 169 miRNAs (supported by microarray/ sequencing data), of which 46 were listed as functional miRNAs in FuncMir (miRDB) (Supplementary Table S5). Most strikingly, ~50% of these are either primate-specific or human-specific miRNAs (microRNAviewer) (35). The occurrence of target sites for human-specific miRNAs in this recently evolved UTR prompted us to carry out further in-depth analysis of miRNA recognition elements (MREs).
Since our 3’UTR has diverged across the vertebrate phylogeny, we did not consider algorithms that employ evolutionary conservation as prediction criteria. Many algorithms which predict target sites based on seed sequence matches also seem to have limitations as the length and position of the seed sequence is variable amongst miRNAs (36,37). In order to reduce false MRE prediction in non-conserved regions, we used miRanda that employs a two-step strategy: sequence complementarity, followed by thermodynamic stability of the predicted miRNA-mRNA duplex (38). Using stringent cut-off criteria, we obtained a total of 4742 MREs for 994 miRNAs, 4500 of which overlap with Alus (4382 MREs, if a conservative estimate of >50% overlap is considered) (Supplementary Table S6). The MREs overlapping with Alu elements are considered as Alu-MREs.
These 4742 MREs span the entire length of the 3’UTR along with several high density pockets in Alu regions (Figure 3a). The 23 exonized Alus mainly belong to Alu S and J family and are from 13 different subfamilies - AluSx, AluSp, AluSc, AluSz6, AluSq2, AluSx3, AluSc8, AluSx1, AluSz, AluSg, AluJo, AluJb and AluJr. Their presence into the 3’UTR of CYP20A1_Alu-LT in 5’ to 3’ direction is represented in Figure 3a from top to bottom of the circos plot. The 994 miRNAs were grouped on the basis of numbers of MREs present in the 3’UTR of CYP20A1_Alu-LT. MREs are grouped in range of 1-5, 6-10, 11-20, 21-43, for group 1 (G1), group 2 (G2), group 3 (G3) and group 4 (G4), respectively. The total numbers of miRNAs in each group are 702, 178, 92, 22 for G1, G2, G3 and G4, respectively. Only 2% of total miRNAs have MREs more than 20 (Group G4), whereas ~70% of the miRNAs were in the group G1 with ≤ 5 MREs. The miRNAs present in G4 are shown in figure 3a, with their number of MREs on 3’UTR written in brackets. The connections in circos plot show the presence of binding sites in each Alu. Non-Alu region is grouped as one and shown at the bottom of the circos. Majority of sites are present in Alus as all the connections from each of the group fall in all the Alu elements. Only ~ 5% of miRNA binding sites fall in non-Alu regions (Figure 3a).
It is plausible that the accumulation of so many Alu-MREs (miRNA binding sites present in Alu elements) in this 3’UTR has been due to the retrotransposition or recombination of Alus with pre-existing target sites. To test this possibility, we carried out analysis on 1000 sets of 23 Alus taken randomly from the genome with matched length composition and subfamily. We did not observe a similar distribution of Alu-MREs - only 0.5 and 4.2% of these random sets had MREs ≥4742 and 4500, respectively (Figure 3b). All the 23 Alus on CYP20A1_Alu-LT UTR are diverged from the consensus sequences of their respective subfamilies; some of these substituted bases might have aided the creation of MREs within this UTR. Next, we sampled the distribution of MREs for each subfamily randomly picking 2000 members from anywhere in the genome and asked where in this distribution CYP-Alus lie. As expected, CYP-Alus were clear outliers for all 13 subfamilies. However, when a similar distribution is plotted only for 3’UTR-Alus, CYP-Alus lie above the median (but within the distribution) for eight (out of 13) subfamilies viz., AluJo, Sc, Sc8, Sg, Sx, Sx1, Sx3 (1 out of 3) and Sz (Supplementary information, S4). Taken together, these data convey that there are certain subsets of 3’UTR-Alus, at least for the 13 subfamilies analyzed, that have accumulated MREs. This suggests that the chance of Alus having retrotransposed into 3’UTRs with pre-existing MREs is extremely low and these MREs have been created within Alus post exaptation. An alternative, albeit less likely, possibility is greater retention of MRE sites within 3’UTR Alus than elsewhere in the genome. However, even among the 3’UTRs, the propensity of Alu elements with high MRE content to occur in tandem in a single UTR is low i.e., CYP20A1_Alu-LT UTR contains many instances of such high MRE containing Alus whereas other UTRs have relatively fewer such instances. One possibility is that accumulation of MREs could potentiate its function as miRNA sponge for a regulatory network (Figure 3c).
CYP20A1_Alu-LT isoform functions as a potential miRNA sponge
To determine if CYP20A1_Alu-LT can be a potential miRNA sponge, we characterized this 3’UTR further using bioinformatics and experimental approaches. First, we checked its level using RT-qPCR in both nuclear and cytosolic fractions and found that it is predominantly localized to the cytosol - a feature observed in most sponges (Figure 4a). A sponge RNA also typically contains 4 to 10 low binding energy MREs for a particular miRNA that are separated by a few nucleotides and is generally devoid of destabilizing RNA elements. In CYP20A1_Alu-LT, using a stringent cut-off for MRE prediction (binding energy≤ -25kcal/mol), we observed miRNAs with as many as 43 MREs and binding energy as low as -47kcal/mol. Out of the 994 miRNAs, 140 have ≥10 MREs and are distributed across the length of the UTR (Figure 3a, also see Supplementary information, S4).
We next checked for the presence of bulge within the MREs for the 23 prioritized miRNAs (Table 2) using miRanda with default parameters. To screen for MREs that would efficiently dock miRNA without degrading the CYP20A1_Alu-LT transcript, we used twin criteria – a complete match (2-7) in 6-mer seed site and presence of mismatch or insertion at 9-12 position. 6-mer sites with wobble base pairing were also retained as two wobble-pairs were maximally present in some of the MREs. We found five such sites for miR- 6724-5p, two each for miR-1254, miR-4767 and miR-3620-5p and one each for miR-941, miR-4446-3p, miR- 296-3p, miR-619-5p, miR-6842-3p and miR-1226-5p (Table 3). At all these sites, we observed insertion in CYP20A1_Alu-LT, which suggests the possibility of a bulge formation in the sponge RNA. This can potentially prevent CYP20A1_Alu-LT from miRNA directed degradation and increase its efficiency to sequester miRNA molecules.
Potential sponge activity of CYP20A1_Alu-LT in primary neurons in response to heat shock and HIV1-Tat
To probe if the alteration in CYP20A1_Alu-LT level could affect expression of transcripts containing cognate MREs, we looked for conditions where it is likely to be altered. In these conditions the miRNA that targets these MREs should also be expressed. We anticipated that in conditions where there is a higher expression of the potential ‘sponge’ (CYP20A1_Alu-LT), the abundant MREs would sequester the miRNAs. This should relieve its other cognate targets and we should observe a higher expression of those genes. Whereas in conditions where CYP20A1_Alu-LT is downregulated, the miRNA would be free to bind its cognate targets, thereby reducing their expression (Figure 3c).
We first queried for the expression of the 994 miRNAs having potential MREs in CYP20A1_Alu-LT from miRNA expression profiles available in public datasets. These experiments, mostly microarray based, showed low concordance across replicates and high variability across experiments (Supplementary information, S1). So we tested this experimentally in MCF-7 and primary neurons. Since primary neurons preferentially express longer 3’UTRs, we reasoned it would be an ideal model to study miRNA-mediated regulation events(39,40). We carried out small RNA-seq and using a cut-off of at least 10 MREs on CYP20A1_Alu-LT 3’UTR and TPM value of 50, we obtained a set of 21 and 9 miRNAs in MCF-7 and neurons, respectively, of which 7 were common to both (Table 2).
Since CYP20A1 has been identified as a candidate from a set of Alu exonized genes that map to apoptosis, we asked if this would respond to triggers that induce cell death. HIV1-Tat is a potent neurotoxin that kills ~50% more neurons compared to the vehicle control (Figure 4b). Upon treating primary human neurons with HIV1 full length Tat protein, followed by 6 hours recovery, CYP20A1_Alu-LT was found to be significantly upregulated (1.65 fold). However, progenitor cells, which are immune to Tat (Figure 4b), did not show any such trend (Figure 4c). CYP20A1_Alu-LT’s 3’UTR also carries 17 potential binding sites for HSF1, 14 of them within Alus, which show positional conservation in agreement with previous studies (41). This suggests that this transcript may also be amenable to antisense-mediated downregulation during heat shock response as demonstrated by an earlier work from our lab (41). We found CYP20A1_Alu-LT to be significantly downregulated (2.68 folds) in primary neurons upon heat shock, followed by 1hr recovery (Figure 4d).
In order to query the expression of the other cognate targets of the 9 prioritized miRNAs in these two conditions, we performed stranded RNA-seq of primary neurons after these treatments. The expression of CYP20A1_Alu-LT in RNA-seq showed similar patterns of expression as observed in RT-qPCR, significantly downregulated 2.68 folds (log2FC=-1.42) upon heat shock (HS) recovery and 1.21 folds upregulated (log2FC=0.28) during Tat response. The latter, however, did not cross our stringent statistical significance threshold. Out of the 3876 genes differentially expressed in HS or Tat, 380 exhibit positively correlated expression patterns as CYP20A1_Alu-LT (Figure 5a, Supplementary Table S7). All of these 380 genes contain at least one MRE for one or more of the 9 prioritized miRNAs and the majority of their MREs are canonical and not Alu-derived. On the contrary, CYP20A1_Alu-LT contains a total of 116 MREs for all these 9 miRNAs combined (Supplementary Table S7). There is a significant enrichment of MRE sites (for our 9 prioritized miRNAs) in this set of 380 genes compared to all expressed genes (complete transcriptome; FPKM>2) or those with a significant differential expression (FPKM>2; FDR<0.05) (Table 4; Kolmogorov-Smirnov test). The MRE distribution in expressed genes and significantly differentially expressed genes were not significantly different. Similar results were obtained by comparing the median distribution of MREs in the 3 sets using Mann-Whitney U test ( Supplementary Figure S5, Table 4). The abundance of MREs (for our 9 prioritized miRNAs), plotted against the FPKM values in significantly differentially expressed genes, exhibit a normal distribution (Shapiro-Wilk test, p<2.2e-16). Genes with high MRE counts have lower expression values and vice-versa (Supplementary Figure S6). To further validate the enrichment of MREs in the 380 genes, we performed Monte-Carlo simulation using one million random sets of 380 genes . The MRE densities in the 380 genes with expression pattern correlated to CYP20A1_Alu-LT, were clear outliers when plotted with the distributions derived from random sets (p-value 9.99999e-07), for all the miRNAs except miR-5096 (Figure 5b). The 9 miRNAs studied had nearly similar distribution of MREs across genes (Supplementary Figure S7). Taken together these data suggest that the set of 380 genes represents potential cognate targets whose expression levels can be modulated by CYP20A1_Alu-LT through competing for the miRNAs targeting them.
According to Gene Ontology analysis for this gene set using Toppfun (FDR <0.05) the top five processes were hemostasis (28 genes), axon guidance (25 genes), neutrophil degranulation (23 genes), platelet signaling, activation and aggregation (18 genes) and ECM organization (18 genes). Other processes include mRNA processing and mitochondria translation, metabolism, amino acid and nucleotide synthesis and antigen presentation (Supplementary Table and Figure S8). Pathways from the Topfunn enrichment were manually grouped into major biological processes. For the 9 miRNAs, analysis of target genes in the set of 380 genes and their pathways revealed blood coagulation to be the major biological process targeted by almost all miRNAs followed by neuron development (Figure 5c). This implies that CYP20A1_Alu-LT might be important in maintaining the homeostasis and fine tuning the neurological pathways.