Stable intronic sequence RNAs (sisRNAs) are selected regions in introns with distinct properties

doi:10.21203/rs.2.14880/v4

Background: Stable introns and intronic fragments make up the largest population of RNA in the oocyte nucleus of the frog Xenopus tropicalis . These stable intronic sequence RNAs (sisRNAs) persist through the onset of zygotic transcription when synchronous cell division has ended and the developing embryo consists of approximately 8000 cells. Despite their abundance, the sequence properties and biological function of sisRNAs are just beginning to be understood. Results: To characterize this population of noncoding RNA, we identified all of the sisRNAs in the X. tropicalis oocyte nucleus using published high-throughput RNA sequencing data. Our analysis revealed that sisRNAs, have an average length of ~360 bps, are widely expressed from genes with multiple introns, and are derived from specific regions of introns that are GC and TG rich, while CpG poor. They are enriched in introns at both ends of transcripts but preferentially at the 3’ end. The consensus binding sites of specific transcription factors such as Stat3 are enriched in sisRNAs, suggesting an association between sisRNAs and transcription factors involved in early development. Evolutionary conservation analysis of sisRNA sequences in seven vertebrate genomes indicates that sisRNAs are as conserved as other parts of introns, but much less conserved than exons. Conclusion: In total, our results indicate sisRNAs are selected intron regions with distinct properties, and may play a role in gene expression regulation .

Epigenetics & Genomics

sisRNA

intron

noncoding RNA

oocyte

transcription

Xenopus

RNA is one of the three major macromolecules essential for living organisms and consists of three major types: messenger RNA (mRNA), transfer RNA (tRNA), and ribosomal RNA (rRNA). Whereas mRNAs code for proteins [1], tRNAs and rRNAs are non-coding RNAs that play essential roles in protein translation ([2, 3]). In the past 15 years, several additional types of non-coding RNAs have been shown to play important roles in regulation of gene expression [4], including microRNAs (miRNA, 21-22 nt) [5], small interfering RNAs (siRNA, 20-25 nt) [6], and Piwi-interacting RNAs (piRNA, 29-30 nt) [7]. While these regulatory non-coding RNAs are usually very small (< 50 nt), there are also the long non-coding RNAs (lncRNA), which are longer than 200 nucleotides [8], and most recently identified, the stable intronic sequence RNAs (sisRNAs), the majority of which are several hundreds of nucleotides [9, 10].

In higher eukaryotes, the majority of protein-coding genes have one or more non-coding introns interspersed within the coding sequence that are spliced from the primary transcript [11]. The spliced introns are primarily in the form of a lariat in which the 5’ end is linked to the 3’acceptor splice site. These lariats are debranched into a linear form and then degraded rapidly in most cases [12, 13]. The majority of intron fragments are believed to be unstable, with a few exceptions [14-17]. Recently, a number of sisRNAs were identified in the oocyte nucleus (germinal vesicle or GV) of Xenopus tropicalis [9] and later in the oocyte cytoplasm [10]. Although less stable than cytoplasmic mRNA, the sisRNAs are very stable. Transcription inhibition studies revealed that they are stable for at least 2 days and RNA sequencing analysis demonstrated that they are transferred to the egg upon GV breakdown and persist until at least the blastula stage [9]. These sisRNAs are usually only a portion of the full intron and while nuclear sisRNAs are present as either linear or lariat molecules, most cytoplasmic sisRNAs are in lariat form [9, 10]. Thus far, sisRNAs have also been found in human, mouse, chicken, zebrafish the Epstein-Barr virus and Drosophila melanogaster [9, 18-22]. Little is known about the sequence properties and biological function of these abundant sisRNAs although recently, sisR-1 and sisR-4, have been shown to participate in a feedback loop to modulate their parental gene expression in Drosophila [20, 22-24].

To further characterize sisRNAs, we identified all sisRNAs in the X. tropicalis genome using the published high-throughput sequencing data of RNA from the GV [9]. We then determined the average length of sisRNAs, their distribution in genes, sequence composition, transcription factor binding site (TFBS) enrichment, gene ontology and evolutionary conservation. Here we show that sisRNAs are most widely expressed from genes with multiple introns, enriched in GC and TpG. They are enriched in introns at both ends of transcripts but preferentially at the 3’ end. They also contain specific transcription factor binding sites (TFBS), which supports recent findings that sisRNAs play a role in the regulation of gene expression.

Genome-wide identification of sisRNAs in Xenopus tropicalis

To study the sequence characteristics of sisRNAs, the high-throughput sequencing data of RNA (RNA-seq) from the GV of the frog Xenopus tropicalis [9] was used to detect sisRNA peaks with the model based analysis for ChIP-seq (MACS) algorithm [25]. We identified a total of 63,410 RNA peaks by a more stringent criterion (FDR=0.01) (Figure 1A, Table 1) and compared the location of each RNA peak to exons and introns to identify the true sisRNA peaks (i.e. those from spliced introns that did not cross intron–exon boundaries.) Two widely used and primarily manually annotated gene sets, RefSeq (9,448 protein-coding genes) and Ensembl (28,967 protein-coding genes), were used as references to identify the sisRNA peaks. 24,901 sisRNAs were identified by refSeq genes (termed as refSeq sisRNAs), with a total length of ~9 Mbps, which accounts for 0.6% of the X. tropicalis genome (Figure 1A, Table 1). Similarly, 34,169 sisRNAs were identified by Ensembl genes (termed as Ensembl sisRNAs), with a total length of ~12 Mbps, which accounts for 0.8% of the genome (Figure 1A, Table 1). Together, a total of 20,020 peaks were identified by both gene sets (Figure 1A), which represents ~80% of refSeq sisRNAs and ~60% of Ensembl sisRNAs respectively. The sisRNA length ranges from 160 bps to 2,908 bps with the majority 200-500 bps long and an average of 363.6 bps and 356.6 bps for refSeq and Ensembl sisRNAs, respectively (Figure 1B-D, Table 1). Taken together, two high quality datasets of genome-wide sisRNAs were generated for further investigation of sisRNA properties.

Genome-wide identification of sisRNAs in Xenopus tropicalis

To study the sequence characteristics of sisRNAs, the high-throughput sequencing data of RNA (RNA-seq) from the GV of the frog Xenopus tropicalis [9] was used to detect sisRNA peaks with the model based analysis for ChIP-seq (MACS) algorithm [25]. We identified a total of 63,410 RNA peaks by a more stringent criterion (FDR=0.01) (Figure 1A, Table 1) and compared the location of each RNA peak to exons and introns to identify the true sisRNA peaks (i.e. those from spliced introns that did not cross intron–exon boundaries.) Two widely used and primarily manually annotated gene sets, RefSeq (9,448 protein-coding genes) and Ensembl (28,967 protein-coding genes), were used as references to identify the sisRNA peaks. 24,901 sisRNAs were identified by refSeq genes (termed as refSeq sisRNAs), with a total length of ~9 Mbps, which accounts for 0.6% of the X. tropicalis genome (Figure 1A, Table 1). Similarly, 34,169 sisRNAs were identified by Ensembl genes (termed as Ensembl sisRNAs), with a total length of ~12 Mbps, which accounts for 0.8% of the genome (Figure 1A, Table 1). Together, a total of 20,020 peaks were identified by both gene sets (Figure 1A), which represents ~80% of refSeq sisRNAs and ~60% of Ensembl sisRNAs respectively. The sisRNA length ranges from 160 bps to 2,908 bps with the majority 200-500 bps long and an average of 363.6 bps and 356.6 bps for refSeq and Ensembl sisRNAs, respectively (Figure 1B-D, Table 1). Taken together, two high quality datasets of genome-wide sisRNAs were generated for further investigation of sisRNA properties.

sisRNAs are widely expressed and preferentially located in genes with multiple introns

To determine the number of genes with sisRNAs, we first divided the genes of each dataset into 10 groups according to the number of introns in a gene. Genes without introns were excluded from further analysis. From the 9,448 RefSeq genes, 93.8% (8,864) have multiple exons while only 6.2% (584) of the genes have a single exon (Table 2A). Our data show that the more introns a gene has, the more likely the gene generates sisRNAs. While 24.9% of genes with a single intron have sisRNAs located in their intron, the possibility that a gene with 5 introns has a sisRNA increased to 66.6%. When a gene has 10 or more introns the possibility increased to 81.0% (Figure 2A). On average, 67.3% of Refseq genes (5,965/8,864) with introns produce sisRNAs (Figure 2A, Table 2A). 你= 6

/

We also calculated the average number of different sisRNAs per gene. We observed that the average sisRNA number increases with intron number (the average number of introns per gene is 7.8 in Xenopus tropicalis). For example, genes with a single intron have an average of 0.60 sisRNAs, genes with 5 introns have an average of 2.06 sisRNAs, and genes with 10 or more introns have an average of 5.06 sisRNAs (Figure 2B, Table 2A). On average, there are 2.97 sisRNAs per gene and 0.36 sisRNAs per intron (Table 2A). We also compared the intron length to the number of sisRNAs for each intron, and it indicates that longer introns tend to have more sisRNAs (R=0.72, Supplementary Fig. 1A-B). Next, we compared the total length of introns (Supplementary Figure 1C,) or normalized intron number per Mb (Supplementary Figure 1D) to the number of sisRNAs for each gene. The results indicate that the number of sisRNAs is well correlated with the total length of introns (R=0.87) rather than the number of introns (R=0.09) in a gene. We observed a similar result for Ensembl genes.

The sisRNA sequences are GC and TpG rich

To investigate the base-pair composition in sisRNAs, we first calculated the prevalence of all 10 unique dinucleotides in sisRNAs and in the X. tropicalis genome. We observed that sisRNAs as compared to the X. tropicalis genome are rich in AC, CC, AG, CA and GC, while poor in CG, AT, AA, TA and GA, for both RefSeq- and Ensembl- identified sisRNAs (Figure 3A). We then extended the calculation to trinucleotides. As we expected, sisRNAs are rich in CCA, GCC, CAC, GCA, CAG, AGC, which are all trinucleotides containing GC or CA|TG, the dinucleotides shown to be enriched in sisRNAs (Figure 3B). sisRNAs are very poor in CGN, which are trinucleotides with CpG, and also poor in ATA, AAT, AAA and TAA (Figure 3B). Generally speaking, CpG rich regions (e.g. CpG islands) are G+C rich and CpG poor regions are A+T rich. Interestingly, we observed that sisRNAs are CpG poor while GC rich.

We then calculated the GC content, CpG density and CA|TG density for each sisRNA as well as each scaffold in the whole genome (Figure 3C-E). Compared to the X. tropicalis genome, sisRNAs have a higher GC content (Figure 3C), a lower CpG density (Figure 3D) and a higher CA|TG density (Figure 3E). As a result, in terms of base-pair compositions, sisRNAs are different from the genome, with a different prevalence of dinucleotides, trinucleotides, and a lower CpG density, higher CA|TG density and GC content. In other words, sisRNAs are unique regions of the genome with their own properties. The results for RefSeq sisRNAs and Ensembl sisRNAs are nearly identical (Figure 3), providing strong support for our results.

The sisRNAs are specific regions of the introns

Since sisRNAs are derived from regions of introns, and introns have a distinct base-pair composition as compared to the whole genome (Haddrill et al. 2005), we expect sisRNAs to also have a sequence composition different from the whole genome. However, it remains unclear whether sisRNAs are randomly distributed in introns or in specific regions of the intron. GC content (G+C%), CpG density and CA|TG density are widely used and important parameters to analyze sequences characteristics. Thus, we calculated these parameters for each intron with/without sisRNAs and compared these results with sisRNAs alone. The total GC content of the introns with sisRNAs (40.59%) is higher (p<0.001, t-test) than the genome (40.07%). The GC content in sisRNAs alone (41.92% and 41.79% for RefSeq and Ensembl sisRNAs, respectively) is higher than that of the introns (p<0.001, t-test) (Figure 4A, Table 3). The CpG density of introns when compared to the whole genome is poor (Figure 4B). Interestingly, introns with sisRNAs have a higher CpG density than both introns without sisRNAs and sisRNAs alone (Figure 4B). While the CA|TG density in introns is very similar to the genome, sisRNAs have a higher CA|TG density than introns with sisRNAs (Figure 4C). Thus, the sisRNAs are in regions of the introns with higher CA|TG density and GC%. We further divided the introns without sisRNAs into two groups according to whether the host gene is with/without sisRNAs: Intron A (the host gene without sisRNAs) or Intron B (the host gene has sisRNAs but the intron itself is without sisRNAs). We found that intron A and intron B are very similar, thus sisRNAs are closely associated with the introns from which they originate. Taken together, these results reveal that sisRNAs are specific regions of introns with distinct sequence compositions.

sisRNAs are enriched at both 5’ and 3’ end of transcripts, with a preference for the 3’ end

After we observed that sisRNAs are in specific regions of introns, with unique base-pair compositions, we analyzed whether they are derived from specific introns along the gene. To study where sisRNAs are enriched, we concatenated all the introns for each gene along the transcript. We divided each joined intron transcript into 100 bins and identified the bins in which the sisRNAs are located. As shown in Figure 5A, the sisRNAs are mostly enriched at the beginning (5’ end) and the end (3’end) of the transcript with a preference for the 3’ end. An example is shown for the gene nasp: more sisRNAs were observed at the 3’ end (Figure 5B). These results further confirmed that sisRNAs are not random sequences from the introns: they have distinct sequence compositions and are preferentially driven from the 3’ end of a transcript.

We next asked whether sisRNAs are derived from the 5’ and 3’ end of introns because these end introns have unique properties. To investigate this possibility, we divided all the introns from genes with more than 3 introns into 3 categories: 1^st -5’ end introns (S), middle introns (M) and last -3’ end introns (E). The results indicate that last -3’ end introns are very similar to the middle intron, while the first -5’ end introns have different compositions and are slightly longer (Supplementary Figure 2). The properties of the first introns may be different because they sometimes overlap with the promoter regions, which have higher CpG, CA|TG density, and GC content. These results indicate that sisRNAs are not preferentially derived from 3’ end introns because these introns have unique properties, but instead some other mechanism of sisRNA production is involved.

Another possibility is that sisRNAs produced from 3’ end introns are remnants of gene transcription and splicing. To test this idea, we performed the analysis of correlation between the expression levels of host genes and sisRNAs. As shown in Figure 5C, the expression levels of host genes (FPKM) and sisRNAs peak signals are negatively relevant (R=-0.1724), which does not support sisRNAs as remnants of gene transcription and splicing that have not been cleaned from cells. Thus, the stability of sisRNAs is regulated independently from the host mRNAS. Taken together, sisRNAs are most enriched in the 3’ end of introns, and the mechanism for this enrichment remains to be investigated.

Specific TFBS are enriched in sisRNA-producing introns

Because of the specific sequence properties and preference for introns on the ends of a gene, we performed an enrichment calculation of transcription factor binding sites (TFBS) in sisRNAs to study the potential functions of sisRNAs. We searched for enriched DNA motifs among the 935 position weight matrices (PWMs) collected from the TRANSFAC databases [26] in sisRNAs and in introns without sisRNAs (Figure 6). We also calculated the motif enrichment in both RefSeq and Ensembl sisRNAs and the enrichments in two datasets are nearly identical (R=0.99), indicating the calculation is robust and the quality of identified sisRNAs is high (Figure 6B). Stat3, NF-κB, p50:p50, MYOG:NF1 and GAF consensus sites are enriched in sisRNAs but depleted in introns without sisRNAs (Figure 6A). Stat3 (signal transducer and activator of transcription 3) is a member of the STAT protein family, functions as a transcriptional activator [27], and is highly expressed in X. tropicalis (Figure 6C). NF-κB (nuclear factor kappa-light-chain-enhancer of activated B cells) is a protein complex that controls transcription [28]. p50 is the mature NF-κB subunit, which has no intrinsic ability to activate transcription and has been proposed to act as a transcriptional repressor when binding with κB elements as homodimers (p50:p50) [29]. Myogenin (MYOG) is one of four muscle-specific basic helix-loop-helix regulatory factors involved in controlling myogenesis [30], and NF-1 (Neurofibromin 1) is a negative regulator of the Ras signal transduction pathway [31] and is required for skeletal muscle development [32]. The GAGA factor (GAF) is one of a few transcription factors that can regulate transcription at multiple levels: depending on its target genomic location, it can act as either activator or repressor [33]. We also observed that some TFBS, such as Ncx, Prop1 and Nkx3a, are depleted in sisRNAs while slightly enriched in introns without sisRNAs (Figure 6A). These results suggest that specific TFBS involved in transcription regulation, either activation or suppression, are enriched in sisRNA-producing introns, which imply that sisRNAs may play a functional role in transcriptional regulation.

The GO terms Nucleotide binding, RNA binding and ATP binding are enriched in genes with sisRNAs

To further investigate the role of sisRNAs, we asked whether the genes with sisRNAs share any function. GO (Gene Ontology) enrichment analysis [34] of the 5,419 RefSeq genes with sisRNAs indicated that 13.9% of the genes are involved in nucleotide binding, 3.2% of the genes are involved in RNA binding and 7.9% genes are involved in ATP binding (Table 4). The other enriched gene sets included RNA processing, mRNA metabolic process and translation (Table 4). These results suggested that sisRNAs might have a potential biological function involved in gene regulation and metabolism.

The sisRNAs are as evolutionary conserved as introns, and much less than exons

Our data indicate that sisRNAs are in specific regions of introns, contain TFBSs, and are in the introns of genes involved in nucleotide, ATP and RNA binding. To investigate whether these sisRNAs are conserved across species, we determined the PhastCons conservation score for sisRNAs and introns (Figure 7). The PhastCons scores [35] are calculated based on multiple alignments of 6 vertebrate genomes (zebrafish, chicken, opossum, rat, mouse and human) with X. tropicalis. As expected, the boundary of exon and intron has the highest conservation score (Figure 7A, C). Although sisRNAs are as conserved as other intron regions, they are still much less conserved than exons, suggesting they might not be conserved among species, which was also shown in a recent study [10].

An alternative approach to evaluate the conservation of sisRNAs is to compare the sisRNAs in different cell types from various species. We obtained the cytoplasm sisRNAs data recently published [21], which contains several species and cell types, including human red blood cells, Hela cells, mouse red blood cells and 3T3 cells, chicken DF1 cells, and Xenopus laevis XTC cells. We compared the GC content, CpG density, and CA|TG density among these cytoplasm sisRNAs to the refSeq and Ensembl sisRNAs that we identified (Figure 7D-F). We found that between different cell types in the same species, for example, human red blood cells and Hela cells, the GC content, CpG density and CA|TG density of these sisRNAs are quite different from each other (Figure 7D-F). While for the same cell types between different species, for example, human red blood cells and mouse red blood cells, some sequence composition similarities exist (Figure 7D-F). These results indicate that sisRNAs are cell type specific. Although sequences of sisRNAs may not be so conserved among species, the sequence compositions are similar.

The X. laevis XTC sisRNAs are very similar to X. tropicalis GV sisRNAs in terms of sequence composition. We next asked how many common host genes with sisRNAs exist between the two species. We compared 5,563 unique genes host sisRNAs in GV of X. tropicalis to 2,029 unique genes which have sisRNAs in X. laevis XTC cells, there are only 602 genes overlapped and this chance is even lower than random selection (Figure 7G). Among these 602 genes, the number of sisRNAs from the same intron is less than 50 (data not shown). Overall, our calculation suggested there may not be positional or sequence conservation between sisRNAs, which is also supported by other recent work [10].

We also performed GO (Gene Ontology) enrichment analysis for the genes host sisRNAs in each cell type (Supplementary Figure 3A-F). In chicken DF1 cells, the sisRNAs are not enriched in any gene sets (Supplementary Figure 3E). Even though sisRNAs are enriched in some gene sets in the other 5 cell types, there are no common enriched gene sets present in all 5. The most common enriched gene set is related to sister chromatid segregation or nuclear division, which is observed in human RBC, Hela, mouse RBC, and X. laevis XTC cells (Supplementary Figure 3A-C, F). Another common enriched biological process is covalent chromatin modification, which is observed in human RBC, mouse RBC and 3T3 cells (Supplementary Figure 3A, C-D). It will be interesting to test whether sisRNAs play functional roles in cell division or chromatin organization.

To determine the number of genes with sisRNAs, we first divided the genes of each dataset into 10 groups according to the number of introns in a gene. Genes without introns were excluded from further analysis. From the 9,448 RefSeq genes, 93.8% (8,864) have multiple exons while only 6.2% (584) of the genes have a single exon (Table 2A). Our data show that the more introns a gene has, the more likely the gene generates sisRNAs. While 24.9% of genes with a single intron have sisRNAs located in their intron, the possibility that a gene with 5 introns has a sisRNA increased to 66.6%. When a gene has 10 or more introns the possibility increased to 81.0% (Figure 2A). On average, 67.3% of Refseq genes (5,965/8,864) with introns produce sisRNAs (Figure 2A, Table 2A). 你=6