Genome-wide identification of sisRNAs in Xenopus tropicalis
To study the sequence characteristics of sisRNAs, the high-throughput sequencing data of RNA (RNA-seq) from the GV of the frog Xenopus tropicalis  was used to detect sisRNA peaks with the model based analysis for ChIP-seq (MACS) algorithm . We identified a total of 63,410 RNA peaks by a more stringent criterion (FDR=0.01) (Figure 1A, Table 1) and compared the location of each RNA peak to exons and introns to identify the true sisRNA peaks (i.e. those from spliced introns that did not cross intron–exon boundaries.) Two widely used and primarily manually annotated gene sets, RefSeq (9,448 protein-coding genes) and Ensembl (28,967 protein-coding genes), were used as references to identify the sisRNA peaks. 24,901 sisRNAs were identified by refSeq genes (termed as refSeq sisRNAs), with a total length of ~9 Mbps, which accounts for 0.6% of the X. tropicalis genome (Figure 1A, Table 1). Similarly, 34,169 sisRNAs were identified by Ensembl genes (termed as Ensembl sisRNAs), with a total length of ~12 Mbps, which accounts for 0.8% of the genome (Figure 1A, Table 1). Together, a total of 20,020 peaks were identified by both gene sets (Figure 1A), which represents ~80% of refSeq sisRNAs and ~60% of Ensembl sisRNAs respectively. The sisRNA length ranges from 160 bps to 2,908 bps with the majority 200-500 bps long and an average of 363.6 bps and 356.6 bps for refSeq and Ensembl sisRNAs, respectively (Figure 1B-D, Table 1). Taken together, two high quality datasets of genome-wide sisRNAs were generated for further investigation of sisRNA properties.
sisRNAs are widely expressed and preferentially located in genes with multiple introns
To determine the number of genes with sisRNAs, we first divided the genes of each dataset into 10 groups according to the number of introns in a gene. Genes without introns were excluded from further analysis. From the 9,448 RefSeq genes, 93.8% (8,864) have multiple exons while only 6.2% (584) of the genes have a single exon (Table 2A). Our data show that the more introns a gene has, the more likely the gene generates sisRNAs. While 24.9% of genes with a single intron have sisRNAs located in their intron, the possibility that a gene with 5 introns has a sisRNA increased to 66.6%. When a gene has 10 or more introns the possibility increased to 81.0% (Figure 2A). On average, 67.3% of Refseq genes (5,965/8,864) with introns produce sisRNAs (Figure 2A, Table 2A).
We also calculated the average number of different sisRNAs per gene. We observed that the average sisRNA number increases with intron number (the average number of introns per gene is 7.8 in Xenopus tropicalis). For example, genes with a single intron have an average of 0.60 sisRNAs, genes with 5 introns have an average of 2.06 sisRNAs, and genes with 10 or more introns have an average of 5.06 sisRNAs (Figure 2B, Table 2A). On average, there are 2.97 sisRNAs per gene and 0.36 sisRNAs per intron (Table 2A). We also compared the intron length to the number of sisRNAs for each intron, and it indicates that longer introns tender to have more sisRNAs (R=0.72, Supplementary Fig. 1A-B). We observed a similar result for Ensembl genes. On average 49.6% of genes (13,044/26,313) with introns encode sisRNAs. On average there are 1.94 sisRNAs encoded in an Ensemble gene and each intron encodes 0.22 sisRNAs (Figure 2, Table 2B).
The sisRNA sequences are GC and TpG rich
To investigate the base-pair composition in sisRNAs, we first calculated the prevalence of all 10 unique dinucleotides in sisRNAs and in the X. tropicalis genome. We observed that sisRNAs as compared to the X. tropicalis genome are rich in AC, CC, AG, CA and GC, while poor in CG, AT, AA, TA and GA, for both RefSeq- and Ensembl- identified sisRNAs (Figure 3A). We then extended the calculation to trinucleotides. As we expected, sisRNAs are rich in CCA, GCC, CAC, GCA, CAG, AGC, which are all trinucleotides containing GC or CA|TG, the dinucleotides shown to be enriched in sisRNAs (Figure 3B). sisRNAs are very poor in CGN, which are trinucleotides with CpG, and also poor in ATA, AAT, AAA and TAA (Figure 3B). Generally speaking, CpG rich regions (e.g. CpG islands) are G+C rich and CpG poor regions are A+T rich. Interestingly, we observed that sisRNAs are CpG poor while GC rich.
We then calculated the GC content, CpG density and CA|TG density for each sisRNA as well as each scaffold in the whole genome (Figure 3C-E). Compared to the X. tropicalis genome, sisRNAs have a higher GC content (Figure 3C), a lower CpG density (Figure 3D) and a higher CA|TG density (Figure 3E). As a result, in terms of base-pair compositions, sisRNAs are different from the genome, with a different prevalence of dinucleotides, trinucleotides, and a lower CpG density, higher CA|TG density and GC content. In other words, sisRNAs are unique regions of the genome with their own properties. The results for RefSeq sisRNAs and Ensembl sisRNAs are nearly identical (Figure 3), providing strong support for our results.
The sisRNAs are specific regions of the introns
Since sisRNAs are derived from regions of introns, and introns have a distinct base-pair composition as compared to the whole genome (Haddrill et al. 2005), we expect sisRNAs to also have a sequence composition different from the whole genome. However, it remains unclear whether sisRNAs are randomly distributed in introns or in specific regions of the intron. GC content (G+C%), CpG density and CA|TG density are widely used and important parameters to analyze sequences characteristics. Thus, we calculated these parameters for each intron with/without sisRNAs, and compared these results with sisRNAs alone. The total GC content of the introns with sisRNAs (40.59%) is higher (p<0.001, t-test) than the genome (40.07%). The GC content in sisRNAs alone (41.92% and 41.79% for RefSeq and Ensembl sisRNAs, respectively) is higher than that of the introns (p<0.001, t-test) (Figure 4A, Table 3). The CpG density of introns when compared to the whole genome is poor (Figure 4B). Interestingly, introns with sisRNAs have a higher CpG density than both introns without sisRNAs and sisRNAs alone (Figure 4B). While the CA|TG density in introns is very similar to the genome, sisRNAs have a higher CA|TG density than introns with sisRNAs (Figure 4C). Thus, the sisRNAs are in regions of the introns with higher CA|TG density and GC%. We further divided the introns without sisRNAs into two groups according to whether the host gene is with/without sisRNAs: Intron A (the host gene without sisRNAs) or Intron B (the host gene has sisRNAs but the intron itself is without sisRNAs). We found that intron A and intron B are very similar, thus sisRNAs are closely associated with the introns from which they originate.
Taken together, these results reveal that sisRNAs are specific regions of introns with distinct sequence compositions.
sisRNAs are enriched at both 5’ and 3’ end of transcripts, with a preference for the 3’ end
After we observed that sisRNAs are in specific regions of introns, with unique base-pair compositions, we analyzed whether they are derived from specific introns along the gene. To study where sisRNAs are enriched, we concatenated all the introns for each gene along the transcript. We divided each joined intron transcript into 100 bins, and identified the bins in which the sisRNAs are located. As shown in Figure 5A, the sisRNAs are mostly enriched at the beginning (5’ end) and the end (3’end) of the transcript with a preference for the 3’ end. An example is shown for the gene nasp: more sisRNAs were observed at the 3’ end (Figure 5B). These results further confirmed that sisRNAs are not random sequences from the introns: they have distinct sequence compositions and are preferentially driven from the 3’ end of a transcript.
We next asked whether sisRNAs are derived from the 5’ and 3’ end of introns because these end introns have unique properties. To investigate this possibility, we divided all the introns from genes with more than 3 introns into 3 categories: 1st -5’ end introns (S), middle introns (M) and last -3’ end introns (E). The results indicates that last -3’ end introns are very similar to the middle intron, while the first -5’ end introns have different compositions and are slightly longer (Supplementary Figure 2). The properties of the first introns may be different because they sometimes overlap with the promoter regions, which have higher CpG, CA|TG density, and GC content. These results indicate that sisRNAs are not preferentially derived from 3’ end introns because these introns have unique properties, but instead some other mechanism of sisRNA production is involved.
Another possibility is that sisRNAs produced from 3’ end introns are remnants of gene transcription and splicing. To test this idea, we performed the analysis of correlation between the expression levels of host genes and sisRNAs. As shown in Figure 5C, the expression levels of host genes (FPKM) and sisRNAs peak signals are negatively relevant (R=-0.1724), which does not support sisRNAs as remnants of gene transcription and splicing that have not been cleaned from cells. This suggests that sisRNAs play an inhibitory role in host gene expression. Taken together, sisRNAs are most enriched in the 3’ end of introns, and the mechanism for this enrichment remains to be investigated.
Specific TFBS related to transcription regulation are enriched in sisRNAs
Because of the specific sequence properties and preference for introns on the ends of a gene, we performed an enrichment calculation of transcription factor binding sites (TFBS) in sisRNAs to study the potential functions of sisRNAs. We searched for enriched DNA motifs among the 935 position weight matrices (PWMs) collected from the TRANSFAC databases  in sisRNAs and in introns without sisRNAs (Figure 6). We also calculated the motif enrichment in both RefSeq and Ensembl sisRNAs and the enrichments in two datasets are nearly identical (R=0.99), indicating the calculation is robust and the quality of identified sisRNAs is high (Figure 6B). Stat3, NF-κB, p50:p50, MYOG:NF1 and GAF consensus sites are enriched in sisRNAs but depleted in introns without sisRNAs (Figure 6A). Stat3 (signal transducer and activator of transcription 3) is a member of the STAT protein family, functions as a transcriptional activator , and is highly expressed in the X. tropicalis cytoplasm (Figure 6C). NF-κB (nuclear factor kappa-light-chain-enhancer of activated B cells) is a protein complex that controls transcription . p50 is the mature NF-κB subunit, which has no intrinsic ability to activate transcription and has been proposed to act as a transcriptional repressor when binding with κB elements as homodimers (p50:p50) . Myogenin (MYOG) is one of four muscle-specific basic helix-loop-helix regulatory factors involved in controlling myogenesis , and NF-1 (Neurofibromin 1) is a negative regulator of the Ras signal transduction pathway  and is required for skeletal muscle development . The GAGA factor (GAF) is one of a few transcription factors that can regulate transcription at multiple levels: depending on its target genomic location, it can act as either activator or repressor . We also observed that some TFBS, such as Ncx, Prop1 and Nkx3a, are depleted in sisRNAs while slightly enriched in introns without sisRNAs (Figure 6A). These results suggest that specific TFBS involved in transcription regulation, either activation or suppression, are enriched in sisRNAs, which imply that sisRNAs may play a functional role in transcriptional regulation.
The GO terms Nucleotide binding, RNA binding and ATP binding are enriched in genes with sisRNAs
To further investigate the role of sisRNAs, we asked whether the genes with sisRNAs share any function. GO (Gene Ontology) enrichment analysis  of the 5,419 RefSeq genes with sisRNAs indicated that 13.9% of the genes are involved in nucleotide binding, 3.2% of the genes are involved in RNA binding and 7.9% genes are involved in ATP binding (Table 4). The other enriched gene sets included RNA processing, mRNA metabolic process and translation (Table 4). These results suggested that sisRNAs might have a potential biological function involved in gene regulation and metabolism.
The sisRNAs are as evolutionary conserved as introns, and much less than exons
Our data indicate that sisRNAs are in specific regions of introns, contain TFBSs, and are in the introns of genes involved in nucleotide, ATP and RNA binding. To investigate whether these sisRNAs are conserved across species, we determined the PhastCons conservation score for sisRNAs and introns (Figure 7). The PhastCons scores  are calculated based on multiple alignments of 6 vertebrate genomes (zebrafish, chicken, opossum, rat, mouse and human) with X. tropicalis. As expected, the boundary of exon and intron has the highest conservation score (Figure 7A, C). Although sisRNAs are as conserved as other intron regions, they are still much less conserved than exons, suggesting they might not be conserved among species, which was also shown in a recent study .
An alternative approach to evaluate the conservation of sisRNAs is to compare the sisRNAs in different cell types from various species. We obtained the cytoplasm sisRNAs data recently published , which contains several species and cell types, including human red blood cells, Hela cells, mouse red blood cells and 3T3 cells, chicken DF1 cells, and Xenopus laevis XTC cells. We compared the GC content, CpG density, and CA|TG density among these cytoplasm sisRNAs to the refSeq and Ensembl sisRNAs that we identified (Figure 7D-F). We found that between different cell types in the same species, for example, human red blood cells and Hela cells, the GC content, CpG density and CA|TG density of these sisRNAs are quite different from each other (Figure 7D-F). While for the same cell types between different species, for example, human red blood cells and mouse red blood cells, some sequence composition similarities exist (Figure 7D-F). These results indicate that sisRNAs are cell type specific. Although sequences of sisRNAs may not be so conserved among species, the sequence compositions are similar.
The X. laevis XTC sisRNAs are very similar to X. tropicalis GV sisRNAs in terms of sequence composition. We next asked how many common host genes with sisRNAs exist between the two species. We compared 5,563 unique genes host sisRNAs in GV of X. tropicalis to 2,029 unique genes which have sisRNAs in X. laevis XTS cells, there are only 602 genes overlapped and this chance is even lower than random selection (Figure 7G). Among these 602 genes, the number of sisRNAs from the same intron is less than 50 (data not shown). Overall, our calculation suggested there may not be positional or sequence conservation between sisRNAs, which is also supported by other recent work .
We also performed GO (Gene Ontology) enrichment analysis for the genes host sisRNAs in each cell type (Supplementary Figure 3A-F). In chicken DF1 cells, the sisRNAs are not enriched in any gene sets (Supplementary Figure 3E). Even though sisRNAs are enriched in some gene sets in the other 5 cell types, there are no common enriched gene sets present in all 5. The most common enriched gene set is related to sister chromatid segregation or nuclear division, which is observed in human RBC, Hela, mouse RBC, and X. laevis XTC cells (Supplementary Figure 3A-C, F). Another common enriched biological process is covalent chromatin modification, which is observed in human RBC, mouse RBC and 3T3 cells (Supplementary Figure 3A, C-D). These results imply that sisRNA may play functional roles in cell division or chromatin organization.