Genome-wide Sequence Analysis of Human Splice Acceptor Regions for Motif Discovery

Background Splicing mechanism in eukaryotic cells, drives the protein diversity. This important mechanism is orchestrated by regulatory factors controlled through various splicing signals. 3’ and 5’ splice sites, and common branch point sequences are known as initiating signals of splicing, and changes in these signals can be the underlying cause of diseases. Nevertheless, an extensive genome-wide analysis of the sequences around these signals is lacking. In this study, we focused on the genome-wide motif analysis of splice acceptor region and analyzed 400 nucleotides long sequences (300 nucleotides upstream and 100 nucleotides downstream of 3’) to identify candidate motifs with functional importance. Results 207,583 sequences retrieved from Ensembl Biomart are analyzed with MEME ChIP resulting in 518 significant splice acceptor region motifs. Besides 457 known motifs, 227 of which are observed on various mammalians other than Homo sapiens, 61 novel mammalian motifs are identified. Furthermore, proteins binding to the known motifs, are mainly annotated for homeobox, homeodomain, DNA binding region, and transcription regulation function. Additionally, 4 novel motifs, supported with experimental evidence from DNA binding studies, are identified as potential branch-site motifs. Conclusions Genome-wide computational motif analysis around the human acceptor region sequences identified significant splice acceptor region motifs. The known mammalian motifs, among the computationally significant acceptor region motifs, had known biological roles with prior experimental support. The novel acceptor region motifs identified for the first time in this study are proposed as candidate motifs, and studying their biological roles in functional studies will further increase our understanding of the splicing mechanisms.


Background
In eukaryotes, genes are transcribed into pre-messenger RNAs (pre-mRNA) from which introns are excised and the remaining exons are joined to produce the mature mRNA through the splicing process. Alternative splicing allows production of variety of proteins from a single pre-mRNA by regulating the inclusion of different exon combinations in different mRNAs and thus in the mature 3 transcripts [1]. In fact, gene splicing increases the genetic coding capacity to form structurally and functionally different protein isoforms.
Protein coding genes, are generally spliced and most of them are alternatively spliced by means of a fine balance within the regulatory factors, which controls the splice-site selection. Disturbance of regulation of splicing results in malfunction of the biological processes, which can lead to variety of diseases [2]. For example, spinal muscular atrophy, is caused by a point mutation in an exonic regulatory element [3][4][5][6]. Tauopathies (diseases of central nervous system) such as in Alzheimer's Disease, is known to be due to the changes in the ratio of the protein isoforms produced by alternative splicing [7][8][9]. Frontotemporal lobar dementia is caused by the loss of a splicing factor and aberrant alternative splicing events of several genes are observed in patients with Parkinson's Disease [10]. Furthermore, specific alterations in the expression of splicing factors are observed in various cancers. Mutations of tumor suppressor genes affecting splice site selection are shown to cause cancer, as in the case of BRCA1 gene for breast cancer [11][12][13].
The currently accepted molecular mechanisms of splicing reveals that introns are excised away from the primary transcripts at conserved sequences called splice sites, which are found at the 5' and 3' ends of the introns. The excised intronic RNA sequence commonly begins with the dinucleotide GU at the 5' end and ends with AG at its 3' end, which are also known as donor and acceptor dinucleotides respectively. These consensus sequences are critical for splicing, and changes on these sequences result in inhibition of splicing [14]. Another important sequence for splicing resides at the branch point, which is located between 19 to 37 nucleotides upstream of the acceptor site [15] and contains the nucleotide (always adenine) which forms the intermediate intron lariat with 5' splice site [16]. A recent genome-wide study on human splicing branch points [15] has identified a set of 5 to 6 nucleotides long motifs around the branch point. These branch point sequences are called B-box elements as they are enriched for B-nucleotides (C, G, and U). B-box elements are the motifs in the form of UUNAN (U-motif), CUNAN (C-motif), G/AUNAN (G/A-motif), and others (B-boxes without a branch point adenine) [15].
Currently, studies of splicing signals and motifs have been restricted to the sequences around the 3' 4 and 5' splice sites, and branch point. There is lack of computational studies for genome-wide discovery and prioritization of conserved motifs around the splice regions. In order to fill this gap for 3' splice sites of human genome, we have performed a genome-wide motif discovery analysis for the sequences around the splice acceptor sites. We have retrieved 400 nucleotides long sequences around the 3' splice sites (300 nucleotides upstream and 100 nucleotides downstream) for 207,583 sites and searched for the conserved patterns. Based on an extensive analysis through motif finders and an experimental database search afterwards (which is held from March 2018 to June 2018), we have identified various patterns, some of which are already known human DNA binding motifs.
Additionally, we have identified several motifs that locates to the acceptor region for the first time in human genome, but previously observed on other mammalian genomes. Furthermore, we have found various novel motifs, most of which are enriched up to 50 nucleotides around the acceptor and few conforming to branch point B-boxes indicating that they may have a role during splicing. Our findings indicate that computationally significant motifs found in this study are biologically significant for mammalian genomes. Therefore, we believe that further analysis of the proposed motifs by experimental methods will further increase the understanding on splicing mechanisms.

Methods
Main workflow of our study for the construction of a genome-wide motif discovery analysis on splice acceptor sites is as outlined in Figure 1. First, genome scale retrieval of acceptor sequences is completed, then various motif finders are used to find patterns on these sequences. Afterwards, motifs with low significance are filtered during the pre-processing step. Finally, biological relevance of the motif sequences is evaluated.

Acceptor region sequence retrieval
In this study, we have focused on the 400 nucleotides around the splice acceptor sites (300 nucleotides upstream and 100 nucleotides downstream of the acceptor site). The exact base pair locations of acceptor sites are retrieved through Ensembl Biomart [17]. Exon start locations, which are the indicators of splice acceptor sites have been retrieved for all protein-coding genes based on GRCh38.p12 human assembly of Ensembl Biomart. By this query we have also gathered the transcript start site (TSS) locations in order to eliminate 5'UTR sequences as our main consideration is the splice acceptor sites (Supplementary A.1 presents the query for the chromosome 1 through Ensembl Biomart). After gathering the exact acceptor positions by eliminating the exons with initiation codons, we have calculated the base pair locations of start and end points for 400 nucleotides long sequences surrounding the acceptor sites. Next, UCSC DAS Server, located at http://genome.ucsc.edu/cgi-bin/das, is used to retrieve the sequences for the selected 400-nucleotide long acceptor regions (by the Perl script provided in Supplementary A.2).
Motif analysis of 400-nucleotide long acceptor regions MEME-ChIP [18], is selected as the motif finding tool for this study. MEME-ChIP runs several motif analyzers including MEME, DREME, Centrimo,, FIMO, and Tomtom which are explained below: MEME, discovers novel, ungapped motifs (recurring, fixed-length patterns) in the provided sequences [19]. DREME, discovers short, ungapped motifs (recurring, fixed-length patterns) that are relatively enriched in the provided sequences [20]. CentriMo, identifies known or user-provided motifs that show a significant preference for particular locations in the provided sequences [21].
FIMO, searches the sequences for occurrences of provided motifs and provides the locations of matches [22].
Tomtom, compares the motifs against a database of known motifs [23].
After sequence retrieval, we have taken the reverse complement of the negative strand sequences.
Next, we have arranged all of the sequences in FASTA format. We have run the MEME-ChIP with its default configuration for the matched known motifs, i.e. vertebrates (in vivo and in silico) of Eukaryote DNA, which is the largest set for known motifs of eukaryotic DNA. Furthermore, for DREME and Centrimo, we have used the default parameters suggested by MEME-ChIP. However, for MEME we have limited the number of motifs to 20 and the length of the motifs to from 9 to 50 nucleotides. The former limitation is held for making the algorithm run faster and finding the most significant motifs whereas the latter limitation helps for finding the novel motifs, which may be excluded by DREME, as it looks for the shorter motifs up to 8 nucleotides (Supplementary A.3 MEME-ChIP Analysis Parameters).

Pre-processing of identified motifs
A filtering strategy is developed to select the most significant motifs, by inspecting the central enrichment information provided by Centrimo. Steps of the filtering process is as follows:

1.
We have filtered the motifs with region width less than 3 nucleotide (the width of the most enriched region), since they mostly represent the common dinucleotide signal, AG, of the acceptor sites.

2.
We have filtered the motifs based on the E-values (the expected number motifs that would have at least one region as enriched for best matches to the motif as the reported region) The cut-off value e-5 is used to filter the motifs based on their significance.

3.
We have described a new value for prioritization of the identified motifs, which equals to " and set its threshold to -10 in order to select the motifs with higher significance but less number of sequence matches and also the motifs with lower significance but high number of sequence matches.

4.
We have filtered the motifs, which were not found to be significant by all three motif finders, CentriMo, MEME and DREME.

5.
We have removed the duplicate motifs. There were some duplicate motifs such as ALX3_full_1 and MA0634.1 which are represented by different identifiers as they exist in different motif databases (i.e. HumanTF 1.0 [24] and JASPAR 2018 [25] respectively).

Biological annotation of the known splice region motifs
Functional enrichment analysis. The Database for Annotation, Visualization and Integrated Discovery (DAVID) is a functional annotation tool, which mainly provided typical batch annotation and gene-GO term enrichment analysis to highlight the most relevant GO terms associated with a given list.
Biological functions of transcription factors (TFs) that recognize the significant acceptor region motifs 7 of this study are evaluated by DAVID v6.8 [26,27].
Comparison with experimental databases. FootprintDB is a meta-database of TFs, which locates them to their experimentally determined DNA binding motifs and DNA binding sites (i.e. cis elements) [28].
As a meta-database it integrates several databases such as 3d-footprint [29], JASPAR [25], UniPROBE [30], HumanTF [24], and etc. which are experimental repositories for the binding preferences of TFs and proteins. FootprintDB's search engine requires a DNA consensus motif or site as an input to produce a list of DNA-binding proteins (mainly TFs), detected to bind with a similar DNA motif [28].
Here we have used FootprintDB's sequence search service to reveal the biological significance of the known motifs by using the count matrix representation, which is a basic position frequency matrix (PFM) created by counting the occurrences of each nucleotide at each position.

Biological annotation of the novel splice region motifs
Analysis of repetitive sequences. RepeatMasker, uses curated libraries of repeats namely Dfam [31] and Repbase [32] to screen DNA sequences for interspersed repeats and low complexity DNA sequences [33]. We have used RepeatMasker for the detection of the motifs on the repetitive sequences. For this search IUPAC codes in the sequences were replaced with the alternative nucleotides to represent up-to 3000 possible alternatives. The limitation was necessary as for many of the motifs, especially the long motifs, the number of alternatives were really high and over the capacity of the online RepeatMasker. Furthermore, UCSC Blat [34] was used for searching repetitions in case the whole motif sequence is not indicated to be repetitive by RepeatMasker but a huge portion of it may indicate a repeat.
Analysis of motif locations. In order to investigate the motif locations and detect whether there are any special regions, which are predominantly preferred, we have divided our acceptor site sequences into 8 regions as shown in Figure 2. Afterwards, we have utilized the output of FIMO program [22], which provides the exact locations of the motif occurrences in the analyzed sequences, in order to match the motif locations within the defined regions. The program run with a p-value of 0.01.
Furthermore, Region 6 is further divided into 3 more regions in order to investigate the motifs that predominantly occur in the branch point region (i.e. between 19 and 37 nucleotides upstream of the acceptor site) as defined in [15].
Comparison with experimental databases. FootprintDB's sequence search facility is utilized for novel motifs in the same manner as known motifs to reveal their biological significance.
Results 518 motifs were found to be significant MEME-ChIP analysis of 207,583 splice acceptor site sequences revealed 1491 significant Centrimo motifs with e-value ≤ 10 using the binomial test. After applying the filtering rules as defined in the pre-processing step, 518 motifs were selected for further investigation. Among these 518 acceptor site motifs, 457 (i.e. nearly 88.22%) were mentioned by MEME-ChIP to be previously known, 33 were novel motifs found through MEME, and the remaining 28 were the novel motifs found through DREME.
All of these known and novel motifs are listed in Supplementary B.

Analysis results of known motifs
FootprintDB annotation of 457 previously known motifs revealed 455 of the motifs as DNA binding motifs. The remaining two motifs, namely "HOXD12_DBD_4" and "PITX1_full_1", didn't have any footprintDB entry, although there was a link provided by MEME-ChIP. This may be due to some versioning inconsistencies between MEME-ChIP and footprintDB.
51% of the known motifs are experimentally observed in human genome and remaining 49% are experimentally observed in other mammalian genomes. When we investigate previously known 455 motifs through footprintDB in terms of the organisms that they were identified, we have observed that 51% of them have been experimentally shown to locate in human genome (221 of the motifs were identified in human genome and 9 of the motifs were identified in both Homo sapiens and Mus musculus). Remaining 49%, i.e. 225, of the known motifs previously identified on non-human mammalian genomes, mostly in Mus musculus genome, and they were novel to Homo sapiens ( Figure   3).

Analysis results of novel motifs
Sequence lengths of novel motifs. Genome wide motif search on splice acceptor sequences through MEME algorithm has produced 33 novel motifs, which are 9 to 50 nucleotides long. Searching motifs with DREME algorithm has produced 28 motifs which are 5 to 8 nucleotides long. Distribution of motifs according to their lengths is summarized in Figure 6. According to an experimental study [15], branch-points have shown a restricted distribution upstream of the 3' splice site with 90% of branch-points occurring within 19 to 37 nucleotides upstream. This is the Region 6 (shown in Figure 2 of Methods section), which is the first rank for 31 novel motifs' occurrence frequencies. With a closer look at the positions of these novel motifs within Region 6, we have observed that 4 of the novel motifs namely, AAATRH, AAATRY, ACATWY, and TAWATR occur most frequently in Region 6_2. Most of the novel motifs with <20nt length are found to be >80% similar with previously known Homo sapiens motifs. Another analysis for the novel motifs with <20nt length, were searching them from footprintDB. For this purpose, footprintDB's sequence search feature was used and this analysis showed that 34 of the inspected 38 motifs were >80% similar with previously known motifs of Homo sapiens as shown in Figure 7. The 4 motifs that showed less similarity was namely "CYCTCCYTNCCACMCT", "RGRRAARGRRAGAGAG", "TTTTTTTTTTTTNAKW", and "YCCYTCCCASCYCCYC", and their similarity levels to previously known motifs of Homo sapiens were 67,75%, 69,56%, 78,19% and 70,88% respectively.

Discussion
In this study, we have performed a genome-wide computational motif analysis around the human acceptor region sequences. We have retrieved sequences of 400 nucleotides long regions, which are located 300 nucleotides upstream and 100 nucleotides downstream of the actual acceptor sites. Motif finding algorithms included in the MEME-ChIP were utilized for each chromosome independently, due to the computational limitation of the tool (The sequence file input of MEME-ChIP can be at most 80MB, but our sequences are nearly 92MB in total). The resulting significant motifs were combined afterwards, and filtered according to their length, significance, and consensus. 518 motifs passed the filtering. Among the selected motifs 457 were previously known mammalian motifs, but 61 of them were novel motifs of the acceptor region.
Previously known motifs, identified as computationally significant motifs of acceptor site, were experimentally proven to locate on Homo sapiens and/or other mammalian genomes as DNA binding motifs. The proteins, binding to the known acceptor site motifs identified here, mainly annotated for homeobox and homeodomain, DNA binding region, and/or transcription regulation functions in both Homo sapiens and Mus musculus genome. Homeobox genes direct the formation of body structures during early embryonic development, and their alternative splicing is known to be under strict regulation [35]. The genomic organization and the complex arrangement of regulative elements in homeobox gene structure is essential in their spatio-temporal regulation and function [36]. Also conserved intronic regions, containing homeobox sequences [37] or alternative promoters [38] are known to have regulative roles. The homeobox sequences that are identified as acceptor region motifs in this study likely to have a regulative role in the alternative splicing mechanism. Also, the computationally significant acceptor motifs that are experimentally shown to be functional in nonhuman organisms can be prioritized for further investigation of their roles in human genome.
We were able to detect 61 novel motifs through two different algorithms, namely DREME and MEME.
We have used both algorithms as DREME is designed for finding shorter motifs up to 8 nt long whereas MEME searches for the longer ones. We have arranged MEME to detect motifs up to 50 nt long. As functional annotation results of the known motifs showed their biological significance and supported the computational predictions, we investigated the relevance of the novel computationally significant motifs. Comprehensive analysis of the biological significance of the novel motifs are done based on the length of the motifs.
Novel motifs with >20nt length were found to be simple or low complexity repeats. Shorter motifs found to be significant by DREME are all found to have sequence similarity between 84,6% to 99,80%, to previously known Homo sapiens motifs without any perfect match. For example, the motif namely "RGRAA", showed 99,80% similarity to "yrsTTTCabTTyCyc" (IRF8_HUMAN.H10MO.DIM01272 model of HOmo sapiens COmprehensive MOdel COllection (HOCOMOCO) v10 - [39]) when it is aligned as "---------TTyCy-". This is not an exact match as the aligned motif is much longer, so MEME-ChIP categorizes it as a novel motif. Only 6 of the MEME motifs were found to be >80% similar with previously known motifs. Furthermore, such MEME motifs are at most 16 nt long. Our observation supports that as the novel motif length increases, the corresponding sequences are not actually biologically significant patterns but mainly repetitive sequences.
For the novel motifs shorter than 20nt (i.e. 38 motifs), locational analysis of FIMO [22] outputs showed that 31 of them most frequently occur in Region 6, which is the intronic region in the upstream part of the splice acceptor site. Furthermore, Region 6 overlaps with the biologically significant part of the intronic sequences, where polypyrimidine tract (located about 5-40 base pairs upstream [40]) and the branch-point (19-37 base pairs upstream - [15]) maps. 27 of the novel motifs located most frequently 13 between the 1st and 18th nucleotides upstream of the acceptor site, indicating a high probability for regulating acceptor site splicing. Like the donor site splice signals that facilitates binding of U1 snRNP, these novel acceptor motifs can be proposed to be sequences that facilitates recognition of a site or a unique U2AF heterodimer [41]. Availability of data and material All data generated during this study are included in this published article and its supplementary materials. Acceptor site region sequences analyzed in this study are available in the GitHub repository ( https://github.com/gkaraduman/motifanalysis). Organism distribution of known motifs according to footprintDB. 49% (i.e. 221) of the motifs located in human genome, 48% (i.e. 219) of the motifs located to mouse genome, 2% (i.e. 9) of the motifs were seen on both Homo sapiens and Mus musculus, and 1% (i.e. 6) of the motifs were seen on some other rat types.
21 Figure 4 Functional annotations of Homo sapiens TFs binding to known acceptor motifs identified in this study.
22 Figure 5 Functional annotations of Mus musculus TFs binding to known acceptor motifs identified in this study.

Figure 6
Distribution of the MEME and DREME motifs according to their length. Most of the MEME motifs are 16 to 31 bases long whereas most of the DREME motifs are 6 bases long Novel motifs' (<20nt length) similarity percent with previously known Homo sapiens motifs.
Motifs found by DREME algorithm are shown in blue and MEME algorithm are shown in orange. DREME motifs have higher similarity percent compared to MEME identified motifs.

Supplementary Files
This is a list of supplementary files associated with this preprint. Click to download. BMC_Genomics_GK_YAS_PaperSupplementary_Revised.docx