Identification of cis-regulatory motifs in first introns and the prediction of intron-mediated enhancement of gene expression in the plant Arabidopsis thaliana.

Background. Intron mediated enhancement (IME) is the potential of introns to enhance expression of its respective gene. This essential function of introns has been observed in a wide range of species, including fungi, plants, and animals. Studies in the plant ​ Arabidopsis thaliana have shown that enhancing introns exhibit a distinct base composition and are generally the first intron located close to the transcription start site. However, the mechanisms underlying the enhancement are as of yet poorly understood. The goal of the study was to identify potential IME-related sequence motifs and genomic features found in first introns of genes in the plant Arabidopsis thaliana. Results. Based on the rationale that functionale sequence motifs are evolutionarily conserved, we exploited the deep sequencing information available for Arabidopsis thaliana, covering more than one thousand Arabidopsis accessions, and identified 81 candidate hexamer motifs with increased conservation across all accessions, and which also exhibited positional occurrence preferences. Of those, 71 were found associated with increased correlation of gene expression of genes harboring them, suggesting a cis-regulatory role. Filtering further for effect on gene expression correlation yielded a set of 16 hexamer motifs, corresponding to five consensus motifs. While all five motifs represent new motif definitions, two are similar to the two previously reported IME-motifs, whereas three are altogether novel. To identify additional IME-related genomic features, Random Forest models were trained for classification of gene expression level based on an array of different sequence-related features. The results indicate that introns harbor information with regard to gene expression level and suggest sequence-compositional features as most informative, while position-related features, that were thought to be of central importance before, were found with lower than expected relevance. Conclusions. ​ Exploiting deep sequencing and broad gene expression information and on a genome-wide scale, this study confirmed the regulatory role on first-introns, characterized their intra-species conservation, and identified a set of novel sequence motifs located in first introns of genes in the genome of the plant ​ Arabidopsis thalian ​ a that may play a role in inducing high and correlated gene expression of the genes harboring them.


Introduction
The presence of introns is one of the defining characteristics of eukaryotes (1) . Introns are nucleotide sequences within a gene that are transcribed, but are removed during the maturation of the functional mRNA construct. While self-splicing introns, which catalyze their own removal from the maturing RNA, can be found in all domains of life, spliceosomal introns are unique to eukaryotes. As the name suggests, these introns require the spliceosome, a large protein/RNA complex, for removal from the nascent RNA product.
The origin of introns has been discussed intensively. Some studies suggest that introns, at their core, are selfish deoxyribonucleic acid (DNA) elements, emerging during population bottlenecks (2) . Their function, if having any at all, would therefore be a secondary development.
The evolutionary age of introns proved to be a controversial topic, which divided the field into two major theories. The "intron early" theory posits that introns are of very ancient origin, even predating cellular life, and all modern organisms without introns lost them secondarily. By contrast, the "intron late" theory advocated a relatively recent or even several independent origins (3) . However, as more whole genome sequences for a variety of prokaryotes became available, revealing an absence of introns in the majority of this group, the intron early theory was regarded increasingly unlikely. The occurrence of spliceosomal introns in all known eukaryotes and the precursor self-splicing introns in prokaryotes indicate a very ancient origin nonetheless.
This led to the proposition of a synthesis theory (4) , suggesting a very early emergence of self-splicing introns in simple prokaryotes, which further developed into spliceosomal introns in a common ancestor of all eukaryotes.
The question as to which functions introns and intron splicing have, has been discussed since their discovery. The fact that they can be found in the nuclear genomes of all eukaryotic life forms suggests a central importance (1,5) . This assumption also seems to reconcile the absence of an immediate functional role with the substantial burden introns impose on the harboring cells and organisms as a whole. The spliceosome complex consists of several small nuclear ribonucleic acids (snRNAs) and proteins, which all have to be available at high quantities to guarantee efficient splicing and to avoid delays in mRNA maturation, straining the cellular resources. Furthermore, transcription and intron splicing itself are energy-intensive processes.
Lastly, due to the crucial role of splice junctions in the correct splicing of genes, the organisms containing introns are prone to gene loss by single point mutations. A single mutation in the splice site or a mutation that generates a new splice site can lead to incorrect splicing and unwanted retention of intronic sequence or deletion of parts of exons. This may lead to loss of function of the resulting protein or to complete gene-loss due to introduction of a premature stop codon, either contained in the intronic sequence or by a frameshift, resulting in nonsense-mediated decay.
Additionally, intron mutations can lead to an activation of non-canonical splice sites, resulting in similar effects (6) . Correspondingly, a high number of human genetic diseases have been linked to mutations in introns (7) . Similarly, several plant mutants related to splicing errors affecting the phenotype are known (8) .
As introns have been observed across all eukaryotic cells, it seems reasonable to assume that introns play a vital role. Allowing alternative splicing is one of the leading explanations for the prevalence of introns. Alternative splicing is a process of selective removal or retention of exons or introns. This leads to multiple different mRNA transcripts that can be derived from a single gene, resulting in several different proteins, thereby efficiently expanding the protein repertoire.
Alternative splicing has been observed for most multi-exon genes, and has been suggested as the main contributor to eukaryotic complexity (9) . Conversely, other authors suggest that only a very small portion of alternative transcripts are actually translated into proteins at relevant levels (10) . Besides alternative splicing, other functions for introns have been discovered over the years. mRNA-stability has been linked to introns, indicating that splicing stabilizes mRNA, leading to an increase of mRNA half-life (11) . Furthermore, splicing can assist in the 3′-end formation of the pre-miRNA by recruiting capping factors (12) . Surprisingly, the "fate" of an intron does not necessarily end after its excision from the mRNA. Introns can contain RNA genes, which are expressed upon removal from the host mRNA. Several non-coding RNA families have been found in introns, such as snoRNAs, long non-coding RNAs (lnRNAs), miRNAs and small-interfering RNAs (siRNAs) (1) . The resulting RNAs can then have gene-regulatory roles. miRNAs and siRNAs inhibit translation of their target genes by binding to the mRNA and either blocking translation or initializing degradation of the mRNA. Thus, by inhibiting its host or its host antagonist, siRNAs can both promote or inhibit the expression of its host gene (13) .
As perhaps one of the most essential functions of introns, the enhancement of gene expression has been reported. Studies have shown that certain introns are able to enhance the expression of their respective genes by a significant amount. Interestingly, in contrast to normal enhancers, these introns have to be transcribed to trigger this effect (14) . This enhancement, known as Intron Mediated Enhancement (IME), is strong enough to be used as a tool in the repertoire of molecular biology techniques to boost the expression of specific target genes, and has been suggested to contribute to the high expression levels of housekeeping genes (15) . IME was one of the earliest known functions of introns, when it was discovered in 1987 in maize (16) .
Since then, IME has been found in a variety of species, from plants to vertebrates and nematodes (17,18) . It has been reported that IME can act via increased transcription rate, increased nuclear export of the transcript, transcript stability, and even enhanced translation efficiency (19) . The mechanisms responsible for these diverse modes of action of introns on the gene expression are as of yet not fully understood. However, a strong correlation between proximity of an intron to the transcription start site (TSS) and its capability to enhance expression has been identified, with the vast majority of reported IME found associated with the first (5'-most) intron of a gene (20) . As for the actual mechanism that results in gene enhancement, both splicing-dependent and splicing-independent effects have been reported.
The splicing snRNA U1 is known to interact with and recruit polymerase II via the transcription factor TFIIH, increasing the probability of transcription initiation in the presence of a functional 5′ splice site, possibly even independent of the actual splicing process (21) . This effect is the stronger the closer the site is to the TSS, as closer proximity is likely to result in an increased probability of the recruited polymerase to initiate transcription. Potentially further contributing to IME, the previously mentioned effect of splicing on the mRNA stability and capping efficiency increase the amount of functional mRNA. The exon junction complex, which forms at the junction of two exons after splicing, can increase the rate of nuclear export of the processed mRNA (22) . Finally, the exon junction complex has also been linked to increase in translation efficiency, by interacting with translation factors and ribosome components and recruiting or activating them (23) .
All previously described mechanisms depend largely on the splicing process and the spliceosome. This would mean that merely the presence of an intron in close proximity to the TSS is sufficient to induce IME, independently of the actual intron sequence. However, several studies have shown that different introns at the same position can have different effects on gene expression, ranging from no effect to strong stimulation (24) . Furthermore, certain introns can confer tissue specificity, leading to a spatially differentiated expression enhancement (25) . These findings strongly indicate a sequence-dependent mode of action of IME, which does not primarily rely on splicing alone. Concordantly, studies have shown that intron sequences can enhance expression without being spliced at all (19,26) . This implies a regulation similar to expression regulation by transcription factors. However, severe reduction of IME when splicing was inhibited has been reported as well (27,28) . These conflicting observations either suggest two independent mechanisms of IME, splicing-dependent and splicing-independent, or a combination thereof, which is supported by some studies reporting a reduction, but not a depletion of IME in absence of splicing (15) . To understand how all these factors play together, identifying introns, which lead to IME is crucial.
Primarily, IME-introns have been identified by experimental evidence (16,25,29) . While this is essential for gaining further insight into IME, it only covers a small portion of potential IME candidate introns. To identify IME introns on a larger scale, bioinformatic methods are required.
Currently, the only available computational method with this goal is IMEter, which works under the assumption that TSS-proximal introns are enriched in IME sequence motifs assumed as words (k-mers of length 5) (30,31) . IMEter computes a log-odds score for an intron sequence to correspond to TSS-proximal, and hence IME-signal-bearing, introns by scoring the actually present pentamers relative to observed relative frequencies of pentamers in TSS-proximal vs.
TSS-distal introns. This straightforward approach has shown promising results. Many of the previously established IME-introns were assigned high scores high by IMEter, and the authors even suggested a correlation between IME-induced mRNA-fold increase and IMEter score (32) .
Furthermore, in top scoring introns, two sequence motifs were detected, which, when present at high densities, are able to induce IME (25,32) . These motifs even led to an increase of mRNA levels when located within an exon (15) . However, not all introns that have been reported to be able to induce IME, score accordingly with IMEter or are enriched for the two reported motifs (15,32) . Therefore, alternative computational approaches may identify additional regulatory motifs in introns.
One commonly used strategy to bioinformatically identify functional sequence elements, called phylogenetic footprinting, assumes that they are likely conserved across related species.
With available sequence and associated single nucleotide polymorphism (SNP) information, this approach can also be applied to intra-species evolution, as applied, for example, in Arabidopsis thaliana (33) . This method requires a large set of sequences to achieve a high motif resolution.
However, when performed across species, increasing the number of species in a set also increases the divergence between sequences, leading to ambiguity in the orthologous relationships. Conversely, if the sequences are too similar, for example being from the same species, the density of accumulated mutations might be too low to determine conserved regions.
Therefore, in this case, a large number of genomes is required to ensure a minimum variety among the compared sequences. The 1001-genome-project offers this opportunity, with a Single Nucleotide Polymorphism (SNP)-set for 1,135 fully sequenced Arabidopsis thaliana accessions (34) . Moreover, a large compendium of gene expression data (microarray-and RNA-seq-based) is available, allowing to test whether introns sharing a particular motif also share a similar expression pattern as well as available methylome data, permitting to include epigenetic information in the analysis (35) . A previous study was able to identify novel motifs in promoter regions using the 1001 Genome project SNP set (36) . The authors compared conservation not only across a single mapping location, but compared all mapping locations of a motif. This design attempts to avoid the previously described problem of the relatively low SNP density between the Arabidopsis accessions by determining the degree of conservation of a motif over all its occurrences in the genome. As a further relevant parameter, it was shown that most motifs had a positional preference, indicating a correlation between occurrence of a motif and SNP density.
As an alternative and broader, more general approach, machine learning can be applied to the problem of identifying factors relevant for IME. Machine learning has been applied successfully to a wide range of biological problems, such as predicting RNA and protein folding, identifying descriptors of enzyme efficiency, and prediction of regulatory motifs (37)(38)(39) . While the classical machine learning algorithms, such as decision trees, lost popularity in favor of deep learning, their models allow better interpretation. By identifying intron features that can influence gene expression levels, alternative modes of action underlying IME could be discovered.
This study builds on the rationale that IME-motifs are conserved more than expected by chance and uses a SNP-based approach to identify cis-regulatory elements, initially defined as sequence hexamers, applied to intronic sequences. Introns were split into two groups, first (5'-most) introns, which are known to be able to induce IME, and all other downstream introns. By adding conservation and location distribution as characteristic features associated with IME candidate motifs, our approach attempts to extend the concepts established by IMEter, which relies on candidate motif occurrence differences in the first vs. other introns alone. Differential methylation as a potential regulator of IME was also investigated here. For validation of functional relevance, correlation of gene expression of all genes containing candidate IME motifs in their first intron was used.
To assess the information contents of intronic sequences on gene expression and to extract associated informative features, his study also includes a Random Forest (RF) classifier model for prediction of mRNA expression levels based on intron sequence information. Strongly expressed genes have been associated with IME, indicating that intron features that coincide with high gene expression of the respective gene might be related to IME (20) . A number of sequence characteristics of the respective first intron, such as intron length, nucleotide composition, distance to TSS, distance to the translation start codon, and the IMEter score served as features for the Random Forest classifier. In addition, folding energetics of intronic RNA, cross-species conservation, and presence of transpons was considered as well. The goal was not only to create an accurate model, but also to extract features that contribute the most to the prediction accuracy in addition to the more targeted k-mer motif approach.
We report the identification of 16 motifs, collapsing to five consensus motifs. While all five motifs constitute new motif definitions, two resemble previously reported IMEter motifs, and three appear altogether novel. The RF-models confirm the predictive potential of introns with regard to the expression level of their host genes and suggest features associated with base composition as particularly informative. Our results shed new light on the possible mode of action responsible for IME and may serve as a starting point for further approaches examining IME in the future.

Extraction of intron positions and sequences
The Arabidopsis Information Resource (TAIR)10 (40) General Feature Format version 3 (GFF3) file was used to extract the sequence coordinates of all mRNA introns within the Arabidopsis thaliana genome via exon positions to infer intron positions. All introns shorter than ten base pairs (bp) were excluded. A FASTA file containing all introns was created by using bedtools (41) and the complete TAIR10 genome as a reference. The intron set was then split into first, i.e. the promoter-proximal intron set, and the set of other introns. Introns located in the 5'UTR of a gene were detected by an overlap between an artificially length-extended (5bp at either end) intron and 5'UTR coordinates.

Extraction of relevant single nucleotide polymorphisms (SNPs)
SNPs were extracted from the 1001 genome project variance calling file (VCF) (34) . All variants that were positioned in one of the introns were extracted. A threshold of 50 was set as the minor allele frequency for SNP positions to be considered and 500 valid (i.e. non-"N") alleles called, with alleles counted as haploid counts (i.e. counts per chromosome). With VCFtools (42) , the resulting VCF file was used to extract all SNP positions. In total, 2,426,458 SNPs were used, of which 382,016 were located in introns.

Selection of k-mer size
As a compromise between specificity of motifs (favoring longer motifs) and the combinatorial increase associated with increasing motif-length, a k-mer size of k=6 was chosen, from here on

Relative frequency of hexamers
Similar to IMEter (32) , the frequency of hexamers in first introns compared to other introns was taken as the initial criterion for the identification of potential regulatory hexamers. For both intron sets, first and other introns, the total occurrence of each hexamer, H, over all introns in the Col-0 reference genome was determined, and then normalized by the total occurrence of all hexamers for each intron group, respectively. Afterwards, the relative frequency, F, was calculated by dividing the normalized frequency of hexamers in the first by the normalized frequency of hexamers in the other introns, with , where C stands for counts, f and o for first and others, respectively. N is the total number of observed hexamers (N=2080).

Degree of conservation of hexamers, conservation rate
To assess the degree of conservation of each hexamer, the total number of occurrences of each hexamer introns was compared to the occurrence of the same hexamer with SNP positions masked, performed separately for first and other introns. The masking was done by replacing each position containing a SNP with a symbol not in the nucleic acid notation, in this case "*". The degree of conservation was calculated as the ratio of hexamer counts, C H , with SNPs masked and the counts without masking. This provides a position and alignment independent measure of conservation with ratio-values near 1 suggesting high conservation and smaller ratios suggesting increasing variability. For comparison, the randomly expected conservation was computed as: where N SNP is the number of SNP-positions found in introns and N bp is the total number of positions in respective introns, computed separately for first and other introns. C r corresponds to the probability of a k-mer not containing any SNP position given the background SNP-density.

Positional distribution of hexamers in introns
Two factors were considered for the location distribution of hexamers within introns. First, since most cis-regulatory elements show preferences for specific localization, we hypothesized that relevant hexamers should show a characteristic distribution, which significantly differs from a uniform distribution. To examine this, the relative positioning of each occurrence of a hexamer in an intron was determined by dividing the first position of each hexamer occurrence by the length of the respective intron. These relative start positions were then binned into ten bins covering an interval of (0, 1). Based on the binned occurrence counts, positional preferences were expressed as position entropies, S m , withfigure where p H,b is the relative frequency of hexamer motif (k-mer) H occurring in bin b.
For each hexamer, 10,000 random uniform distributions with the same number of occurrences were simulated and the entropy for each distribution was calculated. Since uniform distributions have the largest possible entropy (over a finite interval), non-uniform distributions should be significantly smaller. By comparing the entropy of the actual hexamer entropy relative to the random entropy, an empirical p-value was calculated.
As a second criterion, to be considered a candidate hexamer motif, the distribution of hexamers was required to be significantly different in first introns compared to the distribution in other introns. A Fisher's exact test on the binned data was used to determine whether there was a significant difference between the two distributions.
For both metrics, the Benjamini-Hochberg method of False Discovery Rate (FDR) adjustment was applied (43) .

Multiple sequence alignments/ Consensus motif generation
For the identification of a consensus motif from candidate hexamers, a Multiple Sequence Alignment (MSA) on a subset of hexamers considered candidate motifs was performed. The multiple alignment using fast Fourier transform (MAFFT) tool (44) was used to perform the alignment. JalView (45) was utilized for tree visualization. For comparison of consensus motifs, the STAMP tool (46) was used. Collapse of hexamer motifs into consensus motifs is, by its nature, to some degree, arbitrary and was performed requiring a minimum support per consensus position of two individual motifs and guided by the dendrogram of sequence-distance-clustered motifs ( Figure 3a) with the objective to group similar motifs together, while unique motifs separate..

Calculation of IMEter score
IMEter (31) is a tool scoring the similarity of a sequence to introns close to the TSS. IMEter version 2.2 was downloaded from the KorfLab/IME github repository. IMEter was trained with the Phytozome dataset as described in the IMEter use manual (47) . The IMEter score for each first intron was then calculated. The introns were subsequently ranked by their IMEter score.

Comparison of gene expression
As a gene expression dataset, microarray expression data from Craigon et al. Potential motifs were compared to high IME-scoring introns as judged by the IMEter tool.
The correlation of the hexamer gene set was compared to the set of genes with the highest IMEter score with equal set size by calculating Cohen's d effect size.

Analysis of differentially methylated regions
For the analysis of differential methylation, information on differentially methylated regions

Identification of new motifs and motif binding comparison
The tool Tomtom was used to compare candidate motifs to a set of 872 sequence motifs reported as part of the published DAP-seq motif dataset for Arabidopsis thaliana (49) . DAP-seq motifs correspond to transcription factor binding sites motifs derived from binding assays of transcription factors to "naked" genome DNA segments.

GO-term enrichment
Gene Ontology (GO)-term enrichment analysis was performed based on a Fisher's exact test with FDR correction. The terms were extracted from the GO-slim-term subset available from TAIR10 (40) .

Selected features
All features chosen to characterize introns were directly or indirectly linked to information contained in first introns. Table 1 lists all features along with a short description. The length of the first intron, the distance of the first intron to the coding sequence, the distance of the first intron to the transcription start site and intron retainment of the proximal intron were derived from the extracted intron GFF3 file. The relative base-type frequencies were derived from the extracted fasta file of the first introns, with the flanking three bp bordering the splice sites masked. The relative dimer counts were calculated in a similar fashion as the hexamers described above, but with k=2. All possible dimers were determined, their occurrence in each first intron, excluding the splice sites, were assessed, and the count of reverse complementary dimers were combined.
Finally, the counts were normalized by dividing by the respective intron length.
Information about differentially methylated regions (DMRs) was derived as described above. Similarly, the IMEter score for the first introns was calculated as described above. The SNP-frequency per bp was calculated using the VCF file.
The minimum folding energy was calculated using mfold (50) . The mfold script was Transposable elements were extracted from the TAIR10 transposable element dataset (40) . The total number of transposable elements per intron was normalized by intron length.
As an indication of functional relevance, we probed introns for evidence of retention in annotated splice variants as reported in the GFF-file. If an intron sequence was found to overlap with an exon of an alternative transcript, it was considered retainable (retention=1), otherwise not (retention=0).

Classification
As a target variable for prediction, gene expression level was chosen. For the expression set, the microarray data (48) was utilized. The median expression for each gene across all samples was determined. A binary classification into high/low expression was chosen using the median as a set division threshold. To potentially increase prediction performance, models were also created for a modified dataset, which contained only genes found in the upper and lower quartile of RNA expression levels. The goal was to create two more distinct groups to allow better classification (increased contrast).

Model selection
For creating the actual prediction model, the Random Forest (RF) classifier as implemented in the sklearn (52)

Dataset selection
For training the Random Forest model, the dataset for the introns was randomly split into training and test dataset with a ratio of 80% and 20%. For the creation of learning and ROC curve analysis, ten-fold cross-validation on the whole set was performed.

Feature importance
For determining the feature importance, permutation feature importance was selected. It has been suggested that this method provides better results than the "Mean Decrease in Gini" method, which is used by the sklearn classifier (53) . After training the classifier, one feature of the test set was permuted randomly and the accuracy was scored. This was repeated five times for each feature, and the mean decrease in accuracy (MDA) was calculated, respectively. This process was repeated for all features.

SHAP importance
The Shapley Additive explanation (SHAP) method explains individual predictions of a model (54) .

Statistical analysis and visualization
All statistical analyses were done in Python 3.7 (56) . The modules scipy (57) , numpy (58) , and pandas (59) were used. Visualization and plotting was performed with the modules matplotlib (60) and seaborn (61) . In cases of single test statistics, reported p-values less than p=0.001 are not specified further and indicated as p<0.001.

Comparison of SNP-frequencies in first versus other introns
Since it has been shown that specifically the first intron bears the capacity to influence expression High sequence conservation, as reflected by a low SNP-density, can be an indicator of functionality (62) . This agrees well with IME-function predominantly being found in introns close to the TSS, and therefore close to, or even within, the 5'-UTR, indicating a possible correlation between conservation and IME function, but within CDS regions, first and other introns do not follow the expected conservation pattern.

Selection criteria for potential cis-regulatory intron motifs
For identifying candidate intron motifs associated with IME, a k-mer-based strategy similar to IMEter was applied, with additionally utilizing conservation and relative position in introns as informative criteria, similarly as described by Korkuc et al.
(2014) (36) . As a compromise between specificity of a sequence motif and combinatorial explosion, a k-mer length of k=6 was chosen. All counts of reverse complement hexamers were combined, leading to a total of 2080 unique

Evolutionary conservation of hexamers
Our approach builds on the rationale that functional motifs show increased conservation.
Therefore, and if indeed IME is associated specifically with first introns, we expect potential motifs to be more evolutionarily conserved in first introns than in other introns. The mean conservation rate (see Methods for definition) over all hexamers was determined as 0.9131, higher than the randomly expected rate, C r , Eq. 2, of 0.905 (Figure 2a). Similarly, other introns had an average hexamer conservation of 0.915 compared to the expected value of 0.907 (Figure 2b). At first, it may be surprising that the average observed hexamer conservation is higher than that based on the expected background conservation (Eq. 3). This apparent contradiction can be explained as an indication that SNPs are not completely randomly distributed within introns, but tend to positionally cluster. Similar observations have previously been reported (64) . This could be due to either a bias in the sequencing technology or some biological process. Also, hexamers with very low occurrences tend to have higher SNP-rates (Figures 2a, b). This may point to a sequencing artifact as well (homo-oligomeric stretches). A total of 929 hexamers were determined to have a higher conservation in first introns relative to other introns, while 1151 hexamers were more conserved in other introns, which reflects the observed higher SNP frequency, and hence, lower conservation, in first vs. other introns (Fig 2a).

Relative occurrence of hexamers in first vs. other introns
Under the assumption that functional sequence motifs induce IME, it appears plausible to expect that these motifs show a higher relative occurrence in first introns compared to other introns, since the vast majority of reported IME-introns are first introns of a gene (31) . Inspecting relative hexamer counts (count of a particular hexamer divided by the total number of detected hexamers), 843 hexamers were detected with higher relative occurrence in first compared to other introns, while for 1237 hexamers, the inverse was true. A closer examination of the relative count distribution of hexamers revealed a significant difference between the distribution of hexames with lower relative frequency versus those with higher relative frequency in first introns (Figure 2c that may point to the ones that are functionally significant and, thus, enriched.

Non-uniform positional distribution of hexamers in introns
Studies have shown that functional sequence motifs often exhibit a positional preference (36,63) , including signals associated with IME (31) . Assuming that potential functional motifs in introns exhibit this preference as well, hexamer positional distributions were tested for deviation from uniformity (see Methods), yielding 1448 hexamers detected with significantly non-uniform positional distributions in first introns.

Dissimilar positional distributions of hexamers in first vs. other introns
Finally, to exclude positional preferences unrelated to hexamer IME function, only hexamers with significantly different positional preferences in first and other introns were considered further. A Fisher's Exact test comparing positionally binned distribution of hexamers (ten bins, see Methods) within first introns to other introns respectively yielded a subset of 459 introns, which were significantly differently distributed in first vs. other introns.
In total, 81 hexamers met all four requirements laid out above, and were investigated further.

Analysis of identified candidate hexamers Expression correlation of genes containing candidate intronic hexamer motifs
To test for any regulatory effects of the identified 81 candidate first-intron motifs, correlation of gene expression level was taken as an indicator. Under the assumption that an intron motif regulates gene expression, those genes that harbor a particular motif should exhibit a higher correlation of gene expression amongst them than a comparable set of random genes. However, increased correlation among genes with a specific intron motif could not only indicate regulatory effects, but also originate from the genes being homologous. Closely related genes might exhibit a similar expression profile and will also be more sequence-similar to one another with a correspondingly increased probability to find the same hexamer in their introns. Therefore, candidate motifs were compared to hexamers with similar occurrences as the one under consideration (within a 10% interval of higher/lower occurrence) to account for this effect. Gene expression correlation of the gene subset containing the hexamer of interest was computed, and then compared to the correlation of genes observed to each contain a comparable hexamer in their first intron. Of note, as a control, we compared the matching k-mer approach to the naive approach to simply use all other genes and found concordant results (Supplementary Figure S1).

Analysis of candidate hexamer subset with evidence of expression regulation
For each of those 16 hexamers, the average gene expression level of genes harboring them in their first introns was significantly higher than the average of the whole set (p<0.001), with Cohen's d effect size ranging from 0.18 for ACCCTA to 0.45 for AGATCG. This result is in line with motif-mediated IME being associated with highly expressed genes.

Candidate consensus motifs
The set of hexamers contained sequence-related hexamer sequences, which may be equivalent in function or part of a larger consensus motif, e.g. AGATCG and TCGATC (with its reverse  signalling and transcription factor activities appear to be less present in the gene sets associated with the five identified IME-candidate-motifs.

Comparison of potential regulatory motifs to IMEter
To further evaluate the newly identified motifs, they were compared to the most commonly used tool for identifying potentially IME introns. IMEter scores whole introns (32) , or, in a new version, a sliding window of 50bp (31) . For all first introns, the IMEter score was calculated, and then sorted by score. Genes with the highest scoring introns were correlated amongst themselves at significantly higher levels than a subset of random genes of equal set size (p<0.001), with an average Cohen's d of 0.183 (Figure 4a). The mean expression level of the top 2000 IMEter score genes was significantly higher than that of the whole gene set (p<0.001, Cohen's d effect size of 0.43). By comparison, correlation of expression amongst genes containing either one of the five consensus motifs reported in this study was either comparable to or only slightly below that of the IMEter set (Figure 4b, |Cohen's d| <0.1), suggesting a potentially cis-regulatory role.
The overlap between the candidate motif gene sets and the corresponding IMEter sets of equal size was large, with an average overlap of 34% (p<0.001, regarding the overlap in percent, note that sets were always of the same size). This is expected, since both approaches partly employ similar strategies for identifying IME. When the candidate motifs were compared to the IMEter set not containing overlapping genes, the effect size associated with the candidate motif gene set generally increased, resulting in comparable gene expression correlation of genes within the respective gene sets as observed for the size-matched top-IMEter gene sets, with effect size ranging from 0 to 0.1 (Figure 4c), suggesting an even stronger, albeit slightly, regulatory effect associated with the identified five consensus motifs compared to IMEter-selected gene sets. Taken together, our consensus motifs resulted in similar effects as compared to the IMEter-based intron scoring and the two IMEter-motifs, and yielded novel motif definitions and/or altogether novel motifs that may function in a cis-regulatory fashion.

Effect of differential methylation in first introns on gene expression
A study of vertebrates by Anastasiadi et al. (2018) has shown a strong inverse correlation between methylation in the first intron and gene expression (18) . They also showed that first introns exhibit the highest density of differentially methylated regions (DMRs) of any genomic feature, and that certain DMRs could positively correlate with gene expression. These findings suggest a potential influence of DMRs on IME and therefore on gene expression, which was further investigated here using published DMR data (35) . The gene sets associated with the two methylation contexts, CG-and C-DMRs, that each were found with sufficient numbers of observations (CH-DMRs were not considered as fewer than 100 cases of overlaps with first introns were observed) had very little overlap with either the top scoring IMEter genes, or any of the potential hexamer motif gene sets. Genes containing C-DMRs in their first intron were significantly more correlated than a set of random genes, with an average effect size of 0.1.
However, C-DMR genes had a significantly lower gene expression than the set average (p<0.001) with an effect size of -0.74. Conversely, genes with intronic CG-DMRs were expressed at significantly higher levels than the set average (p<0.001) with an effect size of 0.07. Yet, the CG-DMR subset showed a comparatively lower expression correlation than the C-DMR set, with only 0.054 as the average effect size compared to a subset of random genes of equal set size.
With regard to overlap of DMR-set genes and the gene sets associated with any of the five candidate motifs reported here, for C-DMR, no significant overlap was detected. By contrast, the CG-DMR genes overlapped significantly with all five consensus motifs (p<0.004), with enrichment factors of 1.2-fold and higher.
In conclusion, no coherent picture emerges with regard to the role of DMRs in IME. While genes with CG-DMRs in their first introns are expressed at higher than average levels, the corresponding set of genes does not show correlated gene expression, a feature, which we considered evidence of regulation used to identify IME motifs.

Random Forest model for prediction of expression level based on intron features
IME has been connected to highly expressed genes such as housekeeping genes. Thus, it appears possible to cast the problem of identifying features responsible for IME as a feature extraction problem with Machine Learning methods applied to the prediction of expression level.
By only including features of the first intron, the goal was to investigate the predictive value of first introns with regard to expression level of their respective genes. Random Forest Classifiers were trained for the prediction of expression level. Initially, genes were binned into two groups, sets with high and low expression level, respectively, used as classes for building the classification models, with the global median expression level taken as the threshold value. To increase contrast, binning of genes was performed based on the upper and lower quartile of expression levels. A spectrum of sequence-dependent and sequence-independent intron features, which we considered potentially predictive, were selected and tested (Table 1).
Using the median-split gene classes, the achieved model performance was modest (area under the ROC (AUC) of 0.68 and an average accuracy in a tenfold cross-validation of 63%.
When increasing the gene expression difference between the two considered gene sets by using the upper and lower quartile of expressed genes to train models, a substantial increase of model performance was observed (AUC=0.78, accuracy=72%) (Figure 5a).

Feature importance
For the trained Random Forest models, feature importance, judged as MDA, the mean decrease in prediction accuracy, was determined. For the best performing model, sequence composition features seemed most important, with the percentage of guanine (G) and adenine (A) having the highest impact on model performance (Figure 5b). IMEter score, which is derived from the distribution of pentamer-motifs in introns, the relative occurrence of TC-dimers and percentage of Cytosine (C) were also found to have high feature importance, further suggesting sequence-dependent effects. Finally, intron-length and distance to the TSS had only a small positive effect on prediction performance. This is surprising, since IME has been closely associated with distance to the TSS. However, MDA, while powerful, is susceptible to correlated features, as influences can be masked. MDA importance and SHAP importance were observed for the features distance-to-the-TSS and CDS-start, which were considered less important by SHAP, and number of differentially methylated regions (methylation C), which had a stronger effect on model prediction according to SHAP. In the case of C-DMRs, an interesting pattern was observed. While a low number of differentially methylated sites had no effect on the model prediction, high numbers resulted in a negative SHAP value, indicating that the model associated them with lower gene expression. This is consistent with the significantly lower mean gene expression level of C-DMR genes reported above. The low impact of both distance features (distance_TSS, distance_CDS) was yet again surprising, since IME has been associated with both, a short distance to the TSS, as well as being positioned in the UTR. Even more surprising is that very small feature values (distances <200b) were associated with negative SHAP values, i.e. the model was more likely to classify the respective gene as expressed at low levels ( Figure 5d).
As considered above for the relevance of motifs, it needs to be considered whether there is indeed a specific signal in introns that causes increased gene expression of the associated genes, or whether our classifier simply picks up on features associated with genes that are expressed at high levels, such as housekeeping genes. To test for that, we extracted the same set of features, as considered for introns, for first exons of the same genes and built RF-models using exon-only, intron-only, and exon-intron-combined features. As shown in Figure 6a, while the performance of exon-only and intron-only features is comparable (AUC=0.78), considering both combined leads to a significant increase of predictability (AUC=0.81). We interpret this as evidence that introns hold information over and above that, which is associated with recognition of highly-expressed gene families alone, for which exon-only and intron-only serve as a suitable point of reference. Furthermore, both exon and intron features were considered equally important (Figures 6b, c) Taken together, these classification results imply that there is indeed relevant intronic information for determining the expression level class (high vs. low) of genes and suggest a number of informative features.

Discussion
Intron-mediated enhancement (IME) has been discussed as an important regulator of gene expression, found in nearly all eukaryotic systems tested so far [23]. It has been associated with highly expressed genes [20,67,68]. However, the exact mode of action for inducing expression enhancement is not yet understood. Several different mechanisms have been proposed, with the seemingly biggest open question being the importance of splicing [23]. Several studies suggest that the recruitment of splicing factors, even in the absence of splicing, are the major determinants for the induction of enhancement [23,27]. Conversely, studies have shown that some introns and intronic motifs were able to induce enhancement irrespective of splicing and splicing factors, and suggested a mechanism acting at the level of genomic DNA [20,65,67].
Here, we reported the identification of 16 hexamer and five resulting consensus DNA sequence motifs that may be related to IME in the plant Arabidopsis thaliana . Building on previous studies on sequence signals associated with IME (31,32) , our study exploited the available deep sequencing information of more than one thousand Arabidopsis thaliana accessions allowing us to probe for conservation as a hallmark of functional relevance.
Furthermore, we imposed more explicitly that motifs should show positional preferences within introns, an assumption that appears supported by prior findings (36,63) , and tested as evidence of a functional effect that motif harboring genes exhibit correlated gene expression in addition to elevated expression level, making use of the plethora of available expression information. Thus, we postulated that IME not only leads to increased expression, but also includes a regulatory component, leading to concerted gene expression of subsets of genes. In line with this assumption, out of 81 motifs, identified based on conservation and evidence of preferential intron locations alone, 71, i.e. almost all, were associated with increased correlation (positive, rather than negative effect size, p bionial, prior=0.5 =1.8E-12, Supplementary Table S1) and 16 (19.7%) were found associated with significantly elevated co-expression (effect size, Cohen's d>5%) ( Table 2).
Contributed by the seminal studies on IME (31,32) , IME has been associated with whole introns (IMEter tool and score) and two motifs (CGATT and TTNGATYTG, with the first being a sub-motif of the other) have been implied as functional. Our study enlarges this set to 16 hexamer and five consensus motifs that now can be explored further and experimentally characterized. The IMEter tool has been shown to be a good indicator of IME, with experimentally identified IME introns having consistently high IMEter scores, and the level of enhancement of known IME introns correlating with IMEter score (25,31,65,66) . The motifs identified here and based on conservation, relative occurrence, and positional distribution were comparable with regard to their effect on correlation of gene expression and expression level to genes with the highest IMEter score (Figures 4b,c). Therefore, it seems likely that the discovered hexamer and consensus motifs are truly related to IME. Building on conservation using intra-species sequence variation, as done here, also is supported by previous observations indicating that regions with high IMEter score were conserved among different species (24,31) .
As the mode of action of IME still is a big unknown, the fact that we did succeed in identifying motifs that are, based on our filter and test criteria, associated with IME, suggests that either a molecular recognition event -such as binding by proteins -may be at work, or that the motifs play a RNA-structural role relevant for splicing. At this point, using the approaches presented here, we cannot interpret the data in favor of either of the two alternatives. However, our study provides novel candidates for targeted follow-up studies.
In addition to a search for sequence motifs, we performed a Random-Forest-based classification of genes with regard to gene expression level. Here, the goal was to a) prove predictability of expression level using intron-based information, and b) to identify additional features relevant for IME. Indeed, we were able to show that introns hold information on expression level over and above the information provided by the gene context (exon-related information), (Figure 6a). Overall, base compositional features (most significantly, G-contents) were found most informative, and more important than other parameters such as distance to the TSS or other parameters that would allow to arrive at more interpretable conclusions with regard to mode of action of IME (such as DMRs, folding energy and others) (Figures 5b, 6b). Low A-and high G-content of the first intron were pivotal for classification as a highly expressed gene. In contrast to the k-mer-based IMEter score, A-and G-content are more general features, describing the composition of the intron and the pre-mRNA. This could indicate a motif-independent influence of first introns on gene expression. Studies have shown that intron composition can regulate splicing by influencing pre-mRNA folding around the splice sites (67) , which could explain the observed effects. Compositional effects have also been reported to influence mechanical properties of genomic DNA, such as bending flexibility (68) . However, initial attempts by us to use machine learning to associate reported sequence-dependent DNA flexibility measures to IME proved unsuccessful (not shown). However, considering mechanical properties of pre-mRNA may offer a fruitful avenue for further research.
Confirming the validity of prior approaches, the previously published IMEter score was among the most important features. Correlation between IMEter score and enhanced gene expression by IME has been established, also experimentally for selected gene sets (31) . The results of this study show that this also applies to the whole genome.
Regarding the role of DNA-methylation, more specifically, differential methylation (DMRs), given the data and approaches used here, no consistent picture emerged. While C-DMR regions in introns were found associated with increased correlation of the corresponding genes, they were expressed at low levels. Conversely, CG-DMR intron genes showed higher expression, but low correlation. Hence, a cis-regulaotry role of DMRs in introns related to IME appears unlikely.
With regard to intron-related cis-regulatory functions, first intons (5'-most) have been considered most relevant (69) . Our observation that first introns, on average, show a slightly increased SNP-density compared to the remaining introns ( Figure 1) appears counter-intuitive.
However, introns located in 5'-UTRs exhibit a reduced SNP-density, and hence increased conservation. Therefore, UTR-located introns may play a different functional role than introns embedded in coding regions, which is consistent with previous reports that several UTR introns have been reported to induce IME (69) .
Also, when considered as a predictive feature, the distance of the intron to the transcription start site (TSS), with close distances having been discussed as more associated with IME, was not found to be particularly informative. This relatively low impact of the distance to the TSS on the expression level prediction (Figures 5b,c) is surprising. Previous studies had suggested that proximity to the TSS is an essential property of IME introns [70], and that IME effect declines with distance to the TSS [32].  Figure S2), and, as reported previously [66], also a high correspondence of pairwise correlation (r=0.49). It should be noted that the probed conditions were very different.
Hence, expression level and pairwise correlation proved robust, reflecting condition-independent, coherent expression regulation. Thus, as expression level and pairwise correlation were the two criteria tested for in this study, the microarray data used here can be considered representative.
We performed our analysis within one species (Arabidopsis thaliana) with sequence variations amounting to single nucleotide polymorphisms (SNPs). While this intra-species approach eliminates the alignment challenges associated with inter-species studies, evolutionary conservation is confined to a relatively short divergence time (about 5mya (71) ). The associated limitations have been discussed before in a study on promoter elements (36) and correlated mutations (72) and apply here as well.
Concerning the employed classification methodology, we employed Random Forest classifiers. While recently, deep learning architecture (Recurrent and/ or Convolutional Neural Networks (RNNs, CNNs), have proven to be powerful sequence-based classification approaches (73) , RFs, in addition to being a powerful classification engine, allow for a more direct assessment of feature importance, which specifically was the goal of our study.

Conclusions
Exploiting deep sequencing and broad gene expression information and on a genome-wide scale, this study confirmed the regulatory role on first-introns, characterized their intra-species conservation, and identified a set of novel sequence motifs located in first introns of genes in the genome of the plant Arabidopsis thaliana that may play a role in inducing high and correlated gene expression of the genes harboring them.

Declarations Ethics approval and consent to participate
Not applicable

Consent for publication
Not applicable

Availability of data and materials
All relevant data are made public as part of this publication or are publicly available.

Competing interests
None.

Max Panck Society
Authors' contributions