Promoter Motif And Conserved Module Analysis In DNA And RNA Using Almodules

Identication of transcription factor binding sites (TFBSs) elucidates regulation and control of transcription and gene expression. Overprediction of TFBS due to the high abundance of the small binding motifs is a notorious problem. For this comparative evaluation of TFBS is important. The tool AIModules presented in this paper presents an integrated solution for TFBS analysis as Web service and downloadable software. It works for other DNA and RNA motifs as well. Several sequences (DNA and RNA) are searched for TFBSs using predened matrices from the JASPAR DB or by using own matrices for DNA or RNA motif discovery. Furthermore, AIModules can nd TFBS common to two or more sequences. Demanding high or low conservation AIModules outperforms other solutions in speed and nds more modules (specic combinations of TFBS) than alternatives. The web application searches also RNA motifs (e.g. polyadenylation motif) as well as other DNA motifs (e.g. enhancers, silencers) and motif combinations. The application is free and open source to be used on-site or locally. These functionalities allow a number of biological applications, in particular analysis of promoters, including conserved modules and shared TFBSs in conserved gene families, presence of modules, motifs or regulatory elements in mRNA.


Introduction
During transcription of DNA to RNA the transcription preinitiation complex is build up by transcription factors, that bind speci c regions of the DNA. These regions are called the transcription factor binding sites and on the one side make up the promoter itself and on the other are conserved throughout species as well as gene families. E. g., TFs make up about 8% of all human genes (1). Due to mutation the sequences of TFBSs differ in various species and hence to search for the equivalent transcription factors (TFs) among them, techniques were developed to estimate with high con dence not only the corresponding TFs but also coupled TFs, which are called modules. For that, databases such as GenBank, Prodoric, MotifMap, etc. were developed as well as TRANSFAC, which offers a publicly free but outdated version from the year 2005, or the up-to-date JASPAR(3) database, which uses high quality matrices generated from SELEX, protein binding microarray (PBM), ChiP-based assays, etc.(1) (see Supplementary Table S9 for more on matrix generation). Via these techniques TFs could be predicted with high con dence. However, due to not knowing the TF-TF protein interactions in the preinitiation complex or other details during the DNA annealing of the TF, hence only relying on the one DNA motif for predicting a TFBS, false positives were generated. That's why it is important to search for conserved TFBSs (transcription factor binding sites) and consider cell type speci c as well as functional modules. Hence, a useful software "Genomatix" was created, which did exactly that (TRANSFAC, as another tool for module discovery, is not part of this paper and mentioned in the Supplementary Section "TRANSFAC"). However, the tool was commercialized and an open source alternative is currently not available. Thus, we present a free and open source tool which lls this gap and we present the different biological uses of it (e.g. motif and module search). In addition, this allowed to generalize the tool so that also other DNA motifs can be searched (silencers, enhancers) as well as more complex combinations of such. Moreover, also RNA motifs and conserved bindings sites in RNA can be e ciently searched. This is demonstrated for a poly(A)-tail search in RNA.

Architecture
We used a three-layered architecture ( Figure 1): front-end, back-end, database. The searches for TFBSs are done on the backend, whereas the module search as well as the rendering of the result in the front-end. Furthermore, we prepared a Docker based solution (Supplementary section "Build and Deploy").
Remark to editor: Insert Figure 1 here!

Motif searches
At the time of writing this manuscript only two commercial products (Genomatix database tool and TRANSFAC database) were available, which could nd modules (coupled motifs) on a stretch of DNA. Therefore, we sought to make available an open source tool that allows the user to insert own DNA stretches as well as own matrices, but also provide a comprehensive collection of matrices from Jaspar. The tool is a web service and therefore allows for easy including as well as applying new features. Motif search is done on the server-side, whereas module discovery on the client to decrease the load on the server.
The User Interface is designed to support the user, so that a quick start is possible (see Supplementary section "User Interface"). After module calculation, the result for the module search is depicted visually and can be downloaded as an Excel le as well ( Figure 2 and for details see Supplementary Fig. S 6).
AIModules offers three types of searches: TFBS, RNA motif and module search.

TFBS search
The most basic search AIModules offers is the search for TFBSs. For that the user inserts DNA sequences for analysis, select the thresholds (La and Ld) and the matrices. These can either be selected from the database or individual ones can be inserted. Matrices for AIModules can also be generated in the tool itself. An example on the TFBS search is depicted in Figure 2a.

RNA Motif search
Furthermore, our tool AIModules also allows for the analysis of RNA motifs. Therefore, the RNA sequence of choice can be analyzed by the user. Uracils are internally converted to thymine. After selecting the parameters La and Ld as well as selecting the matrix, the result is obtained. La is the actual log-odds ratio of the match, whereas Lm is the maximum possible log-odds ratio for a match, i.e. the information content of the consensus sequence. Ld on the other hand is the maximum log likelihood de cit (La -Ld) -or put simply La can be understood as the lower and Ld as the top score threshold between which a TFBS is valid. The score is calculated via position weight matrices (PWMs) (5) (7). For matrices the user may select Jaspar matrices from a drop down menu within the web application or insert own ones. We included not TRANSFAC database as this is commercial, however, this alternative can of course easily be uploaded in older public versions (8). To illustrate the capabilities for RNA searches, a suitable RNA motif search matrix of poly(A) motif sequences from (9) was generated in AIModules itself and used for the analysis. A result of a Poly-(A) search is depicted in Figure 2b.
The result can also be downloaded as an Excel table (see Table 1).

Module Search
Moreover, we implemented the module search. For that, the user may insert own sequences, select the parameters for La and Ld, and activate the checkbox for module ltering. The conservation of the TF can be selected by the user using a stepper menu: The user then may select either Jaspar matrices or insert own matrices as well. Results from module searches are shown in Figure 2c and Figure 2d.

Performance comparison to other tools (TFBS search)
We compared our tool to the web application conTraV3 (10). Whereas the conTraV3 solution was in production state, we tested our tool in a Virtualbox virtual machine with Debian, 4.5 GB of RAM and two CPUs (Intel i7-3520m 2 x 2,9GHz).
Although run locally, our tool performed well compared to conTraV3. We analyzed one sequence (AJ223836.

Comparison to commercial products (TFBS and Module search)
To analyze how well our solution performs, we took our results and compared our results to a commercial tool, Genomatix.
Hence, we began to compare the results of both -AIModules and Genomatix (11). Genomatix is a commercially available tool with a one-week free trial period (available tools in the eld of promotor analysis are found in the Discussion section). We analyzed and compared the results from AIModules and Genomatix by looking at an example set of selected genes. The homologous promoters are taken from GenBank and are in the rst example cathepsins The resulting modules from AIModules and Genomatix were compared by hand, as the modules from Genomatix were low in number. On the other side, namely the results from crude TFBS search, were handled differently. Due to the high number of the ndings and the differences between the naming of the TFBSs from both systems, we chose a semi-automatic approach. Each result set was put into python arrays and the Genomatix result sets were copied unchanged into another array. The corresponding arrays from AIModules and Genomatix then were compared for string equality by a python script (see Supplements.zip). The matches were put in separate arrays with the syntax AIModules_TFBS_name::Genomatix_TFBS_name, where Genomatix_TFBS_name can consist of multiple hits, which are separated by commas. The resulting arrays then were printed to standard output and re ned manually in libreo ce Spreadsheet. We saw that AIModules found more TFBSs than Genomatix and that some motifs are common to both systems. Regarding modules, AIModules found much more (ten-fold) modules than Genomatix. The parameters used for both solutions and the statistics of the found TFBSs as well as modules are depicted in Supplementary Table S1 -S8.
The amount of found TFBSs from both methods differ in number. This is due to differences in available matrices and the setting of search parameters which are for AIModules La and Ld, and for Genomatix 0.75 and Optimized. As explained in methods, we give considerations how these parameters can be compared and where they differ. Regarding the module search the differences between the system parameters are similar to the ones for TFBSs. In AIModules the parameters are La, Ld and the activated checkbox for module search, whereas Genomatix uses a Threshold for number of elements and a Maximum number of matches.
As these parameters are di cult to directly compare and normalize to each other, the found matches have only a small overlap. Additionally, some of the TFBSs are speci c to the used system.
We show that our tool is the only one, that can detect common modules within the analyzed sequences. Moreover, we combine this feature with a TFBS search as well as RNA motif discovery. Our tool allows the user to insert not only own sequences but also own matrices.
Each of the motif discovery tools mentioned in Table 2 and Table 3 are the results of beautiful work and they have their own use cases. For speci c uses however, our tool has shown to nd much more modules than even a commercial product. We have a detailed comparison of the tools from Table 2 and Table 3 in the Supplementary section "An Overview of the tools for TFBS discovery".

Discussion
AIModules allows e cient module searches on more than one sequence and lters the common modules to render them beautifully on the website. The application is enriched by TFBS searches and RNA motif discovery as well. This is possible due to our general approach algorithm. Therefore, we believe that this is a powerful package to nd modules in silico. We have prepared the application in a way, that allows for extensions without much effort, not only due to the architecture but also due to the chosen free and open-source licensing agreement (GPLv2). Moreover, the application is provided on our own server so that the user does not have to use complicated scripts or even commercial software.
Modules analysis follows in AIModules a strict algorithm to try and nd shared modules between the input sequences. For that the TFBSs have to have a match in N input sequences to be valid and hence be included in the module search. The number N can be de ned by the user via a stepper control and de nes the conservation of the TF. The strand orientation for TFBSs is relevant in this step. A module consists of two TFBSs with a xed offset of +/-200 bp. Every permutation of every TFBS is tested for validity (AB, AC, AD, …, BC, BD, ...). The module is a valid one, when it is shared at least between two input sequences. In Genomatix all sequences are analyzed for known modules independently. I.e., that Genomatix does not show common modules. These differences in the module nding process lead to different amounts of found modules. Where AIModules nds all possible modules algorithmically, Genomatix relies on known co-citations. This means that AIModules may over represent modules, whereas Genomatix only shows co-citations and may miss modules, that are included in AIModules. However, the number of found modules can be re ned in AIModules by increasing La or decreasing Ld or only using user input matrices, or a combination of the aforementioned points. The analyzed sequences for cathepsin and IL-10 showed no overlap in AIModules and Genomatix regarding modules. For each of these systems in silico search will not make experimental validation obsolete.
The other commercial product TRANSFAC is available as a free version after registration. However, the matrices are from 2005, hence outdated, and limited in number (398 matrices). Furthermore, this free version is functionally constrained(12) and the professional version is only available after licensing. AIModules offers 1920 matrices from the Jaspar DB, whereas the public version of TRANSFAC contains only 398. Therefore, we decided for the more up-to-date and sensitive matrices from the Jaspar DB, which also provides a REST-API and the JASPAR R/Bioconductor package (3). Moreover, the matrices from Transfac cannot be downloaded, but have to be extracted from the website manually, which is time consuming as well as error-prone. Since the application is open source, Transfac matrices can be added when needed.
Compared to those two most popular databases (Genomatix and Transfac public), AIModules offers the possibility to nd much more patterns. Meanwhile AIModules can search multiple sequences and obtain comprehensive visualization and statistic results. Moreover, AIModules allows the user to deselect and select each of the found TFBSs and assemble TF patterns manually (for more see Supplementary Fig. S 2).
By the time this manuscript is assembled only two products were on the market that could predict modules. These tools are Genomatix (from Intrexon Bioinformatics Germany GmbH) as well as TRANSFAC. ModuleMaster (13) is another tool that could predict modules, but we were unable to start the WebStart Application on different operating systems. The originating lab could not provide any assistance as the Bioinformatics research is no more part of it. Furthermore, for the end user it is easier to use a website than a Java WebStart application that is rstly not up to date and secondly shipped without a valid certi cate which may pose a security hazard. All the other tools in Table 2 and Table 3 had no common module search functionality, but they are the result of beautiful work and have their own uses. Moreover, AIModules is not only available as a web application, but can be deployed as an on-site application or locally on a PC or notebook as well.
Remark to editor: Insert Table 2 here! Remark to editor: Insert Table 3 here! Table 2 and Table 3 are discussed in more detail in Supplementary section "An Overview of the tools for TFBS discovery".
Furthermore, for TFBS analyses R packages from Bioconductor (14) are also available (e.g. TFBSTools (15), RcisTarget (16), enrichTF (17)). These, however, must be packaged into new code to be able to be used on sequences for TFBS identi cation, in particular if you want to determine conserved TFBS between different promotor regions and DNA sequences or for establishing a web server and visualization of TFBS found.
Compared to the tool conTraV3 we have shown that our tool outperforms it as AIModules needs for the same sequence with more matrices to analyze only seconds, whereas the analysis from conTraV3 was cancelled after one hour without results.
Additionally, we have shown that AIModules is able to detect polyadenylation sites (see Table 1) which were previously described in (9). Our tool AIModules is not only faster but also presents features such as module search and RNA motif discovery, whereas the sequences as well as the matrices can be inserted by the user individually. An overview of tools and features can be found in Table 2 and Table 3.
The resulting picture (see Figure 2 or Supplementary Fig. S 5) shows that binding site matches frequently overlap. These matches are ltered beforehand by the backend via the parameters La and Ld, and therefore strong bindings and high score matches are shown. However, it has to be considered that these results mean, that a TF would bind the binding site in vitro, but this has to be validated through experimentation. I.e., that even if the TF binds in vitro this must not mean that this TF plays a role in gene regulation in vivo.
The other side of the coin is, that if a TFBS is not shown in the result of an input sequence this must not mean that there is none. It could mean that the match was excluded by a high La or low Ld. Furthermore, the JASPAR database of matrices, while being more current and more thorough than e.g. the public version of TRANSFAC, is not exhaustive and is enriched and optimized over time. I.e., that the TF may not be available and therefore may not be included in the result. These restrictions also apply to commercial products.
Predictions should be treated as such. A match means that in vitro the corresponding TF is very likely to bind the TFBS. In vivo there are factors like interactions of the TF with chromatin (conformation) which play a crucial role. Furthermore, the quantity of available TF relative to its TFBS and the quantity of cofactors contribute greatly to TF-TFBS interactions.
The endeavor of trying and involving machine learning should be assessed. Since there is an abundance of data it should be possible to train a model to recognize different TFBSs and modules in promoters. Important work on that has already been done using deep learning(18) but this approach can be improved. This could be done by machine learning frameworks such as Tensor Flow or PyTorch, etc. Moreover, models based on hidden Markov models such as Transcription Factor Flexible Models (TFFMs), that can model positional interdependence within the TFBSs and variable length motifs (19; 20), can be included. The high quality matrices from JASPAR can also be enriched with matrices from CisBP (21) or UniPROBE(22) as well.
Furthermore, the ltering for modules can be improved in quality by saving co-cited TFBSs (modules) in a fast database with its' offsets and applying these to the found TFBSs.
Since the architecture is also available as a Docker based solution (and a docker swarm), it should not be di cult to deploy this system onto a Kubernetes provider with high throughput. In this environment add-ons should be considered to be deployed as micro services or server-less components to increase the capacity for load balancing and failover functionality.
We present AIModules as a stand-alone promotor analysis tool. The user does not have to use complex scripts or commercial tools. Our solution provides a general approach to analyse protein factors binding to DNA as well as RNA. We do not only predict TFBSs on DNA but the tool can also be applied to proteins recognizing RNA motifs and to analyze RNA stretches. It is also a means to search for modules (conserved TFBS combinations) and thus lls the gap that arose regarding academic software when Genomatix was commercialized. As the only free and open source tool that can predict common modules on two or more sequences we also include matrices of high quality by the use of the JASPAR 2022 DB.
This database is more recent than e.g. the public TRANSFAC version, which is from the year 2005 (8). Moreover, within the web application we include a functionality to calculate matrices to be used in AIModules, so that any user can compile own matrices. Additionally, we included a matrix for the search of poly(A)-tails which again can be extended by further RNA motifs. We show, that our solution nds more TFBSs, and a lot more modules than public available alternatives.
Furthermore, our application is much faster than the solution conTraV3 (few seconds vs at least 1 hour). When searching for modules the degree of conservation of TFBSs can be determined by the user so that less conserved modules can be found as well as individually conserved TFBSs. Moreover, found TFBSs can be selected manually to compile custom modules manually. With our solution we not only provide a free service that can be deployed in a containerized environment for load balanced high availability environments but also provide a framework for TF and module searches, that can be extended easily due to the chosen design as well.

Conclusions
AlModules is a nice and versatile software package which allows to identify transcription factor binding sites (TFBSs) as well as combinations of them comparing their conservation in several genes. The software performs better than current alternatives and is generic, i.e. it can also be applied to look for other motifs in DNA such as enhancers or silencers or regulatory motifs in RNA. The user can also look into conserved motifs of his own choice. We provide both a web-server as wel as the stand-alone software for installation. We have thus a particular exible and easy to use solution for the interested researcher. Most notably, it is completely free, non-commercial and open source -lling here a gap for such a free solution in the academic community.
AlModules functionalities allow a number of biological applications, in particular analysis of promoters, including conserved modules and shared TFBSs in conserved gene families, presence of modules, motifs or regulatory elements in mRNA.

Methods
Architecture and parameters: The backend calls tessWms to search for TFBSs with the parameters selected in the front-end. These are La and Ld. La is the log odds ratio of the match from a PWM, whereas Ld is the maximum log likelihood de cit, i.e. the difference between the maximum ratio score of a PWM (Lm), which is the consensus sequence, and the log odds ratio of the match (Lm-La) (5) (7). Each position in a binding site can contribute up to the value of two to the score. Thus, the best La corresponds to the consensus sequence and hence this means that Ld de nes how much worse the La of a TFBS is compared to Lm.
The parameters for MatInspector in the tool Genomatix on the other hand are Optimized and 0.75. A perfect match to the matrix means, that the binding site is equal to the consensus sequence and hence gets a score of 1.00. Optimized in the context of matrix similarity for Genomatix means, that the binding site is valid if the score is greater than 0,75 (23).

Implementation:
Our tool AIModules (3) uses position frequency matrices (PFMs), introduced in 1982 (4), to predict TFBSs either from the Jaspar 2022 Database or user input PFMs. We decided against including Transfac matrices, as these are from 2005, hence old, and the 398 matrices cannot be downloaded but have to be extracted manually from their website one by one from HTML, which is time consuming and error-prone. The result of the analyses is presented to the user in graphical and le form. Apart from a conservative three layered architecture (see Figure 1), AIModules is also implemented and prepared as a Docker container based solution (see Supplementary section "Build and Deploy").
The architecture of AIModules is the product of three layers which are loosely coupled. For the view we chose the single page application framework (SPA) angularJS to reduce calls to the backend. The SPA is offered on an apache webserver. The frontend communicates with the backend via JSON, the backend itself is a JAXRS Rest service running on apache tomcat 9.0.0.M1 and JavaEE. In this layer the executable tessWMS (5; 6; 7) is called to nd TFBSs. We modi ed tessWms to allow for a JSON interface, hence communication from tessWMS to the Java backend is done via JSON. The user can select TFBS classes in the frontend. Those are read from the REST backend, which communicates with the third layer -a postgres 11 DB with the Jaspar2022 TFBSs -and presents them via speci c URLs (see Supplementary section "Build and Deploy").
Furthermore, we offer a completely docker packaged environment (see Supplementary section "Build and Deploy").
Moreover, all mentioned databases and tools of the paper are described in Supplementary section "Development".
Abbreviations TF: Transcription factor, TFBS: Transcription factor biding site Declarations MA developed the web application in its three-layered form and as a Docker based solution. MA generated the results. CL provided software expertise and checked the software. MA wrote the manuscript with the assistance of TD. All authors read and approved the nal version of the manuscript.

Data availability statement
All data and materials are fully available from the paper and its supplementary materials. The program sources are available via https://github.com/muharrem-aydinli/AIModules.git or https://zenodo.org/badge/latestdoi/363702392. Sources are available and uploaded as soon as the paper is accepted. Further project and software information:  Figure 1 Architecture of AIModules.

Figures
The three layers (frontend, backend, DB) and the ow of information are illustrated.

Figure 2
Transcription factor binding and module searches using AIModules.
Sub gures a-d are only excerpts; for full results visit https://bioinfo-wuerz.de/aimodules/; the black line represents the sequence itself. Above the line you will nd the motifs of the (+) strand and below the line are the motifs from the (-) strand. The results can be downloaded from the web application as an Excel le as well; the application is described in detail in the Supplementary section "User Interface"; the sequences can be found in the Supplementary section "Used sequences"; a