RPINaptaBASE: a Database for RNA-Protein Interaction Network Analysis and Aptamer Design

Abstract Background The RNA-protein interactions play crucial roles in the biological processes. Recent developments to clarify RNA and protein structural features have the urgent need for designing various databases, related to the specificity and the mechanism of the underlying interactions between a protein and an RNA molecule. The majority of these databases have focused on RNAs or proteins macromolecules independently, and they do not have the capability to run integrated queries on the RNA-protein complex. Theses existing databases have a linear query structure. Furthermore, they only focus on interacting (positive) samples and they do not contain non-interacting (negative) samples. Results We developed a Database for RNA-Protein Interaction Network Analysis and Aptamer Design (RPINaptaBASE). RPINaptaBASE has a nested query approach that enables users to apply nonlinear query analysis. The query engine module contains a wide range of features related to RNA and protein sequences and secondary structure elements of these macromolecules, which are helpful to generate custom datasets, especially for machine learning approaches. In this version, more than 175 features were calculated and available to users. It provides a web interface with download management services allowing users to generate desired datasets of unique RNA or protein sequences in independent lists. Furthermore; the web service empowers users to create artificial datasets of positive and negative samples from RNA-protein complexes. In order to present negative samples, the idea of distinguishing protein sequences by their clans and families was employed to efficiently generate non-interacting pairs. Conclusion This database prepares a user-friendly platform to study RNA-protein interactions. It also provides an important simplified contribution to the oligonucleotide-aptamer design process using machine learning algorithms. RPINaptaBASE is freely available at http://rpinbase.com

NDB [9], contains structural information of oligonucleotides such as nucleic acid structural conformation, RNA 3D motifs and RNA 3D interactions. Furthermore, in this database, searching with predefined parameters and logical operators has been provided. RNA Strand [10], includes general and structural information of RNAs, such as the length of structural motifs, type of structural elements and number of non-canonical base pairs per molecule. RNA CoSSMos [11], consists of structures derived from The Protein Data Bank (PDB) [20] and a collection of motifs extracted from the structural elements. The database provides the option of searching for the desired data according to the general parameters, experimental parameters, structural motifs, and nucleic acid sequences. RNA FRABASE [12], includes RNA structures (derived from the PDB and NDB databases), the dot-bracket format of the RNA sequences, and the comprehensive series of parameters related to structural elements.
URSDB [13], contains structural elements of RNAs, which have been derived from the PDB database.
In this database, pseudoknots in secondary structures have also been presented; moreover, executing queries with respect to the structural features is available. NPIDB [14], includes information derived from PDB complexes. It provides a user interface and a set of tools for extracting features, such as the calculations of intermolecular interactions, the classification of proteins based on the Structural Classification of Proteins (SCOP) [21] and the Protein Families database (Pfam) databases [22], and some information on DNA-protein interactions. RBPDB [15], as a database of RNA-binding proteins, consists of the experimental data of four species (human, mouse, fly and worm). To recognize the similarity of proteins in various species, RBPDB also provides their orthology relationships. NPInter [16], includes functional interactions between noncoding RNAs and other biomolecules. It covers the largest number of interaction data on tissues or cell lines, binding sites, conservation, and other factors. RIAN [17], contains noncoding RNA associations in four organisms (human, mouse, rat, and yeast). These are generated from curated samples, experiments, interaction predictions and text mining. Also, it provides an integrative scoring scheme to determine the reliability of each interaction. RNAct [18], covers predicted and experimentally identified information on RNA-protein interactions in three organisms (human, mouse, and yeast). Moreover, it contains pairwise search for sets of proteins and RNAs. PRIDB [19], contains information on RNA-protein complexes; moreover, the sequential motifs of complexes were identified. Additionally, three benchmarks datasets are available for evaluating machine learning methods. In the latest version of this database, RNA-protein interactions using only sequence information (PRIseq) [23,24], the analysis of RNA-protein interactions are provided with using Support Vector Machine (SVM) [25] and Random Forests (RF) [26] algorithms.
The majority of these databases have focused on RNAs or proteins macromolecules independently, moreover, these RNA-protein complex databases do not have the capability to run integrated queries on the RNA-protein complex. On the other hand, existing databases have a linear query structure. All conditional statements in this query structure are in the equivalent degree of priority. Consequently, the linear query structure is inconvenient for complex requests. However, the nested query structure enables one to achieve this goal. In other words, the precedence of query evaluation of nested queries is analogous to that of parentheses in mathematical operations where subqueries are evaluated first. Therefore, one complex query can be broken down into a series of logical subqueries.
These subqueries are created based on the characteristics of the primary and secondary structure of macromolecules, then these subqueries are aggregated to form one nested query and this query is sent to the database. Another problem, facing most databases in the field of nucleic acid-protein interaction, is that they only focus on interacting (positive) samples and they do not contain validated non-interacting (negative) samples. Therefore, the majority of these studies addressed this problem by using atomic-distance with an arbitrary threshold or by random pairing [24,[27][28][29][30][31][32]. In another study, Cheng et al. [33]  RPINaptaBASE is a repository of all RNA-protein complexes stored in the PDB database. The query executor module of RPINaptaBASE contains a wide range of features related to primary and secondary structural elements of RNA and protein macromolecules. At the level of the primary structure, it allows queries to request protein and RNA sequences with different lengths and to search various substrings between existing sequences. Moreover, the phylogeny data about the family and clan of protein sequences are available. At the secondary structure level, it allows queries containing different related secondary structural features. Furthermore, the query structure of RPINaptaBASE is designed based on the nested object concepts in object-oriented programming [36][37][38] to respond to diverse demands related to the study of different aspects of RNA-protein binding. However, the importance of machine learning as a fast-growing approach in this field is undeniable. Therefore, RPINaptaBASE presents an option to create downloadable positive and negative datasets with nonredundant sequences for training and testing classifiers. Non-interacting (negative) samples in this database were generated using protein clans whereas interacting pairs (positive) samples were generated using atomic-distances.

Results
RPINaptaBASE provides a 'Help' section with comprehensive information and a practical example of RPINaptaBASE's web interface usage (Additional file 1: Help).
There is a menu bar on the homepage. In this bar four stages are represented; 1-'Start', 2-'Select mode', 3-'Make a query' and 4-'Check the result & download'. The 'Start', is the entry point of the database. There is also a start button. By clicking this button user can go to the second stage. In the 'select mode' page, users can select three types of datasets: 'Complex', 'Protein' and 'RNA'. Figure 1 shows the 'select mode' page. By choosing one of these modes, the 'make a query' page appears.
In the 'Make a query' stage users create a query to request a dataset from the database using specific features. These pages contain logical blocks to create specific queries. These blocks are executed according to the rules and precedence of parentheses. This means all the items in parentheses are evaluated separately. Items with nested parentheses are evaluated, from inside to outside.
By selecting the 'Complex' mode on the 'select mode' page, users can create positive and negative datasets. Each of them divided into four parts: 'Query setting', 'General', 'RNA' and 'Protein'. In the Query setting, users can specify the number of desired samples and select the desired threshold. In 'General', 'RNA' and 'Protein' parts users can create their requests based on the general information, primary and secondary structure of RNA and protein macromolecules. Figure 2 shows the page of the complex mode.
By selecting the 'RNA' mode on the 'select mode' page, users can create RNA datasets. This page is divided into two parts: 'Query setting' and 'RNA'. In Query setting, users can specify the number of output sequences in their RNA samples. In the RNA part, users can create their queries based on the primary and secondary structure of RNA sequences. Figure 3 shows the RNA mode.
By selecting the 'Protein' mode, on the 'select mode' page users can create protein datasets. This page is divided into two parts: 'Query setting' and 'Protein'. In Query setting, users can specify the number of output sequences in their protein samples. In the protein part, users can create their queries based on the protein family, primary and secondary structure of protein sequences. Figure 4 shows the protein mode.
By selecting the 'Check result' button in all of the three modes, users have access to the 'Check the result & download' page. In this page, users can view the statistics of the created set and are able to download and save the results in one of two forms (CSV and Text). Figure 5 shows the result page.

Discussion
With regarding RNA-protein interactions, RPINaptaBASE provides the nested query mechanism. It allows users to easily search sequential motifs, collect phylogeny data about protein sequences and also collect various types of statistical data related to RNA and protein macromolecules. Furthermore, RPINaptaBASE can be used to create three types of data sets; 'RNA', 'Protein' and 'RNA-protein Complex'. From the other hand, the negative datasets of RPINaptaBASE create based on the view of family and clan of protein sequences.
In most of the RNA-protein databases [14,15,19] users cannot simultaneously query primary and secondary structural features of both RNA and protein macromolecules. It is essential for the RNAprotein database to have the capability to run complex queries, which consists of many conditional statements and subqueries. In addition, there are databases [10][11][12][13] that, perform primary and secondary feature queries on RNAs and have a linear query mechanism. It means that all required conditions in the database are evaluated at a single level. Therefore, it is impossible to design a query that utilizes results from subqueries. Consequently, the linear query structure is inconvenient for complex requests. In the linear query structure, users have to use multiple queries to achieve what a single nested query achieves. As for the RPINaptaBASE, its web interface allows one to create a nested query structure on different features of both RNA and protein macromolecules.
In the point of negative samples, PRISeq [23,24] provides benchmark datasets, that they only focus This database application provides an important simplified contribution to the oligonucleotideaptamer design process using machine learning algorithms. Aptamers are a short sequence of oligonucleotides with high specificity and affinity, which can bind to dedicated small to large targets [39][40][41][42]. In the Future version of RPINaptaBASE, we intend to insert sequence and structural information of DNA-Protein complexes. In addition, the feature vector of these macromolecules will be accessible.

Conclusions
RPINaptaBASE provides a user-friendly effective tool for researchers who need to create the accurate and minimal datasets of protein, RNA and RNA-protein complexes easily and quickly by investigating in the structures of RNA-protein complexes. The extracted features are completely classified in the specific form of primary and secondary structures of protein and RNA sequences and available in this database. Users can properly prepare datasets in the form of 'complex', 'protein' and 'RNA' targets that typically contain specific features for machine learning purposes. In this database, users can also select the negative dataset typically generated according to the family of protein sequences based on their specific characteristics. RPINaptaBASE updates regularly from PDB, which is considered as an acceptable source for users requiring information on these complexes.

Methods
The data stored on RPINaptaBASE was prepared as follows; at first, the contents of PDB structures were extracted, cleaned and converted to a suitable format for database design. Then, the values of macromolecules' structural features were calculated and appended to the database. Finally, positive and negative datasets were constructed.

Data gathering and preprocessing
RPINaptaBASE is a repository of RNA-protein complexes. In the beginning, all structures were extracted from the PDB, then, the PDB files that contain at least one protein and one RNA sequence were stored as the target sample. Our database performed the analysis of 2258 complexes (as of June 2018). In the preprocessing step, duplicated chains and sequences which contain unknown alphabets were recognized and ignored. For example, the complex with the PDB id '3oij' has two protein chains (A, B) and two RNA chains (C, D). The A and B in protein chains also C and D in RNA chains were identical. Consequently, we removed one protein and one RNA chain, then obtained the B-C as a positive and non-redundant sample. Table 1 shows the number of raw and preprocessed data for developing RPINaptaBASE.
To investigate which chains of proteins directly interact with the RNA chains, the analysis of 3D structures of macromolecules in the complexes were done. Different thresholds were used to differentiate the interaction between chains of macromolecules [24,[28][29][30][31][32]. Therefore, if the atomic distance between RNA and a chain of protein in the PDB files was less than the chosen threshold, then these two chains were detected as interacting pairs. One of 3.4 Å, 3.7 Å, 5 Å, 7 Å, and 10 Å was chosen as a threshold to distinguish strongly interacting pairs. Users can also select one of these distances to select highly interacting pairs. Afterward, the secondary structure of the protein sequences assigned by Define Secondary Structure of Proteins (DSSP) [43] was extracted from PDB.
The Protein Secondary Structure Prediction server (JPred) [44] was used to predict the secondary structure of sequences that didn't include any DSSP assignment. Besides, the RNAfold from the Vienna package [45] was used to predict the secondary structure of RNA sequences. Finally, the family and clan of protein sequences were extracted from the Pfam. The process of storing the information of sequences is shown in figure 6.

Structural processing of macromolecules
The main aim of RPINaptaBASE is to provide a suitable resource to construct the desired datasets based on different features of RNA macromolecule or protein macromolecule in a complex. Therefore, the final stored data in the last step are processed to extract all of the possible features of first and secondary structures. Furthermore, the phylogeny information of macromolecules was extracted from the Pfam database, hence, it is possible to create the desired dataset based on a specific family or clan. In the case of primary structure, RPINaptaBASE supports substrings and length of RNA and protein sequences. On the secondary structure level, this database contains the information of RNA secondary structure elements (Stem, Hairpin, Bulge, Internal loop, and Multi-Loop), and protein secondary structure (Alpha Helix, Beta Strand, and Coil). According to the DSSP algorithm, the specified protein secondary structure assignment has eight elements. These eight letters were translated into three letters for ease of interpretation [46]. Here, the results of the study [47] were used to calculate the number of parallel and antiparallel beta sheets. In addition, general information related to the structural elements was processed using different algorithms [48][49][50] and their results were stored as searchable data in the database. In the current version of RPINaptaBASE, more than 175 features (Additional file 2: Macromolecules features) of protein and RNA macromolecules were calculated and presented to be used in queries.

Generating Negative and Positive datasets
In this study, the family and clan of protein sequences were used to generate non-interacting pairs.
To this aim, an infrastructure was constructed to interact with RNA-protein pairs by observing experimental reports and construct non-interacting RNA-protein pairs based on family and clan of protein sequences. There were 9,367 protein families, and these families were classified into 604 clans in the Pfam database. Interestingly, RNA-protein complexes only cover 111 clans and 620 families in the Pfam. Owing to this fact, we introduced the idea of using family and clan of protein sequences to generate non-interacting pairs (negative samples). We take into account only clans of RNA binding proteins to ensure that only related features are considered. Moreover, discrimination between positive and negative samples is done on the base of specific features of macromolecules capable of composing the RNA-protein complex. Positive samples were composed by combining protein chains with RNA sequences of complexes whose distance was less than the selected thresholds of 3.4 Å, 3.7 Å, 5 Å, 7 Å, and 10 Å. Negative samples were generated by choosing RNA sequences in a complex and combining with the protein sequences that had not been observed in the same clan of that complex. Consequently, users are enabled to construct positive and negative datasets based on the desired family and clan of protein sequences. Figure 7 shows an overview of the formation of positive and negative samples.

Nested query
In this database, recursive functions were used to obtain the evaluation of subqueries in nested structures. This means that various subqueries can be created according to the characteristics of the primary and secondary structure of macromolecules, then these subqueries are aggregated and sent to the database. Nested queries are defined in the form of nested objects. Each query object is the parent of their own child objects. In other words, with the assumption of the parent's conditions, each child node appends more details in a total query and redefines its parent more accurately. This type of query was implemented using recursive logic [51,52]. This nonlinear query structure is helpful when it comes to complex queries and mixed conditional statements on different types of features.

Implementation
The database first approach was chosen and the data models were created from tables. The RPINaptaBASE application was developed in a standard three layers of architecture. First, the data access layer, which was implemented using Microsoft SQL Server, has the data storing and retrieving task. The second layer is the business layer. This layer was developed using Active Server Pages (ASP.Net) and is run on Internet Information Services (IIS). This layer was based on the Model-View-Controller (MVC) and Entity Framework. All queries are sent from the business layer to the data access layer and the data access layer responds appropriately to the query. The third layer is the presentation layer. Every query is received from the presentation layer via web services, validated and forwarded to the data access layers. The results which come from the database are converted to the JavaScript Object Notation (JSON) format and are sent to the presentation layer. The presentation layer (web interface) was created with HyperText Markup Language (HTML) and AngularJS which provide the dynamic script execution on the clients' browsers. Figure    The process of storing macromolecules; contains external data, processes and conditions The RPINaptaBASE architecture