2.1. Construction of the Putative Exon-intron Junction and Putative intron-exon Junction Database from Ensembl Core Database
Typically, there are three classes of mass spectra (or RNA-Seq reads) that represent possible intron retention events(Figure 1): (A) A mass spectra (or an RNA-Seq read) that maps to an exon-intron junction; (B) A mass spectra (or an RNA-Seq read) that maps to an intron-exon junction; (C) A mass spectra (or an RNA-Seq read) that maps to an intron only (intron internal).
The EnsEMBL Core Database (homo_sapiens_core_89_38h) is downloaded from the Ensembl website http://www.ensembl.org into a local MySQL database. The putative exon-intron and putative intron-exon junction database (PEIJ_PIEJ DB) is constructed for the three kinds of mass spectra (or RNA-Seq reads) that represent possible intron retention events.
The putative exon-intron and putative intron-exon junction database (PEIJ_PIEJ DB) is a protein database, which is constructed in details as below:
(1) For Class A intron retention events, entries are presented as 50AA sequences (25 amino acids are taken which are translated from the end of an exon and connected to 25 amino acids which are translated from the beginning of the following intron).
(2) For Class B intron retention events, entries are presented as 50AA sequences (25 amino acids are taken which are translated from the end of an intron and connected to 25 amino acids which are translated from the beginning of the following exon).
(3) For Class C intron retention events, since the strand information is known in advance, a three-frame translation approach which is similar to six-frame translation [45] is adopted to exhaustively account for all possible in-frame [4] and frame-shift [21] and partial intron internal retention events [29, 46-48].
The reason for choosing 50 amino acids for an entry in PEIJ_PIEJ DB for Class A and Class B intron retention events is described as in the PEEJ DB [2]. Briefly, a typical ion-trap mass spectrometer has a window size to detect peptides with molecular weight from 500 to 3000 daltons. A peptide of 25 amino acids has a molecular weight of about 3000 daltons, covering the upper range of MS detection [2].
As with the PEEJ DB [2] for exon skipping events, the phase information and strand information of all the exons in the human genome in the ensembl database has also been considered in the construction of the putative exon-intron and putative intron-exon junction database (PEIJ_PIEJ DB) in this research.
Table 1 depicts the database structure of the putative exon-intron and putative intron-exon junction database (PEIJ_PIEJ DB).
Table 1. Database Structure of PEIJ_PIEJ DB
Database Schema
|
Description
|
Entries for Class A intron retention
events
|
25 amino acids are taken which are translated from the end of an exon, 25 amino acids are taken which are translated from the beginning of the following intron, then connecting the two parts
|
Entries for Class B intron retention
events
|
25 amino acids are taken which are
translated from the end of an
intron,25 amino acids are taken
which are translated from the
beginning of the following exon,
then connecting the two parts
|
Entries for Class C intron retention
events
|
three-frame translation of the
intron sequences to account for all
possible in-frame, frameshift and
partial intron internal retention
events
|
The database is implemented in perl, Bioperl, mysql and Ensembl API which a fasta file with 4,280,722 entries, with gene symbol, chromosome, strand, exon symbol, exon phase information and junction position clearly defined and the Perl source code which implemented the database is freely available at https://sourceforge.net/projects/peij-piej. It should be noted that although the database is based on annotations from ENSEMBL (homo_sapiens_core_89_38h), in the related studies, users can easily modify the script which implements the database to update the database when new ENSEMBL build and new annotations are available to include new information (new intron retention events).
2.2. Utility
Like the putative exon-exon junction database [2, 3] which is specified for exon skipping events, researchers can search against the PEIJ_PIEJ DB for known and novel intron retention events from mass spectrometry data using the X!Tandem open source protein identification program [5] or the TurboSEQUEST [6] (for a workflow with mass spectrometry data, users can refer to Mo et al (Mo et al., 2008) by simply replacing the PEEJ DB with the PEIJ_PIEJ DB constructed in this study). In addition, when RNA-Seq data is available, researchers can also align the RNA-Seq reads to the PEIJ_PIEJ DB for intron retention events (known or novel) identification (for a workflow with RNA-Seq data, users can refer to [7] by simply replacing the PEEJ DB with the newly constructed PEIJ_PIEJ DB).