1) Physicochemical profiles at the exon-intron boundaries
Figure 1A-1E, Figure S1, and Figure S2 (Supplementary File 4) depict the numerical profiles of all 28 parameters, including the nine Backbone angle parameters, six Inter-BP parameters, six Intra-BP parameters, four BP-axis parameters, and three energy parameters. These profiles were generated for coding sequences (CDS), exon-start (ES), and exon-end (EE) sequences in H. sapiens. The results of this study indicate that the physicochemical profiles linked to each parameter in the five main categories for both ES and EE sequences display unique patterns, which differ significantly from those observed in CDSs. The results demonstrate that while the structural and energy properties of the CDS remain
relatively constant throughout the sequences, a distinct shift occurs in the biophysical profiles at the exon-intron boundary within sequences harboring them. The structural trends observed at the exon-intron sites in Fig. 1A-1D, for each parameter, emphasize the presence of a transient thermodynamically unstable boundary. This boundary is crucial for facilitating classical splicing events, with exons demonstrating higher stability than neighboring intronic sites51,52. Additionally, the energy plots in Fig. 1E, at the exon-intron junctions within these DNA sequences, support the classical hypothesis that boundary elements play a crucial role in secondary structure formation in RNA, thereby facilitating splicing. The Hydrogen bond energy exhibited a rapid rise followed by a drop, implying an initial instability at the boundary position that gradually balances out as the junction site progresses. In contrast, the Stacking energy reached its maximum value at the border junction, leading to an increased flexibility in the DNA by reducing its stiffness. The observed decreased Solvation energy could indicate the transiently formed stable structure at the interfaces between exons and introns.
Moving further with the idea that the combined effect of smaller features brings about a concerted change, we combined the individual structural parameters belonging to the respective major categories to provide us with the actual Backbone, Inter-BP, Intra-BP, and BP-axis profiles. The synergistic visualization of these categories and the three energy parameters at the exon junctions are available in Fig. 1F. These results provide us with the evident change at the boundaries for the seven structural and energy parameters. The trend, initially widespread over a region of 50–100 length within the individual parameters, is now contained uniformly within a region of ~ 50 for all the categories. The shaded region within the combined plots shows the site undergoing major structural and energy changes. Together, these individual and combined profiles offer valuable insights into the potential utility of the combined parameters for effective exon-intron boundary identification within any given gene sequence.
2) Correlation analyses and feature importance
A correlation analysis was conducted to examine the interrelationships among the seven final parameters within the 50-nucleotide regions of both the ES and EE profiles in humans. The primary objective of these analyses was to assess the degree of correlation between parameters and identify any redundancy that may exist. Figure 2 shows the correlation results. Different pairs of parameters exhibited varying degrees of correlation, ranging from moderate to high. Some parameters were dependent on each other, while others showed no correlation. Furthermore, an examination of feature importance through principal component analysis (PCA) was performed to retain the significant features without compromising on information for the downstream analysis. As summarized in Fig. 2 and Methodology S1 (Supplementary File 2), these results emphasize the significance of using all seven parameters. The methodology, as outlined in Fig. 3, was thus followed, leading to the development of the novel physicochemical property-based exon-intron boundary prediction method, ChemEXIN.
Figure 2. (A) Correlation matrices depicting the relationships among the seven final parameters at exon-start and exon-end. (B) Feature importance analysis conducted through PCA, revealing significant contributions of all seven parameters.
3) Performance evaluation
To arrive at an optimal exon-intron boundary prediction method, various Machine Learning (ML)/DL models were deployed during the initial development phases of ChemEXIN. The performance of these models underwent comparison using both the training-testing split dataset and the evaluation dataset in humans. Model assessment and comparison were conducted using five key criteria: sensitivity, specificity, F1-score, precision, and accuracy. The conclusive results of the training-testing are presented in Table S1 (Supplementary File 3). These results indicate that all the models exhibit the capability to predict exon-intron boundary sites, with accuracy levels and F1-scores spanning from 54% when utilizing basic models to an improved performance of ~ 80% when employing DL model on the test set. On the same dataset, it is worth noticing that parameters such as specificity, which examines the model’s ability to accurately detect true negatives, and sensitivity (true positive rate or recall), which evaluates the system’s proficiency in predicting true positives within each category or class, demonstrated notably strong performance values.
The findings from the H. Sapiens evaluation set comprising 60,000 held-out sequences further validate the efficacy of utilizing biophysical parameters for accurately identifying boundary regions (Table S1, Supplementary File 3). Further, the area under the receiver operating characteristic curve (AUROC) scores on the evaluation dataset presented in Fig. 4 show that the three-dimensional CNN (3D-CNN) and Support Vector Machine (SVM) classifiers surpass other models in predictive performance across all three classes. Notably, the 3D-CNN exhibits higher area under the curve (AUC) values across all classes, signifying its efficacy in distinguishing between diverse classes. Following the comparison results of the above models and the three-dimensional nature of our datasets, we decided to proceed with the 3D-CNN53 trained model for subsequent analysis and its independent implementation in all the three organisms under study. The architecture of the 3D-CNN model employed here is detailed in Fig. 5 and Methodology S2 (Supplementary File 2).
4) Comparison with the state-of-the-art tools
Five widely used gene structure organization prediction tools —Spliceator, Fgenesh, geneid, Genscan, and Augustus, were benchmarked against each of our three organism-specific trained models. The results are presented in Fig. 6 and Tables S3, S4, and S5 (Supplementary File 3), with details on the outputs available in Methodology S3 (Supplementary File 2). To ensure an unbiased comparison of our approach, we used three benchmarking datasets, each comprising 2,000 randomly selected sequences from the respective organism. The majority of these tools are available as web servers, which tend to crash on large input sequences and/or require input sequences in batches. Henceforth, this reasonable-sized comparison data ensured the efficient working of all the tools.
Spliceator, available as a web server23, employs CNN in conjunction with a user-defined reliability parameter and a sequence search window to predict the gene organization within input sequences. Instead of treating individual input sequences separately, it processes them as a unified input string with a maximum length of approximately 200,500 bases. To adhere to this constraint, we divided the input sequences for ES and EE for organisms under study into two batches. We employed a default reliability parameter score of 98% and a model tailored to a 400-length search window (as our individual input sequences are 401 nucleotides long) to predict donor and acceptor sites. The output files obtained for each batch were combined into their respective categories and processed to provide a final confusion matrix. From the results, it is evident that Spliceator results are less than satisfactory for all three organisms. The observed high level of misclassification is primarily attributed to Spliceator's consensus-based approach to identifying donor and acceptor sites, resulting in an over-representation of these sites in the predictions. This over-representation tends to increase with a decrease in the reliability parameter score due to non-specific pattern matching. Moreover, there is no noticeable improvement at a 100% reliability parameter score, suggesting its high sequence specificity.
Fgenesh is available both as a web server and as a local downloadable version. Due to the requirement of several genomic feature files in processing the downloadable version, we tested our sequence with the online web server20. The method accepts a single file as input, with each sequence represented in FASTA format. In addition to a Hidden Markov model (HMM) based gene prediction model trained over several eukaryotic species, though not within the scope of our research, it provides numerous user-specific advanced options. Operating it with organism-specific default parameters, the method generated output files, which were then processed into a confusion matrix. Similar to Spliceator, Fgenesh yielded comparable results for C. elegans and M. musculus. However, for humans, the precision and accuracy notably improved, reaching close to 40%. Although there is a potential for improved outcomes by utilizing targeted training with specific feature files in the downloadable version (Fgenesh++), we opted not to pursue these advanced options.
geneid, another tool in our evaluation, employs position weight arrays, scoring, and Markov models to identify gene features in DNA sequences. Although available as a web server and a GitHub repository22, we faced challenges with the online version, prompting us to resort to the local version downloaded from GitHub22. Despite processing input sequences in a manner similar to Fgenesh, the processed results more closely resemble those of Spliceator, exhibiting a high misclassification rate ranging from 80–90% for the organisms under consideration. The misclassification observed can be attributed to the overrepresentation of the ES and EE sites. Regardless of being trained on multiple species from all four eukaryotic kingdoms, geneid did not yield satisfactory results in our study.
Continuing our benchmarking efforts, we evaluated Genscan, a widely used tool for identifying exon-intron structures in genomic sequences. Genscan16 employs general probabilistic models to annotate gene features within input sequences. While Genscan can process nearly one million bases, our dataset, comprising approximately 0.8 million bases, posed a challenge to its processing capabilities. Hence, following a similar strategy employed with Spliceator, we partitioned the input sequences into two batches for both the ES and EE. Regrettably, akin to the outcomes observed with the other tools, the results were not encouraging.
Transitioning to our last tool, Augustus, our objective was to evaluate its proficiency in predicting gene structures. Beyond being accessible as a straightforward pre-trained web server and a GitHub repository, Augustus offers an improved web server option. This server allows training sequences not listed in their database using annotation files containing information for complementary DNA (cDNA) sequences and/or hints for donor and acceptor sites (hint files). We utilized the pre-trained web server19 and prepared a basic hint file (GTF) for input sequences (FASTA), adhering to the required format. Using a generalized HMM with an additional probabilistic model for gene structure prediction, Augustus also provides information on alternate SSs. Augustus exhibited relatively favorable performance compared to other tools, achieving a specificity of approximately 85%, notably attributed to the utilization of a hint file. In the case of humans, the misclassification rate decreased significantly to approximately 45%. However, while a similar trend was observed for other organisms, the outcomes were less favorable.
In a similar manner to the above-reported comparisons, we examined how well our models performed by looking at their predictions on the benchmarking datasets. This evaluation was essential for understanding how accurately our approach could predict exon-intron boundaries. Our analysis unequivocally demonstrates that our approach clearly outperforms other tools (Fig. 6) across all major evaluation criteria in all three organisms. The results indicate a notably low misclassification rate, ranging approximately from 0.075 to 0.20, and high precision, ranging from approximately 0.796 to 0.92. These findings indicate the reliability and accuracy of the predictions obtained through our technique. This exhaustive comparison underscores the presence of substantial sequence alternatives. However, despite these variations, the biophysical profiles at the junction sites remain largely conserved. This conservation suggests the potential utility of these profiles in facilitating precise recognition and prediction by our physicochemical property-driven 3D-CNN models.
Expanding the scope of our comparison, we further assessed the performance of the reported method alongside two top-performing tools identified in the previous benchmarking step, namely Fgenesh and Augustus. This extended comparison focused on predicting exon-intron junctions in non-protein coding genes, including long-non-coding RNA (lncRNA) genes; transfer RNA (tRNA) genes; and ribosomal RNA (rRNA) genes in humans.
Despite its widespread usage, Fgenesh failed to generate results in our comparative assessment. Unlike Augustus, while our method has not undergone specific training on the exon-intron characteristics of these genes, the results documented in Table 1 and Fig. 7 underscore a notable performance of our framework against Augustus, indicating its adaptability and efficacy even in contexts beyond the specialized training domain. This underscores the robustness and versatility of our approach, particularly in addressing gene prediction tasks across varied genomic contexts.
Leveraging biophysical parameters and the DL method, our approach exhibited superior performance compared to existing gene annotation tools across the three organisms. Moving forward, we developed ChemEXIN, a consolidated prediction framework combining the three organism-specific models. This approach holds significant potential for enhancing the efficiency of exon-intron boundary annotation.
5) Exon-intron boundary prediction through ChemEXIN
ChemEXIN, available as an open-source tool, can be downloaded and used within a conda environment, offering an accessible platform for researchers. After the initial setup of the virtual environment through
cloning, users can activate and run ChemEXIN using a Python 3 interpreter via a command prompt. This process involves providing essential inputs: a file containing the gene sequence of interest, the associated
organism, and a threshold value that defines the probability at which prediction windows are refined. Upon receiving these inputs, ChemEXIN performs its analysis and delivers the prediction results in a comma-delimited file. For detailed instructions on setting up and using ChemEXIN, researchers can refer to the user manual (Supplementary File 5).
To assess the performance of ChemEXIN, we tested it on random gene sequences of varying lengths from the studied organisms, using a default probability score of 0.75. The specific outcomes of this analysis are cataloged in Table 2. Additionally, to assess ChemEXIN's compatibility across different computing environments, we executed predictions on the same gene set but on systems with various configurations. The results of this compatibility assessment are detailed in Table S6 (Supplementary File 3). Collectively, these results demonstrate that ChemEXIN is highly efficient in processing sequences of diverse lengths, a feat it accomplishes using minimal computational resources and without depending on the Operating system (OS). A detailed examination of our prediction outcomes, particularly with human and mouse gene sequences, reveals that a significant number of boundary sites are predicted with remarkable accuracy. Even in many instances where predictions deviate, they do so by a margin of only five to ten nucleotides from the established boundary windows. Notably, Despite the promising results from training-testing and benchmarking analyses, the predictions made by the C. elegans model were not reliable and, hence, not reported. This irregularity could be attributed to inadequate training from imbalanced positive and negative datasets. Enhancing the accuracy for C. elegans thus requires refining the filters and extending the training process. These improvements will yield better results and broaden the applicability of ChemEXIN across various eukaryotes, paving the way for future advancements.
Table 2
Performance evaluation of ChemEXIN on random genes from H. sapiens and M. musculus.
Organism | Gene | Length (nt) | Predicted Sites | Average time (sec) |
H. sapiens | DMD | 2,220,382 | 47 | 221.77 |
BDNF | 188,307 | 28 | 24.84 |
NEU1 | 10,881 | 8 | 8.24 |
M. musculus | RP1 | 409,685 | 26 | 41.30 |
CDK6 | 189,524 | 9 | 22.46 |
SCAF8 | 83,888 | 15 | 12.78 |