PySML and other similar tools
There have been numerous tools implemented for producing concepts and entity SS scores. Most of them are context dependent, restricted to GO with a specific fixed version and only implement SS measures shown to perform well in a specific application 44 . we refer the interested reader to38 where these tools are described in terms of SS measures and input size that each tool supports. PySML is implemented in Python, one of the most efficient coding languages with various in-built libraries, providing fast prototyping capabilities which make it easy to learn. In addition, Python enables easy transitioning of code between computers, rendering PySML portable and an effective framework for developing, assessing and testing existing and novel SS measures. As such, PySML provides an interface for developers to easily include source code implementing their own SS models. This library facilitates the retrieval of any existing or new SS measures regardless of the ontology and 7 non-ontology based SS measures. These non-ontology SS measures are used for other types of data structures, which can be translated into Boolean vector profiles, e.g., clinical records, gene expression and population- or individual-based single nucleotide polymorphism, protein-biological pathway profiles, etc.
Among existing tools dedicated to computing SS scores, the SML library 52 is most similar to PySML, SML is implemented in Java while PySML is implemented in Python. Though the two languages are among the best languages and available for most operating systems, Java, as a compiled language, offers good computational performance in terms of running time in comparison to interpreted languages, e.g., Python. However, Python programming language benefits outweigh Java running time performance and has become one of the most popular programming languages, if not the most popular programming language. This popularity is due to several factors, including high productivity as compared to other programming languages, like C, C + + and Java, simple programming syntax, code readability and English-like commands. As such, Python is more advantageous in terms of learning, considering its various in-built libraries and more appropriate for its expansion in many machine learning and artificial intelligence tasks. This suggests that the PySML library meets a high level of acceptability.
Assessing SS score integrity
We closely analyzed PySML scores in comparison to those produced by SML. For both tools, the BMA scores are higher than the SimGIC (Jaccard index) scores almost surely (Fig. 3a), as it would be expected based on their mathematical expressions. Comparing PySML to SML scores, SML produces higher scores than PySML (Fig. 3b). The high scores produced by SML is mainly caused by the contribution of the root of the ontology in SS computation as proteins sharing only the ontology root have a score greater than 0. This suggests that SML overestimates SS scores by considering the root as an informative ontology concept, which ultimately biases these scores38, thus negatively impacting the performance of these SS models38,54.
Retrieving, reproducing and reusing SS scores for any ontology in any application is still challenging 39,44,54 . This mainly due to the lack of a tool that exhaustively implements existing SS models and related assumptions to produce consistent scores on demand and in real-time for use in related applications and for testing hypotheses. PySML bridges this gap, providing an effective framework for developing, assessing and testing existing and novel SS measures with the possibility to compute a customized IC-based SS approach. This framework is practical and of immediate interest for the SS end-users and developers, helping in consistently retrieving and SS scores and easing comparisons of any existing and novel SS models.
Quantifying running time performance
As observed in Fig. 4, PySML average running time increases linearly as the input size increases. Comparing PySML average running time to SML (Fig. 4a), SML takes less time than PySML to produce SS scores, as expected. This is mainly due to the programming languages used as pointed out previously. SML benefits this feature from the Java programming language, as a compiled language, which is expected to take less running time than PySML implemented in Python as an interpreted language. Considering the Python simple programming syntax, PySML is easy to understand compared to a Java-based software, and we anticipate a great acceptance of this library in the scientific and computational audience.
Taken together with average running time for computing different SS scores, PySML allows a large audience to benefit from its functionalities, effectively producing scores in realistic timeframes (Fig. 4b). It is worth mentioning that the Python performance issues are being overcome by introducing several libraries, such as scipy, numpy, etc. implemented in C and Fortran, well known compilers in terms of performance and speed, yielding performance boosts that can range from a few percent to several orders of magnitude, depending on the task at hand. This is an area of the PySML potential future expansion, which will be explored to optimize as much as possible the PySML running time.
Summary
PySML is a flexible, easy-to-use and expandable Python open library for handling SS measures for any ontology in any application with clear benefits when compared to similar solutions. It provides a large community interested in SS measures with an interface that eases SS measure implementation, testing, evaluation and comparison, enabling reproducibility and reusability of SS scores. Moreover, the PySML library adequately supports the implementation of new SS models, enabling end-users customized IC-based SS scores by providing the term-IC value map and freely choosing the SS models to be used for producing scores. Thus, PySML provides an effective platform for the replication and independent assessment of previously reported models and results.