Model building features
Flame can build predictive models starting from a single file in SDFile format containing the structures and the biological properties of the training series. The default model building workflow takes care of reading the structures, normalizing them, generating molecular descriptors, scaling their values and building a machine-learning model which is saved in a format suitable for predicting the properties of new compounds.
Flame provides defaults for methods and parameters but the user can customize them, either editing the parameter file parameters.yml when using Flame in command line mode or using the model building dialogue (Figure 4), when using the Flame GUI.
Table 3 describes the methods implemented natively in Flame. All of them make use of open source libraries. The choice of models can be easily extended to include commercial products or external tools, using the code overriding technique described at the implementation section.
Table 3. Overview of the main modeling methods and tools implemented natively in Flame.
Modeling task
|
Method
|
Source
|
Structure normalization
|
Standardiser
|
[23]
|
|
ChEMBL pipeline
|
[24, 25]
|
Molecular descriptors calculation
|
RDKit properties
|
[26]
|
|
RDKit md
|
[26]
|
|
RDKit Morgan fingerprints
|
[26]
|
Scaling
|
Raw
|
-
|
|
Autoscaling
|
[7]
|
Machine learning
|
RF
|
[7, 27]
|
|
SVM
|
[7, 28]
|
|
PLS
|
[7, 29]
|
|
XGBOOST
|
[30]
|
|
Conformal regression
|
[31, 32]
|
Typically, models are built starting from a collection of annotated chemical structures, but Flame can also use as input a tab separated (TSV) table with pre-calculated molecular descriptors and annotations. Another option, rarely found in other modeling frameworks (but present in OCHEM [33]), is the possibility to use as input the prediction results of other models present in the repository. This option, called in Flame “model ensemble” is interesting for combining the results of multiple qualitative models into a single result representing the majority voting. The prediction results of an ensemble of quantitative models can also be combined using the individual predictions mean or median. Regressors and classifiers can also be applied to the model ensemble output to generate a smarter result combination and obtain better predictions. When the ensemble models provide an estimation of the individual prediction error this information is considered, using appropriate probabilistic methods, to generate an estimation of the final prediction error. The description of these algorithms is beyond the scope of the present work and will be published in a separate article.
The last step of model building workflows is an estimation of the model’s quality using cross-validation which presents the user information about the model’s goodness of fit, the model predictive quality and some characteristics of the training series (e.g. value distribution). Since Flame can use diverse ML methods, we tried to generate comparable output for facilitating the selection of the best methods and parameters. The values shown are summarized in Table 4, for qualitative and quantitative endpoints.
The Flame GUI provides additional information, oriented to diagnose the quality of the model and the training series, as shown in Figure 5. For qualitative endpoints (left side of Figure 5), the confusion matrix is shown as a 2x2 matrix and as a “radar plot” expressing in the radius of the sections the relative size of true positive, true negative, false positive and false negative results. This information is shown separately for the model fitting and prediction, the latter being calculated using cross-validation methods selected by the user (default to five k-fold). In addition, Flame displays a scatterplot of the training series, obtained by running a Principal Component Analysis (PCA) with the calculated molecular descriptors and showing the 2 first Principal Components (PCs). Objects (compounds) are colored red or blue according to their biological annotations (respectively, positive or negative). The positive and negative ratio of substances in the training series is depicted using a pie chart.
For quantitative endpoints (right side of Figure 5), apart from the parameters mentioned in table 4, the interface shows scatterplots of fitted/predicted values versus the experimental annotations. For conformal models, the confidence interval for the defined confidence level is also shown. In a separate tab, Flame displays a scatterplot of the training series, like the one shown for qualitative endpoints, but in this case the substances are colored using the continuous scale included in the plot. The distribution of the annotation values is shown using a violin-type plot, which offers valuable information to diagnose the presence of a skewed value distribution or the presence of outliers. All the graphics representing the training series are interactive, and hovering the mouse cursor over the dots allows to display the 2D structure of the compounds they represent.
The model quality reports described above are persistent. All this information is stored within the model folder and can be retrieved and shown for every model in the repository at a later time.
Error handling
A very important feature that should be implemented in any modeling software aiming to solve real-life problems, is error handling. A workflow can fail for many reasons: molecules can have a wrong structure, contain metals or water, the model building can also fail when the annotations are not correct. For this reason, a lot of effort was devoted in Flame to implement appropriate error handling methods, removing molecules that cannot be processed and producing suitable output, which is shown in the command line or the GUI. Modelers know that the number of potential sources of error is high and Flame cannot claim to be able to handle all error types. However, years of development and use by different modeling teams, established Flame as a rather robust software.
Model predictions
Any models stored in the repository can be used to predict the properties of a new substance simply entering a SDFile with the structure of the compound(s) to predict. The prediction workflow will then apply to this file the same pretreatment, molecular descriptors calculation and x-matrix scaling used for the training series, using exactly the same source code, thus guaranteeing the maximum consistency of the results. The molecular descriptors obtained are then projected using the stored estimator. The prediction results can be qualitative or quantitative, depending on the nature of the training series annotations. Models built using conformal regression [32] generate additional information about the prediction uncertainty; for quantitative endpoints they provide a confidence interval, while for qualitative (binary) endpoints the result of the prediction can be "uncertain", meaning that the model cannot ascertain if the result is positive or negative. In both cases, the uncertainty is reported at a given probability (confidence level of CI or probability that the result is correct, respectively) which must be defined by the user.
Models are watermarked during the building process and have a unique ID associated. This ID is read when the model is used for prediction. This means that predictions keep record of the model version used to generate it, guaranteeing a full traceability.
As stated in the introduction, prediction results are often difficult to understand and interpret by users not involved in the model building. For this reason, the Flame GUI presents the prediction results in different formats, decorated with extra information aiming to facilitate the result interpretation and its use for decision making.
As shown in Figure 6, results are displayed in three alternative views. First, they are presented as a list, including for every predicted compound its name, 2D structure and prediction result, as well as uncertainty information when available. This list is paged, searchable and can be reordered. It can also be exported to Excel or PDF formats, printed, or copied to the clipboard. Clicking in any of the list items displays a more detailed report for a single compound, showing again the compound name, structure and prediction result but also information about how to interpret the result (extracted from the model documentation), and a list of the closest compounds in the training series with their biological annotations. The similarity is computed using the same molecular descriptors used for building the model.
When the model used for the prediction is an ensemble, the prediction report shows the individual result of the low-level models and combined result (Figure 7). For conformal binary classifiers (left of Figure 7) the graphic shows the low-level model prediction results, indicating if the query compound belongs to class 0 (negative), class 1 (positive), both of them (inconclusive type I) or neither (inconclusive type II). For conformal quantitative models (right of Figure 7), the predictions are shown with the corresponding confidence intervals.
Finally, the prediction results are also represented projected on a scatterplot of the training series PCA scores, generated as explained in the previous section (Figure 5). The aim of this representation is to show whether the predicted compound belongs to a region of the chemical space well represented by the training series or if it falls in a desert region. In this representation, the training series compounds can be displayed as grey dots or colored by the biological annotation. Predicted compound can be displayed as green circles with the compound names, as red dots or as dots colored by the compound distance to model (DModX, see [34]). A high DModX value indicates that the predicted compound has “original features” not present in the training series which can be detrimental for the prediction quality.
Finally, it should be mentioned that the predictions are stored in a persistent prediction repository and therefore it is possible to revisit previous predictions, until they are actively removed by the user.
Model management
Once a model is built, it is stored in a separate folder of the model repository. This folder can contain multiple versions of the model. As a minimum, there is a dev version which must be considered as a “sandbox” used only for model development, and which is overwritten every time the model is re-built. Precisely for this reason, the dev version cannot be used for prediction. Model versions which the model developer considers worth storing should be ‘published’ to generate version 1, 2, etc.
The main GUI window shows a list (Figure 8) where models can be browsed and selected. Every model is identified by a name and version and labelled by Maturity, Type, Subtype, Endpoint and Species. The labels are defined by the end-user and can be used to filter the models shown, thus making it easier to find models for a certain endpoint, species, organ, etc.
Both the command mode interface and the GUI provide model management commands for creating new models, publishing a model version, deleting a whole model tree with all the versions or any single model version, etc.
Models can be exported using a command that produces a compressed version of the whole model folder. This file can be easily stored, backed-up, or sent in electronic formats (e.g. as an e-mail attachment). Once imported in any Flame instance, the model is copied to the model repository and becomes fully functional. During the import step, the versions of the software libraries used for generating the models were checked and in case of version mismatches a warning message is issued.
Model documentation
Flame models are documented using a template based on the QMRF [35], reusing our previous experience in model documentation [36]. When the model is built, Flame automatically completes in this template the fields describing the modeling methodology and quality. This half-completed document should be edited by the modeler, using the GUI or editing a documentation file in yaml format using a text editor and re-importing it into the model. In either case, the model documentation is stored at the model folder, and is included when the model is exported or published.
The model documentation has been split in three sections: General Model information, Algorithms and Other information. The first and third sections should be completed by the modeler while Flame automatically completes most of the second section. The additional file ‘BZR.yaml’ contains an example of human-readable file in yaml format, suitable for being imported into a Flame model, with all the items included in these three sections. The additional file ‘model documentation GUI.pdf’ contains a PDF files showing how the model documentation is presented to the user in the Flame GUI.
Performance
In a typical modeling workflow, the same code (structure normalization, molecular descriptors calculation) is run for every compound in the input series, both for training series and prediction series. This makes it simple to speed up the computation by splitting the series in n sub-series and assigning them to different computation threads, which are run in different CPUs. Flame has an option for running parallel tasks related to the molecular descriptors calculation, obtaining nearly linear speedup. Another time-consuming step is the model building and validation. By default, Flame applies the multitasking implemented in the ML libraries (e.g. in scikit-learn, it is used for cross-validation or grid-search and in XGBOOST it makes an excellent use of multi-CPU computing power). Use of GPUs is under development and a special Flame version supporting GPUs is planned to be released in the future, facilitating the efficient use of deep learning within the framework.
Additionally, during model development it is a common practice to rebuild the model repeatedly using diverse machine learning settings to optimize them. To speed up this process, Flame stores intermediate results of the calculation (e.g. the molecular descriptors matrix) thus saving the work of re-computing them in every cycle.