Data extraction The off-target compounds interaction data (in terms of percentage of inhibition-[PCT] ) for the 50 targets listed in Table 1 were extracted from our in-house databases according to the following criteria: (a) Percentage of inhibition was measured at 10 µm concentration (b) Assay types were either radio ligand binding displacement or enzymatic assays (c) Only human recombinant receptors were used in these assays (except for GABA A, Glutamate, Glycine and CA2+ channel receptors, where rat brain was used). These safety in-vitro assays were conducted over a period of ~15 years (time interval for the assays collected was between January 2020 and September 2004). The panel comprised diverse target classes: 22 GPCRS, 6 Ion-channels, 5 Kinases, 4 Nuclear-receptors, two Transporters, two other receptors, and 9 further enzymes.
Table 1
Roche optimized off-target panel (50 targets) arranged in a descending order according to total number of compounds screened per target.
Target
|
Symbol
|
Compounds screened
|
Positives
|
Hit percent
|
Adenosine A3
|
ADORA3
|
2834
|
832
|
29.35
|
Dopamine D2S
|
D2S
|
2606
|
346
|
13.27
|
Mu-type opioid
|
MOP
|
2573
|
364
|
14.14
|
Muscarinic M1
|
M1
|
2533
|
916
|
36.16
|
Seritonin 5-HT2A
|
5-HT2A
|
2391
|
526
|
21.99
|
Adenosine A1
|
ADORA1
|
2378
|
102
|
4.28
|
NE transporter
|
NET
|
2360
|
393
|
16.65
|
Seritonin 5-HT2B
|
5-HT2B
|
2316
|
823
|
35.53
|
Dopamine D1
|
D1
|
2301
|
241
|
10.47
|
Seritonin -5HT1A
|
5HT1A
|
2290
|
290
|
12.66
|
Histamine H1
|
H1
|
2286
|
214
|
9.36
|
Muscarinic M2
|
M2
|
2253
|
513
|
22.76
|
Adrenergic β1
|
ADRB1
|
2252
|
60
|
2.66
|
Acetylcholineesterase
|
ACHE
|
2245
|
418
|
18.61
|
5HT transporter
|
SERT
|
2241
|
799
|
35.65
|
GABA A (Cl− channel)
|
GABA A (Cl− channel)
|
2238
|
448
|
20.01
|
Adrenergic α1A
|
ADRA1A
|
2210
|
380
|
17.19
|
Monoamine oxidase
|
MAOA
|
2204
|
20
|
0.90
|
Seritonin 5-HT3
|
5-HT3
|
2130
|
50
|
2.34
|
HIV-1 Protease
|
HIV1-PR
|
2110
|
46
|
2.18
|
Adrenergic α2A
|
ADRA2A
|
2020
|
196
|
9.70
|
Adrenergic β2
|
ADRB2
|
1973
|
76
|
3.85
|
PPARgamma
|
PPARG
|
1955
|
194
|
9.92
|
Ca2+ channel (Diltiazem site)
|
CACNA1C
|
1942
|
473
|
24.35
|
Nicotinic muscle-type
|
CHRNA1
|
1899
|
173
|
9.11
|
Prostaglandin F
|
FP
|
1897
|
73
|
3.84
|
Histamine H3
|
H3
|
1864
|
215
|
11.53
|
Xanthine oxidase
|
XDH
|
1831
|
49
|
2.67
|
Glucocorticoid
|
GR
|
1816
|
76
|
4.18
|
Cyclooxygenase 2
|
COX2
|
1799
|
192
|
10.67
|
Choleystokinin 1
|
CCK1
|
1792
|
139
|
7.75
|
Matrix metallopeptidase 9
|
MMP-9
|
1787
|
38
|
2.12
|
Angiotensin receptor II
|
AT1
|
1762
|
45
|
2.55
|
GABA-A (Benzo)
|
GABA-A (Benzo)
|
1645
|
329
|
20
|
Histamine H2
|
H2
|
1644
|
123
|
7.48
|
Cannabinoid CB1
|
CB1
|
1609
|
67
|
4.16
|
Phosphodiesterase 3B
|
PDE3B
|
1563
|
66
|
4.22
|
ZAP70 Kinase
|
ZAP70
|
1484
|
25
|
1.68
|
CDK2 Kinase
|
CDK2
|
1465
|
61
|
4.16
|
GSK3 beta
|
GSK3B
|
1394
|
65
|
4.66
|
Estrogen alpha
|
ERalpha
|
1389
|
7
|
0.50
|
ABL1 Kinase
|
ABL1
|
1328
|
236
|
17.77
|
GSK3 alpha
|
GSK3A
|
1285
|
97
|
7.54
|
Androgen
|
AR
|
1234
|
86
|
6.96
|
Glutamate (PCP)
|
PCP
|
1187
|
6
|
0.50
|
Glycine receptor (Strychnine insensitive)
|
Glycine
|
1145
|
2
|
0.17
|
Nicotinic neuronal-type (alpha-BGTX insens.)
|
CHRNA4
|
951
|
9
|
0.94
|
Angiotensin converting enzyme
|
ACE
|
832
|
58
|
6.97
|
Kappa-type opioid
|
KOP
|
428
|
89
|
20.79
|
Phosphodiesterase 4D2
|
PDE4D2
|
133
|
15
|
11.27
|
Hit percent calculation and binary classification A cut-off of 50% inhibition at 10 µm was drawn in order to classify the compounds’ inhibition percentages of the targets into active (≥ 50% binding) or inactive (<50% binding) upon each target. The hit percent was then calculated according to Eq. 1 and a cut-off of 20% was specified, where a hit percent >20% was defined as high and hit percent <20% as low.
\(\frac{No. of active compounds}{Total screened}\times 100\) Eq. 1
Targets with hit percent less than 1 (Glutamate, Glycine and Nicotinic receptor-neuronal type) were removed, since constructing models for such extremely imbalanced data sets would be unfeasible. The final assembled dataset thus consisted of 3928 unique compounds and 47 targets.
Compounds’ descriptors Structural features of the compounds were used as descriptors. Extended Circular FingerPrints with radius 4 (ECFP4)[17] were selected in this study since they are considered amongst the best performing features in bioactivity predictions[13, 18–20]. Standardization of canonical smiles and duplicates removal was performed in Pipeline Pilot (version 9.1.0)[21]. Calculations were done using the rcdk package in R version 3.5.1[22] where canonical smiles were parsed into the ‘mol’ format and the 1024 bits of the ECFP4 fingerprints were generated for each compound. A matrix of 3928 compounds vs 1024 structural binary bits was obtained.
Since compounds are usually screened against the suspected off-targets (not necessarily the full panel), the compounds’ off-target interaction matrix was not fully populated and comprised missing values. Thus, a single task modelling approach was adopted, where one model was constructed for each target separately.
Train-Test splits Due to the high imbalance observed in the datasets, and to ensure uniform presence of positive samples in both training and test sets, a stratified activity splitting approach was used, using the caret package in R (version 3.5.1). A ratio of 80:20 was used for the training and test splits, respectively. The 80 percent training set was automatically split by keras into 60% training and 20% validation (i.e. 60% training,20% validation and 20% test set). Test sets were used as external held out sets and were completely excluded from the training procedure. To allow a comparison between the modelling approaches, the same training and test sets were used for each approach.
Machine learning approaches A deep learning (Feedforward neural network) and two automated machine learning approaches (H2O driverless AI[23] and AutoGluon Tabular[24]) were used to construct the models. Subsequently, a comparative analysis was performed between the three methods used, with respect to their robustness, resources consumed, and expertise needed to build the models.
Deep learning with feed forward neural networks
A feed forward neural network was employed using tensorflow[25] (version 2.3) and keras[26] packages (version 2.3) in R studio (version 1.1.456) using R (version 3.5.1). As explained above, the ECFP4 fingerprints were used as an input layer to the network to predict the binary activity of the molecules in the output layer (inactive = 0, active = 1). A sigmoid activation function was applied to the output layer while the rectified linear unit (ReLU) function was used to activate the input and hidden layers. A drop out was applied to the input layer and the hidden layers to avoid overfitting, and a penalty was also applied to the input and hidden layer by using a kernel regularizer. The number of hidden units and dropout rate varied according to Table 2. The list of fixed parameters used in the network is presented in Table 3
Table 2 Values of the grid search parameters used in the neural network.
Parameter
|
Value
|
Hidden units
|
[256,512,1024,2048]
|
Dropout input
|
[0,0.1,0.2]
|
Dropout hidden
|
[0.2,0.3,0.4]
|
Learning rate
|
[0.01,0.001,0.0001]
|
Batch size
|
[64,128,256]
|
Table 3 Fixed parameters of the neural network.
Parameter
|
Value
|
Optimizer
|
Adam
|
Loss
|
Binary cross entropy
|
Hidden layers
|
2
|
Activation
|
ReLu
|
Kernel regularizer l2
|
0.001
|
Activation output
|
Sigmoid
|
In addition to the binary accuracy function provided by keras, a custom metric function (balanced accuracy) was created to monitor the training statistics to avoid any misleading outcomes that could be caused by the unbalanced nature of the data. Adam optimizer was employed to minimize the selected loss function (binary cross entropy), with a varying learning rate. The learning rate was further reduced when no improvement was shown in the validation balanced accuracy over 10 epochs, in order to help to reach a global minimum and to minimize random noise.
Since fitting the same neural network for 47 datasets of varying size and hit percent is a complex task, the grid search was performed on a selection of hyper parameters and a network architecture that could fit with these diverse datasets. The variable flags used are indicated in Table 2 and were searched using the tuning function from the tfruns library provided by tensorflow (version 2.3).
A 50% randomized sample of the grid search combinations were trained for 250 epochs (full Cartesian was not performed due to time and resource limits). However, to overcome overfitting, early stopping was applied and training was stopped when no improvement in the validation balanced accuracy was observed for 20 epochs.
In addition to the previous callbacks used during fitting the model (early stopping, reduce learning rate on plateau), model checkpoint call back was also implemented. This means that all the models were saved after each epoch, and only the model with the best validation balanced accuracy was retained. In order to ensure model generalization, a 20% random validation split was applied during the training (therefore the 80% training sets were further divided into 60% training and 20% validation).
Finally, for each target, three models were preserved, with best validation loss, validation balanced accuracy, and validation binary accuracy. Afterwards, the models were further evaluated on the external held out test sets.
AutoML with H2O Driverless Artificial Intelligence (DAI)
H2O autoML models were implemented using the R client driverless artificial intelligence (DAI) library (version 1.8.5.1) in R studio (version 1.1.456) using R (version 3.5.1). A license is required for the access of H2O driverless AI.
Datasets in autoML models require less complex handling than tensorflow models. As mentioned previously, the same data sets for each target were used, except for the difference in the shape of the dataset. Unlike neural network models, where the chemical structures (input) and the binary activities (output) are fed separately to the network, one dataset comprising the structures and the binary activities is directly fed to the H2O system. For each target we uploaded the train and test sets as follows:
Train <- dai.upload_dataset (“filepath/MuscarinicM2_train.csv”)
Test <- dai.upload_dataset (“filepath/MuscarinicM2_test.csv”)
A “target_col” is specified, which is the column to predict; in our case this refers to the ‘binary activity’ column. H2O automatically detects the column named ‘ID’ as the identifier column of the dataset (in our case compounds IDs). No feature engineering or hyper parameter optimization is required in H20 autoML. Only few parameters require configuration during the training procedure, namely: accuracy, time, interpretability, and model scorers. As there is no implementation of balanced accuracy within H2O, area under the precision recall curve (AUCPR) is used for training the models. AUCPR is also considered as a suitable metric for unbalanced problems, since it is not affected by the true negatives and has higher sensitivity towards true positives, true negatives, and false positives. Further elaboration on the use of different evaluation metrics and how to overcome evaluation bias is given in the next section.
Models were set to be reproducible and the default expert settings were used.
model <- dai.train(training_frame = train, testing frame = test, target_col = “binary_activity”,is_timeseries = FALSE, is_classification = T, accuracy = 10, time = 6, interpretability = 4, seed = 30, enable_gpus = T, config_overrides = "make_mojo_scoring_pipeline = 'off' \n min_num_rows = 50 \n make_python_scoring_pipeline = 'off' \n make_autoreport = false \n " ,scorer = "AUCPR")
For less complex tasks (e.g. models for targets with high hit percent), different configurations were explored, such as setting a low accuracy threshold (accuracy = 1). The different algorithms implemented in H20 are: XGBoost GBM models, Light GBM models, XGBoost GLM models, TensorFlow models and RuleFit models. A final ensemble of the top performing models is created in order to optimize the overall performance. More details on the models integrated in H2O and the model stacking process can be found in the H2O booklet documentation.
The final ensemble conformation matrices (TP, FP, TN, FN), ensemble rocs (TPR, FPR, Precision, Recall) and ensemble scores (Accuracy, MCC, AUCPR) of the held-out test sets were retrieved for all the 47 target models in form of .json files and analyzed in R studio 3.5.1. Balanced accuracy was calculated according to Eq. 4.
The process was automated to train and retrieve the results for the models of the 47 targets.
AutoML with AutoGluon Tabular
Similar to the neural networks and H2O autoML, we utilized open source AutoGluon (version 0.0.13) in Jupyter notebook (version 7.12.0) with python (version 3.6.5) to construct 47 models for the 47 off targets. AutoGluon implements several algorithms such as neural networks, LightGBM boosted trees, CatBoost boosted trees, Random Forest, Extremely Randomized Trees, and k-Nearest Neighbors. A multilayer stacking strategy is adopted, where AutoGluon individually fits various models and creates an optimized model ensemble that outperforms all individual models. Further details on AutoGluons’ strategy in models training, tuning and ensembling can be found in the work of Erickson et al[24].
Similar to H2O, AutoGluon does not require any data processing. The same train and test csv files (containing the compounds’ fingerprints and binary activities) utilized for H2O were used for AutoGluon and are imported as follows:
train_data = task.Dataset(file_path=train_filename)
test_data = task.Dataset(file_path=test_filename)
A ‘label’ column (containing the binary activities) and an ‘id_column’ (containing compounds ids) are defined. Allocation of the predictive accuracy, disk usage and time required for model training are defined by a list of presets. Several presets are available, such as ‘best quality’, which trains the models to obtain the best accuracy regardless of training time or disk usage, ‘best quality with high quality refit’, which is identical to ‘best quality’ but with slightly lower accuracy and higher efficiency (10x less disk space and 10x faster training times). Other presets with different time and accuracy specifications are available and are explained in detail in the AutoGluon Tabular documentation. We selected the option ‘best quality with high quality refit’ to ensure a reasonable balance between accuracy and efficiency. No time limit specification is needed for this option (assigned automatically by AutoGluon). Best models are also automatically stacked in this preset to produce the final model (autostack = TRUE). The parameter ‘optimize for deployment’ was also enabled in order to delete the unused/unstacked models to save disk space with no effect on the final model accuracy. The training command lines can be summarized as follows:
presets = ['best_quality_with_high_quality_refit', 'optimize_for_deployment']
predictor = task.fit(train_data=train_data,label=label_column, id_columns=['ID'],output_directory=output_directory, presets=presets, problem_type='binary' eval_metric='balanced_accuracy', verbosity=1,random_seed = 42 )
AutoGluon models were evaluated on the external test sets using performance metrics implemented in the sci-kit learn library (version 0.22.2).
Evaluation of models and overcoming assessment bias
Classification models are assessed through different metrics applied to the confusion matrix of the test set, which is composed of four quadrants summarizing actual and predicted values: True positives (TP), true negatives (TN), false positives (FP) and false negatives (FN). To ensure a fair evaluation of machine learning models, the performance metrics must be carefully selected and adapted to the nature of the dataset. For example, in our study most of the targets have highly imbalanced datasets. Using the accuracy metric (calculated by Eq. 2) could be misleading, where high values could be attributed to correct prediction of the true negatives and not the true positives. Area under receiving operator curve (AUC), which measures the area under the plot of true positive rate (TPR) (calculated by Eq. 3) vs false positive rate (FPR) (calculated by Eq. 4) might also produce erroneous conclusions for the same reason.
Capturing both the true positives and true negatives is essential in off-target modeling and thus the balanced accuracy (calculated by Eq. 5), which accounts for both outcomes, was utilized as the primary metric to assess the performance of all the models and methods. Other metrics adapted to the data imbalance were also investigated through case studies, such as Mathew’s correlation coefficient (MCC) (calculated by Eq. 6), which equally weighs the four quadrants of the confusion matrix, and area under the precision-recall curve (AUCPR), which excludes the true negatives from the calculation by measuring the area under the plot of the positive predicted value (ppv) or precision (calculated by Eq. 7) vs the recall (calculated by Eq. 3). An overall comparison between the balanced accuracy of all the three methods (Neural networks, H2O and AutoGluon) and is presented in Additional file (Table S1). Furthermore, a detailed analysis of all the mentioned performance metrics of the neural networks is provided in Additional file 2 (Table S2).
\(Accuracy= \frac{\sum TN+\sum TP}{\sum P+N}\) Eq. 2 \(TPR\left(recall\right)= \frac{TP}{TP+FN} Eq.3\) \(FPR= \frac{FP}{FP+TN} Eq.4\)
\(Balanced accuracy= \frac{\sum \frac{TP}{P}+\sum \frac{TN}{N}}{2}\) Eq. 5 \(MCC= \frac{TP\times TN-FP\times FN}{\surd (TP+FP)(TP+FN)(TN+FP)(TN+FP)}\) Eq. 6
\(Precison \left(ppv\right)= \frac{TP}{TP+FP}\) Eq. 7