Multi-task convolutional neural networks for predicting in vitro clearance endpoints from molecular images

Optimization of compound metabolic stability is a highly topical issue in pharmaceutical research. Accordingly, application of predictive in silico models can potentially reduce the number of design-make-test-analyze iterations and consequently speed up the progression of novel candidate molecules. Herein, we have investigated the question if multiple in vitro clearance endpoints could be accurately predicted from image-based molecular representations. Thus, compound measurements for four commonly investigated clearance endpoints were curated from AstraZeneca internal sources, providing a sound basis for building multi-task convolutional neural network models. Application of several increasingly challenging data splitting strategies confirmed that convolutional neural network models were successful at capturing implicit chemical relationships contained in training and test data, similar to what is commonly observed for structural fingerprints. Furthermore, model benchmarking against state-of-the-art machine learning methods, including deep neural networks and graph convolutional neural networks, trained with structure- and graph-based representations, respectively, revealed on par or increased accuracy of convolutional neural networks with clear benefit of multi-task learning across all clearance endpoints. Our findings indicate that image-based molecular representations can be applied to predict multiple clearance endpoints, suggesting a potential follow-up to investigate model interpretability from molecular images.


Introduction
Prospective drug candidates are typically required to exhibit not only desirable biochemical and physiological effects, but also optimal absorption, distribution, metabolism, and elimination (ADME) characteristics to ensure adequate exposure at the site of action [1,2]. Drug exposure is a focal component of pharmacokinetics (PK), which plays an essential role in drug discovery and consequently a part in compound attrition during clinical development [3,4,5].
A suite of in vitro assays and animal PK studies is commonly applied to evaluate ADME properties of drugs [2]. Among these, drug metabolism describes the rate of a molecule modified by degrading or metabolizing enzymes (i.e., metabolic stability). Metabolic stability can be quantified using intrinsic clearance (CL) which together with volume of distribution defines other, so-called secondary PK parameters. As these parameters eventually define pharmacological and toxicological profiles of compounds, optimization of drug metabolic stability has long been considered a fundamental objective in drug discovery [3]. Hepatic matrices, such as liver microsomes and hepatocytes, are regularly employed to estimate hepatic metabolic stability of compounds for subsequent extrapolation to in vivo [2,4]. Accurate prediction of CL from hepatocytes and microsomes increases confidence in estimating total in vivo CL in animals, and eventually humans [6]. In addition, early application of in silico tools, such as machine learning, has the potential to triage down the number of compounds proceeded to synthesis and experimental evaluation [7][8][9][10]. Traditionally, these models learn from chemical structure which can be encoded as either qualitative (e.g., presence or absence of carbonyl group) or quantitative (e.g., number of hydrogen bond acceptors) descriptors/fingerprints [8,9]. 1 3 Due to high experimental variability, but also intrinsic limitations of conventional structural fingerprints, a number of ADME in silico models (e.g., different clearance endpoints) suffer from low predictability which ultimately increases the number of design-make-test-analyze (DMTA) iterations, causing the delay in candidate progression [9][10][11]. While the experimental error is inherently attributed to the conducted experimental procedure, different chemical structure representations could further be explored to potentially increase the information gain responsible for successful predictions.
In recent years, several chemoinformatics studies have investigated use of alternative, low-resolution image-based molecular representations to train predictive convolutional neural network (CNN) models for different tasks. For example, Fernandez et al. explored few two-dimensional representations of molecules with CNNs using known benchmarking data sets that comprised seven nuclear receptor and five toxic stress response endpoints [12]. They concluded that predictability of CNN models was on par with stateof-the-art methodologies, which was further prospectively validated using an external test set from which several compounds with predicted antiandrogen activity were experimentally verified [12]. In another study, Cortés-Ciriano and Bender used compound images to build CNN regression models for the prediction of compound potency against 25 different protein targets and eight cytotoxicity endpoints [13]. Different CNN architectures were evaluated against conventional machine learning methods and deep neural networks (DNNs) trained with structural representations. Once more, benchmarked models displayed comparable performance which greatly improved once random forest and CNN model outputs were averaged. Such averaged outputs yielded lower prediction errors compared to individual estimates, which demonstrated complementarity of features extracted from orthogonal methodological frameworks and representations [13]. In a proof-of-concept study, Iqbal et al. evaluated activity cliffs-pairs of structurally similar compounds with large differences in biological activity-against structural analogs that did not form activity cliffs, both of which were represented as image-based representations modelled with CNNs [14]. Molecular images displayed structures of shared cores and exchanged substituents that were assigned one of the two class labels (i.e., 'forms an activity cliff' or 'does not form an activity cliff'). Surprisingly, high prediction accuracy was achieved which enabled mapping of gradient weights from CNN layers back onto the images of test compounds, revealing structural features essential for successful predictions [14]. In a study by Yoshimori, so-called Molecular Topographic Map (MTM) approach was developed, which applied generative topographic mapping of atomic features to derive two-dimensional maps for each chemical compound [15]. In a subsequent benchmarking study, performance of MTM-based CNN models was shown to be equal to or better than state-of-the-art machine learning methods trained on different structural representations [15]. In yet another study by Iqbal et al., CNNs were applied to predict structure-activity relationship characteristics of activity landscapes [16]. Activity landscapes were depicted as visuals of both similarity and activity relationships of compounds. Once again, high prediction accuracy was achieved that distinguished between activity landscapes having significantly different topographies [16]. Evidently, recent advances in deep learning have enabled investigation of novel molecular representations with ultimate aim to further increase model predictability for different molecular endpoints, particularly those with significant impact on DMTA cycle with yet limited prediction accuracy.
Herein, we systematically explored several molecular image representations and configurations to build predictive multi-task CNN models for different in vitro clearance endpoints coming from AstraZeneca internal sources. These models were thoroughly evaluated using a number of increasingly difficult data splitting strategies that are commonly applied in chemoinformatics. Lastly, we benchmarked performance of our single-and multi-task CNN models against the respective configurations of existing machine learning approaches that were trained with established structural fingerprints, as well as graph convolutional neural networks that directly learn molecular representations from chemical structure to predict endpoints in an end-toend manner.

Data curation
Experimental data for four clearance endpoints, namely human hepatocyte intrinsic clearance (HH CLint), human liver microsome intrinsic clearance (HLM CLint), rat hepatocyte intrinsic clearance (RH CLint), and dog hepatocyte intrinsic clearance (DH CLint), were extracted from Astra-Zeneca internal sources. Standard operating procedures for different clearance endpoints are described in Supplementary Information.
Compound annotations given as SMILES strings were standardized using RDKit [17] and ChEMBL Structure Pipeline [18] packages in Python, which included removal of salts and solvent molecules, neutralization of acids and bases, conversion of isotopes, removal of stereoisomers and canonicalization of aromatic structures. Only compounds with molecular weight (Mw) between 200 and 900 Da were retained.
Only measurements within the range of 1 and 300 μL/ (min × 10 6 cells) were kept for HH, RH, and DH CLint, whereas for HLM CLint this limit was between 3 and 300 μL/(min × mg). Decadic logarithm transformation was applied to convert qualifying experimental values. If a compound was measured several times for a particular endpoint then its mean and standard deviation of measurements were calculated. Standard deviation distribution for each endpoint was used to derive experimental error estimates, which were set at 95 th percentile value of distributions [19]. Such derived experimental errors were applied to investigate model strength where root-mean-square error (RMSE) values of model performances were compared to the estimates. To ensure that the compounds with high experimental variability were excluded from the model building, the linear correlation between the mean and standard deviation of measurements was explored. If a compound's standard deviation was larger than two times the expected standard deviation of the linear fit, then this measurement would be removed [19]. Following these steps, clearance data were proceeded to modelling.

Image representations
Canonical SMILES of each compound was converted to a two-dimensional molecular representation using OpenEye library in Python [20]. Image size was fixed at 240 × 240 pixels to accommodate depiction of larger molecules present in the data set, but also to preserve the information gain from smaller compounds. Aromatic bonds were represented using ring-centered circles to maximize the information content and number of pixels associated with chemical bonds.
Default pixel range was between 0 (black) and 255 (white) where pixels of value of zero contained no chemical information. Atoms and bonds were encoded using two alternative molecular image representations. In the first representation, which we termed color coded, different color values were assigned to different atoms (e.g., black for carbon, red for oxygen, blue for nitrogen, etc.). This information was converted in the pixel space so that the atom and bond values retained their original color but were normalized between 0 and 1. Second image representation, named property gradient, consisted of atom and bond properties depicted as grayscale gradients (atomic number, MMFF94 [21] partial charge, bond order, or valence). Pixel values (from 0 to 255) were mapped back to original feature values whereas the empty pixels were assigned the value of zero. Presence of property gradients reduced the number of pixels devoid of chemical information. Images from different properties were then combined in a multi-channel image. Figure 1 shows an example of image representations for drug imatinib in both color and property gradient representations.
Two different configurations of molecular sizes were explored for image derivation: fixed and scaled. In fixed configuration, consistent resolution of 0.03 Å/pixel was applied which ensured equally sized renditions of atoms and bonds for explored molecules. This consequently introduced greater number of pixels devoid of chemical information for small-sized molecules. For the scaled configuration, molecules were adapted to fit the image dimensions so that both atom and bond contents were maximized. Here, small-sized molecules were depicted as relatively larger, as to minimize the number of empty pixels. Figure 2 depicts both fixed and scaled configurations for drug clozapine.
Following the definition of representations and configurations, different data augmentation strategies were applied to reduce the model bias towards single two-dimensional poses rendered for each molecule. Here, we applied 90° or 180° rotation, vertical flip, and horizontal flip to expand the number of molecular poses and allow better model generalization independent of rendered orientations. During the training process, for each SMILES provided per batch, an augmented version of the original molecular representation was submitted for learning, efficiently carried out by a PyTorch [22] data loader.

Structural representations
State-of-the-art structural representations were used to benchmark CNN models built on images. Here, different configurations of extended-connectivity fingerprints (ECFPs) [23] were applied, which included diameters four (ECFP4) and six (ECFP6) hashed to either 1024-or 2048-bit versions. ECFPs were obtained using RDKit [17] package in Python.

Data splitting strategies
Prior to model building, clearance data set was divided into the training and test sets using three different strategies given in an increasingly challenging order of predictions: random split, scaffold-based split, and time-based split. For the random split, compounds were randomly divided in respective training and test sets, which may have shared common chemical space given the stochastic nature of this approach. On the other hand, scaffold-based split ensured that the compounds containing same Bemis-Murcko scaffolds [24] were made part of the same (training or test) set. This further reduced chemical similarity between the training and test sets and thus increased prediction difficulty. Further up the scale, time-based split defines a time point threshold by which the data are divided into the training and test set. As our internal data were all associated with sampling dates, the latest date of clearance measurement was defined as the time stamp for each compound. Time thresholds of 30, 12, 6, and 4 months were established to evaluate different time split scenarios.
Training sets were used for the model cross-validation and hyperparameter optimization whereas the test sets were employed to independently evaluate final model performances. Three-fold cross-validation was applied for all three data splitting strategies. For both random and scaffold-based splits, 80% of the data set were dedicated for model building while the remaining 20% were defined as a test set. For time-based split, different time thresholds produced different ratios of trainingtest data with time point of 30 months being the closest to the 80/20 split.

Loss functions and performance metrics
Models were optimized using either Huber [25] loss or RMSE. Huber loss balances between L1 and L2 losses, where L1 minimizes the sum of all absolute differences between true and predicted values while L2 minimizes the sum of all squared differences between true and predicted values. For individual predictions, it can be calculated using the following: where parameter δ regulates the application domains of L1 and L2 penalties, set to a default of 1.
The RMSE presents the measure of spread of prediction errors around the line of best fit and can be calculated using the following equation: where y i is the true value for i-th sample, ŷ i is the predicted value for i-th sample, and n is the number of samples. Model configuration with lowest RMSE value was proceeded forward.
Model performance was evaluated using both RMSE and coefficient of determination (R 2 ) score. The R 2 presents the proportion of variance of dependent variable explained by independent model variables. It provides an indication of goodness of fit and can be defined as: where y is the arithmetic average of all true values.

Convolutional neural networks
Different CNN architectures, coded in PyTorch [22], were explored to learn in vitro clearance endpoints from molecular images. Basic concept of CNNs is as follows: a series of convolutional layers comprised of convolution and pooling operators are repeatedly applied to extract features from different scales before being processed by several fully connected layers (activated by a non-linear function) which end in an output layer corresponding to the number of tasks (e.g., clearance endpoints) [26]. For each training epoch, prediction outputs are evaluated against corresponding experimental values to calculate the training loss which is then minimized using a pre-defined loss function.
Hyperparameters were initialized either randomly or from pre-trained networks that were built using a commonly applied ImageNet [32] data set. The latter approach, commonly known as transfer learning, is often practiced in computer vision to speed up convergence and improve predictability of novel tasks. Weight imports from these architectures could be partial, as in a simple configuration which we termed SimpleNet, where only the convolutional operators were acquired from a pre-trained VGG [33], or total, where all the pre-trained weights from architectures Chemception [34], ResNet [35], and EfficientNet [36] were applied. As we primarily focused on multi-task predictions, CNNs were optimized in a multi-task mode to derive optimal configuration which was later used to train single-task CNN variants to enable fair comparison between single and multi-task learning.

Machine learning for structural representations
Machine learning methods frequently employed in ECFPlearning tasks, such as random forest regressor [37], support vector regressor [38], and deep neural network [39], were optimized and validated using these structural representations. To enable fair comparison between image-and structure-based representations, the training-test splits were kept consistent across different splitting strategies. Random forest regressor (RFR) is an ensemble of decision trees each built from a randomly sampled subset of training data, procedure commonly known as bootstrapping. Herein, we optimized number of decision trees (50, 100, or 150) and minimum number of samples left in terminal nodes (1)(2)(3)(4)(5) using Scikit-learn [40] implementation. Support vector regressor (SVR) maps training data (or features) in high-dimensional space (often using a non-linear kernel function), as close as possible to their quantitative output (or label) to derive a regression function. Hyperplane margin (ε) controls tolerable deviation between the observed and predicted values, and larger errors are penalized. Hyperparameter C, also known as regularization slack variable, also regulates the relaxation of error minimization problem by penalizing large deviations from the ε tube. The SVRs were optimized for hyperparameters ε (0.001, 0.01, and 0.1), C (0.1, 1, and 10) and a kernel function (radial basis or Tanimoto [41]). These as well were implemented via Scikit-learn [40].
Deep neural network (DNN) applies a series of learned matrix multiplications and non-linear activation functions in units known as neurons, commonly organized in several fully connected layers. Loss function (such as one of those described previously) evaluates the difference between predicted and experimental values and minimizes it via an iterative learning process. The DNNs were implemented in PyTorch [22] and optimized for batch size (32, [43] algorithm, for our purposes set at 30 sampled experiments. Methodological framework of RFRs and SVRs allows only for single-task learning, whereas DNNs can be trained in both single-and multi-task fashion.

Graph convolutional neural networks
In addition to the above-mentioned state-of-the-art machine learning methods trained on structural representations, we also explored the predictive potential of graph convolutional neural networks (GCNNs) in both multi-and single-task settings. In this study, the directed message-passing neural network framework called Chemprop [44] was used, which learns molecular representation directly from the chemical structure (represented as a graph) to provide endpoint prediction in an end-to-end fashion. It consists of a messagepassing phase which uses GCNNs to extract molecular representations and a readout phase that learns and predicts endpoints of interest [44]. GCNNs are conceptually the closest to CNNs from all the techniques introduced, which allows for their direct model performance comparison. Three-fold cross-validation using 20 trials of built-in hyperparameter optimization function was performed for 50 epochs to optimize a set of different hyperparameters: number of message-passing steps, size of GCNN layers, number of feedforward layers, and dropout.

Clearance data
All compounds associated with experimental data for HH CLint, HLM CLint, RH CLint, and DH CLint, were obtained from internal AstraZeneca sources and curated to combine assay data which followed equivalent experimental procedures (see Supplementary Information). Basic statistics for the curated intrinsic clearance data are summarized in Table 1.
As we were interested to explore multi-task modelling for the four clearance endpoints, we derived a [compound, endpoint] matrix where each compound corresponded to its curated experimental value for a particular endpoint. In total, we identified 139,907 compounds with at least a single clearance endpoint measurement that qualified for our analysis. As common in experimental sciences, not all the compounds were investigated for all the clearance endpoints, resulting in only 36.91% of the potential compoundendpoint pairs available for model building and evaluation. Out of 139,907 compounds, only 922 (0.66%) had clearance measured for all four endpoints, whereas 79,470 (56.70%), more than half of the curated compounds, had only a single clearance value reported. This revealed that our data set consisted of compounds spanning different stages of drug development. This is further corroborated by the individual clearance statistics in Table 1. Most measured clearance experiments upon compound synthesis are HLM CLint and RH CLint, which were present for 101,913 and 79,043 in our data set, respectively. This is followed by HH CLint (21,616 compounds) and DH CLint (6747 compounds) which are evaluated comparatively later.
Our curated clearance set was also characterized with high chemical diversity. In total, we detected 64,262 Bemis-Murcko scaffolds, with an average of 2.18 compounds per scaffold. Compound-to-scaffold ratio for each individual endpoint (Table 1) corresponded to global average. About 34.18% (or 47,821) of compounds were singletons, i.e., they contained a scaffold that was not shared with any other compound. Value distribution for all clearance endpoints had a wide range and was typical of compounds present at different stages of drug development. The HH CLint and DH CLint had overall the lowest distribution average, main characteristic of compounds with optimized clearance characteristics. Errors estimated from clearance measurements were consistent with the standard two-fold margin expected for clearance experiments.
Altogether, qualified compounds and clearance measurements were of adequate quality to warrant further exploration via machine learning.

Evaluation of transfer learning, image configurations, and representations
Prior to evaluating different CNN architectures and hyperparameters, we first investigated the impact of transfer learning, image configurations, and representations on model accuracy using cross-validation results. Figure 3 details the performance of SimpleNet models trained on either scaled or fixed image configuration, with or without transfer learning. To fairly compare the effect of transfer learning and different image configurations, other parameters were kept consistent: color coded images, maximum of 150 training epochs, four convolutional layers, two fully connected layers with 0.50 dropout rate, batch size of 128, learning rate of 10 -3 , RMSE as a loss function, Adam optimizer, and batch normalization. As shown on the top (Fig. 3), average R 2 scores for all clearance endpoints were greater for transfer learning models than for those initialized with random weights (without transfer learning). This held true for both scaled and fixed image configurations where transfer learning increased R 2 score by 0.36 and 0.40, respectively. This difference was less evident for RMSE (Fig. 3, bottom) where transfer learning models displayed slightly lower values (ΔRMSE = 0.10 for scaled and ΔRMSE = 0.11 for fixed). Model accuracy was only marginally better for fixed configuration. Furthermore, we conducted sensitivity analysis to compare mean absolute error (MAE) values of prediction for small-sized molecules (Mw < 350 Da) using both fixed and scaled configurations trained via transfer learning. As the information content (number of 'filled' pixels) would differ for small-sized molecules depicted with respective configurations, we expected that the fixed configuration (lower information content) would result in inferior model performance compared to the scaled configuration (higher information content) for smallsized molecules. However, we noted little difference in the model performance for the two configurations: on average MAE was 0.34 and 0.35 for fixed and scaled configurations, respectively. Based on these results, we further focused our investigation on models trained using transfer learning and fixed image configuration. Next, we analyzed the performance of SimpleNet models trained with different molecular representations (Fig. 4). Once more, other parameters remained unchanged for just comparison: fixed image configuration trained with transfer learning, maximum of up to 150 training epochs, four convolutional layers, two fully connected layers with dropout rate of 0.50, batch size of 128, learning rate of 10 -3 , RMSE as a loss function, Adam optimizer, and batch normalization. Although we initially defined two image representations, color coded and property gradient, we also evaluated individual components of multi-channel property gradient representation: atomic number, partial charge, bond order, and valence. As shown in Fig. 4, color coded representation showed a slight advantage over individual property gradients, with an average of ∆R 2 = 0.08 and ∆RMSE = 0.03. On the other hand, multi-channel property gradient, a representation which combines all four property gradients, scored similarly to color coded images. Taken all together, differences between individual property gradients and multichannel representation were minor. This demonstrated that plain, single channel images were often a good substitute for more complicated, multi-channel representations. Once again, RMSE scores revealed very minor differences among evaluated representations. Given their simplicity and initial performance, we further proceeded with color coded images for CNN architecture assessment.

Evaluation of CNN architectures and hyperparameters
Now that we identified transfer learning models with fixed configuration and color coded image representation as a preferential setup, we continue by investigating various, hyperparameter optimized CNN architectures using crossvalidation results.   Table 2 was chosen to further assess the performance of hold-out test sets originating from different data splitting strategies.

Model performance
Next, we built and evaluated ResNet models using previously defined data splitting strategies: random, scaffold-, and time-based splits. Here, we first divided complete clearance data into the training and test sets for each split. Then, we built multi-task models using the designated training sets. Finally, we validated each model using the respective holdout test sets. Figure 5 shows model accuracy for individual clearance endpoints as evaluated on 20%-worth of test data for random, scaffold-, and time-based splits (last 30 months). Designated order of splitting strategies corresponds to an increase in prediction difficulty, where training-test data for random split is expected to have greater chemical overlap than time-based training-test data of equivalent ratio. As shown in Fig. 5, model performances decline in the expected order of prediction difficulty. On average, only a minor accuracy decrease was noted between the random and scaffold-based splits (ΔR 2 = 0.05 and ΔRMSE = 0.03), where RH CLint recorded the greatest (ΔR 2 = 0.07 and ΔRMSE = 0.03) and DH CLint the smallest drop (ΔR 2 = 0.02 and ΔRMSE = 0.02). Significant performance decrease was observed for time-base splits with an average decline of ΔR 2 = 0.29 and ΔRMSE = 0.11 compared to the random split. Here, HH CLint and RH CLint showed the greatest change in R 2 and RMSE, respectively (ΔR 2 = 0.43  and ΔRMSE = 0.14), whereas DH CLint and HLM CLint showed the smallest change in R 2 and RMSE, respectively (ΔR 2 = 0.13 and ΔRMSE = 0.09). It should be noted that the time threshold of 30 months (2.5 years) is not commonly evaluated in practice. Yet, for the purpose of comparison between different data splitting strategies, 20% of the data were chosen to form hold-out test sets, which corresponds to a 30-month threshold for the time-based split. Interestingly, the model performance correlated with the expected increase in prediction difficulty, imposed by different data splitting strategies. This indicates that CNN models could extract implicit chemical information from molecular images that were relevant for predictions. As chemical overlap with each strategy narrows, it becomes more difficult for CNNs to exhibit good performance. This is further supported by the performance results coming from time-based splits at different time thresholds. Figure 6 depicts changes in the model accuracy as the time threshold becomes gradually relaxed, from initially displayed 30 months (Fig. 5) to 12, 6, and 4 months. Model accuracy across all four clearance endpoints improves as the time threshold becomes less stringent (from 30 to 4 months). Clearance models at a 30-month threshold displayed overall the lowest accuracy, which was followed by the 12-, 6-, and 4-month thresholds. Minimal differences in the performance of 6-and 4-month threshold models were due to highly similar composition of their data sets. As these models only differ by 2-months' worth of data (either in the training set as for the 4-month threshold, or in the test set as for the 6-month threshold), model performances may display either slight increase or decrease. For example, we observed a decrease in performance for the 4-month threshold models for HH CLint and DH CLint, which are commonly the least frequently measured endpoints. We note that the trend displayed in Fig. 6 reflects the common observation made for structure-based models, that in practice require continuous updates to retain reasonable model performance for the latest chemical data.

Benchmarking with structural representations
Finally, we compared the performance of single-and multitask CNN models against a number of different machine learning methods (RFR, SVR, DNN) that were trained using structure-based representations. In addition, we compared CNNs with GCNN models that were trained with graphbased representations of molecules. Provided that RFR and SVR support only single-task and DNNs and GCNNs both single-and multi-task configuration, single-and multi-task CNNs were evaluated against respective, equivalent setups. Figure 7 compares the model performance (hold-out test set from the same random split) of multi-task DNNs and GCNNs with multi-task CNNs trained on molecular images. Optimal DNN models were trained using 2048-bit ECFP4 fingerprints with the following set of hyperparameters: batch size of 64, learning rate of 10 -4 , two hidden layers with [4000, 2000] neuron configuration, RMSE as a loss function, and dropout rate of 0.40. This setup was used to train both single-and multi-task DNNs. On the other hand, optimal GCNN models consisted of the following hyperparameter combination: four message-passing steps, size of GCNN layers set at 900, two feedforward layers, and dropout of 0.15. Similar to DNNs, this setup was applied to both single-and multi-task GCNNs. All three models (DNN, GCNN, and CNN) displayed overall good performance, in line with the observations made in practice. We note that molecular images demonstrated a minor advantage over conventional structure-based fingerprints, whereas GCNNs were either on par or slightly better than CNNs (Fig. 7). An average improvement in model performance for DNN vs. CNN was only ΔR 2 = 0.03 and ΔRMSE = 0.01 with HLM CLint showing the greatest and HH CLint the lowest increase in accuracy. This shows that the multi-task CNNs trained on molecular images are, at the very least, on par with the multi-task DNNs which use structural fingerprints commonly applied in predicting similar tasks. On the other hand, average performance gain for CNN vs GCNN was even smaller: ΔR 2 = 0.01 and ΔRMSE < 0.01. Here, no clear performance benefit of GCNNs over CNNs was noted. Further, we looked at the estimated experimental errors for each of the clearance endpoints. We conceived experimental errors as the upper bound limits of potentially achievable model performances, given that the models should not be able to predict more accurately than the error margins imposed by the experiments. As shown in Fig. 7, three out of four clearance endpoints (HH CLint, HLM CLint, and RH CLint) had experimental errors that were still considerably lower than multi-task model performances. In the case of DH CLint, multi-task CNN model achieved accuracy that was only marginally worse than its estimated experimental error (ΔRMSE = 0.03). This suggests that further modelling of HH CLint, HLM CLint, and RH CLint data could potentially result in the reduction of performance-error margin, whereas for DH CLint this would not yield tangible improvement.
However, a different picture emerged for the comparison of single-task models. Figure 8 shows model performance for RFR, SVR, and single-task DNN, GCNN, and CNN models as evaluated on hold-out test sets from the same random split. The optimal RFR was trained on 2048-bit ECFP4 fingerprints, using an ensemble of 100 bootstrapped decision trees with a minimum of one sample per terminal node. On the other hand, optimal SVR models used 2048-bit ECFP6 fingerprints, radial basis kernel, hyperplane margin ε of 0.01, and regularization slack variable C of 1. We applied these optimal configurations (RFR and SVR) across all individual clearance endpoint models. As depicted in Fig. 8, singletask CNNs demonstrated equivalent or inferior performance compared to the other machine learning methods trained on structural representations. Here, SVR and GCNN models displayed on average the best accuracy with only minor differences noted across all clearance endpoints. Of the three machine learning methods that utilized structural fingerprints, RFR showed overall the lowest performance that was on average weaker by ΔR 2 = 0.09 and ΔRMSE = 0.04 compared to the best performing models. Clear disadvantage of single-task CNNs was observed for HH CLint (R 2 = 0.45, RMSE = 0.35) and DH CLint (R 2 = 0.37, RMSE = 0.40) models which were among the lowest performing. Unlike in a multi-task setting, GCNNs provided greater advantage over CNNs when individual endpoints were modelled. On average, recorded ΔR 2 and ΔRMSE were 0.06 and 0.01, respectively. It is also worth mentioning that single-task deep learning-based models usually provide no clear benefit over classical machine learning methods trained on structural fingerprints when small-sized data sets are considered (such as DH CLint). In such cases, machine learning methods such as RFRs and SVRs are advised to be applied first, mainly due to their speed and simplicity. Compared to multi-task CNNs (Fig. 7), single-task equivalents for HH CLint and DH CLint recorded an R 2 decrease of 0. 13  performance increase. Although SVRs showed considerable advantage in a single-task setting over other, more complex methods, methodological framework of neural networks enables knowledge transfer between the tasks, which ultimately enhances performance of individual endpoints, especially for those with fewer data points. Given the high similarity among the clearance endpoints, their joint knowledge transfer becomes even more emphasized. Altogether, CNN models learning from molecular images showed comparable performance to standard machine learning methods trained on structural representations. Moreover, multi-task learning showed greater advantage over single-task variants, suggesting it as preferential setup for modelling multiple clearance endpoints. Furthermore, cross-validation accuracies for both multiand single-task models (Figs. S1 and S2 of the Supplementary Figures) corroborated the results obtained from test set predictions. In addition, small standard deviation across different performance measures confirmed that the evaluated models were robust. Using Welch's t test on R 2 distributions (which displayed greater variance than RMSE distribution), the following CNN vs. other models showed similar cross-validation performance (p > 0.05): for HH CLint, multi-task CNN vs. multi-task DNN, multi-task CNN vs. multi-task GCNN, and single-task CNN vs. RFR; for HLM CLint multi-task CNN vs. multi-task GCNN, single-task CNN vs. single-task DNN, and single-task CNN vs. singletask GCNN; for RH CLint, single-task CNN vs. single-task DNN; and for DH CLint, multi-task CNN vs. multi-task DNN, and multi-task CNN vs. multi-task GCNN. For the remaining tasks, performance distribution was significantly different (p << 0.05).
Another topical issue to consider is the model training time which can differ greatly for different machine learning methods. Here, cross-validation coupled with hyperparameter optimization, a common bottleneck during model building, varied in completion time between the methods, including multi-and single-task setups. For example, multiand single-task CNNs were fully trained after approximately 4 and 3 days, respectively, whereas multi-and single-task DNNs were obtained after 8 and 6 hours, respectively. Multiand single-task Chemprop GCNN models were overall the fastest among the deep learning-based methods, completing the full training protocol in 4 and 3 hours, respectively. Interestingly, simpler methods (SVRs and RFRs) displayed considerably different training times. For HLM CLint (the largest data set), SVR training was completed after one hour and RFR training after only three minutes. Not surprisingly, RFRs are generally preferred over other, more complex, machine learning methods for their training speed. However, the inference time (i.e., the time required to predict a test set) was fairly similar for all the above evaluated models, with an average of three milliseconds per compound.

Conclusion
In this study, we have investigated the use of multi-task convolutional neural networks to predict a series of in vitro clearance endpoints using different molecular image representations. Prediction of in vitro ADME parameters, including different clearance endpoints, is commonly achieved through the use of structure-based fingerprints that encode chemical structure as fixed-length binary vectors which indicate presence or absence of particular substructural patterns. With the recent advent of novel deep learning architectures, end-to-end representation learning together with iterative model refinement have expanded capabilities of predictive models beyond pre-defined, knowledge-based structural representations.
To evaluate this, we have trained convolutional neural networks using low-level two-dimensional image representations of molecules to predict four in vitro clearance endpoints commonly applied in AstraZeneca. First, we evaluated the impact of transfer learning, different image configurations, and representations to choose appropriate modelling setup for the follow-up evaluation of different CNN architectures and hyperparameters. We then applied different data splitting strategies to assess the performance of selected model configuration. Finally, we benchmarked our singleand multi-task CNN models to respective machine learning methods that were trained using conventional and graphbased structural fingerprints, typically employed in learning tasks such as clearance prediction. Based on these results, we conclude that CNNs were successful at deriving implicit chemical information from raw molecular images, resulting in models with on par accuracy as state-of-the-art machine learning approaches using standard structure-based fingerprints. Interestingly, performance of CNN models reflected changes in prediction difficulty associated with designated data splitting strategies, similar to the observations made for structure-based machine learning models typically applied in practice. This was further supported by model performances estimated at different time thresholds. While experimental error estimates revealed that model accuracy could still be potentially improved, it remains questionable whether further in-depth investigation of different modelling aspects would yield a measurable increase in performance. In addition, we demonstrated that multi-task learning greatly improved predictability of our models, particularly those with limited number of accessible data points.
Although structure-based representations are generally less complex than molecular images, and graph-based representations aim to extract high-quality features at lower computational cost, several potential advantages of image representations should be mentioned. First, CNN models could potentially extract plausible chemical features that extend beyond fixed-length structural rules. Derivation of novel molecular features via imaging could improve the learning capability of chemoinformatics-based models where conventional structural features show only limited success. However, such models often require a large corpus of data to confidently engineer learnable features. Secondly, in cases where insufficient modelling data exists, transfer learning could be applied to fine-tune pre-trained models for increased endpoint predictability. This approach has been successfully employed for a variety of deep learning applications, including the study discussed herein. Thirdly, CNNs could potentially be applied to hand-drawn images of molecules, where an experimentalist could simply sketch a desired compound to obtain the estimates directly on the display. This would however require a large image collection comprising several drawing styles with assigned experimental values to successfully train CNNs for the prediction of newly drawn molecules. Lastly, features extracted from molecular images could be mapped back onto two-dimensional structures of molecules to investigate their relevance for compound design. Applied to clearance endpoints, feature contribution maps could suggest potential structural modifications to lower metabolic clearance and consequently increase stability of compounds. These, and other potential applications of CNNs to molecular images, will be in our focus for the coming period.