Intrinsic clearance
Experimental data for four clearance endpoints, namely human hepatocyte intrinsic clearance (HH CLint), human liver microsome intrinsic clearance (HLM CLint), rat hepatocyte intrinsic clearance (RH CLint), and dog hepatocyte intrinsic clearance (DH CLint), were extracted from AstraZeneca internal sources.
Standard operating procedure for hepatocyte intrinsic clearance assays (HH CLint, RH CLint, and DH CLint) follows similar protocol, which is outlined as follows:
- Test compound is prepared in 10 mM dimethyl sulfoxide (DMSO) solution at 37°C and then diluted in acetonitrile to 100 μM concentration
- Frozen hepatocytes are warmed up in a 37°C bath and incubated in medium until reaching an approximate viable cell concentration of 1.0x106 cells/mL, before the solution is transferred to incubation well plates
- 2.5 μL of test compound solution is combined with the hepatocyte solution
- At time points of 0.5, 5, 15, 30, 45, 60, 80, 100, and 120 minutes, 20 μL of the solution is transferred to a separate well, centrifugated for 20 minutes at 4000g, and then a percentage of disappearance of the test compound is estimated
- Linear correlation between the percentage of disappearance (in natural logarithm units) and time is drawn from which a slope of the curve k is extracted to calculate intrinsic clearance with the following equation:
where V is the volume of incubation (μL) and N is the number of hepatocytes in the well. Hepatocyte intrinsic clearance is measured in μL/(min x 106 cells) units.
For HLM CLint, following standard operating procedure was applied:
- Test compound is prepared in 10 mM DMSO solution and then diluted in acetonitrile to 100 μM concentration
- Solution containing liver microsomes is prepared at a concentration of 1.1236 mg/mL in 100 mM phosphate-buffered saline solution at pH 7.4
- 222.5 μL of the microsomal solution is mixed with 2.5 μL of the test compound solution and 25 μL of a 10 mM nicotinamide adenine dinucleotide phosphate (NADPH) solution in a separate plate at 37°C
- At time points of 0.5, 5, 10, 15, 20, and 30 minutes, 20 μL of the solution is sampled, centrifuged for 15 minutes at 3000g, and then a percentage of disappearance of the test compound is estimated
- Linear correlation between the percentage of disappearance (in natural logarithm units) and time is drawn from which a slope of the curve k is extracted to calculate intrinsic clearance with the following equation:
where V is the volume of incubation in μL and m is the amount of microsomal protein in milligrams. Microsome intrinsic clearance is measured in μL/(min x mg) units.
Data curation
Compound annotations given as SMILES strings were standardized using RDKit [17] and ChEMBL Structure Pipeline [18] packages in Python, which included removal of salts and solvent molecules, neutralization of acids and bases, conversion of isotopes, removal of stereoisomers and canonicalization of aromatic structures. Only compounds with molecular weight between 200 and 900 Da were retained.
Only measurements within the range of 1 and 300 μL/(min x 106 cells) were kept for HH, RH, and DH CLint, whereas for HLM CLint this limit was between 3 and 300 μL/(min x mg). Decadic logarithm transformation was applied to convert qualifying experimental values. If a compound was measured several times for a particular endpoint then its mean and standard deviation of measurements were calculated. Standard deviation distribution for each endpoint was used to derive experimental error estimates, which were set at 95th percentile value of distributions [19]. Such derived experimental errors were applied to investigate model strength where root-mean-square error values of model performances were compared to the estimates. To ensure that the compounds with high experimental variability were excluded from the model building, the linear correlation between the mean and standard deviation of measurements was explored. If a compound’s standard deviation was larger than two times the expected standard deviation of the linear fit, then this measurement would be removed [19]. Following these steps, clearance data were proceeded to modelling.
Image representations
Canonical SMILES of each compound was converted to a two-dimensional molecular representation using OpenEye library in Python [20]. Image size was fixed at 240 x 240 pixels to accommodate depiction of larger molecules present in the data set, but also to preserve the information gain from smaller compounds. Aromatic bonds were represented using ring-centered circles to maximize the information content and number of pixels associated with chemical bonds.
Default pixel range was between 0 (black) and 255 (white) where pixels of value of zero contained no chemical information. Atoms and bonds were encoded using two alternative molecular image representations. In the first representation, which we termed color coded, different color values were assigned to different atoms (e.g., black for carbon, red for oxygen, blue for nitrogen, etc.). This information was converted in the pixel space so that the atom and bond values retained their original color but were normalized between 0 and 1. Second image representation, named property gradient, consisted of atom and bond properties depicted as grayscale gradients (atomic number, MMFF94 [21] partial charge, bond order, or valence). Pixel values (from 0 to 255) were mapped back to original feature values whereas the empty pixels were assigned the value of zero. Presence of property gradients reduced the number of pixels devoid of chemical information. Images from different properties were then combined in a multi-channel image. Fig. 1 shows an example of image representations for drug imatinib in both color and property gradient representations.
Two different configurations of molecular sizes were explored for image derivation: fixed and scaled. In fixed configuration, consistent resolution of 0.03Å/pixel was applied which ensured equally sized renditions of atoms and bonds for explored molecules. This consequently introduced greater number of pixels devoid of chemical information for small-sized molecules. For the scaled configuration, molecules were adapted to fit the image dimensions so that both atom and bond contents were maximized. Here, small-sized molecules were depicted as relatively larger, as to minimize the number of empty pixels. Fig. 2 depicts both fixed and scaled configurations for drug clozapine.
Following the definition of representations and configurations, different data augmentation strategies were applied to reduce the model bias towards single two-dimensional poses rendered for each molecule. Here, we applied 90° or 180° rotation, vertical flip, and horizontal flip to expand the number of molecular poses and allow better model generalization independent of rendered orientations. During the training process, for each SMILES provided per batch, an augmented version of the original molecular representation was submitted for learning, efficiently carried out by a PyTorch [22] data loader.
Structural representations
State-of-the-art structural representations were used to benchmark CNN models built on images. Here, different configurations of extended-connectivity fingerprints (ECFPs) [23] were applied, which included diameters four (ECFP4) and six (ECFP6) hashed to either 1024- or 2048-bit versions. ECFPs were obtained using RDKit [17] package in Python.
Data splitting strategies
Prior to model building, clearance data set was divided into the training and test sets using three different strategies given in an increasingly challenging order of predictions: random split, scaffold-based split, and time-based split. For the random split, compounds were randomly divided in respective training and test sets, which may have shared common chemical space given the stochastic nature of this approach. On the other hand, scaffold-based split ensured that the compounds containing same Bemis-Murcko scaffolds [24] were made part of the same (training or test) set. This further reduced chemical similarity between the training and test sets and thus increased prediction difficulty. Further up the scale, time-based split defines a time point threshold by which the data are divided into the training and test set. As our internal data were all associated with sampling dates, the latest date of clearance measurement was defined as the time stamp for each compound. Time thresholds of thirty, twelve, six, and four months were established to evaluate different time split scenarios.
Training sets were used for the model cross-validation and hyperparameter optimization whereas the test sets were employed to independently evaluate final model performances. Three-fold cross-validation was applied for all three data splitting strategies. For both random and scaffold-based splits, 80% of the data set were dedicated for model building while the remaining 20% were defined as a test set. For time-based split, different time thresholds produced different ratios of training-test data with time point of thirty months being the closest to the 80/20 split.
Loss functions and performance metrics
Models were optimized using either Huber [25] loss or root-mean-square error (RMSE). Huber loss balances between L1 and L2 losses, where L1 minimizes the sum of all absolute differences between true and predicted values while L2 minimizes the sum of all squared differences between true and predicted values. For individual predictions, it can be calculated using the following:
where parameter δ regulates the application domains of L1 and L2 penalties, set to a default of 1.
The RMSE presents the measure of spread of prediction errors around the line of best fit and can be calculated using the following equation:
where yi is the true value for i-th sample, is the predicted value for i-th sample, and is the number of samples. Model configuration with lowest RMSE value was proceeded forward.
Model performance was evaluated using both RMSE and coefficient of determination (R2) score. The R2 presents the proportion of variance of dependent variable explained by independent model variables. It provides an indication of goodness of fit and can be defined as:
where ȳ is the arithmetic average of all true values.
Convolutional neural networks
Different CNN architectures, coded in PyTorch [22], were explored to learn in vitro clearance endpoints from molecular images. Basic concept of CNNs is as follows: a series of convolutional layers comprised of convolution and pooling operators are repeatedly applied to extract features from different scales before being processed by several fully connected layers (activated by a non-linear function) which end in an output layer corresponding to the number of tasks (e.g., clearance endpoints) [26]. For each training epoch, prediction outputs are evaluated against corresponding experimental values to calculate the training loss which is then minimized using a pre-defined loss function.
Following hyperparameters were optimized: batch size (32, 64, or 128), learning rate (10-3 to 10-5), number of convolutional layers (2-6), number of fully connected layers (1-3). Other hyperparameters remained consistent: number of epochs was 150 with early stopping [27], gradient descent optimization performed with Adam optimizer [28], non-linear activation function set to ReLU [29], batch normalization [30], and dropout [31] rate of 0.5 in fully connected layers.
Hyperparameters were initialized either randomly or from pre-trained networks that were built using a commonly applied ImageNet [32] data set. The latter approach, commonly known as transfer learning, is often practiced in computer vision to speed up convergence and improve predictability of novel tasks. Weight imports from these architectures could be partial, as in a simple configuration which we termed SimpleNet, where only the convolutional operators were acquired from a pre-trained VGG [33], or total, where all the pre-trained weights from architectures Chemception [34], ResNet [35], and EfficientNet [36] were applied. As we primarily focused on multi-task predictions, CNNs were optimized in a multi-task mode to derive optimal configuration which was later used to train single-task CNN variants to enable fair comparison between single and multi-task learning.
Machine learning for structural representations
Machine learning methods frequently employed in ECFP-learning tasks, such as random forest [37], support-vector regressor [38], and deep neural network [39], were optimized and validated using these structural representations. To enable fair comparison between image- and structure-based representations, the training-test splits were kept consistent across different splitting strategies.
Random forest (RF) is an ensemble of decision trees each built from a randomly sampled subset of training data, procedure commonly known as bootstrapping. Herein, we optimized number of decision trees (50, 100, or 150) and minimum number of samples left in terminal nodes (1-5) using Scikit-learn [40] implementation.
Support vector regressor (SVR) maps training data (or features) in high-dimensional space (often using a non-linear kernel function), as close as possible to their quantitative output (or label) to derive a regression function. Hyperplane margin (ε) controls tolerable deviation between the observed and predicted values, and larger errors are penalized. Hyperparameter C, also known as regularization slack variable, also regulates the relaxation of error minimization problem by penalizing large deviations from the ε tube. The SVRs were optimized for hyperparameters ε (0.001, 0.01, and 0.1), C (0.1, 1, and 10) and a kernel function (radial basis or Tanimoto [41]). These as well were implemented via Scikit-learn [40].
Deep neural network (DNN) applies a series of learned matrix multiplications and non-linear activation functions in units known as neurons, commonly organized in several fully connected layers. Loss function (such as one of those described previously) evaluates the difference between predicted and experimental values and minimizes it via an iterative learning process. The DNNs were implemented in PyTorch [22] and optimized for batch size (32, 64, 128, 256, and 512), learning rate (5 x 10-5, 10-4, 5 x 10-4), neuron and layer configuration ([1000, 500], [1500, 750], [2000, 1000], [4000, 2000], [4000, 2000, 1000], or [5000, 2500]), and dropout rate (between 0.25 and 0.5).
Models were optimized with Ray Tune [42] library which implements Bayesian Optimization with HyperBand (BOHB) [43] algorithm, for our purposes set at 30 sampled experiments. Methodological framework of RFs and SVRs allows only for single-task learning, whereas DNNs can be trained in both single- and multi-task fashion.