Computational Prediction of the Isoform Specificity of Cytochrome P450 Substrates by an Improved Bayesian Method

Cytochrome P450 (CYP) is the most important drug-metabolizing enzyme in human beings. Each CYP isoform is able to metabolize a large number of compounds, and if patients take more than one drugs during the treatment, it is possible that some drugs would be metabolized by the same CYP isoform, leading to potential drug-drug interactions and side effects. Therefore, it is necessary to investigate the isoform specificity of CYP substrates. In this study, we constructed a data set consisting of 10 major CYP isoforms associated with 776 substrates, and used machine learning methods to construct the predictive models based on the features of structural and physicochemical properties of substrates. We also proposed a new method called Improved Bayesian method, which is suitable for small data sets and is able to construct more stable and accurate predictive models compared with other traditional machine learning models. Based on this method, the predictive performance of our method got the accuracy of 86% for the independent test, which was significantly better to the existing models. We believe that our proposed method will facilitate the understanding of drug metabolisms and help the large-scale analysis of drug-drug interactions.


Introduction
Cytochrome P450 (CYP450, CYP) is the most important drug-metabolizing enzymes in human beings [1] , which involves in the metabolism of more than 90% clinical drugs [2,3]. CYP450 is a superfamily of monooxygenases, which contains heme and can increase the water solubility and polarity of drugs to make them easy to excrete by oxidation reaction. 57 CYP isoforms have been discovered in human beings [4], and based on the similarity of amino acids, CYP450 can be divided to different families and subfamilies, in which xenobiotics are mainly metabolized by the families of CYP1, CYP2 and CYP3 [5].
The number of CYP isoforms is limited but the xenobiotics are various, so each CYP isoform is able to metabolize large number of compounds, which is quite different from the high specificity of most enzymes. In many cases, people will take more than one drugs during the treatment. Then it is possible that some drugs are metabolized by the same CYP isoform and lead to drug-drug interactions [6,7], in which the safety and efficacy of a drug can be altered by the co-administration of another drug, and this may result in serious side effects [6] and even death of the patients [8,9]. Therefore, it is quite important to know which CYP isoform will metabolize a drug candidate in the drug discovery stage [10].
In the past decades, both experimental and computational approaches have been developed to identify the isoform specificity of CYP450 substrate. Experimental assay is accurate but the process such as the bioluminescent and fluorescent assays [11] are quite resource-consuming. A variety of computational models including structure-based and feature-based methods have also been developed, in which the structure-based methods such as molecular dynamics simulation [12][13][14], quantum mechanical methods [15,16] and pharmacophore modeling [17,18] are based on the crystal structures of CYPs. However, some CYP structures are still unavailable, and because of the heme in the active site, the force field is difficult to construct, which makes these methods quite complex and time-consuming. The feature-based methods employ various physiochemical and structural features of substrates to predict which CYP isoforms will participate in the metabolism reactions based on machine learning technologies, which have been widely used in bioinformatics A large number of predictive models have been developed to predict the isoform specificity of CYP450 substrate using various machine learning techniques such as logistic regression, decision tree, naïve Bayes, k-nearest neighbors (KNN), support vector machine (SVM) and artificial neutral network (ANN) [27][28][29][30][31][32]. Block and Henry [27] constructed independent classifiers using the descriptors including molecular connectivity, E-state indices, Spartan descriptors and structural keys, but these classifier cannot get the isoform specificity straight and the accuracy is just about 70-80%. Michielan et al.
[28] developed several single-label and multi-label classification models with the accuracy of 75-95%, but the results seemed to come from the large number of negative samples because the sensitivity and precision was usually low and the more positive samples were used, the worse predictive performance was gotten. Zhang et al. [32] used decision tree, BP neutral network and ML-kNN to construct classification models, and if a substrate was metabolized by only one CYP isoform, the predictive performance is more than 90%, but if a substrate could be metabolized by more CYP isoforms, the performance would be down obviously. And above of all, these studies used small number of samples to construct the models, which limited the application fields and cause difficulties when the approaches were scaled to larger data set.
In this paper, we developed a new method called Improved Bayesian method, which not only used the samples whose biological activities are known, but also the samples whose activities are unknown, and two-steps training was used in the model construction. This method may be quite suitable for small data set, and is able to construct more stable and accurate predictive models to small data set compared with other machine learning models. Based on the Improved Bayesian method, we used the metabolism information of 776 compounds and up to 10 CYP isoforms (CYP 1A1, 1A2, 2A6, 2B6, 2C8, 2C9, 2C19, 2D6, 2E1, 3A4) from Meta-CYP database to construct the predictive models of isoform specificity of CYP450 substrates. Compared with naïve Bayesian method, k-nearest neighbors and support vector machine, we found the predictive performance of our method was obviously better than others and got the accuracy of 86% for the independent test, which was a significant improvement to the existing models.

Feature selection
In this paper, a total of 176 descriptors representing various structural and physicochemical properties of substrates were subject to the feature selection. As we constructed 10 separate models for different CYP isoforms, the selected descriptors were different for each model. As shown in Figure   1, CYP2A6 and CYP2E1 prefer to metabolize small molecules and CYP3A4 prefers to metabolize large molecules, but it has little correlation between molecular weight and whether a molecule can be metabolized by CYP2D6, so in the model of CYP2A6, 2E1 and 3A4, the descriptor molecular weight was selected, but in the model of CYP2D6, this descriptor was not selected. Some descriptors were selected by most models. As shown in Table 1, the descriptors about molecular size and shape seem to appear frequently, such as chi0, VDistMa, weinerPol, Kier1, zagreb and Std_dim3, which implies molecular size and shape may have significant influence over the isoform specificity of CYP substrates. Other features such as hydrogen bond or molecular surface area seem also important, but the descriptor logP (Log of the octanol/water partition coefficient) did not be selected by any model, which is quite different from the studies on the other biological activities of CYP [18,[33][34][35][36][37][38]. Maybe the mechanism of isoform specificity of CYP substrates is somewhat different from other biological activities.

Model comparison
Four methods including Improved Bayesian method, naïve Bayesian method, k-nearest neighbors (kNN) and support vector machine (SVM) were used to construct our classification models based on the validation set. The selected descriptors were used as input and whether a molecule can be metabolize by the CYP isoform was output (represented by 1 or 0). Leave-one-out cross-validation was used to evaluate the predictive performance, and the result is represented by AUC of ROC curve, which is shown in Table 2, Figure 2 and Figure 3.
We can see Improved Bayesian method is superior to other methods in all models, while naïve Bayesian method seems to make the worst predictive performance. The results support that our method is a significant improvement to naïve Bayesian method and it is even better than some other machine learning methods such as SVM and kNN. The predictive performance also seems different among the models of different CYP isoforms. The accuracy of CYP2C8, 2C9 and 2C19 is lower than other models while CYP1A1, 2D6 and 2E1 make the best performance. This is possibly due to the different selected descriptors in each model.

Independent test
Besides the leave-one-out cross validation test, we also conducted an independent test on the models constructed by the improved Bayesian method, which simulates a true prediction since the model trained on one dataset is tested on an unseen dataset. The 600 molecules in the validation set were used to train the models and the other 176 molecules were used to test the models. As shown in Table 3, the predictive performance is similar to those of the leave-one-out cross validation test and the accuracy is up to 86% on average, suggesting that our method performed well on the independent test set.

Discussion
As the major drug-metabolizing enzymes in human beings, cytochrome P450s have been considered as the important factor of drug-drug interactions [6]. In many cases, people will take multiple drugs during the treatment, and if two drugs are metabolized by the same CYP isoform, mutual inhibition may occur and the blood concentration will abnormally increase when the patient still follow the scheduled treatment doses, which may lead to serious side effect [6]. In China, tradition Chinese medicines (TCM) are widely used, and sometimes, doctors will prescribe TCM combined with other medicines to improve the curative effect [39], but the problem of drug-drug interactions is usually neglected. The absence of rapid analysis tools is obviously a significant reason, and the computational methods with high accuracy are certainly a good solution.
In this work, we constructed novel predictive models to predict the isoform specificity of CYP substrates based on Improved Bayesian method. This method not only used the samples whose biological activities are known, but also used the samples whose activities are unknown (background samples). Compared with naïve Bayesian method, the relationship of features is taken into account and we no longer need to assume the independence of conditional probability. Compared with most machine learning methods, as the relationship of features is determined by the background samples, only small amount of model variables will be changed during the training, which may lead to more stable and reliable predictive performance and less probability of overfitting. In biostatistics and bioinformatics, we often face the dataset with small sample size but large number of features, which is usually hard to analyze, but with the help of background samples, we may construct much better models. Of course, there is still some limitation in our method. We have to assume that the correlation of all features is independent of the biological activity. If this assumption is not tenable, our method is inappropriate.
The common CYP isoform of two drugs is certainly a cause of drug-drug interactions, but there are still many other causes [40]. For example, some drugs can inhibit or induce the activity of CYP isoform straight, and these drugs are called inhibitors or inducers. Some drugs can interact with the transcription factors and influence the expression of CYP isoforms. All these phenomena increase the difficulty in accurate prediction of drug-drug interactions caused by CYPs, which need further investigation in future.

Improved Bayesian method
In this paper, we will introduce a new method to construct our classification models, which is called Improved Bayesian method. In bioinformatics, due to the limitation of experimental technology, we often face the dataset with small sample size. For example, if we had 1000 drug candidates, maybe only 100 candidates were known the corresponding CYP isoforms that metabolize them, and the activities of the other 900 candidates were unknown. In most machine learning methods, we will only use the 100 compounds as samples to construct the model, but in our method, we will use the other 900 samples as well. In other words, we will not only used the samples whose biological activities are known (activity samples), but also the samples whose biological activities are unknown (background samples). Figure 4, the model will be trained by two steps. First, the model is trained by the background samples, and we can get the distribution of each feature and correlations of all features.

As shown in
Next, after training with the activity samples, we can get the relationship between the target and all features. Thus, the activity samples will just change small amount of variables in the model, which may lead to more stable and reliable predictive performance and less probability of over-fitting. We do not have to construct simple model for small data set, and more accurate model may be built.
We will introduce the procedures of our Improved Bayesian method in detail.  Then (2) Thus, X̂ follows standard normal distribution. Note that F(x) is calculated based on the background samples.
Next, we need to calculate the correlation coefficient matrix R based on the normalized features. Of course, R is also calculated based on the background samples. Take note that the correlation coefficient may be meaningless if the variables do not follow normal distribution, so the normalization procedure is essential.

(b) Estimate the joint distribution function of all features
Here, we need two basic assumptions in our Improved Bayesian method. First, the correlation coefficient of two features is not dependent on the biological activity. For example, if a molecule has more atoms, its molecular weight is usually larger, and the positive correlation between the atomic number and molecular weight is irrelevant to the CYP specificity of these molecules. Second, if random variables X 1 , X 2 , … , X n follow normal distribution, n dimensional random variable (X 1 , X 2 , … , X n ) follows n dimensional normal distribution approximately. Obviously, the two assumptions are reasonable and easy to satisfy.
Based on the assumptions, the joint distribution function of (X 1 , X 2 , … , X n ) is (3) (c) Construct the classification models based on the activity samples.
Denote P(A) is the probability of event A. Each sample is described by n features X 1 , X 2 , … , X n , and belongs to one of the two classes C 0 or C 1 . Then, for a sample x = (x 1 , x 2 , … , x n ), we can calculate the probability that it belongs to C k (k = 0 or k = 1) based on Bayesian formula: (4) In naïve Bayesian classification, we should assume the independence of conditional probability, and then (5) But in our method, based on equation (1), (3), (4), we can calculate the probability straight. (6) where F i (x) and f i (x) are the distribution function and density function of X i respectively. Φ(x) and φ(x) are the distribution function and density function of standard normal distribution respectively. Φ n (x) and φ n (x) are the joint distribution function and joint density function of n dimensional normal distribution respectively. Take note that φ n (x) can be expressed as (7) where Σ = diag(σ 1 , … ,σ n ) is standard deviation matrix, μ = (μ 1 , … , μ n ) is mean vector, R is correlation coefficient matrix and x = (x 1 , x 2 , … , x n ).
where μ i and σ i are the mean and the standard deviation of feature i in all samples respectively. μ ki and σ ki are the mean and the standard deviation of feature i in the samples belonging to class Ck.
Plug equation (7)-(11) into equation (6), and take logarithm, we can get (12) Then, the larger Hk is, the more probability that x = (x 1 , x 2 , … , x n ) belongs to C k . We can compare H 0 and H 1 to determine whether x belongs to C 0 or C 1 , or we can plot receiver operating characteristic curve (ROC curve) and select the best threshold to determine the classification.
The number of compounds metabolized by each CYP isoform and their distribution are shown in Table   4. We can see the total number is up to 1646, which means each compound will be metabolized by about two CYP isoforms on average. The set of 776 compounds were randomly divided into two data sets. The first set was composed of 600 samples, taken as the validation set for feature selection and model construction. The rest 176 samples constituted the independent test set for the model test.
Meta-CYP database and the analysis tools based on our proposed methods are currently under construction by our laboratory, in which the information used in this paper was mainly form SuperCYP It should be noted that inorganic compounds and polypeptides are not including in our dataset, because the features describing these compounds are quite different from other small molecules. So the models constructed in this study are not suitable for the two types of compounds.

Molecular descriptors calculation
We employed the molecular descriptors representing various structural and physicochemical properties of the compounds including atomic composition, molecular size, shape, weight, hydrogen bond, hydrophobicity, acidity or basicity, charge distribution and energy of chemical bonds to construct our models. To accurately calculate these molecular descriptors, the chemical structures of all compounds were first drawn by SAMM, an in-house molecular mechanics and drug design package.
Next, all structures were further energy minimized by the MMFF94 force field [47] . At last, the molecular descriptors were calculated based on the energy-minimized structures by SAMM. After removing the descriptors that were highly correlated or had the same value in almost all compounds, 176 molecular descriptors including 134 two-dimensional and 42 three-dimensional descriptors were gotten and would be subject to further feature selection.

Feature selection
Feature selection is performed prior to the construction of the predictive models in order to eliminate the redundant features and identify the most relevant features [48][49][50] . Minimum Redundancy Maximum Relevance (MRMR) [51] was applied to select the best features in this paper, which is a powerful method for getting useful information from complex systems through maximizing the mutual information between the selected features and the classification target, and in the meanwhile, minimizing the correlation of the features themselves [52] . In this work, we selected the top 20 features with the highest scores in the model construction.

Model comparison and evaluation
Besides Improved Bayesian method, we also used naïve Bayesian method, k-nearest neighbors and support vector machine [53] to construct our classification models based on the validation set for model comparison. In naïve Bayesian method, the Laplace smoothing factor was 1. The parameters of k-nearest neighbors was k = 5, Euclidean distance and distance weighting function = 1/d. The kernel function in support vector machine was radial basis function and other parameters were optimized by LIBSVM [54] .
For the constructed models, we would use AUC (area under the curve) of receiver operating characteristic curve (ROC) for model evaluation. Besides that, we would calculate specificity, sensitivity, accuracy as well.
where TP, FP, TN, FN means true positive, false positive, true negative and false negative respectively.

Conclusions
In this study, we employed machine learning methods to predict the CYP isoform specificity of 776 compounds with their corresponding CYP enzymes, including CYP 1A1, 1A2, 2A6, 2B6, 2C8, 2C9, 2C19, 2D6, 2E1 and 3A4. The improved Bayesian method was proposed and applied in the model construction based on the features of structural and physicochemical properties of substrates. In comparison with the naïve Bayesian method, k-nearest neighbors and support vector machine, it was found that the predictive models built by our method achieved the best accuracy, which is up to 86% accuracy on average on the independent test. Using such a model, we can efficiently and reliably estimate the metabolic paths of a given compound involved in CYP-mediated metabolism of these ten CYPs. We believed that our proposed method will be potentially useful in large-scale in silico drug screening and drug-drug interaction analysis.       Figure 1 The The comparison of the validation results by four methods and ten CYP models ROC curves of ten CYP models with four methods Figure 4 Illustration of the two-steps training in Improved Bayesian method. The gray circle represents each feature and the black circle represents the target. The gray lines and the black arrows represent the relationship among features and the relationship between