Computational Prediction of the Isoform Specificity of Cytochrome P450 Substrates by an Improved Bayesian Method
Cytochrome P450 (CYP) is the most important drugmetabolizing enzyme in human beings. Each CYP isoform is able to metabolize a large number of compounds, and if patients take more than one drugs during the treatment, it is possible that some drugs would be metabolized by the same CYP isoform, leading to potential drugdrug interactions and side effects. Therefore, it is necessary to investigate the isoform specificity of CYP substrates. In this study, we constructed a data set consisting of 10 major CYP isoforms associated with 776 substrates, and used machine learning methods to construct the predictive models based on the features of structural and physicochemical properties of substrates. We also proposed a new method called Improved Bayesian method, which is suitable for small data sets and is able to construct more stable and accurate predictive models compared with other traditional machine learning models. Based on this method, the predictive performance of our method got the accuracy of 86% for the independent test, which was significantly better to the existing models. We believe that our proposed method will facilitate the understanding of drug metabolisms and help the largescale analysis of drugdrug interactions.
Figure 1
Figure 2
Figure 3
Figure 4
Cytochrome P450 (CYP450, CYP) is the most important drugmetabolizing enzymes in human beings [1] , which involves in the metabolism of more than 90% clinical drugs [2, 3]. CYP450 is a superfamily of monooxygenases, which contains heme and can increase the water solubility and polarity of drugs to make them easy to excrete by oxidation reaction. 57 CYP isoforms have been discovered in human beings[4], and based on the similarity of amino acids, CYP450 can be divided to different families and subfamilies, in which xenobiotics are mainly metabolized by the families of CYP1, CYP2 and CYP3 [5]. The number of CYP isoforms is limited but the xenobiotics are various, so each CYP isoform is able to metabolize large number of compounds, which is quite different from the high specificity of most enzymes. In many cases, people will take more than one drugs during the treatment. Then it is possible that some drugs are metabolized by the same CYP isoform and lead to drugdrug interactions [6, 7], in which the safety and efficacy of a drug can be altered by the coadministration of another drug, and this may result in serious side effects [6] and even death of the patients [8, 9]. Therefore, it is quite important to know which CYP isoform will metabolize a drug candidate in the drug discovery stage [10].
In the past decades, both experimental and computational approaches have been developed to identify the isoform specificity of CYP450 substrate. Experimental assay is accurate but the process such as the bioluminescent and fluorescent assays [11] are quite resourceconsuming. A variety of computational models including structurebased and featurebased methods have also been developed, in which the structurebased methods such as molecular dynamics simulation [1214], quantum mechanical methods [15, 16] and pharmacophore modeling [17, 18] are based on the crystal structures of CYPs. However, some CYP structures are still unavailable, and because of the heme in the active site, the force field is difficult to construct, which makes these methods quite complex and timeconsuming. The featurebased methods employ various physiochemical and structural features of substrates to predict which CYP isoforms will participate in the metabolism reactions based on machine learning technologies, which have been widely used in bioinformatics [1921] and drug discovery [2224]. These methods only employ the structures of substrates and are easy to develop and use, which are suitable for the analysis of drug metabolism [25] and drugdrug interactions [26] in large scale.
A large number of predictive models have been developed to predict the isoform specificity of CYP450 substrate using various machine learning techniques such as logistic regression, decision tree, naïve Bayes, knearest neighbors (KNN), support vector machine (SVM) and artificial neutral network (ANN) [2732]. Block and Henry [27] constructed independent classifiers using the descriptors including molecular connectivity, Estate indices, Spartan descriptors and structural keys, but these classifier cannot get the isoform specificity straight and the accuracy is just about 7080%. Michielan et al. [28] developed several singlelabel and multilabel classification models with the accuracy of 7595%, but the results seemed to come from the large number of negative samples because the sensitivity and precision was usually low and the more positive samples were used, the worse predictive performance was gotten. Zhang et al. [32] used decision tree, BP neutral network and MLkNN to construct classification models, and if a substrate was metabolized by only one CYP isoform, the predictive performance is more than 90%, but if a substrate could be metabolized by more CYP isoforms, the performance would be down obviously. And above of all, these studies used small number of samples to construct the models, which limited the application fields and cause difficulties when the approaches were scaled to larger data set.
In this paper, we developed a new method called Improved Bayesian method, which not only used the samples whose biological activities are known, but also the samples whose activities are unknown, and twosteps training was used in the model construction. This method may be quite suitable for small data set, and is able to construct more stable and accurate predictive models to small data set compared with other machine learning models. Based on the Improved Bayesian method, we used the metabolism information of 776 compounds and up to 10 CYP isoforms (CYP 1A1, 1A2, 2A6, 2B6, 2C8, 2C9, 2C19, 2D6, 2E1, 3A4) from MetaCYP database to construct the predictive models of isoform specificity of CYP450 substrates. Compared with naïve Bayesian method, knearest neighbors and support vector machine, we found the predictive performance of our method was obviously better than others and got the accuracy of 86% for the independent test, which was a significant improvement to the existing models.
Feature selection
In this paper, a total of 176 descriptors representing various structural and physicochemical properties of substrates were subject to the feature selection. As we constructed 10 separate models for different CYP isoforms, the selected descriptors were different for each model. As shown in Figure 1, CYP2A6 and CYP2E1 prefer to metabolize small molecules and CYP3A4 prefers to metabolize large molecules, but it has little correlation between molecular weight and whether a molecule can be metabolized by CYP2D6, so in the model of CYP2A6, 2E1 and 3A4, the descriptor molecular weight was selected, but in the model of CYP2D6, this descriptor was not selected.
Some descriptors were selected by most models. As shown in Table 1, the descriptors about molecular size and shape seem to appear frequently, such as chi0, VDistMa, weinerPol, Kier1, zagreb and Std_dim3, which implies molecular size and shape may have significant influence over the isoform specificity of CYP substrates. Other features such as hydrogen bond or molecular surface area seem also important, but the descriptor logP (Log of the octanol/water partition coefficient) did not be selected by any model, which is quite different from the studies on the other biological activities of CYP [18, 3338]. Maybe the mechanism of isoform specificity of CYP substrates is somewhat different from other biological activities.
Model comparison
Four methods including Improved Bayesian method, naïve Bayesian method, knearest neighbors (kNN) and support vector machine (SVM) were used to construct our classification models based on the validation set. The selected descriptors were used as input and whether a molecule can be metabolize by the CYP isoform was output (represented by 1 or 0). Leaveoneout crossvalidation was used to evaluate the predictive performance, and the result is represented by AUC of ROC curve, which is shown in Table 2, Figure 2 and Figure 3.
We can see Improved Bayesian method is superior to other methods in all models, while naïve Bayesian method seems to make the worst predictive performance. The results support that our method is a significant improvement to naïve Bayesian method and it is even better than some other machine learning methods such as SVM and kNN. The predictive performance also seems different among the models of different CYP isoforms. The accuracy of CYP2C8, 2C9 and 2C19 is lower than other models while CYP1A1, 2D6 and 2E1 make the best performance. This is possibly due to the different selected descriptors in each model.
Independent test
Besides the leaveoneout cross validation test, we also conducted an independent test on the models constructed by the improved Bayesian method, which simulates a true prediction since the model trained on one dataset is tested on an unseen dataset. The 600 molecules in the validation set were used to train the models and the other 176 molecules were used to test the models. As shown in Table 3, the predictive performance is similar to those of the leaveoneout cross validation test and the accuracy is up to 86% on average, suggesting that our method performed well on the independent test set.
As the major drugmetabolizing enzymes in human beings, cytochrome P450s have been considered as the important factor of drugdrug interactions [6]. In many cases, people will take multiple drugs during the treatment, and if two drugs are metabolized by the same CYP isoform, mutual inhibition may occur and the blood concentration will abnormally increase when the patient still follow the scheduled treatment doses, which may lead to serious side effect [6]. In China, tradition Chinese medicines (TCM) are widely used, and sometimes, doctors will prescribe TCM combined with other medicines to improve the curative effect [39], but the problem of drugdrug interactions is usually neglected. The absence of rapid analysis tools is obviously a significant reason, and the computational methods with high accuracy are certainly a good solution.
In this work, we constructed novel predictive models to predict the isoform specificity of CYP substrates based on Improved Bayesian method. This method not only used the samples whose biological activities are known, but also used the samples whose activities are unknown (background samples). Compared with naïve Bayesian method, the relationship of features is taken into account and we no longer need to assume the independence of conditional probability. Compared with most machine learning methods, as the relationship of features is determined by the background samples, only small amount of model variables will be changed during the training, which may lead to more stable and reliable predictive performance and less probability of overfitting. In biostatistics and bioinformatics, we often face the dataset with small sample size but large number of features, which is usually hard to analyze, but with the help of background samples, we may construct much better models. Of course, there is still some limitation in our method. We have to assume that the correlation of all features is independent of the biological activity. If this assumption is not tenable, our method is inappropriate.
The common CYP isoform of two drugs is certainly a cause of drugdrug interactions, but there are still many other causes [40]. For example, some drugs can inhibit or induce the activity of CYP isoform straight, and these drugs are called inhibitors or inducers. Some drugs can interact with the transcription factors and influence the expression of CYP isoforms. All these phenomena increase the difficulty in accurate prediction of drugdrug interactions caused by CYPs, which need further investigation in future.
Improved Bayesian method
In this paper, we will introduce a new method to construct our classification models, which is called Improved Bayesian method. In bioinformatics, due to the limitation of experimental technology, we often face the dataset with small sample size. For example, if we had 1000 drug candidates, maybe only 100 candidates were known the corresponding CYP isoforms that metabolize them, and the activities of the other 900 candidates were unknown. In most machine learning methods, we will only use the 100 compounds as samples to construct the model, but in our method, we will use the other 900 samples as well. In other words, we will not only used the samples whose biological activities are known (activity samples), but also the samples whose biological activities are unknown (background samples).
As shown in Figure 4, the model will be trained by two steps. First, the model is trained by the background samples, and we can get the distribution of each feature and correlations of all features. Next, after training with the activity samples, we can get the relationship between the target and all features. Thus, the activity samples will just change small amount of variables in the model, which may lead to more stable and reliable predictive performance and less probability of overfitting. We do not have to construct simple model for small data set, and more accurate model may be built.
We will introduce the procedures of our Improved Bayesian method in detail.
(a) Make all features follow normal distribution similarly, and calculate the correlation coefficient matrix R based on the normalized features
Denote F(x) is the cumulative distribution function of feature X. Φ(x) and Φ^{1}(x) are the distribution function and inverse distribution function of standard normal distribution respectively. Let
(1) 
Then
(2) 
Thus, X̂ follows standard normal distribution. Note that F(x) is calculated based on the background samples.
Next, we need to calculate the correlation coefficient matrix R based on the normalized features. Of course, R is also calculated based on the background samples. Take note that the correlation coefficient may be meaningless if the variables do not follow normal distribution, so the normalization procedure is essential.
(b) Estimate the joint distribution function of all features
Here, we need two basic assumptions in our Improved Bayesian method. First, the correlation coefficient of two features is not dependent on the biological activity. For example, if a molecule has more atoms, its molecular weight is usually larger, and the positive correlation between the atomic number and molecular weight is irrelevant to the CYP specificity of these molecules. Second, if random variables X_{1}, X_{2}, … , X_{n} follow normal distribution, n dimensional random variable (X_{1}, X_{2}, … , X_{n}) follows n dimensional normal distribution approximately. Obviously, the two assumptions are reasonable and easy to satisfy.
Based on the assumptions, the joint distribution function of (X_{1}, X_{2}, … , X_{n}) is
(3) 
(c) Construct the classification models based on the activity samples.
Denote P(A) is the probability of event A. Each sample is described by n features X_{1}, X_{2}, … , X_{n}, and belongs to one of the two classes C_{0} or C_{1}. Then, for a sample x = (x_{1}, x_{2}, … , x_{n}), we can calculate the probability that it belongs to C_{k} (k = 0 or k = 1) based on Bayesian formula:
(4) 
In naïve Bayesian classification, we should assume the independence of conditional probability, and then
(5) 
But in our method, based on equation (1), (3), (4), we can calculate the probability straight.
(6) 
where F_{i}(x) and f_{i}(x) are the distribution function and density function of X_{i} respectively. Φ(x) and φ(x) are the distribution function and density function of standard normal distribution respectively. Φ_{n}(x) and φ_{n}(x) are the joint distribution function and joint density function of n dimensional normal distribution respectively.
Take note that φ_{n}(x) can be expressed as
(7) 
where Σ = diag(σ_{1}, … ,σ_{n}) is standard deviation matrix, μ = (μ_{1}, … , μ_{n}) is mean vector, R is correlation coefficient matrix and x = (x_{1}, x_{2}, … , x_{n}).
As R is real symmetric matrix, it must exist orthogonal matrix P = (α_{1}, α_{2}, … , α_{n}), which makes

 (8)  
and  
 (9) 
where λ_{i} and α_{i} are the eigenvalues and corresponding eigenvectors of R.
Let
(10)  
(11) 
where μ_{i} and σ_{i} are the mean and the standard deviation of feature i in all samples respectively. μ_{ki} and σ_{ki} are the mean and the standard deviation of feature i in the samples belonging to class Ck.
Plug equation (7)(11) into equation (6), and take logarithm, we can get
(12) 
Then, the larger Hk is, the more probability that x = (x_{1}, x_{2}, … , x_{n}) belongs to C_{k}. We can compare H_{0} and H_{1} to determine whether x belongs to C_{0} or C_{1}, or we can plot receiver operating characteristic curve (ROC curve) and select the best threshold to determine the classification.
Dataset
We collected 776 compounds from the MetaCYP database, and each compound can be metabolized by at least one of the ten CYP isoforms: CYP1A1, 1A2, 2A6, 2B6, 2C8, 2C9, 2C19, 2D6, 2E1 and 3A4. The number of compounds metabolized by each CYP isoform and their distribution are shown in Table 4. We can see the total number is up to 1646, which means each compound will be metabolized by about two CYP isoforms on average. The set of 776 compounds were randomly divided into two data sets. The first set was composed of 600 samples, taken as the validation set for feature selection and model construction. The rest 176 samples constituted the independent test set for the model test. MetaCYP database and the analysis tools based on our proposed methods are currently under construction by our laboratory, in which the information used in this paper was mainly form SuperCYP (http://bioinformatics.charite.de/supercyp/index.php) [41] , CYP450 Engineering Database (https://cyped.biocatnet.de) 42[, 43] and PubChem Compound (https://www.ncbi.nlm.nih.gov/pccompound) [44] . The detailed information about this data set is available in Supplementary Material. In this paper, background samples were gotten from DrugBank database (http://www.drugbank.ca) 45[, 46] , including the approved, experimental and investigational drugs, and 7114 compounds in all.
It should be noted that inorganic compounds and polypeptides are not including in our dataset, because the features describing these compounds are quite different from other small molecules. So the models constructed in this study are not suitable for the two types of compounds.
Molecular descriptors calculation
We employed the molecular descriptors representing various structural and physicochemical properties of the compounds including atomic composition, molecular size, shape, weight, hydrogen bond, hydrophobicity, acidity or basicity, charge distribution and energy of chemical bonds to construct our models. To accurately calculate these molecular descriptors, the chemical structures of all compounds were first drawn by SAMM, an inhouse molecular mechanics and drug design package. Next, all structures were further energy minimized by the MMFF94 force field[47] . At last, the molecular descriptors were calculated based on the energyminimized structures by SAMM. After removing the descriptors that were highly correlated or had the same value in almost all compounds, 176 molecular descriptors including 134 twodimensional and 42 threedimensional descriptors were gotten and would be subject to further feature selection.
Feature selection
Feature selection is performed prior to the construction of the predictive models in order to eliminate the redundant features and identify the most relevant features [4850] . Minimum Redundancy Maximum Relevance (MRMR)[51] was applied to select the best features in this paper, which is a powerful method for getting useful information from complex systems through maximizing the mutual information between the selected features and the classification target, and in the meanwhile, minimizing the correlation of the features themselves[52] . In this work, we selected the top 20 features with the highest scores in the model construction.
Model comparison and evaluation
Besides Improved Bayesian method, we also used naïve Bayesian method, knearest neighbors and support vector machine[53] to construct our classification models based on the validation set for model comparison. In naïve Bayesian method, the Laplace smoothing factor was 1. The parameters of knearest neighbors was k = 5, Euclidean distance and distance weighting function = 1/d. The kernel function in support vector machine was radial basis function and other parameters were optimized by LIBSVM[54] .
For the constructed models, we would use AUC (area under the curve) of receiver operating characteristic curve (ROC) for model evaluation. Besides that, we would calculate specificity, sensitivity, accuracy as well.
In this study, we employed machine learning methods to predict the CYP isoform specificity of 776 compounds with their corresponding CYP enzymes, including CYP 1A1, 1A2, 2A6, 2B6, 2C8, 2C9, 2C19, 2D6, 2E1 and 3A4. The improved Bayesian method was proposed and applied in the model construction based on the features of structural and physicochemical properties of substrates. In comparison with the naïve Bayesian method, knearest neighbors and support vector machine, it was found that the predictive models built by our method achieved the best accuracy, which is up to 86% accuracy on average on the independent test. Using such a model, we can efficiently and reliably estimate the metabolic paths of a given compound involved in CYPmediated metabolism of these ten CYPs. We believed that our proposed method will be potentially useful in largescale in silico drug screening and drugdrug interaction analysis.
This work was supported by the fundings from National Key Research and Development Program of China (Contract No. 2016YFA0501703), and National Natural Science Foundation of China (Grant No. 31601074, 61872094, and 61832019).
Table 1. The features selected by most models and their descriptions
Name 
Description 
a_heavy 
Number of heavy atoms 
a_acc 
Number of hydrogen bond acceptor atoms 
a_IC 
Atom information content (total). Let n_{i} be the number of occurrences of atomic number i in the molecule, p_{i} = n_{i} / n where n is the sum of the n_{i}. Then a_IC = n∑p_{i} log p_{i} 
b_single 
Number of single bonds (not including aromatic bonds). 
bpol 
Sum of the absolute value of the difference between atomic polarizabilities[55] of all bonded atoms in the molecule 
chi0 
Atomic connectivity index (order 0) [56, 57]. The sum of 1/sqrt(d_{i}) over all heavy atoms i with d_{i} > 0, where d_{i} is the number of bonded neighbors of the atom i in the hydrogen suppressed graph 
VDistMa 
Sum of log_{2} m  D_{ij} log_{2} D_{ij} / m over all i and j, where D_{ij} is the length of the shortest path from atoms i to j, and m is the sum of D_{ij} 
weinerPol 
Wiener polarity number: half the sum of all the distance matrix entries with a value of 3 as defined in [58] 
Kier1 
First kappa shape index: (n1)^{2} / m^{2}, where n and m denotes the number of atoms and the number of bonds in the hydrogen suppressed graph respectively[56] 
PEOE_VSA_PPOS 
Total positive polar van der Waals surface area. Sum of the v_{i} such that q_{i} is greater than 0.2. The v_{i} is calculated using a connection table approximation and q_{i} is the partial charge of atom i calculated by the Partial Equalization of Orbital Electronegativities (PEOE) method[59] 
vdw_area 
Area of van der Waals surface 
mr 
Molecular refractivity 
zagreb 
Zagreb index: the sum of d_{i}^{2} over all heavy atoms i 
Std_dim3 
Standard dimension 3: the square root of the third largest eigenvalue of the covariance matrix of the atomic coordinates. A standard dimension is equivalent to the standard deviation along a principal component axis 
Table 2. The validation result of ten CYP predictive models by four methods, represented by AUC.
1A1 
1A2 
2A6 
2B6 
2C19 
2C8 
2C9 
2D6 
2E1 
3A4 

Improved Bayesian method 
0.92 
0.84 
0.87 
0.88 
0.84 
0.80 
0.83 
0.92 
0.91 
0.88 
naïve Bayesian method 
0.79 
0.70 
0.81 
0.76 
0.72 
0.69 
0.71 
0.82 
0.85 
0.82 
kNN 
0.82 
0.75 
0.84 
0.80 
0.75 
0.73 
0.77 
0.84 
0.88 
0.81 
SVM 
0.85 
0.73 
0.83 
0.80 
0.75 
0.76 
0.79 
0.88 
0.89 
0.82 
Table 3. The results of independent test set calculated by the improved Bayesian method
1A1 
1A2 
2A6 
2B6 
2C19 
2C8 
2C9 
2D6 
2E1 
3A4 
Avg. 

Specificity 
0.95 
0.93 
0.95 
0.94 
0.87 
0.91 
0.90 
0.86 
0.95 
0.72 
0.90 
Sensitivity 
0.58 
0.48 
0.75 
0.53 
0.71 
0.56 
0.66 
0.76 
0.75 
0.86 
0.66 
Accuracy 
0.90 
0.81 
0.93 
0.86 
0.85 
0.86 
0.85 
0.83 
0.93 
0.78 
0.86 
Table 4. The number of substrates metabolized by each CYP450 isoform
1A1 
1A2 
2A6 
2B6 
2C19 
2C8 
2C9 
2D6 
2E1 
3A4 
Total 
87 
167 
83 
109 
144 
115 
166 
225 
127 
423 
1646 
This is a list of supplementary files associated with the primary manuscript. Click to download.
Posted 23 May, 2019
Computational Prediction of the Isoform Specificity of Cytochrome P450 Substrates by an Improved Bayesian Method
Cytochrome P450 (CYP) is the most important drugmetabolizing enzyme in human beings. Each CYP isoform is able to metabolize a large number of compounds, and if patients take more than one drugs during the treatment, it is possible that some drugs would be metabolized by the same CYP isoform, leading to potential drugdrug interactions and side effects. Therefore, it is necessary to investigate the isoform specificity of CYP substrates. In this study, we constructed a data set consisting of 10 major CYP isoforms associated with 776 substrates, and used machine learning methods to construct the predictive models based on the features of structural and physicochemical properties of substrates. We also proposed a new method called Improved Bayesian method, which is suitable for small data sets and is able to construct more stable and accurate predictive models compared with other traditional machine learning models. Based on this method, the predictive performance of our method got the accuracy of 86% for the independent test, which was significantly better to the existing models. We believe that our proposed method will facilitate the understanding of drug metabolisms and help the largescale analysis of drugdrug interactions.
Figure 1
Figure 2
Figure 3
Figure 4
Cytochrome P450 (CYP450, CYP) is the most important drugmetabolizing enzymes in human beings [1] , which involves in the metabolism of more than 90% clinical drugs [2, 3]. CYP450 is a superfamily of monooxygenases, which contains heme and can increase the water solubility and polarity of drugs to make them easy to excrete by oxidation reaction. 57 CYP isoforms have been discovered in human beings[4], and based on the similarity of amino acids, CYP450 can be divided to different families and subfamilies, in which xenobiotics are mainly metabolized by the families of CYP1, CYP2 and CYP3 [5]. The number of CYP isoforms is limited but the xenobiotics are various, so each CYP isoform is able to metabolize large number of compounds, which is quite different from the high specificity of most enzymes. In many cases, people will take more than one drugs during the treatment. Then it is possible that some drugs are metabolized by the same CYP isoform and lead to drugdrug interactions [6, 7], in which the safety and efficacy of a drug can be altered by the coadministration of another drug, and this may result in serious side effects [6] and even death of the patients [8, 9]. Therefore, it is quite important to know which CYP isoform will metabolize a drug candidate in the drug discovery stage [10].
In the past decades, both experimental and computational approaches have been developed to identify the isoform specificity of CYP450 substrate. Experimental assay is accurate but the process such as the bioluminescent and fluorescent assays [11] are quite resourceconsuming. A variety of computational models including structurebased and featurebased methods have also been developed, in which the structurebased methods such as molecular dynamics simulation [1214], quantum mechanical methods [15, 16] and pharmacophore modeling [17, 18] are based on the crystal structures of CYPs. However, some CYP structures are still unavailable, and because of the heme in the active site, the force field is difficult to construct, which makes these methods quite complex and timeconsuming. The featurebased methods employ various physiochemical and structural features of substrates to predict which CYP isoforms will participate in the metabolism reactions based on machine learning technologies, which have been widely used in bioinformatics [1921] and drug discovery [2224]. These methods only employ the structures of substrates and are easy to develop and use, which are suitable for the analysis of drug metabolism [25] and drugdrug interactions [26] in large scale.
A large number of predictive models have been developed to predict the isoform specificity of CYP450 substrate using various machine learning techniques such as logistic regression, decision tree, naïve Bayes, knearest neighbors (KNN), support vector machine (SVM) and artificial neutral network (ANN) [2732]. Block and Henry [27] constructed independent classifiers using the descriptors including molecular connectivity, Estate indices, Spartan descriptors and structural keys, but these classifier cannot get the isoform specificity straight and the accuracy is just about 7080%. Michielan et al. [28] developed several singlelabel and multilabel classification models with the accuracy of 7595%, but the results seemed to come from the large number of negative samples because the sensitivity and precision was usually low and the more positive samples were used, the worse predictive performance was gotten. Zhang et al. [32] used decision tree, BP neutral network and MLkNN to construct classification models, and if a substrate was metabolized by only one CYP isoform, the predictive performance is more than 90%, but if a substrate could be metabolized by more CYP isoforms, the performance would be down obviously. And above of all, these studies used small number of samples to construct the models, which limited the application fields and cause difficulties when the approaches were scaled to larger data set.
In this paper, we developed a new method called Improved Bayesian method, which not only used the samples whose biological activities are known, but also the samples whose activities are unknown, and twosteps training was used in the model construction. This method may be quite suitable for small data set, and is able to construct more stable and accurate predictive models to small data set compared with other machine learning models. Based on the Improved Bayesian method, we used the metabolism information of 776 compounds and up to 10 CYP isoforms (CYP 1A1, 1A2, 2A6, 2B6, 2C8, 2C9, 2C19, 2D6, 2E1, 3A4) from MetaCYP database to construct the predictive models of isoform specificity of CYP450 substrates. Compared with naïve Bayesian method, knearest neighbors and support vector machine, we found the predictive performance of our method was obviously better than others and got the accuracy of 86% for the independent test, which was a significant improvement to the existing models.
Feature selection
In this paper, a total of 176 descriptors representing various structural and physicochemical properties of substrates were subject to the feature selection. As we constructed 10 separate models for different CYP isoforms, the selected descriptors were different for each model. As shown in Figure 1, CYP2A6 and CYP2E1 prefer to metabolize small molecules and CYP3A4 prefers to metabolize large molecules, but it has little correlation between molecular weight and whether a molecule can be metabolized by CYP2D6, so in the model of CYP2A6, 2E1 and 3A4, the descriptor molecular weight was selected, but in the model of CYP2D6, this descriptor was not selected.
Some descriptors were selected by most models. As shown in Table 1, the descriptors about molecular size and shape seem to appear frequently, such as chi0, VDistMa, weinerPol, Kier1, zagreb and Std_dim3, which implies molecular size and shape may have significant influence over the isoform specificity of CYP substrates. Other features such as hydrogen bond or molecular surface area seem also important, but the descriptor logP (Log of the octanol/water partition coefficient) did not be selected by any model, which is quite different from the studies on the other biological activities of CYP [18, 3338]. Maybe the mechanism of isoform specificity of CYP substrates is somewhat different from other biological activities.
Model comparison
Four methods including Improved Bayesian method, naïve Bayesian method, knearest neighbors (kNN) and support vector machine (SVM) were used to construct our classification models based on the validation set. The selected descriptors were used as input and whether a molecule can be metabolize by the CYP isoform was output (represented by 1 or 0). Leaveoneout crossvalidation was used to evaluate the predictive performance, and the result is represented by AUC of ROC curve, which is shown in Table 2, Figure 2 and Figure 3.
We can see Improved Bayesian method is superior to other methods in all models, while naïve Bayesian method seems to make the worst predictive performance. The results support that our method is a significant improvement to naïve Bayesian method and it is even better than some other machine learning methods such as SVM and kNN. The predictive performance also seems different among the models of different CYP isoforms. The accuracy of CYP2C8, 2C9 and 2C19 is lower than other models while CYP1A1, 2D6 and 2E1 make the best performance. This is possibly due to the different selected descriptors in each model.
Independent test
Besides the leaveoneout cross validation test, we also conducted an independent test on the models constructed by the improved Bayesian method, which simulates a true prediction since the model trained on one dataset is tested on an unseen dataset. The 600 molecules in the validation set were used to train the models and the other 176 molecules were used to test the models. As shown in Table 3, the predictive performance is similar to those of the leaveoneout cross validation test and the accuracy is up to 86% on average, suggesting that our method performed well on the independent test set.
As the major drugmetabolizing enzymes in human beings, cytochrome P450s have been considered as the important factor of drugdrug interactions [6]. In many cases, people will take multiple drugs during the treatment, and if two drugs are metabolized by the same CYP isoform, mutual inhibition may occur and the blood concentration will abnormally increase when the patient still follow the scheduled treatment doses, which may lead to serious side effect [6]. In China, tradition Chinese medicines (TCM) are widely used, and sometimes, doctors will prescribe TCM combined with other medicines to improve the curative effect [39], but the problem of drugdrug interactions is usually neglected. The absence of rapid analysis tools is obviously a significant reason, and the computational methods with high accuracy are certainly a good solution.
In this work, we constructed novel predictive models to predict the isoform specificity of CYP substrates based on Improved Bayesian method. This method not only used the samples whose biological activities are known, but also used the samples whose activities are unknown (background samples). Compared with naïve Bayesian method, the relationship of features is taken into account and we no longer need to assume the independence of conditional probability. Compared with most machine learning methods, as the relationship of features is determined by the background samples, only small amount of model variables will be changed during the training, which may lead to more stable and reliable predictive performance and less probability of overfitting. In biostatistics and bioinformatics, we often face the dataset with small sample size but large number of features, which is usually hard to analyze, but with the help of background samples, we may construct much better models. Of course, there is still some limitation in our method. We have to assume that the correlation of all features is independent of the biological activity. If this assumption is not tenable, our method is inappropriate.
The common CYP isoform of two drugs is certainly a cause of drugdrug interactions, but there are still many other causes [40]. For example, some drugs can inhibit or induce the activity of CYP isoform straight, and these drugs are called inhibitors or inducers. Some drugs can interact with the transcription factors and influence the expression of CYP isoforms. All these phenomena increase the difficulty in accurate prediction of drugdrug interactions caused by CYPs, which need further investigation in future.
Improved Bayesian method
In this paper, we will introduce a new method to construct our classification models, which is called Improved Bayesian method. In bioinformatics, due to the limitation of experimental technology, we often face the dataset with small sample size. For example, if we had 1000 drug candidates, maybe only 100 candidates were known the corresponding CYP isoforms that metabolize them, and the activities of the other 900 candidates were unknown. In most machine learning methods, we will only use the 100 compounds as samples to construct the model, but in our method, we will use the other 900 samples as well. In other words, we will not only used the samples whose biological activities are known (activity samples), but also the samples whose biological activities are unknown (background samples).
As shown in Figure 4, the model will be trained by two steps. First, the model is trained by the background samples, and we can get the distribution of each feature and correlations of all features. Next, after training with the activity samples, we can get the relationship between the target and all features. Thus, the activity samples will just change small amount of variables in the model, which may lead to more stable and reliable predictive performance and less probability of overfitting. We do not have to construct simple model for small data set, and more accurate model may be built.
We will introduce the procedures of our Improved Bayesian method in detail.
(a) Make all features follow normal distribution similarly, and calculate the correlation coefficient matrix R based on the normalized features
Denote F(x) is the cumulative distribution function of feature X. Φ(x) and Φ^{1}(x) are the distribution function and inverse distribution function of standard normal distribution respectively. Let
(1) 
Then
(2) 
Thus, X̂ follows standard normal distribution. Note that F(x) is calculated based on the background samples.
Next, we need to calculate the correlation coefficient matrix R based on the normalized features. Of course, R is also calculated based on the background samples. Take note that the correlation coefficient may be meaningless if the variables do not follow normal distribution, so the normalization procedure is essential.
(b) Estimate the joint distribution function of all features
Here, we need two basic assumptions in our Improved Bayesian method. First, the correlation coefficient of two features is not dependent on the biological activity. For example, if a molecule has more atoms, its molecular weight is usually larger, and the positive correlation between the atomic number and molecular weight is irrelevant to the CYP specificity of these molecules. Second, if random variables X_{1}, X_{2}, … , X_{n} follow normal distribution, n dimensional random variable (X_{1}, X_{2}, … , X_{n}) follows n dimensional normal distribution approximately. Obviously, the two assumptions are reasonable and easy to satisfy.
Based on the assumptions, the joint distribution function of (X_{1}, X_{2}, … , X_{n}) is
(3) 
(c) Construct the classification models based on the activity samples.
Denote P(A) is the probability of event A. Each sample is described by n features X_{1}, X_{2}, … , X_{n}, and belongs to one of the two classes C_{0} or C_{1}. Then, for a sample x = (x_{1}, x_{2}, … , x_{n}), we can calculate the probability that it belongs to C_{k} (k = 0 or k = 1) based on Bayesian formula:
(4) 
In naïve Bayesian classification, we should assume the independence of conditional probability, and then
(5) 
But in our method, based on equation (1), (3), (4), we can calculate the probability straight.
Comments (0)