Improved Bayesian method
In this paper, we will introduce a new method to construct our classification models, which is called Improved Bayesian method. In bioinformatics, due to the limitation of experimental technology, we often face the dataset with small sample size. For example, if we had 1000 drug candidates, maybe only 100 candidates were known the corresponding CYP isoforms that metabolize them, and the activities of the other 900 candidates were unknown. In most machine learning methods, we will only use the 100 compounds as samples to construct the model, but in our method, we will use the other 900 samples as well. In other words, we will not only used the samples whose biological activities are known (activity samples), but also the samples whose biological activities are unknown (background samples).
As shown in Figure 4, the model will be trained by two steps. First, the model is trained by the background samples, and we can get the distribution of each feature and correlations of all features. Next, after training with the activity samples, we can get the relationship between the target and all features. Thus, the activity samples will just change small amount of variables in the model, which may lead to more stable and reliable predictive performance and less probability of over-fitting. We do not have to construct simple model for small data set, and more accurate model may be built.
We will introduce the procedures of our Improved Bayesian method in detail.
(a) Make all features follow normal distribution similarly, and calculate the correlation coefficient matrix R based on the normalized features
Denote F(x) is the cumulative distribution function of feature X. Φ(x) and Φ-1(x) are the distribution function and inverse distribution function of standard normal distribution respectively. Let
Then
Thus, X̂ follows standard normal distribution. Note that F(x) is calculated based on the background samples.
Next, we need to calculate the correlation coefficient matrix R based on the normalized features. Of course, R is also calculated based on the background samples. Take note that the correlation coefficient may be meaningless if the variables do not follow normal distribution, so the normalization procedure is essential.
(b) Estimate the joint distribution function of all features
Here, we need two basic assumptions in our Improved Bayesian method. First, the correlation coefficient of two features is not dependent on the biological activity. For example, if a molecule has more atoms, its molecular weight is usually larger, and the positive correlation between the atomic number and molecular weight is irrelevant to the CYP specificity of these molecules. Second, if random variables X1, X2, … , Xn follow normal distribution, n dimensional random variable (X1, X2, … , Xn) follows n dimensional normal distribution approximately. Obviously, the two assumptions are reasonable and easy to satisfy.
Based on the assumptions, the joint distribution function of (X1, X2, … , Xn) is
(c) Construct the classification models based on the activity samples.
Denote P(A) is the probability of event A. Each sample is described by n features X1, X2, … , Xn, and belongs to one of the two classes C0 or C1. Then, for a sample x = (x1, x2, … , xn), we can calculate the probability that it belongs to Ck (k = 0 or k = 1) based on Bayesian formula:
In naïve Bayesian classification, we should assume the independence of conditional probability, and then
But in our method, based on equation (1), (3), (4), we can calculate the probability straight.
where Fi(x) and fi(x) are the distribution function and density function of Xi respectively. Φ(x) and φ(x) are the distribution function and density function of standard normal distribution respectively. Φn(x) and φn(x) are the joint distribution function and joint density function of n dimensional normal distribution respectively.
Take note that φn(x) can be expressed as
where Σ = diag(σ1, … ,σn) is standard deviation matrix, μ = (μ1, … , μn) is mean vector, R is correlation coefficient matrix and x = (x1, x2, … , xn).
As R is real symmetric matrix, it must exist orthogonal matrix P = (α1, α2, … , αn), which makes
|
| PTRP = Λ = diag(λ1, λ2, …, λn)
| (8)
|
and |
|
| | (9)
|
where λi and αi are the eigenvalues and corresponding eigenvectors of R.
Let
where μi and σi are the mean and the standard deviation of feature i in all samples respectively. μki and σki are the mean and the standard deviation of feature i in the samples belonging to class Ck.
Plug equation (7)-(11) into equation (6), and take logarithm, we can get
Then, the larger Hk is, the more probability that x = (x1, x2, … , xn) belongs to Ck. We can compare H0 and H1 to determine whether x belongs to C0 or C1, or we can plot receiver operating characteristic curve (ROC curve) and select the best threshold to determine the classification.
Dataset
We collected 776 compounds from the Meta-CYP database, and each compound can be metabolized by at least one of the ten CYP isoforms: CYP1A1, 1A2, 2A6, 2B6, 2C8, 2C9, 2C19, 2D6, 2E1 and 3A4. The number of compounds metabolized by each CYP isoform and their distribution are shown in Table 4. We can see the total number is up to 1646, which means each compound will be metabolized by about two CYP isoforms on average. The set of 776 compounds were randomly divided into two data sets. The first set was composed of 600 samples, taken as the validation set for feature selection and model construction. The rest 176 samples constituted the independent test set for the model test. Meta-CYP database and the analysis tools based on our proposed methods are currently under construction by our laboratory, in which the information used in this paper was mainly form SuperCYP (http://bioinformatics.charite.de/supercyp/index.php) [41] , CYP450 Engineering Database (https://cyped.biocatnet.de) 42[, 43] and PubChem Compound (https://www.ncbi.nlm.nih.gov/pccompound) [44] . The detailed information about this data set is available in Supplementary Material. In this paper, background samples were gotten from DrugBank database (http://www.drugbank.ca) 45[, 46] , including the approved, experimental and investigational drugs, and 7114 compounds in all.
It should be noted that inorganic compounds and polypeptides are not including in our dataset, because the features describing these compounds are quite different from other small molecules. So the models constructed in this study are not suitable for the two types of compounds.
Molecular descriptors calculation
We employed the molecular descriptors representing various structural and physicochemical properties of the compounds including atomic composition, molecular size, shape, weight, hydrogen bond, hydrophobicity, acidity or basicity, charge distribution and energy of chemical bonds to construct our models. To accurately calculate these molecular descriptors, the chemical structures of all compounds were first drawn by SAMM, an in-house molecular mechanics and drug design package. Next, all structures were further energy minimized by the MMFF94 force field[47] . At last, the molecular descriptors were calculated based on the energy-minimized structures by SAMM. After removing the descriptors that were highly correlated or had the same value in almost all compounds, 176 molecular descriptors including 134 two-dimensional and 42 three-dimensional descriptors were gotten and would be subject to further feature selection.
Feature selection
Feature selection is performed prior to the construction of the predictive models in order to eliminate the redundant features and identify the most relevant features [48-50] . Minimum Redundancy Maximum Relevance (MRMR)[51] was applied to select the best features in this paper, which is a powerful method for getting useful information from complex systems through maximizing the mutual information between the selected features and the classification target, and in the meanwhile, minimizing the correlation of the features themselves[52] . In this work, we selected the top 20 features with the highest scores in the model construction.
Model comparison and evaluation
Besides Improved Bayesian method, we also used naïve Bayesian method, k-nearest neighbors and support vector machine[53] to construct our classification models based on the validation set for model comparison. In naïve Bayesian method, the Laplace smoothing factor was 1. The parameters of k-nearest neighbors was k = 5, Euclidean distance and distance weighting function = 1/d. The kernel function in support vector machine was radial basis function and other parameters were optimized by LIBSVM[54] .
For the constructed models, we would use AUC (area under the curve) of receiver operating characteristic curve (ROC) for model evaluation. Besides that, we would calculate specificity, sensitivity, accuracy as well.
where TP, FP, TN, FN means true positive, false positive, true negative and false negative respectively.