A machine learning model for the prediction of drug permeability across the Blood-Brain Barrier: a comparative approach

Background: Drug permeability across the blood-brain barrier (BBB) is a critical challenge for successful drug discovery which has led to multiple efforts to develop in silico predictive models. Most of the in silico models are based on the molecular descriptors of the drugs. In this work, we compare the ability of sequential feature selection and genetic algorithms in selecting the most relevant descriptors and hence enhancing the permeability prediction accuracy. Methods: Five different classi�ers were initially trained on a dataset using eight molecular descriptors. Then, sequential feature selection and genetic algorithms were performed separately and the same classi�ers were trained using the descriptors chosen by each algorithm. Results: The highest overall accuracy obtained without feature selection was 94.98%. This accuracy increased with sequential feature selection and genetic algorithms on multiple classi�ers. However, the highest accuracy (96.23%) was obtained after performing genetic algorithm on the feature vector. Moreover, genetic algorithm with a �tness function based on the performance of a support vector machine led to an increase in the accuracy of all the tested classi�ers unlike sequential feature selection. Conclusions: The �ndings show that genetic algorithm is a more robust approach than sequential feature selection in choosing the most relevant molecular descriptors involved in the permeability across the blood-brain barrier. The results also highlight the importance of the polar surface area of drugs in crossing the BBB.


Background
The Blood-Brain Barrier (BBB) is a physiological barrier that maintains brain homeostasis by controlling the exchange of molecules between the blood and the brain [1]. Consequently, the BBB blocs the passage of multiple molecules towards the brain, including administered drugs. This is bene cial when the target of the drug resides outside the brain since it prevents undesirable drug interactions and the ensuing phenotypic side effects. However, in the case of drugs targeting central nervous system (CNS) diseases, transport across the BBB is mandatory [2]. Therefore, the ability of drug candidates to cross the BBB has to be studied by all pharmaceutical companies during drug discovery. In this context, numerous in silico BBB models have been implemented by researchers in order to predict the behavior of drugs across the barrier [3]. These predictive models can be used during the early phases of drug discovery, and hence allow companies to save time and money resulting from failed drug investigations. Two different types of in silico BBB models exist in the literature: binary models which aim at qualitatively predicting whether drugs cross the BBB (BBB+) or not (BBB-), and quantitative models which attempt to qualify the permeability of the barrier to a given drug by computing the logarithm of the ratio of the concentration of the drug in the brain to that in blood (logBB) or its penetration rate (PR) [3]. In this context, K. Raja et al. [4] proposed two different stepwise regression models, one for the prediction of logBB values and the other for PR values. Other quantitative models are reviewed in [3]. While such models assign speci c logBB/PR values for each drug, binary models have so far reached a higher prediction accuracy and provide a preliminary insight regarding the behavior of candidate drugs which is su cient in early drug discovery stages. Predominantly, binarization of drug permeability across the BBB is performed by setting empirical thresholds to logBB values [5][6][7][8][9]. However, S. Kunwittaya et al. [6] have shown that varying logBB thresholds lead to a difference in the prediction accuracy. Therefore, binary BBB models based on logBB values are prone to biases introduced by the thresholds setting. On the other hand, Adenot and Lahana [10] introduced a dataset based on the activity of the drug in the CNS: if a drug is CNS active, then it is necessarily BBB+. However, some drugs can cross the BBB but still show no activity in the CNS. Even though nding BBB-drugs based on CNS activity is consequently a challenging task, CNS activity-based datasets require no threshold setting and hence do not introduce the previously mentioned biases.
Machine learning is ubiquitously applied in the case of binary BBB models. In this context, different types of classi ers were trained in the literature including Support Vector Machines (SVM) [6,8,11,12], Linear Discriminant Analysis (LDA) [13], Arti cial Neural Networks (ANN) [6] and Multi-Layer Perceptron (MLP) [8,9], k-Nearest Neighbors (k-NN) [8], Decision Trees (DT) [6,7] and Random Forests (RF) [5,8,9]. Other studies apply consensus models, by training and combining multiple classi ers [8,9]. While consensus models mitigate the over tting problem of single classi ers, they naturally require high computational power, especially when dealing with high dimensional data. The features used to train these classi ers are often molecular descriptors which are chemical properties describing the drugs [3]. Some studies also add the ngerprints of the molecules in order to reach better prediction [8,9,12]. On the other hand, novel approaches apply the drug side effects and indications for BBB penetration prediction [14]. The model achieved excellent prediction performance but relies on high-level phenotypes which prevent extraction of signi cant biological explanations concerning drug interaction with the BBB.
Molecular descriptors remain the staple of classi cation-based BBB models. However, until today, the high dimensionality of the data based on molecular descriptors is still challenging. The selection of the most relevant features is crucial since it guarantees an improved prediction performance on one hand, and a faster computation on the other; by reducing the size of feature vectors. In order to study the effect of the chosen features on the classi cation performance, Y. Yuan et al. [12] compared the performance of SVM models trained by feature vectors containing different molecular descriptors, ngerprints or a combination of both. Since trying all possible combinations of feature vectors dramatically increases the required computational time and power, an effective feature selection algorithm is needed. In this context, D. Zhang et al. [9] applied genetic algorithm (GA) for the selection of the appropriate features and optimization of SVM parameters. Nevertheless, choosing the most suitable algorithm for a given application is an important step since different algorithms may lead to convergence to different feature subsets and consequently affect the prediction results. This study hence, compares the effect of GA to that of the sequential feature selection (SFS) algorithm on different classi ers applied in the reported in silico BBB models.

Methods
The work ow including the use of feature selection is summarized in Fig. 1. In this study, we began by collecting the drug dataset. Then, in order to compare the performance of GA to that of SFS, we started by training and evaluating multiple classi ers without the application of any feature selection algorithm.
Subsequently, the same classi ers were implemented while applying each algorithm separately. Finally, the performance of each classi er was evaluated individually.

Dataset preparation
We built and compared the models using a drug permeability dataset made publically available by Zhao et al. [15]. The dataset is composed of 1593 drugs: 1283 that cross the BBB (BBB+) and 310 that don't (BBB-). The authors used the previously described dataset of Adenot and Lahana [10]. For each drug, Zhao et al. [15] calculated a set of molecular descriptors that are also listed in the dataset.

Feature selection
In order to obtain a highly predictive model, feature engineering should be followed by feature selection as a means to choose the most relevant features. In fact, while some features hold exclusive information regarding the permeability of the drugs across the BBB, others are simply irrelevant or hold misleading information. In this work, GA and SFS algorithms were evaluated and compared separately on the same dataset.

Part 1: sequential feature selection
This is an iterative algorithm that aims at nding the optimal combination of predictors that lead to the best prediction capacity of a speci c classi er. The algorithm may run in two opposite directions. On one hand, it can start with the entire input features set and iteratively remove features that mislead a prede ned classi er, until reaching the predictors' subset leading to the best classi cation performance; in this case, the algorithm is running in the backward direction. On the other hand, it may run in the forward direction by starting with an empty predictors' subset and successively adding features that would improve the classi er's predictive performance until reaching the optimal predictors' subset. The steps of the forward algorithm are summarized in the owchart of Fig. 2. The algorithm starts by creating an empty feature subset. Then it randomly adds one feature to the subset and performs 10-fold crossvalidation, which returns a criterion value expressing the loss of the classi er. In this work, the criterion used by the algorithm for each combination of features is the number of misclassi ed observations in the test set. Then, the previously selected feature is removed and a new feature is randomly added to the subset to nd a new criterion value. Once all the features have been tried, the algorithm chooses the feature with the least criterion value as a permanent feature in the subset. Then, having this feature permanently present, the algorithm randomly adds a second feature to the subset and a new criterion value is found. This is repeated until all the features have been tried. If the lowest criterion value of the feature subset (with two features) is smaller than the originally chosen subset (with one feature), the algorithm repeats the same steps by testing the addition of a third feature to the subset. Otherwise, the algorithm stops and the previous feature subset is deemed the optimal one. For example, if one needs to choose the optimal features from three initial ones, the algorithm successively calculates the criterion value obtained with each. If feature 1 leads to 0.056 as criterion value, feature 2 0.065 and feature 3 0.078, the algorithm permanently chooses feature 1. Then it tests the addition of feature 2 (classi er built with features 1 and 2) and feature 3 (classi er built with 1 and 3). If at least one the two criterion values is lower than 0.056 (initially obtained with feature 1), it tests the addition of the third feature to the feature subset. Otherwise, feature 1 is selected as optimal feature.

Part 2: Genetic Algorithm
Under the umbrella of feature selection, GA was also applied with its label "the ttest survives" [16]. In fact, GA mimics the genetic evolution by setting an initial population of binary chromosomes. Each gene is hence a binary digit in the chromosome. Afterwards, at each new generation, the chromosomes undergo three different phenomena: • Selection: the ttest chromosomes of the initial population are preserved for the next generation These three aforementioned phenomena are repeated during each transition from generation to another in order to progressively decrease the tness value, until the prede ned number of generations is reached. In this work, the tness value was calculated using two different tness functions and was taken as the classi cation loss of a SVM or a k-NN classi er. The population size is initially 5 chromosomes in which each gene is a feature that might be included (1) or rejected (0). The probability that it mutates is 10% and that of a cross-over is 80%. The selection probability is hence 10%.

Classi cation
In this study, the dataset was divided into 2 different subsets: 80% of the dataset was used as a training set: A set of feature vectors with known output was employed to build the classi er.
20% was used as testing set: A classi er is tested by predicting the outputs of a test set and comparing the predicted results to the actual ones. This step is important to evaluate the performance of any classi er used.
Once both sets were ready, the following types of classi ers were applied for performance comparison on the different classi ers: SVM [17] (linear SVM and using polynomial and Radial Basis Function (RBF) kernels), LDA [18] and Quadratic Discriminant Analysis (QDA) [19], and k-NN.

Performance evaluation
The performance of each classi er was individually evaluated by using the confusion matrix technique [20] which allows to compute the following parameters: The sensitivity (SE) which re ects the capacity of the classi er to detect BBB + drugs in the entire dataset The positive predictive value (PP) which expresses its ability not to deem non-crossing drugs as BBB+ The speci city (SP) which expresses the ability of the model to detect BBB-drugs in the dataset The overall accuracy (ACC) which expresses the total true predictions over the total number of prediction The receiver operating characteristic (ROC) [20] curve was also applied in our study.

Results
In this study, the rst step was to train the classi ers without applying any type of feature selection algorithm. The results are tabulated in Table 1. The highest accuracy value obtained with the test set when training the classi ers with all the initial features was 93.35% acquired with SVM (RBF kernel function).

Feature selection
After obtaining the initial results, the following feature selection algorithms were performed and led to the following results: Part 1: sequential feature selection As described in Fig. 2, the convergence of the SFS towards the nal feature subset is based on a criterion value extracted from the classi er itself: the number of misclassi ed observations in our study. Therefore, a speci c feature subset was obtained with each classi er as reported in Table 2.

Prediction performance evaluation Discussion
The dataset used in this study was chosen since drug permeability strati cation is CNS-based [10], hence independent from logBB thresholds. The dataset was used to classify drugs into BBB + and BBB-while comparing two types of feature selection algorithms, the backward SFS and GA.
After running the feature selection algorithms, the same classi ers reported in Table 1 were trained separately using the selected features. The overall accuracy obtained on the test set is reported in table 3. In the case of SFS, the overall accuracy increased with the linear SVM as well as LDA, QDA and k-NN. The highest accuracy value reached was 94.98%, obtained with the QDA. On the other hand, GA resulted in an increase of the overall accuracy of all the classi ers in the case of a tness function based on the classi cation loss of a SVM. The highest accuracy value reached was 96.23% with the SVM (polynomial kernel function). Nevertheless, with the k-NN based tness function, the SVM (polynomial kernel) and the k-NN witnessed a decrease in overall accuracy. The highest accuracy value was also 96.23%, obtained with the QDA which is higher than that received with the SFS. Table 2 compares in details the performance of the two classi ers that led to the 96.23% overall accuracy. Fig. 4 ROC curves of the SVM classi er (polynomial kernel) before (red) and after (blue) feature selection using GA algorithm Figure 4 represents the ROC curves of the SVM classi er trained with the entire feature set and after the application of GA for feature selection. It is clear that the area under the curve is much higher after applying GA. In this study, the SFS resulted in larger feature subsets. Since at each iteration, SFS performs 10-fold cross validation and returns a criterion value speci cally based on the classi er being trained, it is hypothetically tailored to optimize the results of a given classi er. It is to be noted that the number of hydrogen bond donors was selected with all the classi ers. This result is in line with previously reported ndings on the signi cant role of hydrogen bonding characteristics in predicting drug permeability across the BBB [15].
However, GA showed that relying exclusively on the PSA and the number of hydrogen bond donors would lead to better results compared to those obtained while including other features. This is re ected by the overall accuracy (table 3) as well as the ROC curve (Fig. 4).
Moreover, it is to be noted that GA (with SVM based tness function) led to an improvement of the results with all the reported classi ers unlike SFS. The highest overall accuracy was also found with GA.
Comparing the best two classi ers, one can note that the QDA trained using the PSA and pKa (strongest acid) has a slightly lower sensitivity and a higher speci city than the SVM trained with the PSA and the HD. Nevertheless, both classi ers have a better balancing between predicting BBB + and BBB-drugs than the binomial partial least squares implemented in [15] on the same data with selected molecular descriptors. However, the division of the dataset between training and test sets differs between the two studies. Furthermore, PSA was selected by GA in the two models that led to the highest overall accuracy. This nding is in accordance with previously reported results underlining the importance of PSA in stratifying drugs into BBB + and BBB- [13]. In fact, this descriptor appeared in the best four RF models created in [5] using up to four descriptors. Moreover, PSA was used in a classi cation tree model [7] as the criterion allowing the rst split of the tree. In addition, D. Zhang et al. [11] listed PSA among the ten most signi cant descriptors. All these ndings re ect the robustness of GA in selecting the most relevant descriptors.
Given that the QDA trained with the PSA and pKa (strongest acid) led to fewer false negatives than the SVM trained with the PSA and the HD, we can speculate that this model is more useful when the research aim is to detect BBB + drugs such as drugs targeting CNS diseases. In fact, this model has a lower risk of classifying BBB + drugs as BBB-which would prevent a pharmaceutical company from moving forward with a BBB + drug candidate that could have made it through drug discovery phases. On the other hand, the latter model has fewer false positives than the former which makes it more valuable for the detection of BBB-drugs whose target is located outside the brain. However, these ndings could be con rmed by increasing the size of the test set.

Conclusion
This work implements two different in silico BBB models that are mainly useful during the early phases of drug discovery. In fact, in silico BBB models allow pharmaceutical companies to reduce the number of hits that will undergo lengthy and expensive in vitro testing. Therefore, increasing the accuracy of in silico models is key. For this purpose, this paper applies and compares two different types of feature selection algorithms on a CNS-based dataset. The results show that GA enables an improvement of prediction accuracy over SFS. The best classi ers obtained after performing GA gave an accuracy of 96.23% and have a relatively good balance between predicting BBB + drugs to BBB-ones. Highly accurate BBB in silico models are needed in order to improve the strati cation of drug candidates into BBB + and BBBdrugs at early phases of drug discovery and consequently save time and money associated with this complex, laborious process.  Flowchart of the forward sequential feature selection algorithm: the algorithm iteratively selects features to be added to a feature subset until reaching the subset with the lowest criterion value. In this work, the criterion value used is the number of misclassi ed observations in the test set.