2.1 Antibacterial compounds collection
Compounds that performed antimicrobial activity test were collected from ChEMBLdb (version 25, https://www.ebi.ac.uk/chembl/) and PubChem (https://pubchem.ncbi.nlm.nih.gov/) databases. A total of 83768 compounds were obtained, 8001 of these compounds has a clear IC50 value, and others only have an inactive label. The IC50 cutoff value of antibacterial activity was defined by curve fitting the IC50 values of all compounds. Compounds with IC50 less than 10 µmol/L were generally considered as active antibacterial compounds [23–27], the curve fitting results also suggest that this cutoff is reasonable (Supplementary Figure 1). Based on the curve fitting results, compounds with IC50 higher than 10 µmol/L were considered as inactive antibacterial compounds. Pybel, a python wrapper of OpenBabel [28, 29] was used to access the SMILES string of compounds and calculate molecular fingerprint which represents the presence or absence of particular substructures in the molecule. Multiple types of molecular fingerprints of all compounds were calculated. Benchmark datasets were build based on the following steps: (1) remove duplicate compounds; (2) remove compounds with a molecular weight greater than 1000; (3) remove compounds with molecular fingerprint similarities higher than 0.9 between the active and inactive antibacterial compounds. Finally, we got a positive dataset including 2708 active antibacterial compounds and a negative dataset including 78620 inactive antibacterial compounds. All active antibacterial compounds have IC50 values whereas only 1893 inactive antibacterial compounds have IC50 values.
2.2 Construction of the benchmark dataset
There is a large difference in the number of compounds between the positive and negative datasets. The positive dataset contains 2708 active antibacterial compounds. To balance the number of compounds between the positive and negative datasets, the filtered negative dataset contains 1893 inactive antibacterial compounds with IC50 values, the remaining quantity difference was random selected from the inactive antibacterial compounds only with an inactive label. Considering the uncertainty of random selection, we repeated 10 times for negative dataset extract. Therefore, the filtered datasets including one positive dataset and 10 negative datasets, each negative data set is combined with the positive data set for subsequent analysis. Next, the molecular fingerprint is calculated for the positive dataset and all repeated negative datasets. The following types of molecular fingerprints were calculated including FP2, FP3, FP4, DLFP, MACCS, ECFP2, ECFP4, ECFP6, FCFP2, FCFP4, and FCFP6. Several start-of-the-art chemoinformatics approaches were also calculated such as mol2vec [30], SMILES2Vec [31], and FP2VEC [32]. The features of each compound were presented by the binary bits of the different types of molecular fingerprints or vectors and these features were used for machine learning modeling (Supplementary Table 1). All these benchmark datasets were used for preliminary screening of applicable machine learning models.
2.3 Initial screening of machine learning methods
In order to choose the appropriate machine learning methods to construct the anti-bacterial compound prediction model, we evaluate the predictive performance of different machine learning methods including k-nearest neighbor (KNN), logistic regression (LR), linear support vector classifier (LSVC), random forest (RF), gradient boosting regression tree (GBRT), support vector machine (SVM), and multi-layer perception (MLP). In the initial screening process, each machine learning method used benchmark datasets constructed from different molecular fingerprints for training and prediction with default parameters. The benchmark dataset was split to the training set (accounting for 80%) and the validation set (accounting for 20%), and then performed a 5-fold cross-validation test. The results suggested that the benchmark dataset based on FP2 molecular fingerprints, along with the SVM, RF, and MLP methods showed excellent prediction accuracy among all machine learning methods and molecular fingerprints combinations, whereas the accuracy fluctuates greatly among different machine learning methods in the benchmark dataset based on vector features (Supplementary Figure 2). Therefore, the benchmark dataset based on FP2 molecular fingerprints, and the RF, SVM and MLP methods were selected in the subsequent analysis.
2.4 Parameter selection of the SVM, RF, and MLP models
The SVM, RF, and MLP models for antibacterial compounds prediction were built using the svm, ensemble, and neural_network module in the scikit-learn Python library (version: 0.20.0, https://scikit-learn.org/stable/). A parameter grid search strategy was used to choose the optimal parameter "gamma" for the kernel function and regularization parameter "C" for the SVM model, the optimal number of trees (parameter "n_estimators") for the RF model, and the optimal hidden layer sizes and alpha for the MLP model. The other parameters of the above three models use default values. The benchmark dataset was randomly split to the training and validation set (accounting for 80%) and the test set (accounting for 20%) using the train_test_split function in the scikit-learn. The 5-fold cross-validation method was used to evaluate the generalization performance of the model with specified parameters in the training and validation set. The cross-validation accuracy was calculated for model evaluation. After cross-validation, a temporary model was built using the training and validation set and calculated the area under the curve (AUC) for the receiver operating characteristic (ROC) curve in the test set. Considering that there may be similar compounds in the split datasets, dataset split and cross-validation were repeated 10 times, which may reduce the impact of similar compounds on the prediction performance of these models. For each given parameter, the mean cross-validation accuracy and mean AUC was calculated. The optimal model was selected by comparing the maximum mean cross-validation accuracy under different parameters. If there were multiple models with the same mean accuracy, the model with the maximum AUC was considered to be the optimal model.
2.5 Performance evaluation
The optimal SVM, RF, and MLP models were used for performance evaluation. The confusion matrix was calculated using the results of the optimal cross-validation test. The true positive (TP) indicates the number of correctly predicted active antibacterial compounds, the true negative (TN) indicates the number of correctly predicted inactive antibacterial compounds, the false positive (FP) indicates the number of inactive antibacterial compounds predicted as active antibacterial compounds, and the false negative (FN) indicates the number of active antibacterial compounds predicted as inactive antibacterial compounds. We calculated the following quality indices: accuracy = (TP + TN) / (TP + TN + FP + FN), precision = TP / (TP + FP), sensitivity = TP / (TP + FN), specificity = TN / (TN + FP), and F1 score = 2 * TP / (2 * TP + FP + FN). Mean squared error (MSE) was calculated for all three models. Because the filtered datasets including one positive dataset and 10 negative datasets, the average of 10 calculations of these quality indices and AUC were used to evaluate the SVM, RF, and MLP model performance. A model with high scores (≥ 0.8) of accuracy, precision, F1 score, and AUC was considered to be an effective model.
2.6 Antibacterial small-molecule drugs prediction
The final SVM, RF, and MLP models were built using the benchmark dataset with the optimal parameters. All these three models were used to predict antibacterial activity for approved small-molecule drugs. We compared the prediction performance of a single model and a combination of different models. The candidate antibacterial drugs were defined as the drugs that showed antibacterial activity in all the SVM, RF, and MLP models. Drug information was acquired from the DrugBank database (https://www.drugbank.ca/) [33]. We first filtered the drugs with approved status but not withdrawn yet, then removed the drugs with molecular weight ≥ 1000. Finally, 2315 approved small-molecule drugs were screened to perform antibacterial activity prediction. The predicted active antibacterial drugs excluding FDA approved antibacterial drugs were defined as novel antibacterial drugs.
2.7 Structural similarity analysis
FP2 molecular fingerprint similarity was calculated among all novel antibacterial drugs and FDA approved antibacterial drugs. The overlap between fingerprints is quantified as a measure of molecular similarity using the Tanimoto coefficient (Tc). The predicted drugs with average and maximum molecular fingerprint similarity less than 0.1 and 0.2 were considered to be structurally novel. Furthermore, previous literature reported several core scaffolds shared by most antibacterial compounds [22]. The flexible maximum common substructure algorithms in the fmcsR package [34] in R was used to identify whether the core scaffolds exist in the predicted antibacterial drugs.