Materials and Methods
Receptors and Ligands used in docking calculations
The docking scores reported in this work were produced in-house drug discovery research projects. Two different ligand sets were screened on several target proteins using AutoDock-Vina program [18]. 2D information of ligands were downloaded from the ZINC15 database as SDF format. The ligand set 1 was obtained from an extended version of lead-like ready to purchase clean tranche of ZINC15. In this set, molecular weights of the molecules are between 200-425 daltons and LogP values range from -1 to 3.5. The ligand set 1 roughly contains 3.4 million molecules. The ligand set 1 was screened on GTP binding site of Drp1 GTPase, dantrolene binding site of Ryanodine Receptor and MiD49 interaction interface of Drp1 GTPase using AutoDock-Vina program. The Ligand set 2 was compiled by randomly selecting 400.000 molecules from ~6.2 million molecules that is under drug-like clean ready to purchase tranche of ZINC15. These 400.000 molecules were docked to the angiotensin-converting enzyme (ace)/ spike-ace interaction interface, spike/ spike-ace interaction interface, SARS-CoV-2 non-structural protein 16 (Nsp16) /SARS-CoV-2 non-structural protein 10 (Nsp10) interaction interface, and SARS-CoV-2 non-structural protein 16 (Nsp16) S-adenosylmethionine binding site. The exact number of docking scores might change for each target due to the fact that sometimes in docking calculations some of the molecules might not be docked because of technical problems. For example, in RyR2 we have 1.8M molecules that were docked into the right binding site and the rest of the molecules were docked in another binding site. Table 1 summarizes the receptors and molecules docked to these receptors.
Table 1 Targets and their aliases
Target protein
|
Alias
|
Number of available docking scores
|
Ligand set
|
Drp1 GTPase/ GTP binding site
|
Drp1_GTPase
|
3 500 000
|
Ligand set1
|
Ryanodine Receptor/ dantrolene binding site
|
RyR2
|
1 756 665
|
Ligand set1
|
Drp1 GTPase/ MiD49 interaction
interface
|
Drp1_MiD49
|
3 334 534
|
Ligand set1
|
angiotensin-converting enzyme (ace)/ spike-ace interaction interface
|
ace
|
400 000
|
Ligand set2
|
spike/ spike-ace interaction interface
|
spike
|
400 000
|
Ligand set2
|
SARS-CoV-2
non-structural protein 16(/SARS-CoV-2
non-structural protein 10 interaction interface
|
Nsp16-Nsp10
|
400 000
|
Ligand set2
|
SARS-CoV-2
non-structural protein 16 S-adenosylmethionine binding site
|
Nsp16-sam
|
400 000
|
Ligand set2
|
Descriptors
Simplified molecular-input line-entry system (SMILES) is used for representing ligands. Using SMILES descriptors for ligands were produced. Machine and deep learning algorithms were experimented with different descriptors that were created using the DeepChem library [15]. The descriptors used in this work are Extended Connectivity Circular Fingerprints (ECFPs) [16], MACCS (Molecular ACCess System) [6] and One Hot Encoding. ECFPs uses an iterative approach to describe a molecule, it uses a combination of atomic properties such as atomic number and hashing in the description process. The MACCS descriptor starts a 167-bit vector of zeros, the vector is then populated with ones for any position that satisfy a molecular property such as Nitrogen availability. The last descriptor used was One Hot Encoding, this also works like the previous descriptor. Here, each character of the SMILES is transformed into a 35-bit vector: one is placed at the index corresponding to the character's position in the predefined set of characters ['#', ')', '(', '+', '-', '/', '1', '3', '2', '5', '4', '7', '6', '8', '=', '@', 'C', 'B', 'F', 'I', 'H', 'O', 'N', 'S', '[', ']', '\', 'c', 'l', 'o', 'n', 'p', 's', 'r'], with zeroes elsewhere, and the last bit is set to '1' if the input character is not in the list. The default maximum length for the SMILES string is 100; if the input is shorter than this, the output is padded with zeros to ensure a consistent output size. Table 2 shows the information on the descriptors used.
Table 2 Descriptors used and respective number of features
Descriptor Identifier
|
Feturizer
|
Total Features
|
d1
|
One Hot Encoding
|
3500
|
d2
|
MACCS
|
167
|
d3
|
CircularFingerprint+ One Hot Encoding + MACCS
|
4691
|
Selecting a Percentage of Molecules for Training
Since we aim to use only a small size of docking results to predict the docking scores of a large library, we experimented with various size of training set. It is important to understand how much data is sufficient to train a machine learning model that could be used to screen a large library of molecules. Thus, the training set size is chosen from the set {7000, 10000, 20000, 50000, 100000, 350000}. Then 5-fold CV is used in training where each training size is multiplied by five so that training is performed on one fifth of the data and the CV evaluation on the remaining four fifths. For example, for training on 7000 molecules, 35 000 (5 x 7000) molecules are randomly selected, and, in each fold, 7000 molecules are used in training while the remaining 28 000 molecules are used in evaluation. This process works like the opposite of traditional CV. An advantage that CV provides is it ensures a good estimate of a model’s prediction performance. For final testing after the CV process, a separate testing set having a size depending on the training set size is used. To maintain consistency, the final testing size is delimited to 3 000 000 for target proteins when applicable. In the case of a training size of 350 000, 1 750 000 molecules become in the CV set. Thus, when training set is 350 000 , then we have around 1 800 000 molecules in the final test set (3 500 000 - 1 750 000). The ligand set one training and testing set sizes is summarized in Table S1.
For the ligand set two, due to the dataset size limitations, the training set size is chosen from the set {7000, 10000, 50000} and except for the size of 7000 which uses 5-fold CV, 3-fold CV is used in the remaining sizes. The ligand set two training and testing set sizes are summarized in Table S1.
Machine learning models
In order to get a better result, several machine learning models were trained. One of the models experimented on was a bidirectional LSTM with an attention mechanism [5]. LSTM is a type of recurrent neural network (RNN) designed to learn and retain long-term dependencies in sequential data effectively. Bidirectional LSTMs process the input sequence in both forward and backward directions, allowing the model to capture context from both past and future time steps. The attention mechanism enhances the bidirectional LSTM model by dynamically focusing on relevant input parts during decoding, improving its handling of long sequences and complex relationships. In addition, a vanilla neural network is incorporated, which takes the output from the attention-enhanced bidirectional LSTM and makes the final prediction.
After performing a hyperparameter tuning, the loss function settled on was mean squared error (MSE). The optimization function used was Adam optimizer with a learning rate of 0.001. During training, a batch size of 128 was used. A higher batch size could have been used but 128 is just enough to be used on a home machine as ease of use was considered. The experiments for this project were conducted on a Linux-based machine equipped with 126 GB of memory and a multi-core processor consisting of 28 cores (Intel Xeon E5-2680 v4). Pytorch [12] is used in training the LSTM. Other models used are from the scikit-learn [13] library. These models are decision tree regressor and stochastic gradient descent regression. XGBoost Regressor from the XGBoost library was also experimented with.