In this paper, a deep learning and molecular simulation based hybrid strategy is proposed for virtual drug screening against RdRp over the TargetMol-Approved-Drug-Library, an approved drug library with 1906 compounds collected by TargetMol, resulting in four candidates (Pralatrexate, Azithromycin, Sofosbuvir, Amoxicillin) for drug repurposing. qRT-PCR assay, indirect immunofluorescence assay (IFA) and CCK-8 assay were carried out to validate the efficacy for Pralatrexate, Azithromycin which inhibit SARS-CoV-2 replication in vitro. Surface plasmon resonance (SPR) assay was used to evaluate the RdRp-drug binding affinity.
Structural modeling of RdRp and drug compound dataset
The RdRp sequence and its modelled structure were obtained from https://zhanglab.ccmb.med.umich.edu/C-I-TASSER/2019-nCov/. The RdRp-ligand model was constructed by I-TASSER. The ligand was taken from the template protein (PDB ID: 3BR9) by COFACTOR algorithm within the I-TASSER using structure comparison and protein-protein networks. We extract the amino acids within 1 nm of the ligand as the binding pocket. RMSD between the modeled structure and the recent experimental RdRp structure (PDB ID 6M71) is calculated (~0.516 Å), shown in Supplementary Fig. 11a). RNA polymerase superfamily region is also very similar between these two structures (RMSD=0.456 Å, Supplementary Fig. 11a).
TargetMol-Approved_Drug_Library, which contains 1906 compounds, was used as virtual screening library. These 1906 compounds collected by TargetMol are drugs approved by Food and Drug Administration (FDA), the European Medicine Agency (EMA), or China Food and Drug Administration (CFDA), or included in the US Pharmacopeia (USP) Dictionary, the British Pharmacopoeia (BP), the European Pharmacopoeia (EP), the Japanese Pharmacopoeia (JP), or Chinese Pharmacopoeia (CP) Dictionary.
Molecular vector-based drug screening
A deep learning-based method, DFCNN (Dense fully Connected Neural Network), has been developed for predicting protein-drug binding probability  and used in this paper for the initial drug screening (Fig. 1a). DFCNN utilizes the concatenated molecular vector of protein pocket and ligand as input representation, and the molecular vector are generated by Mol2vec which is inspired by the word2vec model in natural language processing. DFCNN model was trained on a dataset extracted from PDBbind database . Negative data samples in the dataset were generated by cross-combination of proteins and ligands from PDBbind database and positive data samples were taken from protein-ligand pairs in experimental structure. The details of the method were described in our previous paper , and DFCNN achieved an AUC value around 0.9 for the independent testing set . The model is about ~100,000 times faster than Autodock Vina in predicting protein-ligand binding probability (range 0~1), because it does not rely on the protein-drug complex conformation.
To examine the efficiency and effectiveness of the DFCNN method, we screen a large scale chemical compound dataset (about 10 million compounds) targeting 8 representative protein targets taken from the DUD.E diverse data set. For each target, the corresponding dataset contains some active compounds (between 40 and 536) in the DUD.E dataset and 10,402, 895 drug-like compounds from ZINC database. The effectiveness is measured by the prediction-random ratio (Ratio0.9), defined as TPR0.9/Random0.9, where TPR0.9 indicates the ratio (N0.9/Active_num) between the number of active compounds with a DFCNN score larger than 0.9 (N0.9) and the active number of compounds (Active_num). The total number of the compounds (Total_num) with score above 0.9 is defined as NN. The random selection rate (Random0.9) is defined as NN/Total_num. Using cutoff score of 0.9, the prediction-random ratio measures the ratio of predicted TPR and random selection TPR.
Structure-based drug screening
DeepBindBC, an in-house deep learning-based software, is used for structure based drug screening. Unlike the DFCNN, the input of DeepBindBC includes both the physical-chemical information and spatial information between the protein-ligand interfaces (Fig. 1a), hence DeepBindBC is able to achieve higher accuracy, but requires protein-drug complex structure information as input generated by Autodock Vina.
Autodock Vina is used to dock the target with the potential ligands . The pocket is determined by the location of ligand in the template protein (PDB ID: 3BR9) . We set the cavity volume space with 3.5 nm, 3.5 nm and 3.5 nm in x, y, z dimensions from the pocket mass center. AutoDock Tools were used to convert PBD file format to PDBQT file format . The exhaustiveness was set to 8; the num_modes was set to 20, and energy_range was set to 3. The scoring function and optimization algorithm of Autodock Vina have been well discussed in a previous article. In this study, we selected the most likely targets for further validation by setting a binding energy threshold value of -7 kcal/mol.
The DeepBindBC is a ResNet model trained over the PDBbind database. In DeepBindBC, the protein-ligand interaction interface information will be converted into figure-like metric, similar to DeepBindRG . By incorporating the cross-docking (docking proteins and ligands from different experimental complexes) conformation as negative training data, DeepBindBC is highly possible to distinguish non-binders. Since DeepBindBC relies on docking conformation and DFCNN only uses molecular vector information, these two methods are complementary to each other and DeepBindBC takes much more time than DFCNN.
Force field-based screening
Further drug screening was carried out by force field based molecular dynamic (MD) simulations. In this study, we selected 14 drug binding complexes for MD simulation, including Adenosine, Amenamevir, Amoxicillin, Azithromycin, Clofarabine, Fipronil, Gemcitabine, Nitisinone, Pralatrexate, Raltegravir, Romidepsin, Sofosbuvir, Teriflunomide and Vidarabine, respectively.
We also proposed a pocket molecular dynamics simulation (pocket MD, Supplementary Fig. 11b) to facilitate the simulation process by only keeping the binding pocket region for simulation. Binding free energy calculation can be estimated by metadynamics simulations to explore whether protein-ligand will bind in solution. Metadynamics relies on addition of a bias potential to sample the free energy landscape along a specific collective variable of interest ,. Noted that the binding free energy calculations from Metadynamics may only be suitable for detect the general trend of binding in virtual screening.
The pocket MD is same as the classical MD simulation, except that we only using the pocket region to reduce system size for simulation (Supplementary Fig. 11b), which is inspired by a previous dynamic undocking (DUck) method . An in-house script was used to extract the pocket region of the protein (1nm within the binding ligand), the N terminal and C terminal ends were capped with the ACE and NHE, respectively. Terminals will be applied a position restrain to maintain the relative conformation of the pocket. MD simulation was carried out by Gromacs with AMBER-99SB force field [34,35]. The topology of ligand and the partial charges of ligand was generated by ACPYPE , which relies on Antechamber . Firstly, we created a dodecahedron box and put the target-ligand complex at the center. A minimum distance from the protein to box edge was set to 1 nm. We filled the dodecahedron box with TIP3P water molecules , the counter ions was added to neutralize the total charge using the Gromacs program tool . The long-range electrostatic interactions under the periodic boundary conditions was calculated with Particle Mesh Ewald approach . A cutoff of 14 Å was used for van der Waals non-bonded interactions. Covalent bonds involving hydrogen atoms were constrained by applying the LINCS algorithm .
We performed the energy minimization steps with a step-size of 0.001ns, 100 ps simulation with isothermal-isovolumetric ensemble (NVT), and 10ns simulation with isothermal-isobaric ensemble (NPT) for water equilibrium. After that, a 100ns NPT production run (step size 2 fs) was carried out. The Parrinello-Rahmanbarostat and the modified Berendsen thermostat were used for simulation with a fixed temperature of 308 K and a pressure of 1 atm. RMSD and hydrogen bond number of the trajectory were calculated using Gromacs tools.
The simulation was continued using the metadynamics approach for exploring the free energy landscape. The interface coordination number of atoms of protein ligand complex was used as collective variable (CV). The protein-ligand interface coordination numbers correlate with the numbers of atom contact, and larger coordination number usually indicates protein-ligand is in binding state.
The coordination number C is defined as follows by Plumed: see formulas 1 and 2 in the supplementary files.
Tools used in analysis
The USCF Chimera, VMD, ICM-browserPro and Discovery Studio Visualizer 2019 were used to generate the structure and to visualize the 2D protein-ligand interactions [44–47]. Clusfps (https://github.com/kaiwang0112006/clusfps) which depends on RDKit was used to cluster the drugs in the dataset. The drug fingerprint was used as inputs with algorithm of Murtagh being used for clustering 1906 drugs into 20 groups.
Cell line and drugs
Vero cell (ATCC, CCL-81) was cultured at 37°C in Dulbecco’s modified Eagle’s medium (DMEM, Gibco) supplemented with 10% fetal bovine serum (FBS, Gibco) in the atmosphere with 5% CO2. Cells were seeded in 96-well plates and cultured overnight with a density of 5 × 104 cells/well prior infection or drug feeding. Remdesivir, Azithromycin, Pralatrexat, Sofosbuvir and Amoxicillin were obtained from Selleck Chemicals. All drugs were dissolved in DMSO to prepare 50 mM stock solutions, and stored at -20℃. DMSO was used in the controls.
Viral stock titration by 50% tissue culture infective dose (TCID50)
TCID50 was measured as previously reported . In brief, Vero cells in 96-well plates were grown to 80% confluence and infected with 10-fold serial dilutions of the stock SARS-CoV-2 (BetaCoV/Shenzhen/SZTH-003/2020, GISAID No. EPI_ISL_406594) for 1 h at 37°C. The inoculum was removed, and cells were overlaid with fresh DMEM plus 2% FBS. At 5 days post infection (d.p.i), plates were assessed for the lowest dilution in which 50% of the wells exhibited cytopathic effects. The values of TCID50 were calculated according to the Reed-Muench method.
Evaluation of antiviral activities of the drugs in Vero cells
Firstly, the cytotoxicity of the five drugs on Vero Cells were determined by CCK8 assays (Sangon). Then the antiviral activities of the drugs were evaluated as previously reported with some modification. Vero cells seeded in 96-well plates were pre-treated with the different doses of the indicated drugs for 1 h, and then virus was subsequently added at multiplicity of infection (MOI) of 0.02 to allow infection for 2 h. Then, the virus-drug mixture was removed and cells were further cultured with fresh DMEM with 2% FBS and the indicated concentrations of drugs. At 48 hours post infection (h.p.i), the cell supernatant was collected and viral RNAs were extracted using the QIAamp RNA Viral Kit (Qiagen, Heiden, Germany) for further quantification analysis. The cells were collected for indirect immunofluorescence assay (IFA). All the experiments involving infectious SARS-CoV-2 were handled in BSL-3 facilities at the Shenzhen Third People's Hospital.
Quantitative reverse transcription polymerase chain reaction
This assay was carried out as described previously . Viral RNAs were extracted from the samples using the QIAamp RNA Viral Kit (Qiagen, Heiden, Germany), and quantitative reverse transcription polymerase chain reaction (qRT-PCR) was performed using a commercial kit (Genrui-bio) targeting the S and N genes. The specimens were considered positive if the Ct value was ≤ 38.0, and negative if the results were undetermined. Specimens with a Ct higher than 38 were repeated. The specimen was considered positive if the repeat results were the same as the initial result and between 38 and 40. If the repeat Ct was undetectable, the specimen was considered negative.
Indirect immunofluorescence assay (IFA)
IFA was carried out as previously reported [53,54]. Vero cells were fixed in 4% formaldehyde at 48 hours post infection. Then cells were permeabilized in 0.5% Triton X-100, blocked in 5% BSA in PBS, and then probed with the plasma of this patient or healthy control at a dilution of 1:500 for 1 h at room temperature. The cells were washed three times with PBS and then incubated with either goat anti-human IgG conjugated with Alexa fluor 488 at a dilution of 1:500 for 1 h (Invitrogen). The cells were then washed and stained with hoechest-33342 (Invitrogen) to detect nuclei. Fluorescence images were obtained and analyzed using EVOS FL Auto Imaging System (Invitrogen).
Protein expression and purification
The genes for nsp12 of SARS-CoV-2 isolate BetaCov/Wuhan/WH01/2019 (EPI_ISL_406798) was chemically synthesized with codon optimization for insect cells (Spodoptera frugiperda) by Synbio Technologies. The sequence was fused with a C-terminal thrombin cleavage site, a 6×His-tag and a 2×Strep-tag, and incorporated into pFastbac-1 plasmid. Recombinant protein was expressed with Hi5 cells at 27℃. Cells were harvested at 48 hpi（hour post infection）and resuspended in 25 mM HEPES pH 7.4, 1 M NaCl, 1 mM MgCl2 and 2mM TCEP. An equal volume of the same buffer supplemented with 0.2% (v/v) Igepal CA-630 (Anatrace) was added and incubated at 4 °C for 10 min. Cells were lysed by sonication and the lysate was clarified by ultracentrifugation. Cleared lysates were passed through a 0.22-μm filter film before further purification. The protein was purified by tandem affinity chromatography and SEC.
Surface plasmon resonance (SPR) assay
The affinities between nsp12 and drugs were measured at room temperature (r.t.) using a Biacore 8K system with CM5 chips (GE Healthcare). The nsp12 protein was immobilized on the chip with a concentration of 100 μg/mL (diluted by 0.1mM NaAc, PH 4.0).
Drug samples were prepared according to procedure 29264621AA of GE Healthcare Life Sciences. 1×PBS solution plus 5% DMSO and 0.005% p20 was used for running and diluting drugs. A blank channel of the chip was used as the negative control. Serial diluted drugs were then flowed through the chip surface. The LMW multi-cycle kinetics was analyzed with the Biacore 8K Evaluation Software (version22.214.171.12442) and fitted with a 1:1 binding model.
Data are presented as the mean ± SD (Standard Deviation). All analyses were performed using GraphPad Prism version 7.0 for Windows (GraphPad Software, San Diego California, USA). Data were subjected to statistical analysis by two-way ANOVA or two-tailed Student’s t-test. The P values less than 0.05 were considered statistically significant.
The structure for SARS-CoV-2 RdRp has been deposited in the Protein Data Bank with accession number 6M71, 6WTT. The template structure we used to construct RdRP-ligand complexes are deposited in the Protein Data Bank with accession number 3BR9).