Structure based drug discovery by virtual screening of 3699 compounds against the crystal structures of six key SARS-CoV-2 proteins

Background The current Novel Coronavirus (SARS-CoV-2) pandemic is the third major outbreak of the 21st century which emerged in December 2019 from Wuhan, China. At present there are no known treatments or vaccines to cure or prevent the illness. Objective The objective of this study was to explore a list of potential drugs (herbal and antivirals) for their role in inhibiting activity and or replication of SARS-CoV-2 by using molecular docking onto the crystal structures of key viral proteins. Methodology In this study, we used molecular docking to estimate the binding affinities of 3699 drugs on the potential active sites of the 6 main SARS-CoV-2 proteins (Papain like protease, Main protease, ADP Ribose phosphatase, Spike protein, NSP-9 and NSP-10 to 16 complex). While other studies have mostly been performed on the homology models, we obtained the most recently submitted crystal structures of all 6 proteins from the protein data bank for this analysis. Results Our results showed the top ligands as Theasinensin A, Epigallocatechin, Theaflavin, Theasinensin A, Epigallocatechin and Favipiravir showing the highest binding affinities against papain-like protease, ADP ribose phosphatase, main protease, spike protein, RNA replicase (NSP-9) and methyl-transferase (NSP-16) respectively. Conclusion We show that the compounds from our list with the greatest inhibitory potential against SARS-CoV-2 activity or replication include Theasinensin A, Epigallocatechin-3-gallate, Theaflavin, Favipiravir, Curucumin, Quercetin, Mitoxantrone, Amentoflavone, Colistin, Cimicifugic acid, Theaflavin, Silymarin and Chebulagic. We recommend further wet-lab and clinical testing of these compounds to further explore their role against SARS-CoV-2. These results show the potential role of these ligands in inhibiting NSP-9 thus they may hinder RNA replication.


Background
The beginning of the 21st Century has seen two major Coronavirus related outbreaks. These include Severe acute respiratory syndrome (SARS-CoV;2002-2003, Middle East respiratory syndrome (MERS-CoV; 2012) (1). The current Novel Coronavirus (SARS-CoV2) or COVID-19 illness is the third major outbreak which emerged in December 2019 from Wuhan, China (2). The significant mortality associated with the illness and its rapid transmission rate has caused a global health crisis. The mortality rate that was estimated on the 3rd of March by the World Health Organization (WHO) was 3.4% (3). At present there are no known treatments or vaccines to cure or prevent the illness.

Biological Structure
Coronaviruses (CoVs) have an enveloped positive ribonucleic acid (RNA) genome (7). The positive-sense single stranded RNA (+ssRNA) virus has a single linear RNA segment (7). The genome sequencing as of late 27th march 2020 has yielded 1,495 SARS-CoV-2 genomes (8). In addition to this,there are four different viral proteins that are common in CoVs among others which include 1) M proteins, 2) S-Proteins, 3) E-Proteins and 4) N-Proteins (9).

Structure Based Drug discovery
There are two common ways among others in which drugs can modulate the course of a viral disease activity: 1) Effects on the human immune response and 2) The direct inhibitory effects on the viral activity / replication by binding to specific protein sites.
It is difficult to gauge the effects of immune response due to the complex interplay of different immune pathways and usually takes time to be discovered, however, the latter is possible early in the disease epidemiological course by computational modeling and wet lab tests. The true efficacy and side effects profile however can only be determined in a clinical setting such as by a randomized controlled trial.
Computational modeling provides a way to simulate a compound's interaction with a given viral protein using molecular docking. Such a docking model can be used as a precursor for potential drug discovery. In order to discover a potential drug there are two types of virtual screenings which are usually performed. Structure based and Ligand based. We used the former method to screen a large list of potential inhibitory ligands against 6 main viral proteins.

Objective
The objective of this study hence was to explore a list of potential drugs (herbal and antivirals) for their role in inhibiting activity and or replication of SARS-CoV-2 by using molecular docking onto the crystal structures of key viral proteins.

Selecting the main SARS-CoV-2 viral proteins from protein data bank
The protein data bank was searched for the most recently resolved crystal structures submitted related to SARS-CoV-2 virus (10) . A total of 114 submissions were found (date: 14th april 2020). Out of the total, 105 structures were resolved using X-RAY diffraction method while 81 structures from this subset had a resolution less than 3Å. Since our candidate proteins were not all resolved by other methods including cryo electron microscopy or Nuclear magnetic resonance spectroscopy (NMR) thus in order to keep it consistent over all the proteins, we selected x-ray diffraction to be the only method of choice. This was the final subset which was used to find the most recently submitted structures for papain-like protease (PLpro)(PDB code 6w9c), NSP-15 endonuclease (PDB code 6vww), spike protein complexed with ACE receptor (PDB code 6m0j), main protease (Mpro) (PDB code 5rek), NSP-3 in complex with AMP (PDB code 6w6y), NSP-9 RNA replicase bound to peptide molecule (PDB code 6w9q), NSP-10 to NSP-16 complex (PDB code 6w75) [ Table 1] [Fig 2].
SARS-CoV-2 genome consists of ORF1a sequence of about 14 kilobase long which encodes for polypeptide 1a which is eventually cleaved to produce the two replicase proteins namely PLpro (NSP-2-3) and 3CLpro (NSP-5). By ribosomal frameshifting, the virus also encodes for polypeptide 1ab which contains the RNA polymerase (NSP-12), helicase (NSP-13), exonuclease (NSP-14), endoribonuclease (NSP-15) and methyl-transferase (NSP-16) domains (11). These molecules were selected because each of them is found to play a critical role in either viral entry or replication in the human host. Some of the submissions (NSP-15, spike protein, Mpro, NSP-3, NSP-16) were found to be complexed with existing ligands which provided clues for potential receptor sites while for PLpro, literature review and docking-site search algorithm were used to help in potential receptor site identification. There have been previous studies testing several potential ligands against some of the mentioned proteins (PLpro and Mpro) but most of them have been performed on the homology modelled protein structures (12,13). We used the X-RAY diffraction resolved structures because they are the closest representations of the true protein structure.

Searching the compound database for the potential herbal and antiviral ligands
We obtained a list of more than 3457 potential antiviral phytochemicals and traditional Chinese medicinal compounds through a thorough search on Pubchem, Chembl and Zinc databases (20,21). Several freely available online databases (MPD3, MAPS) specializing in medicinal drugs were also used for the purpose of drug selection (22)(23)(24). After literature review, we also ran a separate CHEMBL search query using search strings "antiviral" and "antiviral medications". The query resulted in a total of 2350 compounds. We then filtered the results by choosing only small molecules and ligands with RO5 violations less than 2 (Lipinski's rule of five) to keep the most orally active drug-like agents selected. This resulted in a final selection of 242 antiviral medications which were added to our final drug

Molecular docking
First, all the candidate protein structures were analysed for sequence completion. Then, literature review was performed for each in order to select the potential docking sites. If the crystal structure already came with an attached ligand [ Table 1], the site of ligand attachment was selected for subsequent docking, else (NSP-16), the ligand-binding pockets are determined based on a transformation of the Lennard-Jones potential by convolution with a Gaussian kernel of a certain size, a grid map of a binding potential and construction of equipotential surfaces along the maps (ICM pro version 3.8-7c) (25,26). Once the candidate pockets were listed, the pocket with the highest buriedness, hydrophobicity and drug-likeness was selected [ Fig 15] (27). After selecting the target pocket, structure-based virtual screening method was used for ligand docking procedure. Using Intel i7 8750 processor (ICM pro version 3.8-7c, MolSoft LLC, San Diego, CA, USA), Biased probability Monte Carlo (BPMC) procedure was used to perform conformational sampling of the ligand table against the selected receptor pocket with 2 conformer samples being generated for each ligand (28). An all-atom vacuum force field ECEPP/3 with appended terms to account for solvation free energy and entropic contribution was used for the final binding affinity score calculation (29). Visual iNSP-ection was done to identify the compounds which are outside the receptor or weekly binding. From the subsequent list, ligand conformations with the lowest overall ICM and mfscores were selected for the final reporting. As per ICM's user manual, an ICM score of <-32 or a mfscore <-100 is considered to reflect a high binding affinity (30).

Papain-like Proteinase (PLpro, NSP-2-3) (PDB code 6w9c)
PLpro is responsible for cleavaging the N-terminus of the replicase polyprotein and thus production of NSP-1-3. This way it plays an important role in viral replication. It is also shown to be suppressing type 1 interferon signalling thus coMpromising the host immunity (31). For all these reasons, it is a valuable target for inhibiting viral replication.
We first performed amino acid sequence alignment between the selected new crystal structure (PDB code 6w9c) and its closest match in humans SARS-CoV PLpro (PDB code 3e9s). There was a high sequence identity of about 82% with several conserved regions [ Fig 3].
In order to find the docking site, we used the above information of highly conserved sites and an existing PDB crystal structure containing a bound ligand called 5-Amino-2-Methyl-N-[(1r)-1-Naphthalen-1-Ylethyl]benzamide (an engineered hydrolase of class: papain-like protease deubiquitinase inhibitor) (32) locked in with the crystal structure of SARS CoV-1 PLpro (PDB code 3e9s) (33).
We then performed structural residue alignment. The cartesian RMSD value between the new SARS-CoV-2 PLpro and earlier reported SARS-CoV PLpro came out to be 1.48 which shows a close structural similarity between the two structures [ Fig 4]. It also implies that the ligand binding site identified in the later can be used for the new crystal structure of SARS-CoV-2 as well. The ligand receptor site was thus selected by analysing this existing crystal structure (A-chain) and its attached ligand. This site was prepared by adding hydrogen ions, electrostatic charges and removing water molecules.
The screening results were plotted with ICM score on x-axis while the mfscore was on y-axis. This provided the best selection of ligands which performed equally well on both scoring systems [ Fig 5].
The results are displayed in Table 2 which shows a selection of lowest scoring ligands on ICM and mfscores.
It was found that Theasinensin A which is a polyphenol flavonoid found in black tea (Camellia sinensis) happened to score the lowest. It was found to be surrounded by 9 amino acid residues in the pocket with G266 being the closest with a distance between it and the ligand's oxygen-8 calculated to be around 2.9 Å. The ligand was also forming at least 1 more hydrogen bond with E161 and several hydrophobic interactions (D164, L162, P248, Y264, N267, G163, K157) [ Fig 6].
Curucumin (principal curcuminoid of turmeric), Quercetin (found in red onions and kale), Mitoxantrone (an anthracenedione antineoplastic agent), Amentoflavone (a biflavonoid constituent of a number of plants including Ginkgo biloba) and Colistin (an antibiotic for multi-resistant gram-negative bacteria) were some of the most notable ligands with high affinity towards the receptor pocket. Table 2. Potential PLpro inhibitors from the curated database of herbal and antiviral drugs.

ADP Ribose phosphatase macro domain of NSP-3 SARS-CoV-2 (PDB code 6vww)
Coronaviridae (genera Coronavirus and Torovirus) possess one of the largest single-stranded RNA genomes (27 to 31.5 kb) (34). One of the non structural proteins made by polypeptide 1a of SARS-CoV-2 is NSP-3 which contains certain well preserved macro domains (35). One such domain has been identified as ADP ribose phosphatase which is believed to be involved in diverse pathways, including ADP-ribose metabolism and posttranslational protein modification. The SARS NSP-3 domain readily removes the 1'' phosphate group from Appr-1''-p in in vitro assays, confirming its phosphatase activity (36).
We screened multiple potential ligands against the AMP binding site of this protein [Fig 7].
The results show [ Table 3] that Epigallocatechin gallate, also known as epigallocatechin-3-gallate (most abundant catechin in tea) came on top with the highest ICM score and shows very high promise in binding this active receptor site. This drug has been shown to be effective in treatment of colon cancer cells as well (37) and is known for its antioxidant properties. The ligand was shown to be forming 6 hydrogen bonds (G46, A50, G47, S128, W360, A129) with the closest one with G46 (distance 2.6 Å) [Fig 8]. There were also several hydrophobic interactions as well (I131, A38, G130, F132, G48).
The second notable mention is of Cimicifugic acid (extracted from cimicifuga racemosa). Its role has been well documented in preventing collagen degradation by collagenases or collagenolytic enzymes under pathological conditions, wound healing, or inflammation (38). Its role has also been documented for its antiviral activity against enterovirus A71 infection (39). Table 3. Potential NSP-3 inhibitors from the curated database of herbal and antiviral drugs.

Main protease (Mpro also called 3CL pro) (PDB code 6w6y)
The protein crystal structure used in this analysis represents the PanDDA analysis of the crystallographic fragment screening of SARS-CoV-2 main protease A-chain. The structure came with a ligand molecule docked around the active catalytic site. This site was subsequently selected for a further docking experiment. Mpro along with PLpro is considered an essential protein for SARS-CoV-2 replication and forms a part of the replicase polyprotein complex. It is found to operate at at-least 11 active cleavage sites on the large polyprotein 1ab (~790 kDa). If the action of this protein is blocked, it would most certainly hinder the replication of this virus. After literature review and using the ligand position that came with the crystal structure, the protein receptor site was identified. It was made sure that the receptor site also includes Cys145 and His41 (catalytic dyad) which are the main amino acid residues of the active catalytic site.
The ligands with the lowest scores on ICM and mfscore scales included Theaflavin, 5,7,3′,4′-Tetrahydroxy-2'-(3,3dimethylally and Silymarin. This is shown in Fig 9 and Table 4]. Table 4. Potential Mpro inhibitors from the curated database of herbal and antiviral drugs. Theaflavin, an antioxidant polyphenol which is often formed from the condensation of flavan-3-ols in tea leaves, had the lowest ICM score of -25. Theaflavin and its derivatives have also been shown to be equally effective as an antioxidant in comparison with catechins (40). Furthermore, it was shown to have lower docking scores against RNA dependent RNA polymerase of SARS-CoV-2 in a recent study (41). Though the most important finding was a strong hydrogen bond between its hydrogen (H8) and oxygen of His41 residue which also serves a critical role in forming the active catalytic site. Cys145 was also in close proximity, involved in hydrophobic interactions, thus the ligand was able to fully block this active catalytic site. Apart from these, Theaflavin was forming a total of four more strong hydrogen bonds (T26, T24, Q189, S46) with the closest being 2.7Å in distance. [Fig 9]. There were also several hydrophobic interactions (M49, T25, T45).
The second most interesting ligand was Sillymarin as it had equally low ICM and mfscore. It is.known mostly for its hepatoprotective functions, but recently has been shown possessing antiviral properties against hepatitis C virus (42). Sillymarin is a standardized extract of the milk thistle seeds, containing a mixture of flavonolignans (43) and has also been shown to exert membrane-stabilizing and antioxidant activity (44). It also possesses antifibrotic, immunomodulating and anti-inflammatory effects as mentioned in another study (45). Our findings suggest that these two ligands can act as potential viral Mpro inhibitors of SARS-CoV-2 and can be selected for subsequent testing.

Spike Protein (PDB code 6m0j)
The selected viral protein crystal structure represents the main interaction site of SARS-CoV-2 spike protein bound with human ACE-2 receptor (hACE2). For our target receptor site for docking, we chose the hACE2 interaction domain on the spike protein (receptor binding domain). Spike protein is considered to be the key for entry into the body cells. It has been shown to be interacting with hACE2 receptors on several human cells including type-2 pneumocytes and enterocytes where it helps in the protein mediated membrane fusion (46). It has also been shown to infect T lymphocytes despite a low expression of hACE2 receptors (47). Thus by blocking this receptor binding domain by a ligand, we can potentially inhibit its entry into human cells.
The docking results showed C20H19F2N3O5 (Dolutegravir) with the lowest ICM score of -19.29 while Chebulagic had the lowest mfscore of -133.7 [ Table 5]. The ligands with equally good scores on both scales included Theasinensin with an ICM score of -19.22 and an mfscore of -98 and Colistin (ICM score -16.69, mfscore -126) [Fig 11]. Table 5. Potential spike protein inhibitors from the curated database of herbal and anti-viral drugs.
having a distance of 2.47Å from T500. Theasinensin A is among those herbal agents which have already been shown to possess antiviral activity (48).

RNA Replicase NSP-9 (PDB code 6w9q)
Non structural protein 9 has been found to be playing a key role in binding of RNA-Polymerase to the RNA strand and interacts with NSP-8, thus forming an important component of the RNA-polymerase complex (49). Inhibiting its action may play a role in keeping the RNA-polymerase from binding to the RNA strand thus inhibiting the viral replication. The receptor pocket for action was chosen to be around the bound peptide site of the crystal structure. This was then verified by the pocket finder screening results of the ICM software which proved the selected pocket to be showing maximum hydrophobicity and receptivity for a drug like ligand.
Our docking results showed Epigallocatechin with the lowest ICM score of around -26 while Chebulagic, a benzopyran tannin, had the lowest mfscore of -117 [ Fig 13 and Table 6]. Epigallocatechin (found in black tea) has been shown to be an effective antioxidant in several studies (50) while Chebulagic acid has been shown to be inhibiting enterovirus 71 replication in some studies (51). Cimicifugic acid, Scutellarin and Rosemarinic acid were other compounds with lower ICM and mfscores. Epigallocatechin is also found to have a similar zinc ionophore activity as shown by chloroquine in one of the studies (50) thus being a herbal agent, it can be safely used in higher concentrations to achieve better inhibition compared to chloroquine.
Epigallocatechin had the lowest overall score in both metrics. It was shown to be forming 2 hydrophobic interactions (T77, V76) and 4 hydrogen bonds (R111, D78, A107, V110) with the smallest distance of 2.8Å with R111 [FIG 14]. Table 6. Potential NSP-9 inhibitors from the curated database of herbal and antiviral drugs.
These results show the potential role of these ligands in inhibiting NSP-9 thus they may hinder RNA replication.

Non structural proteins 10-16 (PDB code 6w75)
NSP-10-16 is a complex of several individual proteins including RNA polymerase (NSP-12), helicase (NSP-13), exonuclease (NSP-14), endoribonuclease (NSP-15) and methyl-transferase (NSP-16) domains. Each one plays its role in bringing about the replication of the virus. The ICM pocket finder was run on this complex to locate the most suitable target pockets for ligand docking [Fig 15]. The selected pocket was found on the A-chain of NSP-16 molecules that had the highest hydrophobicity and DLID scores (Drug-Likeness) (27) and a volume of 348A 3 .
After selecting the target pocket, virtual screening was performed [Fig 16] which showed the antiviral Favipiravir (Avigan) a novel RNA polymerase inhibitor having the highest affinity (ICM score -36, mfscore -42) with receptor site [ Table 7]. Favipiravir is a novel broad-spectrum antiviral drug and has been found to be effective against all the subtypes of influenza virus including those resistant to neuraminidase and M2 inhibitors (52). In a recently published study it was found to be reducing viral infection (53). Table 7. Potential NSP-16 inhibitors from the curated database of herbal and antiviral drugs.
Favipiravir was developed by Fujifilm Toyama Chemical and approved in Japan in 2014 for the treatment of influenza and is currently under trial for its potential role in inhibiting viral replication of SARS-CoV-2. In our analysis, it was found to be making 5 hydrophobic interactions (D6897, D6898, D6912, M6929, F6947) while it was forming 4 hydrogen bonds (G6869, G6911, C6913, S6896) with the closest one of 2.8Å with G6869 [ Fig 17].
Among other ligands, Amentoflavone (a biflavonoid constituent of a number of plants including Chinese plant Selaginella tamariscina and Ginkgo biloba) also showed a high affinity with an ICM score of -33. It has been shown to possess antiviral and anti cancer effects in some studies (54,55). Theaflavin was also among potential inhibitors of this site.
The following table summarizes the key ligands found for all 6 proteins [ Table 8].

Comparison with other classes as controls
We also tested some of the most mentioned (12,(78)(79)(80)(81) ligands from different drug classes (Antivirals, Antibacterials, Antifungals, Antimalarials, Antihypertensive, Cholesterol lowering, Anticancer, Antiallergics, AntiInflammatory and membrane stabilizers) as controls for each of the protein targets. Favipiravir and Ribavirin in general, Atoquone and Rosuvastatin against NSP-3, Mitoxantrone against PLpro, Montelukast against Spike protein and Silymarin against NSP-3 and NSP-16 showed high binding affinities (Table 9). However, as mentioned above, most of these control ligands did not score the best affinity scores when compared to other phytochemicals in each protein receptor docking experiment except for Favipiravir.

Conclusion
Using structure based ligand screening method, we identified Theasinensin A (for PLpro and spike protein), Epigallocatechin (for ADP ribose phosphatase), Theaflavin (for Mpro) and Favipiravir (for NSP-16) as the key probable inhibitors for their respective receptor protein pockets.
Furthermore, there were few drugs that bound with more than one protein with low mfscores (Theasinensis A with PL pro and spike protein, Quercetin with PLpro and ADP ribose phosphatase, Amentoflavone with PLpro and NSP-16, Cimicifugic acid with ADP ribose phosphatase and NSP-16, Chebulagic with spike protein and NSP-9). This may suggest their role in viral inhibition by more than one mechanism of action.
Based on the finding of our study, we suggest further analysis of our selected potential ligands in a wet-lab or clinical setting in order to validate our findings.     Sequence alignment of SARS-CoV-2 PLpro sequence (PDB code 6w9c) with SARS-CoV PLpro sequence (PDB code 3e9s). Only a segment is shown here. The sequence identity is preserved up to 82%. Greener shades represent better sequence conservation.

Figure 5
Selection plot for PLpro receptor site. X-axis represents the ICM score while Y-axis represents the mean force potential score.  Selection plot for ADP Ribose phosphatase macro domain of NSP-3 receptor site. X-axis represents the ICM score while Y-axis represents the mean force potential score.  Selection plot for Mpro receptor site. X-axis represents the ICM score while Y-axis represents the mean force potential score.  Selection plot for spike protein receptor binding domain as a docking site. X-axis represents the ICM score while Y-axis represents the mean force potential score.

Figure 12
Left figure shows the Theasinensin A molecule present in the ACE-2 receptor binding domain of spike protein.
Right figure shows a 2d dendrogram showing the ligand molecule in the centre, receptor pocket amino acid residues forming hydrophobic interactions with the ligand (green circles) and receptor pocket amino acid residues forming hydrogen bonds with the ligand (blue circles).

Figure 13
Selection plot for NSP-9 protein receptor binding domain as a docking site. X-axis represents the ICM score while Y-axis represents the mean force potential score.   Selection plot for NSP-16 protein receptor binding domain as a docking site. X-axis represents the ICM score while Y-axis represents the mean force potential score.