In Silico high-throughput virtual screening and molecular dynamics simulation study to identify 3CLpro inhibitor of COVID-19

COVID-19 is an ongoing coronavirus pandemic with more than 260,000 deaths worldwide so far. In the event of no targeted drug and vaccine for SARS-CoV-2, causative agent of COVID-19, finding newer targets becomes indispensable. 3CLpro is a processing enzyme of viral replicase-transcriptase complex, chosen as a potential drug target. 3CLpro of COVID-19 is structurally similar to SARS-CoV and MERS-CoV 3CLpro. Here, we built up an in-house inhibitor library for 3CLpro of SARS-CoV and MERS-CoV. The library revealed 7 non-toxic compounds with 5 compounds showing significant protease and enzyme inhibition activity. Molecular docking of these compounds with 3CLpro shown lower binding energy profile of compounds 1 and 2. The compactness and flexibility analysis of compounds 1 and 2 complexes with 3CLpro showed lower stability profile as compared to unbound 3CLpro. The results cite the potential of compounds 1 and 2 as probable inhibitors of 3CLpro as a prospective therapeutic target for COVID-19. 30kb long of coronavirus for replicases, proteins, nucleocapsid, and it an essential role in attaching to the host cell receptor, that the enzyme2 a bind the host cell the cell nucleocapsid epitope at B cell and T cell these play important role in COVID-19 replicases polyprotein a ribosomal frameshift ORF1a ORF1b pp1a pp1ab replicase-transcriptase” for Proteases play essential role in the life cycle at various stages, thus proteases an important molecule for study and developing therapeutics target.


Introduction
World is suffering from another very serious outbreak of coronavirus within a period of two decades. The 2019 coronavirus outbreak has been named as COVID-19 by the WHO to demark it from the previous Severe Acute Respiratory Syndrome (SARS) outbreak and is declared as pandemic owing to its spread across the continents. As of 8 th may, around 3,800,000 cases of infection have been reported with death tool climbing to more than 260,000 worldwide (https://coronavirus.jhu.edu/map.html). The recent outbreak has presented a serious challenge to global health care system for handling any such outbreaks. The scientific community worldwide is working day night to develop the preventive vaccine and drugs to contain the outbreak at the earliest and save people at risk. At present, National Health Commission of the People's Republic of china has recommended ritonavir and lopinavir both are HIV-1 protease inhibitors as well as chloroquine phosphate for the COVID-19 treatment. Remdesivir, an antiviral drug against the RNA virus including Ebola 1 , SARS 2 and Middle East Respiratory Syndrome (MERS) 2 has potential to inhibit COVID-19 virus 3 . At present, some traditional Chinese medicines (Lianhuaqingwen and Shuanghuanglian oral liquid) 4,5 are also tested which are found to be effectively eliminating the virus. COVID-19 is a (+) sense single-stranded enveloped RNA virus codes an ~30kb long genome as reported using different sequencing methods on the patient-derived virus samples 6,7 . COVID-19 virus belongs to the order of Nidovirdales, family Cornidovirineae, subfamily Coronaviridae genus Betacoronavirus, and subgenus Sarbecovirus. Previous coronavirus outbreaks have been caused by the same genus of the virus which included SARS outbreak of 2002 in China and MERS outbreak in 2012 near the Arabian Peninsula. All these diseases are marked by respiratory illness ultimately resulting in fatal pneumonia. Phylogenetic studies conducted between the causative agents of all these three outbreaks suggest of closer association between COVID-19 to SARS-CoV (79% identity) in comparison to the MERS-CoV (50% identity) 8 . However, studies also show a very close association with two other strains of the SARS-like coronaviruses, bat-SL-CoVZC45 and bat-SL-CoVZXC21 8 .
Approximately 30kb long genome of coronavirus encodes for viral replicases, structural proteins, spike, nucleocapsid, membrane and envelope protein along with it the genome codes for several non-structural proteins (nsp) which are often described as the accessory proteins or essential for various other functions 9 . Viral spike protein plays an essential role in attaching to the host cell receptor, studies suggest that the virus uses Angiotensin-converting enzyme2 as a receptor to bind to the host cell and enter into the cell 10,11 . Spike protein and nucleocapsid are derived epitope at B cell and T cell suggests, these protein play important role in vaccines development against the COVID-19 12 . Most of the viral genome is occupied by replicases (~21kb) an overlapping polyprotein with coding varying by a ribosomal frameshift of -1 resulting in ORF1a and ORF1b which encode pp1a (486KDa) and pp1ab (790kDa) respectively 13 . Proteolytic processing of this polyprotein results in release of a number of functional protein required by virus life cycle. This proteolytic cleavage is mainly accompanied by two proteases that are papain-like cysteine protease (PL PRO ) and 3chymotrypsin-like protease (3CL Pro ) 13 . These two proteins upon proteolytic processing are responsible for the formation of the multi-subunit complex known as "viral replicase-transcriptase" required for viral replication, hence making these proteases indispensable for replication. Proteases thereby play an essential role in the viral life cycle at various stages, thus making these proteases an important molecule for study and developing therapeutics target. Recently, PL PRO and 3CL Pro of COVID-19 have been used as the therapeutics target by in slico as well as in vitro drugs screen with the FDA approved drugs [14][15][16][17] . In this study, we attempt a virtual screening of 3CLpro inhibitor as it being the main protease of COVID-19. The inhibitor library was based on SARS and MERS 3CLpro inhibitor. The toxicity and drug durability of inhibitors were assigned by Osiris and Molinspiration. Further, interaction of inhibitor with 3CLpro were analysis by molecular docking which followed by analysis of stability of 3CLpro with inhibitors by molecular simulation dynamics.

Phylogenetic analysis
The new outbreak of coronavirus (COVID-19) is the highly infectious and causative agent of respiratory diseases like the SARS. COVID-19 is positive-stranded RNA viruses that contain ~30kb long genome encodes 10 open reading frames. Almost two-thirds genome of COVID-19 encodes two large ORFs (1a and 1b) viral replicase polyproteins 13 . These two polyproteins are extensively cleaved by two viral proteases into 16 mature proteins 13 . These two proteases namely, 3CL Pro and PL Pro are potential drug targets for the coronavirus. In respect of SARS, it is predicted that PLP and 3Clpro cleave the polyprotein at four and eleven places respectively which is predicted the same function in COVID-19. To determine the evolutionary relationship of the COVID-19 and previously discovered CoVs, we construct the phylogenetic tree based on the amino acid sequence of 3Clpro, figure 2. Our phylogenetic tree analysis revealed, 3Clpro of COVID-19 was clustered with the SARS more closely than the MERS. Although, 3Clpro from other coronavirus was distinctly related to COVID-19 and clustered in the different clades. The most striking observations were that 3Clpro of COVID-19 and SARS originated by closely related ancestors. Structural comparison of 3Clpro of COVID-19, SARS (PDB ID: 3VB3) 18 , and MERS (PDB ID: 5C3N) 19 revelled the perfect superimposition with each other. Analysis of superimposed structure shown similar structural fold, figure  3A. The multiple sequence alignment of this protein presented 3Clpro of COVID-19 has 96% and 56% similarity with 3Clpro of SARS and MERS as shown in figure 3B. The similarity of structural and sequence of this protein gives the possibility the protease inhibitor of SARS and MERS would inhibit 3Clpro.

Molinspiration calculations:
Druglikenees and molecular properties of compounds were investigated using the Molinspriration server which is based on Lipinski rules of five. The Lipinski rules of five screens potential oral drug accordingly, the potential drug must have following properties (a) logP ≤5, (b) molecular weight ≤ 500 Dalton, (c) the number of H-bond acceptors ≤ 10, (d) number of H-bond donor ≤ 5 20 . Molecular properties of the selected compounds were tested by Molinspiration properties as shown in table 1. For the further screening the compounds that violated more than one Lipinski rules of five were rejected. All selected compounds have a molecular weight less than 500 except compound 2. Although, high molecular weight compounds are difficult to transported and absorbed in respect to low molecular weight compounds and it also affects the action of the drug. We observed that the number of H-bond donors as well as H-bond acceptors in these selected compounds were follow the Lipinski's rules which were less than 5 and 10 respectively. Lipophilicity (log P) and Topological Polar Surface area (TPSA) are essential properties of the potential drug molecules, which defines the oral bioavailability of the compounds. Penetration of drug molecule through biomembranes require less than 5 logP value, the compound shown in table 1 have acceptable logP value. The higher lipophilicity value of compound helps the compound to interact with the membrane. TPSA was estimated using consideration of O-and N-polar fragments and the related hydrogen atom. It depends on the hydrogen bond-forming potential of the compounds. It is also a defined percentage of absorption using % absorption = 109-0.345×TPSA. Compound shows in table 1 all have TPSA≤140 A 0 which showed higher oral bioavailability. Another important factor is the number of rotatable bond ≤10, present in all selected compound. A higher rotatable bond increases the flexibility of the compound for the interaction with protein and provides good bioavailability. Bioactivity of all selected compounds were analyzed against, GPC (GPCR ligand), ICM (ion channel modulator), KI (Kinase inhibition), NRL (Nuclear receptor ligand), PI (Protease inhibitors), and EI (Enzyme inhibitors) showed in table 2. The compounds have more than 0.00 score are biologically high active, -0.50 to 0.00 score are moderately active and less than -0.50 score represent inactive compounds (r). Bioactivity score is summarized in table 2, in which compound 7 and 10 were biologically inactive for all proteins while compound 8 was inactive for ICM and compound 12 was active for only NRL. Compound 9 was biologically inactive for ICM and NRL proteins. Compound 1 to 5 showed positive bioactive score for all proteins except KI in which compound 1 only showed a positive score for KI other compounds shows more than -0.50 score. Compound 1 to 6 showed good bioactive scores represents that these molecules were effective against GPC, ICM, KI, NRL, PI, and EI.

Toxicology studies
The toxicity risks, drug-likeness (DL), drug score (DS), and Physico-chemical property were estimated by OSIRIS property explorer shown in table 3. Toxicity is predicted based on the screening of fragments which probably cause toxicity. Prediction of toxicity risks were presented by different colour where red colour shows a high risk, yellow colour shows moderate risk, and green colour specifies the drug conform behavior. From table 3, compound 8-10 have a high risk of mutagenicity and tumorigenicity except 10 which also has a high risk of irritation. Further, compound 11 showed the mild risk of mutagenicity and tumorigenicity whereas compound 12 was mild irritants. Compound 1-7 were established by drug-like molecules. LogS represent aqueous solubility of the molecules. In which low solubility of the compound shows poor absorption. Most of the drugs available have logs > -4.00. LogS values of 1-7 were around -4, which shows good solubility and comparable standard drug zone. The drug score is calculated by the combination of toxicity risks, drug-likeness, cLogP, LogS and molecular weight which provide a single value for the potential drug candidate. This drug score is calculated by the following equations: DS = π(1/2+1/2Si)πti (1) Where Si = (1+s ap+b ) -1 (2) Si is the contribution of cLogP, LogS, molecular weight, and drug-likeness calculated from Eq. (2), which is a spline curve. Parameters 'a' and 'b' are (1, -5), (1,5), (0.012, -6) and (1, 0) for cLogP, LogS, molecular weight, and druglikeness respectively. On the other hand, 'ti' represent the contribution of toxicity risk types, which have 1.0, 0.8, and 0.6 values for no risk, moderate and high risk respectively. The potential drug candidate has a positive drug score and indicates a molecule contains pharmacophoric group. All selected compounds exhibited positive drug score. The compounds which did not have toxicity risk only compound 1-3 showed moderate drug-likeness values which indicating these compounds are potential drug candidates.

Molecular Docking analysis:
The structural based virtual screenings of all the selected compounds were done using molecular docking which defines the confirmations of the ligand in 3Clpro protein.  (Supplementary table 2). However, Drug likeness and Molinspiration analysis displayed only compound 1 and 2 that have drug-like properties. Docking of the compounds 1 and 2 are consistent because these compounds shows the interaction with the active site residues of the catalytic pocket of the 3Clpro protein as shown in figure 4. However, compound 2 exhibited the lowest binding energy -7.8 kcal/mol with respect to compound 1 that have binding energy -7 kcal/mol. Docking of the compound 2 established hydrogen bond with Asn142, His164, Glu166, and Gln189 whereas compound 1 has hydrogen bond with His164 and Gln189 at the active site. Compounds 1 and 2 interact with several amino acid residues with hydrophobic as well as van der Waal bonds. The toxicity and docking study established both compound 1 and 2 have a drug like property and could be a potential inhibitor for 3Clpro protein.

3Clpro protein-ligand complex stability analysis:
The stability of protein and the protein-ligand complexes is done by using the molecular dynamics simulation to validate the structure in the context of compound 1 and 2. The RMSD values are monitored under this study by the side of several other factors like temperature, pressure, density, and potential energy. RMSD of the trajectories were calculated using only the backbone atoms. RMSD in MD simulation was calculated by Eq. (3).

Discussion
The recent COVID-19 is a serious threat for the whole world. Outbreaks in 2002 and 2012 SARS outbreak lead to several deaths and in 2020 COVID-19 has already resulted in more than 260,000 deaths world-wide till 8 th April. Non-availability of any specific drug for treatment has put a great emphasis on the new drug discovery to target COVID-19. In our study, we have done the extensive virtual screening and molecular dynamic of 3CLpro protein. Only 7 compounds were satisfied with Lipinski's rule and toxicity study. However, out of 7, only 5 compounds have bioactivity with GPCR, NRL, KI, PI, and ICM. These compounds were docked with 3CLpro, where compound 1 and 2 had low binding energy with respect to the other compounds exhibited good interaction with protein. We carried out MD simulation of unbound 3CLpro and complex with compound 1 and 2. The RMSD and RMSF complexes of compound 1 and 2 with 3CLpro loss stability respect to the unbound 3CLpro. The flexibility of the 3CLpro complex with compound suggested compound 1 and 2 would be a useful inhibitor of 3CLpro which would be helpful drugs for the prevention of COVID-19.

Sequence retrieval and structure modeling
The sequence of 3CLpro of COVID-19 was retrieved from NCBI (https://www.ncbi.nlm.nih.gov/gene/?term=wuhan+coronavirus). The multiple sequence alignment of 3CLpro was done by ClustelOmega 21 and analyze by BoxShade (https://embnet.vital-it.ch/software/BOX_form.html). 3CLpro sequence from several viruses were retrieved from NCBI and UniProt (https://www.uniprot.org/) (Supplementary Table1), which were used in phylogenetic tree construction using maximum like hood method with 1000 bootstraps using Mega7 22 . Homology modeling is important techniques in case of structure-based drug designing applications, especially when a crystallographic structure is unavailable 23 . Tertiary structure of 3CLpro was predicted by using the SWISS-MODEL 24 that is Comparative structure modeling technique by using (PDB id : 3VB3) as a template. The model structure was validated by PROCHECK 25 . The discovery studio visualizer was used for the visualization and superimposition of the 3CLpro structures.

Docking protocol and Drug scanning
The ligand library of protease inhibitors was created from PDB database from the structure of protease inhibitors of SARS and MERS (Supplementary Table2). Active site of the protein is predicted from the ProDRG server 26 . Docking was performed by AutoDock 4.0 program 27 , using the implemented empirical free energy function and the Lamarckian Genetic Algorithm (LGA) 28 . The grid maps were calculated using Auto Grid. In all dockings, a grid map with 44× 44 × 44 points and grid-point spacing from the centre 12.053 x -0.797 x 19.586 was applied. The drug scans of the selected compounds were screened by Molinspiration (https://www.molinspiration.com/cgibin/properties) and Osiris Property Explorer. Molinspiration analysed Lipinski's Rule of Five 20 and drug-likeness such as LogP, molecular weight, protease, enzyme inhibition, and other property. Osiris Property Explorer analyses drug-likeness and toxicity of the compounds.

Molecular dynamics simulations
The structure of 3CLpro was solvated in a cubic box with TIP3P water molecules using AMBER99sb-ILDN. The systems were neutralized by the addition of counter ions (Na + and Cl -). The Steepest Descents energy minimization was performed for 1000 steps. Then, the system was equilibrated by two steps: pressure and temperature equilibration for 1000ps. After the equilibration of systems, the 3CLpro was simulated for 20ns. Then, molecular dynamics simulation data were calculated and analyzed by Gromacs 29 and qtGrace 30 (https://sourceforge.net/projects/qtgrace/) in respect of RMSD (root-mean-square deviation) and RMSF (rootmean-square fluctuation). The methodological workflow is depicted in figure 1.