Mining peptides against SARS-CoV-2 entry from constructed RBMs-hACE2 isomer libraries using machine learning and molecular docking

Cellular entry of SARS-CoV-2 initiates from the protein-protein interactions (PPIs) between viral surface protein S and human angiotensin converting enzyme 2 (hACE2). Peptide-based drugs have the advantage of small molecule compounds to block such viral-host PPIs. Thus the viral targetregions on hACE2 have been believed as promising templates for designing specific inhibitory peptides against SARS-CoV-2 infection. However, starting from a few potential templates, in silico design and prediction between binding affinity and bioactivities in vivo are very challenging, herein a novel design strategy was implemented by mining constructed template isomer libraries using feature filters, supervised classifier and peptide protein docking. Applying these methods and the isomer libraries, 4 peptides were identified from 12 millions candidates owing to their distinct stability, interaction activity, inhibitory specificity, binding affinity, transmembrane potentials and effective conformation. These results have supplied a panel of specific anti-COVID19 leads for further drug development, supporting a new feasible antiviral strategy for targeting both intracellular and extracellular SARS-CoV-2 S proteins simultaneously. The methods have provided a useful tool for mining antiviral-peptides against viral diseases.


Introduction
The viral disease COVID-19 has severely threatened global health and life with high morbidity and mortality because there are currently no approved drugs or vaccines that specifically target severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2) [1][2][3][4] . Relying on its envelope glycoprotein spike (S), this single stranded positive-sense RNA virus is able to rapidly locate on and effectively attach to host cell surface, and consequently initiates cellular entry and triggers organ injury 5,6 . At molecular level, the multidimensional process of the virus invasion has been proven as proteinprotein interactions (PPIs) between the viral protein S and human cellular angiotensin converting enzyme 2 (hACE2) [7][8][9][10] , and indeed such essential ligand-receptor binding has readily gained broad attention as critical targets for blocking the virus infection [11][12][13][14] .
To date, many vaccines, repurposing FDA approved drugs and various monoclonal antibodies have been tested and relevant trials have been conducted 3,[17][18][19][20][21][22][23] , whereas the prophylactic vaccines of ongoing trials do not play role in therapeutic intervention to the patients with SARS-CoV-2 infection 18,19 and there are no independent clinical reports so far to prove the efficacy of drugs for the antiviral therapy 3,17,22,23 , thus specific targeted treatments are urgently needed to combat the pandemic. Recently, several experimental and computational studies have been involved in the development of peptides as hACE2 blockaders or SARS-CoV-2 S inhibitors based on the primary sequence of S-ACE2 binding domains [11][12][13][14]16,24 . Particularly SARS-CoV-2 S RBM-primed regions on hACE2 (RBM-hACE2) have been believed as specific templates for in silico design of the potential inhibitors against CoV-2 infection [11][12][13][14] .
Clinically, peptide derived drugs, for instance insulin, ciclosporin and many others, have represented a new class of therapeutics for different type of diseases [25][26][27] . The molecular mechanisms of action in peptide drugs have received more attention recently for their antiviral potentials 28,29 . Peptide drugs have the advantage of small molecule compounds to block PPIs, especially in different kinds of viral-host PPIs, because peptides are capable of binding to the larger grooves or clefts on an interface, which are hard to target with traditional small molecule drugs due to large size, simple surface or shallow pocket [28][29][30] . Extensive studies demonstrated peptides effectively blocked virus attachment or the entry of a virus into host cells [31][32][33] . These efforts along with recent studies have offered a great hope for discovery of the peptide-based prophylactic and therapeutic agents against COVID-19 [11][12][13][14]31,32 , however, starting from a few potential templates, in silico peptide design and properties prediction across binding affinity, specificity, stability and membrane permeability are still very challenging. Herein a novel design strategy was introduced by mining constructed RBMs-hACE2 isomer libraries using feature filters, supervised classifier and peptide-protein docking [34][35][36] .
Applying these methods and the isomer libraries, 4 peptides were identified from 12 millions of candidates owing to their distinct bioactivities and inhibitory potentials against SARS-CoV-2 entry.
These results have supplied a panel of specific anti COVID19 leads for further drug development, which promisingly support a new feasible antiviral strategy for targeting both intracellular and extracellular SARS-CoV 2 S proteins simultaneously. The pipelined-protocols have provided a useful tool for mining antiviral-peptides against other viral diseases.

Results
Development and application of a traversal method for collecting RBMs-hACE2 homolog templates As shown in Figure 1, three RBMs-hACE2 segments, covering 20 specific binding sites, were selected as primary templates for designing the inhibitors due to the corresponding druggable sites on the 3D S. With desired linkage at N and C terminal, the 3 templates were extracted as: 24QAKTFLDKFNHEAEDLFYQ42, 72FLKEQSTLAQMYPLQ86 and 347TAWDLGKGDFRIL-M360. Using a traversal method, all homolog residues within these 3 extracts ranging in length from 5 to 19 amino acids were collected. As indicated in Figure 2-3, a total of 120, 66 and 55 homolog templates were generated from the RBMs-hACE2 of Q24-Q42, F72-Q86 and T347-M360 respectively. The sequences of these homolog templates were detailed in Table S1, Supplementary data. Obviously the possibility for mining satisfactory inhibitors was increased 80 times since the number of templates was enlarged from 3 to 241.

CoV-2 S protein
For screening the stability and interaction activity of the homolog templates, a feature filter was set up based on computing the values of two physico-chemical properties, Boman and Instaindex 37,38 .
Through this filter, a total of 12 homolog templates were selected from above 241 peptide segments, including 5 from Q24-Q42 and 7 from D347-M360 ( Figure 4). All peptide homologs from Q72-Q86 were excluded for downstream analysis due to low stability and interaction activities (Figure 3-4 and Table S1 Supplementary data). In parallel, structural datasets of these homologs were generated from the extracts of hACE2 chain E ( Figure 4). These 12 filtered-templates possessed high stabilities and interaction activities.

Development and application of a permutation program for constructing RBMs-hACE derived isomer libraries
In order to retain the binding specificity and minimize undesired bioactivities from the wild type template, an isomer library of each template generated using a permutation method. One million isomers were sampled from billions of isomers randomly, this approach provided a large number of candidates for further optimisation. As summarized in Figure 3, a total of 10^6 inhibitory candidates were settled by the computational sieving again, and hence promisingly possessed not only high stability and interaction activity but also low potentials of unwanted bioactivities.

Application of a machine learning tool for predicting transmembrane potentials of the inhibitors
Doubtlessly in vivo distribution, a key element of the well-known ADMET, has critically affected clinical application of the peptide drugs and has strongly challenged all steps in the development of effective peptides 39 . To address this issue, transmembrane potentials were predicted among the selected peptide isomers by SVM modeling. As shown in Figure 3 and Table S2 Supplementary data, 618 peptides, labelled with either extracellular or transmembrane potentials, were selected from 10^5 isomers of 12 wild type homologs for further molecular docking assay. This classification provided fundamental support to carry out a new antiviral strategy by using a panel of the peptides for targeting both intracellular and extracellular SARS-CoV-2 S proteins simultaneously.

Application of a peptide-protein docking approach for validating binding affinity and effective conformation between the inhibitors and SARS-CoV-2 S
The inhibitory potentials of the 618 selected peptides were finally validated by in silico docking each of them with the known binding sites on SARS-CoV-2 S protein ( Figure 1). Since these peptide sequences were originally derived from 12 different homolog templates ( Figure 4D-E), each templates tertiary structure was extracted as a reference ligand for the relevant docking. Using the reference ligands and the docking receptor SARS-CoV-2 S chain E ( Figure 4B, 4D-E), individual affinity maps were computed by the ARDF model in AutoDock CrankPep. As shown in Figure 5 and in Table S2 Supplementary data, following 618 docking assays (all dots), 4 peptide ligands (red dots) were finally determined as the most promising leads against SARS-CoV-2 S because the mean values of all atoms Root-Mean-Square Deviation (RMSD) and binding affinity were less than 2.5 Å and -10 kcal/mol respectively among all the first 10 docking ranks 36,40 . These outputs were the first demonstration of the effective binding activities between RBMs-hACE2 isomers and SARS-CoV-2 S and suggested that these 4 peptide isomers retained inhibitory potentials against SARS-CoV-2 infection.

Interface analysis of the docking complexes
Based on the spatial extract of SARS-CoV-2 S ( Figure 4B) and docking conformation of the isomer ligand, the interface contacts within each docking complex were produced by a Python program using the Pymol package. In order to understand the significant docking outputs mostly and facilitate the comparisons with previous studies, data was generated from the contacts within 3.5 ( Figure 6 and Table 1) and 3.8 angstroms separately (Table S3 Supplementary data). As shown in Figure 6, the total contacts exhibited higher density and more evenly distributed between L4 and SARS-CoV-2 S RBM compared with that in wild type W2. Similar differences were discovered in each pairwise comparison between W1 and L1, L2, L3. Specifically, on the SARS-CoV-2 S sides, all contacted residues belonged to the known binding sites (Table 1 and Table S3 Supplementary data).
In line with previous findings 8 (Table 1 and   Table S3 Supplementary data). Notably, these observations provided the evidence that all networks among the contacts, including all strengths, contributed to the distinct RMSD and binding affinity of these 4 isomer leads.

Retrospective comparison of the isomer inhibitor and the corresponding wild type template
The structure of each selected isomer was folded and generated from sequential input by Autodock modeling. As displayed in Figure 7A, each helical backbone of the isomer is similar to the structure of wild type extract, and it is indicated as reliable right-handed alpha helix by further Ramachandran plots 41 since 100% of residues were outlined in the blue areas at -180~ 0 degrees and 98.7% (1/72) of them were located within the dark blue islands ( Figure 7B). Indeed these short helical structures were well retained in all isomers according to the primary compositions of wild types despite the random permutation process. Typical hydrophobic face is not seen in Lead 1, 3 and 4 though it seems to be in W1, W2 and L2 ( Figure 7C). As emphasized before, the 4 distinct isomers were determined and selected by the experiential thresholds of RMSD and binding affinity 36,40 though there were no statistical differences of RMSD and binding affinity among the isomers and wild types ( Figure 8). In line with early studies 42 , various physicochemical properties were found in the millions of isomers (data not shown). In particular, the stability of isomers L1-L4 was increased 30-50 times compared to the wild types. Isomers with or without transmembrane potentials were obtained from the constructed libraries (Table 2), which were failed to select from wild types due to the limited sample size (data not shown).

Discussion
Preventing the binding between SARS-CoV-2 S and hACE2 is an effective action to fight COVID-19, however, to fit this, there are neither new approved compounds nor repurposing existing drugs from large number of basic studies and clinical trials 3,4,20,21,23 . This insufficiency might suffer from an essential difficulty that small molecule compounds were not effective at blocking PPIs where a deep binding pocket might be missing at the interface 11,25,43 . While other studies brought advantages of peptide-based inhibitor for preventing or interfering the viral host PPIs 31-33 , which has been further extended by recent researches of peptide-based inhibitors against SARS-CoV 2 entry 11-14 . However, along with the structure to binding affinity, in silico peptide design across bioactivity, stability and distribution profile in vivo are critical elements for the development of effective inhibitors 44,45 . To address these challenges, a novel design strategy was carried out in this study.
In line with the attempts of previous studies [11][12][13][14] , three sequential peptide segments were initially extracted from the significant druggable regions of the hACE2 interface, including Q24-Q42, F72-Q86 and D347-M360 (Figure 1). These 3 wild type residues were promising primary templates for starting the design because of the potential binding affinity and specificity 8,[11][12][13][14] . Furthermore, necessary optmizations were made to tackle three major difficulties in the template-based design: (i) Properties of the wild type origins could cause undesirable effects to host or be affected by endogenous ACE2 substrates since both membrane bound and circulating form of hACE2 play important enzymatic roles in vivo 15,44,45 (ii) A panel of antiviral peptides with and without membrane permeability are indispensable for effectively targeting the viral protein S because functional exposures of this surface protein are involved into both intracellular and extracellular distribution through whole viral life cycle 39,46,47 . (iii) The small sample size of templates could limit the selection of satisfactory interaction activity, binding affinity, membrane permeability and stability.
Logically all peptide homolog segments within these 3 wild type templates may retain specific binding affinity since the full-length templates possess the known binding sites 8,9 , thus a total of 241 homologous sequences ranging in length from 5 to 19 amino acids were collected and pooled for the selection by feature filter. In theory, position change of amino acids in peptide isomers strongly modulated the relative stabilities of topologically similar regions in the energy landscape, rather than redefined the topology space 48 . Experimentally, certain isomeric versions of antimicrobial peptide significantly increased in vitro activity against bacterial pathogens 49 . These demonstrations raised a hypothesis that the isomer pools established from wild type templates could reserve a certain number of candidates retaining desired target specificity, binding affinity, distribution properties and stability. This was finally verified by mining constructed isomer libraries using a SVM classifier and peptide docking [34][35][36] (Figure 3 and 5). Figure 2 and 3, based on the 3D structures of RBMs-hACE2, millions of the homolog isomers were generated by computational traversal and permutation methods. These candidates were screened through a feature filter, a Python program using modlamp and AutoDock CrankPep 35,36,40 , 4 peptide leads with potential inhibitory activity against SARS CoV-2 S were identified by their distinct binding activity, stability, binding affinity, effective conformation and transmembrane potentials (Table 1-2, Figure 3 and 5). Interface analysis unveiled that these peptide leads exhibited specific binding activities against SARS-CoV-2 S, encompassing whole RBM regions through K417, Y453, L455, F456, A475, N487, Y489, Q493, S494 to Y505. It was also found that the molecular-atomic actions of these bindings were involved the formation of N O hydrogen bonds, salt bridges and many other bonds. Most significantly, these salt bridges were all linked to the unique residue K417 in SARS-CoV-2 S, which has been thought to contribute to the high contagiosity of SARS-CoV-2 in humans compared to SARS-CoV 8,9 , therefore, the salt bridge associated bindings between the druggable target and the inhibitory isomers might enhance the blockage of viral entry. In addition, these findings correlate with the early studies 50 and thus support that either strong or weak bonds can make an important contribution to the association and stability of protein complexes 50,51 .

As indicated in
In sum, this work has provided specific and effective leads as therapeutic agents against COVID-19 before the synthesis process and wet experiments. The peptides with different transmembrane potentials promisingly support a combination antiviral therapy for targeting both intracellular and extracellular SARS-CoV-2 S. The computational design has combined multiple approaches, exploring a new way to identify effective and specific antiviral peptides by mining viral-target isomers.

Extraction of sequential and spatial RBMs-hACE2
Initially, an X-ray crystal structure data of SARS-CoV-2 RBD / hACE2 complex was retrieved from RCSB Protein Data Bank (PDB, https://www.rcsb.org /structure/6MOJ) 8 . Three segments of RBMs-hACE2, Q24-Q42 / F72-Q86 / T347-M360, were determined and selected as primary templates for the peptide design according to a literature search and the definition of sequential and spatial motifs 8,52 (Figure 1-3). All sequential residues from three original RBMs-hACE2 templates were denoted as wild-type segments in this study. Both primary sequence and tertiary structure of each wild type, SARS-CoV-2 S chain E and hACE2 chain A, were extracted respectively by a Python program using libraries such as PyMol, Biopython and Pandas [53][54][55] Search and collection of RBMs-hACE2 homolog by traversal method All homolog segments from RBMS-hAC2 were mined to maximise the chances of finding potential bioactivity and binding affinity. Another Python program using a custom traversal method was developed. Using this program, fragments within the sequence RBMS-hACE2 ranging in length from 5 to 19 amino acids were filtered and collected as primary templates (Figure 2 and 3).

Construction of RBMs-hACE2 homolog isomer libraries by permutation method
In order to use the selected templates to generate isomers with different physico-chemical properties, RBMs-hACE2 homolog isomer libraries were built using a permutation method. Using a random permutation algorithm, millions of peptide segments were generated by reordering amino acid positions of the RBMs-hACE2 homolog. Datasets were specifically sampled according to the wild type template origins (Figure 2 and 3).

Selection of inhibitory candidates against SARS-CoV-2 S via a feature filter
A custom feature filter was written using the R package Peptides. With this feature filter, each peptide sequence from the isomer library was mapped to numerical data which incorporated two physico-chemical properties, Boman and Instaindex. The candidate was selected for downstream analysis by having a Boman value of over 2.0 and an Instaindex value less than 0.0 or 40.0 37,38,56 .

Prediction of transmembrane potentials for the inhibitory candidates using SVM classifier
In order to obtain the inhibitory candidates with and without membrane permeability, transmembrane potentials of each peptide isomer were predicted by a supervised machine learning tool. A support vector machine was written in Python and optimised using the Modlamp package. 34,35 . Peptides with or without transmembrane potentials were classified by inputting the selected peptide sequences with the built-in training dataset TM-AMP provided by the Modlamp package 35 .
The threshold of possibility was defined as 0.99.

Validation of the inhibitory specificity and affinity by peptide-protein docking
According to the known binding sites, the selected peptides were validated by peptide-protein docking using Autodock CrankPep 1.1 36 . Each docking assay was performed between the receptor SARS-CoV-2 S and the ligand peptide using a corresponding reference grid box. Each docking grid box (the affinity map) was constructed using the S receptor and an individual reference ligand. The reference ligand was determined by the wild type template and extracted from the structure of hACE2 chain A 8 . Missing side chain atoms in both receptor and ligand were completed using the Reduce module in the AutoDock CrankPep 36,57 . Large batch docking assays were achieved by Bash scripts. Large datasets of docking outputs were processed by a Python program, which analyzed the docking interface using PyMol-API.

22
As shown in Figure 5, outputs of 618 docking assays were plotted by RMSD against binding affinity. Each dot represents the docking result from each peptide candidate. As shown in left, 4 isomers in red were determined as promising inhibitors against SARS-CoV-2 S because both values of RMSD and binding affinity were < 2.5 and < 0. 23 These images show the general contacts between peptides (green) and SARS CoV-2 S (pink) within 3.5 angstroms.
Data represent the contact as red-dotted line and the distance label in black. The two group images include the selected Lead 1-3 (L1-3) derived from wild type 1 (W1) and Lead 4 (L4) derived from wild type 2 (W2).

24
As shown in Table 1., the general contacts within 3.   The tertiary, secondary and primary structure of each selected inhibitor and wild type origin were visualized by PyMol spatial extraction, Ramachandran and Wenxiang plots. As shown in A, each helical backbone of the isomer is similar to the structure of wild type extract. These 3D structures are indicated as reliable righthanded alpha helixes by the Ramachandran plots (B). The well-retained helical backbone can be observed in all isomers by Wenxiang plots (C). Typical hydrophobic face is not seen in Lead 1, 3 and 4 though seems to be in W1, W2 and L2.
Four RBMs-hACE2 isomers L1-L4 were identified as the potential inhibitors against SARS-CoV-2 S. Leads L1-L3 were derived from the wild type W1 and L4 from W2.