ASCoVPred: a machine learning-based platform for quantitative prediction of anti-SARS-CoV-2 activity and human cell toxicity of molecules

There is an urgent need to accelerate the discovery of effective drugs for COVID-19. We have developed machine learning models for rapid discovery of molecules potentially inhibitory to SARS-CoV-2 and negligible or no human cell toxicity. The machine learning (ML) QSAR models were trained and optimized with features (descriptors and fingerprints) of the experimentally validated SARS-CoV-2 inhibitory compounds. Several molecular descriptors and fingerprints were calculated to select the decisive ones for the training and evaluation of thousands of ML models. The best-optimized models are deployed as ASCoVPred webserver and standalone software, that provides easy and free access to the models. The feature selection for selecting the best descriptors for ML models training helped identify a set of decisive descriptors and fingerprints that correlate positively or negatively with the anti-SARS-CoV-2 activity and toxicity of the compounds. Systematic prediction and optimization of compounds with the help of ASCoVPred can facilitate the discovery of novel anti-SARS-CoV-2 compounds. The ASCoVPred web server and standalone software are freely available at http://14.139.62.220/ascovpred/.


Introduction
According to the World Health Organization (WHO) online dashboard 1 , as of October 13, 2021, severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has infected more than 238.22 million people and is responsible for >4.85 million deaths globally. Despite having high morbidity and mortality rates, the world is limited to a few medicines to treat Coronavirus disease or COVID-19, an infectious disease caused by SARS-CoV-2. Although several vaccines have been developed to reduce the disease burden, effective antivirals are still required to treat infected and hospitalized patients. We are also not sure of another impending wave of viral infections. An antiviral drug Veklury (remdesivir), is the only U.S. Food and Drug Administration (FDA) approved available for adult and certain pediatric COVID-19 patients. Also, several drugs are in different stages of trials as the quest for an effective anti-SARS-CoV-2 continues (e.g., molnupiravir) 2 . FDA has also issued emergency use authorization (EUA) for other types of treatments such as monoclonal antibody-based treatments 3 . However, treatment options, such as the development of antivirals, immunomodulators, neutralizing antibody therapies, cell therapy, etc., are ongoing and yet to pass through different clinical trials 4 . Thus, efforts with multiple approaches are continuing to discover novel anti-SARS-CoV-2 drugs as quickly as possible.
However, in vitro discovery of novel inhibitors is tedious, labor-intensive, time-consuming, apart from being costly exercise. However, computational predictions facilitate in vitro discovery by shortlisting the most effective chemical entities, saving time and cost.
Recently, several drug targets have been explored to design novel drug molecules against SARS-CoV-2. Few of the identi ed drug targets are essential for the viral entry into target cells and survival into host cells, while others play an indispensable role in viral growth. For example, the viral surface Spike protein (Sprotein) is essential for the attachment of the virus to human angiotensin-converting enzyme 2 (ACE-2), present on the target human cells 5,6 . Thus, S-protein plays a vital role in the SARS-CoV-2 entry into human target cells is an attractive drug target. Furthermore, a human serine protease named transmembrane serine protease 2 (TMPRSS2) also plays an essential role in the entry of SARS-CoV-2 into human target cells through S-protein priming. Hence, inhibiting the TMPRSS2 activity may lower the risk of disease progression 6-8 . Thus, as S-protein and TMPRSS2 play a vital role in SARS-CoV-2 entry into human target cells, many efforts are going on to develop drugs against these viral targets 9 . There are reports of computational approaches, including molecular docking and machine learning (ML)-based classi cation algorithm development, to identify suitable anti-SARS-CoV02 inhibitors [10][11][12][13] . Moreover, drug repurposing approaches have been explored to quickly identify the approved drug molecules which may bind to SARS-CoV-2 drug targets [14][15][16] . Most of such computational studies have used docking or molecular dynamics, and combinations of the two, to explore FDA-approved drugs as potential SARS-CoV-2 inhibitors 10,14,17−21 . However, systematic attempts to develop ML-based models through quantitative structure-activity relationship (QSAR) approaches are lacking, which motivated us to develop ML-based QSAR models that could rapidly screen large chemical libraries to identify anti-SARS-CoV-2 compounds.
In the present study, we have explored the publicly available National Centre for Advancing Translational Sciences (NCATS) chemical assays data 9,22 to develop ML-based QSAR models to predict anti-SARS-CoV-2 activity (primarily related to preventing viral entry into the target human cells) and toxicity of compounds (against HEK293 and Vero E6 cells). The best models are deployed on an online webserver and standalone software "ASCoVPred" that predicts the activities of user-submitted compounds in an automated way. In the future, we plan to update the webserver and standalone software with additional MLbased prediction models with improved prediction performance.

Data mining
The data from the nine assays of high-throughput screening (HTS) were passed through rigorous pre-processing steps. As mentioned in the materials and methods section, only the compounds with SMILES structures and maximum response values were used in the study, while others were discarded. Furthermore, after the descriptors and ngerprints (FPs) calculations, only the compounds with the calculated descriptor values were selected for dataset preparation.
The duplicate records were removed from the training datasets to eliminate the over tting of models during training steps. The duplicate records correspond to the compounds with identical descriptors/FPs values and maximum response values. For the details of records after the ltering steps, see Supplementary   Tables 1-9. The compounds listed in the supplementary les were used to train and evaluate the machine learning (ML)-based QSAR models. Moreover, model building and training of models was performed using only the datasets with the ratio of the number of compounds to the number of descriptors at least ve, as suggested for best practices for QSAR studies 23 . The number of molecules used for the training and evaluation of best prediction models and ML-based models' performances is described in the subsequent sections.
Molecules used for training and evaluation of best models The assay-wise summary of the molecules used for models' training and evaluation is shown in Table 1. For further details of the training dataset molecules, refer to Supplementary les 1-9. Table 1 The number of molecules used for training and evaluation of best prediction models. Performances of best prediction models The performance measures of the best machine learning (ML) models, trained and evaluated with nine different assays' experimental data, are given in Table   2. The regression techniques (along with parameter optimization) available in the Weka (v3.8.2) 24 package were used to train and evaluate multiple ML models (for details, please refer to methods). However, the best results are achieved with RandomForest only, either as a main "Classi er" or a hyper-parameter for another "Classi er." Thus, ve out of the best nine ML models (based on assay IDs 1, 2, 8, 15 and 20) performed best. Hence, we deployed them at the ASCoVPred webserver and standalone software. The ve best ML models achieved Pearson Correlation Coe cient value (R; between actual and predicted maximum response values), greater than 0.50 for training (in ve-fold cross-validation), as well as external validation dataset molecules. The ML models which did not perform well are based on the four assays, namely the "ACE2 enzymatic activity" assay (Assay ID-6), the "3CL enzymatic activity" assay (Assay ID-9), the "SARS-CoV-2 cytopathic effect (CPE)" assay (Assay ID-14) and the "Human broblast toxicity" assay (Assay ID-21). Due to poor performance of ML models with these four assays, the models were rejected. Pro le for the desired multi-target molecule For any molecule to be an ideal multi-target hit (against SARS-CoV-2 and human TMPRSS2), it must be predicted as highly active by AlphaLISA assay (Assay ID-1), least active by TruHit counterscreen assay (Assay ID-2), highly active by TMPRSS2 enzymatic activity assay (Assay ID-8), least active by "SARS-CoV-2 cytopathic effect (host tox counterscreen)" assay (Assay ID-15) and HEK293 cell-line assay (Assay ID-20). The desired activity prediction pro le for an ideal multi-target molecule is shown in Table 3. In addition, the scales used to assign "Activity Class", are given on the ASCoVPred webserver's prediction results page. Descriptors or ngerprints associated with activities of molecules For a QSAR study, it is vital to identify the putative descriptors or ngerprints associated with the biological activity of the molecules. Therefore, in the present study, Pearson Correlation Coe cient (R) based-analysis of molecules was performed to understand the types of descriptors or ngerprints present in highly active (most potent) versus least active molecules. Supplementary les 10-19 contain the results of these analyses.
The analysis with human cell-line toxicity assay named "HEK293 cell line toxicity" (Assay ID-20) provided top 10 negatively correlated and top 10 positively  Figure 1) and positively (Supplementary le19.xlsx, Tables2-4, Figure   1) correlated FPs showed considerable differences in their average FP values amongst highly active and least active compounds.

Utility of ASCoVPred
The in silico designing of inhibitors is an essential task for any drug discovery strategy. Also, the real-life utility of a computational study lies in its practical day-to-day usage. Thus, to enhance the utility of the present study and best use the developed ML-based prediction models, a webserver named "ASCoVPred" has been developed. Moreover, a standalone version of the ASCoVPred has also been developed for the high-throughput screening (HTS) of compounds. Currently, the standalone version of ASCoVPred has been tested for the Linux (Ubuntu 18.04 LTS) and Mac operating system (OS) by supplying thousands of compounds as input (at a time) in SMILES format. Thus, the standalone software can be run in parallel mode to screen multiple chemical libraries simultaneously. Whereas the webserver version of ASCoVPred provides a user-friendly web interface to the best prediction models. Users interested in anti-SARS-CoV-2 drug-discovery projects can use the webserver to rapidly discover or lter a single compound within a minute. Hence, the webserver possesses the potential to speed up the SARS-CoV-2 inhibitor discovery. The current version of the ASCoVPred webserver is freely accessible from http://14.139.62.220/ascovpred/. The server and standalone software provide various modules to perform specialized tasks such as virtual screening of large chemical libraries, designing potent inhibitors, improving the biological activity of molecules, etc. Figure 1 provides the screenshots to display the functioning of ASCoVPred webserver and standalone software. For further details, refer to the online help on ASCoVPred webserver.

Discussion
The discovery of novel hits is a tedious process in drug discovery that requires several years of massive investments of time, manpower, resources, and money. However, using computational drug discovery resources can minimize the usage of valuable resources, improve e ciency to accelerate the drug discovery process. To facilitate drug discovery for COVID-19, the present study focuses on developing a computational resource or webserver "ASCoVPred" and standalone software that helps design e cacious and safe to use inhibitors against SARS-CoV-2. The valuable experimental activity data of compounds from HTS assays (downloaded from the NCATS portal) were utilized for training and evaluating ML-based prediction models. Finally, the best performing models are deployed as a publicly available prediction webserver, ASCoVPred and standalone software (for Linux and Mac OS).
The analysis performed with the compounds (used in the training of best models) has provided some interesting descriptors and FPs with potential roles in the determination of activities of compounds. For example, in AlphaLISA assay (Assay ID-1), the top 10 negatively correlated FPs (with the maximum response value of compounds) includes a count of chiral center speci ed (SubFPC307), aromatic functional groups (SubFPC274), conjugated double bonds (SubFPC287), vinylogous esters (SubFPC137), phenols (SubFPC169), arylchlorides (SubFPC171), vinylogous acids (SubFPC136), tertiary mixed amines (SubFPC33), etc. Furthermore, the highly active compounds possess higher average values for these FPs than the least active compounds (Supplementary le10.xlsx, Figure 1). Therefore, a compound with higher values corresponding to these FPs might act as a suitable inhibitor (highly active) to disrupt the Spike-ACE2 interaction, thus preventing viral entry into the human host cells.
The TruHit counterscreen assay (Assay ID-2) ensures the identi cation of false positives from the AlphaLISA assay, thus being helpful in the identi cation of non-speci c inhibitors. For any compound to be a true positive, it must be detected as inactive or least active in the TruHit counterscreen assay 9,13 . From the analysis of training dataset compounds, the top 10 negatively correlated 1D and 2D descriptors (with the maximum response value of compounds) possess higher average values for descriptors, namely, doubly bound carbon bound to two other carbons (C2SP2), doubly bound carbon bound to three other carbons (C3SP2), Crippen's LogP (CrippenLogP), simple path, order 7 (SP-7), excessive molar refraction (MLFER_E), number of 10-membered fused rings (nF10Ring), the coe cient sum of the last eigenvector from topological distance matrix (VE1_D), maximum atom-type E-State: =C< (maxdssC), maximum atom-type E-State: =CH-(maxdsCH) and the coe cient sum of the last eigenvector from Barysz matrix / weighted by rst ionization potential (VE1_Dzi), in case of highly active compounds than least active compounds (Supplementary le12.xlsx, Figure 1). Therefore, for any compound to be least active or inactive in the TruHit counterscreen assay, the ideal values for these 1D and 2D descriptors must be low. Whereas, in the case of top positively correlated 1D and 2D descriptors (with the maximum response value of compounds), a few descriptor types such as MATS2c, nAcid and MATS2e, showed signi cant differences of average values amongst highly active and least active molecules (Supplementary le13.xlsx, Figure 1). Therefore, for any compound to be least active or inactive in TruHit counterscreen assay, the compound may possess higher values for the descriptors, namely, Moran autocorrelation -lag 2 / weighted by charges (MATS2c), number of acidic groups (nAcid) and Moran autocorrelation -lag 2 / weighted by Sanderson electronegativities (MATS2e), than the highly active compounds.
The TMPRSS2 enzymatic activity (Assay ID-8) is speci c to identi cation of compounds with the potential to inhibit the activity of the human TMPRSS2 enzyme, thus preventing the viral entry into the human cells 9 . Although the trained models with this assay data didn't show high performance in ve-fold cross-validation (R = 0.52), better performance is observed with the external validation dataset compounds (R = 0.73, Table 2). These kinds of results point towards the reliability and good performance of a model in real-life practices. Further analysis with training dataset compounds provided top 10 negatively correlated FPs having very poor correlation (R-value) between FP values and the maximum response value of the compounds (Supplementary le14.xlsx, Table 1). However, a few of the top 10 positively correlated FPs, namely, primary arom amine (SubFPC28), hetero N nonbasic (SubFPC181), heteroaromatic (SubFPC184) and heterocyclic (SubFPC275), achieved slightly higher values of R (Supplementary le15.xlsx, Table 1). Furthermore, these FPs possess lower average values for the highly active compounds than the least active compounds (Supplementary le15.xlsx, Figure 1). Therefore, for any compound to be highly active in TMPRSS2 enzymatic assay, the values for these FPs should be ideally low.
For any compound to act as a drug molecule, it must be non-toxic for normal cells in the human body. Therefore, the toxicity prediction of compounds against normal human cells is the most important task in a drug discovery pipeline. The SARS-CoV-2 cytopathic effect assay (host tox counterscreen, Assay ID-15) can help in the estimation of toxicity of compounds against Vero E6 cells. From the FPs level analysis of training dataset compounds, few negatively correlated FPs with the maximum response values are identi ed. For example, FPs, namely "nAromBond", "naAromAtom", "nBondsM" and "SwHBa", showed signi cant differences in their average maximum response values amongst the highly active and least active compounds (Supplementary le16.xlsx, Figure  1). Thus, for a compound to act as a drug, it should ideally have lower predicted maximum response values for this assay. Moreover, the compound might possess the least number of FPs, namely, aromatic bonds (nAromBond), aromatic atoms (naAromAtom), bonds that have bond order greater than one (nBondsM); and least value for the sum of E-States for weak hydrogen bond acceptors (SwHBa). Furthermore, the HEK293 cell-line toxicity assay (Assay ID-20) also ful lls the goal of testing the compounds' activity or toxicity against human HEK293 cell lines. For any compound to behave like a drug molecule, it must be least active or inactive in this assay. Furthermore, the analysis with training dataset molecules (used in the training of best prediction model) provided the FPs that are negatively and positively correlated with the maximum response value of compounds (Supplementary les18-19.xlsx, Table 1). Interestingly, some of the top 10 negatively correlated and all top 10 positively correlated FPs (with the maximum response value of compounds) show differences amongst the highly active and least active compounds (Supplementary les18-19.xlsx, Figure 1). The negatively correlated FPs, namely, aromatic groups (SubFPC274), hetero N nonbasic groups (SubFPC181), heteroaromatic groups (SubFPC184), heterocyclic groups (SubFPC275), chiral center speci ed (SubFPC307), amine groups (SubFPC23), tertiary aliphatic amine groups (SubFPC26) and hetero N basic no H groups (SubFPC180), show lower average values for least active molecules w.r.t. highly active molecules (Supplementary le18.xlsx, Figure 1). Whereas the top 10 positively correlated FPs show higher average values of FPs, namely, carboxylic acid groups (SubFPC84), 1,2-Diol groups (SubFPC41), alcoholic groups (SubFPC12), secondary alcohol groups (SubFPC14), sugar pattern 2 beta (SubFPC286), sugar pattern 2 alpha (SubFPC285), sugar pattern 2 (SubFPC282), sugar pattern 1 (SubFPC281), sugar pattern combi (SubFPC283) and primary alcohol groups (SubFPC13), for the least active compounds w.r.t. highly active compounds (Supplementary le19.xlsx, Figure 1). Therefore, designing the compounds with higher counts of these FPs values would help predict non-toxic compounds (against normal HEK293 cells).
The descriptors and FPs determined through the chemicals data analysis can be very useful in designing highly (against SARS-CoV-2) or least active molecules (against normal human cells) in the future. Thus, essentially deterministic features have been identi ed from experimental data. However, manual optimization of multiple descriptors and FPs is a challenging task that must be performed to design the compounds with desired properties.
Although, the current version of ASCoVPred predicts the activity of input compounds with the calculated e ciency, it may not predict e ciently for targets/assays it is not trained for. This limitation can be addressed by including additional models trained with compound activity datasets representing other targets that are not included in the current version of ASCoVPred. We will continue to update ASCoVPred with additional models, apart from improving the e ciency of existing models.

Conclusion
The designing of strategies for the rapid discovery of anti-SARS-CoV-2 compounds is an urgent need of the hour. Machine learning-based approaches in drug discovery and design are time-saving and cost-effective. The present study is based on computational designing of anti-SARS-CoV-2 compounds and estimates their toxicity against normal human cells. ASCoVPred webserver and standalone software are very useful in rapidly discovering inhibitors against SARS-CoV-2 and preventing viral entry into the human host cells. Also, the toxicity of molecules against normal human cells (HEK293 and Vero E6 cell-line) can be estimated with the help of toxicity prediction models deployed on the ASCoVPred platform.
Furthermore, the functional groups associated with the anti-SARS-CoV-2 activity of the molecules may provide better insights while designing the better hit molecules. In the future, the development of more ML models (trained and evaluated with more NCATS assays data) could enhance the utility of the ASCoVPred platform. We will also continue to improve the performance of the deployed models and update those on the ASCoVPred platform.

Data source
In the present study, a total of nine high-throughput screening (HTS) assays data were downloaded from National Centre for Advancing Translational Sciences (NCATS) website and used for the machine learning (ML)-based models training and evaluation (Table 4). Initially, nine tab-separated les were downloaded from the NCATS website. Of the nine les, each contained information such as "Sample ID", "Sample Name", "PubChem SID", "Primary MOA", "Assay Name", "CAS ID", "AC50 value", "E cacy", "Maximum Response", "Drug Name", "SMILES", etc. Of the given columns, two columns, namely "SMILES" and "MAX_RESPONSE", were extracted from each tab-separated le for further processing and machine learningbased models training and validation. According to the description given on the NCATS data browser 25 , a molecule with a MAX_RESPONSE value approaching -100 cab be considered as highly active while the one with MAX_RESPONSE value approaching zero or towards positive values can be considered as least active or inactive.

Description of HTS assays
The nine HTS assays (Table 4) used to test compounds' bio-activities by NCATS can be broadly categorized into four different types viz. assay used to determine the molecules that (i) prevent viral entry into host cells (ii) prevent viral replication into host cells (iii) reverse the cytopathic effect of host cells (caused by SARS-CoV-2 virion) and (iv) show toxic effects against normal human/host cells.
AlphaLISA assay (Table 4, Assay ID-1) contains the therapeutic molecules that can potentially disrupt the interaction between SARS-CoV-2 Spike protein and the human host ACE2 receptor. The interaction between SARS-CoV-2 Spike protein and host ACE2 receptor plays a vital role during the viral entry into the human host 6 . Thus, this assay aims to identify the molecules with the potential to prevent SARS-CoV-2 entry into the human host. The second assay, i.e., TruHit counterscreen (Table 4, Assay ID-2), helps identify false-positive compounds that interfere with the AlphaLISA readout in a non-speci c manner.
ACE2 enzymatic activity assay (Table 4, Assay ID-6) measures the ACE2 inhibitory potential of compounds to prevent the disruption of endogenous enzyme function. The TMPRSS2 enzymatic activity assay (Table 4, Assay ID-8) is speci c to measure the TMPRSS2 inhibitory potential of compounds. Recent studies have highlighted the importance of human TMPRSS2 as an antiviral drug target and made attempts to screen the molecules with inhibitory potential against TMPRSS2 19,26 .
The 3CL enzymatic activity (Table 4, Assay ID-9) measures the inhibitory potential of molecules against viral 3-chymotrypsin like protease (3CLpro), the main protease of SARS-CoV-2. The enzyme is pivotal for the viral replication within human host cells 17,27 . Thus, it is an important anti-SARS-CoV-2 drug target in COVID-19 drug discovery.
CPE assay ( Data pre-processing The pre-processing of data is an important task for any machine learning study. Therefore, a systematic approach is applied to process the molecular structure data while molecular descriptors and ngerprints calculation (Figure 2). The parameters opted on PaDEL software (before actually starting the descriptors / FPs calculation) are "Remove salt", "Detect aromaticity", "Standardize nitro groups", "Max. threads -1", "Max. waiting jobs -1", "Max. Running time per molecule: 12,00,000 milliseconds", and "Retain molecules order". Furthermore, only those molecules for which all the descriptor/ ngerprint values are calculated that have been used for ML-based models training, external validation and further analysis.

Preparation of datasets for models training and validation
Data pre-processing and ltering are followed by redundancy removal to retrieve the dataset of unique molecules.

Cross-validation technique used
The prioritization of best models (with the highest e cacy) is essential in any ML-based study. In the present study, thousands of models (the results for best models from each assay are given online at http://14.139.62.220/ascovpred/supple.php) were trained with all regression techniques available in WEKA, however, selection of the best models are made through the ve-fold cross-validation technique. This technique has been widely used in previous studies to prioritize the best prediction/classi cation models 34,35,38−42 . In the ve-fold cross-validation technique, the whole training dataset is divided into ve almost equal-sized parts and each part is used in the training and testing of the prediction models.
Formulae used to evaluate the models' performance The in-built functions available with WEKA (v3.8.2), such as Pearson Correlation Coe cient (R; equation (1)), mean absolute error (MAE; equation (2)) and root mean squared error (RMSE; equation (3)), have been used to evaluate the models' performance through ve-fold cross-validation technique. In both internal and external validation, the models with the highest R-value and lowest MAE and RMSE values are selected as the best prediction models. The latter ve models are hosted on the ASCovPred webserver and standalone software for public use. The equations for the used formulae are as given below: 2 1 N For i th compound, Yi and Xi represent predicted and actual maximum response values, respectively. N is the total number of compounds. The value of R is used to measure the quality of a model. The value of R varies from −1 to +1. The negative value of R shows the negative correlation with a particular property or feature. Thus, higher the value of R, the better the model's quality in terms of the predicted maximum response value of the compounds.

Identi cation of descriptors of FPs associated with the maximum response
To understand the relative contribution of different descriptors and ngerprints (FPs) in determining the molecules' maximum response values, assay-wise, R values were calculated between these. The molecules with top 10 positively and negatively correlated descriptors or FPs values (Supplementary les 10-19,  Table 1) were shortlisted from the training dataset molecules along with their maximum response values and sorted in the ascending order of maximum response values. Further, the top 100 records (highly active compounds) (Supplementary les 10-19, Table 2) and the lowest 100 records (least active compounds) (Supplementary les 10-19, Table 3) were selected for both positively and negatively correlated descriptors and FPs. Finally, the average values were calculated for each descriptor and FP (Supplementary les 10-19, Table 4). Grouped bar charts were plotted to compare the average values of descriptors / FPs amongst highly active and least active compounds (Supplementary les 10-19, Figure 1). Thus, the grouped bar charts depict the differences of descriptors / FPs average values amongst highly active and least active compounds.