The discovery of novel hits is a tedious process in drug discovery that requires several years of massive investments of time, manpower, resources, and money. However, using computational drug discovery resources can minimize the usage of valuable resources, improve efficiency to accelerate the drug discovery process. To facilitate drug discovery for COVID-19, the present study focuses on developing a computational resource or webserver “ASCoVPred” and standalone software that helps design efficacious and safe to use inhibitors against SARS-CoV-2. The valuable experimental activity data of compounds from HTS assays (downloaded from the NCATS portal) were utilized for training and evaluating ML-based prediction models. Finally, the best performing models are deployed as a publicly available prediction webserver, ASCoVPred and standalone software (for Linux and Mac OS).
The analysis performed with the compounds (used in the training of best models) has provided some interesting descriptors and FPs with potential roles in the determination of activities of compounds. For example, in AlphaLISA assay (Assay ID-1), the top 10 negatively correlated FPs (with the maximum response value of compounds) includes a count of chiral center specified (SubFPC307), aromatic functional groups (SubFPC274), conjugated double bonds (SubFPC287), vinylogous esters (SubFPC137), phenols (SubFPC169), arylchlorides (SubFPC171), vinylogous acids (SubFPC136), tertiary mixed amines (SubFPC33), etc. Furthermore, the highly active compounds possess higher average values for these FPs than the least active compounds (Supplementary file10.xlsx, Figure 1). Therefore, a compound with higher values corresponding to these FPs might act as a suitable inhibitor (highly active) to disrupt the Spike-ACE2 interaction, thus preventing viral entry into the human host cells.
The TruHit counterscreen assay (Assay ID-2) ensures the identification of false positives from the AlphaLISA assay, thus being helpful in the identification of non-specific inhibitors. For any compound to be a true positive, it must be detected as inactive or least active in the TruHit counterscreen assay9,13. From the analysis of training dataset compounds, the top 10 negatively correlated 1D and 2D descriptors (with the maximum response value of compounds) possess higher average values for descriptors, namely, doubly bound carbon bound to two other carbons (C2SP2), doubly bound carbon bound to three other carbons (C3SP2), Crippen's LogP (CrippenLogP), simple path, order 7 (SP-7), excessive molar refraction (MLFER_E), number of 10-membered fused rings (nF10Ring), the coefficient sum of the last eigenvector from topological distance matrix (VE1_D), maximum atom-type E-State: =C< (maxdssC), maximum atom-type E-State: =CH- (maxdsCH) and the coefficient sum of the last eigenvector from Barysz matrix / weighted by first ionization potential (VE1_Dzi), in case of highly active compounds than least active compounds (Supplementary file12.xlsx, Figure 1). Therefore, for any compound to be least active or inactive in the TruHit counterscreen assay, the ideal values for these 1D and 2D descriptors must be low. Whereas, in the case of top positively correlated 1D and 2D descriptors (with the maximum response value of compounds), a few descriptor types such as MATS2c, nAcid and MATS2e, showed significant differences of average values amongst highly active and least active molecules (Supplementary file13.xlsx, Figure 1). Therefore, for any compound to be least active or inactive in TruHit counterscreen assay, the compound may possess higher values for the descriptors, namely, Moran autocorrelation - lag 2 / weighted by charges (MATS2c), number of acidic groups (nAcid) and Moran autocorrelation - lag 2 / weighted by Sanderson electronegativities (MATS2e), than the highly active compounds.
The TMPRSS2 enzymatic activity (Assay ID-8) is specific to identification of compounds with the potential to inhibit the activity of the human TMPRSS2 enzyme, thus preventing the viral entry into the human cells9. Although the trained models with this assay data didn’t show high performance in five-fold cross-validation (R = 0.52), better performance is observed with the external validation dataset compounds (R = 0.73, Table 2). These kinds of results point towards the reliability and good performance of a model in real-life practices. Further analysis with training dataset compounds provided top 10 negatively correlated FPs having very poor correlation (R-value) between FP values and the maximum response value of the compounds (Supplementary file14.xlsx, Table 1). However, a few of the top 10 positively correlated FPs, namely, primary arom amine (SubFPC28), hetero N nonbasic (SubFPC181), heteroaromatic (SubFPC184) and heterocyclic (SubFPC275), achieved slightly higher values of R (Supplementary file15.xlsx, Table 1). Furthermore, these FPs possess lower average values for the highly active compounds than the least active compounds (Supplementary file15.xlsx, Figure 1). Therefore, for any compound to be highly active in TMPRSS2 enzymatic assay, the values for these FPs should be ideally low.
For any compound to act as a drug molecule, it must be non-toxic for normal cells in the human body. Therefore, the toxicity prediction of compounds against normal human cells is the most important task in a drug discovery pipeline. The SARS-CoV-2 cytopathic effect assay (host tox counterscreen, Assay ID-15) can help in the estimation of toxicity of compounds against Vero E6 cells. From the FPs level analysis of training dataset compounds, few negatively correlated FPs with the maximum response values are identified. For example, FPs, namely “nAromBond”, “naAromAtom”, “nBondsM” and “SwHBa”, showed significant differences in their average maximum response values amongst the highly active and least active compounds (Supplementary file16.xlsx, Figure 1). Thus, for a compound to act as a drug, it should ideally have lower predicted maximum response values for this assay. Moreover, the compound might possess the least number of FPs, namely, aromatic bonds (nAromBond), aromatic atoms (naAromAtom), bonds that have bond order greater than one (nBondsM); and least value for the sum of E-States for weak hydrogen bond acceptors (SwHBa). Furthermore, the HEK293 cell-line toxicity assay (Assay ID-20) also fulfills the goal of testing the compounds’ activity or toxicity against human HEK293 cell lines. For any compound to behave like a drug molecule, it must be least active or inactive in this assay. Furthermore, the analysis with training dataset molecules (used in the training of best prediction model) provided the FPs that are negatively and positively correlated with the maximum response value of compounds (Supplementary files18-19.xlsx, Table 1). Interestingly, some of the top 10 negatively correlated and all top 10 positively correlated FPs (with the maximum response value of compounds) show differences amongst the highly active and least active compounds (Supplementary files18-19.xlsx, Figure 1). The negatively correlated FPs, namely, aromatic groups (SubFPC274), hetero N nonbasic groups (SubFPC181), heteroaromatic groups (SubFPC184), heterocyclic groups (SubFPC275), chiral center specified (SubFPC307), amine groups (SubFPC23), tertiary aliphatic amine groups (SubFPC26) and hetero N basic no H groups (SubFPC180), show lower average values for least active molecules w.r.t. highly active molecules (Supplementary file18.xlsx, Figure 1). Whereas the top 10 positively correlated FPs show higher average values of FPs, namely, carboxylic acid groups (SubFPC84), 1,2-Diol groups (SubFPC41), alcoholic groups (SubFPC12), secondary alcohol groups (SubFPC14), sugar pattern 2 beta (SubFPC286), sugar pattern 2 alpha (SubFPC285), sugar pattern 2 (SubFPC282), sugar pattern 1 (SubFPC281), sugar pattern combi (SubFPC283) and primary alcohol groups (SubFPC13), for the least active compounds w.r.t. highly active compounds (Supplementary file19.xlsx, Figure 1). Therefore, designing the compounds with higher counts of these FPs values would help predict non-toxic compounds (against normal HEK293 cells).
The descriptors and FPs determined through the chemicals data analysis can be very useful in designing highly (against SARS-CoV-2) or least active molecules (against normal human cells) in the future. Thus, essentially deterministic features have been identified from experimental data. However, manual optimization of multiple descriptors and FPs is a challenging task that must be performed to design the compounds with desired properties.
Although, the current version of ASCoVPred predicts the activity of input compounds with the calculated efficiency, it may not predict efficiently for targets/assays it is not trained for. This limitation can be addressed by including additional models trained with compound activity datasets representing other targets that are not included in the current version of ASCoVPred. We will continue to update ASCoVPred with additional models, apart from improving the efficiency of existing models.