Creating a screening set of potential anthelmintic compounds using ChEMBL CURRENT STATUS: POSTED

Many existing anthelmintic compounds are compromised by low efficacy, serious side-effects and/or rising resistance in parasite populations. Thus, to reveal potential new drug targets and drugs, we describe a protocol to identify the most promising nematode/flatworm targets, and compounds likely to interact with them from the ChEMBL database (which includes both targets of known drugs and of other biologically active compounds). This provides a potential screening set of drug-like compounds.


Introduction
Many existing anthelmintic compounds are compromised by low efficacy, serious side-effects and/or rising resistance in parasite populations. Thus, to reveal potential new anthelmintic drug targets and drugs, we describe a protocol to identify the most promising nematode/flatworm targets, and compounds likely to interact with them from the ChEMBL database (Gaulton et al. 2017), includes both targets of known drugs and of other biologically active compounds. Thus, this provides a potential screening set of drug-like compounds that can be tested for anthelmintic activity.
Our protocol (Figure 1) starts by using predicted proteins from nematode and platyhelminth genomes to search using BLASTP (E ≤ 1e-10) (Altschul et al. 1997) against single-protein targets from ChEMBL.
To remove redundant hits, the nematode and platyhelminth proteins with significant BLAST hits are collapsed by gene family using an in-house database of gene families (e.g. built the Ensembl Compara pipeline; Vilella et al. 2009). Then, a 'target score' is calculated for each worm gene, taking into account similarity to known drug targets; lack of human homologues; and whether C. elegans and D.
melanogaster homologues have lethal phenotypes, amongst other factors.
Next, ChEMBL is used to identify hundreds of thousands of compounds with activities against ChEMBL targets to which worm proteins had BLAST matches. To calculate 'compound scores' for those compounds, we prioritise compounds in high clinical development phases, oral/topical administration, crystal structures, properties consistent with oral drugs, and lacking toxicity. At this stage, we focus on the compounds for the top 15% of highest-scoring worm targets, as these are the most promising novel targets. The set of compounds is filtered by selecting compounds that co-appear in a PDBe (Velankar et al.) structure with the ChEMBL target; or have high median pChEMBL score, reflecting high potency/affinity for the ChEMBL target.
Lastly, the candidates are placed into chemical classes, based on molecular fingerprints. Then they are further filtered by (i) checking availability for purchase in ZINC 15 (Sterling et al. 2015);and (ii) for each worm target, taking the highest-scoring compound from each chemical class.
This gives a relatively small diverse screening set of a few thousand compounds.
Many previous analyses of nematode and platyhelminth genomes have identified potential novel drug targets among their predicted proteins, and prioritised these based on factors such as lethal knockout phenotypes. However, relatively few previous studies have identified compounds predicted to target those nematode/platyhelminth proteins, and those that have done so (e.g. Berriman et al. 2009) have generally been limited to small numbers of compounds. By analysing the huge wealth of data in the ChEMBL database, this protocol produces much larger sets of compounds to screen for anthelmintic activity.

Procedure
Step A: identifying potential nematode/platyhelminth drug targets The inputs for Step A are gene predictions for a set of nematode/platyhelminth genome assemblies (as well as the corresponding predicted proteins), and the protein sequences for single-protein targets from the ChEMBL database (Gaulton et al. 2017).

1.
To identify potential drug targets in ChEMBL, use BLASTP (Altschul et al. 1997) to search all predicted nematode/platyhelminth proteins against the sequences of single-protein targets from the ChEMBL database.

The output for
Step A is a set of nematode/platyhelminth genes with BLASTP hits to single-protein ChEMBL targets.
Step B: ranking potential nematode/platyhelminth drug targets The input for Step B is set of nematode/platyhelminth genes with BLASTP hits to single-protein ChEMBL targets (output from Step A); a database of gene families containing all the genes from all those nematode/platyhelminth species and also human genes (e.g. built using the Compara pipeline, Vilella et al. 2009); and a set of chokepoint enzymes predicted in the nematode/platyhelminth genomes (e.g. using the approach of Tyagi et al. 2018). In Step B, to rank potential nematode/platyhelminth targets, target properties are considered that would be most attractive in a potential new drug target, including: • Similarity to a known drug target in any species; • Lack of a human homologue, to avoid toxicity issues in humans;

5
• Whether C. elegans or D. melanogaster homologues have lethal or sterile phenotypes when disrupted.
Thus, each of the nematode/platyhelminth genes (output from Step A) is assigned an overall target score, by following these steps:

1.
A BLAST score is set to 1 for high sequence similarity hits to ChEMBL (E ≤e-85, and ≥80% of the residues in the ChEMBL protein covered by the BLAST match), otherwise it is set to zero.

2.
A ChEMBL non-human score of 1 is applied to nematode/platyhelminth proteins that only match non-human targets within ChEMBL, because for these targets, developing drug selectivity to avoid toxicity issues in humans should be easier. The human proteins in the in-house Compara database are searched against ChEMBL 21 proteins using BLASTP. Where the top ChEMBL BLAST hit of a helminth gene has itself a match with human (with E ≤0.05) or belongs to a Compara family containing a human member, it is assigned a score of zero; otherwise 1.

3.
A phenotype score is calculated. Large-scale mutant phenotype data are not available for parasitic worms so essentiality is inferred from model organisms. A list of C. elegans genes with 'lethal', 'sterile', 'L3 arrest', 'molt defect', or 'paralysed' phenotypes (based on knockouts/knockdowns/variants) is downloaded from WormBase (Howe et al. 2016). In addition, a list of D. melanogaster genes with 'lethal', 'sterile' or 'paralytic' (paralysed) phenotypes is downloaded from FlyBase (Millburn et al. 2016). Lethal phenotypes are weighted more heavily than other phenotypes. If a nematode/platyhelminth gene belongs to a Compara family containing both C. elegans and Drosophila genes with lethal phenotypes it is assigned a phenotype score of 1. Otherwise, nematode genes that belong to families with C. elegans genes that have lethal phenotypes score 0.9 and those in families with essential Drosophila (but not C. elegans) genes score 0.8. Platyhelminth genes 6 belonging to families with essential C. elegans or Drosophila genes score 0.8. If a nematode/platyhelminth gene belongs to a Compara family containing both C.
elegans and Drosophila genes that both have 'sterile' phenotypes, or both have 'paralysed' phenotypes, it is assigned a score of 0.7. Otherwise, for a nematode gene, if it belongs to a family with a C. elegans gene with a paralysed/sterile/L3 arrest/molt defect phenotype, it is assigned 0.6; and if it belongs to a family with a fly (but no C. elegans) gene of paralysed/sterile phenotype, it is assigned 0.5. A platyhelminth gene that belongs to a family with a C. elegans or Drosophila gene of paralysed/sterile/L3 arrest/molt defect phenotype is assigned 0.5. The distribution of phenotype scores for nematode/platyhelminth genes is typically something like this: 1:5%, 0.9:22%, 0.8:8%, 0.7:0.5%, 0.6:7%, and 0.5:4%.

4.
A chokepoint score is set to 1 for nematode/platyhelminth genes predicted to encode chokepoint enzymes and belonging to Compara families with ≥3 predicted chokepoints.
Because chokepoint predictions are not very accurate, predicted chokepoints that do not belong to such families are assigned a score of 0.1. Non-chokepoint enzymes are assigned a zero score.

5.
A multi-species score. To penalise species-specific nematode/platyhelminth gene predictions that could include residual unfiltered contamination (e.g. from bacteria), this score is set to zero for nematode/platyhelminth genes belonging to Compara families with only a single species, but otherwise set to 1. The reason for including this score is that typically 99% of the nematode/platyhelminth genes have multi-species scores of 1.
6. An expression score. For the majority of nematode/platyhelminth species, including filarial species, drugs should target the adult stage. However, for some species, targeting other stages is important, for instance: the metacestode stage for cestodes, larvae of Trichuris and Strongyloides, and somules of Schistosoma. In the absence of expression data for many stages from many species, adult and metacestode (for cestodes) expression data are used here. For each nematode/platyhelminth gene, expression data from the most closely related 'reference'

7.
A species-distribution score is calculated. To give greater preference to potential nematode/platyhelminth targets that are found in multiple species (so that the same drug may target multiple species), a score of 1 is assigned if a nematode/platyhelminth gene is present in a Compara family containing ≥90% of the species from a major group of helminths. The major groups were defined as (in order of preference): all nematodes and platyhelminths, just 8 nematodes, just platyhelminths, cestodes, filaria, trematodes, or schistosomatids. Otherwise, it is assigned a score of 0. Typically, ~90% of the nematode and platyhelminth genes have a species distribution score of 1.
8. An invertebrate biology score is calculated. If the ChEMBL BLAST hit is from a closely related animal, it is more likely that the nematodes/platyhelminths have conserved biology, for example, nematodes/platyhelminths may share processes involved in moulting and life cycle control with arthropods. Therefore, if the top ChEMBL BLAST hit of a nematode/platyhelminth gene is to a UniProt protein from a non-chordate metazoan, it is assigned an 'invertebrate biology score' of 1, otherwise 0. We do not assign a score of 1 for matches to chordate proteins, in order to downweight targets that may be shared with the vertebrate hosts of nematodes/platyhelminths. Typically ~3% of the nematode/platyhelminth genes have an invertebrate biology score of 1.

9.
A singleton score is calculated. Developing drugs against simple single-copy targets is likely to be easier than developing drugs for multigene families. The score is therefore set to 1 for nematode/platyhelminth genes that lack within-species paralogues (according to the Compara database), otherwise zero score is applied. Typically ~53% of the nematode/platyhelminth genes have singleton scores of 1.

10.
A PDB score of 1 is assigned to helminth proteins that match ChEMBL targets with available structures in the PDBe (Velankar et al. 2016). This is to reflect the possibility of structure-aided drug development. Typically ~51% of nematode/platyhelminth genes have a PDB score of 1.
11. An alignment conservation score is calculated. Drugs that act against multiple species are highly desirable and this is more likely to occur if the target is sufficiently conserved between nematode/platyhelminth species. Across each column of the alignment, for each Compara family, a score is calculated using the approach of Capra & Singh (2007) using the Jensen-Shannon divergence, with a window size of 3 (on either side of the residue) and the BLOSUM62 background distribution. The overall score for a family is taken as the median of the scores for 9 all columns that have scores of > -1000. If a nematode/platyhelminth gene belongs to a family with an alignment score of ≥0.68, it is assigned an 'alignment conservation score' of 1, and otherwise 0. Typically, ~5% of nematode/platyhelminth genes have alignment conservation scores of 1.

12.
Thus, each of the nematode/platyhelminth genes (output from Step A) is assigned an overall target score based on the sum of scores above, weighted as follows: (50 x BLAST score) + (40 x ChEMBL non-human score) + (30 x phenotype score) + (15 x chokepoint score) + (10 x multi-species score) + (10 x expression score) + (10 x species-distribution score) + (10 x invertebrate-biology score) + (5 x singleton score) + (5 x PDB score) + (5 x alignment-conservation score) The output from Step B is a 'target score' for each of the nematode/platyhelminth genes with BLASTP hits to single-protein ChEMBL targets (output from Step A).
Step C: collapsing the list of potential helminth drug targets by gene family The input for Step C is the set of nematode/platyhelminth genes with BLASTP hits to single-protein targets, along with their target scores from Step B.
These nematode/platyhelminth proteins usually include many with BLAST matches to the same ChEMBL protein, so the following steps are taken to reduce redundancy.

1.
The list is collapsed to just take the top-scoring nematode/platyhelminth protein (using the 'target score'; see above) matching each ChEMBL protein, leaving fewer nematode/platyhelminth proteins.

2.
The list of nematode/platyhelminth proteins usually still contains redundancy, since many nematode/platyhelminth proteins are from the same Compara family and match homologous ChEMBL proteins (e.g., from human, mouse, rat). The list is further collapsed to just the highestscoring nematode/platyhelminth protein from each Compara family.
The output from Step C is a reduced list of nematode/platyhelminth proteins with BLASTP matches to single-protein ChEMBL targets (n nematode/platyhelminth proteins matching n ChEMBL targets).
Step D: identifying potential new anthelmintic drugs in ChEMBL The input to Step D is the reduced list of nematode/platyhelminth proteins with BLASTP matches to single-protein ChEMBL targets, from Step C. In Step D, The ChEMBL database (Gaulton et al. 2017) is first used to identify compounds (including approved drugs, compounds in clinical development and bioactive compounds from the medicinal chemistry literature) that have activities against ChEMBL single-protein targets to which our nematode/platyhelminth proteins have significant (E ≤ 1e-10) BLAST matches (ie. the helminth proteins from Step C). For each of the nematode/platyhelminth genes in the reduced list (from Step C), we take the compounds with activities to its best ChEMBL BLAST hit and also the corresponding compounds for all nematode/platyhelminth genes in the same Compara family (including those that were discarded by Step C and/or have a different top BLAST hit in ChEMBL).
To assess the likely suitability of each compound as a potential new anthelmintic drug, an overall 'compound score' is generated. Compound properties considered are those that would be most advantageous for a neglected disease drug (Oprea & Overington 2015) including: • Compounds that could be more quickly and cheaply developed into drugs: we prioritise compounds in high clinical development phases and those where a crystal structure is available to help inform molecule design; • Compound properties: by focussing on compounds with properties consistent with those of oral drugs (Quantitative Estimate of Drug-likeness) and lacking known toxicity issues using Black Box warning information and toxic effect predictors; • Preferred route of administration for the ideal anthelmintic: compounds with oral or topical administration are considered most desirable.
Thus, each of the ChEMBL compounds with activities against the ChEMBL targets to which our nematode/platyhelminth proteins (output from Step C) have BLAST matches is assigned an overall compound score, by following these steps: 1.
The Quantitative Estimate of Drug-likeness (QED, Bickerton et al. 2012) is retrieved from the ChEMBL database. QED values can range from 0-1, and the closer a QED value is to 1, the more 11 oral-drug-like is the compound under consideration. Typically here QED values range from 0.01-0.95.

2.
The maximum development phase score for a compound is set equal to its maximum development phase taken from the ChEMBL database. Typically here the distribution of scores is: 0:99.7%, I:0.02%, II:0.03%, III:0.05%, and IV:0.22%.

3.
A route of administration score is calculated based on the route of administration, based on information obtained from the FDA orange book, and retrieved here from the ChEMBL database.
Compounds with oral or topical routes are given a score of 1. Typically ~0.17% of compounds have a route of administration score of 1.

A black box warning score is calculated based on Black Box Warnings. The FDA provides Black
Box Warnings for approved drugs where use of the drug is associated with serious or life threatening side effects. Here Black Box Warnings are retrieved from the ChEMBL database, and a score of -1 is used to penalise any compounds with Black Box Warnings. Typically ~0.07% of compounds have a black box warning score of -1.

5.
A molecule structural information availability score is calculated. A score of 1 is assigned to compounds with at least one structure deposited in the Protein Data Bank in Europe (PDBe) (www.ebi.ac.uk/pdbe), using PDBe information retrieved from the ChEMBL database. Typically here ~0.8% of compounds have a molecule structural information availability score of 1.

6.
A toxicology target interaction prediction score is calculated based on predictions (retrieved from the ChEMBL database) of compounds predicted to interact with known toxicology targets.
These predictions originate from the ChEMBL toxicology target prediction pipeline, with prediction models created using activity information in the database collected as part of the HeCatos project (www.hecatos.eu). Compounds predicted to interact with known toxicology targets are penalised with a score of -1. Typically here ~0.004% of compounds have toxicology target interaction scores of -1.

7.
The overall compound score is calculated as the weighted score: (5 x QED) + (5 x Maximum development phase) + (5 x Route of administration) + (5 x Black box warning information) + (5 x Molecule structural information availability) + (2.5 x Toxicology target interaction prediction) The output from Step D is the nematode/platyhelminth proteins with BLASTP matches to singleprotein ChEMBL targets (from Step C) and corresponding compounds from ChEMBL that have activities against those ChEMBL targets, and a 'compound score' for each of those compounds.
Step E: filtering the list of compound-target pairs, by target score The input to step E is the list of ChEMBL compound-target pairs output from Step D.
To reduce down the number of compounds and targets considered, this list is now filtered to just retain the top 15% of highest-scoring nematode/platyhelminth targets (using the target scores from Step B).

The output from
Step E is a reduced list of ChEMBL compound-target pairs.
Step F: filtering the list of compounds, using information on compound-target pairs The input to step F is the reduced list of ChEMBL compound-target pairs output from Step E. In Step F, to filter these ChEMBL compound-target pairs, two additional parameters were retrieved from ChEMBL and used for filtering. The first is pCHEMBL, a parameter calculated by ChEMBL that provides a consistent measure of the affinity of a compound for its (ChEMBL) target and is defined as: −log10 (molar IC50, XC50, EC50, AC50, Ki, Kd or Potency). For example, pChEMBL of 5 corresponds to an activity of 10 μM. The second is whether the compound and target co-appear in a structure: where possible, structures are retrieved from the Protein Data Bank in Europe (PDBe) (www.ebi.ac.uk/pdbe) that contained both the compound and a ChEMBL target.
The compounds (from Step E) are filtered by selecting compounds that (i) co-appear in a PDBe (Velankar et al. 2016) structure with the ChEMBL target; or (ii) have a high pChEMBL score (median >5), reflecting high potency/affinity for the ChEMBL target).
The output from step F is a further reduced list of compound-target pairs.
Step G: curating a list of known anthelmintic drugs and compounds for 'anthelmintic OR antihelminthic OR antinematodal OR antitrematodal' in DrugBank; (k). finding compounds from PubChem identified by searching for 'anthelmintics OR anthelmintic OR nematocide OR nematicide', and with low structural similarity to the list of compounds from (a)-(j) above (similarity ≤0.4 when calculated using the Tanimoto coefficient of ECPF4 fingerprints as implemented in the RDKit toolkit (www.rdkit.org)).
Where the source (see (a)-(k) above) only gives a name (not structure) for the compound, the OPSIN software (Lowe et al. 2011) is used to convert IUPAC names to SMILES strings that are then used to search PubChem and ChEMBL for specific compounds. The resulting list of anthelmintic drugs (human and veterinary) and compounds typically includes >260 compounds.
The output from Step G is a curated list of known anthelmintic compounds and their corresponding SMILES strings.
Step H: classifying known anthelmintic compounds into clusters The input for Step H is the curated list of known anthelmintic compounds from Step G, and their SMILES strings.
To classify these compounds into clusters:

2.
The similarities have range 0.0-1.0, and are converted to distances using 1.0 -similarity. The distance matrix is read into R, and hierarchical clustering performed using Ward's minimum variance method (using 'ward.D' in the 'hclust' function). The dendrogram is cut at a height which separates the well known chemical classes of anthelmintic compounds avermectins, milbemycins, and imidazothiazoles from other compounds. If the members of the clusters defined in this way have little in common with respect to chemical structure, the cluster is 15 further split up. This typically results in ~44 chemical classes.

The output from
Step H is a chemical class for each of the known anthelmintic compounds.
Step I: classifying the top drug candidates into chemical classes To create a 'diverse screening set', we perform a diversity analysis to classify the compounds from Step F (which we refer to as our 'top drug candidates') into chemical classes.
The data set consisted of the compounds from Step F for which a SMILES string is available in Some of the connected components in this graph may be very large and include relatively dissimilar compounds among the known anthelmintic compounds. Therefore, we further split the connected components by using a community (cluster) detection algorithm to find clusters within the subgraph corresponding to each connected component. To find the communities, we use the 'community' Python module (Aynaud 2009) to find optimal communities in terms of the modularity measure (as described previously by Blondel et al. 2008), using the similarities between compounds as the edge weights. When the community detection algorithm is applied to the connected components in the graph of top drug candidates and anthelmintic compounds, it splits the components into smaller communities (clusters). Upon manual examination, it is usually found that some of these clusters still contains relatively dissimilar compounds among the known anthelmintic compounds. Therefore, for each cluster, we perform hierarchical clustering in R using Ward's minimum variance method, and cut the dendrogram at a height of y (e.g. 0.85). The value of y is chosen by trying various heights, and finding the height at which anthelmintic compounds known to have similar structures (based on Step H) are placed in the same cluster.
We consider that the resultant clusters represent the chemical classes to which the top candidate compounds belong.
Step J: further filtering the list of compounds, using information on compound-target pairs At this stage, some of the 'top drug candidates' are further filtered, by discarding the medicinal chemistry compounds that do not co-appear in a PDBe structure with the ChEMBL target or have a median pCHEMBL score of >7, which leaves a smaller number of top drug candidates.
The same chemical classes identified above for the compounds in Step I are still used for the smaller set of candidates from Step J.
Step K: further filtering the list of compounds, using information on availability for

purchase, and chemical class
This set of candidates is further filtered by: (a). just taking compounds listed as available for purchase in ZINC (see below), and (b). for each nematode/platyhelminth target, just taking the highest-scoring compound (according to our compound score; see Step D above) from each chemical class.
ZINC 15 (Sterling & Irwin 2015) is used to identify compounds available for purchase in the data set.
The subset of ZINC 15 with the highest level of availability ('in stock') is used. Identity-matching should use the parent compound (a single largest component) and standard InChIs, first directly and then after removing the charge and atom-based stereochemistry layers.
This gives the final 'diverse screening set' (typically several thousand compounds).