Genomic landscape of drug binding and pharmacogenetic variation across diverse populations using SNPdrug3D

doi:10.21203/rs.3.rs-2377190/v1

Download PDF

Article

Genomic landscape of drug binding and pharmacogenetic variation across diverse populations using SNPdrug3D

https://doi.org/10.21203/rs.3.rs-2377190/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

One of the promises of precision medicine is to understand and act on inter-individual genetic differences in drug responses. SNPdrug3D contains the complete genomic landscape of missense single nucleotide variants (SNV) across the human proteome and at population-wide level that could affect drug binding. In the Singapore SG10K Health and global gnomAD cohorts, comprising variations in over 80,000 individuals, we identified ~ 1.17 million variants mapped to residues near bound drug molecules in protein-drug complexes relative to ~ 6000 drug molecules and experimentally verified effects of selected SNVs, including previously uncharacterized variants, on drug binding in relevant proteins ranging from kinases to cytochrome P450s (CYPs). The latter led to a specific predictor for interpreting variants in the CYP family that vastly outperforms existing tools in the prediction of pharmacogenetic effects. By placing variants and drugs in structural context, SNPdrug3D aids drug development by early flagging of resistance potential through population-specific variability.

Biological sciences/Computational biology and bioinformatics/Sequence annotation

Health sciences/Medical research/Genetics research

The advent of next generation sequencing technology has generated abundant genetic variation data with potential clinical utility. This includes datasets capturing global variations like gnomAD¹ to a recent genomic database of 10,000 healthy Singaporeans (SG10K Health) that mainly contains variants occurring in the three major ethnic groups of Chinese, Malay and Indians². While substantial effort has been devoted to predict^3–7 and understand the clinical relevance of these variants, over 98% of variants have unknown consequences⁵. Specifically, much emphasis has been placed on studying missense SNVs (i.e., missense variants) and their effects in the context of disease, but they can also affect drug response by altering drug pharmacokinetic or pharmacodynamic properties. Such pharmacogenetic (PGx) variants occur at least 5 times in an individual’s genome⁸, are generally more common⁹ than monogenic variants and are documented to be actionable for medication prescribing for a small but growing list of drugs¹⁰. During drug treatment, they are a cause of adverse drug reactions (ADR) (e.g. 3–5% of all hospital admissions are estimated to be due to ADRs)¹¹ and therapeutic failure both at the clinic and in clinical trials when the underlying population genetic diversity is underappreciated¹².

Given their clinical importance and implications for drug development, there is a need to relate genotypes to drug response phenotypes by investigating the molecular mechanisms leading to such phenotypes. One of the ways this may arise is due to variant-induced single amino acid substitutions at or near protein-drug binding sites or long-distance allosteric effects¹³. For example, some substitutions are known to cause tyrosine kinase inhibitor resistance¹⁴ during anticancer therapy. Similarly, these substitutions can also affect drug metabolism by abrogating binding of drugs such as antidepressants to metabolic enzymes such as CYP2C9 and CYP2C19¹⁵ which are members of the cytochrome P450 (CYP) gene family that is responsible for metabolizing the vast majority of drugs (e.g. 80% of phase I metabolism is mediated by CYP enzymes)¹⁶. With an estimated 6,000 to 10,000 non-synonymous SNVs¹⁷ per genome, it is expected that some of these missense variants and their associated amino acid substitutions will affect drug binding sites or drug metabolism, but the effects of most remain unknown.

To explore this, there have been past efforts mapping variants to protein structures for both germline and somatic variants^18–25. However, there has been limited coverage of protein structure space as these works were mostly limited to protein crystal structures and reference proteomes from PDB and UniProt respectively and few (e.g. PhyreRisk, mutation3D) were supplemented with protein structure data from prediction models and none are focused on large-scale drug molecule binding sites. As a result, proteome coverage is limited to UniProt (e.g. 70% coverage by PhyreRisk¹⁹ compared with 72% in SNPdrug3D) and consideration for protein isoforms is lacking (e.g. MISCAST²⁰ covers 1330 genes as it only includes data from precise reference sequence matches to PDB). With the recent advent of high-quality predictions from AlphaFold, mapping of variants to protein residues is now available for 98.5% of proteins in the UniProt human reference proteome (one sequence per gene)²⁶. However, there exists no mapping yet of SNV-drug 3D interaction within these structures. To address these challenges, we introduce SNPdrug3D, an interactive tool that maps and visualizes missense variants in protein structures for 20,442 human genes for interpretation of variant effects in protein-drug binding sites and apply it to variants in SG10K Health and gnomAD.

Specifically, SNPdrug3D can be used to identify putative functional PGx variants while also providing a predicted molecular mechanism for their effects (e.g. disruption of protein-drug binding at atomistic level of description of the binding pocket). Here, we demonstrate that its use can improve understanding of functional PGx variants by i) identifying previously uncharacterized amino acid substitutions that affect drug binding to protein targets which were validated using the cellular thermal shift assay (CETSA)²⁷ and ii) deriving useful features from protein-drug complexes such as the proximity of substituted amino acids to ligands for building a machine-learning based, PGx-inclusive cytochrome P450 (CYP) variant prediction tool (CYPVarPred). Overall, our work provides a comprehensive picture of such variants with an unprecedented annotation depth that not only captures variations in the Singapore population but also global variation via linkage to gnomAD data for a broad coverage of the world population as the ethnic mix in both cohorts are complementary. This provides the first complete map of SNV-drug 3D interactions across the human proteome and at population-wide level.

Overview

SNPdrug3D integrates population-level missense variant data from the SG10K Health² and gnomAD²⁸ cohorts, 3D protein structure data derived from the Protein Data Bank (PDB)²⁹, HHPred, Modbase³⁰ and the AlphaFold Protein Structure Database³¹, sequence data from Ensembl, UniProt and NCBI RefSeq and information from DrugBank³².

The workflow for combining different data sources to create SNPdrug3D is shown in Fig. 1a. In total, the SNPdrug3D reference proteome comprises of 102,532 unique human proteins (including isoforms) from 20,442 genes with 65% (~ 38 million) of amino acid residues in the proteome being mapped to at least one protein structure from PDB (based on a cut-off of 40% sequence similarity), ModBase, HHPred or AlphaFold (residues with a predicted local distance difference test score (pLDDT) of at least 70). These protein structures may have a ligand or drug as defined by DrugBank and we found that a total 5,962 drug molecules present in this database were associated with a structure. The remaining 35% (~ 20 million) of proteome residues mapped to missing residues in crystallized structures and had no homologous residues in non-human proteins (Fig. 1a, bottom) or were in low confidence regions in AlphaFold structures (pLDDT < 70) possibly associated with intrinsic disorder or are structured only in a complex²⁶. Overall, each residue covered will have associated protein structure and sequence data along with information regarding its proximity (i.e. within 8Å) to one or more drug molecules.

Integrating information from various sources allowed SNPdrug3D to reach a full proteome coverage of ~ 65% (up to 70.5% if low confidence residues in AlphaFold structures are considered), whereas use of protein structure databases individually did not allow for such high coverage (Fig. 1). To illustrate, if we considered only structures that had sequences with very close similarity (i.e. above 98%, B98 group) to our reference sequences, a maximum coverage of less than 20% was obtained (Fig. 1b). If only coverage of the UniProt reference proteome (UP000005640_9606) is considered, we show that including AlphaFold predictions alone allowed us to reach a coverage of more than 62% of the residues in the ~ 20,000 protein sequences found in this proteome (Fig. 1c). If all residues regardless of prediction quality are considered, AlphaFold predictions cover 99% of the UniProt proteome, however, AlphaFold does not include drug binding information and when considering reliable predictions only, the consensus of multiple structure prediction methods used here provides the highest usable coverage of sites in the human proteome. Taken together, we have mapped residues to both experimental and predicted structures while also considering homologous structures (i.e. B40 group – similarity of at least 40–98%) which greatly expanded coverage of the SNPdrug3D proteome from 15% (i.e. B98 only) to 65% and the UniProt proteome from 18.7–72%.

Overall, an unprecedented level of annotation per variant is created in SNPdrug3D by mapping variations in 20,442 genes to 102,532 reference protein sequences and to 202,299 protein structures that contained 28,949 ligands of which 5,962 are considered as “drugs” by DrugBank. Amongst existing efforts, SNPdrug3D has the largest reference proteome size and number of protein structures with the next largest being PhyreRisk which included 18,874 experimental and 84,818 predicted structures with a coverage of 42,485 UniProt proteins¹⁹. Furthermore, a key differentiating factor for SNPdrug3D is that it contains binding site information for ~ 42% (5,962) of DrugBank drug molecules (remainder does not have 3D structure protein complexes or are unspecific in binding) derived from PDB protein structures which is at least 5-fold more than previous efforts such as mutLBSgeneDB²⁴ (1,324 drugs) and DrugVar²⁵ (235 drugs).

Mapping variants to structures

Following the above, SNPdrug3D was used to annotate natural missense variants present in the SG10K Health and gnomAD cohorts. As summarized in Fig. 2, analysis of the latest SG10K Health genetic variation catalogue identified 158,331,366 single nucleotide variants (SNV). Of these, 6.2% (9,770,964) were common (allele frequency, AF of more than 1%), while 49% (77,625,433) and 44.8% (70,934,969) were private (seen in only one individual) or rare (AF < = 1% and in more than one individual) respectively. For each genome, an estimated median of over 11,000 missense SNVs per genome including over 200 unknown variants (Fig. 2, right) were found which is higher than previously estimated¹⁷.

Including gnomAD variation data, there were ~ 5.8 million unique missense variants with ~ 5.19 million missense variants from gnomAD (Fig. 3a, left) and ~ 1.25 million missense variants in SG10K Health (Fig. 3a, middle). Of these 5.8 million variants, around 0.6 million variants were exclusive to the SG10K Health dataset (i.e. not in gnomAD) and further comparison of these variants with dbSNP v151 revealed that ~ 0.3 million of these were entirely novel (Fig. 3a, right). Around 4.5 million variants were not present in SG10K Health (i.e. were found in gnomAD only) and the rest (~ 675,000) were found in both datasets. While these missense variants may cause amino acid substitutions in the corresponding reference proteins, not all were amenable to be mapped to amino acid residues in protein structures as they may mutate residues in disordered protein regions that were not crystallized or had low confidence predictions if mapped to AlphaFold structures. In total, 68.5% (~ 3.94 of 5.8 million) of variants had protein structure coverage (i.e. mapped to at least one structure with or without a drug molecule bound nearby) with similar proportions of variants structurally annotated in both the gnomAD (~ 67.8%) and SG10K Health (~ 71.1%) datasets.

To understand the contribution of protein-ligand interaction disruptions to disease phenotypes we sorted a subset of SG10K Health variants (i.e. 1,312 missense variants from 508 genes) into three groups (i.e. benign, damaging and VUS) based on annotations from the SG10K Med dataset³³ where variants are already curated to identify highly pathogenic variants. Here, all likely pathogenic or pathogenic variants were consolidated into the damaging group and similarly, likely benign or benign variants were grouped into the benign class. Using SNPdrug3D, we observed that ~ 97% of these variants were mapped to a protein structure and of these, ~ 48% resulted in amino acid substitutions within 8Å of a drug binding site (Fig. 3b) which is > 2x higher compared to the whole SG10K Health set (21%, Fig. 3a).

By variant pathogenicity class (Fig. 3b), we show that more than half of the damaging variants (~ 51%) affected amino acids in drug binding sites. As endogenous ligands and drug molecules may share similar binding sites, an amino acid change at these sites may possibly disrupt such protein-ligand interactions, leading to a loss in protein function and manifestation of the disease phenotype. If protein-drug interactions are affected, the drug effect may be diminished, and therapeutic failure may occur in addition to toxicity if drug metabolism is decreased.

Another important role for SNPdrug3D is to help improve annotation of variants of unknown significance (VUS) with potential effects on drug binding. We observe that ~ 40% of VUS in this dataset (n = 204) affect residues near a drug in a ligand-binding site suggesting that disruption of protein-ligand interactions may be a possible molecular phenotype of these variants and besides providing enhanced annotation, makes them prime candidates for further experimental verification of the proposed effect mechanism.

Using SNPdrug3D to identify missense variants that affect protein-drug binding experimentally.

The extensive mapping of missense variants to protein structures allowed us to pinpoint variants that can affect drug binding, possibly leading to varying drug response phenotypes. In the combined global data, we found 1.17 million mapped variants that altered amino acid residues within 8Å of one or more drug in at least one protein structure. To delineate and validate the effects of such variants on protein-drug binding, we examined the effects of several variant-induced amino acid substitutions in drug targets (human HCK and DHFR) and well-known drug metabolizing enzymes (CYP2C19 and CYP2D6). The substitutions were selected based on the SNPdrug3D annotation suggesting they could directly disrupt protein-drug interaction in the protein drug binding pockets and they comprise both known and novel examples.

In CYP2C19, the investigated amino acid substitutions were NP_000760.1:p.R97T, p.A297V and p.R433W. All substitutions were located close to the binding cavity of the drug clopidogrel and the heme group of the holoprotein (Fig. 4a, upper panels) in the homologous CYP2B4. Further validation with a CETSA-luciferase binding assay revealed that p.R97T and p.R433W impaired binding of the drug to the enzyme but not p.A297V (Fig. 4a, bottom panels). CYP2C19:p.R433W is an amino acid substitution with known effect and is part of the no-function CYP2C19*5 haplotype which has been demonstrated to abrogate metabolism of clopidogrel by decreasing protein stability³⁴. A similar disruption of clopidogrel binding to CYP2C19 by p.R97T was also demonstrated (Fig. 4a, bottom right). Like p.R433W, this may be due to decreased enzyme stability in addition to a direct effect on binding due to the lowered thermostability of the mutated enzymes, compared to the wild-type (WT) and A297-mutated enzymes (Extended Data Fig. 1). Although both SNVs associated with p.R97T (NM_000769.4:c.290G > C) and p.R433W (NM_000769.4:c.1297C > T) are rare (< 1% AF in both SG10K Health and gnomAD datasets), we found that the latter SNV to be more common by an order of magnitude in the Malay population (0.21% AF) compared to other ethnic groups like the Chinese (< 0.1% AF) and Indian (variant not observed) populations.

Similarly, we also investigated protein variants in the drug targets HCK and DHFR and found two amino acid substitutions that impaired protein-drug binding according to the CETSA-luciferase assay (Fig. 4b, c). The first, NP_000782.1:p.F180S (NM_000791.4:c.539T > C), was located within 5Å of the drug methotrexate and mapped to a structure homologous to human DHFR (Fig. 4b) while the other, NP_001165601.1:p.N385K (NM_001172130.3:c.1155C > A), was also in proximity and within 5Å of the drug staurosporine in the HCK protein structure homolog (Fig. 4c). Unlike for the CYP2C19 substitutions (p.R433W and p.R97T), protein variants of DHFR and HCK did not decrease the thermal stability of the enzymes (Extended Data Fig. 1). Conversely, p.F180S increased the thermal stability of DHFR compared to WT. These observations suggest that impairment of drug-protein binding may be due to a direct effect on binding rather than protein destabilization. This may occur through steric hindrance or changes in the biochemical properties of the drug binding cavity such as the introduction of a positive charge in HCK:p.N385K. Further analysis of the occurrence of these variants in the two cohorts studied revealed that both variants were very rare and are present only in SG10K Health but not in any gnomAD-associated populations. This is relevant since it means DHFR:p.F180S may promote methotrexate and other anticancer drug resistance in a population-specific context.

In our attempt to find missense VUS that can disrupt protein-drug binding, we also found two SNVs in CYP2D6 that either increase binding to the drug or disrupted binding indirectly via allosteric mechanisms, such as has been observed in drugs targeting kinases and G protein–coupled receptors³⁵. The first is a AEU08335.1:p.F120I substitution (JF307778.1:c.358T > A) (Fig. 5a) that is within 5Å of prinomastat in the enzyme’s catalytic core and, instead of disrupting binding, increases binding of the drug to the enzyme according to the CETSA-luciferase assay results (Fig. 5b). The other amino acid substitution, p.T76M (c.227C > T) (Fig. 5a), is located more than 8Å away from the drug and heme group but nevertheless disrupted priminostat-CYP2D6 binding (Fig. 5b), possibly through a long-range effect and enzyme destabilization (Extended Data Fig. 1). The missense variant (c.227C > T, p.T76M) was found exclusively, albeit rarely, in the Chinese population while the other variant (c.358T > A, p.F120I) is enriched in the East Asian and Chinese population with allele frequencies of 0.7% and 0.8% in the gnomAD and SG10K Health cohorts respectively.

Annotation and prediction of population-specific effects on drug binding is invaluable for drug development efforts which can study and address such effects early in the development pipeline avoiding costly surprises at later stages^36–38. This especially includes effects in the important CYP family of drug metabolizing enzymes³⁹.

Deriving predictive features from SNPdrug3D to build a PGx-inclusive variant pathogenicity prediction tool.

Beyond mapping and visualization of missense variants in protein-drug complexes, we also hypothesized that information from SNPdrug3D such as amino acid substitution proximity to drug molecules (or ligands in general) could serve as useful predictive features to build variant effect prediction tools that are able to discriminate PGx variants in addition to disease variants from neutral variants rather than just binary classification of variants into damaging (i.e. disease and PGx) and neutral categories. To the best of our knowledge, tools capable of such ternary classification of missense VUS have not been successfully developed but a previous attempt was made using a random forest classifier which found that PGx and neutral variants had overlapping characteristics⁴⁰. To this end, we focused on variants in the cytochrome P450 (CYP) superfamily of enzymes as variation in CYP genes are known to cause disease (i.e. CYP21A2 and congenital adrenal hyperplasia⁴¹) and also alter drug responses (i.e. CYP2D6 variation and tamoxifen metabolism⁴²). The CYP family has a dominating role in drug metabolism with 80% of FDA-approved drugs from 2005 to 2016 being metabolized by 4 of the 57 CYP gene family members⁴³. Using supervised machine learning, we built a ternary linear-discriminant analysis-based classifier with a manually curated dataset of 1,301 known CYP missense variants (Supplementary Table 1) of which 712 were disease variants, 397 were neutral variants and 192 were PGx variants. Two categories of predictive features were used: i) predictions and rank scores from dbNSFP (including consensus and input of other methods), ii) distance-based annotations from SNPdrug3D based on whether a mapped variant is near (< 8Å) or further from at least one drug after being mapped on to protein-drug complexes (see Supplementary Table 2).

To assess the classifier (‘CYPVarPred’), a held-out test set containing 20% of the total variants in the CYP dataset was used. Comparison of CYPVarPred with a dummy classifier that made stratified predictions based on training set class distributions (i.e. 55% disease: 30% neutral: 15% PGx) showed that CYPVarPred vastly outperformed with a Matthew’s correlation coefficient (MCC) of 0.746 and an average class-specific accuracy of 80% across the three predicted variant classes (i.e. disease, neutral and PGx) (Fig. 6a). To allow direct comparison with classical variant pathogenicity binary predictors, we binarized the outputs of CYPVarPred into neutral/non-neutral (i.e. PGx or disease) predictions and tested all predictors against two separate datasets consisting of either PGx/neutral (n = 18/51) or disease/neutral (n = 113/51) variants. Using defined cut-offs (see Methods) from the various tools to compare with CYPVarPred, we found that CYPVarPred performed the best in classification of variants in both datasets (MCC = 0.83 and 0.69 on the disease and PGx set respectively) (Fig. 6b). Overall, all predictors performed better on the disease/neutral set compared to the PGx/neutral set where CYPVarPred has a clear advantage over the other tools. To further validate CYPVarPred with data consisting of experimentally assayed CYP variants (not available for PGx but disease effect predictions), we sought an independent dataset of 41 CYP39A1 variants from a recent study of rare CYP39A1 variants and their association with exfoliation syndrome⁴⁴ along with an additional CYP39A1 14 variants uniquely contributed by additional experiments in this study (see Supplementary Table 3) for a total of 55 variants (14 neutral, 41 disease). Across various thresholds and for the specific task of disease vs neutral classification (Fig. 6c, left), CYPVarPred performed comparably to the state-of-art PrimateAI-3D (PAI3D) (G. Liang, personal communication, July 6, 2022) while both outperform other algorithms (Fig. 6c, right). For its main task of PGx vs neutral classification, CYPVarPred vastly outperforms all other prediction tools including the latest version of PAI3D.

SNPdrug3D distinguishes itself from past work of mapping sequence variants to protein-ligand binding sites^18–25,45 because of its unparalleled coverage (~ 65% of proteome in 3D, 20,442 genes) and size of reference proteome (~ 100,000 protein sequences including isoforms), as well as variant annotation depth due to inclusion of homology models (ModBase and HHPred), PDB structures and predicted structures from AlphaFold. It must be noted however that ~ 20% of the residues in AlphaFold structures have pLDDT of less than 70 and were excluded in our analyses. These scores indicate regions of low confidence in the structure prediction, but this is expected in residues falling in loops, linkers and termini that are structurally flexible/disordered and not restricted to rigid conformation. Further, SNPdrug3D also contains information about 6,000 drugs from DrugBank that were also found in the PDB structures and where appropriate, drug-variant proximity data was calculated using VMD⁴⁶. For further improvements, structural summaries from PDBsum⁴⁷ and more complex binding site predictions (e.g. GenProBiS¹⁸) can be considered.

The data in SNPdrug3D is also of clinical relevance as variation data in the webserver were obtained from the whole-genome sequencing data of more than 80,000 individuals from both gnomAD and SG10K Health cohorts (Fig. 2). For the Singapore population, the unique variants that do not overlap with dbSNP and gnomAD (~ 0.3 million variants) (Fig. 3c.) are of particular interest as functional variants in this group may explain population-specific differences in disease susceptibility and drug response. Specifically, after mapping a small subset of 1,312 SG10K Health variants with curated pathogenicity annotations³³ to protein structures, we found most variants annotated to be damaging (~ 51%) were also mapped to residues near one or more bound drugs in at least one protein structure while around 40% of variants of unknown significance (VUS) were mapped to residues near a bound drug hence providing a rationale for reannotation and testing of these variants. For the VUS group, the use of SNPdrug3D would support more detailed investigations and biochemical characterization of the possible effects of these VUS variants on protein-drug or ligand binding sites whereas for damaging variants, SNPdrug3D results can complement existing pathogenicity predictions (e.g. CADD⁴⁸ or REVEL⁴⁹ predictions) or annotations by providing a mechanistic explanation behind the predicted deleterious outcome if the damaging variant was found to possibly disrupt a residue important for drug or ligand binding. Complementary assessment with stability predictors may also be useful to investigate if the variant-induced loss of binding is due to protein destabilization in addition to a direct steric disturbance at the binding site. We explored the prediction ability of two well-known tools^50,51, and observed that more than half of the predictions did not correlate with experimental data in our study (Supplementary Table 4), suggesting that further performance improvements may be needed to predict variant impact on protein stability in these proteins.

By combining sequence variant and structure data, SNPdrug3D provides user-friendly access to further variant information that is specific to the respective cohorts (e.g. population-specific allele frequencies) in addition to structure and sequence-based information. A user just has to provide a query (e.g. gene name, drug name or SNV coordinates) in the webserver’s search page (Extended Data Fig. 2) and select the relevant variant of interest from the results tables to enter the variant results page (Extended Data Fig. 3). At the variant results page and in the sequence feature viewer, the user can inspect up to 14 protein sequence-based features (see Methods) to determine if the variant falls within a protein domain, in a disordered region or a predicted targeting or post-translational modification site. Additionally, nearby amino acid substitutions are also indicated in the viewer along with predictions from SIFT⁷ and PolyPhen2⁶ to allow the user to find out if a variant of interest is clustered with other potentially damaging variants. In the structure feature viewer, the queried residue (in red) and any nearby drug (in yellow if present and within 8Å) are visualized along with the 5 or 8Å binding pocket when this function is toggled. The structure containing a nearby drug with sequence closest to the reference sequence is returned although the user can choose to display other structures (experimental or predicted) from a list below the structure viewer. Finally, for each variant, a direct link out to the relevant variant page in gnomAD or the SG10K Health Chorus Variant Browser page is also provided.

Importantly, we show that SNPdrug3D can be used to identify variants in protein targets that may affect protein-drug binding. Focusing on two drug targets (HCK and DHFR) and two drug metabolizing enzymes (CYP2C19 and CYP2D6) (Figs. 4 and 5), we validated predictions for a total of 7 selected SNVs resulting in amino acid substitutions, 6 of which altered binding of drugs to the ligand according to the CETSA-luciferase binding assay results. Of the protein variants impairing drug binding to the protein targets, the DHFR:p.F180S substitution is particularly interesting as a reduction in binding affinity of methotrexate, an anticancer drug, to DHFR may promote drug resistance as has been demonstrated in other DHFR single amino acid variants^52,53. Although the variant is very rare (< 0.001%) and found only in the SG10K Health cohort, it nevertheless may have implications for methotrexate dosing in the clinical setting. For the CYP enzymes, we unexpectedly found a single amino acid substitution (CYP2D6:p.F120I) that increased the binding affinity of the drug to the enzyme (Fig. 5). Functional CYP variants are known to be unequally distributed amongst the world populations^9,54 and here we found that this variant is more common in the East Asian or Chinese populations but its effects on drug metabolism remain unreported. If the variant effect is substrate-specific, potential drug-drug interactions involving CYP2D6 substrates may arise if the drug has a stronger affinity for the enzyme than concomitantly administered CYP2D6 substrates. Given this observation, further work will be needed to analyse all variants in this study with other drugs, particularly focusing on established drugs where genetic variations have been shown to affect clinical outcomes (e.g. paroxetine-CYP2D6 and voriconazole-CYP2C19)⁵⁵.

We also demonstrate that information derived from SNPdrug3D serves as useful predictive features to build PGx-inclusive missense variant prediction tools such as CYPVarPred. While most conventional tools (e.g. SIFT⁷, PolyPhen2⁶) rely heavily on evolutionary or conservation based scores to distinguish pathogenic from neutral variants, such features may not be enough to discriminate PGx from neutral variants especially if the disease and PGx variants share dissimilar evolutionary patterns⁵⁶. Here, we show that conventional tools performed reasonably in distinguishing disease but not PGx from neutral CYP variants (Fig. 6b) with some tools such as PrimateAI having a high default cut-off (i.e. 0.803)⁵⁷ that may be unsuitable for classifying variants in our CYP dataset unless the threshold is adjusted downwards to 0.486 and 0.527 for PrimateAI and PAI3D respectively (Figs. 6b and c, see Methods). Given these observations, there have been limited attempts to generally predict and classify PGx variants by redefining cut-offs^56,58 and the inclusion of other features, such as information derived from protein-drug complexes (e.g. from docking models) was also suggested to try to improve performance of PGx-based classifiers⁴⁰. We show that the latter approach is feasible in a CYP-specific context with good performance in discriminating both disease and PGx variants from neutral variants for both ternary (Fig. 6a) or binary (Fig. 6b, c) classification tasks. However, we observe that performance across all tools were considerably lower when experimentally assayed variants (Fig. 6c), instead of database-curated variants (Fig. 6b), were used for benchmarking the predictors in a disease versus neutral comparison. This has been similarly demonstrated in past studies^59,60 where it has been suggested that truth sets consisting of functional assay-based variants should be used to improve evaluation of prediction models to ensure that their predictions are robust and clinically relevant.

It is expected that as more human variation is captured and additional protein structures are obtained, the size and scale of SNPdrug3D will grow along with its utility (e.g. Phase 2 of Singapore’s NPM programme aims to generate a total of 100,000 Singaporean genomes). We therefore hope that SNPdrug3D will be an integral tool for investigations of missense variants and their effects on ligand binding to proteins and aid integration of variant information into the clinical set up for more precise and personalized drug treatments. Specifically, SNPdrug3D also enables early and more effective drug development where potential population-specific variability in the respective drug binding pocket can be flagged. Overall, SNPdrug3D will support future precision medicine initiatives through comprehensive interrogation of the molecular phenotype associated with missense VUS and their relation to inter-individual variation in drug responses and diseases.

Data availability

Protein sequence data underlying the reference proteome is freely available from Ensemble (http://ftp.ensembl.org/pub/release-96/), NCBI RefSeq (https://ftp.ncbi.nlm.nih.gov/genomes/refseq/) and UniProt (https://www.uniprot.org/proteomes/?query=human&sort=score) while protein structure data can also be obtained from PDB (https://www.rcsb.org/), ModBase (https://modbase.compbio.ucsf.edu/), HHPred (https://toolkit.tuebingen.mpg.de/tools/hhpred) and the AlphaFold Protein Structure Database (https://alphafold.ebi.ac.uk/). Drug-related information is available via Drugbank (https://go.drugbank.com/releases/latest). Access to variation data from the SG10K Health cohort is limited and requires approval from the National Precision Medicine data access committee of A*STAR (https://npm.a-star.edu.sg/help/) but data from gnomAD is available for download at the consortium’s website (https://gnomad.broadinstitute.org/downloads). Finally, the entire dataset supporting CYPVarPred is available in Supplementary Table 1.

Code availability

SNPdrug3D with gnomAD data is freely available at https://snpdrug3d-gnomad.bii.a-star.edu.sg/ but access to SNPdrug3D with SG10K Health data is limited and will require approval from the National Precision Medicine data access committee of A*STAR (https://npm.a-star.edu.sg/help/).

Acknowledgements

This project was funded by the A*STAR Industry Alignment Fund (Pre-Positioning) (IAF-PP: H17/01/a0/007). The SG10K Health project is also funded by the Industry Alignment Fund (Pre-Positioning) (IAF-PP: H17/01/a0/007), the project made use of participating study cohorts supported by the following funding sources: (1) HELIOS study by grants from a Strategic Initiative at Lee Kong Chian School of Medicine, the Singapore Ministry of Health (MOH) under its Singapore Translational Research Investigator Award (NMRC/STaR/0028/2017) and the IAF-PP: H18/01/a0/016, (2) GUSTO study by the Singapore National Research Foundation under its Translational and Clinical Research (TCR) Flagship Program and administered by the Singapore MOH's National Medical Research Council (NMRC) Singapore (NMRC/TCR/004-NUS/2008, NMRC/TCR/012-NUHS/2014) with additional funding support available through Agency of Science, Technology and Research (A*STAR) and IAF-PP: H17/01/a0/005, (3) SEED study by NMRC/CIRG/1417/2015, NMRC/CIRG/1488/2018, NMRC/OFLCG/004/2018), (4) MEC by individual research and clinical scientist award schemes from the Singapore National Medical Research Council (NMRC, including MOH-000271-00) and the Singapore Biomedical Research Council (BMRC), the Singapore Ministry of Health (MOH), the National University of Singapore (NUS) and the Singapore National University Health System (NUHS), (5) PRISM cohort study by NMRC/CG/M006/2017_NHCS, NMRC/STaR/0011/2012, NMRC/STaR/0026/2015, Lee Foundation and Tanoto Foundation, (6) TTSH cohort study by NMRC/CG12AUG2017 and CGAug16M012.

Authors’ contributions

A.M. and D.K. developed annotations tables for SNPdrug3D. D.K designed the webpage and implemented front-end and back-end of the database. C.S.C. developed CYPVarPred. J.K, M.B.O. and C.S.H.T. performed the CETSA assays. A.M., D.K., S.M.S and C.S.C. wrote the paper. Idea conception and supervision were provided by C.S.V. and S.M.S. All authors were involved in performing the analyses, reviewed and approved the final manuscript.

Competing interests

The authors declare no competing interests.

Additional information

Correspondence and requests for materials should be addressed to Chandra S. Verma and Sebastian Maurer-Stroh.

SG10K Consortium authorship list

Rob M Van Dam, Yik Ying Teo, Calvin Woon Loong Chin, Stuart Alexander Cook, Chee Jian Pua, Chengxi Yang, Chia Wei Lim, Pi Kuang Tsai, Wen Jie Chew, Wey Ching Sim, Li-xian Grace Toh, Yap Seng Chong, Peter D Gluckman, Yung Seng Lee, Fabian Yap, Kok Hian Tan, Charumathi Sabanayagam, Yih Chung Tham, Lavanya Raghavan, Tin Aung, Miao Ling Chee, Hengtong Li, Miao Li Chee, Eng Sing Lee, Paul Eillot, Jimmy Lee, Elio Riboli, Irfahan Kassam, Lakshmi Lakshman, Tock Han Lim, Hong Kiat Ng, Theresia Mina, Darwin Tay, Wansaicheong Khin-lin Gervais, Yik Weng Yew, Justin Jeyakani, Rodrigo Toro, Hui Juan Joanna Tan, Shyam Prabhakar, Claire Bellis, Wee Yang Meah, Shi Qi Mok, Bitong Clarabelle Alexandrine Lin

Karczewski, K. J. et al. Variation across 141,456 human exomes and genomes reveals the spectrum of loss-of-function intolerance across human protein-coding genes. bioRxiv, 531210, doi:10.1101/531210 (2019).
Wu, D. et al. Large-Scale Whole-Genome Sequencing of Three Diverse Asian Populations in Singapore. Cell 179, 736-749 e715, doi:10.1016/j.cell.2019.09.019 (2019).
Rentzsch, P., Witten, D., Cooper, G. M., Shendure, J. & Kircher, M. CADD: predicting the deleteriousness of variants throughout the human genome. Nucleic Acids Res 47, D886-D894, doi:10.1093/nar/gky1016 (2019).
Tian, Y. et al. REVEL and BayesDel outperform other in silico meta-predictors for clinical variant classification. Sci Rep 9, 12752, doi:10.1038/s41598-019-49224-8 (2019).
Frazer, J. et al. Disease variant prediction with deep generative models of evolutionary data. Nature 599, 91-95, doi:10.1038/s41586-021-04043-8 (2021).
Adzhubei, I., Jordan, D. M. & Sunyaev, S. R. Predicting functional effect of human missense mutations using PolyPhen-2. Curr Protoc Hum Genet Chapter 7, Unit7.20, doi:10.1002/0471142905.hg0720s76 (2013).
Ng, P. C. & Henikoff, S. Predicting deleterious amino acid substitutions. Genome Res 11, 863-874, doi:10.1101/gr.176601 (2001).
Tabor, H. K. et al. Pathogenic variants for Mendelian and complex traits in exomes of 6,517 European and African Americans: implications for the return of incidental results. Am J Hum Genet 95, 183-193, doi:10.1016/j.ajhg.2014.07.006 (2014).
Zhou, Y., Ingelman-Sundberg, M. & Lauschke, V. M. Worldwide Distribution of Cytochrome P450 Alleles: A Meta-analysis of Population-scale Sequencing Projects. Clin Pharmacol Ther 102, 688-700, doi:10.1002/cpt.690 (2017).
Relling, M. V. & Evans, W. E. Pharmacogenomics in the clinic. Nature 526, 343-350, doi:10.1038/nature15817 (2015).
Fattinger, K. et al. Epidemiology of drug exposure and adverse drug reactions in two swiss departments of internal medicine. Br J Clin Pharmacol 49, 158-167, doi:10.1046/j.1365-2125.2000.00132.x (2000).
Dugger, S. A., Platt, A. & Goldstein, D. B. Drug development in the era of precision medicine. Nat Rev Drug Discov 17, 183-196, doi:10.1038/nrd.2017.226 (2018).
Tan, Z. W., Tee, W. V., Guarnera, E., Booth, L. & Berezovsky, I. N. AlloMAPS: allosteric mutation analysis and polymorphism of signaling database. Nucleic Acids Res 47, D265-D270, doi:10.1093/nar/gky1028 (2019).
Chen, Y.-f. & Fu, L.-w. Mechanisms of acquired resistance to tyrosine kinase inhibitors. Acta Pharmaceutica Sinica B 1, 197-207, doi:10.1016/j.apsb.2011.10.007 (2011).
Attia, T. Z. et al. Effect of Cytochrome P450 2C19 and 2C9 Amino Acid Residues 72 and 241 on Metabolism of Tricyclic Antidepressant Drugs. Chemical and Pharmaceutical Bulletin 62, 176-181, doi:10.1248/cpb.c13-00800 (2014).
Zhou, S. F., Liu, J. P. & Chowbay, B. Polymorphism of human cytochrome P450 enzymes and its clinical impact. Drug Metab Rev 41, 89-295, doi:10.1080/03602530902843483 (2009).
Kumar, S., Dudley, J. T., Filipski, A. & Liu, L. Phylomedicine: an evolutionary telescope to explore and diagnose the universe of disease mutations. Trends Genet 27, 377-386, doi:10.1016/j.tig.2011.06.004 (2011).
Konc, J., Skrlj, B., Erzen, N., Kunej, T. & Janezic, D. GenProBiS: web server for mapping of sequence variants to protein binding sites. Nucleic Acids Res 45, W253-W259, doi:10.1093/nar/gkx420 (2017).
Ofoegbu, T. C. et al. PhyreRisk: A Dynamic Web Application to Bridge Genomics, Proteomics and 3D Structural Data to Guide Interpretation of Human Genetic Variants. J Mol Biol 431, 2460-2466, doi:10.1016/j.jmb.2019.04.043 (2019).
Iqbal, S. et al. MISCAST: MIssense variant to protein StruCture Analysis web SuiTe. Nucleic Acids Res 48, W132-W139, doi:10.1093/nar/gkaa361 (2020).
Laskowski, R. A., Stephenson, J. D., Sillitoe, I., Orengo, C. A. & Thornton, J. M. VarSite: Disease variants and protein structure. Protein Sci 29, 111-119, doi:10.1002/pro.3746 (2020).
Meyer, M. J. et al. mutation3D: Cancer Gene Prediction Through Atomic Clustering of Coding Variants in the Structural Proteome. Hum Mutat 37, 447-456, doi:10.1002/humu.22963 (2016).
Jubb, H. C., Saini, H. K., Verdonk, M. L. & Forbes, S. A. COSMIC-3D provides structural perspectives on cancer genetics for drug discovery. Nature Genetics 50, 1200-1202, doi:10.1038/s41588-018-0214-9 (2018).
Kim, P., Zhao, J., Lu, P. & Zhao, Z. mutLBSgeneDB: mutated ligand binding site gene DataBase. Nucleic Acids Res 45, D256-D263, doi:10.1093/nar/gkw905 (2017).
Yan, C. et al. Impact of germline and somatic missense variations on drug binding sites. Pharmacogenomics J 17, 128-136, doi:10.1038/tpj.2015.97 (2017).
Tunyasuvunakool, K. et al. Highly accurate protein structure prediction for the human proteome. Nature 596, 590-596, doi:10.1038/s41586-021-03828-1 (2021).
Jafari, R. et al. The cellular thermal shift assay for evaluating drug target interactions in cells. Nat Protoc 9, 2100-2122, doi:10.1038/nprot.2014.138 (2014).
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434-443, doi:10.1038/s41586-020-2308-7 (2020).
Berman, H. M. et al. The Protein Data Bank. Nucleic acids research 28, 235-242, doi:10.1093/nar/28.1.235 (2000).
Pieper, U. et al. ModBase, a database of annotated comparative protein structure models, and associated resources. Nucleic Acids Res 39, D465-474, doi:10.1093/nar/gkq1091 (2011).
Jumper, J. et al. Highly accurate protein structure prediction with AlphaFold. Nature 596, 583-589, doi:10.1038/s41586-021-03819-2 (2021).
Wishart, D. S. et al. DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res 36, D901-906, doi:10.1093/nar/gkm958 (2008).
Chan, S. H. et al. Analysis of clinically relevant variants from ancestrally diverse Asian genomes. Nat Commun 13, 6694, doi:10.1038/s41467-022-34116-9 (2022).
Takahashi, M. et al. Functional characterization of 21 CYP2C19 allelic variants for clopidogrel 2-oxidation. Pharmacogenomics J 15, 26-32, doi:10.1038/tpj.2014.30 (2015).
Wenthur, C. J., Gentry, P. R., Mathews, T. P. & Lindsley, C. W. Drugs for allosteric sites on receptors. Annu Rev Pharmacol Toxicol 54, 165-184, doi:10.1146/annurev-pharmtox-010611-134525 (2014).
Limviphuvadh, V. et al. Discovering novel SNPs that are correlated with patient outcome in a Singaporean cancer patient cohort treated with gemcitabine-based chemotherapy. BMC Cancer 18, 555, doi:10.1186/s12885-018-4471-x (2018).
Heersche, N. et al. Clinical implications of germline variations for treatment outcome and drug resistance for small molecule kinase inhibitors in patients with non-small cell lung cancer. Drug Resist Updat 62, 100832, doi:10.1016/j.drup.2022.100832 (2022).
Keshava, N. et al. Defining subpopulations of differential drug response to reveal novel target populations. NPJ Syst Biol Appl 5, 36, doi:10.1038/s41540-019-0113-4 (2019).
Chong, C. S., Limviphuvadh, V. & Maurer-Stroh, S. Global spectrum of population-specific common missense variation in cytochrome P450 pharmacogenes. Hum Mutat, doi:10.1002/humu.24243 (2021).
Li, B. et al. In silico comparative characterization of pharmacogenomic missense variants. BMC Genomics 15, S4, doi:10.1186/1471-2164-15-S4-S4 (2014).
Simonetti, L. et al. CYP21A2 mutation update: Comprehensive analysis of databases and published genetic variants. Hum Mutat 39, 5-22, doi:10.1002/humu.23351 (2018).
Muroi, Y. et al. Functional characterization of wild-type and 49 CYP2D6 allelic variants for N-desmethyltamoxifen 4-hydroxylation activity. Drug Metab Pharmacokinet 29, 360-366, doi:10.2133/dmpk.dmpk-14-rg-014 (2014).
Saravanakumar, A., Sadighi, A., Ryu, R. & Akhlaghi, F. Physicochemical Properties, Biotransformation, and Transport Pathways of Established and Newly Approved Medications: A Systematic Review of the Top 200 Most Prescribed Drugs vs. the FDA-Approved Drugs Between 2005 and 2016. Clin Pharmacokinet 58, 1281-1294, doi:10.1007/s40262-019-00750-8 (2019).
Genetics of Exfoliation Syndrome, P. et al. Association of Rare CYP39A1 Variants With Exfoliation Syndrome Involving the Anterior Chamber of the Eye. JAMA 325, 753-764, doi:10.1001/jama.2021.0507 (2021).
Stephenson, J. D., Laskowski, R. A., Nightingale, A., Hurles, M. E. & Thornton, J. M. VarMap: a web tool for mapping genomic coordinates to protein sequence and structure and retrieving protein structural annotations. Bioinformatics 35, 4854-4856, doi:10.1093/bioinformatics/btz482 (2019).
Humphrey, W., Dalke, A. & Schulten, K. VMD: Visual molecular dynamics. Journal of Molecular Graphics 14, 33-38, doi:https://doi.org/10.1016/0263-7855(96)00018-5 (1996).
Laskowski, R. A., Jablonska, J., Pravda, L., Varekova, R. S. & Thornton, J. M. PDBsum: Structural summaries of PDB entries. Protein Sci 27, 129-134, doi:10.1002/pro.3289 (2018).
Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nat Genet 46, 310-315, doi:10.1038/ng.2892 (2014).
Ioannidis, N. M. et al. REVEL: An Ensemble Method for Predicting the Pathogenicity of Rare Missense Variants. Am J Hum Genet 99, 877-885, doi:10.1016/j.ajhg.2016.08.016 (2016).
Pires, D. E., Ascher, D. B. & Blundell, T. L. DUET: a server for predicting effects of mutations on protein stability using an integrated computational approach. Nucleic Acids Res 42, W314-319, doi:10.1093/nar/gku411 (2014).
Delgado, J., Radusky, L. G., Cianferoni, D. & Serrano, L. FoldX 5.0: working with RNA, small molecules and a new graphical interface. Bioinformatics 35, 4168-4169, doi:10.1093/bioinformatics/btz184 (2019).
Srimatkandada, S., Schweitzer, B. I., Moroson, B. A., Dube, S. & Bertino, J. R. Amplification of a Polymorphic Dihydrofolate Reductase Gene Expressing an Enzyme with Decreased Binding to Methotrexate in a Human Colon Carcinoma Cell Line, HCT-8R4, Resistant to This Drug. Journal of Biological Chemistry 264, 3524-3528, doi:10.1016/s0021-9258(18)94097-4 (1989).
Dicker, A. P., Volkenandt, M., Schweitzer, B. I., Banerjee, D. & Bertino, J. R. Identification and characterization of a mutation in the dihydrofolate reductase gene from the methotrexate-resistant Chinese hamster ovary cell line Pro-3 MtxRIII. Journal of Biological Chemistry 265, 8317-8321, doi:10.1016/s0021-9258(19)39074-x (1990).
Chong, C.-S., Limviphuvadh, V. & Maurer-Stroh, S. Global spectrum of population-specific common missense variation in cytochrome P450 pharmacogenes. Human Mutation 42, 1107-1123, doi:https://doi.org/10.1002/humu.24243 (2021).
Relling, M. V. et al. The Clinical Pharmacogenetics Implementation Consortium: 10 Years Later. Clin Pharmacol Ther 107, 171-175, doi:10.1002/cpt.1651 (2020).
Gerek, N. Z. et al. Evolutionary Diagnosis of non-synonymous variants involved in differential drug response. BMC medical genomics 8 Suppl 1, S6, doi:10.1186/1755-8794-8-S1-S6 (2015).
Sundaram, L. et al. Predicting the clinical impact of human mutation with deep neural networks. Nat Genet 50, 1161-1170, doi:10.1038/s41588-018-0167-z (2018).
Zhou, Y., Mkrtchian, S., Kumondai, M., Hiratsuka, M. & Lauschke, V. M. An optimized prediction framework to assess the functional impact of pharmacogenetic variants. The Pharmacogenomics Journal 19, 115-126, doi:10.1038/s41397-018-0044-2 (2019).
Mahmood, K. et al. Variant effect prediction tools assessed using independent, functional assay-based datasets: implications for discovery and diagnostics. Hum Genomics 11, 10, doi:10.1186/s40246-017-0104-8 (2017).
Miosge, L. A. et al. Comparison of predicted and actual consequences of missense mutations. Proc Natl Acad Sci U S A 112, E5189-5198, doi:10.1073/pnas.1511585112 (2015).

There is NO Competing Interest.

Download PDF

Version 1

posted

You are reading this latest preprint version

Genomic landscape of drug binding and pharmacogenetic variation across diverse populations using SNPdrug3D

Status:

Version 1

Abstract

Figures

Introduction

Results

Discussion

Declarations

Data availability

Code availability

Acknowledgements

Authors’ contributions

Competing interests

Additional information

SG10K Consortium authorship list

References

Additional Declarations

Supplementary Files

Status:

Version 1