Overview
SNPdrug3D integrates population-level missense variant data from the SG10K Health2 and gnomAD28 cohorts, 3D protein structure data derived from the Protein Data Bank (PDB)29, HHPred, Modbase30 and the AlphaFold Protein Structure Database31, sequence data from Ensembl, UniProt and NCBI RefSeq and information from DrugBank32.
The workflow for combining different data sources to create SNPdrug3D is shown in Fig. 1a. In total, the SNPdrug3D reference proteome comprises of 102,532 unique human proteins (including isoforms) from 20,442 genes with 65% (~ 38 million) of amino acid residues in the proteome being mapped to at least one protein structure from PDB (based on a cut-off of 40% sequence similarity), ModBase, HHPred or AlphaFold (residues with a predicted local distance difference test score (pLDDT) of at least 70). These protein structures may have a ligand or drug as defined by DrugBank and we found that a total 5,962 drug molecules present in this database were associated with a structure. The remaining 35% (~ 20 million) of proteome residues mapped to missing residues in crystallized structures and had no homologous residues in non-human proteins (Fig. 1a, bottom) or were in low confidence regions in AlphaFold structures (pLDDT < 70) possibly associated with intrinsic disorder or are structured only in a complex26. Overall, each residue covered will have associated protein structure and sequence data along with information regarding its proximity (i.e. within 8Å) to one or more drug molecules.
Integrating information from various sources allowed SNPdrug3D to reach a full proteome coverage of ~ 65% (up to 70.5% if low confidence residues in AlphaFold structures are considered), whereas use of protein structure databases individually did not allow for such high coverage (Fig. 1). To illustrate, if we considered only structures that had sequences with very close similarity (i.e. above 98%, B98 group) to our reference sequences, a maximum coverage of less than 20% was obtained (Fig. 1b). If only coverage of the UniProt reference proteome (UP000005640_9606) is considered, we show that including AlphaFold predictions alone allowed us to reach a coverage of more than 62% of the residues in the ~ 20,000 protein sequences found in this proteome (Fig. 1c). If all residues regardless of prediction quality are considered, AlphaFold predictions cover 99% of the UniProt proteome, however, AlphaFold does not include drug binding information and when considering reliable predictions only, the consensus of multiple structure prediction methods used here provides the highest usable coverage of sites in the human proteome. Taken together, we have mapped residues to both experimental and predicted structures while also considering homologous structures (i.e. B40 group – similarity of at least 40–98%) which greatly expanded coverage of the SNPdrug3D proteome from 15% (i.e. B98 only) to 65% and the UniProt proteome from 18.7–72%.
Overall, an unprecedented level of annotation per variant is created in SNPdrug3D by mapping variations in 20,442 genes to 102,532 reference protein sequences and to 202,299 protein structures that contained 28,949 ligands of which 5,962 are considered as “drugs” by DrugBank. Amongst existing efforts, SNPdrug3D has the largest reference proteome size and number of protein structures with the next largest being PhyreRisk which included 18,874 experimental and 84,818 predicted structures with a coverage of 42,485 UniProt proteins19. Furthermore, a key differentiating factor for SNPdrug3D is that it contains binding site information for ~ 42% (5,962) of DrugBank drug molecules (remainder does not have 3D structure protein complexes or are unspecific in binding) derived from PDB protein structures which is at least 5-fold more than previous efforts such as mutLBSgeneDB24 (1,324 drugs) and DrugVar25 (235 drugs).
Mapping variants to structures
Following the above, SNPdrug3D was used to annotate natural missense variants present in the SG10K Health and gnomAD cohorts. As summarized in Fig. 2, analysis of the latest SG10K Health genetic variation catalogue identified 158,331,366 single nucleotide variants (SNV). Of these, 6.2% (9,770,964) were common (allele frequency, AF of more than 1%), while 49% (77,625,433) and 44.8% (70,934,969) were private (seen in only one individual) or rare (AF < = 1% and in more than one individual) respectively. For each genome, an estimated median of over 11,000 missense SNVs per genome including over 200 unknown variants (Fig. 2, right) were found which is higher than previously estimated17.
Including gnomAD variation data, there were ~ 5.8 million unique missense variants with ~ 5.19 million missense variants from gnomAD (Fig. 3a, left) and ~ 1.25 million missense variants in SG10K Health (Fig. 3a, middle). Of these 5.8 million variants, around 0.6 million variants were exclusive to the SG10K Health dataset (i.e. not in gnomAD) and further comparison of these variants with dbSNP v151 revealed that ~ 0.3 million of these were entirely novel (Fig. 3a, right). Around 4.5 million variants were not present in SG10K Health (i.e. were found in gnomAD only) and the rest (~ 675,000) were found in both datasets. While these missense variants may cause amino acid substitutions in the corresponding reference proteins, not all were amenable to be mapped to amino acid residues in protein structures as they may mutate residues in disordered protein regions that were not crystallized or had low confidence predictions if mapped to AlphaFold structures. In total, 68.5% (~ 3.94 of 5.8 million) of variants had protein structure coverage (i.e. mapped to at least one structure with or without a drug molecule bound nearby) with similar proportions of variants structurally annotated in both the gnomAD (~ 67.8%) and SG10K Health (~ 71.1%) datasets.
To understand the contribution of protein-ligand interaction disruptions to disease phenotypes we sorted a subset of SG10K Health variants (i.e. 1,312 missense variants from 508 genes) into three groups (i.e. benign, damaging and VUS) based on annotations from the SG10K Med dataset33 where variants are already curated to identify highly pathogenic variants. Here, all likely pathogenic or pathogenic variants were consolidated into the damaging group and similarly, likely benign or benign variants were grouped into the benign class. Using SNPdrug3D, we observed that ~ 97% of these variants were mapped to a protein structure and of these, ~ 48% resulted in amino acid substitutions within 8Å of a drug binding site (Fig. 3b) which is > 2x higher compared to the whole SG10K Health set (21%, Fig. 3a).
By variant pathogenicity class (Fig. 3b), we show that more than half of the damaging variants (~ 51%) affected amino acids in drug binding sites. As endogenous ligands and drug molecules may share similar binding sites, an amino acid change at these sites may possibly disrupt such protein-ligand interactions, leading to a loss in protein function and manifestation of the disease phenotype. If protein-drug interactions are affected, the drug effect may be diminished, and therapeutic failure may occur in addition to toxicity if drug metabolism is decreased.
Another important role for SNPdrug3D is to help improve annotation of variants of unknown significance (VUS) with potential effects on drug binding. We observe that ~ 40% of VUS in this dataset (n = 204) affect residues near a drug in a ligand-binding site suggesting that disruption of protein-ligand interactions may be a possible molecular phenotype of these variants and besides providing enhanced annotation, makes them prime candidates for further experimental verification of the proposed effect mechanism.
Using SNPdrug3D to identify missense variants that affect protein-drug binding experimentally.
The extensive mapping of missense variants to protein structures allowed us to pinpoint variants that can affect drug binding, possibly leading to varying drug response phenotypes. In the combined global data, we found 1.17 million mapped variants that altered amino acid residues within 8Å of one or more drug in at least one protein structure. To delineate and validate the effects of such variants on protein-drug binding, we examined the effects of several variant-induced amino acid substitutions in drug targets (human HCK and DHFR) and well-known drug metabolizing enzymes (CYP2C19 and CYP2D6). The substitutions were selected based on the SNPdrug3D annotation suggesting they could directly disrupt protein-drug interaction in the protein drug binding pockets and they comprise both known and novel examples.
In CYP2C19, the investigated amino acid substitutions were NP_000760.1:p.R97T, p.A297V and p.R433W. All substitutions were located close to the binding cavity of the drug clopidogrel and the heme group of the holoprotein (Fig. 4a, upper panels) in the homologous CYP2B4. Further validation with a CETSA-luciferase binding assay revealed that p.R97T and p.R433W impaired binding of the drug to the enzyme but not p.A297V (Fig. 4a, bottom panels). CYP2C19:p.R433W is an amino acid substitution with known effect and is part of the no-function CYP2C19*5 haplotype which has been demonstrated to abrogate metabolism of clopidogrel by decreasing protein stability34. A similar disruption of clopidogrel binding to CYP2C19 by p.R97T was also demonstrated (Fig. 4a, bottom right). Like p.R433W, this may be due to decreased enzyme stability in addition to a direct effect on binding due to the lowered thermostability of the mutated enzymes, compared to the wild-type (WT) and A297-mutated enzymes (Extended Data Fig. 1). Although both SNVs associated with p.R97T (NM_000769.4:c.290G > C) and p.R433W (NM_000769.4:c.1297C > T) are rare (< 1% AF in both SG10K Health and gnomAD datasets), we found that the latter SNV to be more common by an order of magnitude in the Malay population (0.21% AF) compared to other ethnic groups like the Chinese (< 0.1% AF) and Indian (variant not observed) populations.
Similarly, we also investigated protein variants in the drug targets HCK and DHFR and found two amino acid substitutions that impaired protein-drug binding according to the CETSA-luciferase assay (Fig. 4b, c). The first, NP_000782.1:p.F180S (NM_000791.4:c.539T > C), was located within 5Å of the drug methotrexate and mapped to a structure homologous to human DHFR (Fig. 4b) while the other, NP_001165601.1:p.N385K (NM_001172130.3:c.1155C > A), was also in proximity and within 5Å of the drug staurosporine in the HCK protein structure homolog (Fig. 4c). Unlike for the CYP2C19 substitutions (p.R433W and p.R97T), protein variants of DHFR and HCK did not decrease the thermal stability of the enzymes (Extended Data Fig. 1). Conversely, p.F180S increased the thermal stability of DHFR compared to WT. These observations suggest that impairment of drug-protein binding may be due to a direct effect on binding rather than protein destabilization. This may occur through steric hindrance or changes in the biochemical properties of the drug binding cavity such as the introduction of a positive charge in HCK:p.N385K. Further analysis of the occurrence of these variants in the two cohorts studied revealed that both variants were very rare and are present only in SG10K Health but not in any gnomAD-associated populations. This is relevant since it means DHFR:p.F180S may promote methotrexate and other anticancer drug resistance in a population-specific context.
In our attempt to find missense VUS that can disrupt protein-drug binding, we also found two SNVs in CYP2D6 that either increase binding to the drug or disrupted binding indirectly via allosteric mechanisms, such as has been observed in drugs targeting kinases and G protein–coupled receptors35. The first is a AEU08335.1:p.F120I substitution (JF307778.1:c.358T > A) (Fig. 5a) that is within 5Å of prinomastat in the enzyme’s catalytic core and, instead of disrupting binding, increases binding of the drug to the enzyme according to the CETSA-luciferase assay results (Fig. 5b). The other amino acid substitution, p.T76M (c.227C > T) (Fig. 5a), is located more than 8Å away from the drug and heme group but nevertheless disrupted priminostat-CYP2D6 binding (Fig. 5b), possibly through a long-range effect and enzyme destabilization (Extended Data Fig. 1). The missense variant (c.227C > T, p.T76M) was found exclusively, albeit rarely, in the Chinese population while the other variant (c.358T > A, p.F120I) is enriched in the East Asian and Chinese population with allele frequencies of 0.7% and 0.8% in the gnomAD and SG10K Health cohorts respectively.
Annotation and prediction of population-specific effects on drug binding is invaluable for drug development efforts which can study and address such effects early in the development pipeline avoiding costly surprises at later stages36–38. This especially includes effects in the important CYP family of drug metabolizing enzymes39.
Deriving predictive features from SNPdrug3D to build a PGx-inclusive variant pathogenicity prediction tool.
Beyond mapping and visualization of missense variants in protein-drug complexes, we also hypothesized that information from SNPdrug3D such as amino acid substitution proximity to drug molecules (or ligands in general) could serve as useful predictive features to build variant effect prediction tools that are able to discriminate PGx variants in addition to disease variants from neutral variants rather than just binary classification of variants into damaging (i.e. disease and PGx) and neutral categories. To the best of our knowledge, tools capable of such ternary classification of missense VUS have not been successfully developed but a previous attempt was made using a random forest classifier which found that PGx and neutral variants had overlapping characteristics40. To this end, we focused on variants in the cytochrome P450 (CYP) superfamily of enzymes as variation in CYP genes are known to cause disease (i.e. CYP21A2 and congenital adrenal hyperplasia41) and also alter drug responses (i.e. CYP2D6 variation and tamoxifen metabolism42). The CYP family has a dominating role in drug metabolism with 80% of FDA-approved drugs from 2005 to 2016 being metabolized by 4 of the 57 CYP gene family members43. Using supervised machine learning, we built a ternary linear-discriminant analysis-based classifier with a manually curated dataset of 1,301 known CYP missense variants (Supplementary Table 1) of which 712 were disease variants, 397 were neutral variants and 192 were PGx variants. Two categories of predictive features were used: i) predictions and rank scores from dbNSFP (including consensus and input of other methods), ii) distance-based annotations from SNPdrug3D based on whether a mapped variant is near (< 8Å) or further from at least one drug after being mapped on to protein-drug complexes (see Supplementary Table 2).
To assess the classifier (‘CYPVarPred’), a held-out test set containing 20% of the total variants in the CYP dataset was used. Comparison of CYPVarPred with a dummy classifier that made stratified predictions based on training set class distributions (i.e. 55% disease: 30% neutral: 15% PGx) showed that CYPVarPred vastly outperformed with a Matthew’s correlation coefficient (MCC) of 0.746 and an average class-specific accuracy of 80% across the three predicted variant classes (i.e. disease, neutral and PGx) (Fig. 6a). To allow direct comparison with classical variant pathogenicity binary predictors, we binarized the outputs of CYPVarPred into neutral/non-neutral (i.e. PGx or disease) predictions and tested all predictors against two separate datasets consisting of either PGx/neutral (n = 18/51) or disease/neutral (n = 113/51) variants. Using defined cut-offs (see Methods) from the various tools to compare with CYPVarPred, we found that CYPVarPred performed the best in classification of variants in both datasets (MCC = 0.83 and 0.69 on the disease and PGx set respectively) (Fig. 6b). Overall, all predictors performed better on the disease/neutral set compared to the PGx/neutral set where CYPVarPred has a clear advantage over the other tools. To further validate CYPVarPred with data consisting of experimentally assayed CYP variants (not available for PGx but disease effect predictions), we sought an independent dataset of 41 CYP39A1 variants from a recent study of rare CYP39A1 variants and their association with exfoliation syndrome44 along with an additional CYP39A1 14 variants uniquely contributed by additional experiments in this study (see Supplementary Table 3) for a total of 55 variants (14 neutral, 41 disease). Across various thresholds and for the specific task of disease vs neutral classification (Fig. 6c, left), CYPVarPred performed comparably to the state-of-art PrimateAI-3D (PAI3D) (G. Liang, personal communication, July 6, 2022) while both outperform other algorithms (Fig. 6c, right). For its main task of PGx vs neutral classification, CYPVarPred vastly outperforms all other prediction tools including the latest version of PAI3D.