PremPLI: Predicting the Effects of Missense Mutations on Protein-Ligand Interactions

Protein-ligand interactions trigger a multitude of signal transduction processes and resistance to small-molecule drugs is the main cause of the failure of therapeutic drugs in clinical practice. Missense mutations altering the binding of ligands to proteins are one of the critical mechanisms that result in genetic disease and drug resistance. Computational methods have made a lot of progress for predicting binding affinity changes and identifying resistance mutations, but they are still not satisfied and need to be further improved in both accuracy and speed. To address these issues, we introduced PremPLI, a structure-based machine learning method for quantitatively estimating the effects of single mutations on ligand binding affinity changes. A comprehensive comparison of the predictive performance of PremPLI with other available methods on two benchmark datasets confirms that our approach performs robustly and presents similar or even higher predictive accuracy than the approaches relying on first-principle statistical mechanics and mixed physics- and knowledge-based potentials while requires much less computational resources. PremPLI can be used for guiding the design of ligand-binding proteins, identifying and understanding disease driver mutations, and finding potential resistance mutations for different drugs. PremPLI is freely available at https://lilab.jysw.suda.edu.cn/research/PremPLI/ and allows to do large-scale mutational scanning.


INTRODUCTION
Protein-ligand interactions are of fundamental importance in myriad processes occurring in living organisms, triggering a multitude of signal transduction processes. [1][2][3] Many genetic diseases are caused by missense mutations through altering binding between small-molecule ligands and proteins. 4,5 For example, the association between disease-related mutations in the type I collagen and the ligand binding sites was observed 6 and the ligand-binding-domain mutations in human androgen receptor gene led to disrupted interaction between the N-and C-terminal domains. 7 Moreover, drug resistance is the main cause of the failure of therapeutic drugs in clinical practice, particularly for the treatment of infectious diseases and cancer, [8][9][10] and mutation in drug target is a critical mechanism that results in drug resistance. [11][12][13] For instance, in hormone-resistant breast cancer, mutations in estrogen receptor 1 gene (ESR1) are associated with acquired endocrine resistance, 14 and two ESR1 ligand-binding site mutations produced partial resistance to the currently available endocrine therapies. 15 During the last decade, next-generation sequencing techniques have detected vast amounts of genetic mutations in humans, leaving clinicians and researchers without knowledge of whether these mutations are associated with genetic diseases or the emergence of drug resistance. Experimental methods can accurately measure the effects of missense mutations on proteins and their complexes, but they are time-consuming and expensive and do not have the capability to tackle large amounts of data. 16 Therefore, the development of reliable computational methods to reveal molecular effects of missense mutations would pave the way for the identification of pathogenic or drug resistance mutations and contribute to many biomedicine related fields and drug discovery. Some attempts to model the effects of mutations on protein-ligand interactions have been made but with limited success and applicability. In summary, the methods can be classified into three main categories: (i) statistical/machine learning approaches using experimentally measured free energy data to parameterize regression models, [17][18][19] such as mCSM-lig, 17 which usually require little computational cost; (ii) simulation-based methods that rely on mixed physics-and knowledge-based potentials to sample side-chain rotamers with restrained backbone motion, such as Rosetta, 20 which have proven successful in the early stages of the discovery process; 21,22 (iii) alchemical free energy calculations that introduce a series of intermediate alchemical states in a thermodynamic cycle, [23][24][25][26][27] which have become increasingly popular in the lead optimization stage of small molecule drug discovery and started being used prospectively by the pharmaceutical industry. 28,29 Several studies have been conducted to examine the performance of these methods on different test cases. 19,25,27 For instance, Aldeghi et al. evaluated the potential of all above three kinds of computational approaches to predict changes in binding affinity of eight different inhibitors with cancer target Abl kinase upon 144 mutations. 19 Overall, the prediction performance of Rosetta and alchemical free energy calculations are better than the current available machine learning approaches but they are still not satisfied and need to be further improved in both accuracy and speed. Alchemical calculations considering the full conformational flexibility of proteinligand complex and the discrete nature of solvent are much more computationally expensive compared to statistical and (semi)-empirical approaches, moreover the workflow of which is complex, tedious, and error-prone. Therefore, developing prediction methods with tradeoffs between computational cost and accuracy is urgently required.
With a large amount of experimental data available, many data-driven machine learning methods have been developed to assess the impact of mutations on protein stability 30-36 and protein-protein interactions, [37][38][39][40][41][42][43][44][45][46][47] including PremPS, 30 MutaBind, 39 and MutaBind2 40 introduced by us. In addition, we also proposed PremPDI 48 and PremPRI 49 to evaluate the effects of mutations on binding between protein and nucleic acid. These approaches have achieved relatively high predictive accuracy with low computational cost. However, very few machine learning methods were proposed to assess the impact of mutations on protein-ligand binding and their prediction accuracy is very limited. [17][18][19] One reason hindering the development is due to the complexity of small molecule chemistry and binding interaction. Another major limitation is the availability of high-quality experimental data that can be used for training. Recently, a manually curated database Platinum was created, 50 which associates experimentally measured effects of mutations on proteinsmall molecule binding with three-dimension structures of the corresponding complexes, allowing us to propose a structure-guided computational method to estimate the affinity changes upon mutations quantitatively. Therefore, we developed a new machine learning computational method, PremPLI, to estimate the effects of single point mutations on protein-ligand interactions by calculating the binding affinity changes quantitatively. PremPLI uses a random forest regression scoring function and consists of 11 sequence-and structure-based features. It performs well across different types of cross-validation and independent tests, with similar predictive accuracy to Rosetta and alchemical free energy calculations but significantly lower computational costs. PremPLI can be used to aid in the development of new drugs to combat rising drug resistance, finding diseasecausing or cancer driver missense mutations, and the design of proteins with novel ligand-binding functionalities and specificities.

Experimental datasets used for training PremPLI
Platinum, 50  coordinates; (f) structures with a resolution lower than 3 Å. Besides, we only retained the interactions between one protein and one type of ligand. For mutations with multiple ∆∆ 9E8 , we first preserved all values, resulting in 859 single mutations (it will be referred to as S859). Second, only one ∆∆ 9E8 value is chosen according to the following criteria in the order specified below: (1) pH is closest to neutral and/or the temperature (T) is closest to room temperature; (2) the value of ∆∆ 9E8 calculated by * has the highest priority, followed by -; (3) the experimental method of fluorimetry is preferred over other measurements. As a result, 796 unique single mutations from 360 complex structures were retained (it will be referred to as S796). The majority of complexes contain only one mutation (Fig. S1a). More information about the training set of S796 is shown in Fig. S1 and Table S1.
The last step is to process PDB coordinate files of complexes, which were obtained from the Protein Data Bank (PDB). 51 First, the biological assembly was used for each complex. Second, only coordinates of proteins and ligands studied were retained. Third, the ligand was further removed from the PDB file when the ratio of the number of its contact heavy atoms to total heavy atoms is less than 0.5. We defined an atom in ligand as a contact atom when it is located within 6 Å from any heavy atom in protein. That is, only ligands that bind relatively strongly to protein were retained. As a result, four types of protein-ligand complex structures were included in the training set: monomer with one ligand; monomer with two or more ligands; homomer with one ligand; homomer with two or more ligands (Fig. S2). Please note, we only study the interactions between one protein and one type of ligand, but they may have multiple chains or multiple ligands.

Experimental datasets used for testing
Recently, Hauser et al. compiled a benchmark dataset consisting of reliable inhibitor ∆p IJ data for 144 clinically identified mutants of human kinase Abl and examined the potential for alchemical free-energy calculations to predict resistance of these mutations to eight FDAapproved tyrosine kinase inhibitors (TKIs). 25  (such as adding missing residues and loops) including two docking models, which were used to evaluate our method and perform comparisons with other methods (this dataset will be referred to as S144 including 144 mutations and eight processed complex structures). Since most users using our method will not or will not be able to produce the structures as Hauser et al. did for performing alchemical free-energy calculations in which the workflow is relatively complex and tedious, we also tested the performance of PremPLI on structures obtained from the Protein Data Bank directly 7 (129 mutations from six Abl-TKI complexes, it will be referred to as S129). This can also check the sensitivity of our method to the initial input structures. The criteria for processing the training dataset have been applied in checking the test sets.
Another dataset proposed by Aldeghi et al. 27 includes 134 single and multiple mutations across 17 proteins and 27 ligands from the Platinum database. Only affinities determined by isothermal titration calorimetry and surface plasmon resonance were selected and proline mutations were excluded. Moreover, it is a diverse and challenging benchmark set, which includes large and flexible ligands, proteins with different folds, and many small-to-large/large-to-small and charge-changing mutations. Aldeghi et al. performed free energy calculations on these 134 mutations using first-principles statistical mechanics and Rosetta protocols, respectively. In our study, 25 multiple mutations were first removed from this benchmark. Then, ten single mutations for which ligand cannot be found in the given complex structures and protein interacts with multiple types of ligands were further removed. As a result, 99 single mutations from 42 complexes were retained (it will be referred to as S99), which is a subset of S796.
The 3D structures of protein-ligand complexes were either taken from Hauser et al. study 25 (S144 dataset) or from the Protein Data Bank (S796, S129 and S99 datasets). Mutant structures were produced using the BuildModel module of FoldX. 34 Missing heavy side-chain and hydrogen atoms in proteins were added via VMD program 52 using the topology parameters of CHARMM36 force field. 53 Hydrogen atoms of ligands were added via Chimera. 54

The model of PremPLI
PremPLI employs Random Forest (RF) regression scoring function and is composed of 11 distinct features that contribute significantly to the quality of the model (Table S2).

•
, a Position-Specific Scoring Matrix (PSSM) score for wild-type residue type at mutated site. PSSMs was calculated from PSI-BLAST 55 searches of protein sequences in NCBI non-redundant database and the resulting profile was constructed using the default parameters. represents the change in conservation induced by a single mutation, which was calculated by PROVEAN program. 56 • 40 and ./0 are the number of contacts between mutated site and ligands in wildtype and mutant structure, respectively. If the distance between two atoms is more than the Van der Waals interaction distance and within 5 angstroms, they were defined in contact.
These terms were calculated by Arpeggio. 57 • presents the molecular weight of ligand which was calculated using XLOGP3. 58 XY is the number of nitrogen and oxygen atoms of ligand.
• term accounts for the hydrophobicity of mutated site. The hydrophobicity for each type of amino acid residue was obtained from 38 .
• [\*] is the fraction of charged residues (R, K, D and E) in the wild-type structure.
, for instance, [ is the number of all arginine residues, and e55 is the total number of amino acids. f = X g X cdd , f is the number of glutamine residues buried in the protein core. A residue is defined as buried if the ratio of solvent accessible surface area of this residue in the protein and in the extended tripeptide is less than 0.2. 59 • eeh is the matrix of single residue interchanges derived from spatially conserved motifs 60 and eei is the matrix of amino acid substitutions produced by superposition of homologous protein structures. 61 They represent the more/less favorable trends of residue exchange in protein structures, obtained from Amino Acid Index Database with identifiers of AZAE970102 and RISJ880101 (AAindex, https://www.genome.jp/aaindex/).

Cross-validation procedures
We first performed 10 times 5-fold and 10-fold cross-validations on the mutations from training dataset (named as CV1 and CV2), and then evaluated the performance of our approach on two low redundancy sets; low redundant at (i) complex (named as leave-one-complex-out validation) and (ii) ligand (named as leave-one-type-ligand-out validation). There are 360 complexes and 168 types of ligands in the training dataset ( Fig. S1a and S1b). Namely, we leave all mutations from one complex or from the complexes having the same type of ligand out as a test set and use the remaining complexes/mutations to train the model, repeating this procedure for each complex or each type of ligand.

Statistical analysis and evaluation of performance
Pearson correlation coefficient (PCC) and root-mean-square error (RMSE) were used to measure the agreement between experimentally determined and predicted values of changes in binding affinity. T-test was used to assess whether the correlation coefficient is statistically significant from zero. Hittner2003 62  PremPLI predictive model is composed of 11 features and built using Breiman's random forest regression algorithm implemented in the R randomForest package. Hyperparameter tuning in a balanced 5-fold cross validation determined that the optimal settings for the number of decision trees and the number of features considered by each tree when splitting a node are 300 and 3 respectively. As described in the Methods section, we compiled two datasets of S859 and S796, and the difference between them is using all values or only one selected binding affinity change for these mutations with multiple ∆∆ 9E8 . The performance trained and tested on S859 and S796 are shown in Table S3 and no significant differences were observed. However, when tested on the blind set of S144, the Pearson correlation coefficients using the models trained by S796 and S859 are 0.46 and 0.36, respectively, the difference is statistically significant (p-value < 0.01, Hittner2003 test). Therefore, the unique single mutation dataset of S796 was used to parameterize PremPLI model. The Pearson correlation coefficient between experimental and calculated changes in binding affinity is 0.70 and the corresponding RMSE is 1.08 kcal mol -1 when trained and tested on S796 (Fig. 1a). Besides, we also tried two other popular learning algorithms of Support Vector Machine (SVM) and eXtreme Gradient Boosting (XGBoost), and the results shown in Table S4 verified that the random forest regression algorithm shows the best performance.

Performance on four types of cross-validation
In supervised machine learning, overfitting is a common problem in which the model  (Table S3). Upon the "leave-onecomplex-out" procedure, the Pearson correlation coefficient between experimental and computed binding affinity changes reach the value of 0.55 and the corresponding RMSE is 1.26 kcal mol -1 (Fig. 1c). Last, we performed a validation by leaving all complexes with the same type of ligand out of the training set ("leave-one-type-ligand-out" validation), and the correlation coefficient is 0.53, remaining statistically significant (Fig. 1d).
In addition, we analyzed 5% outliers to better understand strengths and limitations of PremPLI. Studentized residuals were used in detecting outliers. The performance after removing outliers is presented in Fig. S4a, showing significantly improved PCC after removing 5% outliers.
Through the analysis (Fig. S4b and S4c), we found that the outliers are more likely to occur in complexes with lower-affinity ligand binding and at sites with higher number of proximal atoms and hydrogen bonds. Consistent with the observation from the study of 17 , outliers correspond to mutations with extreme experimental values (highly decreasing or increasing affinity). PremPLI could relatively correctly classify the highly decreasing mutations as decreasing but lose the ability to classify highly increasing mutations (Fig. S4d).

Validation on independent test sets and comparison with other methods
First, the benchmark dataset of S144 was used to assess the predictive performance of  17 The PCC and RMSE values shown in Table 1 and Fig. S5a indicate that R15 has the best performance, followed by PremPLI and FEP+ methods, and ML1 performs the worst. In addition, Aldeghi et al. also built a model of ML2 trained on the S144 and the PCC and RMSE are 0.57 and 0.68 kcal mol -1 , respectively, upon 8-fold nested cross-validation (similar to the leave-one-complex-out validation used in our study). To compare with ML2, we also used S144 as the training set and built a new model. The PCC reaches 0.70 and RMSE is 0.60 kcal mol -1 upon leave-one-complex-out validation, outperforming ML2.
Second, since most users using machine learning methods, such as PremPLI or mCSM-lig, will not or will not be able to process complex structures as Hauser et al. did for performing alchemical free-energy calculations, the workflow of which is relatively complex and tedious, we also calculated binding affinity changes of PremPLI and mCSM-lig using structures obtained from the Protein Data Bank directly (S129, 129 mutations from six Abl-TKI complexes). The results presented in Table 1 demonstrate that PremPLI has significantly better performance than mCSMlig and is not very sensitive to the initial input structures. Moreover, to further compare with mCSM-lig, we also retrained our model on S763 dataset (a training set of mCSM-lig, Table S1).
Even though our features selected were not based on this dataset, we still obtained a significantly higher correlation coefficient (PCC = 0.73 and RMSE = 1.07 kcal mol -1 ) than mCSM-lig (PCC = 0.63 and RMSE = 2.06 kcal mol -1 ), 17 both of which were upon 10-fold cross-validation.
Third, we assessed the predictive performance of PremPLI on S99 dataset. Since S99 is a subset of S796, we retrained the model after removing all mutations in the overlapped complexes with S99 from the training dataset, and then tested it on S99 (named as PremPLI C , trained on 671 mutations from 318 complexes). Binding affinity changes for S99 calculated by 11 Rosetta protocols, first-principles statistical mechanics in which MD simulations used six different force fields, and their combinations were available from the study of 27 . The results of the best performing Rosetta protocol (R14) and MD calculation (A14), and their combination (RMD) were provided in Table 1 and Fig. S5b. The correlation coefficient of PremPLI C is 0.69, significantly higher than other methods (p-value < 0.01, Hitter2003 test). On the S99, a diverse and challenging benchmark set, Rosetta did not perform as well as on the S144. In addition, we found that the prediction values of Rosetta for half of mutations from the S144 and S99 are around zero (Fig. S5).
Last, it is important to put the performances of approaches in the context of their computational cost. The running time of PremPLI for a single mutation per protein with ~ 400 residues is about ten minutes on a single CPU core, and it requires ~ 20 seconds for each additional mutation introduced in the same complex. For instance, calculating ∆∆ |}9.|~• for a mutation and 26 mutations in Abl-axitinib complex takes ~10 and 20 minutes, respectively. However, each ∆∆ estimate takes up to 32 hours on a single CPU core and 72 hours on a single GPU core for Rosetta (R15) and FEP+ calculations, respectively (Table S5).

Performance on predicting resistance mutations
Generally, resistance can be defined according to the resistance fold (RF): RF ≤ 1, no resistance; 1 < RF ≤ 10, low resistance; RF > 10, resistance. 19,25 Following the equation of ∆∆ 9E8 = , no resistance, low resistance and resistance mutations correspond to 9E8 ≤ 0, 0 < 9E8 ≤ 1.36 and 9E8 > 1.36 kcal mol -1 , respectively. Here we examined the potential for PremPLI and other methods to predict resistance mutations. ROC and PR curves using different approaches to distinguish resistance from no and low resistance mutations are shown in Fig. 2.
The values of area under the receiver operating characteristics curve (AUC-ROC) and precisionrecall curve (AUC-PR) and maximal MCC values are provided in Table 1, which are the most relevant performance measures to examine when the objective is to identify resistance mutations rather than assessing their impact on ligand binding quantitatively. The maximal MCC value was calculated for each method through calculating the MCC across a range of thresholds. Given the fraction of resistance mutations in the datasets of S144 and S99, a random classifier would return an AUC-PR of 0.13 and 0.19, respectively. In terms of the results provided in Table 1 and Fig. 2, the best classifiers are PremPLI, Rosetta and MD calculations.
As are shown in Fig. S7, mutations located on binding interface have on average larger effects on protein-ligand interactions than non-interface mutations. The interface residues were defined if any heavy atoms of them are within 5 Å distance from any heavy atoms in ligands.
Rosetta and MD calculations perform well on interface mutations but lose the ability to predict non-interface mutations, while PremPLI yields statistically significant positive correlations in predicting both interface and non-interface mutations ( Fig. 3 and Fig. S7).

Performance on predicting different types of ligands
Previous studies have demonstrated that the impacts of mutations on different types of ligands cannot be equally well predicted. 19,25 Here we further analyzed the predictive performance of PremPLI on each type of ligand. The S144 dataset is composed of eight TKIs, six of which have co-crystal structures and each has more than 20 mutations, and inhibitors of erlotinib and gefitinib adapted docking complex structures and each has less than 10 mutations (Fig. S3). Performance for each inhibitor calculated by six methods is shown in Fig. 4a and Fig. S8. PremPLI and FEP+ perform well on four TIKs with statistically significant PCC values, and Rosetta produces statistically significant PCCs for six TKIs including two using docking models. All approaches lose the ability to predict the effects of mutations on ponatinib binding. By observing experimental and predicted values (Fig. S8), we found that three mutations increasing Abl and axitinib binding are predicted by PremPLI as decreasing mutations. After excluding these three mutations, the correlation coefficients increase up to 0.43 for PremPLI (p-value < 0.05, t-test) but decrease for all other methods (Fig. 4b). Overall, Rosetta, FEP+, and PremPLI alternately present the best performance on different types of ligands, so it is necessary to develop new computational methods to complement each other.
In summary, Rosetta performs well on S144 dataset-with strong correlation, low absolute errors, and good classification performance, while it performs poorly on S99 dataset. MD calculations with a nonequilibrium free energy calculation protocol perform robustly on different data with moderate prediction accuracy. However, for Rosetta or MD calculations, different scoring functions or force fields will present different prediction accuracies, and here we only chose and provided the results of the best performing protocols, so it would be an issue for applying them to unseen data. In addition, both approaches lose the ability to predict non-interface mutations and require heavy computational resources, especially for MD calculations. However, PremPLI presents similar predictive accuracy with the best performing Rosetta and MD calculation protocols while requiring much less computational resources. Therefore, a potentially integrated protocol might be PremPLI calculations for an initial large-scale mutational scan, followed by refinement of the most promising results via the combination of free energy calculations by Rosetta or MD simulations, which would allow for higher predictive power and provide dynamic insight into the impact of mutations.

Online Webserver
The 3D structure of a protein-ligand complex is required by the webserver, and the user could input Protein Data Bank (PDB) code or upload the coordinate file in the standard PDB format.
Biological assemblies or asymmetric unit can be chosen and retrieved from the Protein Data Bank when the user inputs the PDB code (Fig. 5). Next, the server will display a 3D view colored by protein chains and types of ligands and provide the corresponding protein and ligand names (Fig.   5). Given that PremPLI only calculates the impact of mutations on the interaction between a protein and one type of ligand, one or multiple chains that must belong to one protein and one or multiple ligands that must belong to one type can be assigned to the following calculation. The last step is to select mutations and three options are provided: "Specify One or More Mutations", in this option, users can not only perform calculations for the specified single mutations but also be allowed to view the mutated residue in the complex 3D structure; "Upload Mutation List", allows users to upload a file including a list of mutations for performing large-scale mutational scans; "Alanine Scanning", allows users to perform alanine scanning for the selected protein chain (Fig. 5).
For each mutation, the PremPLI server provides ∆∆ (kcal mol -1 ), predicted binding affinity change induced by this mutation (positive and negative sign corresponds to mutations decreasing and increasing affinity, respectively), and location of the mutation (interface/noninterface).  for both S144 and S99 datasets.   All correlation coefficients are statistically significantly different from zero (p-value < 0.01, t-test) except a p-value = 0.14 and b p-value = 0.02. * and ** indicate statistically significant difference between PremPLI and other methods in terms of PCC (Hitter2003 test) and AUC-ROC (DeLong test) with p-value < 0.05 and p-value < 0.01, respectively. PremPLI C : PremPLI was retrained after removing all mutations in the overlapped complexes with S99 from the training dataset. R15: Rosetta using the flex_ddg protocol and REF2015 scoring function. R14: Rosetta using the flex_ddg protocol and talaris2014 scoring function. A14: Amber14sb and GAFF(v2.1)/AM1-BCC force fields were used for proteins and ligands, respectively. RMD: the combination of R14 and A14. The results of more combinations are shown in Figure S6.  Receiver Operating Characteristics (ROC) and Precision recall (PR) curves for different methods to distinguish resistance from other mutations. The number of resistance mutations is 19 for both S144 and S99 datasets.  PremPLI server. Three steps and results pages are provided. "Processing time" refers to the running time of a job without counting the waiting time in the queue.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. PremPLISM21.4.05.pdf