Protein engineering in the big data era: harnessing near-redundant structural data

Motivation: Predicting the effect of mutations on protein-protein interactions is important for relating structure to function, as well as for in silico affinity maturation. The effect of mutations on protein-protein binding energy (ΔΔG) can be predicted by a variety of atomic simulation methods involving full or limited flexibility, and explicit or implicit solvent. Methods which consider only limited flexibility are naturally more economical, and many of them are quite accurate, however results are dependent on the atomic coordinate set used. In this work we perform a sequence and structure based search of the Protein Data Bank to find additional coordinate sets and repeat the calculation on each. Results: . We improve increase precision and Positive Predictive Value, and decrease Root Mean Square Error and higher Positive Predictive Value, compared to using single structures. Given the ongoing growth of near-redundant structures in the Protein Data Bank, our method will only increase in applicability and accuracy.


Introduction
In this work we are interested in predicting the change in protein-protein binding energy (ΔΔG) resulting from amino acid substitutions at the protein-protein interface. This quantity determines the change in protein-protein binding affinity and is thus important for understanding signaling, complex assembly, catalysis, host-pathogen interaction, and other functions. It is also industrially interesting, as it is key to designing proteins for therapy and diagnosis, as well as for purification and catalysis.

Methods of computing change in protein-protein binding energy (ΔΔG)
ΔΔG = ΔG mutant -ΔG wild-type . ΔG wild-type is the free energy change upon binding in the wild-type complex, while ΔG mutant is the same quantity for the mutant complex. The equilibrium constant widely used in pharmacology is computed as Kd = exp(ΔG/RT), where R is the universal gas constant. Many computational methods exist to calculate ΔΔG predicted , an estimate of the (known or unknown) experimental value, ΔΔG experimental . Some methods use reduced representations (Dehouck et al., 2009(Dehouck et al., , 2013, while others include all atoms. (Guerois et al., 2002) such methods compute the protein-protein binding enthalpy (including electrostatic and van der Waals interactions) using physical formulae, but differ in the way they estimate the effect of solvent and side-chain entropy. The most successful and widely-used methods use implicit solvent to estimate these latter terms, namely they compute the solvent-accessible surface area on an atomic basis, then combine this quantity with the atom type and empirically-adjusted weight factors. (Guerois et al., 2002;Cornell Cieplak, P., Bayley, C. I., Gould, I. R., Merz, K. M., Ferguson, D. M., Spellmeyer, D. C., Fox, T., Caldwell, J. W. & Kollman, P. A., 1995;Massova and Kollman, 2000;Li et al., 2016) In recent years such approaches have made relatively small gains in accuracy.
Many limited-flexibility, implicit-solvent methods including FoldX offer good accuracy and economy. (Guerois et al., 2002;Lowegard Anna U. AND Frenkel, 2020;Pires et al., 2014) Multiple workers have found it is actually counterproductive to minimize the structure globally (e.g. by Molecular Dynamics or MD); (Petukh et al., 2015a;Beard et al., 2013) rather it is better to model the substitution and limit further modifications to those required for annealing the resulting steric clashes (a mostly-local minimization). All force fields are inevitably biased by the data they are fitted to, and most are classically formulated, considering quantum mechanical effects only indirectly. Thus modeling can only reduce the accuracy of 3D atomic coordinates, compared to X-ray crystallography or other high-resolution experimental methods. Also, potentials which successfully predict changes in protein-protein binding energy upon mutation (ΔΔG) are typically trained on experimental structures (Guerois et al., 2002). All of this argues in favor of limited, local minimization. The downside of local minimization is that it makes the results dependent on the idiosyncrasies of the experimental coordinates which may reflect crystallization conditions, which would change upon mutation, or which represent only one of many thermodynamically accessible configurations. This work addresses this limitation of local minimization, by identifying and using additional structural data. The Protein Data Bank (PDB) is the main repository of protein structural data. Growth of data in the PDB is accelerating, and this includes growth in near-redundant structures which reveal multiple observations of the same or closely related complexes under different experimental conditions. We can repeat the ΔΔG calculation over all such structures, and average over the results to increase precision.

Near-redundant structures in the Protein Data Bank (PDB)
Although the era of fold discovery is over, the growth of structural data is still accelerating. Many of the new structures differ only slightly from existing entries, having been obtained to e.g. seek higher resolution, determine the effect of a mutation, or add a ligand or subunit. In this work "nearredundant structures" refer to those which have the same (or nearly the same) composition, at least in the proteinprotein interface of interest, but which were resolved in separate experiments or crystallographic units. The experimental effort is dramatically reduced when following proven protocols as opposed to solving proteins of previously unknown structure. Thus it is economical to solve the same complex to probe structural variations, improve resolution, etc. As an illustrative example, the complex of the human Growth Hormone (hGH) bound to two copies of its Receptor (hGHR) was resolved in (Receptor et al., 1996). There exists a mutant of hGH which binds one one copy of hGHR, and the same work also reports the 1:1 complex structure (Receptor et al., 1996). The same lab then remodeled part of the interface by point and phage display mutagenesis and reported the structure (Atwell et al., 1997). Lastly, they solved the 1:1 complex again at improved resolution (Clackson et al., 1998). Even minor differences in biopolymer sequence, number of additional subunits, experimental conditions, and fitting procedure can be expected to introduce differences in atomic positions on the order of tenths of Ångströms. While for some purposes these differences may be insignificant, they lead to variations in predicted ΔΔG when using FoldX and other limited-flexibility methods.

Why does averaging improve precision?
Accuracy is the closeness of the prediction or measurement to the correct value, whereas precision refers to the closeness of independent predictions or measurements to each other. The principle underlying homologyScanner is that ZEMu and other methods which perform only a local minimization are subject not only to limitations in accuracy (due to biases in the ΔΔG prediction potential, and to errors in the ΔΔG experimental used as a gold standard) but also in precision (due to particularities of the coordinate sets used). According to the central limit theorem, for N independent random variables distributed with mean μ and standard deviation σ, the sample mean approaches a normal distribution with mean μ and standard deviation √ ⁄ . Here the ΔΔG predicted computed using a single structure would be a random variable sampled from an underlying normal distribution of unknown mean μ and standard deviation . A ΔΔG predicted averaged from multiple calculations would have a smaller standard deviation, √ ⁄ , about the same mean μ. Thus averaging multiple calculations obtained using independent structures provides a more-accurate estimate of the underlying mean μ. Note that even in the hypothetical case of very large N, μ would still be the mean of many individual values of ΔΔG predicted , subject to biases in the FoldX potential, and may not converge to an accurately measured ΔΔG experimental . Also the latter number depends on experimental conditions. And so though precision is increased by our method, one must consider limitations in the force fields and experimental measurements.

Methods
In our method, the user proposes a mutation, the PDB identifier of a single "query" structure, and two lists of chains, one for each subunit in the interaction of interest. Lastly, the user specifies the chain ID, residue position, and substituted residue type, for one or multiple simultaneous substitutions. The user typically needs not perform any further actions until the calculation completes, the remaining steps are automated, per the flowchart (Figure 1). The procedure is illustrated graphically in Figure 2. homologyScanner is an extension of MacroMoleculeBuilder (MMB).

MacroMoleculeBuilder (MMB)
MMB is a general-purpose, multiscale modeling code. Its internal-coordinate framework (Flores et al., 2011) gives us full control over the flexibility of our molecular system, thus one can have chains which are fully rigid, fully flexible, or which are rigid in some parts and flexible in others.
In past work we have used MMB for applications as diverse as morphing (Tek et al., 2016), local minimization (Dourado and Flores, 2014), and fitting to low-resolution electron density maps (Flores, 2014). MMB can also do homology modeling, which may be considered as alignment of a flexible chain (of presumed unknown structure) to a rigid chain of known structure (Flores et al., 2010). In a related technique, both chains can be partially-flexible or fully-rigid, and here the alignment can be called template docking (Dourado and Flores, 2016) or simply rigid alignment, depending on how constraints are applied. MMB can also use the Kabsch algorithm to compute the minimum Root Mean Square Deviation (RMSD) with which two complexes can be alignedthis is a very fast operation.

Sequence search and alignment
homologyScanner starts by searching the PDB (Lopez et al., 2014) for chains which match the sequence of the userspecified chains with very high statistical significance (evalue <= 10 -11 ). Only structures which contain all the userspecified chains, each satisfying the e-value requirement, are kept and the rest are discarded. homologyScanner then uses the SeqAn-based (Döring et al., 2008) alignment tools in MacroMoleculeBuilder (MMB) to compute the sequence identity for each corresponding chain, viz: Matching residues / minimum(query chain length, subject chain length) Structures which do not satisfy the minimum sequence identity (> 90%) are discarded. We thus know that the remaining structures have the relevant chains, but do not yet know whether they have the correct tertiary and quaternary structure. For that we perform the final structural check using MMB as follows.

Structural alignment
The Kabsch structural alignment is based on residue-residue (and ultimately atom-atom) correspondence between the query and subject structures, which we obtain from the mentioned sequence alignment. MMB can robustly deal with missing or non-canonical atoms (often encountered in PDB structures), usually without user intervention. (Tek et al., 2016) The Kabsch alignment gives us the RMSD of the query vs. subject complex. If the RMSD meets a cutoff threshold ( <= 6Å), the homolog complex is then passed on to the ΔΔG calculation. Note that this differs from the procedure of (Sippl, 1993), in which the PDB is searched on structure but not sequence; we wished to use only structures of very similar sequence to maintain accuracy.

ΔΔG calculation
The user specifies a mutation(s) with chain ID(s) and residue number(s) in the context of the query structure. However different subject structures may employ different residue numbering conventions. We translate the user-specified mutation into the numbering system of the subject structure on the basis of the sequence alignment. We the compute ΔΔG predicted on the query and all subject structures, using FoldX.

Results
We benchmarked homologyScanner on the dataset used in (Dourado and Flores, 2014), comprising 1243 mutations (see Table 1, dataset A). This is a very diverse dataset of 1243 mutants, including some mutants with single-substitutions and some with multiple simultaneous substitutions. We first tried using only single structures, as done in (Dourado and Flores, 2014), and then repeated using multiple structures and quantified the improvement in correlation and Root Mean Square Error (RMSE) (Dourado and Flores, 2014): RMSE=1N∑Ni=1(∆∆Gi,predicted−∆∆Gi,experimental)2 We also tested a subset of the above, comprising singleand multiple-substitution mutants for which multiple structures were available (  Figure 5).

Discussion
As noted previously many ΔΔG prediction methods that limit structural rearrangements, particularly in regions distant from the mutation site, can yield good results, compared to no minimization or minimization of entire structures. (Guerois et al., 2002) However a local minimization leaves us vulnerable to structural idiosyncrasies of the structure employed. In the present work, we diversify the coordinate data by identifying additional complexes including the relevant chains, in the relevant quaternary arrangement. Significantly, we introduce no adjustable parameters and so our results should apply to other potentials and localized minimization methods. The part of the error due to experimental uncertainty and limitations of the DDG potential, remains unchanged in our work but is the topic of ongoing research in the field. (Brender et al., 2015;Petukh et al., 2015b;Guerois et al., 2002) Our benchmarking was done using the dataset of (Dourado and Flores, 2014), itself compiled from SKEMPI (Moal and Fernández-Recio, 2012). The full dataset A contains mutants with single and multiple simultaneous substitutions. More to the point, for some mutants in dataset A only one structure is available, whereas for others there are several similar structures available (in one case 18 were found, see Supplementary Table S1). When only one structure is available of course homologyScanner makes no improvement, but since for many there were multiple structures, ho-mologyScanner reduced RMSE by about 0.1 kcal/mol. The more relevant comparison is the case for which multiple structures are available (datasets B, C, and D). Dataset C comprises only single-substitution, D comprises multiple-substitution mutations, while B contains both.
homologyScanner presented the largest RMSE improvement for D, but error was high overall, so in general we do not recommend using homologyScanner for multiple-substitutions. For the single-substitutions (dataset C), the best RMSE of all, 1.04 kcal/mol, was obtained, better than reported in related work. (Guerois et al., 2002;Dourado and Flores, 2014) Aggregate results however are only part of the story. Figure  3 highlights another important feature of homologyScanner:F whereas several outliers are evident when using single structures, there are arguably zero outliers when using multiple structures. For a given ΔΔG experimental , ΔΔG predicted is consistently closer to the trendline for multiple than for single structures. One may note that the slope ΔΔG predicted /ΔΔG experimental is not unity, this is a characteristic of FoldX which we do not reparameterize here; in any case the slope itself is not as important as the statistical measures of accuracy. ROC curves are commonly used to evaluate binary classifiers with an adjustable thresholdin this case, we can classify mutations into those predicted to decrease ΔΔG (improve affinity), vs. those that should increase ΔΔG or leave it neutral. In ROC curves, Area Under the Curve (AUC) and slope at the point TNR = 1, TPR = 0 are two important measures. Larger AUC's correspond to more significant classifiers, whereas steeper slope indicates better performance for the highest-confidence cases, here those with lowest computed ΔΔG. Both quantities are larger when using multiple structures (Figure 4). But perhaps the most important statistic, again for the purposes of design, is PPV, plotted in Figure 5. For single structures, PPV fluctuates around 0.5 for the highest-confidence mutants, that is to say in the range of ΔΔG predicted < -1 kcal/mol. For multiple structures, in contrast, PPV is a solid 1.0 in the same range of ΔΔG predicted . To reiterate, in an experiment all such mutants would have been found to improve affinity. While we believe this result is important and impressive, we also urge the reader to be cautious. This dataset is compiled from published data, which we strongly suspect contains more affinity-improving mutations than would be obtained by random mutagenesis. We reason that may investigators are looking to improve affinity, and will use tools at their disposalpublished and unpublished data, structural calculations, bioinformatics, etc., prior to attempting a new mutationand if they do not succeed they may decide not to publish. So while the case is strong for using multiple structures, we believe PPV will be less than unity in new applications.
In conclusion, we have presented a protocol for taking advantage of the growing accumulation of near-redundant structures in the PDB to improve prediction of ΔΔG. Though the approach is simple, it provides an improvement which is remarkable since clearly demonstrable improvements in ΔΔG accuracy have been slow in recent years. The method should be compatible with perturbative ΔΔG potentials other than FoldX. As multiple programs are required to implement homologyScanner efficiently, and since there is considerable incentive to reuse calculations, we make the method publicly available on an easy to use web server.

Distribution
HomologyScanner is available on a public server at biodesign.scilifelab.se. The setup is shown in Figure 6. To request a ΔΔG calculation, the user goes to the Submit tab and provides the PDB ID of one suitable structure, and specifies the relevant chains in subunit 1 and subunit 2 of the interaction of interest (chains not in the interface can be left out). The user then specifies the mutation to be computed (one to four simultaneous substitutions). homology-Scanner is then invoked, meaning the PDB is searched for structurally similar complexes, the FoldX ΔΔG calculation is performed for all such complexes found, and the user is notified by email (usually within a few hours, depending on queue status and job characteristics) when the job is done. The server saves all results, so the structure search needs be done only once per PDB ID and definition of subunits 1 and 2, and each FoldX calculation is only done once in total. There is also a View tab where the public can browse all results by selecting a PDB ID, subunits, and mutation (all from drop-down lists). They will then see the ΔΔG for all structural homologs, as well as the average ΔΔG. A Jsmol window displays the protein structure in Cartoon render style, with the mutation highlighted in Sticks style. The web server itself comprises a computer with at least two CPU cores, one of which is responsible for running Apache, MySQL, and other web services. The remaining cores are managed by the SLURM queueing system. The web server submits homologyScanner jobs to this queue upon user request. These jobs call homologyScanner itself, which interacts with the PDB to perform the sequence and structure search. homologyScanner spawns one breeder job for each suitable homologous structure found. Breeder is a program introduced in ( Figure 1 . Program flow. The user must provide an initial Protein Data Bank (PDB) ID, specify which relevant chains are in which of two interacting complexes (irrelevant chains may be left out). 1. The fasta_lwp program searches the PDB for structures containing chains homologous (E-value below eValue-Cutoff, here 10 -11 ) to those specified by the user. 2. We group the thus-discovered homolog chains by PDB ID, each such PDB ID is referred to as a "homolog." We loop over the homologs, performing three checks on each. 3. As a first check, we determine whether the thus-discovered homologs contain chains corresponding to all those specified by the user; those not having all such chains are discarded. 4. Homologs in which all chains do not have at least 90% sequence identity vs. the corresponding user-specified chain are discarded. 5. We perform a rigid alignment of the entire homolog against the user-specified structure, based only on the user-specified chains. Noncorresponding (extraneous) chains are moved along with the rest of the complex. This is the most computationally-expensive process, but only needs to be done once for homolog that makes it to this step; results of all three checks are saved persistently. 6. If RMSD > 6.0 Å (again based on corresponding chains), we discard the homolog. Most homologs which are rejected at this step contain the correct chains but in a different configuration. 7. We then compute the ΔΔG for the user-requested mutation, using the homolog structure and FoldX4.

Figures and table
Steps 3-7 are repeated for each homolog. 8. We average ΔΔG over all homologs that reached and completed step 7 and report the result.     Note that in all cases use of multiple structures yields lower RMSE and higher (or equal) correlation than use of single structures. Best results are with single-substitution mutants (as found in (Dourado and Flores, 2014)), for which multiple structures are available (dataset A). p-values give the probability that the difference in performance could have been observed by chance, and were computed using the Wilcoxon signed-rank test, comparing the squared-errors when using single vs. multiple structures. *Where multiple structures were available and only one was used, choice was made at random, and RMSE and correlation averaged over five randomizations. Program ow. The user must provide an initial Protein Data Bank (PDB) ID, specify which relevant chains are in which of two interacting complexes (irrelevant chains may be left out). 1. The fasta_lwp program searches the PDB for structures con-taining chains homologous (E-value below eValueCutoff, here 10-11) to those speci ed by the user. 2. We group the thus-discovered homolog chains by PDB ID, each such PDB ID is re-ferred to as a "homolog." We loop over the homologs, per-forming three checks on each. 3. As a rst check, we determine whether the thus-discovered homologs contain chains corre-sponding to all those speci ed by the user; those not having all such chains are discarded. 4. Homologs in which all chains do not have at least 90% sequence identity vs. the corresponding user-speci ed chain are discarded. 5. We perform a rigid alignment of the entire homolog against the user-speci ed structure, based only on the user-speci ed chains. Non-corresponding (extraneous) chains are moved along with the rest of the complex. This is the most computationally-expensive process, but only needs to be done once for homo-log that makes it to this step; results of all three checks are saved persistently. 6. If RMSD > 6.0 Å (again based on corre-sponding chains), we discard the homolog. Most homologs which are rejected at this step contain the correct chains but in a different con guration. 7. We then compute the ΔΔG for the user-requested mutation, using the homolog structure and FoldX4. Steps 3-7 are repeated for each homolog. 8. We aver-age ΔΔG over all homologs that reached and completed step 7 and report the result.

Figure 2
Illustration of the sequence and structure matching proce-dure.

Figure 3
Scatterplot of ΔΔGpredicted vs. ΔΔGexperimental, for Dataset C (sin-gle-position substitutions, where more than one structure was available). Green circles: mutants averaged over multiple structures, N=522.
Black dots: mutants computed on a single structure --as multiple structures were available for each mutant, this has a higher N=4028. Note the clear outliers are all single-structure points. Note the third quadrant is populated with True Positives --ΔΔGpredicted and ΔΔGexperimental both negative. On the other hand, the fourth quadrant, representing False Positives, does not have any multiple-structure results below ΔΔGpredicted < -0.65 kcal/mol. The improvement in Positive Predictive Value is discussed elsewhere in this work.

Figure 4
Receiver Operating Characteristic, comparing homologyScan-ner vs. calculation on single structures.

Figure 5
Positive Predictive Value (PPV) for single vs. multiple struc-tures. TP + FP is the denominator of PPV, so we emphasize that this quantity becomes small for ΔΔGpredicted < -1 kcal/mol (crosses). This is why the PPV becomes erratic, at least for sin-gle structures.

Figure 6
The homologyScanner public web server. Users can provide PDB ID, chose chains in each of two subunits, and specify a mu-tation to be computed. FoldX ΔΔG is computed for the query and all matching complexes and reported to the user. The re-sults are available for browsing by others. Compute nodes are needed only for high-throughput runs. The software compo-nents are available on github, simtk.org, and dockerhub. A server has also been set up on a single-board computer for private deployment.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. supplementardata.pdf