PocketAnchor: Learning Structure-based Pocket Representations for Protein-Ligand Interaction Prediction

4 Modeling and predicting protein-ligand interactions have a wide range of applications in 5 drug discovery and biological research. Appropriate and eﬀective protein feature representa- 6 tions are of vital importance for developing computational approaches, especially data-driven 7 methods, for predicting protein-ligand interactions. However, existing sequence-based protein 8 representation methods often fail to explicitly learn the spatial features of proteins, while cur- 9 rent structure-based methods do not fully investigate the ligand-occupying regions in protein 10 pockets. In this work, we propose a novel structure-based protein representation method, 11 named PocketAnchor, for capturing the local environmental and spatial features of protein 12 pockets to facilitate protein-ligand interaction-related learning tasks. We deﬁne “anchors” 13 as probe points reaching into the cavities and those located near the surface of proteins, 14 and we design a speciﬁc message passing strategy for gathering local information from the 15 atoms and surface neighboring these anchor points. Comprehensive evaluation of our method 16 demonstrated that it can be successfully applied to detect the ligand binding sites on a pro- 17 tein surface and greatly outperform existing baseline methods. Our anchor-based model also 18 achieved state-of-the-art performance in the protein-ligand binding aﬃnity prediction task and 19 exhibited great generalization ability for novel proteins. Further analyses illustrated that the 20 anchor features learned by PocketAnchor can successfully capture the geometric and chemical 21 properties of subpockets. In summary, our anchor-based approach can provide eﬀective pro- 22 tein feature representations for developing computational methods to improve the prediction 23 of protein-ligand interactions. 24

sequences of proteins are typically encoded by k-mers (i.e., fragments of length k), one-hot en-48 codings, and matrices containing evolutionary information from the blocks substitution matrix  In this paper, we propose a novel 3D structure-based protein representation method, named 73 PocketAnchor, for addressing protein-ligand interaction prediction problems. Our method em-74 ploys anchor-based protein feature representations, in which "anchors" are defined to represent the 75 locations and features of the potential ligand-occupying regions. This method for the first time 76 learns the substructure-level feature representations of protein pockets in an end-to-end manner. 77 We design a new information aggregation strategy for anchor-based protein feature representa-  More details about the PocketAnchor module can be found in Methods. In the remaining part of 131 this paper, we will introduce two applications of our anchor-based protein representation method, 132 namely protein-ligand binding site prediction and binding affinity prediction.   [13]), which consisted of 9,444 training 148 samples after excluding the proteins homologous to those in the test set, was used as training data. 149 We mainly used the DCC (i.e., distance from the predicted pocket center to its nearest ligand 150 center) and DCA (i.e., distance from the predicted pocket center to its nearest ligand atom) as methods, DeepSite [12] and DeepSurf [13], are 3D-CNN-based models that are specifically de-159 signed for grid-based protein feature representation, while another, P2Rank [14], uses a random 160 forest classifier that mainly takes the descriptors of the solvent accessible surface as input. As 161 shown in Figure 2c, our PocketAnchor-site model achieved the best performance on both datasets 162 according to the DCC-based success rates, and it achieved the best DCA-based success rates on 163 the COACH420 dataset and the best comparative DCA-based results on the HOLO4k dataset. 164 According to its definition, DCC is a stricter metric than DCA, and our model achieved 9.7% and 165 7.6% increases in the success rate defined by DCC-(n+2) over the best baseline method on the 166 COACH420 and HOLO4k datasets, respectively.

167
To illustrate the contributions of the two sources of information employed by our model, i.e., PocketAnchor-site model. 173 We also noticed that not all pockets of a protein were occupied by ligands, resulting in potential 174 missing labels in the benchmark datasets. Through a case analysis, we observed that our model 175 found additional binding sites that were not labeled in the benchmark dataset. Figure 2e   In addition, ligands in the successfully predicted pockets (i.e., those with DCC-(n + 2) < 4Å) 195 tended to have smaller B factors in the crystal structures, which may indicate that the pockets 196 with more stably bound ligands were easier to detect.

197
In conclusion, our anchor-based method achieved the best performance in detecting the ligand 198 binding sites of novel proteins, and is thus a useful tool for identifying potential ligand-binding 199 pockets in structure-based drug design, especially for protein targets without known protein-ligand  indicating that the current protein representation methods may not generalize well to novel pro-216 teins. 217 We speculate that our anchor-based representation method could directly extract the rich 218 structural information from protein pockets, and thus help alleviate the current generalization 219 issue. In this work, we design an anchor-based model, named PocketAnchor-affinity, for predicting 220 protein-ligand binding affinities (Figure 3a and Methods). More specifically, given a protein pocket, 221 anchors covering the potential ligand binding regions in the pocket were first generated. The anchor 222 and ligand features were then extracted by the PocketAnchor module and a ligand encoder module, 223 respectively. Finally, the protein-ligand binding affinities were predicted through a binding affinity 224 prediction module (More details can be found in Methods).

225
To thoroughly evaluate the prediction performance as well as the generalization ability of our 226 binding affinity prediction model, we designed three comprehensive evaluation scenarios with dif-  The results indicate that, compared with most data-driven protein-ligand affinity prediction 275 models, which suffer from decreased performance when applied to novel proteins, our anchor-276 based model has a much better generalization capacity in practical scenarios. This suggests that 277 our model generalizes well and can be potentially applied in real-world drug discovery scenarios 278 for first-in-class protein targets. In the previous sections, we demonstrated that our anchor-based protein representation methods 282 performed well on the tasks related to protein-ligand interaction prediction. We speculate that this 283 performance may have benefited from the subpocket-level anchor features learned by our model, 284 which encoded the ligand-binding properties of the corresponding subpocket regions. To provide 285 evidence to support this hypothesis, we examined whether the learned anchor features were highly 286 associated with the ligand-binding patterns of the surrounding biophysical environment. 287 We first visualized the relationship between anchor features and local characteristics of protein 288 pockets ( Figure 4a). All the anchors that were close to a high-affinity ligand in the PDBbind-289 v2020 dataset were included in the visualization (i.e., anchors with distances to ligand fragment 290 centers < 1Å and affinities ≤ 100 nM). Among these anchors, we found that the anchor features 291 were associated with certain protein surface geometric (i.e., shape index, reflecting the curvature 292 of the local protein surface) and chemical (i.e., hydrophobicity) properties. In addition, the anchor examined the local subpocket environments. As expected, these anchors were located in the inner 312 sides of protein pockets with concave shapes, and there was also at least one negatively charged 313 amino acid (i.e., aspartic or glutamic acid) close to these anchors (Figure 4c). We also noticed 314 that these anchors, though exhibiting similar patterns, were actually from three distinct proteins 315 (furin, anti-dabigatran antibody, and urokinase-type plasminogen activator). This indicated that  In this section, we describe how to obtain the anchor features using our PocketAnchor method. 355 Basically, each anchor gathers the information from protein atoms and the surface within a sphere 356 of radius 6Å to represent the corresponding subpocket environment. More specifically, given a 357 protein, let a i , i = 1, · · · , n a denote the anchors, u j , j = 1, · · · , n u denote the atoms of the 358 protein, and s k , k = 1, · · · , n s denote the vertices of the surface mesh, where i, j, and k stand for 359 the indices, and n a , n u , and n s stand for the numbers of anchors, atoms, and surface vertices in the 360 protein, respectively. Let F a , F u , and F s denote the feature vectors of anchors, atoms, and vertices, 361 respectively. The initial feature vector F  Figure 1e), according to the following formulas: where MPNN h u (·) and MPNN h s (·) stand for the MPNN layers for updating atom and surface fea- combined, that is, where W u and W s stand for the learnable parameters and Cat(·, ·) stands for the concatenation 375 operation. Finally, the anchor features F ai of anchor a i are aggregated from both the protein 376 atoms and surface, that is, where w ij and w ik stand for the normalized distances as weights, Dist(·, ·) stands for the Euclidean 378 distance function, and exp(·) stands for the exponential function. specifically, given a ligand, let g i , i = 1, · · · , n g denote its fragments, and t j , j = 1, · · · , n t denote averaging over all the atoms within the fragment, that is,

393
The ligand binding site prediction module takes the anchor features extracted by PocketAnchor as 394 input. More specifically, given an anchor a i , its features F ai are first converted into an embedding 395 space through linear projection followed by a leaky ReLU layer, that is, where the superscript "site" stands for the notation of ligand binding site prediction, LeakyReLU(x) = 397 max (0.1x, x) stands for the leaky ReLU activation function, and W (site) a and b (site) a stand for the 398 learnable parameters of the linear projection layer. The binding site scoreŝ ai is then predicted 399 through a linear projection followed by a sigmoid function, that is, where W (site) and b (site) stand for learnable parameters, and σ(·) stands for the sigmoid function.
where the superscript "atom" stands for the notation of the atom-level features, LeakyReLU(x) = 409 max (0.1x, x) stands for the leaky ReLU activation function, and W where w i and w j stand for the weights for individual features, which are calculated as follows: where Softmax(x i ) = exp(x i )/ j exp(x j ), and W the outer product between protein atom features and ligand atom features, that is, The substructure-level features F (sub) are obtained in a similar way by using the protein anchor 418 features F ai , the ligand fragment features F gj , and the ligand global features F c . Finally, the 419 binding affinityâ is predicted through a linear projection of the above feature vectors: where W (aff) and b (aff) stand for the learnable parameters. for DeepSurf [13]. During the training process, the anchors within a radius of 4Å from any ligand 425 atom were assigned as positive training samples while the rest were assigned as negative ones.

426
During the evaluation process, to determine the centers of binding pockets based on the predicted 427 scores of anchors, we first selected the anchor points with prediction scores that were two standard in the benchmark test datasets were removed to prevent data leakage (20 and 475 samples were re-438 moved from COACH420 and HOLO4k, respectively). Two proteins were considered as homologous 4.7 Data processing and evaluation for the binding affinity prediction  The biological assembly information was retrieved from the lines of the .pdb files starting with 471 "REMARK 350".

472
For the baseline methods, we followed the same pre-processing protocols and recommended 473 hyper-parameters as in the original papers. Note that since the protein sequences were not provided 474 by the PDBbind database, the sequences retrieved using distinct schemes might be different. Here, 475 we trained the sequence-based baseline models using protein sequences from either PDB or the 476 Uniprot database separately and reported the best performance. More specifically, to extract a 477 protein sequence from its structure file obtained from the PDB, we first selected a chain with 478 the largest number of atoms within the 8Å neighborhood of the ligand. Then the sequence of 479 the chain was used as the PDB sequence, in which the non-standard residues were marked with 480 "X". The sequences with non-standard residues making up more than 50% of the total length 481 were considered abnormal and thus removed. We also adopted the mappings from the PDB IDs      Visualization of anchor features using t-SNE. Colors indicate the protein properties, namely the shape index of protein surface, hydrophobicity, amino acid charge types, and hydrogen bond types. The former two properties were obtained by averaging the corresponding properties of surface points within a 6Å distance from the anchor, while the latter two were collected from the amino acid closest to each anchor. Here, the "H-bond acceptor" group represents the amino acids that can only serve as hydrogen bond acceptors and do not contain any hydrogen bond donor atoms. b. Visualization of anchor features using t-SNE, with those anchors occupied by specific types of ligand fragments colored in red. The diagram and SMILES strings of these ligand fragments are shown. The region covering the majority of the colored anchors in the last example is magnified, and three anchors from three distinct samples are marked in different colors. c. The three selected anchors from the zoomed-in panel in b, in which three anchors from different samples were occupied by a specific ligand fragment. Only the corresponding ligands and the residues located within 4 A of the selected anchors in the protein pockets are shown.