PinMyMetal: A hybrid learning system to accurately model metal binding sites in macromolecules

Metal ions are vital components in many proteins for the inference and engineering of protein function, with coordination complexity linked to structural (4-residue predominate), catalytic (3-residue predominate), or regulatory (2-residue predominate) roles. Computational tools for modeling metal ions in protein structures, especially for transient, reversible, and concentration-dependent regulatory sites, remain immature. We present PinMyMetal (PMM), a sophisticated hybrid machine learning system for predicting zinc ion localization and environment in macromolecular structures. Compared to other predictors, PMM excels in predicting regulatory sites (median deviation of 0.34 Å), demonstrating superior accuracy in locating catalytic sites (median deviation of 0.27 Å) and structural sites (median deviation of 0.14 Å). PMM assigns a certainty score to each predicted site based on local structural and physicochemical features independent of homolog presence. Interactive validation through our server, CheckMyMetal, expands PMM’s scope, enabling it to pinpoint and validates diverse functional zinc sites from different structure sources (predicted structures, cryo-EM and crystallography). This facilitates residue-wise assessment and robust metal binding site design. The lightweight PMM system demands minimal computing resources and is available at https://PMM.biocloud.top. While currently trained on zinc, the PMM workflow can easily adapt to other metals through expanded training data.


Introduction
Metal ions play a crucial role in the structure and function of macromolecules 1 , acting as essential cofactors for many enzymes and in uencing various molecular and cellular processes 2 .About one-third of proteins in known genomes require metal ions to maintain their natural structure and function.However, only a small fraction of metal binding proteins have been elucidated 3,4 .Understanding the location of metals in proteins and their interactions is essential for designing new drug synthesis pathways and modifying biological functions 5,6,7,8 .For example, the most abundant metal ion in the Protein Data Bank (PDB) is zinc, which is crucial in diseases, drug targeting, stability, and regulation 9 .
Metal-protein complexes studies bene t from experimental methods, yet face artifacts like incorrect metal incorporation and ion removal during puri cation 10 .In addition, experimental methods face resolution limitations when determining metal binding structures, particularly in cryo-electron microscopy (cryo-EM).Despite the success of cryo-EM in large and complex macromolecules, electron penetration depth and scattering effects hinder high-resolution imaging of metal ions 11 .Computational predictions offer advantages including cost-effectiveness, scalability, and high-throughput.Combining both approaches provides a more comprehensive understanding of metal sites in proteins.Metal sites typically comprise amino acids close in 3D structure but distant in sequence, posing a challenge to identify sites with short amino acid spacers between ligands, such as regulatory sites 12 .Hence, structure-based predictions are expected to outperform sequence-based methods 13,14,15 .Advancements in protein structure prediction, exempli ed by Alphafold2, show promise for accurate predictions of protein structures, offering opportunities and challenges in annotating metal sites in computational models 16 .
Existing structure-based metal site predictors employ diverse approaches.BioMetAll 5 , TEMSP 17 , and GRE4Zn 13 use geometric features, such as metal-ligand distances.CHED 18 focuses on triads of metal-coordinating ligand residues in apoprotein structures.
ZincBindDB classi es zinc sites into ten classes, employing machine learning models based on structural characteristics 19 .MIB 20,21 , and AlphaFill 22 infer the presence of metal ions based on homology to known metal binding structures.Metal3D employs a deep learning algorithm with a voxelized protein environment representation 23 .These predictors can be divided into three categories: (I) binding site predictors for metal binding residues (CHED 18 , ZincBindDB 19 ); (II) binding position predictors for metal ion coordinates (Metal3D 23 , BioMetAll 5 , AlphaFill 22 ); (III) predictors that identify both residues and coordinates (TEMSP 17 , GRE4Zn 13 , MIB 20,21 ).However, these methods have signi cant drawbacks.BioMetAll lacks templates and a con dence metric but provides many potential binding site locations on a grid, whose strategy nds the site at the cost of increasing site uncertainty 5 .CHED, TEMSP, GRE4Zn, and MIB exclude metal sites with two or fewer coordinating ligands.Metal3D can only predict the coordinates of metal ions and has a long prediction time, unsuitable for large-scale predictions 23 .Homology-based predictors like MIB, AlphaFill, and ZincBindDB can successfully nd sites that match known metal site patterns, while identifying metal binding sites in proteins lacking su cient homologous structural domains or motifs remains challenging.The structure-based hybrid machine learning system developed herein, named PinMyMetal (PMM), overcomes these drawbacks to predict both metal location and coordinating ligands.
Metal ions in proteins are typically coordinated by Cysteine(C), Histidine(H), Glutamate(E), and Aspartate(D) 24 , with sulfur and nitrogen donors increasing site stability according to hard and soft acids and bases principles 25 , speci cally C and H ligands 26,27 .
While alkali and alkaline earth metals tend to commonly serve structural roles, transition metals are more versatile in function, as exempli ed by the most abundant transition metal zinc in PMM.The varied functions of zinc binding sites exhibit distinct structures 14,28 .Functionally, these sites are generally divided into structural, catalytic, and regulatory sites, predominantly coordinated by four, three, and two residues, respectively 29,30,31 .The PMM system uses C and H residues as the primary measure and ED as an auxiliary measure.It categorizes binding sites by functionality into three groups, employing different optimal strategies for each functional group to formulate a hybrid machine learning approach to predict zinc binding sites.Trained on 20,979 nonredundant high-quality zinc binding sites validated by CheckMyMetal (CMM) 32,33 , the PMM system incorporates predicted sites into protein structures and further validates them using CMM.It e ciently screens and validates both metal ion locations and coordinating ligands throughout the protein based on amino acid type, location coordinates, structural characteristics, and surrounding hydrophilic pro le.
While the current work ow of PMM is trained using zinc binding sites, it also gives informative cues about other common transition metal binding sites (Mn, Fe, Co, Ni, Cu, Cd).Yet, the modeling of non-zinc metal ions should be interpreted carefully.The underlying work ow of PMM is readily extensible to alkali and alkaline metal ions by modifying the training data in the model.Our current algorithm using CH as the primary measure can be applied to transition metals by swapping the training set.At the same time, the application of our algorithm to alkali and alkaline earth metals also requires the use of carboxyl side chains from Glutamic acid and Aspartic acid (ED) as the primary measure and hydroxyl side chains from Serine and Threonine (ST) as the auxiliary measure besides using the corresponding training set.

PinMyMetal work ow
The PMM work ow features four modules: (a) CMM validation module; (b) Data analysis and summary module; (c) PMM hybrid learning system; (d) Interactive frontend module.While the latter three modules (b-d) connect sequentially, the validation module interacts with all the other three modules, providing a validated dataset before data analysis to generate a benchmark dataset and con rming the validity of the predicted metal binding site as a utility module (Fig. 1).
Neighborhood processes all zinc-containing protein structure les, followed by validation of zinc binding sites using CMM.PMM is trained using CMM-validated benchmark dataset, utilizing geometric characteristics such as ligand amino acid properties, interatomic distance, angle, and atomic type of the binding site.PMM takes the protein structure as input, searches the entire structure based on ligand type, atomic type, and interatomic distances, and predicts candidate zinc binding sites by constraining geometric features (Fig. 1a).
According to the amino acid atomic coordinates coordinated with zinc, the zinc ion coordinates were deduced.Using the zinc ion coordinates as the center of the sphere, the hydrophilicity pro le of atoms within the surrounding 7 Å range was derived (Fig. 1b).
Predictors addressing these features are considered for judging the possibility of zinc ions bound to candidate zinc binding sites.The PMM frontend described in more detail in section 2.7 features a web server that allows users to input protein sequence or protein structure to predict zinc ion location and the corresponding coordinated ligands.

2.2
The predictive capability and accuracy of the PMM system PMM rst predicts the pair of residues that could potentially bind zinc according to the geometric characteristics from the CMMvalidated benchmark dataset, obtaining candidate zinc binding sites.Subsequently, the binding positions of zinc ions are deduced based on the ligand residues of the candidate zinc binding sites.Ultimately, employing a hybrid learning system, further veri cation is conducted for the candidate zinc binding sites within the CH2 and CH3/CH4 groups using different methods.The CH2 group in zinc binding sites is veri ed with the ensemble model, while the CH3 and CH4 groups are veri ed with values of hydrophobicity contrast functions (C) and values of atomic solvation parameters (Δσ) for veri cation.This is done to determine whether the identi ed zinc ions truly represent accurate zinc binding sites or are merely false positive hits, lacking evidence to possess zinc binding properties.
PMM uses an innovative algorithm to deduce the coordinates of zinc ions.CH2, CH3, and CH4 groups are self-contained in a relatively early classi cation stage, while each of the CH groups is further divided into six subgroups and uses six subgroup-speci c strategies to deduce the most probable location of zinc ions using the known locations of coordinating atoms.These strategies also consider some fundamental measures and complications, including the composition of the coordinating ligands, the orientation of CH sidechain, and possible sidechain rotamer conformations.PMM's accuracy is evaluated by measuring the distance between the predicted zinc ion location and the experimentally determined location.For CH2, CH3, and CH4 groups, the median zinc deviation is 0.34 Å, 0.27 Å and 0.14 Å, respectively (with average zinc deviation of 0.46 Å, 0.34Å and 0.17 Å, respectively) (Fig. 2a).
An ensemble model is used to verify CH2 candidate zinc binding sites.Receiver operating characteristic curve (ROC curve) and Precision-Recall curve (P-R curve) are employed to assess the prediction performance of different machine learning or deep learning models.Better performance is indicated by convexity towards the upper left corner in the ROC curve and convexity towards the upper right corner in the P-R curve.The area under the ROC curve (AUC) and the area under the P-R curve (AP) are also used as additional measures to assess the prediction performance with an area score range in [0, 1], with higher score indicating better performance.2b, c).The ensemble model exhibits an AUC value of 0.994 and an AP value of 0.997, achieving the highest precision and recall when compared with any of its ve base learners.The prediction of the ensemble model in the test set is represented by a confusion matrix (Fig. 2d).According to the confusion matrix, the ensemble model exhibits a recall value of TP / (TP + FN) = 0.981; a precision value of TP / (TP + FP) = 0.978; an F1-score of 2*precision*recall / (precision + recall) = 0.980; and an accuracy of (TP + TN) / (TP + TN + FP + FN) = 0.973.
The prediction of candidate zinc sites is conducted using the CMM-validated benchmark dataset (Supplemental Table 1), and the prediction accuracy is evaluated by IoUR de ned in formula (6).For the CH2, CH3, and CH4 group of sites, 3,627, 3,827, and 11,171 sites could be accurately predicted from 4,348, 4,428, and 12,203 experimental sites, indicating a recall rate of 83.4%, 86.4%, and 91.5%, respectively (Fig. 2e).Using IoUR = 1 instead of IoUR ≥ 0.5 as the threshold results in only a slightly reduced number of 3,457 CH2, 3,627 CH3, and 10,466 CH4 group of sites, suggesting a somewhat reduced recall rate of 79.5%, 79.7%, and 85.8%, respectively (Supplemental Table 2).The procedure may exclude some experimental sites from consideration due to certain complications, e.g., for sites with distance between ligands exceeding 4.5Å or for sites with coordinated atoms being N or O of the backbone.Using a hybrid learning system, different strategies are employed for assigning a certainty score to each candidate site within CH2 and CH3/CH4 groups.The certainty score ranges from 0 to 1, and candidates with a score greater than 0.5 are considered veri ed sites.
As a result, PMM recovers 94.7%, 98.3%, and 98.8% veri ed zinc sites from a total of 3,627 CH2, 3,827 CH3, and 11,171 CH4 candidate zinc sites, respectively (Fig. 2e, Supplemental Table 2).PMM demonstrates high accuracy and recall in predicting the experimental zinc binding sites within the structure.For instance, in the Cryo-EM complex structure of CasPhi-2 (Cas12j) bound to crRNA and Phosphorothioate-DNA (PDB code: 7lyt) 34 , PMM successfully predicted the zinc binding site coordinated by residues C670, C667, C685, and C688 (Fig. 2f), exhibiting a minimal distance deviation of 0.025Å from the zinc site determined by the experimenter.

Prediction of unknown functional sites supported by experimental data
In addition to accurately predicting known experimental binding sites, PMM identi es a large number of previously unknown, putative zinc binding sites, including 2,035 CH2 group, 1,013 CH3 group, and 486 CH4 group of zinc binding sites that are not determined in experimental structures.For these predicted metal binding sites, 425, 98, and 50 sites are from structures determined by cryo-EM, and 445, 304, and 42 sites are metal binding sites that contain another transition metal other than zinc, respectively (Supplemental Table 3).
The CH2 group of zinc binding sites is a typical regulatory site that reversibly binds to zinc ions, depending on the zinc concentration or the presence of chaperone protein in the environment.Therefore, the absence of a zinc site under certain experimental conditions does not necessarily exclude its suitability to bind zinc.PMM predicts a zinc binding site in the ORF1ab protein of the MERS-CoV papain-like protease complex with the C-terminal domain of human ISG15 (PDB code: 5w8t) 35 , coordinated by ligands C32 and H81.
Although the zinc ion is not determined in the experimental structure, the electron density is observed at the proposed zinc location (Fig. 3a).
CH3 and CH4 groups of zinc binding sites could escape from experimental determination due to several reasons: (1) the similarity of zinc ions with other commonly observed transition metal ions such as Fe, Cu, Mn, etc. can cause promiscuity and thus the presence of ions other than zinc at the predicted zinc location; (2) the experimenter lacks the expertise or accidentally overlooks the modeling of some zinc binding sites during model building; and (3) limited-resolution structures usually exhibit uncertainty in metal ion modeling (Supplemental Table S3).For example, in the low-resolution X-ray structure of wild-type RNA polymerase II (PDB code: 1nik) 36 determined to a resolution of 4.1 Å, a CH4 site with four cysteine residues does not have the zinc ion modeled despite the presence of electron density (Fig. 3b).Another example is the TRAP-Anti-TRAP complex structure with a resolution of 3.2 Å (PDB code: 2zp9) 37 .On the Tryptophan RNA-binding attenuator protein-inhibitory protein (Anti-TRAP) within this structure, PMM predicted a CH4 site with four cysteine residues.While electron density is observed at this site, it is not modeled in the experimental structure (Fig. 3c).
The number of transition metal ions per 100 amino acids is used as a metric to assess metal annotation e ciency due to the association between lower resolution and higher uncertainty in metal ion modeling.Structures with resolutions better than 2.5Å are excluded due to the scarcity of atomic-resolution cryo-EM structures (41 structures).The cryo-EM method is commonly used for determining large, complex, or challenging-to-crystallize structures.However, the annotation e ciency for transition metal ions is lower in cryo-EM structures compared to X-ray structures of the same resolution range, consistently decreasing from 0.25 metal ions per 100 amino acids at 3Å to 0.05 metal ions per 100 amino acids at 5Å (Fig. S1).PMM is well suited to routinely model missing metal binding sites or annotate candidate metal binding sites in cryo-EM structures.For example, in the structure of the E. coli 50S ribosomal subunit complex with unmodeled metal ions (PDB code: 6xzi) 38 , PMM predicts a zinc binding site on 50s ribosomal protein L36 (Chain e).This site is coordinated by residues C11, C14, C27, and H33, and is supported by an observed peak in the charge density map (Fig. 3d).In the structure of mammalian RNA polymerase II subunit RPB7 (PDB code: 6exv) 39 , PMM predicts a zinc site coordinated by residues C17, C20, and C42.Although this site is not experimentally modeled, it gives an educated estimation of the candidate zinc binding site that is not contradictory to the charge density map.Conversely, a nearby zinc binding site modeled by the experimenter is not reasonably coordinated and lacks experimental support (Fig. 3e).These discrepancies underscore the challenges in cryo-EM structural determination, while PMM's prediction suggests its potential in supplementing metal binding site modeling.In the structure of the human SMG1-8-9 kinase complex (PDB code: 7pw5) 40 , PMM predicts a zinc binding site of unknown function on the SMG8 protein coordinated by the residues C566, C576, H581, and H601 (Fig. 3f).While the insu cient resolution may not support the direct atomic modeling of metal ions in this model, PMM provides an alternative approach to model coordination bonds pertaining to metal ions in medium-to-low resolution cryo-EM structures.

Comparison with other predictors
The bene ts of PMM to other existing predictors (Table 1) can be summarized as follows: (a) PMM predicts both metal binding residues and metal ion coordinates; (b) PMM achieves superior prediction accuracy with minimal coordinate error between the coordinates of predicted zinc ions and the actual zinc ions; (c) PMM is faster than other metal predictors.For a protein consisting of 350 amino acids, the prediction using the PMM online web service takes approximately 15 seconds.When utilizing Metal3D for prediction and relying on local CPU processing, the process takes about 130 seconds.However, when using "Huggingface Spaces" for web-based online prediction, despite not requiring downloads and registrations, Metal3D demands more runtime, taking approximately one day; (d) PMM embeds validation for all steps from dataset construction to result veri cation; (e) PMM specializes in predicting regulatory sites coordinated by only two amino acids besides 3-or 4-4-coordinated structural and catalytic sites; (f) PMM employs an objective and thorough search strategy to select a negative dataset, in contrast to ZincBindDB and znMachine, which randomly chose arbitrary sites with reasonable geometry yet no experimental zinc as a negative dataset.This minimizes the inclusion of false negative sites or the exclusion of true negative sites in the negative dataset; (g) PMM innovatively uses CH as the major criteria and ED as the auxiliary criteria to predict all possible zinc binding sites.The zinc location is used as the center to nd amino acids other than CH residues within the range of the rst coordination sphere (2.5Å) and the second coordination sphere (4Å).(Fig. 4f, Supplemental Table 4).
Metal3D is a recently published metal ion position predictor based on 3D convolutional neural networks, which is currently the most accurate metal location predictor with a deviation of 0.70 ± 0.64 Å between predicted positions and experimental locations.PMM features a deviation between predicted positions and experimental locations of 0.323 Å, which is 54% less deviation when compared with Metal3D.The dataset reported in Metal3D includes 189 zinc binding sites from 59 structures and is evaluated using: True Positives (TP) for predictions within 5 Å of an experimental metal site, False Positives (FP) for predictions beyond 5 Å from both actual and other false positive sites, and False Negatives (FN) for experimental sites lacking a predicted metal within the 5 Å threshold.PMM uses the same TP, FP, and FN de nition as Metal3D to de ne a corresponding dataset that contains 205 validated zinc binding sites from the same 59 structures, and achieves a better prediction precision of 0.983, a better recall of 0.571, and a better average zinc-deviation of 0.166Å when compared with the corresponding values from Metal3D (Supplemental Table 5).
In order to compare the selectivity for other common transition metals, a data set of 292 metal binding sites (38 for Mn, 66 for Fe, 31 for Co, 30 for Ni, 66 for Cu, 61 for Zn) is chosen to evaluate both PMM and Metal3D using a precision and recall distribution map.The PMM prediction results are generally better, with a precision that consistently outperforms that in Metal3D (Fig. 4a).Evaluation of average metal deviation indicates that zinc is the metal with the most accurate prediction in both PMM and Metal3D.The average error value of 0.257Å in PMM is better than that (0.52 + 0.45Å) in Metal3D (Supplemental Table 5).An extended CMM-validated dataset with a resolution better than 2Å is used to evaluate the stability of PMM against different data (Fig. 4b).The precision of PMM is consistently increased when using the high-resolution dataset, while the recall values for Mn, Fe, Co, and Ni remain unstable.
Trained on zinc, PMM excels in Zn precision and recall, while the similarity between Zn and Cu makes it also a good Cu predictor in terms of both precision and recall.The relatively low recall for Mn, Fe(III), and Co could be attributed to the higher selectivity of the current PMM model trained on the characteristics of Zn data, e.g., towards tetrahedral geometry against octahedral geometry (Fig. 4a,b).
Tools like AlphaFill use structural homology to transplant metals from similar PDB structures to the predicted structure and may not be used to predict novel metal binding sites.For example, PMM predicts a novel metal binding site coordinated by four cysteine residues in a tryptophan RNA-binding attenuator protein-inhibitory protein 2zp9, which is further veri ed by the presence of electron density map (Fig. 4c,d,e).Since this site was not experimentally observed in either 2zp9 or any other homologous proteins, Alpha ll fails to predict its presence (Fig. 4e).Metal3D can predict two zinc locations in this structure with errors of 0.9 Å and 0.6 Å, comparable to the errors in PMM of 0.9 Å and 0.5 Å (Fig. 4c,d).However, Metal3D predicts two additional zinc locations in the same structure, where no electron density is observed, indicating a higher rate of false positive hits of Metal3D when compared with PMM.

Biological implication of zinc binding site prediction for different types of zinc binding sites
Although zinc ligands and coordination geometries are largely different among regulatory, catalytic, and structural sites, PMM achieves high accuracy with commendable biological implications in all scenarios (Fig. 5).Zinc ions at the regulatory (inhibitory) and catalytic sites in zinc-containing enzymes require two or three coordinating ligands for full activity (Fig. 5a, b, c).PMM can accurately predict zinc ion location at cocatalytic sites containing two or three metals in close proximity with two of the metals bridged by a side chain moiety of a single amino acid residue, such as Asp, Glu, or His and sometimes a water molecule (Fig. 5d).The application of PMM is not limited to a single polypeptide chain, but also includes protein interface zinc sites formed from ligands supplied from amino acid residues residing in the binding surface of two polypeptide chains (Fig. 5e).Similar to other zinc ions, zinc binding sites on the protein interface can be regulatory, catalytic, or structural.

Open-Source PMM predictor: local and web access
The code for the PMM predictor is open-source, allowing peers to download, run, and compile it locally.Additionally, an online version is provided for convenient web-based predictions, enhancing the exibility, ease of use, and user-friendliness in practical applications.
The PMM web server is publicly available and freely accessible at https://PMM.biocloud.top.Even though PMM is a structural-based method, it implements an automated structure-retrieval interface that allows users to search by protein name or sequence as identi ed by Uniprot ID.The server provides three input methods for the acquisition of protein structures for zinc binding site prediction: (1) PDB id from the PDB website; (2) Uniprot ID of the target protein, which will be used to retrieve protein structures from the Uniprot database for further analysis.If multiple experimentally determined structures are found from the same Uniprot entry, structure with the highest sequence completeness and highest resolution is chosen.If no experimentally determined structure is found, a computational model from AlphaFold2 is selected; and (3) PDB or CIF format coordinate le is uploaded by the user (Fig. 6).
Pre-processing of the protein structure prompts a chain selection page containing the chain ID, name, source organism, and length for each chain, allowing the user to choose one or more chains of interest to conduct metal binding site prediction.
After submission, users can typically expect to receive a response in about 20 seconds or less.The submitted protein structure, along with all experimental and predicted zinc binding sites, will be displayed on an interactive NGL 3D view page (Fig. 6).The output of PMM is divided into two panels: the right panel features predicted zinc ion location and coordinating amino acid type and residue sequence number (resseq), while the left panel features experimentally determined zinc ion location and coordinating amino acid annotated with whether or not it passes the validation criteria.Experimental zinc binding sites that have not passed the validation criteria is compared with predicted zinc binding sites using IoUR > = 0.5 as the criteria to determine if they are the same site as de ned in section 2.2.A "CheckMyMetal" button is provided on the PMM output interface to allow the seamless validation of the predicted zinc binding site on the sister CMM website, with an '@' indicating predicted sites.The experimenter may download the coordinate in PDB or CIF format, with the predicted sites annotated in the ATOM and LINK records.A certainty score between the range of 0 and 1, indicating the con dence value of the zinc binding site, is provided in the occupancy eld.The NGL interface also allows the visualization of other non-CH amino acids or small molecule ligands within 4Å of the metal center.Careful examination of the interactions of the zinc coordinating ligands beyond the rst coordination sphere could reveal other global characteristics of the protein structure.

Discussion
PMM adopts CH as the major classi cation scheme and ED as the auxiliary measure, ensuring su cient training data for each class of coordination motifs.The biological implications of this classi cation scheme are validated through the analysis of zinc-containing enzyme structures from the PDB.Considering metal ions in macromolecular structures requires a multidisciplinary approach, coherently considering chemical, crystallographic, biological, and experimental aspects 24 .PMM's validation procedure, speci cally the CMM validation, effectively identi es incorrect metal assignments and suboptimal modeling of metal binding sites.Addressing potential complications, such as geometric distortions of the rst coordination sphere, the quality of the diffraction data (e.g., the resolution), and sample preparation concerns, ensures the robustness of PMM in predicting zinc binding sites.
PMM introduces an innovative algorithm that signi cantly reduces the computational resources required for screening the hydrophobicity contrast function and determining candidate zinc ion locations.By deducing the most probable location before applying the contrast function, PMM maintains accuracy while enhancing e ciency, making it a powerful tool for predicting optimal zinc ion locations within protein structures.Validated zinc ions undergo redundancy removal by measuring the distance between two zinc ions.Two zinc ions would represent the same site if the distances between them are close enough to each other.Compared to Metal3D's threshold of 5Å for redundancy, we employ a 2.5Å threshold to eliminate redundancy, achieving accurate annotation of binuclear zinc sites while removing redundancy.
As a signal transduction messenger, zinc regulates protein activities, including the inhibition of enzymatic activities, yet this occurs only when the concentration of zinc ions elevates to a certain level.Nevertheless, the inhibitory sites at the active or allosteric sites of enzymes seem to share similar coordination environments with the typical ligand environments of catalytic zinc in zinc metalloenzymes.The only notable distinction is a tendency for lower coordination numbers in regulatory zinc sites.While the K d for zinc ion can range from milli-molar concentration to micro-or nano-molar concentration, how zinc regulates enzyme activity is not clearly de ned from the structural perspective.CH2 algorithm provides a one-stop solution to propose a hypothetical mechanism for such inhibition by predicting candidate regulatory (inhibitory) zinc sites and other zinc binding sites coordinated by two CH residues.
Many enzyme active sites feature two metal binding amino acid side chains, such as Cys-Cys, His-His, Cys-His, Glu(Asp)-His, and Cys-Glu(Asp), to form a catalytic dyad.Yet not all of them contain two catalytic cysteine or histidine residues, as seen in enzymes like cysteine proteases, protein tyrosine phosphatases (PTPs), aldehyde dehydrogenases, and glyceraldehyde 3-phosphate dehydrogenase 12 .Therefore, failure to predict zinc binding site due to the lack of two CH residues does not necessarily invalidate the possible inhibitory or regulatory role of zinc via an alternative mechanism.
Evaluation of PMM for its selectivity against other transition metals reveals that both Cu and Zn exhibit high precision and recall (Fig. 4a,b).This can be attributed to the general promiscuity of transition metal ion binding according to the Irving-Williams series 41 with a special characteristic similarity between Cu and Zn (Fig. S2), resulting in the PMM predictor trained on Zn also work with high accuracy for Cu.Most Zn binding sites could also bind Cu in competitive binding conditions, and that selectivity in such cases is not determined solely by the binding site, while the contributions of environmental factors, such as chaperones or compartmentalization, should not be underestimated or overlooked.For the CH4 group of structural sites, it is not uncommon to spot incorrectly assigned zinc ions in metalloprotein structures, especially between Zn and other transition metals.For example, Zn has been assigned as Cu (Fig. S3a, PDB code: 3mnd) or Fe (Fig. S3b, PDB code: 1jyb).However, for CH3 group of catalytic sites and CH2 group of regulatory sites, the border between Zn and other transition metals is rather thin or even overlapping.Therefore, the burden that no algorithm could uniquely determine a speci c metal identity among different transition metals for certain metal binding sites stems from the fact that the metal binding site itself is naturally versatile and lacks selectivity, even from a physiological perspective.After all, a protein predicted by PMM to be zinc binding could also bind to multiple metals in vivo due to other environmental factors.This also results in the fact that PMM predictor trained on Zn also possesses a relatively high recall rate for most other transition metals besides Zn and Cu (Fig. 4a,b).
In conclusion, PMM can predict metal ion locations and coordinating ligands based on local geometrical and chemical microenvironments.The application of PMM in zinc binding sites exhibits superior accuracy and e ciency performance compared to other predictors, providing a quick way for the scienti c community to predict zinc binding sites with easy accessibility, high con dence, and minimal latency.The high e ciency also prompts PMM to excel in the large-scale prediction of metal binding site for the superfamily of metal binding proteins or genomic-scale prediction of metal binding sites.PMM also specializes in predicting regulatory (transient) metal binding sites (2-residue predominate) not speci cally handled in any other zinc predictors and exhibits much superior prediction accuracy than Metal3D 23 .Experimentally-determined protein structures generally represent a single snapshot of the protein, while the zinc binding state may not be observed under a speci c experimental condition.Therefore, the absence of zinc binding sites in a given crystal structure does not warrant its absence in the associated biological processes.In this sense, PMM opens up a new window of opportunity to examine candidate zinc binding proteins from a perspective not accessible using any known experimental or computational methodologies.We have also demonstrated the effective routine use of PMM to annotate metal binding sites in cryo-EM structures with limited resolution.PMM offers a complementary and accurate solution to model metal ions in cryo-EM structures which would otherwise be challenging due to the limitations of electron penetration depth and scattering effects.

Data acquisition, validation, and redundancy elimination
The set of metal-containing protein structures was downloaded using the April 22, 2023 version of the PDB 42 and processed using the Neighborhood database as described earlier 24 .The intermolecular interaction between metal ions and proteins is stored in the form of coordination bonds and represents the metal binding site.55,120 experimentally determined zinc ions from 18,082 protein structures were further inspected to remove free zinc ions or zinc ions coordinated by only water, resulting in a dataset of 38,976 zinc binding sites with two or more coordinating ligands from either cysteine or histidine.
The quality of zinc binding site is evaluated using CheckMyMetal (CMM) 32 , with modi cation based on the previously described algorithm used to validate magnesium binding sites in nucleic acid structures 43 .Since the previous algorithm was tested for magnesium ions, the validation parameters are adapted to be applicable to other metal binding sites.Three parameters were used to quantitatively evaluate the agreement with expected valence (oxidation state) (Q v ) (2) , completeness of the rst coordination sphere (Q c ) (3) , and experimental agreement (B factor and occupancy) with the environment (Q e ) (6) .In all formulas, v i represents the bond valence vector of coordination bond i.In formulas ( 1)-( 3), V i represents the magnitude of bond valence vector v i ; V ox represents the expected oxidation state.In formulas ( 4)-( 5 The validation procedure is ne-tuned based on the number of coordinating ligands, assuming that four ligands comprise a stable zinc coordination sphere that adopts a tetrahedral coordination geometry 44 .For zinc with 3 or 4 coordinating ligands, a threshold of half of the optimal quality was set as the validation criteria: Q v > 0.5 and Q c > 0.5 and Q e > 0.5.For zinc with two coordinating ligands, while the expected oxidation state V ox stays at 2, the optimal theoretical bond valence summation (∑V i ) is 1, and the optimal theoretical vector sum is |v1 + v2|=0.58.Therefore, the optimal Q v would be 0.5 according to formula (1), and the optimal Q c would be 0.71 according to formula (2).Using a threshold of half of the optimal quality would result in different validation criteria: Q v > 0.25, Q c > 0.355 and Q e > 0.5.Structures containing zinc binding sites passing our validation criteria are subject to clustering using CD-Hit 45 at 30% sequence identity cutoff to determine homologous zinc binding sites.For clusters containing more than one zinc binding site, the site with the best quality is chosen as the representative zinc binding site for further analysis.A CMM-validated benchmark dataset was ultimately obtained, comprising 15,353 non-redundant structures and 20,979 zinc binding sites.This benchmark dataset is used to train PMM (Supplemental Table S1).

Classi cation of metal binding sites
CHED residues (Cysteine, Histidine, Glutamic acid, Aspartic acid) are the most common coordinating residues or metal ions, while the use of donor atoms of other amino acids, such as serine, threonine, or lysine, is rare and accounts for less than 1% of all cases of metal-ligand interactions 46 .Hard and soft acids and bases imply that zinc proteins containing sulfur and nitrogen donors in the coordination sphere are more stable than those containing oxygen donors 25,26 , which also applies to other transition metals, including Mn, Fe, Co, Ni, and Cu.Coordinating ligand analysis of the high-quality non-redundant dataset also reveals that cysteine and histidine are the major contributors to zinc binding sites, with 34,536 zinc ions coordinated by two or more CH residues (85.9%) and 5,690 zinc ions coordinated by zero or one CH residues together with ED residues (14.1%) (Fig. S2).While copper exhibits a similar preference towards CH residues as zinc, the other commonly-observed transition metals exhibit a preference towards HED residues, except for iron-sulfur clusters (Fig. S2).Moreover, while Cu and Zn are coordinated predominately by tetrahedral geometry, Mn, Fe, Co, Ni take both octahedral and tetrahedral geometries.
To reduce the number of classes and ensure su cient training data for each class of coordination motifs, PMM uses CH as the major classi cation scheme and ED as the auxiliary measure.This metal ion classi cation approach fundamentally differs from the principles used in existing metal coordination motif classi ers such as ZincBindDB 19 .ZincBindDB considers all CHED combinations and is only able to predict sites with a su cient number of cases, such as the top 10 most populated classes (C2H1, C2H2, C3, C3H1, C4, D1H1, D1H2, E1H1, E1H2, H3).For CHED combinations with less experimentally determined structures, ZincBindDB is either unable to build a prediction model, or the prediction accuracy would be seriously compromised.PMM formulates a straightforward classi cation scheme using the total number of cysteine and histidine as the major criteria.S1).PMM does not overlook the auxiliary measure of ED residues but rather postpones its consideration after the location of the zinc ion is determined.For example, the structure metallopeptidase (PDB code: 2qvp) contains a zinc binding site B460 coordinated by 2 histidine residues, while a third and fourth coordinating ligands Glu and water is also identi ed after the location of the zinc ion is predicted (Fig. S4).
The validity of CH classi cation scheme is further veri ed by its biological implications.Zinc is a ubiquitous cofactor for all six major classes of enzymes and zinc-containing enzyme structures from the PDB are analyzed.Sites from CH4 group lack catalytic capability and are considered as structural sites, featuring cysteine as the most prominent coordinating ligand, followed by histidine, with the most common combinations being C4 and C3H1.Zinc may contribute to the catalytic activity in sites from CH3 or CH2 group, featuring histidine as the most prominent coordinating ligand, followed by cysteine, with many common CH combinations in different scenarios (Supplemental Table S6).Catalytic zinc generally forms complexes with any three nitrogen, oxygen, and sulfur donors from CHED residues, with histidine (usually the Nε2 nitrogen) being the predominant amino acid because of its capacity to disperse charge through H-bonding of the other non-liganding nitrogen (usually the Nδ1 nitrogen) 14 .

Prediction of candidate zinc binding sites
of Cβ to dodge possible clash (Fig. 7a).A scoring function is used to evaluate the deviation from a Zn-Sγ-Cβ angle of 109° for each point from the abovementioned circle.The highest-scored point is chosen as the optimal location of the target zinc ion.
(c) HH subgroup: The gravity centers G c1 and G c2 are calculated using the ve atoms forming the corresponding ve-member ring.All four atoms Cδ2, Nε2, Cε1, Nδ1 on the ve-member ring of the histidine sidechain are considered as candidate coordinating atoms.
Four rays G c1 -Cδ2, G c1 -Nε2, G c1 -Cε1, G c1 -Nδ1 are drawn for the rst ve-member ring, with 2.1Å segments G c1 z 1 , G c1 z 2 , G c1 z 3 , G c1 z 4 aligned with each ray, and z 1 , z 2 , z 3 , z 4 being the candidate zinc location, respectively.The candidate zinc location for the second vemember ring is deduced using the same procedure and denoted as y 1 , y 2 , y 3 , y 4 .The distance between each candidate zinc location from z 1 , z 2 , z 3 , z 4 and each candidate zinc location from y 1 , y 2 , y 3 , y 4 are calculated to determine the closest pair of candidate zinc ions (Fig. 7c).The average coordinate of this pair is chosen as the optimal zinc location.
(d) HHH subgroup: Three candidate zinc locations are deduced using strategy c for HH subgroup.The average coordinate of these three locations is chosen as the optimal zinc location.
(e) Other CH3 subgroup: Three candidate zinc locations are deduced using the strategies a-c for CC, CH, and HH subgroups.A voting mechanism is implemented in this scenario since cysteine is more liable to adopt a conformation not suitable to coordinate metal when compared to histidine.Three distances are calculated from each pair of candidate zinc locations, with the shortest distance considered a major vote (2 out of 3).The average coordinate of these two candidate zinc locations is chosen as the optimal zinc location.
(f) CH4 group: The center of the four zinc-coordinating atoms is chosen as the optimal location of a potential zinc ion (Fig. 7d).

Calculation of hydrophobic pro les
The zinc ion location is used as the center of the sphere to calculate the hydrophobicity contrast functions values (C) and mean atomic salvation parameters values (Δσ) 48 .For each identi ed zinc ion location, a series of 21 radii ranging from 2 Å to 7 Å, with a step size of 0.25 Å (2, 2.25, 2.5, ..., 7), are chosen to generate hydrophobicity contrast curves (Fig. S9a, c) and mean atomic solvation parameter curves (Fig. S9 b, d).The hydrophobic pro les are used not only in calculating certainty score for each predicted zinc ion, but also as parameters in the ensemble model.

Veri cation of candidate zinc binding sites
Predicted candidate zinc subject to different veri cation strategies according to CH2 versus CH3/CH4 groups.For zinc binding sites from the CH2 group, the structural characteristics of each ligand residue and the hydrophilic characteristics of amino acids within a radius of 7 Å from zinc ions are used to construct an ensemble model for further veri cation of the candidate zinc sites.For zinc binding sites from the CH3/CH4 groups, the Pearson correlation coe cient is used to evaluate the similarity in hydrophilic characteristics between the predicted and experimental binding sites, contributing to the further veri cation of the candidate zinc sites.Our strategies and veri cation methods for distinct sites are referred to as a hybrid learning system.
The prediction of CH3 and CH4 groups of zinc binding sites is generally straightforward since most zinc ions adopt a typical tetrahedral conformation.We use the proximal interaction network of 3 or 4 CH residues as a strong signal to procure a candidate list of zinc binding sites.The prediction accuracy can easily achieve 85% or higher with geometric restriction of amino acid type, atom type, and coordination bond distance (Supplemental Table S2).The hydrophobicity pro le is used for further processing and evaluation, analyzing the values of hydrophobicity contrast functions (C) and atomic solvation parameters (Δσ) (Fig. S9c, d).The certainty score of the predicted site is determined by calculating the Pearson correlation coe cient between the C values and Δσ values curves of the predicted site and the corresponding curves obtained from the experimental site.A certainty score higher than 0.5 is used as the criterion to further verify the identity of the zinc binding site.The calculated certainty score is annotated in the occupancy eld of each atom record for zinc ion in the output coordinate le.
The prediction of CH2 group of zinc binding sites require the use of a sophisticated ensemble model to achieve the optimal prediction accuracy.Predictors used in the ensemble model can generally be categorized as ligand type, geometrical parameters, and hydrophobic pro les (Supplemental  8).An ensemble model is carried out with ve Base Learners encompassing both machine learning and deep learning learners to prevent potential under tting or over tting due to the use of a single algorithm.The four machine learners include LR, DT, MLP, and SVC, while the deep learner is a FCNN architecture implemented with the Keras library.Individual predictors using the ve different algorithms are trained with 10x cross-validation to pick the optimal parameter.Results of the ve base learners are combined to form a strong learner ensemble model based on a major voting method (3 + out of 5) using a homemade script.The ensemble model performs classi cation to distinguish between zinc and non-zinc binding sites and outputs a probability value for each site as a certainty score.
The calculated certainty score based on the hydrophobic pro le is then annotated in the occupancy eld of each atom record for zinc ion in the output coordinate le.

Web service implementation
PMM web server is deployed using an Ubuntu Linux virtual machine running Nginx 1.      Figure 6 PMM web prediction ow chart.
), B m and B e represent the B factor of metal (m) or environment (e); while O m and O e represent occupancy of metal (m) or environment (e).Each of the three validation parameters Q c , Q v , and Q e has a valid range of 0 and 1, with 1 indicating the best quality and 0 indicating the worst quality. Figures

Figure 1 Work
Figure 1

Table 1
17M is compared with representative predictors from each of the three categories with PMM in more detail, including Category I predictors ZincBindDB, znMachine, CHED; Category II predictors Metal3D, AlphaFill; and Category III predictors GRE4Zn, TEMSP.For an apple-to-apple comparison, the same TP and FN de nition and the corresponding datasets used in Metal3D and TEMSP are also used to evaluate PMM.When comparing PMM, ZincBindDB, GRE4Zn, TEMSP, and CHED, the evaluation uses a dataset comprising 136 experimentally determined zinc binding sites derived from 100 protein structures17.While these data are excluded in the training set of the PMM algorithm to eliminate biases, PMM still identi es 129 out of the total 136 actual zinc binding sites using the same 0.5 IoUR cutoff, achieving a sensitivity/recall value of 94.9%, which notably exceeds the sensitivities predicted by ZincBindDB TableS7).Ligand types including coordinating amino acid residue names (C or H) and coordinating atom names (Sγ, Cδ2, Nε2, Cε1, Nδ1) are enumerated using One-Hot Encoding.Geometrical parameters are numeric values including coordinating atom distance, Cα distance, Cβ distance, and four angels representing the relative positions and orientations of the C α and C β atoms.Hydrophobic pro les feature 21 hydrophobicity contrast function values (C) and 21 mean atomic salvation parameters values (Δσ).A compilation of the three categories of data result in a total of 61 predictors used for further model training.After excluding multi-conformational sites, a total of 4,151 experimentally determined sites from the CH2 group are used as positive datasets, including 134 CC, 3,495 HH, and 570 CH sites.To obtain a negative dataset, the potential zinc binding sites predicted in the rst step are screened for the absence of another metal ion within 4 Å of the site and conform to the criteria of either Qc < 0.355 or Qv < 0.25.A total of 2,246 sites from the CH2 group that fail one of the validation criteria are used as the negative dataset, including 108 CC, 1,543 HH, and 547 CH sites.The data are strati ed according to the CH group and split with 70% of the data as the training set and the remaining 30% as the test set to evaluate the effect of the classi cation model (Supplemental Table