VTR: an algorithm for identifying analogous contacts on protein 1 structures and their complexes 2

12 Evolutionarily related proteins can present similar structures but very dissimilar sequences. 13 Hence, understanding the role of the inter-residues contacts for the protein structure has been 14 the target of many studies. Contacts comprise non-covalent interactions, which are essential to 15 stabilize macromolecular structures such as proteins. Here we show VTR, a new method for 16 the detection of analogous contacts in protein pairs. VTR performs structural alignment 17 between proteins and detects interactions that occur in similar regions. To evaluate our tool, 18 we proposed three case studies: (i) we compared a vertebrate myoglobin and a truncated 19 invertebrate hemoglobin; (ii) analyzed interactions between the spike protein RBD of SARS- 20 CoV-2 and the cell receptor ACE2; and (iii) compared a glucose-tolerant and a non-tolerant β- 21 glucosidase enzyme used for biofuel production. The case studies demonstrate the potential of 22 VTR for the understanding of functional similarities between distantly sequence-related 23 proteins, as well as the exploration of important drug targets and rational design of enzymes 24 for industrial applications. We envision VTR as a promising tool for understanding differences 25 and similarities between homologous proteins with similar 3D structures but different 26 sequences. VTR is available at http://bioinfo.dcc.ufmg.br/vtr. 27 electrostatic surface by Poisson-Boltzmann analysis.


31
Proteins are macromolecules responsible for most functions in living beings, such as transport, 32 immune protection, control growth, and so on 1 . The function of a protein is directly related to 33 their three-dimensional structure. Previous studies have demonstrated that structural changes 34 are directly related to sequence changes 2 . Even small modifications, such as mutations, 35 insertions, or deletions, can change the structure 3-5 . However, evolutionarily related proteins 36 may present similar structures but different sequences 6 . Sequence alignments can 37 unambiguously distinguish similar and non-similar structures when the identity is upper to 38 40%. Even sequences with identities between 20-35% may generate false negatives for 39 homology identification 7 . Also, studies have reported similarities in structures with sequence 40 identity lower than 20% 2 . Until now, the understanding of how the polypeptide sequences fold 41 into a particular three-dimensional shape after synthesis remains a mystery 8,9 . It has motivated 42 the search for computational algorithms to predict protein structures from their sequences 10  interface. Also, VTR determines the statistics of differences between amino acids in contact.

91 92
Case studies 93 To evaluate the VTR tool, we proposed three pairwise comparative case studies: (i) eukaryotic 94 myoglobin and a truncated non-vertebrate hemoglobin chain; (ii) interactions among the cell 95 receptor ACE2 and both SARS-CoV-2 and SARS-CoV; and (iii) glucose-tolerant and non-96 tolerant β-glucosidases. Contact maps generated by VTR demonstrate similarities in the contact 97 pattern between protein pairs in each case study. For instance, Figure 2A is more similar to 98 Figure 2B, Figure 2C to D, and Figure 2E to F. We aim with these case studies to demonstrate 99 a simple use for VTR (case study 1) and show the tool's effectiveness in finding shared and 100 unshared contacts in two systems already described in the literature (case study 2). In the third 101 case study, we propose a real use of VTR at detecting possible mutation sites for enhancing 102 enzymes with biotechnological applications. 103

107
Bgl1B. For C and D, we show only contact maps for chain A. We consider all contacts for determination 108 of the contact maps, including contacts between atoms of the main-chain of closer residues (such as 109 those present in alpha-helices).

Case study 1: Myoglobin and hemoglobin 112
In the first case study, we performed the contacts alignment between a sperm whale myoglobin 113 (PDB ID: 1a6m) and a truncated single-chain hemoglobin from Paramecium caudatum ciliated 114 protozoan (PDB ID: 1dlw). Myoglobins and hemoglobins are both oxygen-binding proteins, 115 belonging to the widespread and distantly related globin family 25 . For this case study, the 116 evaluated myoglobin structure (1a6m) was at the oxy state (i.e., oxygen bound), while the 117 hemoglobin structure (1dlw) was at the de-oxy one (i.e., oxygen unbound). 1a6m presents a 118 sequence length of 151 amino acids, while 1dlw, as a typical non-vertebrate truncated 119 hemoglobin chain, has a minor amino acid content, with just 116 residues. Both proteins 120 present a similar folding but a sequence identity of only 18%. The literature has described that 121 structures of homologous sequences with an identity lower than 20% may present large 122 structural differences 2 . However, the discrepancy about this typical strong relationship 123 between sequence and folding similarity for the globin family has long been known. In fact, 124 globins form a family substantially conserved in structural topology, despite the distant 125 sequence relationship, being this one of the higher conundrums in biochemistry 3,25 . Hence, we 126 believe these proteins to be potential targets for comparison of analogous contacts using VTR. 127 After processing both structures with default parameters, VTR detected the following 128 contacts for 1a6m: 85 hydrogen bonds, 81 attractive interactions, 32 repulsive interactions, nine 129 aromatic stacking, and 364 hydrophobic interactions. We found the following contacts in 1dlw: 130 65 hydrogen bonds, 20 attractive interactions, one repulsive interaction, two aromatic stacking, 131 and 221 hydrophobic interactions. However, we obtained only 13 main contact matches in 132 analogous positions using a 2 Å AVD cutoff (Table S2). The contact matches increase to 60 133 when considering conserved contact matches of hydrophobic type. 134 From the contacts predicted in analogous positions, we found 12 of different types 135 (Figure 3; Table S2-4), such as V21-V26 and Q13-N43 ( Figure 3A). For 1a6m, VTR detected 8 an attractive contact between H36 and E38. However, it detected a hydrogen bond between 137 D26 and T28 in analogous positions of 1dlw ( Figure 3B). On the other hand, in 1a6m, I75 138 performs hydrophobic interactions with L86, while F48 performs the same type of interaction 139 with V109 ( Figure 3C). Interestingly, I75 and L86 are located at a distance of 11 amino acids, 140 while F48 and V109 are at a distance of 61 positions. This highlights the VTR's ability to detect 141 contacts even in different sequence positions. Most of the contact matches are located at 142 compatible distances of 2 to 4 amino acids, such as K102-F106 and A76-T80 ( Figure 3D), 143 L104-S108 and F78-I82 ( Figure 3E), L89-H93 and L64-H68 ( Figure 3F). Only the hydrogen 144 bond detected in the contact L89-H93 of 1a6m was considered a match with Q13-N43. This 145 may suggest that both contacts present similar importance for the stability of both proteins. 146

153
It is essential to highlight that for this analysis, we did not consider the interactions 154 between atoms of the main-chain of closer residues. We did it to reduce the number of contacts 155 detected in alpha-helices, which can explain the low number of hydrogen bonds conserved. 156 Enabling the advanced option "detection of structural contacts in α-helices", the number of 157 hydrogen bond contact matches increases to 129. This option is disabled by default because 158 VTR desires highlight conserved interactions with the side-chain atoms. Also, enabling this 159 option increases the computational cost once that all contacts of similar secondary structures 160 in close regions will be computed. interaction of SARS-CoV-2 with the receptor. It was noted that most of these residues are 172 highly conserved and shared with SARS-CoV. Here, we verified the ability of VTR to find the 173 known contacts between different chains of both structures. We compared the structures of 174 SARS-CoV-2 spike RBD (PDB ID 6m0j) and SARS-CoV spike RBD (PDB ID 2ajf), both in 175 complex with the ACE2 receptor. We maintain the standard 2 Å AVD cutoff, and we use chain 176 Compared to the Lan et al. 27 study, the VTR could find all 16 atomic contacts between 178 SARS-CoV and ACE2 (3 salt bridges and 13 hydrogen bonds). VTR could also find 15 contacts 179 between SARS-CoV-2 and ACE2 (2 salt bridges and 13 hydrogen bonds). Of these, VTR 180 calculated and presented which contacts were shared between receptor and SARS-CoV or 181 SARS-CoV-2. We detected all contacts described in 27 (Table 1). 182 Table 1. Analogous contacts between proteins with PDB entries SARS-CoV (6m0j) and 2ajf. R1:    (Table 1). Although the hydrophobic 189 interaction between Y41 (CE2) and T487 (CG2) was not described in 27 , the hydrogen bond between Y41 and T487 was presented (Table 1; lines 1 and 2). VTR also detected a permuted 191 interaction between atoms of D38 and Y449 (6m0j) and D38 and Y436 (2ajf) ( Table 1; lines 3  192   and 7). Moreover, the interactions between G354 (O) and G502 (N) of 6m0j and their 193 equivalents in 2ajf appear to be probably detected due to the cutoff-based strategy (Table 1;

204
For SARS-CoV-2, VTR detected a hydrogen bond between H34 and Y453 that seems 205 to be equivalent to H34 and Y440 in SARS-CoV ( Figure 4A). The same applies to the following contact pairs: E37-Y505 (6M0J) and E37-Y491 (2AJF) ( Figure 4B); F28-Y489 207 (6M0J) and F28-436 (2AJF) ( Figure 4C); and Q42-Y449 (6M0J) and Q42-Y436 (2AJF) 208 ( Figure 4D). It is important to highlight that besides the interaction between E37 and Y505 209 ( Figure 4B), the residue E37 possibly interacts with R403 in 6M0J, but there is no equivalent 210 contact in 2AJF. Also, Lan et al. 27 described an interaction between Y83 (OH) and Y489 (OH) 211 of 6M0J. Besides this interaction, VTR also pointed out that Y489 (OH) interacts with the 212 main-chain atoms of F28. It demonstrates that the cutoff-based strategy used by VTR also has 213 advantages when compared to more restrictive methods. This possible interaction could, in 214 future studies, be better comprehended through molecular dynamics experiments. 215 We expected that Bgl1B's mutant present characteristics similar to Bgl1A (i.e., characteristics 250 that could lead to glucose tolerance). As a control, we modeled a Bgl1A's mutant that we 251 expected present characteristics like Bgl1B (i.e., negative characteristics for biofuel 252 production). Then, we inspected the mesoscopic influence of such single amino acid 253 modification at the regional electrostatic surface by Poisson-Boltzmann analysis. 254

280
Residues 41 and 57 are depicted as spheres. Regions whose electrostatic potential is affected by the 281 H57D or D57D permutations are depicted as "x" for Bgl1A or "*" for Bgl1B.

283
Indeed, all the topological context around loop A in Bgl1B accumulates a substantial 284 amount of negative electrical potential (red region in Figure 6B). On the other hand, Bgl1A 285 presents a surface with a more neutral electrical potential (blue and white regions in Figure  286 6A). While the D41-D57 contact appears to be a hot spot of negative charge accumulation in 287

Vibrational modulation of loop A 300
The representativity of the eigenvectors (Supplementary Figure S1) demonstrates the 301 differences between the PCA recovered fluctuation between loops A of Bgl1A and Bgl1B. 302 Bgl1A fluctuation is more homogeneously represented by the two first eigenvectors (PCs) of 303 loop A, with a fractional eigenvalue of 63% for the PC1 and 29% for PC2. On the other hand, 304 the fluctuations at the Bgl1B's loop A are strongly located at PC1, that alone accounts for 82% 305 of the sum of all eigenvalues. This can indicate that a single kind of movement, with a 306 substantial fugacity from the middle structure at the minima, account for most of the vibratory 307 mobility of Bgl1B's loop A. This is consonant with a local instability caused by the highly 308 repulsive environment at this loop ( Figure 6B). 309 In Figure S1, we can see higher proximity among the lines that represent wild Bgl1B, For the D41-D57 repulsive contact, the vibrational movement of both acids is 330 unbalanced, with one moving more or faster than the other ("*" in Figure S1 D-E). On the other 331 hand, the movement involving the D41-H57 contact is more balanced (mainly for Bgl1A's 332 context), indicating a more equilibrated vibration again. 333

Implications for protein engineering 334
The intensity of some of these modifications or changes at the rest of loop A seems to be 335 context-dependent. This agrees with the fact that the permutation of the entire loop A between 336 Bgl1A and Bgl1B was able to revert the glucose tolerant/inhibited behavior, but with poor 337 results for local single amino acid substitutions 36 . Despite this, a single substitution detected 338 by VTR was able to promote vibrational changes with the same pattern at almost half of the 339 extension of this functionally crucial loop. 340 All these mentioned facts illustrate a potential use for this tool. If properly combined 341 with electrostatic and mobility patterns detection techniques, such as the APBS and molecular 342 mechanics/PCA here employed, VTR can be useful for rational protein engineering. The 343 integrated use of VTR with computational or experimental tools can be promising to find the 344 topologically minimum modifications that need to be carried at proteins to introduce some 345 desirable characteristics found in other homologous.

Structural superposition 359
VTR performs structural alignments between protein pairs using the default parameters of the 360 TM-align algorithm (https://zhanglab.ccmb.med.umich.edu/TM-align). TM-align receives as 361 input two PDB files and returns the coordinates of a superimposed structure. TM-align will 362 return the best alignment possible, and VTR will return a warning informing if contact matches 363 could not be found. 364

Contacts computation 366
We compute the Euclidean distance among all-atom coordinates using in-house Python scripts. 367 The tool identifies five types of possible interactions: hydrophobic, attractive/repulsive, 368 hydrogen bonds, aromatic stacking, and disulfide bonds. A pair of residues are in contact if any 369 atom pair meets the cutoff ranges presented in Table 2. 370

372
Contact must occur between atoms of two different residues. The atoms involved are 373 described in Table S1. For the detection of aromatic stacking, we determined the centroids of 374 the rings (cutoff 2-6Å). For phenylalanine and tyrosine, we calculate the median coordinate 375 between the atoms CG and CZ for determining the ring centroid. For tryptophan, we used the 376 median coordinate between the atoms CD2 and CE2. For histidine, we determined the median 377 coordinate of the atoms CG, ND1, CE1, NE2, and CD2. 378 By default, VTR ignores contacts between atoms of the main-chain of neighborhood 379 residues (until four positions) to remove contacts that compose the structures of α-helices. 380 However, through the web tool, users can enable the detection of these contacts (increase 381 processing time). 382

Search for contact matches 383
We used in-house scripts to compare the contacts in analogous positions. We proposed the 384 AVD (average vector distance) metric to measure the distance between contacts and detect 385 contacts matches. AVD is calculated by equation (1): 386 where P represents the contact between atoms p1 and p2 of protein A; Q represents the contact 389 between the atoms q1 and q2 of protein B; and D is a function that returns the Euclidean 390 distance between atomic coordinates. To detect a contact match between P and Q, the 391 AVD(P, Q) should be the lowest value. We determine the VTR score based on equation (2) Herein, we presented VTR, a novel approach with a visual web interface that can be used to 425 analyze, compare, and scrutinize analogous contacts in protein pairs. We explored the tool 426 features with three case studies, where we demonstrated the potential of VTR to shed some 427 light on the mechanisms of topology conservation on phylogenetically related but sequentially 428 distant proteins. We also evaluated contact similarities and dissimilarities on pharmacologically relevant targets. We suggested the use of our tool to the rational design of 430 proteins with biotechnological applications. Concerning the second case study, both the 431 confirmation of contacts already reported in the literature and the finding of four hydrogen 432 bond matches still not described are promising finds for the future rational design of anti-Sars-433 Cov-2 drugs. In the last case study, we compared a glucose-tolerant beta-glucosidase enzyme 434 with a non-tolerant one. VTR detected several changes in contact types of analogous positions. 435 Called our attention a change in an attractive contact by a repulsive: D41-H57 of Bgl1A that 436 corresponds to D41-D57 of Bgl1B. We explored the importance of this contact using molecular 437 mechanics minimization and vibrational inspection by principal component analysis.    1 (a-b), 2 (c-d), and 3 (e-f). Each point represents a contact.