A database of calculated solution parameters for the recently released AlphaFold predicted protein structures

Recent spectacular advances by AI programs in 3D structure predictions from protein have revolutionized the field in terms of accuracy and speed. The resulting "folding frenzy" has already produced predicted protein structure databases for the entire human and other organisms' proteomes. However, rapidly ascertaining a predicted structure's reliability based on measured properties in solution should be considered. Shape-sensitive hydrodynamic parameters such as the diffusion and sedimentation coefficients ( ) and the intrinsic viscosity ([ η ]) can provide a rapid assessment of the overall structure likeliness, and SAXS would yield the structure- 9 related pair-wise distance distribution function p ( r ) vs. r . Using the extensively validated UltraScan 10 SOlution MOdeler (US-SOMO) suite we have calculated from the AlphaFold structures the 11 corresponding vs. r and other parameters. Circular dichroism spectra 12 were also computed. The resulting US-SOMO-AF database should aid in rapidly evaluating the 13 consistency in solution of AlphaFold-predicted protein structures.

The Anfinsen dogma, that protein sequences dictates their three-dimensional (3D) structure, was 1 postulated nearly fifty years ago 1 . It set in motion a quest to find methods to reliably and accurately 2 predict 3D protein structures from their sequence, which became even more important with the full 3 sequencing of the human and other genomes (see https://www.ncbi.nlm.nih.gov/genome). Recent 4 spectacular advances in the 3D structure prediction from protein sequences by Artificial 5 Intelligence (AI) programs such as AlphaFold (AF) and RoseTTAfold appear to have 6 revolutionized the field in terms of accuracy and speed 2,3 . Boosted by their success in predicting 7 structures to near (and sometimes even better than) crystallographic accuracy, the AlphaFold 8 consortium (https://alphafold.ebi.ac.uk) has already made publicly available a series of databases of 9 predicted protein structures for the entire human and several other organisms proteomes 4 . 10 However, these AI programs have not tackled the folding issue from a thermodynamic/mechanistic 11 approach, but rather by combining many different observations in a deep learning process 5 . Apart 12 from simple cases of highly homologous sequences, or clearly recognized folding classes, to 13 reasonably rapidly ascertain the degree of confidence of a predicted structure based on a few 14 measured properties in solution we believe should become a necessary step. For instance, besides 15 known occurrences of multi-chain proteins, determining a molecular mass M in solution can 16 immediately verify the protein oligomerization state and prompt for the need of further modeling. 17 On a different level, circular dichroism (CD) spectroscopy, possible on very small sample 18 amounts 6 , would permit a rapid check of the actual secondary structure content of a predicted 3D 19 structure. 20 Particularly useful for known single-chain proteins in the AF databases, shape-sensitive 21 hydrodynamic parameters such as the translational diffusion and sedimentation coefficients 22 ) and the intrinsic viscosity ([η]), could provide a robust assessment of the overall 23 fold likeliness. These measurements, requiring little material and with a reasonably quick 24 turnaround, are usually accessible in most research endeavors, especially in core facilities where 25 analytical ultracentrifugation 7,8 , multi-angle static and dynamic light scattering (MALS and DLS) 26 coupled to size-exclusion chromatography (SEC) 9,10 or directly on plate readers 11 , and SEC-coupled

Database generation and website implementation 2
The steps leading to the implementation of the US-SOMO-AF database are fully described in the 3 Methods and Online Methods sections. Briefly, each entry in the entire AF database was first 4 compared with the corresponding entry in the UniProt database to find the (putative) signal peptide 5 regions, which were subsequently removed from the AF PDB files. Potential disulfides were 6 identified (allowing a better evaluation of the partial specific volume v and of M) and written as 7 SSBOND records in the cured PDBs, together with HELIX and SHEET information identified 8 using the DSSP 25 implementation in UCSF Chimera 26 . Batch-mode US-SOMO was used to 9 , the derived Stokes' radius R s , [η], R g , the maximum extensions 10 along the principal X, Y and Z axes of the molecule, and the generation of the p(r) vs. r distributions 11 (normalized by the M of the structure). SESCA was used to generate 170-270 nm CD spectra. 12 and US-SOMO/SESCA computations were performed appear in their corresponding fields, and in 11 between the "Mean confidence" field reports the calculated mean % per-residue confidence, based 12 on the values present in the AF-generated PDB file. 13 The following ten fields report the US-SOMO computed parameters. Since the hydrodynamic 14 parameters were computed with the statistically-based ZENO method 27, 28 and M, it could also be used to compute an experimental M from SAXS data 16 . 18 The bottom two entries report the per-residue % of α-helix and β-sheet as calculated from HELIX 19 and SHEET fields in the cured PDB. They could be compared with CD-derived values, besides 20 comparing experimental and calculated spectra (see below). 21 External links for the current entry to both UniProt and AlphaFold websites are placed after the 22 parameters listings. Cured PDB-and mmCIF-formatted files for the entry can be retrieved from the 23 provided hyperlinks, as well as text files with the p(r) vs. r distribution and CD spectrum, and a 24 csv-formatted file containing all the identifying information and the single-value parameters. All 25 these files can be also retrieved as single compressed files (zip or tar.xz). Below these hyperlinks, 26 the computed p(r) vs. r distribution and CD spectrum graphs are presented, followed by a JSmol (https://sourceforge.net/projects/jsmol) representation of the structure (see Fig. 2). 1 Controls for the visualization and copying as an image of both graphs are provided. JSmol 2 commands are also available to change the representation and export it. The default representation 3 colors the structure according to the per-residue confidence level (red, lowest; blue, highest), but for 4 a more in-depth analysis we refer the user to the original AF website. 5 In the end, parameters for a total of 365,198 structures were generated from the AF databases (that 6 include multiple predicted segments for certain sequences), and are stored in the freely accessible 7 US-SOMO-AF database. 8  prediction can be seen in Fig. 3c, becoming, however, much less defined when the confidence level 17 goes below 50%. Fig. 3d shows in 3D how combining two parameters, R s and [η], can effectively 18 yield an increased discriminatory power. Another important parameter is R g , but it can rarely be 19 determined by MALS techniques, that have a lower detection limit of ~10-11 nm. While SAXS can determine R g , it can also be used to derive the p(r) vs. r distribution 16 , which contains more 1 information and can be directly compared with the one computed from structure. Note that the 2 effect of not taking into account the hydration water in the computation of the p(r) vs. r distribution 3 is relatively minor, and its importance decreases as M increases. Therefore, plots involving R g are 4 not presented here, but could be easily generated from the Supplementary Data 1 spreadsheet. 5  In Table 1, we have listed 14 entries chosen from the 41,200 mentioned above. They were initially 1 selected to represent intervals from 2.2 to 0.66 in the computed R g /R s ratio indicating deviation from 2 globular shape (R g /R s~0 .7 for a sphere). A suitable range of [η] values was also sought, as well as a 3 good representation of the organisms present in the AF databases, the presence or absence of a 4 signal peptide, and some spread in the mean % confidence. M, R g , R s , and [η] were chosen as the 5 calculated parameters, and the entries are ordered by decreasing M. Connected to Table 1 is Fig. 4,  6 that displays snapshots of the 3D structures for each entry colored according to the per-residue 7 confidence level, followed by the p(r) vs. r and CD plots. 8  Table 1, together with the 9 calculated p(r) vs. r and CD plots. 10 11 Table 1 and Fig. 4 provide an insightful glimpse on the great variety of predicted structures and 12 their associated calculated parameters, suggesting that performing some of these checks can indeed 13 boost, or question, their reliability. As expected, CD spectra display differences between most structures, and they are a robust check on the predicted secondary structure content. The variability 1 Table 1 appears to confirm its discriminating power above that of R s , but clearly it 2 is the p(r) vs. r distribution that would provide the best test, although it is the least rapidly 3 experimentally accessible parameter among those considered. 4

in [η] values in
To provide an additional measure of the predictive power of the hydrodynamic parameters and of 5 the p(r) vs. r distribution, we have selected the O88338 Cadherin-16 from M. Musculus structure 6 (see Table 1 and Fig. 4)  the AF-predicted O88338 structure. 20 Thus, even for such a restricted structural variation, comparing experimental and calculated 21 parameters can provide reliable tests of the predicted structures.

Discussion 1
We have presented here a new database stemming from the AlphaFold predicted protein structures 2 database. It contains calculated hydrodynamic and structural parameters whose experimental 3 determination should be within the reach of scientists working with a particular protein for which a 4 "hard" structure is either currently unavailable or in the making. Indeed, it is interesting to note that 5 crystallographers and cryo-electron microscopists are already suggesting using AF-predicted 6 structures to solve experimental structures by molecular replacement methods 33 . Performing some 7 rapid tests and comparing the results with those we provide in the US-SOMO-AF database could 8 save them valuable time and perhaps hint at twists that should be applied to a predicted structure to 9 better fit the X-ray, cryo-EM, and NMR data. In this respect, we would like to point out a tool 10 present in the US-SOMO program that allows one to color-code a visualized structure based on the 11 contribution of residues to a particular set of distances in a p(r) vs. r distribution 34 . For instance, this 12 could provide an easier identification of domains that under-or over-contribute to that set of 13 distances. This is another reason why we chose to produce p(r) vs. r distributions instead of 14 simulated SAXS intensity vs. scattering vector curves, for which a wide variety of methods, often 15 quite computationally intensive, exist 35 . More in-depth analyses could be subsequently performed 16 on case-by-case basis. 17 For a more general use, assessing the reliability of a predicted structure could lead to better 18 designed function/structure relationship experiments. The availability of the US-SOMO-AF 19 database has the distinctive advantage of allowing a quick comparison without the need to master 20 the expertise necessary to soundly calculate the relevant solution parameters. 21 There are, of course, a series of drawbacks associated with these computations. First and foremost, 22 all the AF predicted structures consider all proteins as single chain entities. Efforts are apparently 23 underway (see 5 ) to cope with this issue by allowing multi-chain predictions, and when an evolution 24 in that sense appears in the AF database, we can re-calculate all parameters for a new set. A second 25 evident drawback resides in the post-translational modifications that many proteins undergo. None 26 were considered by the AF team, and we have just scratched the surface by removing the signal peptides. The most important modification, affecting the calculated parameters, is glycosylation 1 (e.g., see Table 1 in 36 ). While UniProt provides a list of potential glycosylation sites for entries, and 2 publications describing them when available, there is no direct way to have the composition of each 3 carbohydrate associated with a particular site. This is a pity, as methods for building complex 4 carbohydrates are already available and/or under development (see 37 ), and it should be relatively 5 straightforward to automatically add them at the appropriate sites. Indeed, this has just been 6 independently advocated in a very recent letter to this journal 38 . Even in absence of time-consuming 7 molecular dynamics minimization steps, this simple addition could increase the reliability of 8 calculated hydrodynamic and structural parameters. While we hope that such an important step will 9 be taken at the UniProt database level, users that need to refine the calculations on a predicted 10 structure after having manually added any prosthetic group can easily do so by using one of the 11 downloadable (http://somo.aucsolutions.com) US-SOMO versions. 12 The third drawback is the handling of flexibility, especially if large unstructured parts are predicted. 13 Here the US-SOMO-AF database can only raise red flags, such as very high predicted [η] values 14 associated with visualized extended, unstructured parts. Dealing with these issues requires much 15 longer calculations involving either Monte Carlo methods or Brownian dynamics simulations 16 (see 39 ), that would require a major effort to be applied systematically on >365,000 structures. 17 All considered, we believe that the publicly available (https://somo.genapp.rocks) US-SOMO-AF 18 database described here will become a useful tool allowing the research community, by comparing 19 one or more experimentally-determined parameters with the corresponding computed ones, to 20 quickly evaluate the compatibility in solution of an AlphaFold-predicted protein structure. 21 22

Methods 23
Production of the results presented in this paper required five major steps: collect the AlphaFold 24 entries and additional metadata; prepare the structures for hydrodynamic, structural and CD 25 calculations; compute the hydrodynamic, structural and CD properties; build a database containing 26 the hydrodynamic properties and additional metadata; and finally build a website allowing users convenient access to the database. 1 After downloading the AlphaFold database, we prepared the structures by removing the signal 2 peptide regions, where present, identified from the UniProt website. We utilized US-SOMO 21,22,23 3 to compute hydrodynamic and structural properties. The US-SOMO suite uses a bead modeling 4 strategy which takes into account the theoretical amount of bound hydration water, and the ZENO 5 computational algorithm 27,28,29 was employed to calculate the hydrodynamic parameters in a rigid-6 body frame. US-SOMO was also used to compute the p(r) vs. r distribution on not-hydrated 7 structures, using SAXS-related parameters. To compute the CD spectra, we used SESCA 18 . 8 All the computed results were collected and inserted into a database. 9 Full descriptions for all these steps can be found in the Supplementary Methods section.  Table 1, together with the 12 calculated p(r) vs. r and CD plots. 13