Rotamer-free protein sequence design based on deep learning and self-consistency

Several previously proposed deep learning methods to design amino acid sequences that autonomously fold into a given protein backbone yielded promising results in computational tests but did not outperform conventional energy function-based methods in wet experiments. Here we present the ABACUS-R method, which uses an encoder–decoder network trained using a multitask learning strategy to predict the sidechain type of a central residue from its three-dimensional local environment, which includes, besides other features, the types but not the conformations of the surrounding sidechains. This eliminates the need to reconstruct and optimize sidechain structures, and drastically simplifies the sequence design process. Thus iteratively applying the encoder–decoder to different central residues is able to produce self-consistent overall sequences for a target backbone. Results of wet experiments, including five structures solved by X-ray crystallography, show that ABACUS-R outperforms state-of-the-art energy function-based methods in success rate and design precision. A deep learning method for protein sequence design on given backbones, ABACUS-R, is proposed in this study. ABACUS-R shows an improved performance when compared with conventional energy function-based methods in wet experiments.


Articles
NATURe CompUTATioNAl SCieNCe three-dimensional protein structure, and an iterative process (Fig. 1b) of applying this encoder-decoder to update the sidechain type of each residue in a given backbone to obtain self-consistent overall sequences (starting from random initial sequences).The two parts taken together, the overall approach is effectively a backbonebased sequence generator or an encoder that encodes backbone structures into amino acid sequences (see the Methods for more detailed descriptions of the model, discussions about designing decisions and computational costs).

Accuracy of the encoder-decoder for individual residues.
A set of non-redundant Protein Data Bank (PDB) structures has been used to train the encoder-decoder network (see Methods).We have learned two groups of network parameters by splitting the PDB structures into training and test sets in two different ways.The first group of parameters (Model eval ) was learned by using about 95% of the structures for training and the remaining 5% for testing, with the structures for testing belonging to single-domain topology classes (according to the CATH 4.2 classification of protein structures 36 ) in the dataset, so that none of the test structures belonged to the same CATH topology as a training structure; Model eval can therefore be used for unbiased computational evaluations.The second group of parameters (Model final ) was learned by randomly splitting the protein structures into roughly 95% for training and 5% for testing, disregarding their CATH structure classification.As Model final was trained on more complete data, it is expected to be applied in future real applications; Model final was therefore used to design the experimentally examined sequences.
The encoder-decoder has been trained by multitask learning 37 , in which several attributes of the central residue besides its sidechain type have been used as decoding objectives.The decoding accuracies of Model eval for the various attributes are reported in Fig. 2a-c and Supplementary Fig. 1.Similar results for Model final are reported in Supplementary Table 1.
Both Model eval and Model final are able to recover the sidechain type of the central residue with an accuracy of around 50%.Moreover, for a substantial fraction of the decoding results that did not recover the exact native sidechain types, the decoder produces sidechain types physicochemically similar to the native types (for example, lysine for arginine, phenylalanine for tyrosine and so on; see Fig. 2b).Considering attributes other than the sidechain type for training substantially improved the accuracy for decoding the sidechain type (Fig. 2a, from 45.3% to 49.3% for Model eval , and Supplementary Table 1, from 47.2% to 53% for Model final ), probably by compensating 'noises' in the training (natural) sequences (that is, the native sidechain types are not always the optimum choices for the given backbone) For the categorical sidechain type attribute, the actual output of the decoder is a vector of logits values (see Methods).To examine whether the negatives of the logits values could be interpreted as effective energies for different residue types in a specific local threedimensional environment, we chose a dataset of protein mutants with experimentally measured changes in protein stability (the ProTherm dataset 38 ), and examined the correlation between the stability changes (as measured by ΔΔG) of different mutants with the changes in −ΔΔlogits (see Methods); the results are shown in Fig. 2d.Supplementary Table 2 shows that the correlation coefficients of around 0.52 of ABACUS-R are comparable with the results of 0.39-0.59produced by a variety of models (those models producing higher correlation coefficients than ABACUS-R have been trained on the specific task of predicting ΔΔG).Supplementary Fig. 2 also shows that ABACUS-R produces comparable results to earlier methods 30 for another dataset of protein stability versus sequence variations 11 .
Convergence of the sequence design iterations.We have applied Model eval and self-consistent iterations to design overall sequences for 100 target structures taken from the test set of Model eval .These target structures cover three main CATH classes (Supplementary Fig. 1 | an overview of the aBaCuS-R method.a, The framework of the encoder-decoder model.The input of the encoder includes backbone-only structure features and the sidechain types of surrounding residues (residue neighborhood modeling).The encoder is a transformer of 12 blocks, the output of which constitutes a general, integrated representation of the input features (local environment representation).To benefit from multitask learning, this representation is decoded into several different attributes (including the sidechain type) of the central residue (decoding central node).b, The iterative approach for designing self-consistent overall sequences.In each iteration, the encoder-decoder network is applied to each residue in a randomly chosen subset of residues to update its sidechain type according to its current, sequence-dependent local environment.

NATURe CompUTATioNAl SCieNCe
Table 3).For each target, ten sequences have been designed by using runs starting from different random initial sequences.As the iterative approach is effectively a greedy algorithm to maximize the (predicted) probabilities of the sidechain types, we monitored the negative logarithms of these probabilities (the −logP values) during the iterations.Figure 3a shows how the averaged per-residue −logP value decreased and converged to a plateau.Meanwhile, the sidechain types of most residues were converging towards those of the final sequences (Fig. 3b).For all of the target structures, the iterative runs can produce self-consistent sequences that either completely converge or up to having only a very small (target dependent) number of fluctuating positions (see the inset of Fig. 3b and Supplementary Fig. 3a).
If not all positions converge, the sequence of the lowest total − logP values recorded during a run is taken as the final design result.This choice is for convenience and is not critical, as both the number of unconverged positions and the fluctuations of −logP are very small during the plateau stage (see the insets of Fig. 3a,b), suggesting that all of the sequences at this stage are equally acceptable.
For the same target structure, the runs starting from different random initial sequences usually lead to highly similar (although not identical) sequences (sequence identities from 0.76 to 0.89 for the 100 test targets), with the per-residue −logits values varying only narrowly (that is, ±0.50); thus, the sequences designed by different runs can be considered as equally plausible.
For the 100 test targets, the self-consistent design results obtained by using Model final are highly similar to those obtained by using Model eval in terms of −logits values and sequence identities (Supplementary Fig. 4); thus, ABACUS-R design results do not seem to be sensitive to which exact individual structures have been included in or excluded from the training data, as the underlying network only models local statistical distributions, to which each individual training structures should only contribute a very small fraction.
Designed sequences compared with native sequences.For the 100 test targets, the average identity between the designed sequences and corresponding native sequences is 43.1% ± 5.4% (this identity varied from 32% to 61% for various individual targets).Although the native sidechain type recovery rate does not exhibit strong dependency on the CATH class (Fig. 3c), this recovery rate has the expected tendency of decreasing with increasing solvent accessibility (see Supplementary Fig. 3b).The Pearson correlation coefficient of the amino acid type compositions of the designed and the native sequences is 0.93 (Fig. 3d).Nevertheless, some sidechain types such as glutamic acid, alanine and lysine have been used more frequently in the designed rather than the native sequences, while the sidechain types of obviously reduced usages in the designed sequences include glutamine, histidine and methionine (Fig. 3d).
We have also compared the native sidechain type recovery rate of ABACUS-R with related methods.The results are summarized in Supplementary Table 4 (for single residue redesign), and Supplementary Tables 5 and 6 (for complete sequence redesign).Comparisons on the same groups of target backbones indicate that ABACUS-R produces higher native recovery rates than the model of Ingraham et al. 29 , 3DCNN 17 and ProteinSolver 30 , and produces lower rates than two more recent models 39,40 .
We note that the native sidechain recovery rate metric, although very useful, has severe limitations for judging a sequence design method: first, for a substantial portion of residues in natural

Articles
NATURe CompUTATioNAl SCieNCe sequences, the native sidechain types may not be optimum for the given structures; and second, a few incorrectly designed residues in a designed sequence of a high native recovery rate could disrupt protein folding (and lead to experimental failure).
The designed sequences and the native sequences have been further compared in their per-residue Rosetta energies (computed after backbone relaxation) 8,41 .For the same target backbone, the Rosetta energies of the designed and the native sequences are similar (Fig. 3e).We also applied the statistical energy function-based method ABACUS2 to design sequences for the 100 test backbones and considered their Rosetta energies and ABACUS-R −logits values (Fig. 3e,f).Although ABACUS2 sequences are of systematically lower Rosetta energies than both the native and ABACUS-R sequences, they are of systematically higher −logits than the two groups of sequences.These results indicate that the −logits metric has constituents that are orthogonal to the Rosetta energy.This partial orthogonality is also exemplified by the moderate correlations between the per-residue −logits values and the Rosetta energies of the designed and the native sequences (the respective Pearson correlation coefficients are 0.24 and 0.17; see Supplementary Fig. 3c).
We applied ABACUS-R to the four target backbones considered in the 3DCNN work 17 .The native sidechain type recovery rates of ABACUS-R and 3DCNN are compared in Supplementary Table 6.Although Supplementary Table 7 shows that the ABACUS-R −logP values of the 3DCNN sequences are clearly above those of the native sequences, which in turn are clearly above the −logP values of the ABACUS-R sequences, Supplementary Fig. 5 shows that the groups of native residues retained by ABACUS-R and 3DCNN largely overlap, and that these residues, as well as the residue-wise variations of the −logP values among the various sequences, are dispersed throughout the entire sequence.
Structure prediction on designed sequences.We applied AlphaFold2 28 to predict the structures for all of the 1,000 sequences designed for the 100 test targets.The distributions of the template modeling scores (TM-scores) for aligning the predicted strctures with the respective design targets 42 and the predicted local distance difference test scores (pLDDT scores) 28 of the predicted structures are shown in Supplementary Fig. 6.The predictions have been performed without using multiple sequences alignment information, and in two modes 28 : one disallowing the use of any structural templates (template-off) and another allowing the use of structural templates (template-on).For comparison, we also applied AlphaFold2 to the native sequences with the template-off mode, in which the agreements between the predicted and the target structures are poor for both the native and the designed sequences, with only a few TM-score values of above 0.6 (Supplementary Fig. 6a).We note that all of the low-TM-score predictions do not have sufficient pLDDT scores to be considered reliable (Supplementary Fig. 6b); thus, at least for the 100 targets considered here, the template-off AlphaFold2 predictions are not useful for judging sequence-structure compatibility.
In the template-on mode, 811 of the 1,000 structures predicted for the designed sequences agree well with the respective design targets, with TM-scores of above 0.6 (and mostly around 0.9).The pLDDT scores of these high-TM-score predictions are also high (Supplementary Fig. 6a), indicating high sequence-structure compatibility.For the twelve target structures for which all of the designed sequences failed to produce predicted structures with TM-scores of above 0.4, we inspected the structures and found that most of them have been determined as parts of larger complexes (the structures of these targets are shown in Supplementary Fig. 6c).These targets may therefore be dependently stabilized either by special interactions (for example, coordination with multiple metal ions) or interactions with other macromolecules; and thus could be out of reach of a general structure prediction or sequence design method.Supplementary Fig. 5c shows the AlphaFold2 prediction results of the ABACUS-R-designed and the 3DCNN-designed sequences for the four target backbones considered by 3DCNN 17 .Although all of the predicted results have TM-scores of above 0.6, the pLDDT scores of the ABACUS-R sequences are more concentrated in the high-score region than the more varied 3DCNN sequences.
Although ABACUS-R has been learned from natural proteins, we expect it to be applicable to designed backbones, as the core network of ABACUS-R encodes the dependence of sidechain type on not the global but local three-dimensional environments.Learned with statistically sufficient samples, such a local dependence should be applicable to backbones globally distant from the training proteins.We have examined the generalizability of ABACUS-R to designed backbones by considering several backbones of novel, non-natural overall architectures 43 .The obtained sequences are far from any known natural protein sequences according to HHBlits searches 44 (Supplementary Table 8).Applying AlphaFold2 to these sequences lead to predicted structures of high quality (as indicated by high pLDDT scores) and in close agreements with Total number of designs that are soluble and well-folded based on (at least) 1 H NMR 49 Total number of designs with HSQC spectrum measured 6 Total number of designs with solved crystal structures 5 (RMSD: 0.51 ≈ 0.88 Å) From left to right: the columns, respectively, correspond with the PDB IDs of the target backbones; the average −logits per residue of the designed sequences; the average identities between native and designed sequences; the average identities between designed sequences; the number of examined/expressed/soluble proteins; the numbers of proteins for which crystallization was attempted/crystals were obtained; the numbers of crystal structures solved; and the numbers of proteins for which NMR HSQC spectra were measured.For each target backbone, the first and second rows correspond to designs examined in the first and second experimental batches, respectively.Each of the bottom four rows summarizes the total number of proteins in a certain group as described by the text.Note that the group of proteins with solved crystal structures do not overlap with the group of proteins with HSQC measured.RMSD, root mean square deviations of Cα atom positions of crystal structures from design models.
Articles NATURe CompUTATioNAl SCieNCe the respective design targets (as indicated by high TM-scores; Supplementary Fig. 7).We note that passing the AlphaFold2 structure prediction test cannot guarantee that a designed sequence will actually fold as expected.As AlphaFold2 was shown to be unable to correctly predict the disruption of folded protein structures by single mutations 45 , a designed sequence contains only a few folding-disrupting residues (and thus fails to fold in wet experiments) may still pass this computational test.
Experimental analysis of designed sequences.Only a few studies so far have reported experimental examinations of protein sequences designed on the basis of deep learning 17,30 .Proteins designed by 3DCNN 17 and ProteinSolver 30 were shown to have desired secondary structure contents and to fold cooperatively according to circular dichroism signatures.So far the only experimentally solved atomic structures for de novo sequences designed using a deep learning method were for two sequences designed for a de novo TIM-barrel by 3DCNN 17 .That backbone consisted of structural elements that were likely to be far more regular or ideal than those in natural backbones; thus, when it comes to de novo sequence design for (natural) backbones that are abundant with diverse, none-ideal structure elements, we do not know about previous validations of a deep learning method by experimentally solved atomic structures.
Here we examined the sequences designed by ABACUS-R for three natural backbones using wet experiments.These backbones (with PDB IDs: 1r26, 1cy5 and 1ubq) have been chosen as they had been used to experimentally evaluate the ABACUS model 9,10,22,23 .To assess the method in an unbiased way, the automatically generated sequences by ABACUS-R have been used as is; that is, without any post-design selection or adjustment.
Two batches of wet experiments have been carried out to examine proteins designed using two different protocols.The first protocol employs the self-consistent iterations described above to maximally converge the sidechain types of all residues.Although this protocol effectively minimizes the −logits values, the designed sequences are highly similar to each other (the identities between different sequences designed for the same backbone are around 80%; see Table 1 and Supplementary Table 9).In the second protocol, the sidechain types are sampled from distributions derived from the output of the decoders (see Methods), generating more diverse sequences from different sequence design runs, albeit with somewhat higher −logits values (see Table 1 and Supplementary Table 10).
Among the 27 sequences examined in the first batch, 26 have led to successful protein expression in Escherichia coli (E.coli).All of the expressed proteins could be readily purified in soluble form.The purified proteins have been subjected to a range of experimental analysis including size-exclusion chromatography, 1 H NMR spectrum measurement, NMR heteronuclear single quantum coherence (HSQC) spectrum measurement, differential scanning calorimetry (DSC), and protein crystallization and X-ray structure determination (Table 1).Complete data from these experiments are reported in Supplementary Figs.8-11.Representative results for Articles NATURe CompUTATioNAl SCieNCe three designed sequences (one for each backbone target) are shown in Fig. 4a-c.The size-exclusion chromatography results and the 1 H NMR spectra have covered all of the purified proteins in the first batch (Supplementary Figs.8-10 and Supplementary Table 11).They indicate that all of these proteins are monomers and likely to fold into well-defined three-dimensional structures.Five proteins covering the three target backbones have been measured by DSC experiments, which showed that all of these proteins unfold cooperatively at high temperatures from 97 to 117 ∘ C (Supplementary Fig. 11).High-resolution structures of four proteins-including three designed for the 1r26 backbone and one designed for the 1cy5 backbone-have been solved with X-ray crystallography.All of the solved structures agree with the corresponding design targets with high precision (the RMSDs of Cα atoms ranged from 0.51 to 0.88 Å; see Supplementary Figs. 8 and 9).For proteins designed on the 1ubq backbone, we only obtained crystals for one protein (1ubq-A4), which were twinned, and no X-ray structure could be solved; however, the NMR HSQC spectrum and the DSC results of 1ubq-A4 (Fig. 4c) prove that this protein folds into a well-defined three-dimensional structure.Taking the experimental data together with AlphaFold2 prediction results (Fig. 4c; see also Supplementary Table 12, which shows that the predicted structures have high TM-scores relative to 1ubq as well as high pLDDT scores), we expect the structures of the proteins designed for the 1ubq backbone to also fold into the target backbone with high precision.
In the second batch of experiments, we examined proteins obtained by using the modified protocol that designs more diverse sequences.Table 1 shows that the identities between the different sequences designed for the same target are around 58%.Among the 30 designed proteins analyzed, 25 led to successful protein expression in E. coli, and 23 could be purified in soluble form.The purified proteins have been subjected to the aforementioned experimental analysis (Table 1 and Supplementary Table 13).The results are reported in Supplementary Figs.11-15.Again, the size-exclusion chromatography and NMR data (Supplementary Figs.12-15) suggest that all of these proteins are monomeric and fold into well-defined three-dimensional structures.The DSC curves for five proteins covering the three target backbones indicate that these proteins unfold cooperatively at high temperatures from 85 to 118 ∘ C (see Supplementary Fig. 11).One high-resolution structure for a second batch protein has been solved (1r26-B4), whose Cα RMSD from the 1r26 target backbone is 0.67 Å (Supplementary Fig. 12); thus, the modified protocol leads to a somewhat decreased but still acceptable success rate.Moreover, those successfully designed proteins are still very stable, although their unfolding temperatures seem to be somewhat lower than the proteins designed using the maximum convergence protocol for the same

Articles
NATURe CompUTATioNAl SCieNCe targets (see the DSC curves of the first and second batch proteins in Supplementary Fig. 11; note that there are exceptions to this observation).Closer inspection of the crystal structures indicates that ABACUS-R is able to design alternative sidechain combinations that form equally well-packed structures.As examples, Fig. 5a,b show how a group of core residues of different sidechain type combinations are packed in native 1r26, and in the crystal structures of the four ABACUS-R designed proteins; Fig. 5c shows the structural details for some of the sidechains that form (non-native) inter-residue polar interactions in the structures of the designed 1cy5-A7 (in Supplementary Fig. 16, the crystal structures of the native 1r26 and of the designed 1r26 proteins are compared in similar detail).The ABACUS-R model has therefore learned how to find sidechain type combinations with good packing qualities without explicitly going through atomic sidechain coordinates.As shown in Supplementary Fig. 17, the inter-residue (pseudo)coevolution information derived using Direct Coupling Analysis 46 on ABACUS-R-designed sequences is correlated with inter-residue distances, similar to the coevolution information from natural sequences.

Discussion
A number of previous studies have investigated deep learningbased approaches to inverse protein folding 17,[29][30][31][32][33][34] .Among the various models, the encoder-decoder part of ABACUS-R is the most similar conceptually to those using deep learning networks to predict the sidechain type of a central residue on the basis of its local environment 17,29 .
Compared with earlier methods, several features of ABACUS-R are unique and may be important for overcoming some of the main difficulties in solving the inverse folding problem.Notably, ABACUS-R is rotamer-free and does not require the explicit modeling of sidechain conformations.This eliminates the use of approximately reconstructed sidechains by using discrete rotamers 47 , or through the sampling of continuous distributions 17 .Although using atomically detailed sidechains could be advantageous for certain molecular modeling tasks, for de novo sequence design, accurate coordinates of sidechain atoms are not available from input.When sidechains have to be approximately reconstructed and fed to a model trained on experimentally determined atomic structures, the overall accuracy of the method may degrade.
Another unique feature of ABACUS-R is that complete sequences are designed through self-consistent iterations.This is possible as sidechains are not explicitly modeled.Should sidechain conformations have to be constructed and updated, the number of degrees of freedom would be substantially increased and the energy landscape would have become too rugged for the iterative updating algorithm to converge meaningfully.Moreover, on a rugged, high-dimensional energy landscape, it can be hard to produce a sequence that is free of structure-disrupting local interactions.We note that such design errors may not affect much computational metrics such as native recovery rates or even AlphaFold2 prediction results 45 .However, they will considerably degrade the success rates (or robustness) in wet experiments.Another property of the self-consistent iterative approach is that the sidechain type of every residue will be repetitively updated according to the evolving environmental contexts.This is different from a 'one-pass' sequence generation process, in which any improperly chosen residue in an early generation step will be accumulated and propagated in subsequent steps, which can lead to an exponentially decaying probability of generating truly foldable overall sequences with increasing sequence length 48 .
An encouraging finding of the current study is that in experimental tests on the three natural backbones, ABACUS-R has exhibited much higher success rates and design precision than those reported for the energy function-based models ABACUS 9,10 and RosettaDesign 8,14 .For example, more than half of the original ABACUS designs on these three targets could not be expressed or purified; only after introducing mutations through directed evolution, structures of ABACUS designs for the 1r26 and 1cy5 targets had been successfully solved 9,22 .Marin and colleagues 14 have experimentally examined proteins designed for eight different backbones of thioredoxin proteins (the 1r26 target used here is also a thioredoxin), and reported that in the most favorable case (the target backbone for which RosettaDesign exhibited the highest success rate), only three out of eight experimentally examined designed proteins could be purified as monomers.In an earlier study of these authors, the crystal structure of one protein designed by Rosetta for a thioredoxin backbone had been determined with RMSDs of 1.8 to 2.0 Å from the target backbone 13 .There could be multiple reasons for the improved performance of ABACUS-R over energy functionbased methods.One of which is that as a deep learning method, ABACUS-R avoided the use of linear combinations of energy terms.A second reason could be that ABACUS-R does not explicitly use sidechain structures and thus it does not suffer from the incompatibility between the inaccurately reconstructed positions of sidechain atoms (for example, by using rotamers) and the high sensitivity of some of the energy terms (for example, the van der Waals repulsion terms) to atomic positions.
There are still several limitations of our approach.One is that the current decoder predicts the sidechain torsional angles with only moderate accuracies (Fig. 2a and Supplementary Fig. 1); thus, atomic models of sidechain structures-if they are needed after sequence selection-have to be constructed with other tools.Another limitation is that ABACUS-R by itself does not take into account the requirements on sidechain types for properties other than the protein structure, such as protein function.In principle, restraints on sidechain type choices for protein function or other desired properties may be derived before sequence selection from other considerations or analyses (for example, the prediction of functionally important protein residues from evolutionary information), and then imposed during ABACUS-R sequence selection.The application and validation by wet experiments of ABACUS-R in the sequence design for proteins with functions is still to be carried out in future studies.

Methods
Modeling the three-dimensional local environment.For the encoder-decoder network of ABACUS-R (see Fig. 1a), the input comprises the sidechain types and three-dimensional backbone structure information of all of the structurally neighboring residues of a central residue.We emphasize that the sidechain type of the central residue is not used as input.Neither are the sidechain conformations of the neighboring residues.
The set of residues that form the three-dimensional local environment of a central residue is defined to be its k-nearest neighbor residues according to the Cα−Cα distances.As input of the encoder, we have considered the following three types of features for each neighboring or surrounding residue: its location and orientation relative to the central residue (X SPA ), its sequence position relative to the central residue (X RSP ), and its sidechain type (X AA ).The raw input for the different types of information are defined as below.
The components of X SPA are defined using (1) the Cα−Cα distance from the environment residue to the central residue; (2) the Cartesian coordinates of the environment residue's Cα in a local coordinate frame defined using the backbone N, Cα and C atoms of the central residue (in this frame, Cα is at the origin, the Cα−N direction is along the x axis, and the N−Cα−C plane corresponds to the x−y plane); and (3) the rigid body rotation that aligns the orientations of the two local coordinate frames of the central and the environment residues.The Cα−Cα distance is mapped to a vector of 16 components using a set of Gaussian radial basis functions centered around 16 distance values ranged from 0 to 20 Å; the Cartesian coordinates are used as is; the rigid body rotation is mapped to a three-dimensional vector whose direction corresponds to the axis of the rotation and whose length corresponds to the magnitude of the rotated angle.After these mappings, X SPA for each residue is a vector of 22 real-valued components.
The components of X RSP are defined based on the difference between the position indices of the neighboring residues and the central residue, that is, Δi = i − i central .The X RSP is a one-hot vector that encodes 129 states, with each integer value of Δi from -64 to +63 corresponding to one of 128 states, and the remaining one state encoding whether the corresponding neighboring residue is not on the same peptide chain as the central residue.

NATURe CompUTATioNAl SCieNCe
The X AA is a one-hot vector that encodes the 20 sidechain types plus a special MASKED type (for the central residue, see below).
Features of the central residue is mapped to vectors of the same dimensions as the neighboring residues, only that all components of its X SPA vector are zeros, and its X AA value is mapped to the MASKED type.Aside from these, the backbone conformation centered around the central residue (X BB ) is also considered.The components of X BB are the 15 backbone torsional angles , in which i is the position index of the central residue.The angles are divided by π to obtain values in [−1, +1).To maintain the same input dimensions for vectors encoding the central and the neighboring residues, the X BB vectors are also formally considered for all of the neighboring residues, but the values of all of their components are zero.
The above input features of the encoder have been defined to include complete information of relative backbone positions between the central the surrounding residues while remaining invariant with respect to three-dimensional translations and rotations.Moreover, the transformer architecture of the encoder (see below) is invariant to permutation of the ordering of neighboring residues.
To produce input for the transformer encoder from the above raw input feature vectors, the one-hot-encoded feature vectors X AA and X RSP are linearly transformed, whereas the feature vectors X SPA and X BB , which are of real-valued components, are transformed using a linear layer.The resulting intermediate vectors are noted, respectively, as E AA , E RSP , E SPA and E BB , which are simply concatenated together to form the overall input vector E, namely, where the matrices W AA , W RSP , W SPA , W BB and the biasing vectors b SPA and b BB comprise trainable elements, which are shared between all residues.
Architecture of the transformer-based encoder.We note the set of E vectors for all residues as {E n ; n = 0, 1, 2, ... , k}, in which E 0 is the vector for the central residue.These set of vectors are fed to a transformer of 12 blocks, as shown in Fig. 1.Each block of the transformer is composed sequentially of a multihead self-attention module 49 , a layer normalization transformation 50 , a feed-forward module and a second layer normalization transformation, as formulated below, where the residue index n is from 0 to k, the block index i is from 1 to 12, the MHA(⋅,⋅), LN(•) and FFN(⋅) operations refer to the multihead self-attention module, the feed-forward module and the layer normalization transformation, respectively.From the output of the above transformer, the vector F 12 0 is taken as a general representation of the local environment of the central residue, which is to be decoded into various attributes of the central residue.

The decoders and training losses.
The encoder-decoder network has been trained by using a multitask learning strategy 37 , in which multiple attributes of the central residue have been simultaneously considered as decoding objectives, including the sidechain type, the secondary structure state, the solvent accessible surface area (SASA), the sidechain torsional angles (that is, χ 1 and χ 2 ), and the crystallographic B factor of the sum of the mainchain atoms.We note that for secondary structure states, both the three-state and the eight-state schemes as used by the DSSP program 51 have been simultaneously considered.For sidechain torsions, we have only considered χ 1 and χ 2 but not χ 3 or χ 4 , as χ 3 or χ 4 should be useful for encoding sidechain structures only when associated with correct and accurate χ 1 and χ 2 .It seems that the decoding accuracy for χ 1 and χ 2 of our current model (see the 'Results' section) cannot guarantee the additional consideration of χ 3 or χ 4 to be beneficial.
Our main reason for using multitask learning is that the so-called ground-truth native sidechain type labels in the training natural sequences are actually quite noiseful, as the native sidechain types are not always the optimal choices for the given structures.The inductive biases introduced by the auxiliary tasks can help to reduce the encoder-decoder network's Rademacher complexity or its ability to fit these noises 37 .The secondary structure state and SASA auxiliary tasks are expected to introduce useful inductive biases as they are known to strongly affect sidechain type preferences, and they have been considered as key features in our previous (experimentally validated) ABACUS method 9 .The sidechain conformations of the central residues have been considered as auxiliary tasks as they are expected to correlate with the conformations of the surrounding sidechains.Thus they may implicitly help more structural information about the local environment to be retained in the output representation.The B-factors has been included for a similar reason.
Different attributes of the central residues are decoded by passing F 12 0 through different perceptron networks (decoders), each decoder network having a single hidden layer.
For the sidechain type decoder, its output layer contains twenty perceptron nodes, the output values (the logits) are transformed using Softmax to yield normalized probabilities predicted for the 20 sidechain types.Namely, for sidechain type a, pa = e logits a / 20 ∑ b=1 e logits b . ( The cross-entropy loss function is used to measure the decoding loss, namely, for each training item, where a is sidechain type index, y = (y 0 , y 1 , ⋯ , y 20 ) is the one-hot vector produced by the ground-truth label (that is, the native sidechain type) and p = (p 0 , p 1 , ⋯ , p 20 ) is the vector of the predicted probabilities.
The decoders and losses for the secondary structure state are defined in the same way as that for the sidechain type.Both the three-state and the eight-state categorization schemes are used at the same time.The cross-entropy losses are respectively noted as L SS3 and L SS8 .
For the numerical attributes including the SASA and the X-ray B factor, we first normalized the attributes using the means and standard variations computed from the training data.The corresponding decoders regress the normalized attributes.The losses L SASA and L Bfactor are defined as the L1 errors of the regressions.
The total loss for the multitask learning task is given by where λ 1−4 are the trade-off parameters to balance the different losses.As the final decoding accuracies are not sensitive to the exact values of these parameters (see also the ablation results in Fig. 2a) and also because of the computational costs (training the entire model from scratch takes about 15 h on a single GeForce RTX 3090 GPU), we did not carry out extensive explorations on the various combinations of the trade-off parameters.Instead, we have separately and gradually increased the individual λ parameters from small starting values until the decoding accuracy for the corresponding attributes maximized while the decoding accuracy for the sidechain type did not decrease.The finally used values for λ 1−4 are 0.2, 0.2, 0.2 and 0.5, respectively.Supplementary Figs.18 and 19 show the training (and validation) curves of the various losses.Although all of the losses decreased and were gradually plateauing, the sidechain type validating loss plateaued first during the training.
The self-consistent iterations.To design complete sequences for a given target backbone, we start from an initial sequence with randomly chosen sidechain types for all residues.In each iteration, we randomly select a number of residues, consider each of them as the central residue, apply the pre-trained encoderdecoder network to its current three-dimensional local environment, and update the sidechain type of the central residue according to the output of the (sidechain type) decoder.Within one iteration, the calculations for all of the selected residues are executed in parallel to take advantage of parallel computation.The initial number of residues selected for parallel sidechain type updating is set to be 80% of the sequence length, and this number is gradually reduced in subsequent iterations until the final iterations in which the residues are considered one by one.
During the iterations, we monitor the total −logP value of the evolving sequence, which is defined as where L is the length of the sequence, i is the residue index, a i is the index of the sidechain type at position i in the current sequence, and p i ai is the probability of the corresponding sidechain type predicted by the encoder-decoder network according to the current local environment of residue i (which depends on the current sidechain types of i's neighboring residues).Please note that these −logP values are not related to the P-values considered in common statistical hypothesis tests.
For simplicity, we have used a fixed number of 1,000 self-consistent iteration steps to perform all of the design runs in this study.It has been retrospectively confirmed that in all of the design runs, the lowest −logP values were always  20a shows the step numbers at which the lowest −logP are reached versus protein sizes, and the largest value of these step numbers is less than 700; see also Fig. 3a,b).As the computational time costs of the design runs are relatively small (to design a sequence for a 350 residue backbone using 1,000 iterations takes only about 150 s on a workstation with a single GeForce RTX 3090 GPU), we did not try to further reduce the computational costs by finding the minimally required total number of iteration steps for individual targets.Supplementary Fig. 20b,c show that both the time and the memory costs (for a fixed number of iterations) scale linearly with protein size, a useful property for designing proteins of larger sizes.
The schedule schemes or the portions of residues selected for sidechain type updating in the earlier iteration steps are not expected to consequentially affect the final lowest -logP values or amino acid sequences.This is confirmed by the results in Supplementary Fig. 20d,e, which show that the sequences designed with the default schedule scheme, and those designed by using an extreme scheme of always updating the residues one-by-one, are of (on average) 86% sequence identity and of similar −logits (the variations are comparable to the same quantities measured for the various sequences designed using the same schedule scheme but different random starting sequences).We note that the default schedule scheme does indeed increase the convergence rate over the one-by-one scheme (see Supplementary Fig. 20a); however, we did not try to further optimize the schedule scheme as the computational time cost is only of minor concerns for our method.
In generating the sequences for examination by wet experiments, two different protocols have been applied to update the sidechain type of a central residue during the iterations.In the first 'normal' protocol, the sidechain type of the maximum probability (as predicted by the corresponding decoder) is always chosen.This protocol is used to generate maximally self-consistent overall sequences examined by the first batch of experiments.In the second protocol, the sidechain type is sampled according to a probability distribution derived from the output of the encoder.To control the diversity of the sampled sidechain type, we mapped the original distribution predicted by the decoder into a new distribution using the following formula, The value of the parameter α controls the diversity of the sampled sidechain type: for a very large positive value of α, the first and the second protocol become equivalent; decreasing the value of α increases the diversity of the designed sequences, but it will also cause the −logits metric to increase (Supplementary Fig. 21).To design the sequences (which have inter-sequence identity of around 60% for sequences designed for the same target) for examination in the second batch of experiments, we have used the value of 1.5 for α, which still lead to −logits values below those of the native sequences (Supplementary Fig. 21).

Other implementation and computation details.
The ABACUS-R model has been implemented in Python.The encoder-decoder network has been implemented using the PyTorch package 52 .Hyperparameters that have been explored during the optimization of the model include the number of nearest neighbor residues (that is, the value of k) to be considered, the number of blocks of the transformer encoder, the dimensions of all of the internal layers or modules, and the weights of the different losses.The final used values are 20 for k and 12 for the number of blocks of the transformer.The part that encodes the local three-dimensional environments has around 151.21 M parameters, whereas the part decoding the information into various features including sidechain type has 35,840 parameters.In each transformer layer, eight attention heads were used, the dimension of each heads is 128.To avoid overfitting during training, embedded vectors of the input, attention matrix and the output of the multiheaded-attention layers in each block are randomly dropped with probability of 0.1.No further regularization term was considered.The networks have been trained by using the Adam optimizer 53 as implemented in PyTorch with the parameters β 1 = 0.9, β 2 = 0.99 and with a gradually decreasing learning rate from 10 −4 to 10 −6 .The PDB structures for training and testing are 25,234 X-ray structures selected by using the PISCES 54 server with a resolution cutoff of 2.5 Å and a sequence-identity cutoff of 50%.For the default schedule schemes, the initial fraction of residues selected for parallel sidechain type updating is set to be 80% of the sequence length, and the fraction linearly decreases to 1/(number of residues) in 600 steps.Parameters of the model is around 151.24 M.
The Rosetta energies have been computed using the RosettaDesign program 8 v.3.12.As ABACUS-R does not explicitly reconstruct sidechain conformations, all sidechains of all of the evaluated sequences (that is, the native sequences and the designed sequences) have been repacked by using the FixBB option of Rosetta, and then the complete structures have been relaxed using Rosetta RelaxBB to determine the minimized energies.
Figures of protein structures have been produced using PyMOL version 1.8 (ref. 55).
Experimental methods.DNA sequences of designed proteins were constructed into the NdeI and XhoI sites of pET-22b(+) by General Biotech and TsingKe Biotech.Plasmids encoding the designs were transformed into E. coli BL21(DE3) cells.Protein expression was induced at OD 600 = 0.7 with 1 mM IPTG for 20 h at 16 ∘ C. For 1 H- 15 N HSQC-NMR study, uniformly 15 N-labeled proteins were prepared by growing the bacteria in inorganic medium (24 g l -1 KH 2 PO 4 , 5 g l -1 NaOH, 0.5 g l -1 NH 4 Cl, 2.2 mM MgSO 4 , 0.1 mM CaCl 2 and 2.5 g l -1 glucose) using 15 NH 4 Cl as isotope source.Cells were harvested and sonicated in buffer containing 20 mM Tris and 500 mM NaCl at pH 7.8.The soluble supernatant was purified by Ni 2+ affinity chromatography and concentrated in buffer containing 20 mM Tris, 300 mM NaCl and 1 mM EDTA, pH 7.8 and then subjected to size-exclusion chromatography in a Superdex 75 column with the Å KTA purifier system (GE Healthcare).
The thermal stability of designed proteins were evaluated by nano differential scanning calorimetry (nanoDSC) (TA Instruments).The protein samples were prepared in buffer containing 300 mM NaCl, 20 mM Tris, 1 mM EDTA, pH 8.0 with a protein concentration of 3-6 mg ml -1 .The protein melting temperatures were monitored in a 0.33 ml cell from 25 to 125 ∘ C at a heating rate of 1 ∘ C min -1 .Responses of proteins were measured by reheating and scanning for three times.The nanoDSC scans were background-corrected and analyzed with Launch NanoAnalyze software.
Crystallization screening was conducted at 289 K by the hanging-drop vapor diffusion method using various screening kits.Purified and concentrated proteins (15-20 mg ml -1 ) were used for crystallization.The crystals for data collection were obtained in 24-48 h in PEG 1500 for 1r26-A3, 2.5 M ammonium sulfate and 0.1 M BIS-TRIS propane, pH 7.0 for 1r26-A6, 2% v/v 1.4-dioxane, 0.1 M BICINE pH 9.0, 10% w/v PEG 20000 for 1r26-A7, 27% w/v PEG 2,000 MME and 0.1 M sodium cacodylate, pH 6.5 for 1r26-B4.Crystals of 1cy5-A7 appeared in seven days in buffer containing 12% w/v polyethylene glycol 3,350 and 4% v/v Tacsimate, pH 7.0.All crystals were shortly soaked in reservoir solution supplemented with 30% glycerol (v/v) and then flash frozen in liquid nitrogen.The diffraction data were collected on BL19U1 and BL02U1 beamlines 58 at 0.9785 to 0.9791 Å. Datasets were processed by using HKL2000 program 59 for 1r26-A3, 1r26-A6, 1r26-A7 and XDS program 60 for 1cy5-A7 and 1r26-B4.The designed structures served as search model for molecular replacement with MOLREP 61 .The final structures were refined by PHENIX 62 .The statistics for data collection and structural refinement are summarized in Supplementary Table 14.

Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.

n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.

A description of all covariates tested
A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistical parameters including central tendency (e.g.means) or other basic estimates (e.g.regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g.confidence intervals) For null hypothesis testing, the test statistic (e.g.F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted Give P values as exact values whenever suitable.
For Bayesian analysis, information on the choice of priors and Markov chain Monte Carlo settings For hierarchical and complex designs, identification of the appropriate level for tests and full reporting of outcomes Estimates of effect sizes (e.g.Cohen's d, Pearson's r), indicating how they were calculated Our web collection on statistics for biologists contains articles on many of the points above.

Software and code
Policy information about availability of computer code Data collection Protein structures for training the models have been downloaded from the Protein Data Bank.PISCES server (http://dunbrack.fccc.edu/Guoli/pisces_download.php)was used to select non-redundant data with a resolution cutoff of 2.5 Angstrom and a sequence-identity cutoff of 50 %.NMR data were collected using Bruker's standard NMR software TopSpin 4.0.9.The X-ray diffraction data were collected on BL19U1 and BL02U1 beamlines at 0.9785 to 0.9791 Angstrom.
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors and reviewers.We strongly encourage code deposition in a community repository (e.g.GitHub).See the Nature Research guidelines for submitting code & software for further information.

Antibodies
Antibodies used Describe all antibodies used in the study; as applicable, provide supplier name, catalog number, clone name, and lot number.

51 Fig. 2 |Fig. 3 |
Fig. 2 | performance of Model eval in computational tests.a, Accuracy of the encoder-decoder.The tick symbols in the 'Settings' columns mark which attributes of the central residue have been included as the decoding objectives in multitask learning; the 'Results' columns give the decoding accuracies for various attributes of the central residue ('Accuracy' for sidechain type; 'SS3' and 'SS8' for secondary structure state categorized according to the threestate scheme and the eight-state scheme, respectively; and χ 1 and χ 2 for sidechain torsional angles with the accuracies computed by tolerating decoding errors of less than 40 ∘ ).b, The confusion matrix of decoding the sidechain type of the central residue.Darker cells indicate that higher fractions of residues of the corresponding native sidechain types (the horizontal axis) have been decoded into the corresponding decoded sidechain types (the vertical axis).c, Scattering plot of the actual versus the decoded solvent accessible surface areas of the central residues.d, Scattering plot of experimentally measured ΔΔG versus −ΔΔlogits from the decoder for the ProTherm dataset of protein mutants.R is the Pearson correlation coefficient in c and d.

Fig. 4 |
Fig. 4 | Results of experimental analysis of designed proteins.a-c, For each of the three backbone targets considered, the results of one designed protein are shown (a, 1r26-A3; b, 1cy5-A7; and c, 1ubq-A4).From left to right: results of size-exclusion chromatography experiments; 1 H NMR spectra; thermal capacity versus temperature curves from DSC experiments; the experimentally solved structures (1r26-A3, 1cy5-A7) or NMR HSQC (1ubq-A4).The Cα RMSDs of the structures from the corresponding backbones are 0.57 Å for 1r26-A3 and 0.88 Å for 1cy5-A7.For 1ubq-A4, superposition of the crystal structure of 1ubq (cyan) with the AlphaFold2-predicted structure of 1ubq-A4 (pink) is shown together with the HSQC spectrum, in which ω 1 and ω 2 refer to the chemical shifts of (covalently bonded)15 N and 1 H nuclei, respectively.

Fig. 5 |
Fig. 5 | Variable sidechain types and sidechain packing have been designed by aBaCuS-R.a, Crystal structures of the different proteins showing (as sticks) the sidechains of residue 24 surrounded by residues (22, 26, 36, 37, 40, 52, 54 and 75; 1r26 is the native sequence; 1r26-A3, 1r26-A6, 1r26-A7 and 1r26-B4 are ABACUS-R-designed sequences).b, Diagrams showing the sidechain types of residue 24 and its surrounding residues in the different proteins.In each diagram, the orange circle indicates the central residue; the green circles indicate residues that are of the native sidechain types; the yellow circles indicate residues for which non-native sidechain types have been designed; and the circles with bold borders indicate residues whose sidechain types did not completely converge in the self-consistent iterations.c, Structural details of some sidechains forming non-native inter-residue polar interactions in the crystal structures of the designed 1cy5-A7.A superposition of the crystal structures of 1cy5 and 1cy5-A7 is shown on the left; regions of interest are marked by dashed line-encircled numbers, with corresponding enlarged views shown on the right.In each enlarged view, the structure of the native protein is shown in cyan on the left side and the structure of a designed sequence is shown in green on the right side.The text labels above the enlarged views indicate the region number and the sidechain substitutions from the native to the designed sequence.
Nature Research Reporting Summary linked to this article.1 nature research | reporting summary April 2020 Corresponding author(s): Houqiang Li; Quan Chen; Haiyan Liu Last updated by author(s): Jun 2, 2022 Reporting Summary Nature Research wishes to improve the reproducibility of the work that we publish.This form provides structure for consistency and transparency in reporting.For further information on Nature Research policies, see our Editorial Policies and the Editorial Policy Checklist.