Discovery of protein α-helix and β-sheet secondary structures were made as early as 1951 based on chemical bond theory, stereochemistry and early crystallographic studies of small polypeptides and amino acids (1, 2). Since then, a number of α-helices and β-sheets, sub-types and other important secondary structures have been discovered. The secondary structure classification system Define Secondary Structure of Proteins (DSSP), first developed in 1983 (3) classifies seven types of secondary structure, specifically: α-helix, β-sheet (parallel & anti-parallel), 310-helix, π-helix, turn, bend, and polyPro-helix. Everything else is classified as unstructured coil by default, which is not considered a secondary structure. An additional feature identified by DSSP, β-bridges, do not represent standalone secondary structures, but are a pair of hydrogen bonded residues that form part of some β-sheets. Ramachandran diagrams show distinct clusters of dihedral angles for α-helix and β-sheet types, and, in addition to distinct backbone hydrogen bonding, this repetition of similar angles defines them as regular secondary structure. For the same reason 310-helix and π-helices are also defined as regular secondary structures. Conversely, turns, bends, polyPro-helices, which do not have both repeating dihedral angles and distinct hydrogen bonding networks are classified as irregular secondary structures.
Some variation in precise secondary structure definitions exists across different classification systems, most prominently DEFINE (4), STRIDE (5), and SST (6). However, regular structures are universally defined by regular dihedral and hydrogen bond patterns (e.g., α-helices have a pattern of i -> i + 4, meaning hydrogen bond acceptors interact with amide donors three positions away). While individual hydrogen bonds are often crucial for the formation of irregular secondary structures, they do not typically form a recognizable repeating pattern (9) and therefore shape is often used for classification. Irregular secondary structures such as turns are also commonly classified by terminus separation length (e.g.: δ-turn, γ-turn, β-turn, α-turn, π-turn, which have i -> i + 1 to i -> i + 5 patterns respectively (7)). Longer turns can be classified as Ω-loops, a motif between 6 to 16 residues in length, with a shape which resembles the upper-case Greek letter Ω (8). Given the minuscule fraction of protein sequence space estimated to ever have been sampled by evolutionary processes (10), it is certainly conceivable that additional secondary structures remain to be discovered.
Discovery of new secondary structures can potentially be facilitated by tools developed for protein design. Protein design has been accelerated by computational methods such as molecular dynamics, fragment-based approaches, and more recently, generative AI (18, 19). De novo protein design could be particularly useful for secondary structure discovery, as it allows the elucidation of amino acid sequences that fold into a given target structure. Protein folds with symmetry and repeating structural sub-units have been designed this way, including coiled-coils (11–15), TIM-barrels (16), and β-propellers (17). It can also be used to generate sequences for novel structure targets not previously observed in nature (20–22). More recently, neural networks trained on the PDB have been used to build generative adversarial networks (GAN), variational autoencoders (VAE), and large language models (LLM) that can generate new protein backbone structures (23, 24). However, all novel proteins generated to date have exclusively been constructed from the seven secondary structures observed in nature and established since the 1990s.
In this study, we co-opted these well-established computational methods to probe the vast realm of unexplored protein sequence space for undiscovered secondary structures. Physical and mathematical modelling enabled the discovery of a spiral structure unprecedented in nature. Further in silico work uncovered amino acid sequences that support spiral folding, and the most stable was used to realize the spiral structure in vitro using NMR spectroscopy. We named this eighth secondary structure “theta-spiral” (ϑ-spiral) because of its pleasing phonetic flow, spiral-like form, and position of ϑ as the eighth letter of the Greek alphabet. Theta-spirals have the potential for size and sequence variation, as well as potential to assemble and interact with canonical secondary structures and other spirals. Their utility and impact are impossible to predict but could be very far reaching.
Spiral polypeptides have secondary structure potential and are not found in nature
In order to create a starting point for the exploration of new secondary structures, we experimented with different shapes not observed in nature (i.e., not in the protein databank (PDB)). A physical poly-alanine model was constructed with removable hydrogen bonds and used to stimulate exploration by whoever passed through our office. A spiral shape emerged as one of the most simple and elegant geometries with secondary structure potential. It had the potential to be formed and stabilized by both hydrogen bonding networks (regular secondary structure), and side chain interaction (irregular secondary structure). A spiral is a single arc with progressively decreasing curvature as it propagates outward. Allowable dihedral torsion angles φ and ψ were tuned to approximate a spiral backbone structure. Dihedral angles of regular secondary structure elements have been well characterized (25) and visualized using Ramachandran plots. Torsion angles are constrained to a single region on the Ramachandran plot for all the residues that make up a structural element. Inspired by the work of Mannige et al. (26) on peptoid nanosheets using two alternating torsion angle zones on the Ramachandran plot, we found that taking a set of dihedral angles and alternating the signs every residue (i.e. Given a φ/ψ set of 50°/-10° for one residue, the dihedral angles for the next residue would be -50°/10°, and then 50°/-10° next, and so forth) induced the backbone to adopt an arc-like structure. Arc curvature could be controlled by varying the magnitude of one of the angles. Initial spiral structures were thereby constructed using the Schrӧdinger modelling software (Fig. 1).
Spiral structures were made by modifying polypeptide dihedral angles to produce progressively less curvature. Backbone atoms of residues in different concentric rings of the spiral were orientated to face each other (see Fig. 1B) to maximize the chance for hydrogen bond acceptor-donor pair proximity––one of several interaction types that could stabilize novel secondary structures (1, 27). In order to minimize any steric penalties, initial structures were generated using backbone dihedral angles corresponding to the sterically favorable zones of right-handed and left-handed helices on the Ramachandran plot (28). These choices ensure that the requirement of alternating signs to create arc-like chains is satisfied (Fig. 1C & 1E). A set of 20 structures was constructed, each with a length of 34 amino acids. This was the minimum number of amino acids––using favorable phi-/psi- angles––that could produce a reasonable spiral with two concentric rings.
Spiral backbone conformational range explored using molecular dynamics
All 20 structures were simulated for 100 ns using molecular dynamics (MD) to generate a diverse set of spiral backbones. Three replicates for each structure were performed using different force fields, including one specifically designed for simulating intrinsically disordered proteins. Resulting trajectories were clustered to generate a representative structural ensemble of spiral backbone conformations. Diverse conformations allow more diverse amino acid sampling and better chance of uncovering a high quality Rosetta sequence predictions (29). Two different clustering algorithms, DBSCAN (32) and K-means (33) were evaluated using clustering efficiency and accuracy metrics on the MD data. MD trajectories were aggregated and data points were taken every five frames. Resulting data were clustered using DBSCAN and an ensemble of 27 structures representing the conformational space sampled by spirals was determined (Fig. 1F).
Spiral sequence discovered using computational protein design tools
Using the set of 27 spiral backbone structures as queries, novel sequences with the potential to stabilize and fold into them were searched. To avoid canonical secondary structure formation, we narrowed our exploration to sequences that are predicted to be random coil. Each structure from the 27 member conformational ensemble was prepared as an input file for Rosetta’s fixbb sequence exploration algorithm (29). One thousand sequences resulted from Rosetta exploration of each of the 27 seed structure queries, for a total of 27,000 sequences predicted to support a stable spiral. Removing duplicates resulted in 16,173 unique sequences. Additional tests were performed to investigate what bias, if any, starting sequence has on the resulting Rosetta identified spiral sequences. Figure 2A logo plots were generated using three different sequence PWMs. Top. Made from 16,173 unique sequences generated from poly-GLU input seeds. Middle & bottom. Made from input seeds with two different randomly generated sequences (16,098 & 15,899 sequences respectively). This confirmed that starting sequence does not bias the discovery process.
Spiral sequences predicted to exhibit canonical secondary structure removed
We next sought ways to reduce the sequence set to an experimentally testable number. Among the 16,173 spiral sequences predicted, it was theorized that sequences with any capacity to form canonical secondary structures could compete with spiral formation (35). Secondary structure prediction algorithms were used to filter sequences with potential to compete. Four different secondary structure prediction algorithms were applied serially to identify sequences in which any portion(s) was predicted to form secondary structure. Only sequences where every residue was predicted to form either unstructured coil or not predicted to form anything, were progressed to the next algorithm. Prediction algorithms used in order are: PSSFinder, SSPRED, SPIDER3, and JPRED4 (36–39). This filter produced 207 sequences predicted to have no canonical secondary structure in any of the 34 residue positions (Fig. 2B).
Thermodynamically stable spiral sequences identified using MD.
To further reduce the spiral sequences to an experimentally testable number, all 207 sequences were subjected to MD simulations of increasing length. After an initial round of 50 ns simulation, RMSD values were calculated and trajectories exceeding a stringent 4.0 Å RMSD cut off were discarded (supplementary figure). The remainder were subjected to an additional 50 ns simulation and the pruning process repeated. Sequences still in contention after 100 ns were subjected to a final round of 50 ns simulation with three replicate trajectories. Six sequences remained which consistently had RMSD values lower than 4 Å and were therefore determined to be the most thermodynamically stable. All six candidates had very similar sequences and so an additional sequence that did not meet the final RMSD cutoff was added to the pool for extra sequence diversity (Fig. 2C). Very long (1 µs) MD simulations were run for each of the seven final sequences, with c11_0280 and c11_0869 predicted to be the most thermodynamically stable (Fig. 2D).
CD spectroscopy suggests noncanonical secondary structure character.
All seven of the sequences predicted to be the most thermodynamically stable were synthesized, and their secondary structure characteristics were investigated with circular dichroism (CD) spectroscopy (40). All seven were confirmed to be thermostable. Minimal deviation was observed in the spectra across three temperatures of 5°C, 20°C, and 40°C (Fig. 3A). A negative control sample was made by randomly scrambling the sequence of one of the seven peptide sequences. The random sequence was not predicted to form any canonical secondary structure. CD spectra for canonical secondary structure elements are well established. Predominantly α-helical proteins have two distinct negative peaks at around 220 nm and 210 nm, and positive ellipticity at wavelengths lower than 200 nm (41). β-sheets produce a single broad negative peak at around 220 nm and positive bands near 195 nm (42). Disordered proteins show close to zero ellipticity above 210 nm and negative values past 200 nm (43). CD spectra for all seven spiral candidate sequences were closer to that of disordered proteins than secondary structure (Fig. 3A). To quantify secondary structure character, CD data were uploaded to the BeStSel (44) webserver. Results in Fig. 3B indicate that spiral candidate sequences are mostly expected to produce a combination of β-sheet, turn and ‘other’ structural features. This did not rule out the existence of novel secondary structure types with a mix of different canonical structure character.
Spiral structure confirmed in vitro using NMR Spectroscopy
Two-dimensional NMR spectroscopy (46) was used to obtain the three-dimensional structure of candidate sequence c11_0869. This sequence was selected as the most stable in MD simulation (Fig. 2D) that also exhibited the least canonical secondary structure character in CD (Fig. 3B) Homonuclear 1H 2D Total Correlation Spectroscopy (TOCSY) and Nuclear Overhauser Spectroscopy (NOESY) NMR spectra were obtained. Cross-peaks in the characteristic fingerprint region of the 2D TOCSY spectra, spanning the amide hydrogen band along the f2 axis, and the α-proton to aliphatic proton band along the f1 axis, were first analyzed to assign residue identities. The 2D TOCSY experiment allows the restriction of cross-peak correlations to a single spin system (47), or an amino acid in the context of proteins. Following identification of residue identities via 2D TOCSY, 2D NOESY spectra of corresponding peptides were employed to assign the exact sequence position of the identified amino acid cross-peaks (48, 49). Intensities of the NOE peaks in a NOESY spectrum are correlated with the atomic distance between the two correlating atoms (49). Signal intensities were converted into a list of distance restraints, which were used to model an ensemble of three-dimensional structures (supplementary figure). After assignment of cross-peaks was completed, all fully assigned cross-peaks and a number of cross-peaks that could not be confidently assigned were picked and integrated to calculate their peak volumes. Using the ARIAWeb webserver (50), which implements the ARIA structure calculation program (51), these peak volumes were converted into distance restraints between corresponding atoms. A 2D phase-sensitive COSY (psCOSY) spectrum was also obtained. The multiplet peaks in the psCOSY were processed using NMRPipe (52) and analyzed following the ACME protocol (53) to derive a set of vicinal backbone HN-HA J-coupling constants. These values were used to derive backbone dihedral angles via the Karplus equation during the structure calculation process (54). These restraints, along with a simulated annealing protocol were used to calculate the structural model. Figures 3C & 3D show the calculated structure. The best structure from this ensemble (model #4) aligned very closely to the in silico structure (backbone RMSD of NMR structure superimposed with the in silico starting structure is 3.0 Å). Most of the difference is due to N- and C-termini not fully engaging with the spiral.
Spiral structure dihedral angles and hydrogen bonding indicate irregular class
Hydrogen bonding analysis was carried out on the NMR spectroscopy determined spiral structure. Figure 3D shows that just two hydrogen bonds were identified, both resulting from a single donor atom (yellow lines). This was confirmed by MD simulations of the NMR structure (supplementary figure). Lack of repetition of similar dihedral angles (Fig. 3E & 3G) or distinct hydrogen bonding network (Fig. 3D) suggest this spiral should be classified as an irregular secondary structure. It is comparable to the Ω-loop irregular secondary structure, which is characterized by a single hydrogen bond that holds the loop in place and does not have a hydrogen bond network or repeated dihedral angle pattern.
NMR spiral structure and sequence confirmed as having no precedent in nature
To assess if spiral structures already exist in the known repertoire of natural and engineered proteins, in silico and NMR spiral structures were screened for similarity against every structure in the PDB using Distance Matrix Alignment (via the DALI webserver). DALI evaluates 3D structural similarity between protein structures by performing a pair-wise comparison between user-specified input structures and the entire PDB. This method involves a flexible superimposition of two structures, followed by calculation and comparison of intra-molecular distances. DALI returned zero matches in the PDB for both in silico and NMR determined spiral structures. Running spiral sequences on the NCBI PBLAST server returned “no significant similarity found”. Combined with the DALI result, we conclude that polypeptide spirals are novel in sequence and structure.