Case Study I: FUS, LLPS regulated in the context-dependent tuning of preferred forces
Fused in sarcoma (FUS) is one of the early discovered biological systems undergoing self-organization by liquid-liquid phase separation (LLPS) 38. Since then, the protein has been the subject of extensive experimental and computational research to understand the molecular mechanisms and interactions driving this phenomenon. FUS can be found in the BIAPSS service by the UniProt identifier (P35637), gene (FUS), or using the “RNA-binding” search key. The summary page contains a high-quality image of the experimentally confirmed cellular location (left panel in Figure 3A). Due to its multifunctionality in RNA processing, FUS is mostly observed in the nucleus 39. In physiological conditions, the low levels of the protein are distributed in the cytoplasm 40, where FUS transports and manages RNA through the dynamic liquid-like subcellular compartments, such as ribonucleoprotein or stress granules 41. However, the cytoplasmic concentration of FUS significantly increases when noxious mutations lead to aggregation 42. This progressively aberrant process is manifested by neurodegenerative diseases in humans 42. Although plenty of accumulated evidence points to the influence of distinct factors on the cellular behavior of FUS, its primary sequence still holds many cues. To frame the physicochemical properties of full-length FUS, we used the analytical approach offered by the SingleSEQ module of BIAPSS.
The average metrics indicate that the 526 residues long sequence of FUS contains over 80% disorder and only 8% order. The solvent accessibility predictions show the same aspect ratio between exposure and burial. The contents of aromatic, hydrophobic, polar, and charged residues are 10%, 42%, 40%, and 17% respectively, with a slight excess of positive charge. Such a rough overview described by a set of averages gives some general insight about the protein properties but conceals some local distributions that are important for the identification of the preferential interactions. Therefore, we conduct a detailed analysis of the composition and complexity of the FUS sequence and present the resulting patterns in Figure 3B. Compared to any reference set of proteins, this one is extremely enriched in glycine making up nearly 1/3 of the full sequence. Another 20% of the amino acid content consists primarily of serine and glutamine. Although the dominant content of these three amino acids suggests generally low-complexity of the sequence, their distribution along the sequence is strongly heterogeneous. Indeed, the calculated low information content of the sequence is mainly localized around protein terminals and clearly corresponds to three fragments with high glycine concentration (LCR2: 164-267, LCR3: 370-420, LCR4: 454-507). These regions also exclusively accumulate total arginines, which together with glycine form a series of RGG repeating motifs known to bind RNA specifically 44. Both serine and glutamine are mostly localized at the N-terminus, being more clearly clustered within LCR1 (1-163). LCR1 additionally gathers 24/35 available tyrosines, and thus, it has visibly distinct enrichment (SQYG) known to occur in prion-like domains (PLD) 45. Using Domains, Motifs, Repeats application we also found that the remaining compositionally more complex regions of C-terminus (I287-L365 and R422-D453), match the PF00076 and PF00641 Pfam domains, i.e., the RNA recognition motif (RRM) and RNA binding zinc finger (ZnF), respectively. The robust predictions (for details see Methods) unanimously show that RRM is a well-folded FUS domain, while the other fragments remain disordered. The seed MSAs prepared for FUS within Sequence Conservation application further confirms that both domains are evolutionarily conserved members of Pfam families: RRM_1 and zf-RanBP, respectively (see bottom rows in Figure 3B).
The visual inspection of the amino acid content and distribution of FUS allows us to identify and isolate specific regions in the protein (Figure 3C). Furthermore, we have performed a physicochemical featurization of these segments, which reveal preferred interactions when coupled with biomolecular conditionals known from experiments. The recent experimental reports showed that the isolated prion-like domain (PLD, 1-214 or even 1-163) can undergo self-organization forming liquid droplets when kept in high protein levels or high salt concentrations 46,47. This N-terminal fragment is enriched in amino acids whose side chains are multivalent, as shown in Figure 3C. Thus, the dense pattern of polarity comes from enrichment in S, Q, Y, where Y, Q, and G also provide π-electron centers for π-π-stacking. Most of them are also able to be both donors and acceptors of side-chain protons for hydrogen bonding (HB). In line with this, the intermolecular interaction profiles derived from simulations of the 120-163 region indicated the most frequent contacts between QQ > QY > YY > SY and other pairs of enriched amino acids 48. All of these observations suggest that homotypic phase separation of wild-type PLD monomers is driven by balanced contributions from hydrogen bonding and π-stacking. Indeed, several mutagenesis studies showed that Y➝A substitution disrupts phase separation by removal both components of the interaction, while Y➝F mutants are significantly more aggregation-prone, due to strengthening of binding via tighter hydrophobic F-π-stacking at the cost of losing HB contributions of polar tyrosine 46,48. It is also worth noting that the PLD region is completely deficient of positive charge with a minor net charge per residue of -0.01 (M1-S165: -0.012 and M1-G212: -0.024), which place it within the weak polyelectrolytes region on the CIDER diagram (right panel in Figure 3A) 43. However, an excess of serine and threonine in this region provides an ability to introduce a strongly negative charge through multiple phosphorylations. After phosphorylation, the dominant force becomes electrostatic repulsion, which is known to disrupt both phase separation and aggregation 38. The central region of PLD (39-95) was proposed as the core of aberrant fibrils, which in solid-state form structured cross-𝛽-sheets 38. The same structural properties have not been unambiguously confirmed in the condensed phase of liquid-liquid mixing. Undoubtedly, however, our algorithms detected along this region structural motifs known as low-complexity, amyloid-like, reversible, folded segments (LARKS) 31. In our analysis, the most effective predictors of structural properties showed for these motifs some tendency toward extended secondary structure and a slightly increased probability of burial (bottom panel in Figure 3C). Interestingly, the prediction in 8-letter notation detected a turn or bend within each of the structural motifs, which explains their flexible nature. These findings together with ambiguous experimental results may suggest some variations of structural state in the PLD core and specifically the disorder-to-order transition driven by biomolecular conditionals.
The remaining part of the FUS sequence, referred to as the C-terminus, contains two well-known domains (RRM and ZnF) and three glycine-arginine-rich regions (GARs). All components are significant players in binding RNA. Zinc finger supports only the recognition of specific GGU motif, while RRM domain and RGG repeats are universal towards a variety of RNAs 49. Both folded domains of FUS are much less polar than PLD, as seen from the BIAPSS-based physicochemical featurization in Figure 3C. They also have a lower content of side chains that are able to engage in π-stacking or hydrogen bonding. However, the charged residues are pretty abundant in the composition of RRM and ZnF, which explains the functional role of electrostatic interactions towards the binding of nucleic acids or stabilizing folds via salt bridges 50,51.
All three GARs are the least polar regions of the protein (see Figure 3C). The dense patterning of hydrophobicity arises from glycine excess. The rich π-electron-containing systems, other than aromatic side chains, originate mainly from the abundance of the arginine’s guanidino group. Arginine is also a source of excess positive charge at the C-terminus. The experimental studies consistently confirm that the isolated C-terminus does not undergo phase separation 46. However, liquid-liquid droplets rapidly occur when mixed with N-terminal monomers 46. Moreover, the LLPS of full-length wild-type FUS is more robust than heterotypic mixing of N- and C-terminals and homotypic self-assembling of N-terminal monomers 46. This suggests the higher priority of cation-π (R-Y) stacking over π-π (Y-Y) stacking, while both are reinforced by hydrogen bonds. Another experimental study showed that R➝K mutants, who no longer have the ability of π-π-stacking but retain charge, can still undergo phase separation. In turn, R➝A substitutions prevent phase separation because they lose the π-system, cation, and ability of side-chain hydrogen bonding. Interestingly, the recent report indicates that stacking interactions, including cation-π (e.g., RY, KF) and especially π-π (e.g., YY and RY, and even RQ), are most robust over a wide range of salt concentrations 47. The hydrophobic contribution from π-electron-containing systems becomes the main force that strengthens the contact in high salt. In these conditions, the screening of usually dominant electrostatic contributions is significant. Surprisingly, changing the partitioning of the different forces makes the interaction of the two positively charged arginines attractive under these conditions 47. The set of diverse chemical groups in arginine is a unique feature among the other amino acid side chains. With its high reactivity, the need for precise regulation comes, and so arginine can be tuned to a preferred state by posttranslational methylation. Thus, under physiological conditions, FUS is highly methylated 46. This limits self-assembly via interactions with tyrosine and promotes a functional role of intermolecular interactions with other proteins and nucleic acids. Therefore, phase separation and gelation of FUS can increase by hypomethylation of arginines within RGG-rich regions or insertion of additional ones in the C-terminus 46. All of these findings come together to demonstrate the significant role of the arginine side chain in phase separation. Tyrosine and glutamine are similarly relevant. They also contain multifunctional chemical groups that make them reactive and multivalent. These features aid in context-dependent tuning between preferred forces of interactions. They can work synergistically or alternatively. And their regulation depends on environmental conditions, the state of posttranslational modifications, and the presence of binding partners.
Case Study II: hnRNPDL - cellular behavior regulated by the priority of interactions
Heterogeneous nuclear ribonucleoproteins are a large and diverse family of proteins that play active roles at every stage of RNA regulation 52. An important member of this family, the heterogeneous ribonucleoprotein D-like sequence of humans (hnRNPDL), occurs in the nucleus in three isoforms generated by alternative splicing. These isoforms differ by the presence or absence of structurally flexible terminal regions and, depending on this, phase separate (DL1, full-length), aggregate (DL2, missing N-terminus), or remain soluble (DL3, missing both N- and C-terminus) 53. It is therefore thought that a unique combination of these regions is used for regulating the function and cellular behavior (phase state, mobility, location) of the isoforms. It is known that various short sequential or structural motifs, especially located in the disordered regions, enhance multivalent interactions and tune the ability of proteins to form higher-order assemblies 31,54. It has been shown that even a single mutation in a key region of hnRNPDL can significantly promote an aberrant aggregation 53. To understand the differences in self-assembly properties of different isoforms we carry out deep biophysical and bioinformatic analysis of full-length hnRNPDL by using BIAPSS SingleSEQ web applications.
The protein can be found in the BIAPSS online repository by the UniProt identifier (O14979), gene (hnRNPDL), or common name. The brief summary shows that the entry is covered in most primary LLPS databases except PhaSePro. Interestingly we find that PSPredictor score = 0.27 which indicates a lack of phase separation contrary to the experimental reports. The simplified biophysical characteristic of the sequence shows 49% disorder and 31% order consisting mostly of the helical structure. Solvent accessible surface area analysis shows that 64% of the structure is likely exposed to the solvent. The contents of polar, hydrophobic, and aromatic residues are 30%, 42%, 12%, respectively.
Using the Composition and Complexity app allows us to carry out a more in-depth analysis of the patterning of LLPS promoting residues in the full-length sequence of hnRNPDL. We find the protein is enriched with polar (Y, Q), positively charged (R) residues, and π-electron-containing systems (Y, R, Q, G) at the expense of structure promoting hydrophobic ones (I, L, V) with respect to the all, globular proteins, SwissProt sequence set, and even the disordered proteins collected in DisProt or LLPS-driving sequences collected in BIAPSS (Figure 4A). In particular, the agreed consensus of low-complexity (LCR) predictors shows that the G > Y > Q clusters are mostly localized in the C-terminal region, with enrichment typical for prion-like domains. The N-terminus on the other hand is enriched with P and R residues. The K residues are mostly localized in the linker of two folded domains located in the central part of the sequence. In line with these observations, the Shannon entropy reveals that the C-terminal region is much lower in complexity than N-terminus, however, their amino acid composition indicates that both are still classified as LCRs. Additionally, robust predictions of structural disorder (taken from the Structural Disorder app) further confirm that both LCRs contain intrinsically disordered regions (IDRs). Specifically, the IDR1:M1-Q145 is located on the N-terminus, while IDR2:V313-F350 and IDR3:S397-Y420 are located on the C-terminus. Thus we can already see that there is significant asymmetry between the terminals of the protein which is in harmony with experiments showing distinct physicochemical and functional implications for these two LCRs 53.
 It is proved that glycine can stack (due to the lack of the side chain) via π-electrons from a peptide bond and hydrogen bonding via backbone carbonyl or amide. 55
For a more in-depth analysis of the sequence-specific nature of the two terminal LCRs and their physicochemical properties we next turn to the Chemical Properties Patterns app (Figure 4B). For the C-terminal region, we find a compact cluster of mostly polar and π-stacking (including many aromatic) residues which are capable of being a donor and/or acceptor of side-chain hydrogen bonds thereby supporting attractive interactions and explaining the observed burial tendency in this region. More interestingly, when combining in Figure 4B the detailed outcome of structural properties and detected short motifs, we can easily correlate the burial bias with a predicted propensity for extended structure. This region is located in between two IDRs (N347 to Y396) and aligns with the enrichment by Y that contains two LARKS-like structural motifs (GYDYTG: 376-381 and GYADYS: 392-397). Furthermore, the unstructured and exposed flanking IDRs (IDR2 and IDR3) correlate well with the G-rich region (G323-G411), where several glycine-arginine motifs (GARs) have been detected. The LARKS motifs are known to shape cross-𝛽-sheets through weak attractive interactions, including π-stacking, van der Waals and hydrogen bonding, which are all possible according to the amino acid composition of the C-terminus 31. Notably, despite its apparent similarity to the steric zipper, the LARKS motif has a smaller buried interface and looser packing of the side chains between the mating sheets, which means weaker binding force and higher flexibility. In this case, the enrichment with flexible glycine, aromatic tyrosine, and polar glutamine effectively support multivalent interactions since all of them can engage in π-stacking and form hydrogen bonding (Y and Q also through their side chains being both donor and acceptor of a proton). Generally the resultant of binding forces and environmental conditions tend to decide whether such low-complexity regions, decorated with some functional motifs, remain disordered or undergo phase separation or even aggregate. Our findings are also very much in line with ATR-FTIR experiments of Taylor-Ventura et al., showing that a sequence isoform DL2, possessing only C-terminal of two LCRs, leads to aggregates containing intermolecular 𝛽-sheet 53
For the N-terminal region, we find that the content of residues able to π-stacking and side-chain hydrogen bonding is much smaller, which together with the enrichment in positively charged residues (mostly R) and proline (known to be a secondary structure breaker) leads to a reduction in the number of possible short-range interactions. As a result, this region seems unstructured (with a slight tendency towards short ɑ-helices) and almost completely solvent-exposed consistent with the predictions of local structure and solvent accessibility (Figure 4B). When analyzing the predicted contact map on the left panel in Figure 4C (taken from the Contact Map app), again we can easily point out the differences between the LCRs at the two ends of the protein. In particular, despite the enhanced intra-regions contacts for domains, the algorithms detected Q320-G410 inter-region with the slight probability of internal contacts (score ~ 0.35 while contact threshold is ⩾0.5), which covers almost the entire C-terminal LCR, while the N-terminus appears without any internal contacts. Interestingly, using the outcome of the Structural Disorder app (row “BINDING” in Figure 4B), we see that the N-terminal region despite being structurally featureless is detected by the ANCHOR method as a protein binding region (A20-I95) with three picks above 0.8 score. This region coincides with a region enriched by arginine, which is known to be abundant in the LLPS-driving motifs and RNA binding regions 44. Therefore, we suspect that the experimentally observed dissolution of DL1 droplets in presence of highly concentrated RNA 53, is most likely due to strong electrostatic attraction between negatively charged nucleic acid and positively charged arginines. When the environment lacks ionic interaction partners, the DL1 isoform undergoes phase separation driven by weaker but still favorable π-cation interactions between C-terminal tyrosines and N-terminal arginines. And finally, the absence of the R-rich N-terminus in the isoform DL2 leads to fibril formation, driven by π-π-stacking of C-terminal tyrosines in two monomers.
Thus, using a programming analogy, we can say that the final behavior of hnRNPDL isoforms appears as the result of a biomolecular algorithm composed of a series of if-then-else conditionals, depending on the presence of N-terminal and environmental conditions. Furthermore, the experiments confirmed that substitution of cation (R➝K) or/and aromatic system (Y➝F) significantly weakens the phase separation of DL1 53, which we think, points to the stronger matching of the R-Y pair. This complementarity results from the high cooperativity of multivalent interactions, in addition to the π-cation component also includes π-π-stacking (CZ=NH1 π-electrons donated by arginine compared to lysine) and side-chain hydrogen bonding (polar hydroxyl donated by tyrosine compared to nonpolar phenylalanine and multiple protons bonded to nitrogens donated by arginine).
 The “II-STRUCT.”, “SA”, and “MOTIFS” rows in Figure 3B are taken from the outcome of the Secondary structure, Solvent Accessibility, and Domains, Motifs, and Repeats apps, respectively.
Finally, while understandably the focus tends to be on the low-complexity regions decorated with some binding and/or switching structural properties motifs, it is nevertheless also important to consider the well-folded domains, especially in cases like hnRNPD where they play a mediator between the two IDRs. The Domains, Motifs, Repeats app reveals the two HMMER-detected Pfam domains (both with ID: PF00076) located in M151-D219 and V235-I305 regions (Figure 4A, 4B). The presence of the domains is also well captured by the contact map prediction which reveals details about likely binding activities going on between distinct regions of the sequence (left panel in Figure 4C). Both, predictors of the secondary structure and known structures (PDBs) assigned to the sequences used in the Pfam seed-MSA, indicate that each domain has well-structured globular ɑ+β topology (composed of 2 ɑ-helices and single β-sheet), which perfectly explains the solubility of the third isoform (DL3) lacking both flexible ends. The list of corresponding PDBs can be found on the interactive labels assigned to each sequence in the Pfam seed-MSA in the Sequence Conservation & MSA app (e.g., 1P1T, residues: 18-88), and/or by following the link of the “STRUCTURE” button on the main page of the SingleSEQ tab. Specifically, for human hnRNPDL (see the right panel in Figure 4C), the BIAPSS provides a direct link to the entry in the novel AlphaFold database collecting the high-quality predictions of structures that proved to be competitive with the experimental ones 26.