A comprehensive BIAPSS platform for the physicochemical featurization of phase separating proteins

The liquid-liquid phase separation (LLPS) of biological macromolecules has emerged as a foundational mechanism underlying the formation of a myriad of membraneless organelles (MLOs), such as stress granules, transcription factor condensates, and chromatin compartments. A molecular grammar of sequences, which would enable a quantitative prediction and understanding of protein phase separation from rst principles is currently missing. A major challenge in the eld is the sparsity of bioinformatics data and the lack of computational, data-driven tools for biophysical and statistical analysis of proteins capable of phase separation. Here we present the utility of web applications framed within a novel open-source platform for BioInformatic Analysis of liquid-liquid Phase-Separating protein Sequences, https://biapss.chem.iastate.edu/. BIAPSS combines high-throughput interactive data analytics of physicochemical and evolutionary features with a comprehensive repository of bioinformatic data for on-the-y research of the sequence-dependent properties of proteins with known LLPS behavior. To facilitate exploration of the services and provide the interpretation guideline, we present two attention-getting case studies of FUS and hnRNPDL. This should help the LLPS community uncover the nature of interactions driving the formation of membraneless organelles.


Introduction
In the past few years, the liquid-liquid phase separation (LLPS) of biomolecules has become a unifying physical mechanism for understanding principles of intracellular compartmentalization, formation of membraneless organelles, and gene regulation [1][2][3][4][5] . The ability of proteins to phase separate appears to be encoded primarily in the peculiarities of their primary sequences, which often contain low complexity regions and are enriched in charged and multivalent interaction centers 6-8 . While some general sequence trends have emerged, the quantitative aspects of how amino acid sequences encode and decode phase separation still remain largely unknown [9][10][11] . This is because many different combinations of relevant interactions seem to be contributing to phase separation without anyone being universally necessary 12 .
As a consequence (with a few exceptions [13][14][15][16] ) mostly case-by-case studies of different sequences are performed, with the broader context of many ndings, including their statistical signi cance remaining unknown.
In recent years, several databases have emerged which collect LLPS-related protein sequence data and metadata, with prominent examples being PhaSepDB 14 , PhaSePro 15 , LLPSDB 16 , and DrLLPS 17 . These databases collect and annotate partially overlapping sets of phase-separating protein sequences including data on experimental conditions and signi cant annotations. In particular, PhaSePro, LLPSDB, and a subset of PhaSepDB contain manually curated proteins, which are recognized for driving the formation of subcellular compartments.
Accumulation of high-quality datasets is certainly a necessary condition for making progress towards uncovering driving forces of protein phase separation. However, one needs a biophysically motivated computational infrastructure to be able to harness the data from carefully and manually curated sets of decoration parameters that recently emerged as an informative measure for electrostatic interactions of intrinsically disordered proteins (IDPs) 18 . All of the pre-processed bioinformatics data that are passed to BIAPSS visualizers are available for download (these options are included in the DOWNLOAD tab). The available data include raw predictions pre-calculated using the well-established tools, as well as the ndings based on our deep statistical analysis. Finally, the DOCS tab provides detailed documentation related to the description of all the components of BIAPSS.

SingleSEQ web applications for robust analysis of an individual LLPS sequence
The SingleSEQ component of BIAPSS (https://biapss.chem.iastate.edu/single_seq.html) is designed for a high throughput biophysical and evolutionary analysis, visualization, and interactive exploration of sequence features (Figure 2). A comprehensive featurization of sequences was done, which provides users with informative metrics with known impact on protein phase-separation and conformational propensities. Below we summarize the capabilities of each web application followed by their performance in case studies on two proteins, FUS and hnRNPA1. For more details on the tools and approaches used, see the Methods section.

Entry Summary and Annotation web application
The rst interactive application of SingleSEQ analysis provides a compact one-page summary for a single LLPS protein sequence. A brief overview of all key parameters helps to get an insight into the essence of individual sequence characteristics. The extended cross-references section shows coverage in external resources and integrates available metadata for a particular LLPS protein. Over there, the user can (i) easily follow the corresponding entries in the primary LLPS-related databases, such as PhaSepDB 14 , DrLLPS 17 , LLPSDB 16 , and PhaSePro 15 , (ii) explore the principal annotation data and functional information in UniProt resource 19 , (iii) identify manually curated intrinsically disordered regions in DisProt database 20 , (iv) check subcellular location through Compartments 21 or display it on the enlarged image, (v) see experimental mapping in The Human Protein Atlas 22 . If an experimental structure is known, (vi) the Structure button links to the PDBe-KB repository 23 , otherwise the computationally modeled structure is accessible via the MODBASE database 24 and/or SWISS-MODEL server 25 , and the newly provided AlphaFold database 26 . There, you can also nd PSPredictor score estimating the propensity to phase behavior 27 , empirically determined LLPS-related region(s), and PubMed identi ers referring to the literature.

Sequence Complexity & Composition web application
The application allows (i) exploring the sequence composition (amino acid content and patterning along the sequence), and (ii) detecting the low-complexity regions (LCRs), accompanied with overall sequence information content (Shannon entropy). Since LCRs are known to compose imperfect sequence repeats, and polar or hydrophobic blocks of residues, the detection and systematic analysis of these regions can facilitate the identi cation of the nature of multiple weak interaction motifs in phase separating proteins. 28 The AA content section provides the detailed contents and decoration of 20 biogenic amino acids (by default, those into which the sequence is enriched) in a given LLPS protein referenced to the user-selected benchmark dataset of various groups of proteins. Finally, the low-complexity section provides along the sequence the merged and uni ed regions of low-complexity and Shannon information content calculated on the y based on the user-selected length of the frame.

Patterns of Chemical Decoration web application
The Chemical Patterns application allows combining several key biophysical properties for an individual LLPS-driver protein sequence, all within a single multi-row chart. This includes chemical properties, such as polarity, hydrophobicity, aromaticity, and charge. Then, the robust consensus predictions of secondary structure, solvent accessibility, structural disorder, and low-complexity regions are provided. The positionbased vertical alignment of all features gives a deeper glimpse into the correlation of various chemical decorations and structural properties. Furthermore, the all-in-one chart facilitates the detection of regions sensitive for key interactions. For example, parallel comparison of speci c patterns of charged or aromatic residues and unique insertion in a structurally disordered region helps to identify hot spot residues or short motifs, which may play an important role in the LLPS behavior.

Domains, Motifs, Repeats web application
This multi-row application is designed to detect Pfam functional domains and offers a broad screening of short motifs, both sequential and structural, in the individual sequences of phase-separating proteins. The ndings are displayed on the background of regions experimentally con rmed as LLPS-related and/or predicted to be structurally disordered or low-complexity. In the subsequent rows the additional custom feature patterns can be displayed, e.g., physicochemical decoration to facilitate the identi cation of the nature of interactions or evolutionary conservation of short motifs within intrinsically disordered fragments that clearly points to their signi cance for the regulatory function, hence may also be essential for the collective formation of MLOs. Additionally, short fragments frequently repeated along an individual sequence are detectable, including random and tandem repeats. The detailed information on the location, motif class, and instance are displayed on the interactive labels.

Sequence Conservation web application
The conservation strength of a sequence region may be a consequence of its functional or structural importance. Therefore, the unique insertions, deletions, or substitutions, identi ed within the LLPS protein sequence may prove to in uence the chemical properties of the polypeptide chain towards multivalent interactions. The Sequence Conservation application delivers (i) the multiple sequence alignment (MSA) of the LLPS sequence against the reference databases of protein sequences, (ii) the consensus sequence pro le, (iii) and evolutionary conservation (of LOGO type) at the level of individual amino acids. Also, the identi cation of functional domains along the query sequence is provided. For each individual domain, the Pfam seed-MSA can be loaded, which signi cantly increases the reliability of evolutionary conservation in this region of the LLPS protein. The MSA section contains HMMER-aligned 29 protein sequences colored by the chemical character of amino acids and sorted by the E-value and score. Each position in the sequence is annotated interactively by corresponding positions in the query sequence, consensus pro le, and full-MSA. The labels also include amino acid, con dence level, and secondary structure assignment taken from known PDBs.

The Secondary Structure web application
Conformational preferences of local sequence fragments are conditioned by the local composition of amino acids through the formation of regular patterns of hydrogen bonds along the protein backbone. Recent ndings imply that some intrinsically disordered fragments of the sequence tend to get structurally organized into steric zippers or aromatic-rich kinked β-sheets [30][31][32][33] . The most advanced sequence-based secondary structure predictors provide not only the binary assignment of ordered regions but also the probability fractions for the helical (H), extended (E), or coiled (C) structures of the individual amino acids along the sequence 34,35 . This opens up the possibility to determine factors supporting multivalent interactions, as well as detecting regions showing some tendency to form ordered structure, which can be acting as a switch depending on the environment conditions and molecular crowding. The Secondary Structure application presents the sequence-based predictions and agreed consensus of secondary structure elements assigned along with the selected LLPS sequences. In the last section, the detailed probabilities to form each type of HEC element are attributed along the sequence with amino acid resolution.

The Solvent Accessibility web application
The Solvent Accessibility application presents the sequence-based prediction of structurally buried or exposed regions along the protein sequence. Since the structural properties of many phase-separating proteins are hardly graspable, the analysis of surface solvent accessibility may support, to a certain extent, the identi cation of regulatory regions prone to form temporarily stabilized interactions at the contact interface 36,37 . For compatibility and user-friendliness, the overall layout of the interactive graph is similar to that of the Secondary Structure app and contains three main sections: (i) on the top, the agreed consensus of solvent accessibility predictions provided at residue-resolution level, (ii) followed in the next few rows by the original results derived and uni ed from the individual methods, (ii) below which there is a section of detailed probabilities for exposure or burial of individual residues along the sequence. The interactive annotations include residue index, amino acid type, and solvent accessibility in three-state notation or particular probabilities.

The Structural Disorder web application
Identi cation of low-complexity and unstructured fragments within proteins has recently gained prominence, following the discovery that the liquid-liquid phase separation emerges as a common functioning mechanism for partially or fully disordered proteins 1,2,4,5 . From these ndings, we know that protein sequences not only encode for well-ordered tertiary structures but can also explicitly encode intrinsic disorder. The Structural Disorder application provides predictions of disordered regions resulting from the benchmark of numerous widely used methods. Within the app, we provide both the detailed probability distributions of disorder along the sequence and the agreed binary (order/disorder) outcome of all tools assigned per amino acid position.

Contact Map web application
The Contact Map application provides sequence-based predictions of spatial contacts between residues distant in a protein sequence. The most powerful contact predictors, such as those selected for this work, employ the convolutional neural networks, machine learning, and evolutionary coupling techniques to prepare the list of residue pairs occurring close enough in space with a certain probability. An in-depth analysis of the contacts predicted for the LLPS proteins, despite their largely disordered nature, can help identify regions essential for maintaining structure, dynamics, and function, and may indicate a lead relevant to phase behavior.

Discussion
Case Study I: FUS, LLPS regulated in the context-dependent tuning of preferred forces Fused in sarcoma (FUS) is one of the early discovered biological systems undergoing self-organization by liquid-liquid phase separation (LLPS) 38 . Since then, the protein has been the subject of extensive experimental and computational research to understand the molecular mechanisms and interactions driving this phenomenon. FUS can be found in the BIAPSS service by the UniProt identi er (P35637), gene (FUS), or using the "RNA-binding" search key. The summary page contains a high-quality image of the experimentally con rmed cellular location (left panel in Figure 3A). Due to its multifunctionality in RNA processing, FUS is mostly observed in the nucleus 39 . In physiological conditions, the low levels of the protein are distributed in the cytoplasm 40 , where FUS transports and manages RNA through the dynamic liquid-like subcellular compartments, such as ribonucleoprotein or stress granules 41 . However, the cytoplasmic concentration of FUS signi cantly increases when noxious mutations lead to aggregation 42 .
This progressively aberrant process is manifested by neurodegenerative diseases in humans 42 . Although plenty of accumulated evidence points to the in uence of distinct factors on the cellular behavior of FUS, its primary sequence still holds many cues. To frame the physicochemical properties of full-length FUS, we used the analytical approach offered by the SingleSEQ module of BIAPSS.
The average metrics indicate that the 526 residues long sequence of FUS contains over 80% disorder and only 8% order. The solvent accessibility predictions show the same aspect ratio between exposure and burial. The contents of aromatic, hydrophobic, polar, and charged residues are 10%, 42%, 40%, and 17% respectively, with a slight excess of positive charge. Such a rough overview described by a set of averages gives some general insight about the protein properties but conceals some local distributions that are important for the identi cation of the preferential interactions. Therefore, we conduct a detailed analysis of the composition and complexity of the FUS sequence and present the resulting patterns in Figure 3B. Compared to any reference set of proteins, this one is extremely enriched in glycine making up nearly 1/3 of the full sequence. Another 20% of the amino acid content consists primarily of serine and glutamine. Although the dominant content of these three amino acids suggests generally low-complexity of the sequence, their distribution along the sequence is strongly heterogeneous. Indeed, the calculated low information content of the sequence is mainly localized around protein terminals and clearly corresponds to three fragments with high glycine concentration (LCR2: 164-267, LCR3: 370-420, LCR4: 454-507). These regions also exclusively accumulate total arginines, which together with glycine form a series of RGG repeating motifs known to bind RNA speci cally 44 . Both serine and glutamine are mostly localized at the N-terminus, being more clearly clustered within LCR1 (1-163). LCR1 additionally gathers 24/35 available tyrosines, and thus, it has visibly distinct enrichment (SQYG) known to occur in prion-like domains (PLD) 45 . Using Domains, Motifs, Repeats application we also found that the remaining compositionally more complex regions of C-terminus (I287-L365 and R422-D453), match the PF00076 and PF00641 Pfam domains, i.e., the RNA recognition motif (RRM) and RNA binding zinc nger (ZnF), respectively. The robust predictions (for details see Methods) unanimously show that RRM is a wellfolded FUS domain, while the other fragments remain disordered. The seed MSAs prepared for FUS within Sequence Conservation application further con rms that both domains are evolutionarily conserved members of Pfam families: RRM_1 and zf-RanBP, respectively (see bottom rows in Figure 3B).
The visual inspection of the amino acid content and distribution of FUS allows us to identify and isolate speci c regions in the protein ( Figure 3C). Furthermore, we have performed a physicochemical featurization of these segments, which reveal preferred interactions when coupled with biomolecular conditionals known from experiments. The recent experimental reports showed that the isolated prion-like domain (PLD, 1-214 or even 1-163) can undergo self-organization forming liquid droplets when kept in high protein levels or high salt concentrations 46,47 . This N-terminal fragment is enriched in amino acids whose side chains are multivalent, as shown in Figure 3C. Thus, the dense pattern of polarity comes from enrichment in S, Q, Y, where Y, Q, and G also provide π-electron centers for π-π-stacking. Most of them are also able to be both donors and acceptors of side-chain protons for hydrogen bonding (HB). In line with this, the intermolecular interaction pro les derived from simulations of the 120-163 region indicated the most frequent contacts between QQ > QY > YY > SY and other pairs of enriched amino acids 48 . All of these observations suggest that homotypic phase separation of wild-type PLD monomers is driven by balanced contributions from hydrogen bonding and π-stacking. Indeed, several mutagenesis studies showed that Y A substitution disrupts phase separation by removal both components of the interaction, while Y F mutants are signi cantly more aggregation-prone, due to strengthening of binding via tighter hydrophobic F-π-stacking at the cost of losing HB contributions of polar tyrosine 46,48 . It is also worth noting that the PLD region is completely de cient of positive charge with a minor net charge per residue of -0.01 (M1-S165: -0.012 and M1-G212: -0.024), which place it within the weak polyelectrolytes region on the CIDER diagram (right panel in Figure 3A) 43 . However, an excess of serine and threonine in this region provides an ability to introduce a strongly negative charge through multiple phosphorylations. After phosphorylation, the dominant force becomes electrostatic repulsion, which is known to disrupt both phase separation and aggregation 38 . The central region of PLD (39-95) was proposed as the core of aberrant brils, which in solid-state form structured cross--sheets 38 . The same structural properties have not been unambiguously con rmed in the condensed phase of liquid-liquid mixing. Undoubtedly, however, our algorithms detected along this region structural motifs known as low-complexity, amyloidlike, reversible, folded segments (LARKS) 31 . In our analysis, the most effective predictors of structural properties showed for these motifs some tendency toward extended secondary structure and a slightly increased probability of burial (bottom panel in Figure 3C). Interestingly, the prediction in 8-letter notation detected a turn or bend within each of the structural motifs, which explains their exible nature. These ndings together with ambiguous experimental results may suggest some variations of structural state in the PLD core and speci cally the disorder-to-order transition driven by biomolecular conditionals.
The remaining part of the FUS sequence, referred to as the C-terminus, contains two well-known domains (RRM and ZnF) and three glycine-arginine-rich regions (GARs). All components are signi cant players in binding RNA. Zinc nger supports only the recognition of speci c GGU motif, while RRM domain and RGG repeats are universal towards a variety of RNAs 49 . Both folded domains of FUS are much less polar than PLD, as seen from the BIAPSS-based physicochemical featurization in Figure 3C. They also have a lower content of side chains that are able to engage in π-stacking or hydrogen bonding. However, the charged residues are pretty abundant in the composition of RRM and ZnF, which explains the functional role of electrostatic interactions towards the binding of nucleic acids or stabilizing folds via salt bridges 50,51 .
All three GARs are the least polar regions of the protein (see Figure 3C). The dense patterning of hydrophobicity arises from glycine excess. The rich π-electron-containing systems, other than aromatic side chains, originate mainly from the abundance of the arginine's guanidino group. Arginine is also a source of excess positive charge at the C-terminus. The experimental studies consistently con rm that the isolated C-terminus does not undergo phase separation 46 . However, liquid-liquid droplets rapidly occur when mixed with N-terminal monomers 46 . Moreover, the LLPS of full-length wild-type FUS is more robust than heterotypic mixing of N-and C-terminals and homotypic self-assembling of N-terminal monomers 46 . This suggests the higher priority of cation-π (R-Y) stacking over π-π (Y-Y) stacking, while both are reinforced by hydrogen bonds. Another experimental study showed that R K mutants, who no longer have the ability of π-π-stacking but retain charge, can still undergo phase separation. In turn, R A substitutions prevent phase separation because they lose the π-system, cation, and ability of side-chain hydrogen bonding. Interestingly, the recent report indicates that stacking interactions, including cation-π (e.g., RY, KF) and especially π-π (e.g., YY and RY, and even RQ), are most robust over a wide range of salt concentrations 47 . The hydrophobic contribution from π-electron-containing systems becomes the main force that strengthens the contact in high salt. In these conditions, the screening of usually dominant electrostatic contributions is signi cant. Surprisingly, changing the partitioning of the different forces makes the interaction of the two positively charged arginines attractive under these conditions 47 . The set of diverse chemical groups in arginine is a unique feature among the other amino acid side chains. With its high reactivity, the need for precise regulation comes, and so arginine can be tuned to a preferred state by posttranslational methylation. Thus, under physiological conditions, FUS is highly methylated 46 . This limits self-assembly via interactions with tyrosine and promotes a functional role of intermolecular interactions with other proteins and nucleic acids. Therefore, phase separation and gelation of FUS can increase by hypomethylation of arginines within RGG-rich regions or insertion of additional ones in the Cterminus 46 . All of these ndings come together to demonstrate the signi cant role of the arginine side chain in phase separation. Tyrosine and glutamine are similarly relevant. They also contain multifunctional chemical groups that make them reactive and multivalent. These features aid in contextdependent tuning between preferred forces of interactions. They can work synergistically or alternatively. And their regulation depends on environmental conditions, the state of posttranslational modi cations, and the presence of binding partners.
Case Study II: hnRNPDL -cellular behavior regulated by the priority of interactions Heterogeneous nuclear ribonucleoproteins are a large and diverse family of proteins that play active roles at every stage of RNA regulation 52 . An important member of this family, the heterogeneous ribonucleoprotein D-like sequence of humans (hnRNPDL), occurs in the nucleus in three isoforms generated by alternative splicing. These isoforms differ by the presence or absence of structurally exible terminal regions and, depending on this, phase separate (DL1, full-length), aggregate (DL2, missing Nterminus), or remain soluble (DL3, missing both N-and C-terminus) 53 . It is therefore thought that a unique combination of these regions is used for regulating the function and cellular behavior (phase state, mobility, location) of the isoforms. It is known that various short sequential or structural motifs, especially located in the disordered regions, enhance multivalent interactions and tune the ability of proteins to form higher-order assemblies 31,54 . It has been shown that even a single mutation in a key region of hnRNPDL can signi cantly promote an aberrant aggregation 53 . To understand the differences in self-assembly properties of different isoforms we carry out deep biophysical and bioinformatic analysis of full-length hnRNPDL by using BIAPSS SingleSEQ web applications.
The protein can be found in the BIAPSS online repository by the UniProt identi er (O14979), gene (hnRNPDL), or common name. The brief summary shows that the entry is covered in most primary LLPS databases except PhaSePro. Interestingly we nd that PSPredictor score = 0.27 which indicates a lack of phase separation contrary to the experimental reports. The simpli ed biophysical characteristic of the sequence shows 49% disorder and 31% order consisting mostly of the helical structure. Solvent accessible surface area analysis shows that 64% of the structure is likely exposed to the solvent. The contents of polar, hydrophobic, and aromatic residues are 30%, 42%, 12%, respectively.
Using the Composition and Complexity app allows us to carry out a more in-depth analysis of the patterning of LLPS promoting residues in the full-length sequence of hnRNPDL. We nd the protein is enriched with polar (Y, Q), positively charged (R) residues, and π-electron-containing systems (Y, R, Q, G) at the expense of structure promoting hydrophobic ones (I, L, V) with respect to the all, globular proteins, SwissProt sequence set, and even the disordered proteins collected in DisProt or LLPS-driving sequences collected in BIAPSS ( Figure 4A). In particular, the agreed consensus of low-complexity (LCR) predictors shows that the G > Y > Q clusters are mostly localized in the C-terminal region, with enrichment typical for prion-like domains. The N-terminus on the other hand is enriched with P and R residues. The K residues are mostly localized in the linker of two folded domains located in the central part of the sequence. In line with these observations, the Shannon entropy reveals that the C-terminal region is much lower in complexity than N-terminus, however, their amino acid composition indicates that both are still classi ed as LCRs. Additionally, robust predictions of structural disorder (taken from the Structural Disorder app) further con rm that both LCRs contain intrinsically disordered regions (IDRs). Speci cally, the IDR1:M1-Q145 is located on the N-terminus, while IDR2:V313-F350 and IDR3:S397-Y420 are located on the Cterminus. Thus we can already see that there is signi cant asymmetry between the terminals of the protein which is in harmony with experiments showing distinct physicochemical and functional implications for these two LCRs 53 .
For a more in-depth analysis of the sequence-speci c nature of the two terminal LCRs and their physicochemical properties we next turn to the Chemical Properties Patterns app ( Figure 4B). For the Cterminal region, we nd a compact cluster of mostly polar and π-stacking (including many aromatic) residues which are capable of being a donor and/or acceptor of side-chain hydrogen bonds thereby supporting attractive interactions and explaining the observed burial tendency in this region. More interestingly, when combining in Figure 4B the detailed outcome of structural properties and detected short motifs, we can easily correlate the burial bias with a predicted propensity for extended structure.
This region is located in between two IDRs (N347 to Y396) and aligns with the enrichment by Y that contains two LARKS-like structural motifs (GYDYTG: 376-381 and GYADYS: 392-397). Furthermore, the unstructured and exposed anking IDRs (IDR2 and IDR3) correlate well with the G-rich region (G323-G411), where several glycine-arginine motifs (GARs) have been detected. The LARKS motifs are known to shape cross--sheets through weak attractive interactions, including π-stacking, van der Waals and hydrogen bonding, which are all possible according to the amino acid composition of the C-terminus 31 .
Notably, despite its apparent similarity to the steric zipper, the LARKS motif has a smaller buried interface and looser packing of the side chains between the mating sheets, which means weaker binding force and higher exibility. In this case, the enrichment with exible glycine, aromatic tyrosine, and polar glutamine effectively support multivalent interactions since all of them can engage in π-stacking and form hydrogen bonding (Y and Q also through their side chains being both donor and acceptor of a proton). Generally the resultant of binding forces and environmental conditions tend to decide whether such low-complexity regions, decorated with some functional motifs, remain disordered or undergo phase separation or even aggregate. Our ndings are also very much in line with ATR-FTIR experiments of Taylor-Ventura et al., showing that a sequence isoform DL2, possessing only C-terminal of two LCRs, leads to aggregates containing intermolecular -sheet 53 For the N-terminal region, we nd that the content of residues able to π-stacking and side-chain hydrogen bonding is much smaller, which together with the enrichment in positively charged residues (mostly R) and proline (known to be a secondary structure breaker) leads to a reduction in the number of possible short-range interactions. As a result, this region seems unstructured (with a slight tendency towards short -helices) and almost completely solvent-exposed consistent with the predictions of local structure and solvent accessibility ( Figure 4B). When analyzing the predicted contact map on the left panel in Figure 4C (taken from the Contact Map app), again we can easily point out the differences between the LCRs at the two ends of the protein. In particular, despite the enhanced intra-regions contacts for domains, the algorithms detected Q320-G410 inter-region with the slight probability of internal contacts (score ~ 0.35 while contact threshold is ⩾0.5), which covers almost the entire C-terminal LCR, while the N-terminus appears without any internal contacts. Interestingly, using the outcome of the Structural Disorder app (row "BINDING" in Figure 4B), we see that the N-terminal region despite being structurally featureless is detected by the ANCHOR method as a protein binding region (A20-I95) with three picks above 0.8 score.
This region coincides with a region enriched by arginine, which is known to be abundant in the LLPSdriving motifs and RNA binding regions 44 . Therefore, we suspect that the experimentally observed dissolution of DL1 droplets in presence of highly concentrated RNA 53 , is most likely due to strong electrostatic attraction between negatively charged nucleic acid and positively charged arginines. When the environment lacks ionic interaction partners, the DL1 isoform undergoes phase separation driven by weaker but still favorable π-cation interactions between C-terminal tyrosines and N-terminal arginines.
And nally, the absence of the R-rich N-terminus in the isoform DL2 leads to bril formation, driven by ππ-stacking of C-terminal tyrosines in two monomers.
Thus, using a programming analogy, we can say that the nal behavior of hnRNPDL isoforms appears as the result of a biomolecular algorithm composed of a series of if-then-else conditionals, depending on the presence of N-terminal and environmental conditions. Furthermore, the experiments con rmed that substitution of cation (R K) or/and aromatic system (Y F) signi cantly weakens the phase separation of DL1 53 , which we think, points to the stronger matching of the R-Y pair. This complementarity results from the high cooperativity of multivalent interactions, in addition to the π-cation component also includes π-πstacking (CZ=NH1 π-electrons donated by arginine compared to lysine) and side-chain hydrogen bonding (polar hydroxyl donated by tyrosine compared to nonpolar phenylalanine and multiple protons bonded to nitrogens donated by arginine).
Finally, while understandably the focus tends to be on the low-complexity regions decorated with some binding and/or switching structural properties motifs, it is nevertheless also important to consider the well-folded domains, especially in cases like hnRNPD where they play a mediator between the two IDRs.

Page 13/33
The Domains, Motifs, Repeats app reveals the two HMMER-detected Pfam domains (both with ID: PF00076) located in M151-D219 and V235-I305 regions ( Figure 4A, 4B). The presence of the domains is also well captured by the contact map prediction which reveals details about likely binding activities going on between distinct regions of the sequence (left panel in Figure 4C). Both, predictors of the secondary structure and known structures (PDBs) assigned to the sequences used in the Pfam seed-MSA, indicate that each domain has well-structured globular +β topology (composed of 2 -helices and single β-sheet), which perfectly explains the solubility of the third isoform (DL3) lacking both exible ends. The list of corresponding PDBs can be found on the interactive labels assigned to each sequence in the Pfam seed-MSA in the Sequence Conservation & MSA app (e.g., 1P1T, residues: 18-88), and/or by following the link of the "STRUCTURE" button on the main page of the SingleSEQ tab. Speci cally, for human hnRNPDL (see the right panel in Figure 4C), the BIAPSS provides a direct link to the entry in the novel AlphaFold database collecting the high-quality predictions of structures that proved to be competitive with the experimental ones 26 .

Discussion
Case Study I: FUS, LLPS regulated in the context-dependent tuning of preferred forces Fused in sarcoma (FUS) is one of the early discovered biological systems undergoing self-organization by liquid-liquid phase separation (LLPS) 38 . Since then, the protein has been the subject of extensive experimental and computational research to understand the molecular mechanisms and interactions driving this phenomenon. FUS can be found in the BIAPSS service by the UniProt identi er (P35637), gene (FUS), or using the "RNA-binding" search key. The summary page contains a high-quality image of the experimentally con rmed cellular location (left panel in Figure 3A). Due to its multifunctionality in RNA processing, FUS is mostly observed in the nucleus 39  This progressively aberrant process is manifested by neurodegenerative diseases in humans 42 . Although plenty of accumulated evidence points to the in uence of distinct factors on the cellular behavior of FUS, its primary sequence still holds many cues. To frame the physicochemical properties of full-length FUS, we used the analytical approach offered by the SingleSEQ module of BIAPSS.
The average metrics indicate that the 526 residues long sequence of FUS contains over 80% disorder and only 8% order. The solvent accessibility predictions show the same aspect ratio between exposure and burial. The contents of aromatic, hydrophobic, polar, and charged residues are 10%, 42%, 40%, and 17% respectively, with a slight excess of positive charge. Such a rough overview described by a set of averages gives some general insight about the protein properties but conceals some local distributions that are important for the identi cation of the preferential interactions. Therefore, we conduct a detailed analysis of the composition and complexity of the FUS sequence and present the resulting patterns in Figure 3B. Compared to any reference set of proteins, this one is extremely enriched in glycine making up nearly 1/3 of the full sequence. Another 20% of the amino acid content consists primarily of serine and glutamine. Although the dominant content of these three amino acids suggests generally low-complexity of the sequence, their distribution along the sequence is strongly heterogeneous. Indeed, the calculated low information content of the sequence is mainly localized around protein terminals and clearly corresponds to three fragments with high glycine concentration (LCR2: 164-267, LCR3: 370-420, LCR4: 454-507). These regions also exclusively accumulate total arginines, which together with glycine form a series of RGG repeating motifs known to bind RNA speci cally 44 . Both serine and glutamine are mostly localized at the N-terminus, being more clearly clustered within LCR1 (1-163). LCR1 additionally gathers 24/35 available tyrosines, and thus, it has visibly distinct enrichment (SQYG) known to occur in prion-like domains (PLD) 45 . Using Domains, Motifs, Repeats application we also found that the remaining compositionally more complex regions of C-terminus (I287-L365 and R422-D453), match the PF00076 and PF00641 Pfam domains, i.e., the RNA recognition motif (RRM) and RNA binding zinc nger (ZnF), respectively. The robust predictions (for details see Methods) unanimously show that RRM is a wellfolded FUS domain, while the other fragments remain disordered. The seed MSAs prepared for FUS within Sequence Conservation application further con rms that both domains are evolutionarily conserved members of Pfam families: RRM_1 and zf-RanBP, respectively (see bottom rows in Figure 3B).
The visual inspection of the amino acid content and distribution of FUS allows us to identify and isolate speci c regions in the protein ( Figure 3C). Furthermore, we have performed a physicochemical featurization of these segments, which reveal preferred interactions when coupled with biomolecular conditionals known from experiments. The recent experimental reports showed that the isolated prion-like domain (PLD, 1-214 or even 1-163) can undergo self-organization forming liquid droplets when kept in high protein levels or high salt concentrations 46,47 . This N-terminal fragment is enriched in amino acids whose side chains are multivalent, as shown in Figure 3C. Thus, the dense pattern of polarity comes from enrichment in S, Q, Y, where Y, Q, and G also provide π-electron centers for π-π-stacking. Most of them are also able to be both donors and acceptors of side-chain protons for hydrogen bonding (HB). In line with this, the intermolecular interaction pro les derived from simulations of the 120-163 region indicated the most frequent contacts between QQ > QY > YY > SY and other pairs of enriched amino acids 48 . All of these observations suggest that homotypic phase separation of wild-type PLD monomers is driven by balanced contributions from hydrogen bonding and π-stacking. Indeed, several mutagenesis studies showed that Y A substitution disrupts phase separation by removal both components of the interaction, while Y F mutants are signi cantly more aggregation-prone, due to strengthening of binding via tighter hydrophobic F-π-stacking at the cost of losing HB contributions of polar tyrosine 46,48 . It is also worth noting that the PLD region is completely de cient of positive charge with a minor net charge per residue of -0.01 (M1-S165: -0.012 and M1-G212: -0.024), which place it within the weak polyelectrolytes region on the CIDER diagram (right panel in Figure 3A) 43 . However, an excess of serine and threonine in this region provides an ability to introduce a strongly negative charge through multiple phosphorylations. After phosphorylation, the dominant force becomes electrostatic repulsion, which is known to disrupt both phase separation and aggregation 38 . The central region of PLD (39-95) was proposed as the core of aberrant brils, which in solid-state form structured cross--sheets 38 . The same structural properties have not been unambiguously con rmed in the condensed phase of liquid-liquid mixing. Undoubtedly, however, our algorithms detected along this region structural motifs known as low-complexity, amyloidlike, reversible, folded segments (LARKS) 31 . In our analysis, the most effective predictors of structural properties showed for these motifs some tendency toward extended secondary structure and a slightly increased probability of burial (bottom panel in Figure 3C). Interestingly, the prediction in 8-letter notation detected a turn or bend within each of the structural motifs, which explains their exible nature. These ndings together with ambiguous experimental results may suggest some variations of structural state in the PLD core and speci cally the disorder-to-order transition driven by biomolecular conditionals.
The remaining part of the FUS sequence, referred to as the C-terminus, contains two well-known domains (RRM and ZnF) and three glycine-arginine-rich regions (GARs). All components are signi cant players in binding RNA. Zinc nger supports only the recognition of speci c GGU motif, while RRM domain and RGG repeats are universal towards a variety of RNAs 49 . Both folded domains of FUS are much less polar than PLD, as seen from the BIAPSS-based physicochemical featurization in Figure 3C. They also have a lower content of side chains that are able to engage in π-stacking or hydrogen bonding. However, the charged residues are pretty abundant in the composition of RRM and ZnF, which explains the functional role of electrostatic interactions towards the binding of nucleic acids or stabilizing folds via salt bridges 50,51 .
All three GARs are the least polar regions of the protein (see Figure 3C). The dense patterning of hydrophobicity arises from glycine excess. The rich π-electron-containing systems, other than aromatic side chains, originate mainly from the abundance of the arginine's guanidino group. Arginine is also a source of excess positive charge at the C-terminus. The experimental studies consistently con rm that the isolated C-terminus does not undergo phase separation 46 . However, liquid-liquid droplets rapidly occur when mixed with N-terminal monomers 46 . Moreover, the LLPS of full-length wild-type FUS is more robust than heterotypic mixing of N-and C-terminals and homotypic self-assembling of N-terminal monomers 46 . This suggests the higher priority of cation-π (R-Y) stacking over π-π (Y-Y) stacking, while both are reinforced by hydrogen bonds. Another experimental study showed that R K mutants, who no longer have the ability of π-π-stacking but retain charge, can still undergo phase separation. In turn, R A substitutions prevent phase separation because they lose the π-system, cation, and ability of side-chain hydrogen bonding. Interestingly, the recent report indicates that stacking interactions, including cation-π (e.g., RY, KF) and especially π-π (e.g., YY and RY, and even RQ), are most robust over a wide range of salt concentrations 47 . The hydrophobic contribution from π-electron-containing systems becomes the main force that strengthens the contact in high salt. In these conditions, the screening of usually dominant electrostatic contributions is signi cant. Surprisingly, changing the partitioning of the different forces makes the interaction of the two positively charged arginines attractive under these conditions 47 . The set of diverse chemical groups in arginine is a unique feature among the other amino acid side chains. With its high reactivity, the need for precise regulation comes, and so arginine can be tuned to a preferred state by posttranslational methylation. Thus, under physiological conditions, FUS is highly methylated 46 . This limits self-assembly via interactions with tyrosine and promotes a functional role of intermolecular interactions with other proteins and nucleic acids. Therefore, phase separation and gelation of FUS can increase by hypomethylation of arginines within RGG-rich regions or insertion of additional ones in the Cterminus 46 . All of these ndings come together to demonstrate the signi cant role of the arginine side chain in phase separation. Tyrosine and glutamine are similarly relevant. They also contain multifunctional chemical groups that make them reactive and multivalent. These features aid in contextdependent tuning between preferred forces of interactions. They can work synergistically or alternatively.
And their regulation depends on environmental conditions, the state of posttranslational modi cations, and the presence of binding partners.

Case Study II: hnRNPDL -cellular behavior regulated by the priority of interactions
Heterogeneous nuclear ribonucleoproteins are a large and diverse family of proteins that play active roles at every stage of RNA regulation 52 . An important member of this family, the heterogeneous ribonucleoprotein D-like sequence of humans (hnRNPDL), occurs in the nucleus in three isoforms generated by alternative splicing. These isoforms differ by the presence or absence of structurally exible terminal regions and, depending on this, phase separate (DL1, full-length), aggregate (DL2, missing N-terminus), or remain soluble (DL3, missing both N-and C-terminus) 53 . It is therefore thought that a unique combination of these regions is used for regulating the function and cellular behavior (phase state, mobility, location) of the isoforms. It is known that various short sequential or structural motifs, especially located in the disordered regions, enhance multivalent interactions and tune the ability of proteins to form higher-order assemblies 31,54 . It has been shown that even a single mutation in a key region of hnRNPDL can signi cantly promote an aberrant aggregation 53 . To understand the differences in self-assembly properties of different isoforms we carry out deep biophysical and bioinformatic analysis of full-length hnRNPDL by using BIAPSS SingleSEQ web applications.
The protein can be found in the BIAPSS online repository by the UniProt identi er (O14979), gene (hnRNPDL), or common name. The brief summary shows that the entry is covered in most primary LLPS databases except PhaSePro. Interestingly we nd that PSPredictor score = 0.27 which indicates a lack of phase separation contrary to the experimental reports. The simpli ed biophysical characteristic of the sequence shows 49% disorder and 31% order consisting mostly of the helical structure. Solvent accessible surface area analysis shows that 64% of the structure is likely exposed to the solvent. The contents of polar, hydrophobic, and aromatic residues are 30%, 42%, 12%, respectively.
Using the Composition and Complexity app allows us to carry out a more in-depth analysis of the patterning of LLPS promoting residues in the full-length sequence of hnRNPDL. We nd the protein is enriched with polar (Y, Q), positively charged (R) residues, and π-electron-containing systems (Y, R, Q, G [1] ) at the expense of structure promoting hydrophobic ones (I, L, V) with respect to the all, globular proteins, SwissProt sequence set, and even the disordered proteins collected in DisProt or LLPS-driving sequences collected in BIAPSS ( Figure 4A). In particular, the agreed consensus of low-complexity (LCR) predictors shows that the G > Y > Q clusters are mostly localized in the C-terminal region, with enrichment typical for prion-like domains. The N-terminus on the other hand is enriched with P and R residues. The K residues are mostly localized in the linker of two folded domains located in the central part of the sequence. In line with these observations, the Shannon entropy reveals that the C-terminal region is much lower in complexity than N-terminus, however, their amino acid composition indicates that both are still classi ed as LCRs. Additionally, robust predictions of structural disorder (taken from the Structural Disorder app) further con rm that both LCRs contain intrinsically disordered regions (IDRs). Speci cally, the IDR1:M1-Q145 is located on the N-terminus, while IDR2:V313-F350 and IDR3:S397-Y420 are located on the C-terminus. Thus we can already see that there is signi cant asymmetry between the terminals of the protein which is in harmony with experiments showing distinct physicochemical and functional implications for these two LCRs 53 . [1] It is proved that glycine can stack (due to the lack of the side chain) via π-electrons from a peptide bond and hydrogen bonding via backbone carbonyl or amide. 55 For a more in-depth analysis of the sequence-speci c nature of the two terminal LCRs and their physicochemical properties we next turn to the Chemical Properties Patterns app ( Figure 4B). For the Cterminal region, we nd a compact cluster of mostly polar and π-stacking (including many aromatic) residues which are capable of being a donor and/or acceptor of side-chain hydrogen bonds thereby supporting attractive interactions and explaining the observed burial tendency in this region. More interestingly, when combining in Figure 4B the detailed outcome of structural properties and detected short motifs [1] , we can easily correlate the burial bias with a predicted propensity for extended structure. This region is located in between two IDRs (N347 to Y396) and aligns with the enrichment by Y that contains two LARKS-like structural motifs (GYDYTG: 376-381 and GYADYS: 392-397). Furthermore, the unstructured and exposed anking IDRs (IDR2 and IDR3) correlate well with the G-rich region (G323-G411), where several glycine-arginine motifs (GARs) have been detected. The LARKS motifs are known to shape cross--sheets through weak attractive interactions, including π-stacking, van der Waals and hydrogen bonding, which are all possible according to the amino acid composition of the C-terminus 31 . Notably, despite its apparent similarity to the steric zipper, the LARKS motif has a smaller buried interface and looser packing of the side chains between the mating sheets, which means weaker binding force and higher exibility. In this case, the enrichment with exible glycine, aromatic tyrosine, and polar glutamine effectively support multivalent interactions since all of them can engage in π-stacking and form hydrogen bonding (Y and Q also through their side chains being both donor and acceptor of a proton). Generally the resultant of binding forces and environmental conditions tend to decide whether such low-complexity regions, decorated with some functional motifs, remain disordered or undergo phase separation or even aggregate. Our ndings are also very much in line with ATR-FTIR experiments of Taylor-Ventura et al., showing that a sequence isoform DL2, possessing only C-terminal of two LCRs, leads to aggregates containing intermolecular -sheet 53 For the N-terminal region, we nd that the content of residues able to π-stacking and side-chain hydrogen bonding is much smaller, which together with the enrichment in positively charged residues (mostly R) and proline (known to be a secondary structure breaker) leads to a reduction in the number of possible short-range interactions. As a result, this region seems unstructured (with a slight tendency towards short -helices) and almost completely solvent-exposed consistent with the predictions of local structure and solvent accessibility ( Figure 4B). When analyzing the predicted contact map on the left panel in Figure 4C (taken from the Contact Map app), again we can easily point out the differences between the LCRs at the two ends of the protein. In particular, despite the enhanced intra-regions contacts for domains, the algorithms detected Q320-G410 inter-region with the slight probability of internal contacts (score ~ 0.35 while contact threshold is ⩾0.5), which covers almost the entire C-terminal LCR, while the N-terminus appears without any internal contacts. Interestingly, using the outcome of the Structural Disorder app (row "BINDING" in Figure 4B), we see that the N-terminal region despite being structurally featureless is detected by the ANCHOR method as a protein binding region (A20-I95) with three picks above 0.8 score. This region coincides with a region enriched by arginine, which is known to be abundant in the LLPSdriving motifs and RNA binding regions 44 . Therefore, we suspect that the experimentally observed dissolution of DL1 droplets in presence of highly concentrated RNA 53 , is most likely due to strong electrostatic attraction between negatively charged nucleic acid and positively charged arginines. When the environment lacks ionic interaction partners, the DL1 isoform undergoes phase separation driven by weaker but still favorable π-cation interactions between C-terminal tyrosines and N-terminal arginines.
And nally, the absence of the R-rich N-terminus in the isoform DL2 leads to bril formation, driven by ππ-stacking of C-terminal tyrosines in two monomers.
Thus, using a programming analogy, we can say that the nal behavior of hnRNPDL isoforms appears as the result of a biomolecular algorithm composed of a series of if-then-else conditionals, depending on the presence of N-terminal and environmental conditions. Furthermore, the experiments con rmed that substitution of cation (R K) or/and aromatic system (Y F) signi cantly weakens the phase separation of DL1 53 , which we think, points to the stronger matching of the R-Y pair. This complementarity results from the high cooperativity of multivalent interactions, in addition to the π-cation component also includes π-πstacking (CZ=NH1 π-electrons donated by arginine compared to lysine) and side-chain hydrogen bonding (polar hydroxyl donated by tyrosine compared to nonpolar phenylalanine and multiple protons bonded to nitrogens donated by arginine). [1] The "II-STRUCT.", "SA", and "MOTIFS" rows in Figure 3B are taken from the outcome of the Secondary structure, Solvent Accessibility, and Domains, Motifs, and Repeats apps, respectively.
Finally, while understandably the focus tends to be on the low-complexity regions decorated with some binding and/or switching structural properties motifs, it is nevertheless also important to consider the well-folded domains, especially in cases like hnRNPD where they play a mediator between the two IDRs. The Domains, Motifs, Repeats app reveals the two HMMER-detected Pfam domains (both with ID: PF00076) located in M151-D219 and V235-I305 regions ( Figure 4A, 4B). The presence of the domains is also well captured by the contact map prediction which reveals details about likely binding activities going on between distinct regions of the sequence (left panel in Figure 4C) . Therefore, the lower the Shannon entropy, the less complex sequence is.

Physicochemical decoration
To examine the physicochemical properties of LLPS-driving proteins, we identi ed along each sequence the patterns of polarity (Ser, Thr, Tyr, Gln, Asn, Cys, Met), hydrophobicity (Gly, Ala, Val, Ile, Leu, Pro, Phe), and detected π-stacking centers (Arg, Asn, Asp, Gln, Glu, Gly*) including those within aromatic rings (Phe, Tyr, Trp, His). We also provided the charge distributions split between positively (His, Lys, Arg) and negatively (Glu, Asp) charged residues. For each feature, both the arrangement along the sequence and the fraction of residues are provided.

Electrostatics
It is well established that the electrostatic interactions often affect the solubility and stabilize the binding interface in liquid-liquid demixing of biomolecules. The recently proposed charge decoration parameters emerged as a measure of charge distribution along the protein sequence. In addition to the overall charge content, these descriptors are seen as important factors shaping the protein conformations, especially within low-complexity regions 18 . Following these discoveries, we calculated and compared charge decoration parameters, namely: SCD, sequence charge decoration, is implemented following the formulation by Sawle & Ghosha 63 OCS, overall charge symmetry, is implemented following the formulation by Das & Pappu 64 FCR, a fraction of charged residues, is de ned as a sum of the fractions of positive and negative charges.

HMMER-based sequence conservation and functional domains detection
The multiple sequence alignment (MSA) and consensus pro le were prepared using an e cient HMMER method (phmmer + hmmalign and hmmbuild, respectively) that employs a probabilistic hidden Markov model (HMM) 29 and is signi cantly more accurate compared to BLAST-based searches. Because some of the LLPS sequences are highly unique (detection of the remote homologs is needed) and because the MSA is reliable if at least several dozen of homologous sequences are available, we used sequences selected from various UniProt subsets. Speci cally, SwissProt, UniRef50, and UniRef90 differ in the size and increasing sequence identity of entries 19 . To identify sequence regions with signi cant evolutionary conservation, we have derived three additional MSA-based parameters: strength, diversity, and character. The MSA strength of the sequence conservation informs how much the speci c position is held by evolution. This measure normalizes results from the hmmlogo tool to a discrete range from 0 (poorly conserved) to 5 (highly conserved). The hmmlogo computes letter heights along the sequence depending on the information content of the position. The MSA diversity de nes the number of different amino acids detected at a given position in the MSA and is provided in discrete scale from 0 -highly conserved to 5 -poorly conserved (0 -one, 1 -two, 2 -three, 3 -four, 4 -ve or six, 5 -7 and more amino acids at the aligned position). The MSA character describes the chemical nature of the most common amino acids at a given position in the multiple sequence alignment. We distinguished the following attributes: polar, charge, aromatic, another π-system, hydrophobic, other (G or P).
Some LLPS proteins are composed of one or more well-known domains. The identi cation of these functional regions alongside regions of low complexity or disorder can provide additional insights into the regulatory role of phase separation. Therefore, we have performed a Pfam search for all LLPS proteins, reporting the detected domains and incorporating the original Pfam seed-MSAs for corresponding regions of LLPS sequence (instead of full-length ones) to derive more reliable evolutionary conservation descriptors.

Short sequential and structural motifs speci c for LLPS sequences
Short Linear Motifs (SLiMs) are short fragments along the sequence, often situated in the intrinsically disordered regions generally showing high structural exibility and evolutionary conservation. We systematically detected various short sequential and structural motifs. The implemented algorithms used the list of grouped motifs' instances de ned by regular expressions as the keys to search protein sequences prone to phase separation. Among motifs known from the literature as relevant for phase behavior, our analysis includes short structural stretches of protein sequence such as LARKS 31 and steric zippers 32 , glycine-arginine-rich regions (GARs) 44 , S--D/S--E motifs speci c for phosphosites 65 , and new sequential repetitive n-mers. Accordingly, the eukaryotic linear motifs (ELMs) 66 , are indicated along the sequence with a discrimination of the main ELM classes.

Structural properties derived from sequence-based predictions
Bearing in mind the predictive nature of sequence-based methods and, hence, their limited accuracy, comparing several of them and choosing the nal consensus has proven to be successful in many approaches 67 . In our study, we comprised predictions from at least three to six widely used tools for each biomolecular characteristic. While almost every method is available as a web server, due to the size and complexity of our analyses, we employed standalone versions. The raw data derived from these standalone tools during the high-performance computing was initially parsed, ltered, and simpli ed to a uniform CSV format and deposited in our online repository at https://biapss.chem.iastate.edu/download.html.

Solvent accessibility
Solvent accessibility gives some insight into protein structural exibility, indicating the exposed patches on the protein surface available for interactions with the solvent molecules. Some surface sites have high evolutionary conservation, which is suggestive of functional or structural importance. Since not many structures of phase-separating IDPs are known, the robust prediction of solvent accessibility can help to identify exible regions prone to conformational changes upon binding. The assignment of solvent accessibility is usually provided in the 3-letter code: B -buried, E -exposed, M -medium. In our benchmark study, we employed three well-established solvent accessibility predictors: RaptorX 72 , PaleAle5 68 , SPOT-1D 73 .

Structural disorder
The sequence-based predictions indicate regions of increased structural exibility, usually estimating the disorder probability at a given position in the sequence. Detecting highly exible regions may support the identi cation of short sequence stretches of multivalent interactions which can be relevant to phase separation. In our benchmark study, we employed seven well-established predictors of structural disorder: Usually, the residue is considered as ordered when the score is below 0.5. The protein binding regions in disordered fragments were estimated by using the ANCHOR method 81 .

Contact map
Contact map application provides a more reduced representation of a protein structure using a binary two-dimensional matrix of distances between all possible amino acid residue pairs. The commonly used de nition assumes the threshold 6-10Å as the distance between the pair of two Cα or Cβ atoms being in contact. The contact number of protein residues limits the number of possible protein conformations and helps encode a three-dimensional structure. In our benchmark study, we employed three state-of-the-art      side-chain proton for hydrogen bonding, π-electron-containing systems (blue) with separation of aromatic ones (dark-blue), hydrophobicity, predicted solvent accessibility (SA) in 3-letter notation (magentaexposed, green -buried, blue -medium). The zoom of the PLD core is shown at the bottom panel, where the SS and SA rows contain the predicted probabilities of secondary structure and solvent accessibility. The green arrows indicate the side chains buried in the bril core, while the black frames highlight segments that form strands of a cross-motif 38 .   (sand), C-terminus (green). The full-length N-terminus is disordered and contains a PR-rich region detected to show high binding a nity. The C-terminus contains a Y-rich region decorated with two LARKS-like motifs, which is anked by two disordered regions enriched with glycine-arginine-rich motifs.
The low-complexity terminals show different physicochemical patterning (the content of polar, charged, π-electron systems) and structural properties (SA -solvent accessibility, II-STRUCT -secondary structure elements). Figure 4C. The structure and internal contacts of hnRNPDL. The left panel shows the RAPTOR-predicted contact map with the highlighted regions corresponding to N-terminus (blue), folded domains (black with two sand squares), and C-terminus (green). The upper rows contain the PORTER-predicted secondary structure and secondary structure assignment taken from the known protein structures (corresponding to the sequences used in the seed-MSA for detected Pfam domains). The right panel shows the structure of hnRNPDL predicted using the AlphaFold method and colored by model con dence.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.