Conserved molecular recognition by an intrinsically disordered region in the absence of sequence conservation

Intrinsically disordered regions (IDRs) are critical for cellular function yet often appear to lack sequence conservation when assessed by multiple sequence alignments. This raises the question of if and how function can be encoded and preserved in these regions despite massive sequence variation. To address this question, we have applied coarse-grained molecular dynamics simulations to investigate non-specific RNA binding of coronavirus nucleocapsid proteins. Coronavirus nucleocapsid proteins consist of multiple interspersed disordered and folded domains that bind RNA. Here, we focus on the first two domains of coronavirus nucleocapsid proteins: the disordered N-terminal domain (NTD) and the folded RNA binding domain (RBD). While the NTD is highly variable across evolution, the RBD is structurally conserved. This combination makes the NTD-RBD a convenient model system for exploring the interplay between an IDR adjacent to a folded domain and how changes in IDR sequence can influence molecular recognition of a partner. Our results reveal a surprising degree of sequence-specificity encoded by both the composition and the precise order of the amino acids in the NTD. The presence of an NTD can – depending on the sequence – either suppress or enhance RNA binding. Despite this sensitivity, large-scale variation in NTD sequences is possible while certain sequence features are retained. Consequently, a conformationally-conserved dynamic and disordered RNA:protein complex is found across nucleocapsid protein orthologs despite large-scale changes in both NTD sequence and RBD surface chemistry. Taken together, these insights shed light on the ability of disordered regions to preserve functional characteristics despite their sequence variability.


Introduction
The classical structure-function paradigm states that sequence dictates structure, and structure dictates function 1 .This understanding has driven extensive study of protein structure and dynamics.Understanding the 3D structures that proteins adopt provides insight into their normal function.It also allows us to interpret how and why mutations that disrupt those structures and/or dynamics impair function [2][3][4] .However, in recent years, there has been a growing focus on understanding if and how disordered regions can contribute to cellular function [5][6][7][8][9] .Intrinsically disordered regions (IDRs) are poorly described by a single 3D structure; instead, they exist as a collection of structurally distinct interconverting conformations known as an ensemble [9][10][11] .Despite lacking a defined 3D structure, IDRs play critical roles in many aspects of cellular function 9 .Consequently, emerging work suggests that just as folded domains follow a sequence-structure-function relationship, IDRs can follow an analogous sequence-ensemble-function relationship 9,12,13 .Given the importance that structure-function analysis has played in understanding the molecular basis for cellular function, there is a promising and analogous opportunity to understand IDR function through the lens of ensembles 9,[14][15][16][17][18] .
A major goal of modern molecular biology is to accurately predict protein function directly from amino acid sequence.Rooted in the general assumption that similar protein sequences will exhibit similar molecular behavior, one strategy is to compare the sequence of a protein of interest to those of other known proteins [19][20][21][22] .In many cases, multiple sequence alignment of orthologous folded domains reveals high sequence conservation and, therefore, conserved protein function 19,23,24 .This relationship enables us to predict structures of previously unsolved protein structures and infer function by aligning the sequences of an uncharacterized protein against sequences of functionally-characterized folded domains [25][26][27][28] .In sum, applying evolutionary information, directly and indirectly, is a central pillar in our modern toolkit for protein sequence analysis.
While IDR sequences can be aligned, their conservation at the residue level is typically lower than their structured counterparts [29][30][31] .However, even without strict sequence conservation, the presence of disordered regions in a protein is often conserved across orthologs 14,15,[31][32][33][34] .Assuming orthologous proteins provide equivalent functions, this presents a question: "Can apparently divergent IDRs confer the same molecular functions?".For some IDRs, the only feature that matters may be the existence of Short Linear Motifs (SLiMs), such that a large IDR may appear poorly conserved, yet functional conservation is maintained as long as a few short (5-15 residue) regions are present [35][36][37] .More recent work has shown that retaining specific physicochemical properties in disordered regions can be sufficient to preserve function 14,15,29,[31][32][33][34][38][39][40][41][42] . Ultimaely, the absence of a specific 3D structure serves to loosen the relationship between sequence and function.
Viruses provide good test systems for exploring evolutionary conservation in IDRs.Eukaryotic viruses use IDRs extensively, and their rapid evolutionary rates -driven by a combination of fast replication times, massive numbers, and strong fitness selection -mean that even between serotypes Given the relatively autonomous behavior of the NTD-RBD domains compared to the rest of the protein, our more recent experimental and computational work focussed on assessing the interaction of a minimal NTD-RBD construct with RNA 57 .While the RBD alone binds (rU) 25 with a binding affinity of ~0.6 µM -1 , the addition of the NTD enhances this affinity around 30-fold.This work also established our ability to obtain near quantitative agreement between coarse-grained molecular dynamics simulations and single-molecule RNA binding experiments in the context of non-specific binding across a range of RNA lengths and in response to small perturbations in the NTD sequence.While we cannot exclude other potential roles for the NTD, our work to date suggests that one of its functions is to enhance N-protein:RNA interactions, presumably to facilitate genome packaging.Despite our prior progress, many questions regarding the molecular details surrounding NTD-RBD:RNA interaction remain.While the NTD is highly variable in sequence and length across N protein orthologs, a disordered NTD of some type is always present (Fig. 1B) 55 .In contrast, the RBD is extremely structurally conserved among orthologs, exhibiting a characteristic right-handed fist structure.This is formed by a four-strand antiparallel β-fold core and a protruding β-hairpin, which we refer to as the β3 extension 59 .Despite this structural conservation, RBD sequences vary across coronaviruses, leading to changes in surface chemistry (Fig. 1C).As such, despite its pivotal role in coronavirus replication, N protein NTD-RBD sequences vary substantially across different coronaviruses.Given the structurally similar RBDs but differing NTDs, we wondered whether different coronavirus NTD-RBDs bind single-stranded RNA (ssRNA) in the same way or whether they have distinct modes of interaction.Naively, given the large variation in NTD sequence, one might expect fundamentally different modes of recognition.However, recent work has shown that the conservation of IDR ensemble properties is possible despite large changes in IDR sequence 14,60,61 .More broadly, the molecular basis for how the NTD provides a 30-fold increase in binding affinity remains unclear, especially given the NTD-RBD binds RNA almost 60-fold more tightly than the NTD in isolation 57 .
To address these questions, we performed coarse-grained molecular dynamics (MD) simulations of NTD-RBD constructs with poly-(rU) 25 to assess how changes in NTD sequence influence RNA binding.Using this approach, we sought to understand how the sequence properties of an RNA binding domain and flanking disordered region enable them to cooperate to bind nucleic acids and achieve specific binding affinities.Our findings demonstrate that the ability of the SCO2 nucleocapsid protein NTD to potentiate ssRNA binding is determined by a combination of sequence composition and the relative positioning of positively charged amino acids.Our work supports a model in which the NTD and RBD are two halves of a single RNA binding domain, where the two halves make up either side of a conserved RNA binding groove.The disordered nature of the NTD substantially relaxes evolutionary constraints on the NTD, allowing many different sequences to form structurally equivalent bound-state conformations.We suggest that such bi-partite binding domains -made up of both folded and disordered regions -may be a common mode of evolutionarily labile molecular recognition.Our study highlights that disordered regions can enable the conservation of specific binding modes, even in the absence of precise sequence conservation.

"Inert" Intrinsically Disordered Regions Diminish RNA Binding
Our previous work used coarse-grained MD simulations paired with smFRET-based RNA binding experiments to characterize the ability of the SCO2 NTD-RBD to bind ssRNA 57 .These simulations using the Mpipi forcefield were able to qualitatively recapitulate the conformational behavior of the NTD-RBD in the presence and absence of RNA, as well as capture with semi-quantitative accuracy the binding affinity observed for the RBD and NTD-RBD with ssRNAs of differing lengths 57,62 .Simulations and experiments showed that the addition of the disordered NTD SCO2 to the folded RBD resulted in a 30-fold increase in the binding affinity for (rU) 25 compared to the RBD alone.Importantly, this work identified a subregion in the NTD (residues 30-50) that is predicted to interact directly with RNA.
We first sought to establish the relationship between the NTD and RNA binding.We hypothesized that substituting the NTD SCO2 with an inert IDR that interacts negligibly with RNA would result in a binding affinity similar to that of the RBD alone.To our surprise, our simulations showed this was not the case.
In the Mpipi model, glycine and serine residues have negligible interactions with RNA or other amino acids.This agrees with prior experimental work that suggests GS-repeat sequences behave as relatively inert Gaussian-like chains [63][64][65] .We took advantage of this and replaced the 50-residue NTD SCO2 with a length-matched GS repeat -(GS) 25 -and performed simulations with this (GS) 25 -RBD SCO2 chimera (Fig. 2A) 25,26,66 .Our simulations revealed repeated association and dissociation events between (rU) 25 and the (GS) 25 -RBD constructs (Fig. 2B), enabling us to calculate an apparent binding association constant, K A , as done previously (see Methods for details) 57 .For convenience, we normalize this apparent binding affinity by the binding affinity associated with wildtype NTD-RBD binding (rU) 25 , reporting this normalized binding affinity as K A *. K A * > 1 reflects tighter binding than wildtype, while K A * < 1 reflects weaker binding.E.An apparent binding affinity (K A ) is calculated by utilizing the fraction of bound and unbound frames and Eq. 1.This is then converted to a relative apparent binding affinity (K A *) by normalizing all values by dividing by the K A calculated from the SCO2 NTD-RBD + (rU) 25 simulations.Blue points represent each individual simulation K A *, while the red point is the mean of all of the replicate simulations for a given construct.The error bars are the ratio propagated standard error of the mean calculated using Eq. 2. Significance is determined by a Mann-Whitney-Wilcoxon test two-sided with Bonferroni correction.p-value annotation legend: (ns: 5.00e-02 < p <= 1.00e+00), (*: 1.00e-02 < p <= 5.00e-02), (**: 1.00e-03 < p <= 1.00e-02), (***: 1.00e-04 < p <= 1.00e-03), (****: p <= 1.00e-04).
To our surprise, the (GS) 25 -RBD construct bound half as tightly as the RBD alone ((GS) 25 -RBD K A * = 0.020 ± 0.003, RBD K A * = 0.037 ± 0.004) (Fig. 2D).This result is driven by an entropic excluded volume effect, whereby the (GS) 25 impedes the ability of RNA molecules to interact with the RBD by occupying space adjacent to positively charged residues on the RBD.Importantly, this result suggests that the tighter binding affinity associated with NTD SCO2 -RBD SCO2 compared to RBD SCO2 alone is due to a cooperative interplay between the NTD and the RBD with RNA 57,67 .
Next, we sought to understand how the NTD SCO2 enhanced the binding affinity.Given our prior work identified residues 30-50 in the NTD SCO2 as an RNA interacting region, we replaced this region with a (GS) 10 linker.While we anticipated a reduction in binding affinity compared to wildtype, we expected this construct to be stronger than that of the RBD alone.In actuality, we again observed weaker RNA binding compared to the RBD alone with a K A * = 0.021 ± 0.003 (Fig. 2D), statistically indistinguishable from the (GS) 25 -RBD construct.With this in mind, our results suggest residues 30-50 are critical for robust RNA binding.
It is widely known that sequence composition and patterning govern the properties adopted by intrinsically disordered regions 9,68 .However, for IDRs adjacent to RNA binding domains and their binding interfaces, our results illustrate that sequence properties can either enhance or diminish RNA binding affinity, depending on the specific IDR sequence.Taken together, our results suggest that the sequence of the N-terminal IDR adjacent to coronavirus RBDs needs to be relatively specific and is most likely conserved, albeit not in the traditional sense of direct sequence alignment; otherwise, without specific residues, the IDR could interfere with RNA binding to the extent of diminishing binding affinity.

Coronavirus Nucleocapsid Protein NTDs have Conserved Sequence Composition
While NTD's in coronavirus nucleocapsid proteins appear to always be disordered, their absolute sequence conservation is poor (Fig. 1B, Supplementary Fig. 3).If NTDs exist to enhance RNA binding affinity, and disordered NTDs can diminish RNA binding if the 'wrong' sequence is present, then how do coronavirus NTDs ensure tight RNA binding is conserved despite large scale variation in sequence?
The decrease in binding affinity caused by (GS) 10 and (GS) 25 mutant NTDs indicates that any enhancement in RNA binding provided by the NTD SCO2 is sequence-dependent.This conclusion is consistent with our prior work, in which we found even small changes in NTD sequence had measurable effects on RNA binding affinity as measured both by single-molecule experiments and by simulations 57 .
Operating under the assumption that the NTD SCO2 has a role in enhancing RNA binding affinity of the RBD (Supplementary Fig. 4), we reasoned there may be some selective pressure towards NTD sequences that result in a consistent macroscopic RNA binding affinity for the NTD-RBD.Additionally, while RBD structures are highly conserved across coronaviruses, their charged surface residues vary (Fig. 1D) 69 .As such, we also wondered if there may be a co-evolutionary coupling between the NTD sequence and the RBD surface.Thus, despite the diverging surface charge of the RBDs, conserved interactions between the NTDs and their respective RBDs could lead to a consistent macroscopic RNA binding affinity.
To investigate this hypothesis, in addition to the NTD-RBD taken from SCO2, we examined NTD-RBD constructs from five other coronaviruses: human coronaviruses OC43, HKU1, and 229E, the Middle East Respiratory Syndrome Coronavirus (MERS), and the Mouse Hepatitis Virus (MHV1).We reasoned that focusing on coronaviruses that predominantly infect the same host would ensure host selective pressures are consistent, thereby minimizing this as a confounding factor to explain differences in RNA binding affinities.
We first examined NTD physicochemical properties that are routinely used to describe IDRs (Supplementary Table S3-S6).Despite the large variation in NTD length, all NTDs possess a net positive charge, with the least positive NTD possessing a net charge per residue of +0.056.Expanding this analysis to 45 different coronavirus NTDs, we found no examples in which the net charge was lower than +0.056 (Supplementary Fig. 5).This is consistent with RNA binding proteins typically binding RNA through positive electrostatic surfaces that interact with negatively charged RNA 70 .
Next, we examined solvent-accessible residues on the RBD surface.We generated five RBD structures for each of the coronaviruses using AlphaFold2, and then took the average of our calculated properties across the five structures 66 .The net charge per residue (NCPR) of the RBD surface residues stratified into three categories: relatively positively charged (229E = 0.126, SCO2 =0.066, MERS = 0.052,) neutral (HKU1 = 0.0, MHV1 = -0.011),and negatively charged (OC43 = -0.053).However, in all case we found that the β3 extension surface was positively charged, albeit to different extents (Fig. 1D).
In summary, while the surface charge of the RBD domains appears more variable, our analysis suggests two key features conserved across coronavirus N proteins: (1) a net positive NTD and (2) a positive charge on the structurally conserved β3 extension.Compositional conservation in the NTD (i.e., the retention of specific physicochemical features, such as net charge) could enable conserved interactions despite the lack of absolute sequence conservation.We next sought to determine if composition was sufficient or if other sequence properties were important to determine RNA binding.

Sequence Composition Alone Does Not Determine NTD Contribution to Binding Affinity
One possible interpretation of our analysis is that the only factor that matters for NTD function is a net positive charge.To test if composition is the only thing that matters, we designed sequence variants that moved residues 30-50 (which contain several positively charged residues) to different locations across the NTD. 57,67.We placed residues 30-50 at positions 1, 6, 11, 16, 21, 26 (referred to as mutants T1, T6, T11, T16, T21, T26) and 31 (wildtype) of the NTD SCO2 (Fig. 3A).We then performed simulations with (rU) 25 and calculated apparent binding affinities of each variant.These sequences maintain the same sequence composition but rearrange the amino acids, which allows us to determine whether there are positional contributions to RNA binding or if sequence composition alone is sufficient to achieve RNA binding.
To our surprise, the relative position of residues 30-50 has a significant impact on the apparent binding affinity (Fig. 3B).Two mutants showed wild-type-like binding affinities, yet the others bound RNA more weakly.This suggests that the relative location of positive charge with respect to the RBD tunes RNA binding affinity.
Why do the T6 and T11 variants show wild-type-like binding?Our results thus far suggest that placing a cluster of positively charged residues either directly adjacent (as is the case in the wild-type sequence) or ~30-40 residues (as is the case in the T11 and T6 variants) from the RBD are optimal for tight binding.Indeed, in the wild-type sequence, a pair of arginine residues is found around residues 10-14.However, why such a pattern matters for RNA binding was initially unclear.
To further test how the relative position of positively charged residues impacts RNA binding, we generated 172 scrambled NTD SCO2 sequences in which the sequence composition is identical, yet the order of the amino acids has been changed.These scrambles were generated in four ways: The first by randomly shuffling the NTD SCO2 ; the second by shuffling the NTD SCO2 while also making each amino acid change as chemically different from the wild-type sequence as possible in terms of charge and aromaticity; third, by shuffling the NTD SCO2 while forcing positively charged residues from falling in the 30-50 residue region; and fourth, by shuffling the NTD SCO2 while restricting the majority of charged residues to the 30-50 region or a region spanning residues 4-17.Using these scrambled sequences, we performed coarse-grained MD simulations and calculated K A * with (rU) 25 .Binding affinities were calculated for each of the scrambled sequences and compared with one another (Fig 3D, Supp.Table 7).The dynamic range of K A * observed here spans five orders of magnitude, demonstrating the dramatic impact relative amino acid position can have on binding affinity.However, for the majority of the scrambled sequences, the binding affinity is fairly similar, and, importantly, this "average" binding affinity is almost an order of magnitude weaker than the wild-type NTD-RBD.
Taken together with our simulations that shifted the 30-50 amino acid region around the NTD SCO2 , these results suggest composition is not the sole determinant of how the NTD SCO2 influences RNA binding.While 172 scrambled sequences are only a fraction of the total number of possible sequence shuffles that could be generated for the NTD SCO2, the observation that the wild-type NTD SCO2 sequence is among those with the highest apparent affinity suggests that the ordering of the residues in the NTD SCO2 is specific.

Disordered Region Residue Sequence Positioning Dictates RNA Binding Capacity
While most scrambled sequences had similar binding affinities that were much weaker than the wild-type sequence, we identified a subset of sequences that had binding affinities equal to or greater than that of the wild-type sequence.Based on our simulations testing positioning of the 30-50 amino acid region, we reasoned that the relative position of positively charged residues might underlie the increased binding affinity of these select sequences, highlighting regions of the NTD that are more binding-competent.
To assess how the position of positively charged residues correlates with binding affinity, we plotted binding affinity versus the average position of all positively charged residues in each scrambled sequence that we initially tested (Fig 3E, blue circles are the binned means of each sequence).The average position is calculated as the mean of the location of the arginine and lysine residues in the linear sequence of the NTD SCO2 .This analysis revealed a correlation between strong binders and the average position of positively charged residues.When the average position of positive residues is around residues 30-40, binding affinity is drastically increased in comparison to the other regions.This same region is relatively positively charged in the wild-type NTD SCO2 .
The importance of the position of positively charged residues offers a 'structural' explanation for the enhanced binding affinity afforded by the wild-type NTD.Charged residues within this region enable the formation of a positively charged 'groove.' One-half of this groove is made of the positively charged surface of the RBD β3 extension, while the other half comes from the disordered NTD.This charged groove enables simultaneous multivalent interactions between the NTD SCO2 and the RBD SCO2 with RNA and, thus, tight RNA binding (Fig 3C).To further explore if a disordered charged groove underlies high-affinity NTD binding, we examined the relationship between charge clustering and RNA binding.
The average position of positive charges along the NTD does not capture the clustering of positively charged residues.To address this, we used the inverse weighted distance (IWD+) metric to calculate the clustering of positively charged residues 71,72 .Our initial set of scrambles showed relatively similar charge clustering, although in almost all cases, sequences with a greater degree of positively charged residue clustering bound more tightly than those where residues were less clustered (Fig. 3E).
To more systematically investigate the impact of positive charge clustering, we designed a second library of 214 additional scrambles.In this library, sequences were designed such that all positively charged residues were locally clustered at a specific location (Fig. 3F).Sequences with clusters of positive charge generally exhibited increased binding affinities.Moreover, sequences where positively charged residues were clustered towards the C-terminus of the NTD showed -in general -tighter binding than those where positively charged clusters were N-terminal.These results confirm that the presence of a positively charged cluster on the NTD adjacent to the RBD provides the highest affinity binding interface.
Our results thus far are consistent with a model in which the local density of positively charged residues forms one-half of a positively charged binding grove (Fig. 3C).While we conventionally think of binding clefts as forming between two folded domains, here we propose a binding interface that straddles the folded RBD surface and the disordered NTD, akin to a flexible thumb and a structured hand.This disordered binding groove model makes several predictions.First, this model predicts that the NTD should remain disordered upon binding RNA.This prediction is supported by recent nanosecond FCS experiments in which no loss of conformational heterogeneity was seen upon RNA binding 57 .Second, very different sequences should be compatible with RNA binding, a prediction supported by results from our scrambles, which show that if appropriate sequence constraints are met, there are many NTDs with wild-type-like binding (Fig. 3F).Third, this model predicts that across different coronaviruses we should expect the mode of RNA binding by the NTD-RBD to be conserved.In other words, even as the surface and sequence of the RBD and NTD vary, we should expect the conformational features of the bound-state ensemble to be preserved.To test this prediction, we next performed simulations of five additional orthologous NTD-RBD constructs with (rU) 25

NTD-RBD:RNA Behavior in the Bound State is Conserved Across Orthologs
Our scrambles confirm that the NTD sequence has a substantial impact on NTD-RBD RNA binding affinity.We therefore asked if natural NTD sequences encode a similar positively charged "groove" binding mode despite seemingly large-scale variation in NTD sequence and RBD surface chemistry.In this model, specific subregions of the NTD come into closer proximity to the RBD driven by favorable NTD-RNA interactions on one side and RBD-RNA interactions on the other (Fig. 3C).To test this, we performed simulations of each of the six ortholog NTD-RBD constructs with (rU) 25 and assessed the bound-state conformational ensemble of the NTD.
Bound-state ensembles were quantified using scaling maps.Scaling maps capture the average inter-residue distance between all pairs of residues for RNA-bound conformers, and offer a way to quantify the conformational ensemble of an IDR 55,[73][74][75] .Here, scaling map values are calculated as the inter-residue distance measured in the RNA-bound state normalized by the inter-residue distance of sequence-matched NTD-RBD simulations performed in the absence of RNA (Fig. 4A).Shades of purple reflect distances that are closer together in the bound state, while shades of green denote regions that are further apart in the bound state.In this way, the scaling map provides a quantitative description of the RNA-bound ensemble of the NTD.
For SCO2, this analysis identified two regions in the NTD that are closer to the RBD in the bound state ensemble centered around residues 10-20 and residues 30-50, similar to our simulations that shifted the 30-50 amino acid region around the NTD SCO2 and as reported previously 57 .This analysis can be done selectively for one of the residues in the NTD to visualize where it increases RBD interactions when bound to RNA by mapping its distances across the entire NTD-RBD construct with RBD residues colored with respect to NTD distance (Fig. 4B).Doing so shows that in the bound state, the NTD moves closer to the positively charged RBD β3 extension, highlighting the formation of a positively charged groove between the positive β3 extension and the positive region spanning amino acids 30-50, as well as contributions from the region spanning amino acids 10-20 in the NTD SCO2 .This positive groove effectively envelopes RNA, facilitating a specific bound-state ensemble.
We repeated this analysis for the remaining five orthologs, as well as the (GS) 10 -RBD and (GS) 25 -RBD constructs, to determine if these NTDs also move closer to the RBD.This analysis reveals that the same two specific subregions within the NTD come closer to the RBD across coronavirus orthologs.Despite large-scale variation in both folded-domain surface charge and NTD sequence, the bound-state ensemble (and hence RNA binding mode) appears to be largely conserved across the six coronavirus NTD-RBD constructs examined.However, for the GS mutant NTDs this conformational conservation is lost (Supplementary Fig. 6), highlighting the sequence dependence of these interactions.(C-H) Representative snapshots from RNA-bound-state ensembles.In all cases, RBD configuration is aligned in the same way, enabling conservation of binding mode to be directly visualized across six distinct orthologs.
To better visualize this result, we generated structural models of the bound-state ensemble with six conformers each (Fig. 4C).While these make up a tiny fraction of the bound-state frames, it should be clear that in all cases, RNA binding occurs through a conserved bound-state ensemble, whereby RNA lies along a disordered groove generated between the β3-extension on one side and the NTD on the other.Taken together, our work suggests that co-evolution of the NTD-RBD occurs at the level of preserving a bound-state ensemble, as opposed to sequence or conformational properties in the unbound state.

Discussion and Conclusion
Intrinsically disordered proteins and protein regions are prevalent across eukaryotic, prokaryotic, and viral proteomes 9 .They play a wide variety of essential roles yet -perhaps paradoxically -often appear to be relatively poorly conserved sequences by alignment [29][30][31] .In this study, we sought to understand how a specific molecular function (RNA binding) could be conserved despite large-scale changes in amino acid sequence.We utilized two domains of various coronavirus nucleocapsid protein orthologs as a convenient model that contains both a disordered region (NTD) and a folded domain (RBD) that binds RNA.Despite poor sequence conservation assessed by alignment across NTDs, we found that the orthologs were compositionally conserved.That is, the orthologs have similar charge properties in both the NTD and portions of the RBD.Specifically, NTDs harbor a net positive charge, while RBDs retain specific positively charged regions on a specific region of their surface.Despite this conservation, the length and sequence of N protein NTDs vary dramatically, and while RBDs maintain the same 3D structure, orthologous RBDs showed a diverse set of surface properties, including negatively charged patches and changes in positive regions.
To assess how the sequence composition of the disordered NTDs influences interactions with the RBDs and impacts RNA binding, we performed coarse-grained molecular dynamics simulations of coronavirus nucleocapsid proteins with single-stranded RNA.These simulations enabled us to interrogate the role of sequence composition and residue positioning in coronavirus NTDs ability to increase binding affinity of the NTD-RBD.We first showed that RNA binding could be enhanced (NTD-RBD) or suppressed ((GS) 25 -RBD) compared to the RBD in isolation depending on the IDR sequence.Further, by testing hundreds of different sequences with the same overall composition, we determined that composition alone does not dictate RNA binding affinity.Instead, our simulations highlight the importance of clusters of positively charged residues, and that the relative position of positive clusters along the NTD also matter.Specifically, our simulations reveal the mode of binding occurs via a disordered, positively charged grove that forms between the NTD and the positive surface of the RBD (specifically the β3 extension).In this way a 'structural' basis for RNA binding emerges, despite the fact the bound state is highly heterogeneous (a result we previously confirmed via ns-FCS experiments) 57 .Moreover, this specific binding mode is conserved across five additional orthologous NTD-RBD constructs, despite largescale variation in sequence.This charged groove and the dynamic nature of the RNA-protein interaction is potentially similar to the high affinity yet highly dynamic interactions that have been observed for polyelectrolyte complexes formed by charged polymers or the H1-Prothymosin alpha interaction, and for other IDR:RNA interactions [76][77][78] .Here the NTD is able to remain highly dynamic and disordered yet still maintain relatively tight binding affinity.Our rationally-designed sequences suggest tighter binding is certainly possible, but whether tighter binding would be functionally advantageous for viral replication is unclear.
Our work here implicates synergistic cooperation between a folded domain and a disordered region to enable high-affinity binding.The exceptional structural conservation of RBDs across coronaviruses may reflect their crucial role in virion structural stability, perhaps enabled via a network of stacked aromatic residues in the RBD core.While RNA binding domains often posses binding clefts, our work here suggests that such clefts need not be fully structuctured, and that a partially disordered binding groove can also enable evolutionarily-labile RNA binding.
Recent work identified arginine-rich motifs within disordered regions adjacent to DNA binding domains across transcription factors, implicating these regions as mediating RNA binding in concert with the DNA binding domain 79 .Given the conserved binding mode uncovered in our work here, we speculate that while defining RNA/DNA binding domains in terms of their folded domains is convenient, the full 'domain' could in some cases be extended to include flanking IDRs that potentiate and/or regulate binding.In particular, we have explicit examples in which adjacent IDRs enhance 80 , supress 81,82 , or have no effect 83 on DNA binding affinity.These observations dovetail with our own work that suggests the amino acid chemistry of IDRs adjacent to nucleic acid binding domains impacts the macroscopic binding affinity.Furthermore, highly charged flanking IDRs can lubricate interactions between folded domains and nucleic acids by competing with the folded domain for nucleic acid interaction, or nucleic acids for folded domain interactions, a model proposed by the Levy lab almost fifteen years ago [84][85][86][87][88] .
The cooperative effect of NTD and RBD binding with loose structural coupling opens the door to compensatory changes in either domain.While identifying such couplings is inherently challenging, we note that the ortholog with the least prominent NTD:RBD interaction profile (229E; Fig. 4A) also has the most positively charged RBD, pointing to a potential mechanism to compensate for a 'weaker' (less positively charged) NTD.
Our simulations also hint at the presence of a second RNA binding region in the NTD, centered around residues 12 in SCO2.This is highlighted by the appearance of two local subregions that are close to the RBD in the bound state -one around residues 30-50 but a second around residues 5-15 (Fig. 4A).This region is clearly insufficient to enable RNA binding in isolation, because replacing residues 30-50 in NTD SCO2 with a (GS) 10 yielded a binding affinity indistiguishable from one where the entire NTD was replaced by (GS) 25 (Fig. 2E).Nevertheless, designs that repositioned residues NTD SCO2  to the location of this potential second hotspot (T6 and T11) recovered wildtype-like affinity, suggesting this relative position from the RBD may also be well-poised for RNA binding (Fig. 3). We spculate this may reflect an optimal distance between loop-closure entropy, electrostatic repulsion between binding regions, and effective concentration; i.e. that at this number of residues away from the RBD, the NTD can 'fold' back on itself and interact with RNA that is bound to the RBD surface.While they lack absolute sequence conservation, the conserved nature of these hotspots across five of the six orthologs implicates these regions as potentially playing an auxiliary regulatory role in RNA binding.
Recent work has suggested that small-molecules that target specific IDR ensembles may provide a route for sequence-specific pharmacological interventions 89,90 .Given the essential role N protein:RNA interaction has in coronavirus lifecycles, our work here hints at principles to enable the rational design of bivalent molecules that might enable specific NTD-RBD inhibition by outcompeting with RNA to bind in a conformationally-conserved manner.If conventional antiviral structure-guided drug design focusses on conserved structural features, targeting conserved conformational features offers an alternative but conceptually analogous route to pharmacological intervention against regions traditionally considered 'undruggable' 91,92 .
While this study focused on the NTD-RBD from coronavirus nucleocapsid proteins, we expect the insights gleaned here will be widely applicable to a range of disordered nucleic acid-binding proteins.While absolute sequence conservation may not be present, there is still the possibility of conserved behavior encoded into diverging sequences.Rather than solely focusing on sequence alignments to provide information on conservation and important residues, quantitatively describing the ensemble that a disordered region takes on and assessing how it behaves with and without its ligand(s) may provide better insight into the residues that are important and sequence features that need to be maintained to ensure proper biological function.

Molecular Dynamics Simulations
All simulations were performed using the LAMMPS simulation engine 93 .We performed molecular dynamics simulations in the NVT ensemble using the default parameters of the physics-driven coarse-grained force-field Mpipi developed by Joseph et al. 62 The model represents both amino acid residues and nucleotides as chemically unique singular beads and was parameterized to recapitulate the behavior of disordered proteins in isolation as well as their ability to undergo phase separation with and without RNA.Inter-bead interactions consist of a combination of short-range contributions from a Wang-Frenkel potential, which captures a combination of Van der Waals, cation-pi, and pi-pi interactions, and a long-range Coulombic potential for amino acids with net charge and RNA nucleotides.The ability of the Mpipi force field to recapitulate disordered protein dimensions has been previously shown 62,94 .Simulations were performed under an effective ionic strength of 50 mM NaCl, conditions we previously found to engender good agreement between simulation and experiment when comparing with experimentally-measured RNA binding affinities using single-molecule experiments 57 .
We also assessed the ability of the Mpipi forcefield to recapitulate single-stranded RNA (ssRNA) dimensions by comparing simulations of (rU) 40 with scattering data from small-angle X-ray (SAXS) experiments for the same construct 95 .This comparison revealed excellent agreement across the full scattering curve and in terms of the scattering-derived radius of gyration; using the Molecular Form Factor approach of Riback et al., R g sim = 30.9± 0.1 Å while R g exp = 30.2± 0.3 Å) (Supplementary Fig. 1) 96 .
Simulations were performed in a 30 nm 3 simulation box with periodic boundary conditions.Protein and RNA are allowed to diffuse freely throughout the box.Disordered regions and ssRNA behave as dynamic flexible polymers, sampling an ensemble of conformations 62 .However, as done previously, folded domains were made rigid, and residues buried within folded domains experienced downscaled non-bonded interactions 57,62 .Unless otherwise specified, all simulations were run for 300 million steps per replicate.The exceptions are the 'scrambled' simulations, which were run for 100 million steps per replicate.Protein and RNA configurations were saved every 10,000 steps, and the first 0.2% was removed for equilibration.Visualization of protein-RNA complexes was done with Protein Imager and VMD 97,98 .Simulations were analyzed using SOURSOP and MDTraj 74,99 .Small angle X-ray scattering was analyzed using the Molecular Form Factor (MFF) (http://sosnick.uchicago.edu/SAXSonIDPs),while synthetic scattering data for simulations were generated using FOXS default settings 96,100 .
We performed simulations of the NTD-RBD, NTD, and RBD of six coronavirus orthologs.Specifically, we examined five coronaviruses that infect humans: SARS-CoV-2 (SCO2), Middle Eastern Respiratory Syndrome virus (MERS), Human Coronaviruses OC43, Human Coronavirus HKU1, and Human Coronavirus 229E, as well as Murine Hepatitis Virus (MHV1).Sequence alignments were compared to determine a region of the RBD that was well conserved between all orthologs to delineate the start and end positions of the NTD and RBD's of each ortholog 58,[101][102][103] .For simulations with ssRNA, all simulations were done using (rU) 25.
To capture conformational heterogeneity in an artificially rigid structure, we utilized Colabfold to generate five different starting structures for each coronavirus orthologous RBD 25,66 .For simulations of wild-type versions of each ortholog's NTD-RBD all five starting structures are used, to enable conclusions to be less biased by a specific starting conformation.As expected, certain RBD conformers bind RNA better than others, but in all cases where different NTDs are compared, the same sets of RBD conformers are used, such that any RBD conformation-specific biases are consistent across the set (Supplementary Fig. 2).For the large scrambled library, 1 conformation for the SCO2 RBD is used.All simulations were run with multiple replicates per starting RBD structure, with a minimum of five replicates per RBD conformation.

Limitations of Coarse-Grained Simulations
Our use of the Mpipi model should not be taken to imply that RNA or proteins are faithfully represented at one bead per residue/nucleotide resolution.Both proteins and RNA are complex biomolecules with many degrees of freedom, a chemically heterogeneous structure, and can engage in a variety of sequence and structure-specific interactions that are not captured by a simplified coarse-grain model.Our goal in using a simplified coarse-grain model is to enable high-throughput biophysical assessment in a system that, based on prior work, we have good reason to believe is semi-quantitative in terms of relative accuracy 57,62 .While we refer to the molecules in our simulations as protein and RNA, in reality, they are better thought of as RNA-and protein-flavored polymers.The simplicity of this model enables us to address questions that would be intractable using either higher-resolution simulation approaches or experiments.Despite this, we are under no illusion regarding the simplifying assumptions made for a coarse-grain model.

Calculating Apparent Association Constants From Simulations
We determined apparent association constants (K A ) by using an updated version of our previous center of mass (COM) calculations that were able to qualitatively recapitulate SCO2 NTD-RBD single-stranded RNA binding 57 .To do this, post-equilibration simulation frames were divided into bound and unbound states.This delineation was achieved by first taking the intermolecular center-of-mass distances between the protein and the RNA and plotting the distribution of distances.The histogram of intermolecular distances follows a bimodal distribution that reports on the bound and unbound states, and can be fit with two Gaussians (Fig. 2C).We then determined the intersection that minimizes the overlap of the two distributions to define a cutoff distance.The cutoff distance varies based on the size of the protein and RNA.Finally, as done previously, we classify frames as bound or unbound by assessing the linear intermolecular COM distance trajectory and delineating frames as bound when five or more frames are below the cutoff distance.This minimum number of consecutive frames allows us to distinguish between transient random interactions between protein and RNA vs. encounters with a reasonable "lifetime", implying direct and continuous interaction.The distributions and distance cutoffs are calculated for every set of NTD a -RBD b + (rU) n simulations, where a and b represent specific NTD or RBD sequences and n the length of the single-stranded (rU), allowing us to determine protein-RNA specific distance thresholds for each simulation.
The resultant fraction of bound frames is used to calculate an apparent K D with the equation: manner, this approach is analogous to that of Tesei et al. 104 .
It is important to note that the K A s determined from these simulations are not meant to represent absolute values that would be comparable to those determined from experiment.Our prior work has shown that K A s calculated from Mpipi simulations for this system lack absolute agreement with experimentally measured values.Despite this, when experiment and simulation-derived K A values are normalized by an internally consistent reference (i.e., the K A obtained from NTD-RBD binding (rU) 25 ), we see good agreement between simulations and experiment, both as a function of RNA length and as a function of the presence/absence of the NTD 57 .To that end, binding affinity here is reported as K A *, a normalized binding affinity we define as the ratio of the apparent K A of a given protein + RNA simulation divided by the corresponding K A for the analogous SCO2 NTD-RBD binding to (rU) 25 .This enables the SCO2 NTD-RBD + (rU) 25 simulations binding affinity to be a reference point with which to understand the strength of interactions of other orthologs.All K A * values are thus greater than 1 (stronger binding than the SCO2 NTD-RBD + (rU) 25 ) or less than 1 (weaker binding than the SCO2 NTD-RBD + (rU) 25 ).
Error is propagated for our ratio (K A *) using: = (Eq.2) R and R error here represent the ratio and the error of the ratio.A and B represent the numerator and denominator of our ratios, respectively, and A error and B error are their associated errors (standard error of the mean).

Calculating Charge Clustering in Disordered Regions
Charge clustering is quantified by the inverse weighted distance (IWD), a metric that has been applied to study amino acid clustering in several systems 71,72,106,107 .Unlike the patterning parameters κ ("kappa") or sequence charge decoration (SCD), which quantify the patterning of oppositely charged residues with respect to one another, here our interest is on the clustering of positive residues only 68,108 .The IWD score allows us to quantify the clustering of a specific subset of residues.When residues are clustered together, the IWD score is high, whereas when residues are evenly distributed, the IWD score is low.IWD scores were calculated using sparrow (https://github.com/idptools/sparrow).

Statistical Analysis
Every simulation has a minimum of five independent replicates, and calculated values are presented as 95% confidence intervals (box plots, with medians marked), mean and standard error of the mean, or geometric mean and geometric standard deviation (clarified in text below figures).Fitting of Gaussian distributions was done in Python using scipy.optimize.curve_fit 109.

Data Availability and Software
Analysis code and data (calculated distance distributions and contact map information) are deposited at https://github.com/holehouse-lab/supportingdata/tree/master/2023/alston_2023.For further information on using the code, please refer to the deposited Jupyter notebooks.

Figure 1 .
Figure 1.Coronavirus nucleocapsid proteins possess a disordered, poorly-conserved N-terminal domain (NTD) and a more well-conserved folded RNA binding domain (RBD).A. Schematic showing full-length nucleocapsid protein architectures from coronaviruses.The nucleocapsid protein contains three IDRs (NTD, Linker, CTD) and two folded domains (RBD, and Dimerization domains).B. Per-residue conservation calculated over 45 orthologous NTD-RBD constructs, including SCO2, MERS, OC43, HKU1, 229E, and MHV1.Conservation is calculated based on the positional Shannon entropy, with values shown only for residues where 80% or more of orthologs possess a residue.The NTD contains many gaps in a relatively poor alignment, while the RBD is almost uniformly populated with relatively highly conserved residues.Disorder propensity is calculated using metapredict.C. Overlay of RBD structures for SCO2, MERS, OC43, HKU1, 229E, and MHV1, revealing a high degree of structural conservation in the RBD fold.

Figure 2 .
Figure 2.An inert disordered region can suppress a folded domain's RNA binding ability.A. A snapshot of the bound state from a (GS) 25 -RBD + (rU) 25 simulation trajectory.Simulations utilize the Mpipi forcefield 62 .The model represents both amino acids and nucleotides as single beads with specific amino acid-amino acid and amino acid-nucleotide interactions.Folded domains are rigid, and both disordered regions and nucleic acids are dynamic.B. The distances between the COM of the (GS) 25 -RBD and (rU) 25 are plotted over the course of the simulation.A distance threshold (black line) is determined in C (see also

Figure 3 .
Figure 3. Clusters of positively charged residues determine the affinity enhancement provided by the NTD on RNA binding A. Schematic showing the wild type and mutants that systematically reposition residues 30-50 from the wild-type sequence.B. Binding affinity for mutants schematized in panel A. Mutant T6 and T11 show wildtype-like binding affinity, whereas all other variants show binding affinity less than the wild type.C. Graphical schematic highlighting the positively-charged and dynamic 'groove" that can form upon RNA binding between the positively-charged β3 extension on the RBD and the cluster of positively charged residues on the NTD.In the RBD positively charged surfaces are colored blue, negatively charged surfaces are colored red, and neutral surfaces are colored white.A representative NTD is drawn with the blue circles representing the relative positions of the positively charged residues.D. Binding affinities for 172 scramble variants.Orange bars within each plotted box represent the value of the mean KA* of each scrambled sequences replicate simulations.Each variant reports on the binding affinity for an NTD-RBD construct, where for each variant the NTD sequence was randomly scrambled.Despite having an identical amino acid composition, sequence order enables a four-order-of-magnitude change in binding affinity, highlighting the importance of sequence in dictating binding affinity.E. Scramble sequences plotted with

Figure 4 .
Figure 4. Orthologous nucleocapsid proteins show similar bound-state ensembles despite variations in RBD surface charge residues and NTD sequence.A. Scaling maps quantify the average inter-residue distance between NTD residues (X-axis, colored pink) and NTD or RBD residues (Y-axis, colored pink and light blue respectively) in the bound state.Heatmap values are calculated by calculating the average inter-residue distance in the RNA-bound state and dividing that distance by the average inter-residue distance in the RNA-unbound state.Purple colors report on inter-residue distances that are closer together in the bound state while green colors report on inter-residue distances that are further apart in the unbound state.In all six orthologs, the NTD is closer to the β3 extension in the bound state, reporting on the formation of a positively charged groove in the bound state.B. Regions closer to the NTD in the RNA-bound state are Here refers to the fraction of frames where the protein and RNA are determined to be in the   bound state from our COM-COM distribution analysis.refers to Avagodro's constant, and V is   the simulation box volume in liters, which returns a in mol/L. is then calculated using the     expression = 1/ .While we determine if two molecules are bound or unbound in a different