Robust RNA structure refinement by a nucleobase-centric sampling algorithm coupled with a backbone rotameric and quantum-mechanical-energy-scaled base-base knowledge-based potential.


 Refining modelled structures to approach experimental accuracy is one of the most challenging problems in molecular biology. Despite many years’ efforts, the progress in protein or RNA structure refinement has been slow because native structures are often not the global minimum of existing approximate energy scores. Here, we propose a fully knowledge-based energy function that captures the full orientation dependence of base-base, base-oxygen and oxygen-oxygen interactions with the RNA backbone modelled by rotameric states and internal energies. A total of 4,000 quantum-mechanical calculations were performed to reweight base-base statistical potentials for minimizing possible effects of indirect interactions. The resulting BRiQ knowledge-based potential, equipped with a nucleobase-centric sampling algorithm, provides a robust improvement in refining near-native RNA models generated by a wide variety of modelling techniques.


Introduction
Three-dimensional structures of RNAs offer the best clues for their functions and yet only a few thousand structures have been determined 1 , compared to 24 million of non-coding RNA sequences collected in RNAcentral (as of October 25 2020) 2 . This huge gap continues to expand exponentially because sequencing entire transcriptomes is only a fraction of the cost and time required for structure determination of a single RNA by the techniques such as Xray crystallography, nucleic magnetic resonance, and cryo-electron microscopy.
One possible solution to this growing problem is to predict RNA structures from their sequences by computational template-based homology modelling and de novo folding 3,4 . Homology modelling such as RNABuilder 5 and ModeRNA 6 attempts to predict the structure of a target RNA by mapping the query sequence onto the template structure(s) of its homologs. However, most RNAs do not have their respective homologous templates. Thus, method development was mainly devoted to predicting structures by de novo folding from sequences through fragment or motif assembling. Examples are FARNA 7 , MC-fold/MC-Sym 8 , Rosetta-SWM 9 , ifoldRNA 10,11 , Vfold 12,13 , 3dRNA 14,15 , and SimRNA 16 . Evaluation of these methods by RNApuzzles [17][18][19] indicates that reasonable predictions are possible only for RNAs with either simple topology or existing homologous templates. Moreover, reasonable near-native predictions were often ranked poorly by the predictors. Clearly both homology and de novo models would greatly benefit from a structure-refinement technique that can bring the models closer to native structures locally and globally and improve near-native ranking.
Structure refinement, however, is challenging with few methods available for RNAs. Even for protein structure refinement, despite a long history of method development, only a moderate local improvement can be made by combining molecular mechanics force fields and knowledge-based potentials with pre-defined restraints to prevent large deviation away from the native basin 20,21 . For RNAs, most all-atom refinements were performed after coarse-grained sampling for removing steric clashes and non-ideal bond lengths, bond angles, or torsional angles. For example, the QRNAS refinement 22 utilizes a modified AMBER force field with enforced base-pair planarity, explicit hydrogen bonds, and backbone regularization. FARFAR 23,24 improves model accuracy through all-atom refinement after FARNA coarsegrained sampling. It employed a Rosetta all-atom force field that mixes physical and knowledge-based scores with RNA-specific terms for Watson-Crick base pairing, base stacking, and torsional potentials 25 .
Here, we build a knowledge-centric refinement energy score. Atomic-level knowledge-based energy functions, derived from known three-dimensional structures, have traditionally focused on distance dependence in protein structures 26 with similar statistical potentials developed for RNA [27][28][29] . Unlike these atomic knowledge-based scores and previous all-atom RNA refinement energy scores 22,23 , the statistical potential in this work is tailored specifically to RNA interactions that were dominant by orientation-dependent base-pairing and stacking interactions 30 with rotameric backbone 31 , in contrast to dominant distance-dependent hydrophobic interactions 32 with rotameric sidechains 33 in protein folding. Base-base interactions were described by fine grids in a six-dimensional orientational space and scaled by quantum mechanical calculations allowing better capture of both local and global interactions. This Backbone Rotameric and Quantum-mechanical-energy-scaled base-base knowledge-based potential (BRiQ) is integrated with a new nucleobase-centric tree algorithm that samples backbone conformations around predicted or known base pairs. The resulting refinement technique can consistently improve model structures for the majority of nearnative structural models as demonstrated by refining Rosetta-SWM motif 34 , RNA puzzles [17][18][19] , and FARFAR2 24 models.

Results
BRiQ refinement energy score. The BRiQ refinement energy score is designed to capture stably stacked bases linked by a more flexible ribose and phosphate backbone. In particular, the interactions associated with bases and oxygen atoms in backbones are strongly orientation-dependent. To capture this unique structural property of RNA chains, the energy score (E) is separated into six terms: orientation-dependent interactions between two bases ( ), between a base and a main-chain oxygen atom ( ), and between two main-chain oxygen atoms ( ), backbone rotameric energy ( rot ), internal energy ( internal ), and atomic clash energy ( clash ). That is, where we employed five oxygen atom types including OP for OP1 and OP2 and O2', O3', O4', and O5' in ribose. To capture the orientation dependence, the whole nucleobase (A, C, G, and U) is treated as a single rigid group with a local coordinate system and the relative position between two bases is described by the distance vectors and their orientations ( Figure  1A). The densities in the six-dimensional space were derived from known RNA structures using kernel density estimations, rather than histogram statistics (See Methods). This allows a smoother energy landscape ( Figure 1B) and the near-native local minima are closer to the corresponding native structure in a fine orientational space. To remove the impact of indirect base-base interactions, statistical energy scores are scaled by quantum mechanical calculations according to the correlation coefficient between representative hydrogen-bonded base pairs ( Figures 1B and 1C). Similarly, kernel density estimations were also employed to obtain base-oxygen ( Figure 1D) and oxygen-oxygen interactions whereas statistical rotameric states were obtained for ribose backbones ( Figure 1E). Empirical internal bond angle, bond length, and atom-crash energies were also developed to ensure appropriate bond geometry and packing density (Methods). Dihedral angles α, β, γ, bond angles θPOC, θOCC and bond length of C5'O5' were required to calculate the internal energy. (F). Nucleobase-centric fold-tree (NuTree) algorithm for refinement by defining bases as the nodes and locally or globally connected bases as the edges, illustrated by the GCAA tetraloop.
NuTree sampling algorithm. The above nucleobase-centric energy score is coupled with a nucleobase-centric tree (NuTree) algorithm for conformational sampling, in which each node is a base and each edge represents the relative position between two bases ( Figure 1F). Two connected bases in the folding tree could be sequential neighbours or hydrogen-bonded base pairs. Instead of sampling in backbone-dihedral angle space, conformations are sampled according to the pre-defined orientation space of each edge type because the orientations around more stable bases are the determinants of backbone orientations. For the same reason, coordinates of sidechains were built prior to those of the backbones. Only local moves were allowed for each node whereas both local and global moves were allowed for each edge, making both local and global structural improvements possible. In the process of Monte Carlo sampling, we only need to calculate the energy change upon a random structure move, without knowing the first or second derivative of the energy function. Therefore, this sampling algorithm could be used for optimizing an RNA structure with a non-differentiable energy function in atomic accuracy.

Figure 2. (A)
Refining Rosetta-SWM motif models by BRiQ improves the majority of those model structures with RMSD<2Å as demonstrated by RMSD comparison (Lowest RMSD of top 1% before refinement on Y-axis versus after refinement on X-axis). (B) Fixing all native base pairs with random initial conformations for all other regions and then folding the remaining structure leads to more accurate motif models for the majority of the motifs than refining Rosetta-SWM models that have pre-assigned, partially fixed base pairs (Lowest RMSD of top 1% BRiQ-refined models on Y-axis versus lowest RMSD of top 1% BRiQ models with fixed native base pairs on X-axis).
Refining Rosetta-SWM conformations for RNA motifs: We first test our method using a benchmark of 48 motifs. This benchmark 34 was developed to examine the ability of Rosetta-SWM to recover the loop region given some of the base pairs and environment contacts fixed (See Methods and Supplementary Table S1). We test our refinement method by refining all 88,352 conformations generated by Rosetta-SWM with the NuTree algorithm and the BRiQ energy score. The NuTree was constructed with the base-pairing information extracted from Rosetta-SWM model, and the BRiQ energy was optimized by Monte Carlo sampling at low temperature. The lowest RMSD values of top 1% predicted models by the Rosetta and BRiQ energies are compared in Figure 2A. The majority of predicted motif conformations (31/48, 64.5%) have an improved RMSD after BRiQ refinement with a median reduction of 0.2Å RMSD change among all 48 motifs. The improvement is more impressive on the refinement from near-native models (RMSD of Rosetta-SWM model less than 2Å), 81% (17/21) predicted conformations have an improved RMSD after BRiQ refinement. The RMSDs of the top 1% predicted models for the remaining four motifs are essentially unchanged.
Generate motif structure with base-pairing information: As the NuTree sampling algorithm is a refinement protocol, near-native conformations will be difficult to obtain if initial conformations contain false base pairs generated by other methods. Here, we further test whether or not near-native structures can be generated from the NuTree algorithm if all motif base pairs are known. Starting from a random backbone conformation with base pairs fixed, we run Monte Carlo sampling at different temperatures (Methods) to optimize the BRiQ energy score. The lowest RMSD of top 1% predicted models by the BRiQ score are compared in Figure 2B. The majority of predicted conformations (33/48, 68.8%) have an improved RMSD with correct and complete base-pairing information, compared to partial fixation by Rosetta-SWM. The median improvement in the best RMSD value within top 1% predicted models is 0.29Å.
Four representative motifs were chosen for illustration in Figure 3. They are a standard helix made of GC pairs ( Figure 3A), two tetraloops (GCAA_tetra loop, Figure 3B, UUCG_tetra loop, Figure 3C) commonly used for testing molecular mechanics force fields 35 , and an example of refinement leading to worse structures (j55a_P4P6_fixed, Figure 3D). In these figures, the top to bottom panels represents the results from Rosetta-SWM, refinement of Rosetta-SWM conformations by BRiQ and the structure prediction by BRiQ with native base-pairing information. For the GC helix ( Figure 3A), minimum RMSD sampled by Rosetta-SWM is 0.58Å, compared to 0.14Å by BRiQ. For GCAA_tetra loop ( Figure 3B), near-native local minimal improved from around 1.5Å RMSD by Rosetta-SWM to about 1Å by BRiQ. UUCG tetraloop ( Figure 3C) is an example that <1Å near-native structure is the global minimum in BRiQ, whereas the lowest energy conformation in Rosetta-SWM is around 3.5 Å. For all three motifs, BRiQ refinement of Rosetta-SWM models leads to an improved backbone fitness to the native backbone. Motif j55a_P4P6_fixed ( Figure 3D) is an example of worse RMSD values after refining Rosetta-SWM conformations. We noticed that this native structure has one base protruded into the solvent, which may be the reason for the failure of BRiQ as the solvation effect is only implicitly accounted for in a statistical potential. We further found that if native conformations are directly refined by BRiQ, the lowest energy conformations of the most motifs (45/48, 94%) remain <2.5 Å away from the native structure. Refining models from the native structures and the Rosetta-SWM models finds the same minimum for most motifs with similar energy. As a result, most motifs have high-quality near-native conformations as the global minimum.

RNA puzzle refinement.
A real-world test of any refinement techniques is to refine previously predicted models. Here we refined all submitted models in 24 RNA puzzle experiments (PZ1-PZ25, Supplementary Table S2). For each submitted model, we generated 20 refinement models. Then, the lowest energy model within 20 refinement models is treated as the BRiQ predicted model. The lowest RMSD model from all BRiQ predicted models for a given RNA puzzle is compared to the lowest RMSD value among all submitted models in Figure 4A. As the figure shown, if initial models contain models with RMSD<4Å, refinement always leads to improvement, with the largest RMSD reduction of 1.5Å. Figure 4B further compares the lowest RMSD value among all submitted models to those lowest RMSD values among all BRiQ refined models. Essentially all RNAs have better models sampled after refinement with the largest RMSD reduction at 2.2Å. Comparison to FARFAR2. We also test our refinement method starting from the recently developed FARFAR2 predicted models 24 but only for those 12 RNA puzzles with RMSD of best FARFAR2-predicted models less than 6 Å as shown in Figure 4A. This is because it is not possible to refine those models that are far from native. For each puzzle, we select 2000 lowest energy models according to FARFAR2 (1547 models for PZ20) to run BRiQ refinement, and compare the best RMSD values of top 1% and 5% predicted models by FARFAR2 and BRiQ energy scores, respectively. As shown in Supplementary Table S3

Discussion
This work presents a protocol for refining RNA model structures produced by other modelling tools. The protocol is nucleobase-centric in its knowledge-based energy score as well as the sampling algorithm. The fully knowledge-based score (BRiQ) captures the orientation dependence of base-base and base-oxygen, and oxygen-oxygen interactions by using a local coordinate system for each nucleobase, whereas the structures of ribose and phosphate backbones was governed by rotameric statistical potentials and empirical internal energies. Moreover, the dominant base-base interaction was scaled by quantum mechanical calculations for removing possible indirect interactions contained in structural statistics. This statistical potential was coupled with the NuTree algorithm that samples the conformations around pre-determined base pairs. This refinement protocol improves 81% Rosetta-SWM models with less than 2Å RMSD, 100% RNA puzzle models with RMSD<4Å and 83% FARFAR2 models with RMSD<6Å. The CPU time cost of refining a single RNA puzzle model is around 40 minutes for a model of 50 nt, and 2 hours for a model of 100 nt on Inter Xeon Gold 6242 CPU 2.80GHz.
In order to secure as many RNA structures as possible, we included all high-resolution RNA structures to collect base and backbone statistics for our BRiQ knowledge-based score. This is because a few redundancies will have an insignificant effect on all statistics collected at the atom or base level. Indeed, we found that the difference in the energy scores of a single base pair before and after excluding all homologous structures of RNA puzzles based on CD-HIT with 70% sequence identity (the sequence similarity for the RNA "twilight zone" is 80%) 36 ) is mostly negligible (the difference is <0.001 for most cases). Indeed, the difference before and after removing homologs for the refined FARFAR2 structures is only 0.02Å for rp04 and 0.2Å for rp18 for the best in top 1%, respectively.
The NuTree algorithm samples the RNA conformations around predetermined base pairs from a given model. These preformed base pairs serve as the nucleus to speed up the folding of the rest of an RNA chain. However, if a base pair was incorrectly modelled, it will be energetically difficult to correct the mistake. This is the main reason why the current refinement protocol works best for those near-native structures, in which the majority of base pairs were correctly identified. Unlike Rosetta, the NuTree algorithm can fix both nested and non-nested (pseudoknot) base pairs. In other words, this algorithm will become more useful as secondary and tertiary base pairs are increasingly more accurately predicted by employing co-variational analysis of RNA homologous sequences [37][38][39][40] and deep learning techniques 41 .
The BRiQ statistical energy score is built on the knowledge that backbone conformations are determined by more stable stacked base pairs. This was done by focusing on base-base, baseoxygen, and oxygen-oxygen interactions while treating the backbone as rotameric states around the bases. The robustness of the scoring function is illustrated by the consistent improvement in refining a diverse set of RNA models produced from different methods participated in RNA puzzles including ModeRNA 6 , FARNA 7 , MC-Fold/MC-Sym 8 , ifoldRNA 10,11 , Vfold 12,13 , 3dRNA 14,15 , and SimRNA 16 .
This work is limited to RNA structural refinement. Another question is whether or not the RBiQ statistical energy function coupling with the NuTree algorithm can serve as an effective method for ab initio structure prediction. Initial studies suggest that a new sampling algorithm is likely required because ab initio folding requires more frequent breaking and forming of base pairs than what are typically involved in the NuTree algorithm. The work in this area is in progress.

BRiQ Energy Score
(1) Base-base interaction energy ( ) Relative orientation between two bases. The planar shaped bases make the orientation dependence a must for base-base interactions. Here, we consider each base i as a rigid body with a local coordinate system CSi whose origin is located at atom C1'. The x-axis is in the direction of C1'N9 and the z-axis is in the direction of C1'N9×N9C4 for A and G whereas the x-axis is in the direction of C1'N1 and the z-axis is in the direction of C1'N1×N1C2 for U and C. As shown in Figure 1A, the orientation and distance between bases i and j can be described by 1) the distance rij between the origins of CSi and CSj, 2) the rotational angle ωij around rij, 3) the directional vector rij in CSi and 4) the directional vector rji in CSj (a total of 6 dimensions). The distance rij has a range of 0 to 15 Å with a uniform grid space of 0.3 Å. The rotational angle ωij varied from -180° to 180° with a uniform grid of 8°. We represent the orientation of the directional vector rji by 2000 uniformly distributed points on a sphere from Monte Carlo simulated annealing of points with repulsive interactions proportional to the inverse of squared distances between the points. Thus, the space between the two bases is separated into 50×45×2000×2000 discrete regions. Once the energy values for all these regions are known, the energy value at a given distance and orientation can be linearly interpolated. The above coordinate system for representing the base-base relative orientation was similar to but is more sophisticated than what was proposed for a coarse-grained twoparticle representation of RNA chain 42 . The 6-dimensional base-base interactions were precalculated and stored in a table and only negative values were loaded into computer memory for quick access.

Base-orientation representation.
To define an orientation-dependent density, it is necessary to measure the structural similarity first. One common measure is the root-mean-squared distance (RMSD) between two base pairs. However, it is too time-consuming to calculate RMSD. To speed up the calculation, we assumed that four fixed points (T1, T2, T3, T4) in the local coordinate system of a base (Supplementary Figure S1a) are sufficient to represent the orientation of each base and the mean squared distance between these points in two bases ) can approximate RMSD. We have optimized T1, T2, T3, T4 so that DDM has the highest correlation coefficient to RMSD. The final four points in each base coordinate system are: T1 = (2.158, 3.826, 1.427), T2 = (-0.789, -0.329, -1.273), T3 = (4.520, -3.006, 1.586) and T4 = (6.018, 1.903, -1.638). The resulting correlation coefficient is 0.974 (Supplementary Figure S1b).

Orientation distribution density.
We employed a modified radial basis function kernel h(d) to calculate the orientation distribution of one base around another base f(x) by the following equations: and where N is the number of base pairs in the database of RNA structures and DDM(x,xi) is the distance between a target base pair and a base pair in the database. If ( , ) between two bases is less than 0.15Å, we consider two bases shared the same orientation. Here and below, we have employed an empirical value (0.1) for defining the spread of the kernel based on statistics from the PDB structures and a few trials. An example is shown in Supplementary  Figure S2.
The orientation distribution densities of base pairs with different sequence-separation distances (1, 2 or >2) were calculated separately. For non-local base pairs (separation >2), the density was re-weighted by quantum mechanical energy.
Quantum-mechanical-energy-weighted orientation distribution density. One issue associated with a statistical energy function is that a high population of specific orientation between two bases may not be due to a strong direct interaction at this specific orientation but due to the orientation preference of the interaction with another base. Here, we minimize the possibility of this type of indirect interactions by a weighting factor w(xi) according to quantum calculations. The assumption is that the strength of the directional interaction is related to the strength of the quantum interaction energy. Here, where � is the nearest representative configuration to , ( � ) is the quantum energy. The modified orientation distribution ′ ( ) satisfies Here, quantum mechanical (QM) calculations were performed by using Guassian 09 43 Figure 1C shows some examples of the calculation results. The scaling parameter (4.32kcal/mol) in Eq. (4) between QM calculations and statistical energy functions is obtained from the slope of the regression analysis of hydrogen-bonded base pairs ( Figure 1C).
Base-base interaction energy. The final energy function for base-base interactions is calculated by the following equations: and, Where sep is the sequence separation distance, its value could be 1, 2 or 2+. fref(2+) was employed to scale the lowest energy value of non-local base pair to -8.0 based on the lowest energy from QM calculations and the scaling between statistical and QM calculations ( Figure  1C). fref (1) and fref (2) were employed to scale the lowest energy value of local base pair to -4.0. Here, we also set the maximum base-base interaction energy to zero because repulsive interactions from the above six-dimensional statistical analysis are not reliable. Figure 1B shows a schematic example of before and after QM scaling.

(2) Interaction energy between a base and a main-chain oxygen atom ( )
The interaction energy between a base and a main-chain oxygen atom is the interaction between a polar group and a polar atom. The orientation of a base is still represented by 4 points obtained previously. The relative orientation and position between an oxygen atom and a base can be described by the distance between the oxygen atom and the four points representing the base. The kernel function ℎ( ) and the density ( ) for ( ) are given by (8) and where x and xi represent the target base-oxygen and a base-oxygen pair in the database, respectively, and ( , ) are the RMSD between the two base-oxygen pairs. Unlike , the repulsive (positive) interaction was not set to zero but reduced by a decaying weighting function � � = 1 − 1/(1 + (3.7− )/0.08 ), where is the minimum distance between an oxygen atom and any atoms in a base, 3.7Å is the approximate interaction distance between a mainchain oxygen atom and a base and 0.08 is an empirical parameter to control the rate of decay. We also impose an orientation dependence for the hydrogen bonding between an oxygen atom and the corresponding atoms in a base according to the θ angle between the base atom N or O, the main chain atom O (O2', OP1, OP2) and the main chain atom C or P bonding to O. This was done by locating the angle range in RNA structures after removing top 3% angles from each side and defining s(θ)=0 for outside the angle range, s(θ)=1 when θ=θM, the median value of θ, and where θu and θL are the upper and lower bounds for θ, respectively. The final expression for ( ) is where is employed to scale the lowest energy value empirically to -3.0, which we set to be about one-third strength of GC pairs. As an example, Figure 1D shows the distribution of a nucleobase C and an oxygen atom in a phosphate group and that of a nucleobase A and O4 in ribose.
(3) Hydrogen-bonding energy between main-chain oxygen atoms ( ) We consider hydrogen bonding interactions between O2'-O2', O2'-OP and OP-OP. The configuration of hydrogen bonds is determined by four atoms (two oxygen atoms and their connecting C or P atoms). The distance d between two hydrogen-bonding configurations is calculated by RMSD between the four atoms. Similar to ( ), we have is employed to scale the lowest energy value empirically to -3.0 for O2'-O2', -2.0 for O2'-OP, -1.5 for OP-OP and the decay function ( ) = 1 − 1/(1 + (3.3− )/0.07 ) with 3.3Å is the typical hydrogen bond length between two oxygen atoms and 0.07 is the empirical parameter to control the decay rate. Here the angle dependence is implicitly accounted for by using 4 atomic coordinates (e.g. atoms C2'-O2'-OP-P for the hydrogen bond between O2' and OP) to define RMSD.

(4) Energy for Atomic clashes ( clash )
Due to lack of adequate statistics for hard-core exclusion in all statistical energy functions, we have set repulsive terms to 0 in ( ) or quickly decay to 0 by weighting functions � � and ( ) in ( ) and ( ), respectively, when the distance is less than a preset value. To avoid direct atomic clashes, we introduce an empirical energy function as below.
where d is the atomic distance between two atoms, r0 is the shortest statistical distance from RNA structures between the two atoms and ℎ is an empirical parameter to control the increasing rate of repulsion when two atoms approach to each other. We set ℎ = 3 after the energy minimized structure does not change much for ℎ between 2 and 5. r0 is orientation independent for atoms in SP 3 hybridization but is orientation-dependent for atoms in SP 2 hybridization. That is, r0 is orientation-dependent between any atom with an atom in a base but is orientation-independent between two main-chain atoms. For orientationindependent r0, it is set as the top 5% shortest distance between two atoms found in RNA structures. For orientation-dependent r0 between a given atom a and an atom b in base B, the atomic position of a is transformed to the local coordinate system of the base B and clustered around the atom b. The top 5% shortest distance in different orientations is set as r0 in different orientations. The functional form of the clash energy is shown in Supplementary  Figure S3.

(5) Main-chain rotameric energy ( )
For protein structure prediction, the sidechain conformational space is discretized into rotamers. Here, we introduce main-chain rotameric states because stacked bases are more rigid than the main chains. The conformational space of a ribose rotamer is mainly determined by two dihedral angles ( Figure 1E). The first is the rotational angle χ rotating around the chemical bond connecting the base and ribose (N9-C1' for bases A and G and N1-C1' for bases U and C). The second is the improper angle ν between the C2'-C4'-O4' plane and the C2'-C4'-C3' plane within the connected ribose. Each dihedral angle can be separated into two regions, and each region into 300 to 500 rotamers, and a total of 1500 rotamers is employed to describe a ribose rotameric state.
The ribose rotamers are clustered according to RMSD. This is done by transforming atomic coordinates of eight ribose atoms to the local coordinate system for the base. RMSD is the average distance of ribose atoms in the local coordinate system. Once RMSD is known, we can use the kernel density estimation method to calculate the density of each rotamer, and then take the negative logarithm and convert it to the rotamer energy ( ) as below.
, xi denotes a rotamer in the database and the summation is over all the rotamers in the database.

(6) The internal energy
We introduce the following internal energies ( ) between a phosphate group and two connecting riboses to account for energetics associated to changes in bond lengths , bond angles , and dihedral angles where ( ) = ( − 1.422) and bl denotes the bond length between O5'i+1 and C5'i+1. We made this bond slightly flexible so that we can add a phosphate group (atom P, OP1, OP2, O5') between two fixed riboses at fixed dihedral angles, other bond lengths and bond angles. The average bond length found in the RNA structures (1.422Å) is employed here. Parameter is set to 5.0, which is optimized by backbone modelling (rebuilt main-chain atoms after fixing base positions).
POC is between Pi+1-O5'i+1-C5'i+1 and OCC is between O5'i+1-C5'i+1-C4'i+1. Similar to the bond length, we made these two angles flexible so that we can add a phosphate group between two fixed riboses at fixed dihedral angles, other bond lengths and bond angles. The average values for these two angles in RNA structures are 120.7° and 111.1°, respectively. Parameter is set to 0.1, after a few trials of backbone modelling. For both bond-angle and bond-length energies, a harmonic function with linear extrapolations at both ends was employed to improve conformational sampling efficiency.
is calculated from three statistical energy functions (see Figure 1E for angle definitions).

Structure Database
All statistical energy terms in Eq. (1) were derived from 2247 RNA structures obtained by Xray crystallography and cryogenic electron microscopy with a resolution higher than 3.0Å downloaded on January 23, 2020. We did not remove any redundant sequences because we need a database as large as possible. Moreover, RNAs can have different structures in different complexes and we want to capture all possible conformations. We can do this because statistics are collected at the atomic or base level. See the discussion in more details.

1) Confirmational representation by the NuTree
For conformational sampling, each RNA structure is represented by the NuTree. Each node in the NuTree denotes an RNA residue including its base position (the local coordinate system), the ribose rotameric state and the phosphate position attached to the 3' position of the ribose (determined by dihedral angles ε and ζ, Figure 1E). The edge of the NuTree represents relative positions between the local coordinate systems of two bases, which are described by a 3x4 coordinate transformation matrix. There are ten types of coordinate transformation matrices to describe nine types of edges: 1) sequence-neighboring base pairs along 5' to 3' and 3' to 5' directions in the loop regions, 2) sequence-neighboring base pairs along 5' to 3' and 3' to 5' directions in the Watson-Crick pairing regions, 3) sequenceneighboring base pairs along 5' to 3' and 3' to 5' directions in the non-Watson-Crick pairing regions or connection between Watson-Crick pairing and non-Watson-Crick pairing regions, 4)Watson-Crick pairs, 5) Non-Watson Crick pairs, 6) chain connection without a specific positional relation (jump). Except for jump, each edge has a corresponding move set. The move set was collected from all possible conformations of the same type in the structural database that were discretized into 60 representative conformations and each representative conformation was further discretized into 60 sub-representative conformations. These 3600 representative conformations form the move set for each edge type. Figure 1F shows the NuTree for the GCAA tetraloop with a sequence of G0C1G2C3A4A5G6C7 where the edge 0-1 is sequence-neighboring base pairs along the 5' to 3' direction in the Watson-Crick pairing region, edges 1-2 is sequence-neighboring base pairs along the 5' to 3' direction in non-Watson-Crick pairing regions, 2-3, and 3-4 are sequence-neighboring base pairs along the 5' to 3' direction in the loop regions, edges 0-7, 1-6 are Watson-Crick pairs and 2-5 is non-Watson-Crick pair. Although there is no edge between 4 and 5, we need to build the phosphate connection and calculate the internal energy between them.

2) Conformational Sampling Moves.
Conformational sampling involves node and edge sampling. Node sampling allows either a random selection of ribose rotamers or a small adjustment of the base local coordinate system (<0.2Å translational and <2° rotational motions). Both moves are local. They do not change the positions of other nodes and riboses. Edge sampling makes both local and global moves. The local move makes minor adjustment (<0.2Å translational and <2° rotational motions). The global move randomly selects a move from the set according to the edge type. Edge sampling moves will change the positions of all downstream nodes but not the relative positions between the nodes.
The above sampling may change the relative positions of some neighbouring riboses and make it necessary to rebuild the phosphate connection. Such connection can be rapidly built by the grid search from representative points for the lowest value in the ε-ζ torsional angle space. According to the distribution of conformations (Supplementary Figure S4), we have divided ε-ζ space into 50 representative points, each representative point into 50 subrepresentative points and each sub-representative point with 25 local points in 1° extension. That is, a total of 125 calculations is conducted for searching the lowest value.

3) Monte Carlo Simulated Annealing.
All nodes in a NuTree are divided into three types: those with positions fixed (A), those with relative internal positions fixed but not absolute positions (B), and those with both relative and absolute positions changed (C). The energy change after each move is calculated by ∆ = ( ) + ( ) + ( ) + ( ) + + (16) where dE(AB), dE(AC), dE(BC) and dE(CC) denote the changes in interaction energies with the AB, AC, BC, and CC regions, respectively, and and are the changes in rotameric and internal energies, respectively. Each move is accepted or rejected according to the Metropolis criterion. We employed simulated annealing for energy optimisation with an initial temperature set at 2.5, which is decreased by a factor of 0.95 at each round until temperature reached 0.01. In the refinement protocol, the initial temperature is set at 0.5 which is decreased to 0.01 by a factor of 0.9. The sampling steps at each temperature Nstep is proportional to the edge number in the NuTree.
Nstep =400*n(wc edge)+2000*(nwc edge)+4000*n(other edge) where n(wc edge) is the edge number of Watson-Crick pair or helix neighbor, n(nwc edge) is the edge number of Non-Watson-Crick pair or NWC neighbor, n(other edge) is the number of other edges.
To increase the efficiency of conformational sampling, we give Eclash and Einternal a low weight of 0.05 at the initial temperature and gradually increase the weight to 1.

Motif Test set
The test set is obtained from the benchmark of Rosetta-SWM 34 , downloaded from https://purl.stanford.edu/fq893cm4516. After removing redundant RNAs and those with more than 50 bases and non-standard RNA bases, we have 48 motifs listed in Supplementary Table  S1.

Data availability
The structural data and test sets used by BRiQ is publicly available at http://servers.sparkslab.org/downloads/BRiQ-dataset.tar.gz