Principles and limitations of SELEX
When screening large nucleic acid libraries, detection of unspecific background sequences is inevitable (e.g., due to non-specific binding to targets and immobilization matrix, or incomplete supernatant removal). Moreover, the sequencing depth of the current HTS platforms (~ 108–9 reads per run) does not allow the detection of differences in sequence abundance (e.g., between binders and background) in a random pool of ~ 1014–15 molecular species. This combination has so far prevented the development of non-iterative methods for the discovery of high-affinity nucleic acid ligands. In SELEX, binders are iteratively amplified to distinguish them from the background by their sheer abundance, as determined based on HTS datasets (Extended Data Fig. 1a). However, the abundance (or its change between two iterations) is a useful indicator for the strength of target binding only if it can be ruled out that other parameters (such as substrate acceptance by the three polymerases used) have a strong influence on the composition of the library. Due to the iterative nature of SELEX, even small differences in amplifiability may translate into vastly different copy numbers, which will purge much of the initial pool (including excellent binders with poor amplifiability), and the final, enriched pool may consist of mediocre binders with good amplifiability. Another weakness is the constant switching between bead-based and solution-based steps, which makes automation and parallelization of high-affinity binder discovery more difficult. This last point has been addressed by improved technologies such as capillary SELEX or microfluidic SELEX, but they have their own limitations 22.
Design of UltraSelex
In contrast, UltraSelex depletes the background while retaining the binders without intermittent amplifications. Due to the much lower abundance of individual sequences, identifying high-affinity binders requires a sensitive trend analysis. UltraSelex involves both experimental and computational sections that complement each other, enabling the single-step discovery of high-affinity binders (Fig. 1).
(1) The initial RNA pool (containing 1013–14 unique sequences, Extended Data Fig. 1b) is dissolved in buffer and incubated with the target immobilized on magnetic beads. The beads are subjected to (e.g., 10) consecutive partitioning steps in which they are equilibrated with increasing volumes of buffer individually. In the final step, the remaining bead-bound RNAs are recovered by chemical or enzymatic release. Serial partitioning should sequentially dilute the background non-binders, leading to a monotonic decrease in their abundance (black species in Fig. 1a), while the binders (blue and golden) peak in relative abundance at characteristic partitioning steps related to their binding affinities and initial abundance.
(2) RNA from each partitioning (group 1–10) and the final release (group 11) step are separately collected, reverse-transcribed and amplified individually by PCR for HTS library construction (Extended Data Fig. 1b-e).
(3) All HTS libraries are multiplexed and sequenced. The quality-controlled reads (Supplementary Table 1) are combined and their abundance information is added to a single data frame, consisting of 107–8 rows (one for each sequence) and 12 columns (sequence ID, read counts n in g1-g11).
(4) The HTS reads can be used as a proxy for the identification of high-affinity binders only after several adjustments: As initial sequence abundancies in randomized RNA pools differ (Extended Data Fig. 2a) 23, 24, for quantitative comparison within a partitioning group, read numbers in g2-g11 are normalized by division by read numbers in g1, which generates the fold-change (fc) data frame. Quantitative comparison between partitioning groups requires consideration of systematic changes in pool composition across the partitioning process. With each successive partitioning, the amount of RNA released into the supernatant decreases, requiring more PCR cycles for HTS library preparation, resulting in more reads per unique sequence (Extended Data Fig. 2d). Furthermore, the composition of the groups (non-binders vs. weak vs. intermediate vs. high-affinity binders) is expected to change across fractions. To account for these two effects, the fc values of each group g2 - g11 must be corrected with a different factor. As all groups contained at least 1% sequences with ≥ 2 reads (gray dashed line in Extended Data Fig. 2b), to distinguish non-binders from weak binders we selected the RNA species representing the boundary of the top 1% fraction and noted its fc value fcRef.. Because strong binders are expected to elute in late groups, increasing weighting factors γ are assigned to the fractions, using a linear γ-function as a default (Extended Data Fig. 2c). fcRef. and γ are combined to obtain a normalization factor by which the fc values of all sequences in this partitioning fraction are divided. In this way, a data frame is obtained with the intra- and intergroup-adjusted fold change (afc) values of 107–8 sequences in groups g2 – g11. Extended Data Fig. 2k illustrates this data treatment which we refer to as SGREELI. The line plot in Fig. 1b shows the afc trend for three RNAs with different binding characteristics.
Furthermore, to distinguish a binder from background, a sequence must either occur very frequently in one late group or appear in multiple groups. To favor multiple appearances, the afc values for each sequence are integrated, using the γ values as the x-parameters. In this way, a single auc (area under the curve) value is obtained for each species. The auc values are added to the data frame and the sequences sorted by auc. High auc values indicate strong enrichment in later fractions, a feature that is thought to reflect strong binding.
(5) Optionally, the top percentile value, as well as the slope and shape of the gamma function, can be optimized (See Supplementary Note for details).
(6) The sequences with the highest rank are then selected for KD determination and more detailed functional characterization.
Serial partitioning enriches known aptamers
We first tested the wet-lab partitioning procedure on the identification of silicon rhodamine-binding aptamers (See Extended Data Fig. 1 for pool design). For UltraSelex experiment SiR-A, 20 µg RNA (52 randomized positions, ~2x1014 RNA molecules) were subjected to the partitioning procedure and all fractions were analyzed by HTS. This analysis revealed that in each partitioning step, the diversity decreased, whereas the proportion of frequently occurring sequences increased (Extended Data Fig. 2b, d). Still, even in the g10 library, ~ 90% of the sequences occurred only once, indicating that the complexity is still larger than the sequencing depth.
To test whether this wet-lab partitioning procedure can uncover strong binders, we monitored the relative abundance (as reads per million reads – RPM) of 3 known silicon rhodamine binding aptamer sequences that had been discovered previously from a different aliquot of the same RNA library by conventional SELEX 9. Two of them (SiR2, SiR3) started appearing in the HTS raw reads in g5 and reached their peak relative abundance in g10 (Fig. 2a). Notably, the relative abundance in the release fraction g11 was lower than in the peak partitioning fractions. The third aptamer (SiR1) showed a hardly detectable peak at g10. Out of the 42 aptamers identified by Sanger sequencing in our previous SELEX study 9, 39 were detected in the UltraSelex SiR-A dataset, most of them with low abundancies (Extended Data Fig. 2e), illustrating the need for a ranking system that does not only rely on read abundance.
Two additional UltraSelex experiments were carried out against the same target in which unspecific binding of target RNA libraries to the beads was reduced by the prior addition of passivating, as well as partially competing, RNA species (tRNAs in experiment SiR-B, and a SELEX-derived SiR-aptamer in SiR-C). A similar distribution of the known SELEX-derived aptamers was observed in the three experiments, however, with slight shifts in the peak locations (Extended Data Fig. 2e,f) but similar auc values (Extended Data Fig. 2g), attesting to the high reproducibility of the partitioning process.
To investigate whether the libraries created in the partitioning process converge away from randomness (defined as the composition of g1) towards a specific composition (defined as the composition of g11), we analyzed the distribution of all 4096 6mers throughout the randomized part of the library for the top 1% abundant RNA species for each partitioning group and for the bead-bound group. This analysis revealed rapid convergence, as judged by the Pearson correlation coefficient (Pearson’s r). (Fig. 2b, Extended Data Fig. 2f-g). Collectively, these results indicate that non-iterative serial partitioning can identify known high-affinity aptamers. Simple read counting, however, finds only a few, very abundant, aptamers, such as SiR2 and SiR3.
The SGREELI algorithm allows for sensitive scoring and ranking of ligands
We applied the SGREELI intra- and intergroup normalization to the UltraSelex dataset SiR-A, assigning auc values to ~ 4 × 107 RNA sequences. The plot of the adjusted enrichment factors of SiR1, SiR2 and SiR3 clearly identifies SiR1 as a binder, but with a lower auc value than the other two aptamers (Fig. 2c). Out of the 39 known aptamers that were detected in the dataset, 13 were among the top 0.01% group by auc ranking (Fig. 2d). In comparison, only 6 were in this top group when ranked by abundance in the final (bead-bound) fraction, g11.
Of note, none of these 39 sequences was at the very top by auc; the best-ranking known aptamer ranked 140 (Extended Data Fig. 2l). Similarly, 37 of the known sequences were found in UltraSelex dataset SiR-B (top score 75) and 38 in SiR-C (top score 32). Consistently, those sequences in which a fluorescence turn-on was reported after addition of a SiR-quencher conjugate as confirmation of SiR binding ranked much better than those without a turn-on (likely representing false-positive non-binders), confirming the relative enrichment of true binders in UltraSelex (Fig. 2d and Extended Data Fig. 2m). These results raised the question of what the sequences at the very top (by auc) are, and how their performance compares with the previously known aptamers.
UltraSelex enriches high-affinity SiR-binding aptamers for live-cell imaging
Comparison of the top sequences (by auc) from the three UltraSelex experiments revealed significant overlap (Fig. 3a,b). 6 sequences occurred among the top 25 of all 3 data sets, and ~ 300 (60%) among the top 500. The top 25 RNAs from the 3 experiments were individually synthesized and assayed for fluorescence turn-on (Fig. 3c, Extended Data Fig. 3a). 52%, 100%, and 68% of these sequences showed turn-on in SiR-A, SiR-B, and SiR-C. The SiR-B set (tRNA background passivation) not only showed the highest percentage of binders, but also the highest average turn-on (3.5-fold) and the two highest individual turn-on values (9 and 10-fold). For these two promising candidates, SiR-B7 and SiR-B15, the binding dissociation constants KD were determined by titration with SiR-PEG-NH2, to be ~ 250 and ~ 195 nM, respectively (Fig. 3d). Both light-up and KD values compare favorably with the top aptamer previously identified by conventional SELEX (7-fold and 490 nM) 9.
Sequences SiR-B7 and SiR-B15 were then cloned into a vector containing the Tornado scaffold 25 and transfected into human embryonic kidney (HEK-293T) cells. Upon addition of cell-permeable SiR-PEG-NH2 to the cells, aptamer-expressing cells showed strong, mostly cytosolic fluorescence and compared favorably with the SELEX-derived control aptamer-expressing cells (Fig. 3e). These findings highlight the potential of UltraSelex-derived aptamers for live-cell imaging.
Most UltraSelex-identified aptamers are not found by other means
The unique ranking system of UltraSelex identifies many promising aptamers that are ranked very low in other approaches. E.g., of the ~ 5000 sequences in the 0.01% top group of the SiR-B dataset by auc ranking, only one-tenth (500) belong in the same top group when sorted by read abundance in the bead-bound group. (Fig. 3g, top panel). The Sankey plot illustrates that the top auc group derives from different read abundance groups. Looking only at the top 25 sequences, the ratio is better, but 15 of those sequences are ranked much lower by abundance than by auc (Fig. 3g, bottom). Similar relations were observed in the SiR-A and SiR-C datasets (Extended Data Fig. 3b).
Among the top 25 aptamers from SiR-A, SiR-B and SiR-C (47 unique sequences), only 5 could be detected in the HTS data set of the final round of conventional SELEX for SiR-binding aptamers 9. There, they ranked between 3rd and 484th. (Extended Data Fig. 3c).
UltraSelex identifies aptamers that bind to and inhibit the SARS-CoV-2 nsp12 polymerase with picomolar affinity
To test the general applicability of UltraSelex, we applied it to a much larger (around 200-fold MW) macromolecular target, the enzymatic RNA polymerase subunit of SARS-CoV-2, known as non-structural protein 12 (nsp12), which is a central component of the RNA-dependent RNA polymerase (RdRp) complex that replicates the viral single-stranded, positive-sense RNA genome and transcribes its genes 26. Aptamers that bind to this protein target with high affinities could, in principle, provide new tools and therapeutics that disrupt the replication cycle of the virus.
Based on this rationale, we used three aliquots of the same starting library as in the SiR selections, and carried out UltraSelex experiments with bead-bound nsp12 protein. In the experimental setup Nsp-A, no additional background binding reduction measures were taken, while in Nsp-B, the beads were pretreated with tRNA, and in Nsp-C with an RNA hairpin reported to be a substrate for the RdRp complex 27. The protein-derivatized beads were subjected to the same serial partitioning procedure as detailed above, and all fractions were sequenced and analyzed computationally.
Overall, the read counts per unique sequence were much lower than in the SiR experiments, which made binder identification more difficult. However, the SGREELI algorithm consistently ranked the sequences by auc, regardless of the absolute number of reads (For examples, see Extended Data Fig. 4b) and resulted in substantial overlap between the three data sets, demonstrating robust performance of UltraSelex even for more complex, macromolecular targets. Notably, some of the top-ranked sequences had as few as 5 reads in all 11 fractions combined (Supplementary Table 3). 16 sequences were in the top 25 of all three data sets (64%), with a decreasing trend for lower-ranked sequences (Fig. 4a,b). The Sankey plots (Extended Data Fig. 4a) indicate that only a very small portion of the 0.001% auc-ranking fraction also ranked first by read abundance in the last fraction (g11), although there was greater agreement within the top 25 group.
84 sequences that belonged (with 3 exceptions) to the top 250 group in all 3 data sets (Fig. 4e) were individually prepared by in vitro transcription and assayed for the binding of the nsp12 protein by bio-layer interferometry (BLI). 76 showed potentially relevant binding affinities (KD < 1 µM, Fig. 4d) with the majority of the aptamers exhibiting nanomolar and 10 aptamers even picomolar KD’s. Within this group of top-ranked sequences, there was no correlation between rank and KD. The binder with the highest affinity, Nsp-B113, had a KD of ~ 58 pM, which is one of the lowest values for protein-binding aptamers reported to date (Fig. 4c, Extended Data Fig. 4c,d).
We then screened 50 of these nsp12-binding aptamers for their potential to inhibit the RNA polymerase activity of the RdRp, of which nsp12 is one component, using an assay that monitors the extension of an RNA hairpin template, in the presence of the accessory factors nsp7 and nsp8 27. In this experiment, 7 aptamers showed a notable inhibitory effect (Extended Data Fig. 4g). To preclude the aptamers from being extended by the polymerase, which would result in heterogeneous mixtures that would be difficult to analyze, we destroyed their 3’-ends by a brief periodate oxidation step, rendering them unextendable. This treatment had only a minor effect on the KD (Extended Data Fig. 4e), but markedly improved the reproducibility of the assay and the inhibitory effect of 3 of the aptamers. 6 of the 3’-blocked aptamers inhibited the extension by more than 90%, based on single-time point measurements (Fig. 4f, g). Kinetic analysis revealed that aptamers Nsp-B30 and Nsp-B68 showed the most pronounced inhibitory effects (Fig. 4h, i), which were found to be stoichiometry-dependent (Extended Data Fig. 4i). These experiments highlight the potential of UltraSelex for the discovery of protein-binding aptamers with conceivably therapeutic potential.
SGREELI ranking reveals high-performance aptamers at default parameters
In our analysis, we used default values assuming 1% as a suitable cut-off for discriminating between weak binders and non-binders, and a linear enrichment trend (θ = 1). Using multi-parameter optimization with the f1 score as a criterion, we determined the extent to which the choice of θ and the top percentile affected the ranking (see Supplementary Note for details). This analysis revealed a negligible effect of the top percentile and different optimal θ values for the SiR and nsp12 selections. Comparison of the rankings for each UltraSelex dataset calculated once with the default and once with the optimized hyper-parameter settings showed a large overlap (e.g., 67–80% among the top 100 of the three SiR selections and > 94% in the three nsp12 datasets). Similar results were obtained when we examined the agreement of the auc-based rankings with the binding affinity of the nsp12-aptamers using the Normalized Discounted Cumulative Gain (NDCG) score (see Supplementary Note). This high level of agreement indicates that the rankings based on the default parameters produce high-performance aptamer candidates and should be suitable for most applications, and additionally highlights the robustness of this computational approach.