RNA polymerase inaccuracy underlies SARS-CoV-2 variants and vaccine heterogeneity

Both the SARS-CoV-2 virus and its mRNA vaccines depend on RNA polymerases (RNAP)1,2; however, these enzymes are inherently error-prone and can introduce variants into the RNA3. To understand SARS-CoV-2 evolution and vaccine efficacy, it is critical to identify the extent and distribution of errors introduced by the RNAPs involved in each process. Current methods lack the sensitivity and specificity to measure de novo RNA variants in low input samples like viral isolates3. Here, we determine the frequency and nature of RNA errors in both SARS-CoV-2 and its vaccine using a targeted Accurate RNA Consensus sequencing method (tARC-seq). We found that the viral RNA-dependent RNAP (RdRp) makes ~1 error every 10,000 nucleotides – higher than previous estimates4. We also observed that RNA variants are not randomly distributed across the genome but are associated with certain genomic features and genes, such as S (Spike). tARC-seq captured a number of large insertions, deletions and complex mutations that can be modeled through non-programmed RdRp template switching. This template switching feature of RdRp explains many key genetic changes observed during the evolution of different lineages worldwide, including Omicron. Further sequencing of the Pfizer-BioNTech COVID-19 vaccine revealed an RNA variant frequency of ~1 in 5,000, meaning most of the vaccine transcripts produced in vitro by T7 phage RNAP harbor a variant. These results demonstrate the extraordinary genetic diversity of viral populations and the heterogeneous nature of an mRNA vaccine fueled by RNAP inaccuracy. Along with functional studies and pandemic data, tARC-seq variant spectra can inform models to predict how SARS-CoV-2 may evolve. Finally, our results may help improve future vaccine development and study design as mRNA therapies continue to gain traction.


23
Since the COVID-19 pandemic began, we have witnessed the repeated emergence of new  CoV-2 lineages and viral variants of concern (VOC) with the potential to escape vaccine 25 protection. mRNA vaccines based on the Spike protein have been widely used to prevent 26 COVID-19 illness and have been shown to elicit a protective immune response against VOCs 27 after multiple doses 5,6 . Both SARS-CoV-2 and the mRNA vaccines rely on viral RNA 28 polymerases (RNAPs) for their respective replication or synthesis. RNAPs misincorporate 29 nucleotides at much higher frequencies than their DNA counterparts 3 , generating RNA variants 30 during synthesis which can fuel viral evolution. The mutation frequency of RNA viruses ranges 31 from 10 -4 to 10 -6 per base 7 . SARS-CoV-2 is commonly believed to acquire mutations more 32 slowly than other RNA viruses due to the proofreading activity of its exonuclease during RNA 33 synthesis 8 . However, to this point, no empirical studies have directly measured the frequency of 34 RdRp errors during replication (Extended Data Fig. 1a), which is a key parameter for modeling 35 virus evolution. Similarly, the nature and frequency of RNA variants generated during vaccine 36 synthesis remains unknown (Extended Data Fig. 1b). 37 Technical artifacts introduced during library preparation and next generation sequencing (NGS) 38 pose a major challenge to RNA variant detection. To remove these artifacts, new methods that 39 combine RNA-seq with aspects of circle sequencing and molecular barcoding were developed 40 (ARC-seq) to enable accurate variant calling 9,10 . While these advances in consensus sequencing 41 have reduced technical noise, they typically require considerable substrate (≥1 µg of RNA) and 42 are not feasible for low input samples 11 . Another major constraint is that variant discovery is 43 directly correlated with sequencing depth, and depth can be difficult to achieve for rare 44 transcripts or organisms with large and complex genomes. To overcome these limitations, we 45 developed targeted Accurate RNA Consensus sequencing (tARC-seq) (Fig. 1). tARC-seq 46 combines the basic features of ARC-seq with hybrid capture technology for target enrichment to 1 enable deep variant interrogation of low input SARS-CoV-2 samples. 2 tARC-seq validation in E. coli 3 We first validated tARC-seq in Escherichia coli, where hybrid capture produced a >30-fold 4 enrichment on average in unique consensus reads across a panel of twelve genes (Extended Data 5 Fig. 2). This enrichment allows us to make high confidence measurements of variant frequencies 6 by gene, depending on baseline expression levels. 7

RNA variants in WT SARS-CoV-2 8
We next used tARC-seq to examine SARS-CoV-2 RNA isolated from infected Vero cells. Since 9 the yield is low, E. coli mRNA was added during library preparation to serve as a carrier. With 10 hybrid capture some carrier RNA and host sequences are found in the final library; these can be 11 analyzed separately and serve as internal technical controls. When aligned to the E. coli genome, 12 these off-target reads recapitulate the known variant frequency for bulk E. coli mRNA (Fig.  13 2a) 12 . 14 Using tARC-seq, we achieved on average >16,000X depth in consensus reads across the 29,903-15 nucleotide genome of WT SARS-CoV-2. Positions were filtered for ≥50X depth, which excluded 16 only 0.1% of the genome. Clonal and subclonal variants present at >5% allele frequency were 17 discounted to enrich for de novo events (Extended Data Fig. 3a, Supplementary Table 1). We 18 determined that three or more cDNA copies (i.e. minimum family size) was sufficient to filter 19 out most technical artifacts during consensus calling without compromising read depth 20 (Extended Data Fig. 3b). The overall RNA variant frequency in WT virus was 1.16 x 10 -4 , or 21 approximately one in 10,000 nucleotides, meaning new virions harbor on average three novel 22 mutations each (Fig. 2b). These de novo variants arise from RdRp errors during genome 23 replication. Host RNA editing by enzymes like APOBECs may also contribute, as evidenced by 24 the elevated frequency of C>T transitions ( Fig. 2c and Extended Data Fig. 4)  enzymes appear more resistant. These results suggest that some genomic regions in SARS-CoV-38 2 may mutate faster or could be under higher selective pressure. To find a molecular basis for 39 this variability, nucleotide identity was analyzed across all hot and cold spots and we observed a 40 strong GC bias at positions with significantly elevated variant frequencies (Fig. 2h). 41

RNA variants in Alpha and Delta 42
As SARS-CoV-2 has evolved into several different lineages characterized by specific mutations 43 and VOCs, we next examined whether variant frequencies differ between viral lineages. 44 Applying tARC-seq to the B.1.1.7 isolate (Alpha), we measured an RNA variant frequency of 1 1.51E-4 (Fig. 2b). Significantly more point mutations were observed in Alpha, particularly G>A 2 substitutions, which occurred twice as often in Alpha as in WT (Fig. 2c). For DNA-dependent 3 RNAPs (DdRp), these G>A errors are a signature of reduced proofreading 9 . We speculate that 4 the increase in G>A events may reflect less RdRp proofreading in the Alpha lineage. While there 5 are a few candidate mutations in Nsp12 and other replication/transcription complex interactors 6 (Supplementary Table 1 Non-programmed template switching has also been implicated in insertion events and the 30 emergence of novel coronavirus strains 19 . Analyzing tARC-seq data, we observe many recurrent 31 junctions outside canonical TRSs in fusion transcripts (Extended Data 7a-b) as previously 32 reported 17 . Compellingly, a number of large insertions and deletions were observed in WT virus 33 (Fig. 3d) and related lineages (Extended Data Fig. 8a), many of which appear templated from 34 within the SARS-CoV-2 genome. We model two deletions by RdRp slippage at neighboring 35 repeat sequences during transcription (Fig. 3e). Also shown is a large 41-nucleotide insertion 36 (Fig. 3f), which can be modeled by microhomology-mediated template switching involving three 37 sequential jumps between discrete genomic loci. These templated indels represent the rare 38 scenario where RdRp realigns to the correct sequence after a template switching event. Sequence 39 complementarity between donor and acceptor sites facilitates the jump; however, it is unclear 40 what other features are involved in promoting template switching. As further evidence, we found 41 that many indels are clustered around certain sequences, which we've termed transcription "skip 42 sites" (Fig. 3c, Extended Data 8c, Supplementary Tables 5-6). Jumpy RdRp activity at skip sites 43 fuels a diverse repertoire of indels detectable by tARC-seq at a single locus (Extended Data Fig.  44 8b). For example, the indel frequency at position 23308, pictured in Fig. 3e-f, is elevated ten-fold 45 over the genome-wide average in both WT (g.23303: 2.97E-4) and Alpha (g.23308: 3.11E-4). 1 Skip sites often sit adjacent to regions of microhomology (Extended Data 8d) and 2 homopolymeric nucleotide runs (Extended Data 8e), which drive up local indel frequencies. 3

Non-programmed template switching drives pandemic genomic epidemiology 4
Signatures of aberrant RdRp template switching are present in sequences from real-world 5 pandemic data as well (Fig. 4a). One event, a GGG to AAC substitution, defines the 20B clade 6 ( Fig. 4b) from which the Alpha, Gamma, Lambda and Omicron lineages evolved. All of these 7 viral variants also contain lineage-specific multiple nucleotide alterations that can be modeled as 8 single RdRp misalignment and realignment events templated from within the SARS-CoV-2 9 genome ( Fig. 4c-e; Extended Data Fig. 9). Thus, template switching represents a major driver of 10 virus evolution. 11

RNA variants in the Pfizer-BioNTech COVID-19 vaccine 12
The Pfizer-BioNTech COVID-19 vaccine was the first mRNA vaccine approved for use and has 13 been an instrumental tool in our public health arsenal against SARS-CoV-2 20 . Vaccine mRNA is 14 the product of in vitro transcription (IVT) by T7 phage polymerase from a codon-optimized, 15 Spike-encoding DNA construct (Extended Data Fig. 1b) 21 . As a DdRp, T7 polymerase also 16 commits errors during transcription at frequencies ranging from 10 -4 to 10 -6 [22,23] . These errors 17 have not been studied in the context of vaccine production or vaccine-induced immunity. As 18 vaccine mRNA is abundant and amenable to sequencing by bulk RNA consensus sequencing 19 (ARC-seq) ( Fig. 1), we characterized the frequency and spectrum of RNA variants in the Pfizer 20 vaccine. 21 The overall variant frequency in vaccine mRNA is 2.34 x 10 -4 , or double that of WT SARS-22 CoV-2 ( Fig. 5a). At that frequency, full-length Spike transcripts contain on average one novel 23 variant each. The spectrum and clonality of variants in the S gene detected by (t)ARC-seq differs 24 between vaccine and WT virus (Fig.5, Extended Data Fig. 3b), likely reflecting differences 25 between the two RNAPs. Fewer deletions, but more insertions, were observed in vaccine mRNA 26 (Fig. 5a). The C>T base substitution frequency is lower in the vaccine given the absence of RNA 27 editing in vitro (Fig. 5b). Additionally, G>A transitions were more frequent compared to any of 28 the viral isolates tested (Fig. 5b), which may reflect the intrinsic lack of proofreading activity of 29 T7 RNAP 22 . Among the point mutations, 67% were nonsynonymous (Fig. 5c). Sequencing depth 30 was more consistent across the S gene in vaccine mRNA and position-wise tests revealed a more 31 even landscape of variant frequencies compared to viral samples ( Fig. 5d-e). 32

RNA errors during T7 in vitro transcription 33
The high variant frequency in the Pfizer vaccine could stem from a number of variables, 34 including template codon optimization, T7 polymerase biology, synthesis with modified 35 nucleotides, IVT conditions, and vaccine packaging and storage. To address this, a series of T7 36 IVT reactions was performed in parallel over a range of temperatures on two different templates: 37 (1) the native S gene from WT SARS-CoV-2, and (2) the codon-optimized Spike construct from 38 the Pfizer vaccine. The two templates differ significantly by nucleotide content, with the 39 modified vaccine template being 57% GC compared to 37% for the viral template. mRNA from 40 either IVT reaction had a reduced variant frequency compared to the vaccine ( Fig. 5f), but 41 differences between the IVT reactions were apparent as well (Extended Data Fig. 10a-b). 42 Significantly, IVT from the vaccine template produced fewer RNA variants of all types (1.06E-43 4) at an overall rate comparable to WT virus, suggesting that high GC content is protective 44 against transcription errors in vitro. We observed that IVT reactions from the viral template are 1 seven times more prone to insertion errors (4.9E-5 for insertions; 1.8E-4 overall) compared to the 2 vaccine template, implicating low GC content in template switching (Extended Data Fig. 10a). 3 We also found evidence of template switching in IVT transcripts (Extended Data Fig. 10d), 4 corroborating our findings in viral isolates. Modulating the temperature during IVT did not 5 appear to affect the variant frequency (Extended Data Fig. 10a-b). 6 After controlling for GC enrichment and standard T7 IVT conditions, we conclude that the 7 intrinsic error-prone nature of T7 RNAP is the main source of vaccine variants. However, other 8 features specific to Pfizer vaccine production, such as the use of N1-methylpseudouridine 24,25 , 9 mRNA purification, liposomal packaging, vaccine freezing and storage, or the scale up for 10 production, also contribute to the RNA variant frequencies observed in the vaccine samples.. 11

Discussion 12
Herein, we describe a targeted sequencing method for detecting RNA variants in rare transcripts 13 and low abundance samples. We have sequenced three SARS-CoV-2 isolates and established a 14 baseline variant frequency of ~1 in 10,000 per nucleotide for the virus. While higher than other 15 predictions 4 , this frequency is comparable to similar observations in poliovirus, which also 16 utilizes an RdRp for replication but lacks an associated proofreading activity 8,11,26 . The error 17 frequency estimations were previously based on the presence of a proofreading 3'-to-5' 18 exoribonuclease (ExoN, nsp14) in SARS-CoV-2 that is distinct from the viral RdRp 27 . The same 19 proofreading activity has been implicated in promoting template switching 28 , which we show 20 here is error-prone. Thus, our work highlights the promiscuous nature of SARS-CoV-2's RdRp 21 driven by nucleotide misincorporation and erroneous template switching, both controlled by the 22 same exonuclease. ExoN may be a key protein involved in tuning viral evolution. Together, 23 these results showcase the fundamental biology propelling viral diversity and evolution on a 24 massive scale during the COVID-19 pandemic. 25 In conjunction with viral data, we also measured RNA variants in the Pfizer BioNTech COVID-26 19 vaccine using ARC-seq. At a frequency of 1 in 5,000 nucleotides, the pace of vaccine variants 27 appears balanced against viral evolution and suggests that the majority of mRNA produced 28 encodes variant Spike proteins. The role of vaccine heterogenicity in the immune response is 29 currently unknown. Our data may provide insight to explain how mRNA vaccines against 30 COVID-19 offer broader protection against novel strains upon boosting 22,29-31 . Vaccine variants 31 could promote a more diverse immune repertoire, which offers benefits in the context of a 32 rapidly evolving virus. However, other uses like cancer vaccines or mRNA drugs may require 33 high fidelity transcription to reduce the risk of autoimmunity or improve clinical efficacy 2 .  Nature (2022)    across the SARS-CoV-2 genome reveals an uneven landscape. g, Base substitution frequencies 8 by codon mapped against Spike protein illustrate the distribution of hot and cold spots for RNA 9 variants. h, RNA variant hot spots show strong GC bias in vivo. Error bars represent Wilson 10 score 95% confidence intervals. For analysis, a maximum 5% clonality cutoff was applied to the 11 data and positions were filtered for ≥50X depth. A more stringent depth filter (≥10,000X) was 12 applied to the position-wise analyses (f, g) to minimize skewing due to inadequate sampling. connecting the 5' and 3' ends of each chimeric junction. These jumps signify programmed RdRp 5 template switching that functions in viral gene expression. b, TRS regions had a higher RNA 6 variant frequency compared to control regions in both WT and Alpha, suggesting that 7 programmed polymerase jumping reduces overall fidelity in these regions. c, Among other low-8 fidelity regions are indel hot spots, or loci with significantly elevated frequencies of insertions 9 and deletions. Indel hot spots are calculated by Fisher exact test, filtered for ≥10,000X depth, and 10 graphed by position for both Alpha and WT. d, The size spectrum of insertions and deletions in 11 WT virus reveals rare, large events, many of which appear templated from within the SARS-1 CoV-2 genome. e, f, Templated indels can be explained by non-programmed RdRp jumping and 2 realignment at sites of sequence complementarity outside of canonical TRSs. Three events from 3 tARC-seq data are modeled, all occurring at the same indel hot spot in the S gene (g.23308, 4 indicated by the red arrow in panel c). The full sequence of the 41-nt insertion (f) is: 5 TGGTTAAAAACAAATGTGTCAATTTCAACTTCAATGGTTTA. 6 7 1 Figure 4 | On the origin of Omicron. Pandemic data shows that many complex mutations in 2 SARS-CoV-2 appeared suddenly (red arrows). They likely did not accumulate gradually but 3 were driven by a single event: RdRp template switching. In the events modeled here, 3' 4 complementarity facilitates the misalignment and realignment of RdRp, creating complex 5 mutations that have fueled viral evolution. a, Phylogenetic tree based on sequence alterations 6 that define the 20B and Omicron clades; not drawn to any scale. Discrete, coordinated nucleotide 7 alterations are coded by color, and each template switching event is mapped out below (b-e). b, 8 A GGG>AAC mutation in the N gene occurred once early in the pandemic and helped define the 9 20B clade. c, A small hairpin in S has spawned the same 6-nt deletion on more than 3 separate 10 occasions, while other single events are specific to Omicron (d-e). Phylogenetic trees were 11 constructed in Nextstrain v2.35.0 32 from genomes sequenced between Dec. 2019 and March 12 2022. 13 1

Figure 5 | Spectrum and frequency of RNA variants in the Pfizer-BioNTech COVID-19
2 mRNA vaccine. Total ARC-seq was applied to SARS-CoV-2 Spike mRNA purified from the 3 Pfizer-BioNTech COVID-19 vaccine to assess the fidelity of T7 RNA polymerase in vaccine 4 production. a, The variant frequency for vaccine mRNA is 2.24 x 10 -4 (n=2). Samples come from 5 separate vials (labeled Pfizer "A" and "B") of the same lot. G>A transition was the dominant 6 event subtype in vaccine samples (b) and most variants were nonsynonymous (c). Overall, the 7 type of variants observed in Spike differs between vaccine and viral samples. d, e, Positional 8 frequencies are less variable and fewer hot spots are observed in the vaccine (VAF = variant 9 allele fraction). f, Compared to T7 IVT transcripts, vaccine mRNA has significantly more 1 variants, and high GC content was associated with fewer polymerase errors in vitro. Analysis as 2 in Figure 2. 3 4 Methods 1 2 RNA extraction 3 4 All RNA samples were processed under RNase-free conditions using dedicated equipment and 5 reagents. To maintain RNA integrity, samples were limited to ≤1 freeze-thaw cycle, kept on ice 6 whenever possible, and not subjected to high temperatures (≥65 C) in the presence of metal 7 cations 9 . RNA integrity was confirmed via Agilent Tapestation™ prior to sequencing library 8 preparation. 9 10 SARS-CoV-2 11 Ancestral SARS-CoV-2 was received from the World Reference Center for Emergeing 12 Viruses and Arboviruses at Synthesis Kit (NEB) and the ProFlex PCR System Thermal Cycler (Applied Biosystems). 7 Starting from 200 ng of template, reactions were allowed to proceed for two hours before 8 TURBO™ DNase treatment. Transcription products were purified by PCA for sequencing. 9 Notably, while IVT from the vaccine template was less efficient and had lower yields, vaccine 10 transcripts were more resistant to RNaseIII fragmentation during library preparation. 11 12 Library preparation and sequencing 13 14 Total accurate RNA consensus sequencing (ARC-seq) 15 1 µg of RNA was enzymatically fragmented with RNaseIII for 7 min to an average size 16 of 450 nt 38 . Following PCA cleanup, the remaining steps of library construction were guided by 17 ARC-seq design 10 . Briefly, RNA was 5' adenylated and ligated to unique molecular identifiers 18 (UMIs). Barcoding individual fragments before any amplification reactions downstream is 19 critical for error correction during consensus calling. The library was then circularized and 20 primed for rolling-circle reverse-transcription, yielding multiple conjoined copies of the original 21 fragment. After digesting these long cDNA oligomers, the individual monomers were PCR 22 amplified with tailed primers to add sequencing adapters and additional indexes. The final library 23 was analyzed by Tapestation prior to paired-end sequencing on the Illumina NextSeq 550 24 system. Importantly, reaction conditions were optimized throughout to reduce RNA damage 25 from heat and metal ion catalysis. 26 27 Targeted accurate RNA Consensus sequencing (tARC-seq) 28 For concentrated samples (≥ 100 ng/µL), fragmentation can proceed as described above. 29 However, for low-abundance samples it is recommended to add carrier RNA up to 1 µg. For the 30 SARS-CoV-2 experiments, a previously sequenced E. coli sample was mixed 4:1 with viral 31 RNA and served as both the carrier and internal control. Once fragmented, the targeted libraries 32 are prepared by the total RNA protocol up through the last indexing step. SARS-CoV-2 reads 33 were then enriched using the COVID xGen Hybrid Capture Kit (IDT) and 7-9 cycles of post-34 capture PCR to generate the final library. 35 36 Data analysis 37 38 Error-correction and variant calling 39 Illumina BCL files were converted to Fastq and demultiplexed from the 6 nt sample 40 barcode in the i7 index read (base masking: Y*,I6N*,Y*,Y*). Next, UMIs were extracted and 41 appended to the read headers before converting to unaligned BAM format. Specifically, the 11 nt 42 PCR UMI was read from the i5 index, and the 16 nt cDNA UMI was clipped from the leading 16 43 bases of Read 2. Reads with identical UMI tags, originating from the same RNA fragment, were 44 grouped into families and collapsed into consensus reads using a custom python script. 45 Consensus FASTQ sequences were aligned to the appropriate reference genome using BWA-46 MEM (accessions: NC_000913.3, NC_045512.2). Reads were then quality filtered, clipped, and 47 realigned with GATK v3.8. Finally, variants were called using mpileup and tabulated via a 1 custom R script. (Scheme S1) 2 3 Statistical information 4 Frequencies were calculated as variant count / total consensus nucleotides sequenced. 5 Proportion confidence intervals (two-tailed) were calculated for each frequency and error bars 6 represent Wilson scores of 95% confidence. To detect positions in the SARS-CoV-2 genome that 7 have a base substitution frequency different from the sample average, we constructed a 8 contingency table and performed Fisher's exact test comparing each position to the genome-wide 9 average. All positions that passed our initial filter (clonality ≤0.05, raw depth ≥50, fraction of N 10 ≤0.05) were included in this analysis. To control the false discovery rate, p-values were adjusted 11 using the Benjamini-Hochberg procedure. Positions with adjusted p-values <0.05 and 12 substitution frequencies that are elevated or reduced compared to the genome-wide average were 13 called as hot and cold spots, respectively. E. coli hybrid capture was performed on 2 biological 14 replicates. Viral analysis was performed on 1 biological replicate per genotype. Vaccine 15 sequencing data represents 2 vials of the same lot (n=2). In vitro experiments were performed in 16 triplicate at 30, 37, and 42 C (n=1 per temperature). 17 18 3D protein map of variant frequencies by codon 19 Average RNA base substitution frequencies were calculated for each codon in the Spike 20 protein by sum(base substitution count)/sum(adjusted depth). All positions that passed our initial 21 filter (clonality ≤0.05, raw depth ≥50, fraction of N ≤0.05) and were sequenced at a raw depth 22 ≥10000 were included in this analysis. The Spike protein full-length model 6vsb_1_1_1 39 was 23 then colored based on the codon average RNA base substitution frequency using Pymol 40 . 24 25 26 Code availability Python and R code are available on reasonable request. 27 28 29 Data availability Sequencing data are available through the Sequence Read Archive under 30 BioProject PRJNA824595. All other data are available from the corresponding author upon 31 reasonable request. 32 Hubert's writings https://berthub.eu/articles/posts/reverse-engineering-source-code-of-the-11 biontech-pfizer-vaccine/ (2020). 12

NAalytics. Assemblies-of-putative-SARS-CoV2-spike-encoding-mRNA-sequences-for-13
vaccines- BNT-162b2-and-mRNA-1273. (2021  (RNAP, in blue) that is responsible for both replication and gene expression. RNAP errors (red 6 'X') generate genetic diversity in SARS-CoV-2 and fuel the evolution of novel strains. b, Pfizer-7 BioNTech COVID-19 vaccine production begins with a SARS-CoV-2-Spike-encoding sequence 8 that is GC-enriched and codon-optimized. The template is placed downstream of the T7 9 promoter in a plasmid expression vector and linearized for in vitro transcription (IVT duplicates account for most of the pre-consensus nucleotides sequenced, and fold-enrichment 4 drops during consensus calling as duplicates of the same parent RNA fragment are collapsed into 5 a single read. The drop in enrichment between pre-and post-consensus reads is more pronounced 6 for low-expression genes like marR. This reflects both the efficacy of probe binding and the 7 scarcity of marR transcripts. Fold enrichment was calculated from the cumulative, normalized 8 sequencing depth across each gene in tARC-seq samples versus matched bulk ARC-seq controls. 9 b-c, RNA variant frequency analysis by gene is poorly powered using the original ARC-seq 10 method 10 . Coverage is highly variable between targets leading to inaccurate estimates of the true 11 variant frequency. Combining ARC-seq with hybrid capture (tARC-seq) significantly enriches 12 for reads across a 12-gene panel in E. coli, increasing the statistical power of the study (n=2, 13 reported separately). 14 15 1 Extended Data Figure 3 | Empirical validation of tARC-seq data analysis parameters. a, In 2 contrast to de novo variants, clonal and subclonal variants are not independent events and should 3 be filtered out during analysis. They typically arise from a single RdRp error and are 4 subsequently propagated through viral replication, inflating the true error frequency. To 5 determine an appropriate cutoff, all variants were graphed by the cumulative base substitution 6 frequency as a function of each variant's clonality. Relatively few clonal outliers were 7 discovered in the Pfizer vaccine compared to viral samples. A cutoff of 0.05 -or ≤5% allele 8 fraction -counted most variants on the curve while excluding clonal outliers. b, The overall 9 variant frequency (left y-axis, grey bars) in WT SARS-CoV-2 is graphed by consensus read 10 depth (right y-axis, purple line) over a series of minimum cDNAcs family sizes (minmem2). 11 Minmem2 is an expression of the minimum number of PCR copies required to form a cDNA 12 consensus sequence during consensus calling. A family size of 1 is equivalent to traditional 13 RNA-seq without error correction, while a family size of 3 was previously found to sufficiently 14 correct for technical artifacts 11 . exonuclease, and proteinase were relatively protected. Error bars represent 95% Wilson 5 confidence intervals. b, The various ORFs and features were previously graphed by variant 6 frequency (panel a above, and Fig. 2e). Functional regions with significantly elevated or reduced 7 frequencies relative to the sample average are indicated here (p-value <0.05 by Fisher's exact 8 test with Benjamini-Hochberg correction). Column 5 indicates whether the results from each 9 region were concordant between Alpha and WT, with overall agreement exceeding 66%. As with 10 the calculations for hot and cold spots, differences in sequencing depth between samples, 11 particularly for smaller regions, can impact the results. 12 13 1

Extended Data Figure 7 | Interaction between transcription regulatory sequences and 2
RdRp fidelity. tARC-seq detected chimeric junctions between canonical TRSs in WT SARS-3 CoV-2 (Fig. 3a). a, However, many of the observed junctions lay outside canonical regions, 4 suggesting non-programmed template switching by a promiscuous polymerase. b, While tARC-5 seq has the sensitivity to detect single events, the data was filtered further to include only high 6 confidence junctions with ≥100 observations. Even after increasing the stringency to remove 7 potential ligation artifacts (Fig. 1, step 2), many non-canonical junctions remained. Each arc 8 represents a chimeric alignment where the left and right x-intercepts correspond to the 5' and 3' 9 junction coordinates and line shading reflects frequency. c-d, While TRS regions comprise only 10 ~3.5% of the SARS-CoV-2 genome, they incur RNA variants at a higher frequency in both WT 11 and Alpha virus. Each TRS region (n=10) is small (<115 nt) and composed of one canonical 12 TRS site plus 100 flanking nucleotides. 13 14 1 Extended Data Figure 8 | Characterizing indel hot spots in SARS-CoV-2. a, In Alpha, single 2 nucleotide insertions and deletions predominate with additional peaks around multiples of 3 that 3 preserve the reading frame, as expected. Many large indels suggestive of RdRp template 4 switching were also observed. b, Indels are mapped by size (y-axis) and count (dot size) across 5 the SARS-CoV-2 genome. Promiscuous RdRp activity at skip sites fuels a diverse repertoire of 6 indels detectable by tARC-seq at a single locus. c, Indel hot spots observed in WT virus across 1 the S gene are graphed by coordinate and frequency. Overall, Spike had higher rates of indels 2 compared to the genome-wide average. d-e, The spectrum of indels at two particular hot spots 3 (c.1907 and c.3538, indicated by colored arrows) are reviewed in detail. d, With adjacent regions 4 of microhomology (red) to drive local template switching, an array of large deletions was 5 discovered at c.1907. This position was also identified as an indel hot spot in the Alpha variant 6 and many of the same deletions (-9del, -18del, -19del, -20del) were observed in that sample. This 7 pattern is suggestive of a transcription skip site where the sequence and local genome 8 architecture promote RNAP template switching and high variant frequencies. Interestingly, none 9 of the large deletions we found at c.1907 in Spike, including the in-frame events, are seen in 10 pandemic data, suggesting they are deleterious. In support of this, Nextstrain data shows this 11 region to be highly conserved across SARS-CoV-2 lineages and other members of the 12 Coronaviridae family. e, Homopolymeric nucleotide runs (red) can also trigger indel hot spots 13 through transcriptional slippage events, as shown at c.3858 in Spike. Hot spots represent 14 positions with significantly elevated indel frequencies as determined by Fisher's exact test (p-15 value <0.05 with Benjamini-Hochberg correction). 16 1