Development and Validation of a High-Throughput Next-Generation Sequencing Work ow for SARS- CoV-2 Whole Genome Sequencing: Results from Over 65,000 Clinical Cases


 Monitoring new mutations in SARS-CoV-2 provides crucial information for identifying diagnostic and therapeutic targets and important insights to achieve a more effective COVID-19 control strategy. Next generation sequencing (NGS) technologies have been widely used for whole genome sequencing of SARS-CoV-2. While various NGS methods have been reported, one chief limitation has been the complexity of the workflow, limiting the scalability. Here, we overcome this limitation by designing a workflow optimized for high-throughput studies. The workflow utilizes modified ARTIC network v3 primers for SARS-CoV-2 whole genome amplification. NGS libraries were prepared by a 2-step PCR method, similar to a previously reported tailed PCR method, with further optimizations to improve amplicon balance, to minimize amplicon dropout for viral genomes harboring primer-binding site mutation(s), and to integrate robotic liquid handlers. Validation studies demonstrated that the optimized workflow can process up to 2,688 samples in a single sequencing run without compromising sensitivity and accuracy and with fewer amplicon dropout events compared to the standard ARTIC protocol. We additionally report results for over 65,000 SARS-CoV-2 whole genome sequences from clinical specimens collected in the United States between January and September of 2021, as part of an ongoing national genomics surveillance effort.


Introduction
Coronavirus disease 2019 , caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), emerged in the Chinese province of Wuhan in November 2019 1 and was declared a global pandemic by the World Health Organization after its rapid spread around the world (https://www.who.int/emergencies/diseases/novel-coronavirus-2019). Genomic surveillance has been employed to track the evolution of SARS-CoV-2 over the course of the pandemic, including the emergence of new variants that may affect viral transmissibility, infectivity, immune evasion, and vaccine e cacy [2][3][4] . It further informs public health decisions by facilitating the tracking of SARS-CoV-2 transmission, outbreak detection, and contact tracking, and has been used to help trace the origin of the pandemic [5][6][7] . As of Sept 2021, more than three million SARS-CoV-2 genomic sequences worldwide have been deposited and made publicly available in the global initiative on sharing all in uenza data (GISAID) database (https://www.gisaid.org/).
The initial SARS-CoV-2 genomic sequence was obtained through a metagenomic approach and con rmed by Sanger sequencing and PCR [8][9][10] . As the pandemic progressed, multiple NGS approaches have been utilized for SARS-CoV-2 sequencing, including shotgun metagenomics, hybrid capture enrichment, and amplicon-based sequencing [11][12][13][14] . Shotgun sequencing requires no prior knowledge of the targeted viral genome 15 but is limited by requirements for a high viral load and a higher sequencing depth. Hybridization approaches target genomic regions of interest by using biotinylated probes and ensure a more complete pro ling of regions of interest 16 . Since the initial release of the ARTIC SARS-CoV-2 sequencing protocol early in the outbreak (Jan 22, 2020, https://artic.network/ncov-2019), amplicon-based sequencing has become the primary choice for many labs around the world and a number of commercial kits are also available 14 . This approach utilizes rst-strand cDNA synthesis followed by genome ampli cation using viral genome speci c primers to produce amplicons that are tiled across the entire genome. Sequencing adaptors and barcode indices are added, using either ligation or tagmentation-based approaches. However, the complexity of those work ows limits their scalability.
Recently, Gohl et al (2020) 17 reported a cost-effective and highly scalable tailed amplicon method for SARS-CoV-2 sequencing, which bypasses costly and time-consuming library preparation steps.
In this study, we report an automated, high-throughput work ow for SARS-CoV-2 whole genome sequencing utilizing a 2-step PCR NGS library preparation method with modi ed ARTIC v3 primers. Our work ow is similar to the method described by Gohl et al (2020) 17 , where the Illumina sequencing primerbinding sites were added to ARTIC v3 gene-speci c primers for use in the subsequent PCR step to add the sequencing adapters and barcode sequences. However, we made further optimizations to improve amplicon coverage balance for high-multiplexing, to minimize amplicon dropout by employing touchdown PCR 18 , and to integrate automated liquid handlers for high-throughput surveillance studies. Validation studies performed on clinical specimens achieved robust whole genome sequencing coverage in a highly multiplexed setting with much reduced percent amplicon dropout even for the recently reported variants of concern. Using this method, we successfully converted our lower-throughput SARS-CoV-2 whole genome sequencing methodology to an automated, high-throughput process. We additionally report results for over 65,000 SARS-CoV-2 whole genome sequences from clinical specimens collected in the United States from January to September 2021 as part of the ongoing US Center for Disease Control (CDC) National SARS-CoV-2 Strain Surveillance (NS3) system (https://www.cdc.gov/coronavirus/2019ncov/variants/cdc-role-surveillance.html).

High-throughput NGS work ow optimization
Our laboratory has developed an automated, high-throughput SARS-CoV-2 NGS work ow ( Figure 1). This work ow utilizes a 2-step PCR NGS library preparation method: (1) gene-speci c PCR to amplify the SARS-CoV-2 whole genome, using primers published by the ARTIC network, with modi cations to add Illumina sequencing primer binding sites; and (2) index PCR to add specimen-speci c barcoded sequencing adapters by fusion PCR using the Illumina sequencing primer binding sites. We rst optimized primer pools to give even coverage across the SARS-CoV-2 genome. Gene speci c primers were pooled into 4 pools (pool 1A, 1B, 2A, 2B, Supplemental Table S1), and positive patient specimens with RT-qPCR cycle threshold (Ct) value of 24 (n=11) were tested. To evaluate the coverage uniformity, we computed the average coverage of each amplicon at a normalized depth of 200,000 mapped reads. All 98 amplicons produced adequate coverage with a mean coverage of 973X (SD, 719; CV, 73.9%) ( Figure  2A). However, some amplicons (n=7, 3 in the spike protein coding region) resulted in relatively lower coverage. When the same sample set was tested with a standard ARTIC v3 method, even coverage across the whole genome was achieved with a mean coverage of 1,390X (SD, 658; CV, 47.3%) at 200,000 mapped reads ( Figure 2B).
Next, to improve coverage for these low-performing amplicons, we increased the corresponding primer concentration in the same primer pool or added the primer set to a different pool. When tested with RT-qPCR positive patient specimens (Ct 24, n=5), the optimized primer pools improved the coverage of these low-performing amplicons by 2-to 5-fold, resulting in improved amplicon balance with a mean coverage of 893X (SD, 514; CV, 57.6%) ( Figure 2C). For the same sample set, standard ARTIC v3 method achieved 1,470X mean coverage depth (SD, 692; CV 47.1%) ( Figure 2D).
As the SARS-CoV-2 genome has evolved, various mutations were found at the ARTIC v3 primer binding sites. Upon sequence analysis of 3,506 positive samples collected and processed in January and February of 2021, we found that 96% (3,367/3,506) of the samples had at least one primer binding site mutation with a median of 3.0 (range: 1-9) mutations with the potential to impair PCR e ciency per affected sample (Supplemental Table S3).
To minimize adverse effects on sequencing coverage, we employed a touchdown PCR method by gradually reducing the annealing temperature from 65°C to 55°C (0.7°C/second). We compared 79 samples with two different annealing temperature settings. With this approach, the percent amplicon dropout due to primer binding site mutations was decreased from 0.50-0.01% (Table 1). The percent amplicon dropout was calculated from the number of amplicons that did not generate coverage divided by the total amplicon number. Each dropout amplicon was manually reviewed for the presence of a mutation at the affected primer binding site.

Assay Precision
We assessed intra-assay precision using 188 unique specimens with Ct values between 10 and 25 run in 3 replicates. Most (523/564; 92.8%) sequences met quality control metrics ( Figure 3A); 175/188 (93.1%) unique samples met quality control requirements for 2 or more replicates and were utilized for lineage and clade comparison. All 175 samples had concordant clade and lineage assignments for all replicates (  Figure S1A). The average % CV for variant frequency between replicates was 0.5% (min CV: 0.0%; max CV: 6.7%). We then assessed inter-assay precision for 168 unique specimens with Ct values between 10 and 25 over 3 independent runs. Overall, 479/504 (95.0%) sequences met quality control requirements ( Figure 3B). Over 95% (160/168) of samples had valid sequence data for 2 or more replicates and were utilized for clade and lineage comparison. All 160 samples had concordant clade and lineage assignments ( Table 2). On average, there were 5.9 ± 3.2 spike protein amino acid substitutions per sample with 99.5% (946/951, 95% CI: 98.8-99.8%) qualitative agreement between runs when the minority variant frequency was present in >20% of the reads (Supplemental Figure S1B). The average % CV for the variant frequency between replicates was 0.9% (min CV: 0.0%; max CV: 20.3%).

Assay Sensitivity
To demonstrate that our high-throughput work ow offers adequate sensitivity to yield complete SARS-CoV-2 genomes, we determined the limit of detection. The limit of detection was de ned as the highest Ct value that yielded valid sequence data with ≥97% SARS-CoV-2 genome coverage for at least 95% of the specimens tested. A total of 39 unique samples with Ct values between 17 and 32 were serially diluted, yielding 186 samples with Ct values between 17 and 35. The percent of samples that yielded ≥97% consensus sequence ranged between 91% and 100% up to a Ct value of 27 ( Figure 4A). Only 50% ( Figure 4B). When over 90% consensus sequence was generated, there was 100% concordance for clade and lineage assignments.

Accuracy study
Next, we assessed the accuracy of the consensus sequences generated by this work ow. Three commercial synthetic RNA positive controls (clade Alpha, Beta, and Gamma variants; Twist Bioscience) were tested 12 times each, using 4 replicates per set-up in 3 independent set-ups, by 3 different scientists. Of note, the synthetic RNA controls do not cover 100% of the SARS-CoV-2 genome, and amplicon dropout was expected. In 36 trials, the mean percent consensus sequence was 95.1% (min 93%; max 98%). All controls resulted in 100% positive percent agreement for clade and lineage assignments (95% CI: 90.4%, 100.0%) and 100% for spike protein mutations that were detected when the minor allele frequency was over 20% (95% CI: 98.8%, 100.0%).
In addition, a total of 84 qRT-PCR negative samples were evaluated, in 3 independent set-ups using 28 negative samples per run, by 3 different scientists. All negative samples gave negative results, with mean amplicon coverage of 0.2 (min 0.0%, max 0.6%) relative to the PCR plate average, yielding 100% samplelevel negative percent agreement (95% CI: 95.6%, 100.0%).

Robustness study
To assess assay robustness, a total of 2,688 samples ( Table S4, S5). Moreover, there was 100% clade and lineage concordance for all positive samples processed in the same 384-well plates with the 2 higher-coverage NTC wells. These results indicate that a low degree of NTC contamination is unlikely to interfere with accurate consensus sequence generation or with clade and lineage assignments.
Next, we analyzed the proportion of samples that generated 100% consensus coverage out of the samples that passed coverage QC (Table 3). With the optimized high-throughput work ow, over 95.9% of samples generated a complete consensus (range: 83.0-100% per clade). When the same set of samples were analyzed with a standard ARTIC v3 work ow, the proportion of samples with complete consensus sequence was much lower, 66.1% (range 0-100.0% per clade). In some variants (clades Beta, Delta, Epsilon, and Lambda) when sequenced with the ARTIC v3 work ow, amplicon dropouts were observed in almost all cases; dropout was resolved by employing our high-throughput work ow utilizing touchdown PCR. Large-scale surveillance study From January to September 2021, over 65,000 SARS-CoV-2 qRT-PCR-positive or transcription-mediated ampli cation (TMA)-positive specimens were successfully sequenced in our laboratory. Between January and the end of May 2021, the sequencing libraries were constructed using an in-house validated modi ed ARTIC v3 protocol 19 ; subsequently, libraries were constructed using the high-throughput work ow designed to improve scalability, reported in this study. Cumulatively, the Delta variant of concern (VOC) was detected at the highest frequency (35%), because of the steep increase in prevalence of this variant after June 2021 (https://covid.cdc.gov/covid-data-tracker/#variant-proportions). Viruses in clade Alpha were the second-most prevalent, accounting for 29% of specimens tested, re ecting their high prevalence before June 2021. Other variants being monitored (VBMs) were detected at much lower frequencies, including Epsilon (5%), Iota (5%), and Gamma (4%) ( Table 4). Beta, Lambda, Eta, and Kappa variants were found in less than 1% of the specimens sequenced. samples and controls on a NovaSeq 6000 instrument using an S4 ow cell 31 . However, this study described only a single sequencing run and did not utilize an automated high throughput work ow. In the current study, we were able to use the lower output and less expensive SP ow cell to sequence over 2,600 samples.
Here, we report an automated, high-throughput work ow for SARS-CoV-2 whole genome sequencing optimized for large scale surveillance studies, utilizing a 2-step PCR NGS library preparation method with modi ed ARTIC protocol v3 primers. Validation studies performed on 1,711 unique clinical samples demonstrated high precision (100% inter-and intra-assay precision) and accuracy (100% PPA and 100% NPA). Slightly reduced, but adequate sensitivity was achieved in comparison to the lower-throughput inhouse validated ARTIC v3 protocol: near complete consensus sequence was generated for samples of qRT-PCR Ct values less than 28 using the 2-step PCR work ow ( Figure 4) and with samples of Ct less than 30 by the ARTIC v3 work ow 19 .
For large-scale surveillance studies, integration of robotic liquid handlers is a crucial component. Various automation platforms are available for NGS library preparation. Upon evaluation of frequently utilized liquid handlers, the Agilent Bravo platform produced the most consistent results for our work ow in a 384-well format and was used for library preparation. As noted by Gohl et al (2020) 17 , a 2-step PCR method increased primer dimer formation which can result in lower sequencing quality. To remove any primer dimer present, we implemented 3 library cleaning steps: 1) Ampure bead clean up using a centrifugal washer in 384-well format; 2) manual Ampure bead clean up of a pooled library; 3) size selection using 1.5% gel on a BluePippin station prior to loading on the Illumina NovaSeq 6000. With these stringent clean-up steps, our validation runs achieved over 90% (range 90.6%-93.4%) high base call Phred quality scores (Q30 or higher).
The ARTIC amplicon-based whole-genome ampli cation of SARS-CoV-2 is considered a highly sensitive and accurate method and is currently employed by many laboratories around the world. One problem associated with ampli cation of the SARS-CoV-2 genome is amplicon dropout which can be caused when a given primer cannot stably hybridize to its speci c complementary sequence binding site because of novel mutations at the primer binding site. As the SARS-CoV-2 genome has evolved, various mutations were found in ARTIC v3 primer binding sequence 32 . This type of dropout often requires redesign and revalidation of primers. To minimize amplicon dropout, we employed a touchdown PCR method which reduced amplicon dropout ( Table 1). The effects of this improvement were most prominent for Beta, Delta, Epsilon, and Lambda clade specimens, achieving 100% consensus sequence coverage in most trials (Table 3). On the other hand, for these 4 clades, when analyzed by the standard ARTIC v3 work ow, nearly all samples failed to achieve full consensus sequence coverage due to primer binding site mutations causing amplicon dropout (Table 3, Supplemental Table S6). We extended our analysis to samples collected between May and September of 2021 to show that the bene t of using the touchdown PCR method is not limited to the validation sample set. Similar result was observed for Beta, Delta, Epsilon, and Lambda clade specimens with 95% of samples generating a complete consensus sequence using the high-throughput work ow whereas only 2.9% of samples generated complete consensus sequence by the ARTIC v3 work ow (Supplemental Table S7). The ARTIC network has published a newer ARTIC v4 primer set, which was designed to avoid current high-frequency mutations, with the goals of minimizing amplicon dropout and producing high-accuracy variant calling. However, analytical validation is required prior to using this new primer set in surveillance studies. We envision that a touchdown PCR method can be easily adapted to any work ow and can avoid frequent primer redesign and validation.
The 2-step PCR method generated 30% lower coverage compared to the ARTIC v3 method when the per sample coverage was averaged across all amplicons and normalized to 200,000 mapped reads per sample (Figure 2A, 2B). We postulate that this difference may be due to the higher primer dimer formation during the 2-step PCR as mentioned by Gohl et al (2020) 17 . Although the normalized coverage was reduced by approximately 30%, on average, for the 2-step PCR method, the normalized coverage still exceeded the minimum coverage necessary to reliably generate a viral genome consensus sequence or to detect viral variants present in at least 50% of the reads (consensus calling threshold). Indeed, the absolute coverage obtained even when processing >2,600 samples on the NovaSeq sequencer (median coverage 2,478X) greatly exceeded the absolute coverage obtained for ARTIC v3 runs on the MiSeq sequencer (add median coverage 1,098X with batch size of 192). Therefore, these differences in normalized coverage have minimal practical impact on the ability of the 2-step PCR method to generate reliable whole genome consensus sequences or to detect variants.
In summary, we report on the development and the validation of an automated 2-step PCR, highthroughput NGS work ow for SARS-CoV-2 whole genome sequencing with high sensitivity and accuracy with much less amplicon dropout, which we have implemented for high-throughput SARS-CoV-2 genomic surveillance. The combination of automation and optimized bench and analysis work ows has enabled e cient, large-scale SARS-CoV-2 surveillance studies.

Sample Collection
We obtained remnant RNA stored at -80°C, extracted from deidenti ed clinical specimens previously For surveillance studies, a random set of residual nasopharyngeal or oropharyngeal swab specimens from SARS-CoV-2 qRT-PCR positive or transcription mediated ampli cation (TMA) positive specimens, performed at Quest Diagnostics, were collected across the United States between January and September 2021. For SARS-CoV-2 qRT-PCR positive sample selection, Ct <25 was applied for quali cation. In accordance with ethical requirements and in accordance with US Department of Health and Human Services guidelines for the use of deidenti ed specimen remnants in research studies, specimens were deidenti ed prior to the study and were limited to remnant extracted RNA from discarded specimens previously submitted for commercial testing.

Two-Step PCR library preparation
For cDNA synthesis a mixture of 10.8 µl extracted RNA and 2.7 µl SuperScript IV VILO master mix (Thermo Fisher Scienti c, Waltham, MA) was incubated at 25°C for 10 min followed by 50°C for 10 min and 80°C for 5 min. To amplify the entire SARS-CoV-2 genome, four gene speci c PCR reactions were performed using each of the four primer pools (Supplemental Table S1). The primers were based on the ARTIC v3 primer set 33  for 50 min, and at 70°C for 10 min followed by cooling to 4°C. For the multiplex PCR reactions, 2.5 µl of the cDNA was used in 25 µl reaction mixture of Q5 Hot START DNA Polymerase kit (New England Biolabs, Ipswich, MA): 5 µl 5X buffer, 0.5 µl 10 mM dNTPs, 0.25 µl polymerase and 3.6 µl 10 µM primer pool 1 or pool 2, and 13.15 µl nuclease-free water. The primer pools were based on the ARTIC v3 primer pool scheme with further optimization to enhance low-performing amplicons (Supplemental Table S2 R.R.V. participated in study design and validation, viral genome sequencing, sample collection, assay validation and preparation and editing of the manuscript.
K.L. Designed, developed and validated the high-throughput laboratory automation used for viral genome sequencing in this study.
R.M.K designed and developed the viral genome database used in this study, implemented automated lineage and clade assignments to populate the database, performed data analysis and assisted with sample collection and edited the manuscript. B.A. assisted with high-throughput laboratory automation design and development and participated in viral genome sequencing.
R.O. provided scienti c oversight, resources and guidance and contributed to the design of the study.
L.B., A.S., D.X., R.C. and A.Gr. Designed, developed and deployed the informatics infrastructure and software necessary to operationalize the assay pipeline and to provide and interface to the laboratory scientists who performed the testing and reporting.
P.T. and F.L. provided medical oversight and assisted in securing necessary resources and funding for this study.
3 . Quinlan, A. & Hall, I. BEDTools: a exible suite of utilities for comparing genomic features. Bioinformatics 26, 841-842 (2010). Figure 1 High-throughput NGS work ow. Sequencing library preparation was performed on robotic liquid handlers. A. Extracted RNA of SARS-CoV-2 positive specimens were converted to cDNA and subjected to touchdown PCR with primers published by the ARTIC network with modi cations to add Illumina sequencing primer binding sites. Specimen-speci c indexed sequencing adapters were added by subsequent fusion PCR using the primer binding sites. Agilent Bravo was used for each process. B. The nal indexed PCR products were puri ed using Ampure XP beads using BlueCat BlueWasher, pooled into a single library using Hamilton Starlet, and size selected using Sage Science Blue Pippin. C. The nal library was sequenced on an Illumina NovaSeq 6000. D. An in-house developed bioinformatics pipeline Page 24/26 was utilized to generate consensus genomes and for variant calls relative to the reference genome MN908947.3, Wuhan-Hu-1.