The CUTseq method, which we previously described16, enables a cost-effective preparation of highly multiplexed DNA sequencing libraries, by using restriction enzymes to barcode multiple samples before pooling them together into a single library. When the SARS-CoV-2 pandemic started, we wondered whether we could adapt CUTseq to sequence a large number of SARS-CoV-2 genomes in parallel, at an affordable cost. We first determined the SARS-CoV-2 genome breadth of coverage that could be achieved theoretically using restriction enzymes compatible with CUTseq16. In silico double digestion with two 4-base cutters, MseI and NlaIII, predicted that 74.1%, 94.4% and 98.8% of the SARS-CoV-2 genome would be covered at least once using 75, 150 and 300 nucleotides (nt) single-end sequencing (SE75/150/300), respectively (Fig. 1a, Supplementary Fig. 1a, Table 1 and Methods). Furthermore, in silico analysis showed that 18 of the 19 most frequent SARS-CoV-2 single-nucleotide variants (SNVs) reported worldwide to date1, including the A23403G variant, as well as 12 of the 13 SNVs characterizing the new SARS-CoV-2 lineage recently emerged in the UK2, could be theoretically covered using SE150, while 31 out of 32 of them could be covered by SE300 (Supplementary Fig. 1b).
Table 1
Theoretical fraction (%) of the SARS-CoV-2 genome covered by CUTseq using one or two restriction enzymes (MseI and NlaIII) and different sequencing read lengths (nt).
Read length
|
Region
|
NlaIII
|
MseI
|
NlaIII + MseI
|
75
|
Whole genome
|
34.9%
|
61.8%
|
74.1%
|
150
|
Whole genome
|
61.2%
|
87%
|
94.4%
|
300
|
Whole genome
|
84.4%
|
96.5%
|
98.8%
|
75
|
S
|
22.4%
|
63.7%
|
75.1%
|
150
|
S
|
40.6%
|
92.1%
|
98.5%
|
300
|
S
|
67.5%
|
100%
|
100%
|
In order to achieve high SARS-CoV-2 genome coverage even in samples containing a small fraction of viral RNA, we implemented a workflow that begins with a multiplexed PCR step in order to selectively amplify the whole SARS-CoV-2 genome (Fig. 1a, b). To this end, we adopted an assay developed by the U.S. Centers for Disease Control and Prevention (CDC) based on a previously published protocol for WGS of the Zika virus3 (Fig. 1a, Supplementary Table 1 and Methods). We first tested the performance of the multiplexed PCR assay, by preparing a sequencing library from one sample consisting of RNA extracted from the supernatant of a cell culture inoculated with SARS-CoV-2, using a standard library preparation kit (NEBNext) (Methods). Sequencing of this library on Illumina’s NextSeq 500 resulted in complete genome coverage (Supplementary Fig. 2a, b and Supplementary Table 2), although at non-uniform depth, most likely due to the uneven distribution of the amplicons along the SARS-CoV-2 genome (Fig. 1a). We then prepared a COVseq library from the same RNA sample, using MseI and NlaIII either alone or in combination to digest the pre-amplified SARS-CoV-2 genome (Supplementary Table 3 and Methods). In line with our theoretical expectations (Table 1), single enzyme digestion and SE150 resulted only in partial SARS-CoV-2 genome coverage, including the S region (Supplementary Table 2). However, double digestion resulted in almost complete genome coverage, with ~ 99% of the whole genome and S region covered at least once and ~ 95% covered at least ten times (Fig. 1c, d and Supplementary Table 2). Based on our calculations (Table 1), we anticipate that COVseq followed by SE300 would provide full coverage of the SARS-CoV-2 genome and S region.
We then sought to validate COVseq while implementing a low-input version of the method in order to reduce the cost per sample (see COVseq workflow #1 in Methods). To this end, we leveraged on the I-DOT nanodispensing device that we previously utilized for high-throughput CUTseq16 and used it to prepare a single multiplexed library using as little as 50 ng of each of 30 SARS-CoV-2 positive RNA samples (samples 1–30 in Supplementary Table 4 and Methods). As a comparison, we prepared NEBNext libraries from a recommended higher amount (250 ng) of each of the same 30 samples individually and sequenced all the libraries on NextSeq 500 (Supplementary Table 2). The number of reads per sample inversely scaled with the corresponding Ct value, both in the case of COVseq and NEBNext, and it was well correlated (Pearson’s correlation coefficient: 0.78) between the two methods (Fig. 1e, f and Supplementary Fig. 2c), suggesting that both methods quantitatively capture inter-sample differences in similar ways. Moreover, the proportions of reads aligned to the SARS-CoV-2 genome vs. other genomes and the fraction of unmapped reads were very similar between matched COVseq and NEBNext samples (Fig. 1g). In samples with high Ct (> 35)17, a sizable fraction of the reads were aligned to the human genome, independently of the method used to prepare the libraries (Fig. 1g). This reflects the fact that these samples were not treated with DNase during the RNA extraction procedure (Methods), thereby allowing for human genomic DNA present in the sample to become incorporated into the library when the viral load is low. To further validate COVseq, we assessed how many SNVs identified by NEBNext in the above samples were also detected by COVseq (Methods). In 21 samples with low Ct (≤ 35) and consequently a high percentage of sequencing reads aligned to the SARS-CoV-2 genome (Fig. 1g), the number of SNVs per sample was highly correlated between COVseq and NEBNext (Pearson’s correlation coefficient: 0.93). In total, 159 out of 228 (70%) SNVs identified by NEBNext in these samples were also detected by COVseq (Fig. 1h, i). Importantly, the five most frequent SNVs in these samples—including 4 out of the 19 most frequent SNVs reported worldwide2—were detected by both COVseq and NEBNext (Supplementary Fig. 3a). Only three frequent variants affecting three consecutive bases (G28881A, G28882A and G28883C) inside the N gene were not detected by COVseq, most likely due to the fact that they are located inside one of two ‘dark’ genomic regions more than 300 nt away from the closest NlaIII or MseI recognition site (Supplementary Fig. 3b). However, using one additional restriction enzyme (BfaI) that can cut within these regions is expected to result in complete coverage of these variants (Supplementary Fig. 3b).
Having demonstrated the feasibility and analytical validity of COVseq, we then sought to assess its scalability and reproducibility. To this end, we omitted the time-consuming step of purification of the multiplexed PCR amplicons (see COVseq workflow #2 in Methods) and prepared three replicate (Rep) COVseq libraries, each derived from 55 additional SARS-CoV-2 positive RNA samples (samples 31–85 in Supplementary Table 4 and Methods). This faster workflow allowed us to prepare COVseq libraries from a total of 165 samples in only two days with less than two hours hands-on time. To confirm the validity of this approach, we generated a reference (Ref) library from the same 55 samples, using COVseq workflow #1 (Methods). SE150 sequencing of the Rep and Ref libraries on NextSeq 500 confirmed the ability of COVseq to quantitatively capture inter-sample differences in the number of reads as well as percentage of reads aligned to the SARS-CoV-2 genome, and to do so in a reproducible manner (Supplementary Fig. 4a-d, Supplementary Fig. 5a-e, and Supplementary Table 2). Comparison of each Rep library with the Ref library showed a high degree of overlap (mean: 89.4%; range: 88.6– 90.0%) between the number of SNVs detected in samples with low Ct (≤ 30) (Fig. 1j and Supplementary Fig. 6a, c, e), confirming the validity of COVseq workflow #2. Pair-wise comparisons of the three Rep libraries showed that the breadth of genome coverage was highly correlated (Pearson’s correlation coefficient > 0.88) between replicates (Supplementary Fig. 6a-c), highlighting the reproducibility of COVsEq. This was further confirmed by the observation that 81.5% of all the SNVs identified in samples with low Ct (≤ 30) were detected in all three replicates, and 86.2% were detected in at least four of them (Fig. 1j and Supplementary Fig. 6d-j).
We then assessed whether COVseq could be potentially utilized for genomic surveillance of the ongoing pandemic. To this end, we examined the depth of coverage at 32 genomic sites encompassing the locations of the 19 most prevalent SNVs detected thus far worldwide2, and of the 13 SNVs in the recently emerged UK lineage18, in the 74 low Ct (≤ 35) samples sequenced by COVseq (Methods). On average, these sites were covered at a relative depth of 2,930 reads per million reads per sample (2,930 ± 7,291, mean ± s.d.) (Fig. 2a), suggesting that COVseq would be able to detect SNVs at these locations. One site inside the S region (C23709T) was covered at low depth in all the 74 samples (2 ± 3 reads per million reads per sample, mean ± s.d.) (Fig. 2a). The low coverage at this location can be explained by its distance from the nearest MseI and NlaIII site, which is larger than the sequencing read length (130 nt). However, sequencing with longer reads or using an additional restriction enzyme (BfaI) is expected to result in a higher coverage at this genomic position.
Next, we explored whether COVseq data are compatible with Nextstrain19, a popular tool for phylogenetic analysis of viral genomes. We applied Nextstrain to the same 74 low Ct samples sequenced by COVseq together with 991 samples randomly selected among 285,390 samples available for download from GISAID (https://www.gisaid.org/) as of 24 December 2020, including 4 sequences from the original Wuhan lineage and 292 from Italy (Methods). COVseq samples formed two distinct clusters within clades 20A and 20B, which also contained most of the GISAID sequences from Italy (Fig. 2b, c). Each cluster corresponds to the different phase of the 2020 pandemic when the samples were collected at two close-by hospitals in Italy (Mar-Apr and Oct-Nov, respectively, see Supplementary Table 4), and therefore possibly reflects the evolution of the virus in Italy during that period. Notably, none of the two clusters included the variants characterizing the recently emerged UK lineage18.
Lastly, we assessed the cost effectiveness of applying COVseq for genomic surveillance. Assuming to use COVseq workflow #1 (see Methods) and pool 96 samples into the same library, the average library preparation cost per sample would reach $21.48 and $20.28 for 10,000 and 100,000 samples processed, respectively (Fig. 2d and Supplementary Notes). Pooling 384 samples per library would result in a modest decrease in the cost per sample, with a further decrease when COVseq workflow #2 is used (Fig. 2d and Supplementary Notes), similar to what we did to prepare the replicate libraries described above. According to our analysis, the most cost-effective approach would be performing all reactions in nanoliter volumes, starting from the RT-PCR step until pooling all the samples for in vitro transcription (IVT) (COVseq workflow #3 in Methods). Indeed, a proof-of-principle experiment using two different amounts of synthetic SARS-CoV-2 RNA (5,000 and 10,000 copies) as input and SE150 sequencing, showed that near complete SARS-CoV-2 genome coverage can be achieved also with this workflow (Supplementary Fig. 7a, b, Supplementary Table 2). We then compared these costs to those that would have to be faced if sequencing libraries were prepared from individual samples using commercially available kits compatible with Illumina platforms (Supplementary Notes). Using the NEBNext library preparation kit or the transposase based Nextera kit would result, respectively, in one and two orders of magnitude higher cost per sample compared to any of the three COVseq workflows (Fig. 2e and Supplementary Notes). The cost would be one extra order of magnitude higher using the TruSeq kit from Illumina, even though this does not require a pre-amplification step by multiplexed PCR (Fig. 2e and Supplementary Notes). Collectively, our results indicate that COVseq is a sensitive, reproducible, scalable and highly cost-effective method that is suitable for mass-scale SARS-CoV-2 genomic surveillance.