RGEN-seq relies on Illumina sequencing workflow and is similar to CIRCLE/CHANGE-seq and SITE-seq in that it is based on tagging RGEN cut sites and selectively sequencing cleaved genomic DNA but it employs fewer biochemical steps and avoids the use of DNA circularization, PCR, or affinity capture enrichment (Fig.1a; Supplementary Protocol).
The high sensitivity in RGEN-seq is achieved by ligating cut ends with a truncated Y-type adaptor that lacks the p5 flowcell binding site (p7L-adapter). After a fragmentation step, the matching p5L adapter is ligated to remaining fragment ends. As a result, DNA strands containing ligated adapter oligonucleotides with both p5 and p7 sequences at their ends hybridize to the flowcell surface and generate clusters. All other DNA fragments, either without p5 and p7 sites (no ligation) or only with the p5 sequence on both fragment ends (those with two naked ends after fragmentation), cannot hybridize to the flowcell and are washed away during the clusterization process.
For calculating required library loading amounts, we ignore the concentration of p5/p5 molecules and use only the concentration of p5/p7 molecules measured by qPCR. A typical p5/p7 fragments yield from an RGEN-seq library prepared with the use of a single sgRNA varies from ~0.5 fM for a highly specific sgRNAs, to 2-3 fM for a promiscuous sgRNAs or for pooled sgRNAs. Considering that each library is prepared in triplicate and with a control library, there are enough p5/p7 fragments for loading on iSeq100 or NextSeq550 instruments (both require ~1.7 -2 fM for optimal cluster density). Typically, we achieve >80% occupancy rate on iSeq100, which results in over 6 M read output (8 M is the instrument’s maximum) with Q30 >90%. In case of NextSeq550, upon loading of the required amount of productive library fragments (~1.7 fM), we get a normal cluster density in the range of 180-260 K/mm2 with over 70% of them passing filters. Thus, the excessive non-productive p5/p5 molecules have little (if any) effect on the RGEN-seq library QC metrics.
Once sequencing is complete, reverse (R2) read sequences are mapped to a corresponding reference genome (Fig.1a). The use of R1 reads is optional, and generally we perform only 26 R1 sequencing cycles required for generating clusters. An alternative RGEN-seq variant, which requires only R1 reads can be used too (Supplementary Fig. s1); though in this case non-productive adapters can anneal to flowcell oligos and potentially reduce the sequencing run output in case of a pool of dozens of libraries.
In the resulting alignment, sites cleaved by the RGENs yield sequence read pileups that terminate at the cut site, ~3 nucleotides proximal to the PAM, producing a distinct signature that can be detected computationally. This signature is similar to those observed with DIGENOME-Seq23, SITE-Seq28, and DISCOVER-Seq22, which allowed us to use previously developed software for the analysis of RGEN-seq experimental data22,25,28.
RGEN-Seq was extensively tested on a Chinese hamster CHO-K1 clone bearing a 180 kb E. coli insert30. Isolated genomic DNA was targeted with seven sgRNAs, six of which tile across a 40 kb portion of the integrated E. coli region (Fig. 1b), and one targeting an integrated vector sequence. Cas9 cut sites were identified with BLENDER22 pipeline. Figure 1b shows an IGV snapshot of the read coverage within the aforementioned genomic region - note the absence of reads between target cut sites. For comparison, about 40X whole genome coverage is required to reliably detect the sgRNA target sites in the same CHO-K1 clone using DIGENOME-seq (Fig.1b, bottom panel). With the most sensitive library preparation protocol, we have identified ~100-450 cleavage sites per sgRNA, per million reads. Depending on a given cut site, RGEN-seq results in several hundred to several thousand-fold higher sensitivity than DIGENOME-seq; a result which is on par with other established in vitro methods.
Although RGEN-seq was designed as an amplification-free method, we didn’t want to compromise on its sensitivity and high-throughput capability. Therefore, we performed a great deal of protocol optimization in order to reduce the background read coverage and to enable the identification of off-target sites across a broad range of site cleavage sensitivity in vitro. A total of one hundred and fifty CHO-K1 and HEK293 RGEN-seq libraries using twenty sgRNAs were prepared testing different aspects of RGEN-seq protocol and sequenced on several Illumina sequencers (iSeq100, NextSeq 550, and NextSeq 2000). A list of applied laboratory reagents and biochemical steps (library components) used during the protocol optimization are listed in Supplementary Table s1. Ideally, in an RGEN-seq library, only genomic fragments terminated by genome editor cuts would generate clusters on the flowcell. In reality, noise is introduced from several sources as there are random background reads due to incomplete elimination of p7L-adapter, the presence of pre-existing non-specific DSBs or due to newly generated ones over the course of the procedure. Indeed, Figure 2a shows that blocking 3’ends of pre-existing single- and double-stranded breaks in the genomic template with terminal transferase and ddATP together with dephosphorylation of 5’ ends increase the percentage of the productive reads, i.e., those aligned to the RGEN-generated DNA ends. However, reducing the number of cleanup steps in the protocol has an even greater impact on the percentage of the productive reads (Fig. 2a) and so end-blocking as a pre-treatment was abandoned.
Figure 2 also demonstrates the beneficial effect of end repair (DNA Pol I, Large (Klenow) Fragment (3'→5' exo-) (LKFexo-) + dNTPs) of the Cas9-generated breaks before A-tailing compared to A-tailing alone. It is widely believed based on early observations made with plasmids and oligonucleotides that Cas9 predominantly generates blunt ends at a position three base pairs upstream of the PAM32,33. Given this assumption, one would expect on average an even read coverage on both sides of the cut. However, recent genome-wide screening of Cas9-generated breaks by PacBio’s and ONT’s nanopore real time sequencing34 methods revealed highly asymmetrical read coverage of the PAM proximal and distal regions. Figure 2b shows the coverage of the PAM proximal and distal regions is skewed in RGEN-seq too but can be leveled if the end repair (ER) step is done along with A-tailing. Together these data suggest blunt fragments are not the majority (or at least not always the majority27) of all ends in the Cas9 reaction and the actual structure of Cas9-generated breaks is more in line with temporal bidirectional degradation of Cas9 cleavage products, resulting in heterogeneous ends following initial DNA cleavage35,36. It is also worthy to note that this ER step hardly increase the chance of adapter ligation to random DNA breaks. As it was recently shown37, over 80% of mechanically-sheared DNA 3’-ends (MSD3Es) carry either phosphate groups or have unidentified abnormal chemical structures, which require an exonuclease activity along with DNA polymerase for efficient repair. Thus, we believe our ER approach, which is solely based on LKFexo- DNA polymerase, is rather specific for Cas9 induced breaks.
RGEN-seq demonstrates strong correlation of cut site scores between experiments. We performed independent library preparations with 6 guide RNAs targeted to E. coli insert sequence in the CHOK1 clone AD49ZG30 and found that BLENDER22 score of cut sites (roughly equal to the number of reads starting around a cut site) correlate with R2 > 0.92 (Fig. 3a). Figure 3b shows that although the number of recovered cut sites in CHO-K1 DNA is almost a linear function of sequencing reads used for the analysis, all target sites and off-targets with up to three mismatches (up to four mismatches in case of sgRNA6 and sgRNA7) can be recovered with just about 1 million reads. Figure 3c demonstrates the cleavage sensitivity of RGEN-seq and presents RGEN concentration-dependent recovery of target and off-target sites in HEK293 DNA using previously characterized sgRNAs18,23–26,28. The curves thus obtained resemble a sigmoid growth function with the logarithmic accumulation of cut sites in the ~32-256 nM RGEN concentration range, mirroring results of SITE-seq. Off-target sites with three and more mismatches contribute to the growing number of recovered sites, while in most cases cleavage sites with 0-2 mismatches are recovered at the lowest RGEN concentration used. It is interesting to note that the BLENDER score demonstrates a different dynamic (Fig. 3d) – the median score of putative cut sites tends to decrease after initial increase as the RGEN concentration increases for all cut sites regardless of the numbers of mismatches. This tendency is not RGEN-seq specific because it is observed in the SITE-seq method too (Supplementary Fig. s2). Such score dynamics might appear counterintuitive, but still can be explained by considering that S. pyogenes Cas9 is a single-turnover enzyme38, which initially finds high affinity genomic sites and cleaves them efficiently without apparent dissociation39,40. As such sites are occupied irreversibly by Cas9, and upon further increase of Cas9 concentration, multiple low affinity sites with a high number of mismatches and corresponding low cleavage efficiency contribute to the pool of cut sites, resulting in an overall decrease of cleavage score median (Fig. 3c,d). More details on the protocol optimization, such as effects of p7 duplex adapter versus p7L adapter, mechanical DNA shearing versus enzymatic fragmentation, can be found in Supplementary Information.
Comparison of RGEN-seq with other in vitro off-target methods.
To benchmark RGEN-seq against other biochemical genome-wide off-target detection methods, we compiled and re-analyzed data for eight different sgRNAs targeted to human genome sequences that had been previously characterized by DIGENOME-seq, CIRCLE-seq, CHANGE-seq, and SITE-seq across different human cell lines23,25–28. Of those eight sgRNAs, the two targeting the VEGFA site and the FANCF site had been tested by all five methods. The UpSet plot31,41 in Figure 4 shows shared and exclusive off-target sites for all method combinations. Sequencing read datasets from original publications were used to compare CIRCLE-seq, CHANGE-seq, and SITE-seq. To directly compare the performance of the methods, we sampled sequencing reads to ~2 million to match the RGEN-seq sequencing depth and used BLENDER22 to identify off-target cut sites; for DIGENOME, only originally identified off-target sites were used25. Figure 4A demonstrates that in terms of the number of identified off-target sites with FANCF-specific sgRNA, RGEN-seq is on a par with SITE-seq (211 and 228 sites, respectively), while CHANGE-seq is somewhat behind with 164 sites, and CIRCLE-seq with DIGENOME producing by far the fewest sites (75 and 46). It is interesting that CHANGE-seq detects the highest number of off-target sites if used with promiscuous sgRNAs targeting VEGFAs1, S4, and EMX1 sites (Fig. 5a,b). In the case of more specific sgRNAs, like S1, S2, RNF2, CD151, and XRCC5, RGEN-seq supersedes other methods in terms of the number of determined off-targets (Fig. 4b,c).
Though it is obvious RGEN-seq, SITE-seq, and CHANGE-seq are in the same league in regard to their sensitivity, which is substantially higher than that of the remaining two methods, they are not comprehensive, as it is apparent from method comparisons in the intersect and distinct modes (Fig. 4). Indeed, in the three aforementioned methods using FANCF-specific sgRNA, 88 out of 221, 83 out of 223, and 87 out of 164 off-target sites are unique to RGEN-seq, SITE-seq, and CHANGE-seq, respectively. Overall, the percentage of unique off-targets in RGEN-seq, SITE-seq, and CHANGE-seq varies from 25% to 67% for different sgRNAs (Fig. 4a,b). It is also true across different method combinations, for example, RGEN-seq and SITE-seq share 116 (FANCF panel) and 457 (VEGFAs1 panel) sites, from which 54 and 96, respectively, are unique for these two methods (Fig. 4a). However, all compared methods are much more consistent in determining off-target sites with 1-3 mismatches (Supplementary Fig. s3a,b). The latter data also reveal RGEN-seq produces the most output of off-target sites with 6-7 mismatches. It is also worthwhile to note that the number of distinct off-target sites recovered with different software pipelines can vary substantially (Supplementary Fig. s4).
Besides the analysis of intersecting and distinct off-target sites in individual methods and/or methods combinations, we examined the performance of scores assigned to off-targets by BLENDER22, which is a sum of the read ends in a 10-bp window centered on the cut site. Figure 5A compares RGEN-seq and SITE-seq in regard with the score correlation between off-target sites identified at different Cas9 RGEN concentrations in the range of 16-1024 nM. For each method only intersecting cut sites at all four concentrations were used for correlation calculation. Overall, RGEN-seq demonstrates a better score correlation, especially in the experiments performed at the lowest and highest RGEN concentrations. Of the four methods producing the highest number of off-targets, RGEN-seq shows the most monotonic change in scores, closely following with SITE-seq, which is reflected by the lowest score variance across off-target sites grouped by number of mismatches (Fig. 5b). It is interesting that if we stratify off-target sites by supporting score, the target sites in the CHANGE-seq and CIRCLE-seq methods always occur on top with substantial gap in score between the target and off-target cuts for seven out of eight tested sgRNAs (Supplementary Table s2); however, this is rarely the case for RGEN-seq or SITE-seq. The crucial difference between RGEN-seq and SITE-seq on one side and CHANGE-seq and CIRCLE-seq on the other side is that in the latter two methods RGENs operate on closed circular DNA rather than on linear DNA, and the used RGEN-to-DNA ratio is much higher (~ 30 fold) than one used in RGEN-seq and SITE-seq.