Whole-genome epigenetic function annotation through sgRNA libraries synthesized by controlled template-dependent elongation


 Epigenome is the set of DNA-associated proteins or chemical modifications to DNA, which regulates gene expression in processes of development and disease. While current advances have allowed researchers to routinely profile epigenomes from given samples, our understandings of the functions of epigenetic hallmarks are nonspecific at best. Applying CRISPR-screening to genome-widely interrogate the function of individual epigenetic hallmarks demands massive sgRNA libraries which are unaffordable via commercial syntheses. Our development consists of a high throughput and cost-effective controlled template-dependent elongation (CTDE) approach which converts source DNA to sgRNA templates. Affiliated screenings encompass 3.8M sgRNAs generated by CTDE targeting all major H3K4me3 and CTCF hallmarks in mESCs and HepG2 and identified 20K essential epigenetic hallmarks, which render the first batch of functional epigenome annotation of H3K4me3 and CTCF hallmarks in mammals. As an application example, we show that a H3K4me3 hallmark orchestrates CDC42 level and cell-cycle progression through promoting LINC00339 expression in HepG2.

converts source DNA to sgRNA templates. Affiliated screenings encompass 3.8M 36 sgRNAs generated by CTDE targeting all major H3K4me3 and CTCF hallmarks in 37 mESCs and HepG2 and identified 20K essential epigenetic hallmarks, which render the 38 first batch of functional epigenome annotation of H3K4me3 and CTCF hallmarks in 39 mammals. As an application example, we show that a H3K4me3 hallmark orchestrates 40 CDC42 level and cell-cycle progression through promoting LINC00339 expression in 41 HepG2. 42

Introduction 44
Humans have over 20,000 protein-coding genes, which account for about 2% of 45 the overall genome DNA 1  Mapping through assays such as DNase hypersensitivity assays, DNA methylation 53 assays and chromatin immunoprecipitation sequencing (ChIP-seq) assays [4][5][6][7][8] . These 54 refined techniques have allowed most labs to routinely profile their interested 55 epigenomic information under deliberate experimental conditions. 56 The function of many epigenetic modifications is generally known. For example, 57 H3K4me3 (tri-methylation at the 4th lysine residue of the DNA packaging protein 58 Histone H3) is involved in the positive regulation of the nearby gene transcription 9, 10 . 59 However, there is still a scarcity of a functional epigenome annotation of these 68 hallmarks considering genetic screens have prioritized protein-coding genes or 69 expressed non-coding loci 18,19 . 70 Numerous reports have depicted pooled CRISPR-screenings that interrogate the 71 function of regulatory elements using designed dense tiling sgRNA libraries targeting 72 limited genomic loci 8,20,21 . While these studies have provided proof of concept for the 73 application of pooled CRISPR-screening in the functional characterization of 74 regulatory elements, the high-cost of synthesizing a dense tiling sgRNA library to cover 75 an epigenetic hallmark genome-widely (around half-million USD for H3K4me3 or 76 CTCF) is unrealistic and a major hurdle to further functional epigenome studies. 77 We have developed a simple-and cost-effective controlled template-dependent 78 elongation (CTDE) approach that can convert any DNA sample to a sgRNAs library, 79 which covers 98.47% of the effective CRISPR/Cas9 targeting sites within the source 80 DNA. Significantly, over 99% CTDE-sgRNAs targeting sequences have a protospacer 81 adjacent motif (PAM) [22][23][24] . We have generated sgRNA libraries targeting all H3K4me3 82 and CTCF hallmarks in mESCs and HepG2. In total, we have screened 3.8 M sgRNAs 83 and identified 14K (14265) H3K4me3 and 6K (6235) CTCF essential hallmarks for the 84 proliferation of mESCs and HepG2. mESCs CTCF dataset shows that mESCs 85 maintains a high proportion of non-essential cell-type specific CTCF hallmarks, which 86 may be important for the implement of pluripotency. Importantly, the HepG2 87 H3K4me3 dataset helps confirm that an essential H3K4me3 hallmark inside the intron 88 of LINC00339 orchestrates the cell-cycle progression and the expression of CDC42, a 89 pivotal factor for the proliferation and invasion of cancer cells, through promoting 90 Convert DNA to a sgRNA library through synthesized by controlled template-105 dependent elongation (CTDE) 106 The CRISPR/Cas9 system has been developed into genome mutating tools with 107 wide-ranging applications 25  To directly generate large-scale sgRNA templates from source DNA, we 114 fragmented the DNA template within a 1kb length (Figure 1a). We then ligated the 115 DNA fragments with A1 adaptors (red), immobilizing them on streptavidin beads and 116 washing away their positive strands under a denaturing condition ( Figure 1a). Next, we 117 annealed the priming primer (the positive strand of A1 adaptor) onto the immobilized 118 minus strand, then extending the primer using DNA polymerase coupled with reversible 119 terminator (RT) nucleotides (3'-O-N3-dNTP) that allow singular nucleotide 120 incorporation before the restoration of their 3′-hydroxy groups (Figure 1a) 28 . We 121 restored the 3′hydroxyl group via tris (2-carboxyethyl) phosphine (TCEP) treatment 122 and repeated another round of nucleotide incorporation following the previous 123 extension (Figure 1a) 28 . After 23 rounds of cycling, we blunt the 3′ terminus to get 23bp 124 DNA fragments (not including the adapter) ( Figure 1a). The most used Cas9 from 125 Streptococcus pyogenes recognizes the 5′-NGG-3′ (where "N" can be any nucleotide 126 base) PAM sequence. If the last two nucleotides of 3′end of the 23bp DNA are GG, it 127 will compose an AscI cutting site after the A2 adaptor ligation, which is used to select 128 the DNA fragment with 5′-NGG-3′ PAM (Figure 1a). Following PAM selection, we 129 remove the NGG triplet using a type II restrict endonuclease (BbsI) and put an A4 130 adaptor onto the 3′ terminus for further Gibson assembly into a sgRNA expressing 131 vector (Figure 1a) 29 . Since only two rare-cutting endonucleases are employed, the 132 dropout rate of sgRNA template caused by endonuclease cutting is very low (0.86% 133 per mouse genome and 0.82% per human genome). 134 To test the efficiency of this technique, we used a 14.9 kb plasmid (lentiCRISPR-135 v2) modeling a DNA template. After implementing the CTDE steps described above 136 ( Figure 1a and S1a), we generated sgRNAs targeting 98.47% of the sites with 5′-NGG-137 3′ PAM sequence (Figure 1b and S1b). As expected, few sgRNAs can be generated 138 from AT rich region because of the low complimentary binding affinity between AT 139 rich sequences ( Figure S1b). We also checked the capability of CTDE to enrichment 140 the DNA fragments with 3′ NGG triplet. There are 6.23% input DNA fragments that 141 are adjacent to a 3′ NGG triplet, and after enrichment, the rate is raised to 99.7% (15.98 142 folds) ( Figure 1c). As expected, the lengths of the sgRNA templates are predominantly 143 20bp (83.93%) and the functional sgRNA templates (17-22bp) are 98.66% ( Figure  144 1d) 26 perfectly match a position in their template, and 1.6% sgRNAs of CTDE library carry 149 one mismatch (Figure 1e). The error rate of the CTDE procedure is around 0.97 bases 150 per 1000 bases. Around 90% sgRNAs of CBDS can perfectly match their targeting 151 positions and around 7.5%/9.5% sgRNAs carry one mismatch (Figure 1e). The CBDS 152 library, having an error rate of around 4.5-5.7 bases per 1000 bases, assimilates itself 153 with the CTDE library. We also compared the amplification bias between CTDE library 154 and CBDS library. The sgRNA abundance of CBDS library and CTDE library are both 155 similarly low (Figure 1f). Treating of the abovesaid data, CTDE, consistent in the 156 conversion of source DNA to sgRNA library, produces the same qualitative results as 157 the library generated via Chip Based DNA Synthesis (CBDS). 158

Validations of CTDE sgRNA library screening 159
To implement mega-level screenings, we planned to infect cells with the lenti-viral 160 CTDE library expressing sgRNAs along with Cas9 protein (around 40K sgRNAs per 161 batch) (Figure 2a). Then, we sequence the abundance of each sgRNA template at two 162 time points (3 days selected in puromycin media as P1 and 20 days expanded after 163 puromycin selection as P10) to calculate the abundance change from P1 to P10 of each 164 sgRNA template (Figure 2a; details in method). 165 Current bioinformatics efforts for the analysis of pooled CRISPR screens are 166 devoted to identifying genes rather than non-coding genomic loci 30 . A coding gene is 167 targeted usually by multiple sgRNAs with similar mutational abilities in traditional 168 CRISPR screens; and the essentiality of genes are evaluated through the integrative 169 analysis abundance change of these sgRNAs in frequented algorithms, such as 170 MAGeCK 31 . For the purpose, however, of screening non-coding regulatory loci inside 171 epigenetic hallmarks using CTDE library, the objective becomes the identification of 172 narrow essential genomic sites, the majority of said can only be efficiently targeted by 173 one sgRNA. Traditional calling algorithms are unreliable in this scenario as it will 174 report numerous false positive significantly changed sgRNA (ssgRNA) (Figure 2b). 175 During the screening, many minor uncertain factors bias the abundance of each 176 sgRNA randomly and cause the abundance change of the sgRNAs following a normal 177 distribution in the scenario of non-selection pressure 32 . Some sgRNA disruptions will 178 cause a negative selection pressure, bias their abundance in P10 systematically (<20%), 179 and result in their abundance change following another normal distribution. While the 180 distance of above two normal distribution is large enough ( Figure S2a), we can 181 efficiently identify the sgRNAs (FDR<0.1) that leads to negative selections using a self-182 developed straight-forward approach (NSgRNAShot; details in method), which has 183 96.42% precision rate and 96.27% recall rate on a simulation dataset ( Figure S2a; Table  184 S1). Most importantly, on identical testing datasets, NSgRNAShot has significantly 185 lower false positive rates than MAGeCK (Figure 2b). Additionally, we compared the 186 ability of NSgRNAShot to call true positive ssgRNA with MAGeCK using two 187 published essential gene screen datasets. Due to significant false positive rates, 188 MAGeCK reports a noticeably higher amount of ssgRNAs than NSgRNAShot, most of 189 which target non-essential genes (Figure 2b  H3K4me3 hallmarks are essential regulatory elements for mESCs self-renewal ( Figure  229 3a-d; Table S4). We verified three randomly picked ssgRNAs in non-coding regions. Given that the H3K4me3 elements are generally wide, the location of these ssgRNAs 242 should reflect the core regulatory sites within the elements. 243 Many H3K4me3 hallmarks locate on exons ( Figure 3c; Table S4). Exon mutations 244 will inactivate their genes and cause stronger phenotypes rather than disrupting 245  Table S5). 261 Thus, we have successfully performed a genome-wide CRISPR-screening to 262 interrogate the essential H3K4me3 regulatory elements for mESCs self-renewal.  Table S3 and 277 S6). The sgRNA density is 24 sgRNAs per kb in CTCF strong binding sites (top 100). 278 As CTCF elements display a consistent size, whilst maintaining a moderate diversity 279 of the input DNA amount, that of which is significantly smaller than the input amount 280 of H3K4me3, the abundance distribution of sgRNAs inside CTCF hallmarks is notably 281 more even than that of H3K4me3, although the pattern of their template DNAs is still 282 observed (Figure 4b and S4a). We identified 3038 CTCF ssgRNAs, which indicates 283 that the corresponding CTCF hallmarks (47.02 % in intergenic regions) are essential 284 for mESCs self-renewal (Figure 4a-c and S4b; Table S4). ssgRNAs appear in regions, 285 arrayed from weak to strong CTCF elements. Among all essential CTCF elements, 286 87.07% elements are targeted by 1 ssgRNA whilst 12.93% elements are targeted by 287 multiple ssgRNAs (Table S3) Distal promoters, introns and proximal promoters are major parts on which CTCF 299 hallmarks locate ( Figure S4b; Table S4). The GO terms of genes close to ssgRNAs 300 targeting introns and proximal promoters are essential biological processes and tissue 301 developments (Figure 4e and S4e; Table S7). Because the expression of differentiation 302 and development related genes generally antagonizes mESCs pluripotency and self-303 renewal, we believe that these essential CTCF hallmarks should inhibit their expression. 304 Unlike H3K4me3 ssgRNAs, only a minor part of CTCF ssgRNAs target exons ( Figure  305 S4b; Table S4). Although most GO terms of expressed genes of these exons are also 306 related to essential biological processes, their significance is much lower than that of 307 H3K4me3 ssgRNA ( Figure S4e and S3d; Table S5 and S7). We reason that major 308 functions of these essential CTCF hallmarks are beyond promoting the expression of 309 their sitting genes. 310 mESCs differentiate into various cell types during embryonic development. 311 Previous studies have shown that the chromosome spatial structure will rearrange 312 accordingly to fit the change of gene expression patterns during differentiation 38 . We 313 compared the CTCF hallmarks with 16 mouse cell types/tissues and found that 59.63% 314 CTCF hallmarks in mESCs are cell-type specific and 40.37% are common (Figure 4f; 315 Table S8) 37,39 . The common CTCF hallmarks should help maintain the universal spatial 316 structure of chromosome, while the cell-type specific CTCF hallmarks should be either 317 mESCs specific or pre-loaded hallmarks for further differentiated cells. Consistent with 318 this supposition, the percentage of the cell-type specific essential CTCF hallmarks 319 (28.85%) of mESCs is significantly smaller than the percentage of the cell-type specific 320 CTCF hallmarks (59.63%) ( Figure 4f; Table S8). 321 322

Annotation of the essential H3K4me3 hallmarks in human liver cancer cells 323
Whole-genome sequencing has surveyed large sets of cancer genomes and studied 324 the role and extent of single-nucleotide variants (SNVs), small insertions/deletions 325 (indels) and larger structural variants in cancers 40,41 . While the initial focus on the 326 genetic variations in protein-coding regions has dramatically expanded our knowledge 327 of cancer genetics, the remaining (>90%) non-coding part of the genetic variations are 328 much more difficult to understand and have remained largely unexplored 42 , which is 329 due to a lack of functional annotation of regulatory elements inside. 330 To genome-widely interrogate essential activating regulatory elements in human 331 liver cancer cells (HepG2) ( Figure S5a-b), we performed a H3K4me3 CRISPR-332 screening as described above (Figure 1a and 2a). In total, we screened 1.19M sgRNAs 333 targeting 80.91% of the H3K4me3 hallmarks in HepG2 (Figure 5a-c and S5c-d; Table  334 S3 and S9). In H3K4me3 highly enriched regions (top100), the sgRNA density is  (Table S3). Detailed positions of ssgRNAs inside H3K4me3 elements can be 343 found in the Supplemental table 4, indicating the core regulatory sites of these essential 344 H3K4me3 elements. We also verified three randomly picked ssgRNAs targeting non-345 coding regions, and all can significantly inhibit HepG2 growth (Figure 5d and S5e-f). 346 The H3K4me3 ssgRNAs evenly distribute on most chromosomes, with regions on 347  To interrogate essential CTCF hallmarks in HepG2, we generated CTCF sgRNA 369 libraries and performed the CRISPR-screening as described above (Figure 1a and 2a).  Table S3 and S11). In strong CTCF binding regions 372 (top100), the sgRNA density is 63sgRNAs per kb. As expected, the abundance 373 distribution of sgRNAs inside CTCF hallmarks is notably more even than that of 374 H3K4me3, and the patterns of their template DNAs are observed all the while ( Figure  375 6b and 5b). We identified 4628 CTCF ssgRNAs which represent 3583 (44.63 % inside 376 intergenic regions) essential CTCF hallmarks for HepG2 growth ( Figure S6b; Table  377 S4). ssgRNAs appear in regions from weak to strong CTCF elements (Figure 6b). 378 Among all HepG2 essential CTCF elements, 78.76% elements are targeted by 1 379 ssgRNA while 21.24% elements are targeted by multiple ssgRNAs (Table S3) Table S12). 392 Only a minor portion of CTCF ssgRNAs target exons ( Figure S6b). The majority of 393 GO terms of expressed genes whose exons are targeted by ssgRNAs are related to 394 essential biological processes ( Figure S6f; Table S12). Because major functions of 395 CTCF hallmarks on exons are beyond promoting their sitting genes expression 15 , the 396 GO term significance is lesser than that of H3K4me3 ssgRNA targeting exons ( Figure  397 S6f and S5g; Table S12 and S10). 398 The majority of CTCF hallmarks in HepG2 are common (79.79%) among 55 399 human cell types ( Figure 6f; Table S13). As a tissue-specific cell line, the spatial 400 chromosome structure of HepG2 has been adapted to the requirements of liver functions, 401 and it is not necessary to keep so many spatial chromosome structures specific for other 402 cell types. Unlike mESCs, the percentage of the cell-type specific essential CTCF 403 hallmarks corresponds with the percentage of the cell-type specific CTCF hallmarks in 404 HepG2 (Figure 6f and 4f). CDC42 also transduces growth and adhesion signals to drives cell-cycle progress from 426 G1 to S phase 49,50 . Therefore, LINC00339 may promote CDC42 expression in HepG2. 427 As expected, we found that both ssgRNA (chr1-22352881) disruption and knocking-428 down LINC00339 significantly decreases the expression of CDC42 (Figure 7g This type of approach needs more than three paired-guide RNAs for each hallmark, and 463 the library cost to cover all CTCF and H3K4me3 hallmarks (79K in mESCs and 69K 464 in HepG2) proves to be too costly. In addition, once asynchronous cutting happens, the 465 sequence of the first cutting site will change 53 and the hallmark cannot be deleted by 466 the paired-guide RNAs anymore. So, the paired-guide RNA approach needs 467 CRISPR/Cas9 cut both sites simultaneously, which leads to a lower efficiency than 468 single sgRNA system. Thus, the paired-guide RNA library-based screening requires a 469 more sensitive readout than the survival readout used in this work. 470 The essential hallmarks datasets of H3K4me3 and CTCF in mESCs and HepG2                                                        shows that three randomly selected CTCF ssgRNAs significantly inhibit mESCs self-renewal.

672
ssgRNA is named by three factors (chromosome number; targeting strain + or -; mapping position).

674
(e) Top 10 GO terms of the genes whose introns are targeted by CTCF ssgRNAs.

675
(f) Cell-type specific analysis of essential CTCF hallmark (targeted by ssgRNA) in mESCs self-

786
(e) Top 10 GO terms of the genes whose proximal promoters are targeted by H3K4me3 ssgRNAs.

787
(f) Top 10 GO terms of the genes whose UTRs are targeted by H3K4me3 ssgRNAs.

835
Real-time cell numbers are plotted.

838
(e) All three GO terms of the genes whose proximal promotors are targeted by CTCF ssgRNAs.

839
(f) Top 10 GO terms of the genes whose exons are targeted by CTCF ssgRNAs.

NGG PAM selection 857
We ligate the A2 adapter and amplify the library for ten cycles. After the gel extraction 858 of the amplified library, we digest the DNA with AscI (NEB, R0558). Then we capture 859 the library onto streptavidin beads. 860

NGG PAM removal 861
We ligate the A3 adapter and amplify the library for ten cycles, and then digest the 862 library with BbsI (NEB, R3539). We fill in the gap with T4 DNA polymerase (NEB, 863 M0203), ligate A4 adapter, and amplify the library with the KAPA HiFi polymerase 864 mix (Roche, KK2631) for ten cycles. We apply 20% TBE-PAGE to select the size of 865 the library (61nt). After releasing DNA from PAGE, we amplify the library by PCR 866 with KAPA HiFi polymerase and primer (the sequence is below) and size-selected via 867 2% agarose gel 1

Commercial synthesis Neo sreen library contruction 978
According to 795bp neo fragment sequence, we pick all gRNA with NGG PAM, totally 979 115 gRNA. Also, we chose 20 gRNA from Gecko negative control (they can't target 980 hg19, mm9 and neo reference sequence). We order these gRNA from oligo synthesis 981 company (Shangya), and these oligoes were in uniform format: 5'-982 GTGGAAAGGACGAAACACCGNNNNNNNNNNNNNNNNNNNNGTTTTAGA 983 GCTAGAAATAGC-3', where N20 represent neo gRNA or negative control sequence. 984 We combined 115 neo gRNA and 20 negative control equally together, and amplified 985 5µl library(10µM) with 2X HiFi DNA polymerase, Array-F and Array-R primer for 5 986 cycles. 140bp library was gel extracted and assembled into LentiCRISPR V2 plasmid. 987 20 negative control were separately prepared as above and assembled into 988 LentiCRISPR v2 plasmid, and this library was spike-in of CTDE-neo library. 989 display the ratio of sgRNAs in each "NGG" locus and AT in each bin of the 1092 lentiCRISPR v2 sequences. 1093 2. We calculated the coverage of the detected sgRNAs on the "NGG" locus in the 1094 designed reference with the following steps: 1095 Step 1, we count all possible the "NGG" loci in the plasmid lentiCRISPR v2 sequences 1096 as . 1097 Step 2, we count the number of "NGG" loci covered by sgRNAs from the sgRNA set 1098 described above as . Then we calculated the coverage of the detected sgRNAs 1099 on the "NGG" locus in the designed reference (Cover ) as follows: 1100 Cover = 1101 3. We next analyzed the enrichment of the detected sgRNAs on the "NGG" locus with 1102 the following steps: 1103 Step 1, we sequenced the plasmid lentiCRISPR v2, then we trimmed the sequenced 1104 reads to 23bp and aligned them to the reference set using bowtie, not allowing base 1105 mismatch (-v 0). The reads mapped on the plasmid lentiCRISPR v2 sequences and 1106 mapped on the "NGG" locus of the sequences were counted. Next, we calculated the 1107 percentage of reads on the "NGG" locus from the mapped reads as − . 1108 Step 2, we sampled one million reads from each of three sgRNA libraries, then we got 1109 the sgRNAs from the "Sequencing data pre-processing" analysis and extracted the 1110 sgRNAs of 20bp length. 1111 Step 3, we processed the sgRNAs in two different ways. In one way, we removed their 1112 "NGG" tails and mapped them to the reference set by bowtie, not allowing base 1113 mismatch (-v 0). Simultaneously, we mapped them to the reference set by bowtie, 1114 allowing one base mismatch (-v 1). After alignment, we counted the number of mapped sgRNAs of different length with the following steps: 1120 Step 1, for each of the three sgRNA libraries, we got sgRNAs from the "Sequencing 1121 data pre-processing" analysis and mapped them to reference set by bowtie, allowing 1122 one base mismatch (-v 1). 1123 Step 2, we grouped sgRNAs by their length. For the sgRNAs of length i, we counted 1124 the number as , and calculated their proportion as follows: 1125

= . 1126
Here is the number of all mapped sgRNAs. We repeatedly calculated the 1127 proportion for the sgRNAs of 16-24bp from the three sgRNA libraries. 1128 5. To evaluate the how faithful sgRNA synthesis was, we mapped 16-24bp sgRNA to 1129 reference set by the following priority order: with no mismatch, with one mismatch, 1130 with two mismatches (not include mismatch in PAM site). For PAM site, we only 1131 tolerate mismatch at the first base. The error rate was calculated as total number of mismatch base divided by total number of base of mapped sgRNA (not include PAM 1133 site). 1134 method should not detect many significantly selected sgRNAs and genes between these 1154 samples 5 . So we took the strategy to detect the possible false-positive ssgRNA with 1155 FDR < 0.1 between two replicates of ESC treatment samples with NSgRNAShot and 1156 MAGeCK based on benchmark data. When multiple sgRNAs are available for one gene, 1157 MAGeCK demonstrated the best performance to detect essential genes 6 . Therefore we 1158 evaluated the sensitivity of ssgRNA identified by our method with the following 1159 strategy: we took the essential genes reported by MAGeCK as the gold standard set, a 1160 good method should largely report the ssgRNA from essential genes but not from other 1161 genes. 1162 1163

ChIP-seq data analysis 1164
For ChIP-seq of H3K4me3 and CTCF in mESC and HepG2 cell lines, libraries were 1165 sequenced using Illumina HiSeq X Ten and paired-end 150 bp long reads were obtained. 1166 Chip-seq reads were analyzed with the following steps: 1167 Step 1, we trimmed Chip-seq reads to 100 bp (from 5' to 3'), and for redundant reads 1168 which have the same sequence, only one was retained. Then reads were mapped to the 1169 reference genome (mm9 for mESC or hg19 for HepG2) by bowtie, only uniquely 1170 mapped reads were reported (-m 1). 1171 Step 2, Chip-seq peaks for H3K4me3 and CTCF were detected by MACS2 with default 1172 settings. For each mouse ES CTCF peak, we calculated its length ( ℎ) and 1173 sgRNAs. The scatter plot and distribution plot showed that the simulation is successful 1419 because they tend to fit the two plots in the real dataset ( Figure S1a). 1420 To this step, we had generated a series of simulated datasets. Then, we tested the 1421 performance of NSgRNAShot on detecting the negatively-selected sgRNAs at the 1422 simulated datasets. We used two indicators-precision ( ) and recall ( ), which are 1423 calculated with the following formula, to benchmark NSgRNAShot, 1424 where is the number of negatively-selected sgRNA identified as ssgRNA; is 1426 the number of non-selected sgRNA identified as ssgRNA; is the number of 1427 negatively-selected sgRNA failed to be identified as ssgRNA. 1428 We run NSgRNAShot on the simulated datasets to detect negatively-selected sgRNAs. 1429 We can see that in the datasets whose parameter pairs are (0.05,8), (0.1,6), (0.1,8), 1430