Flow of the sgRNA screening and evaluation procedures
To overcome the limitations of currently available sgRNA design tools, we designed a new online sgRNA predictor——OPT-sgRNA, suitable for editing for human or mouse genome using SpyCas9. As seen in Fig. 1, for any provided gene name or DNA sequence, OPT-sgRNA first scans sequence of all exons by genomic coordinate and their corresponding promoters with PAM sequence. sgRNAs with polyT or low GC-content are filtered out before further evaluation. Next, for each sgRNA, all of its off-target sites are retrieved by aligning it to whole genome excluding regions of query gene by SeqMap [57] with mismatches setting as 4, and then the off-target effect is measured by the sum of all off-target site effect evaluated by linear regression model. All of sgRNAs are then sorted incrementally by its off-target effect and only top M sgRNAs (The default M is 100 and it can be set by user) with lower off-target effect are selected for the next activity scoring. sgRNAs scoring process are then started quickly and sgRNAs are ordered by their activity, top N sgRNAs with higher activity are presented as output (The default N is 10 and it can also be set by user).
Evaluation of off-target effect
Several computational methods already exist to predict off-target sites and/or evaluate the specificity of the sgRNAs [18, 28, 33, 41-50]. Two main features are used to predict the specificity of the sgRNA: number and loci of mismatches, binding energy between sgRNA and target DNA. However, previous structural and biochemical studies have shown that the sgRNA-Cas9 complex could divide target DNA into several distinct regions: linker, seed, middle and tail region [8, 10, 12, 15]. Our previous biochemical studies showed that the mismatch between sgRNA and target DNA in these regions show very different effect on target cleavage efficiency [15]. Therefore, previous simple number of mismatches feature is too crude for the accurate evaluation. Here we use the GUIDE-Seq dataset from Keith Joung group and Jennifer Doudna group to perform off-target effect training [58, 59]. The dataset contains 753 off-target sequences reported for 19 different gRNAs. We choose number of mismatches in 5 different regions as factors to training the off-target prediction model (Fig. 2). It could be found that when the number of mismatches is higher than 4, the cleavage activity of Cas9 decreased significantly (Fig. 2A). And as we expected, the position of mismatch is also a very important parameter to evaluate the specificity of sgRNAs. The seed region shows the most significant effect on target cleavage (Fig. 2C). We next constructed a new Linear Regression model to calculate the off-target score of a specific off-target site (Fig. 2D).
sgRNA activity evaluation and selection
After identification of potential off-target sites, candidate sgRNAs with minimized off-target effects can be further evaluated by their activity. As seen in Fig. 3, The sgRNA activity dataset contains 118,862 sgRNAs targeting 22,329 genes. SgRNAs with polyT (TTTT) and GC-content lower than 40% are excluded and only the most potent sgRNAs (top 10% in ranking) and the least potent sgRNAs (bottom 10%) are selected as finally sgRNA dataset. In all, there are 19561 genes with 40234 sgRNAs. Next, features such as single nucleotide, neighboring di-nucleotides and tri-nucleotides, GC-content are extracted and transformed by one-hot coding (e.g. A1, A1T2, A1T2T3, Fig. 3, right panel) for feature selection by L1-SVM. Finally, 10-folds cross validation performed for model selection. To build an efficient prediction model, we need to select important features and discard all irrelevant features. There are various existing machine learning methods such as wrapper or filter method that we can apply to do this job. With L1-SVM applied to feature selection, number of features deceases from 1538 (original) to 240 (finally). It’s interesting that accuracy does not increase with more features used (parameters set as 0.1, 0.05, 0.01) (Fig. S1, Fig. 4A). Considering different algorithms applied in recent sgRNA selection tools, we train each model on selected features and evaluate their performance by 10-fold cross validation. By all, the logistic regression classifier performs as the best model (accuracy and robustness) to predict sgRNA activity accurately. To our surprise, in addition to single- or di-nucleotides preferences as reported before (Fig. S2) (54,55). We also observed tri-nucleotides preference in our model (Fig. 4B). The feature selection and 10-fold cross validation are built on scikit-learn Python module.
Webserver
We next created a webserver to package these two models for sgRNA selection as a web tool: OPT-sgRNA, a user-friendly website for sgRNA selection (Fig. 5, http://bigdata.ibp.ac.cn/OPT-sgRNA/). It composes of sgRNA searching and pre-constructed libraries downloading. The web portal of OPT-sgRNA accepts more than one gene (official gene symbol) or sequence as FASTA format as input, it helps user to design self-library with selected genes. Users can also set the desired numbers of off-target site to be evaluated and output of candidate sgRNAs, and select sgRNAs either Homo sapiens or Mus musculus. The background gene sequences are based on genome assembly hg38 (for human) and mm10 (for mouse), while pre-constructed libraries covering about 50,000 genes with 10 sgRNAs for each gene both for human and mouse are ready to be downloaded. The web portal is developed using in HTML and CSS scripts and implemented in Python based on the Django web framework, all of the backend scripts are written in Python programming language as well.