Deep learning improves the ability of sgRNA off-target propensity prediction

doi:10.21203/rs.2.18444/v3

Download PDF

Methodology article

Deep learning improves the ability of sgRNA off-target propensity prediction

https://doi.org/10.21203/rs.2.18444/v3

This work is licensed under a CC BY 4.0 License

Journal Publication

published 10 Feb, 2020

Read the published version in BMC Bioinformatics →

You are reading this latest preprint version

Background CRISPR/Cas9 system, as the third-generation genome editing technology, has been widely applied in target gene repair and gene expression regulation. Selection of appropriate sgRNA can improve the on-target knockout efficacy of CRISPR/Cas9 system with high sensitivity and specificity. However, when CRISPR/Cas9 system is operating, unexpected cleavage may occur at some sites, known as off-target. Presently, a number of prediction methods have been developed to predict the off-target propensity of sgRNA at specific DNA fragment. Most of them use artificial feature extraction operations and machine learning techniques to obtain off-target scores. With the rapid expansion of off-target data and the rapid development of deep learning theory, the existing prediction methods can no longer satisfy the prediction accuracy at the clinical level. Results Here, we propose a prediction method named CnnCrispr to predict the off-target propensity of sgRNA at specific DNA fragments. CnnCrispr automatically trains the sequence features of sgRNA-DNA pairs with GloVe model, and embeds the trained word vector matrix into the deep learning model including biLSTM and CNN with five hidden layers. We conducted performance verification on the data set provided by DeepCrispr, and found that the auROC and auPRC in the "leave-one-sgRNA-out" cross validation could reach 0.957 and 0.429 respectively (the pearson value and spearman value could reach 0.495 and 0.151 respectively under the same settings). Conclusion Our results show that CnnCrispr has better classification and regression performance than the existing states-of-art models. The code of CnnCrispr can be freely downloaded from https://github.com/LQYoLH/CnnCrispr.

Bioinformatics

sgRNA

off-target

deep learning

GloVe model

CRISPR/Cas9 system[1-4](Clustered regularly interspaced short palindromic repeats /CRISPR-associated 9 system) originally derived from the immune defense mechanism of archaea, it is one of the most popular gene editing technology in recent days. Compared with zinc-finger nucleases [5, 6](ZFNs) and transcription activator-like effector nuclease [6-8](TALENs) technologies, CRISPR/Cas9 system has a pellucid mechanism, simple operation and high efficiency, thus gradually replacing the earlier methods and presently being applied to the fields of biology and clinical medicine, etc.

CRISPR/Cas9 system requires three important components in the process of gene editing: Cas9 protein, guide RNA and PAM motif (protospacer adjacent motif)[9]. Among them, the guide RNA that recognizes a target DNA sequence through complementary base pairing is generally referred to as an sgRNA[10, 11] (single guide RNA, generally an RNA sequence of 20nt in length). The PAM[11-13] is a 3nt motif on the target sequence and a prerequisite for Cas9 protein cleavage at a specified site. A common type of PAM is NGG[14-16] (N represents any base of A, T, C, G). During the editing process, the Cas9 protein cleaves the target DNA at the site three bases upstream of the PAM under the guidance of the sgRNA sequence, and performs subsequent gene editing operations : Introduction of an insertion/deletion (indel) base to cause mutation of a gene at a target position by nonhomologous end-joining (NHEJ); or utilization of the “donor template” provided by foreign DNA to recombine with a mutant target to achieve DNA-based editing of the genome by homology-directed repair (HDR)[17-19].

Some studies have found that when CRISPR/Cas9 system operates, several mismatch sites may appear in the complementary matching of sgRNA to the target DNA sequence, therefore resulting in unintended cleavage of the DNA sequence, which is called “off-target”[16, 20, 21]. Fu et al. [20]have confirmed that sgRNA allows 1-5 base mismatches during the guiding process, which in turn causes unintended sequences to be erroneously edited. The existence of off-target phenomenon has greatly hindered the clinical application and further promotion of CRISPR technology. How to assess the off-target propensity of specific sgRNAs and minimize the risk of off-target has become the focus of the CRISPR/Cas9 system study.

Presently, a variety of off-target detection methods have been developed, such as the GUIDE-Seq[22-24] method created by Tsai et al., which can effectively identify 0.1% of mutations in cells and predict the cleavage activity of the system based on sequencing results. The HTGTS[25] method utilizes fusion of known DNA double-strand breaks with other cleavage DNAs to detect DNA breaks by PCR amplification techniques and further detect off-target sites. On this basis, Frock et al.[26] further developed a higher throughput off-target detection method. The BLESS[27] technique further speculate on off-target sites by detecting DNA double-strand breaks. However, this method is complicated to operate and it is impossible to detect a break site that has not occurred or has already been repaired. In addition, the IDLV [28, 29] method can detect off-target sites within the genome-wide range without bias, but with an accuracy of only 1%.

The above detection method cannot detect all off-target sites of a specific sgRNA, and has disadvantages such as high cost, difficult operation, and low detection accuracy. As the core of artificial intelligence, machine learning and deep learning can effectively analyze empirical data and provide important technical support for bioinformatics. To this date, machine learning has been gradually applied to off-target site prediction[14, 30], sgRNA activity prediction[14] and sgRNA design optimization[31, 32], etc. Various machine learning based sgRNA design models[30, 33-36] have been developed and put into application. Their main design idea is to introduce sgRNA sequence features and secondary structure features, rank all possible sgRNA for specific target DNA sequences by scores of off-target effect, and selecting the sgRNA with high cleavage efficiency and low off-target propensity.

The above machine learning methods were based on sequence features. At the time this paper is written, only three existing prediction models have introduced the idea of deep learning into the sgRNA off-target propensity prediction problem.

DeepCpf1[31], based on the convolutional neural network(CNN), introduced sgRNA sequence features and chromatin accessibility to predict the editing efficiency of sgRNA corresponding to Cpf1. This method does not have to construct the feature artificially, further simplifying the model, and is convenient for researchers to use. DeepCrispr[37] introduced four epigenetic features in addition to DNA sequence features and automatically extracts valid information using the principle of Auto-encoder. Several models including sgRNA target cleavage and off-target propensity prediction were established. However, it is still unknown whether the four epigenetic characteristics will have a positive impact on the model prediction results. CNN_std[38] only used sequence features to construct two-dimensional input matrix by means of "XOR" coding design and utilized CNN for prediction. This deep learning method also received a higher accuracy in the CRISPOR dataset[39]. In addition, Dimauro, G et al. proposed a model named CRISPRLearner[40] for predicting sgRNA on-target knockout activity. Although its purpose is different from ours, its application of deep learning to prediction tasks related to sgRNA provided us with ideas.

Most of the existing prediction methods are still based on machine learning methods and model prediction through complex manual feature extraction [41-46]. However, the internal mechanism of CRISPR gene editing technology is not presently clear and explicit. Manual design of sgRNA features may have a negative impact on the prediction results. Therefore, we would like to present CnnCrispr，a novel computational method for prediction of sgRNA off-target cleavage propensity utilizing the deep learning method. In CnnCrispr, the GloVe embedding model was introduced to extract global and statistical information of input sequences by constructing the co-occurrence matrix of sgRNA and its corresponding DNA sequence. Further integrating with the deep neural network model, the off-target propensity of a given sgRNA at a specific DNA fragments can be predicted. We trained CnnCrispr with the data set used by DeepCrispr[37], and proved that CnnCrispr has a better competitive advantage in predicting sgRNA off-target propensity through performance comparison with four state-of-the-arts models, therefore it is expected to become a potential tool to help on the research of CRISPR system.

2.1 Model Structure and Prediction

In our initial conception, we combined biLSTM with CNN framework at the final prediction model and the model structure is shown in Fig. 1. We also constructed several similar but different models by removing different network parts to compare the test results and select the final prediction model. All pre-selected network frameworks for model selection are briefly described in Table 1.

The structure of the benchmark framework of CnnCrispr is described in detail below:

The first layer of CnnCrispr is an embedding layer, which is used for input of the vector obtained by GloVe model. Since the vector dimension of the GloVe model is set to 100, the input of embedding layer is a two-dimensional matrix with the size of 16×100. We called the mittens package in Python to train the GloVe model on the basis of the realization of GloVe co-occurrence matrix.

The second layer is a biLSTM network, which is mainly used to extract the context features of input information. Five convolution layers are subsequently connected to the model, and each layer has a different kernel number and kernel size. Then the full connection layers are introduced behind the last convolution layer, having the sizes of 20 and 2 respectively.

In addition to the framework mentioned above, Batch Normalization and Dropout layers are added between each layers to prevent model overfitting. The parameters of the Dropout layer are set as 0.3. For the output layer, softmax and sigmoid functions are used as activation functions respectively to obtain the prediction results of classification model and regression model.

In the training process, the initial learning rate was set as 0.01, and we used Adam algorithm to optimize the loss function. Furthermore, we set the batch size as 256 in consideration of the requirements of potential information extraction from negative data and avoiding the occurrence of over fitting. Too large of a batch size may increase the risk of multiple occurrence of some positive data in a single batch during training, while too small of a batch size may reduce the training speed of a model and extend the training time.

Our experiment was divided into two parts. First, we compared the performance of different models. Then, the final prediction model was compared with the existing models with better performance to evaluate the practical application ability of our model. Detailed network descriptions can be found in Additional file 2.

2.2 Model Selection

Experimental data are from the attachment provided by DeepCrispr article, and the relevant data description is detailed in section 5.1. During the process of training, 20% of the data in the Hek293t and K562 data sets were randomly selected to compile the test sets (Hek293t test set, K562 test set and Total test set respectively). Different prediction models were obtained by training with all the remaining data, and the prediction performance of each model in the three test sets were evaluated. During the training process, we generated the batch training data using the data sampling method mentioned in Section 5.6.

We built two models for classification and regression prediction, respectively. The first three models mentioned in Table 1 were trained in order to verify the influence of different parts on the prediction performance of the model. The structure of the benchmark model CnnCrispr is introduced in Section 2.1. And the model CnnCrispr_No_LSTM was obtained by removing the LSTM part from the basis of CnnCrispr, CnnCrispr_Conv_LSTM was obtained by adjusting the order of Convolution layers and Recurrent layer on the basis of CnnCrispr. Among them, the purpose of the latter two models was mainly to illustrate whether CNN layer and RNN layer have improved the performance, as well as whether the order of the two frameworks will affect the performance.

We initially trained the three models mentioned above and obtained the prediction results. The model performance is shown in Table 2.

Due to the highly unbalanced nature of the data set, it was easy for the model to obtain a high auROC value. Therefore, we gave up the comparison of auROC values and focused on the comparison results of auPRC and Recall value on the test set. The results in Table 2 were used to draw the histogram (Fig. 2), from which it can intuitively be seen that CnnCrispr has better predictive performance. Therefore, we took the CnnCrispr as the benchmark network framework and further well-tuned the network structure.

Based on CnnCrispr, the Dropout layer and Batch Normalization layer were removed respectively to verify the influence of the two parts on performance. A brief description of the network structure is given in Table 1. The recall value of CnnCrispr_No_Dropout was 0.810 in the total test set, which was a little lower than that of CnnCrispr, this showed that the Dropout layer does have improved performance and prevented over-fitting, although the degree of improvement is not very noticeable. However, after adding the Dropout layer, the training parameters of the model were greatly reduced, which further saved time for model training, hence we kept the Dropout layer in the final model. Then we trained the model without the Batch Normalization layer several times and analyzed it on the test set, but every time the entire test set were all classified as negative samples. This indicated that the model without the BN layer has lost its ability of classification prediction. Therefore, the BN layer is essential in the final model. In addition, we also mentioned the importance of BN layer for neural network model in Section 5.5, hence we reserved it in our final model.

2.3 Model Comparing

We selected four sgRNA off-target propensity prediction models for model comparison, namely CFD[33], MIT[16], CNN_std[38] and DeepCrispr[37].

CFD is short for Cutting Frequency Determination. As a scoring model for evaluating the off-target propensity of sgRNA-DNA interaction, CFD specified different scores for the location and type of mismatch between sgRNA and corresponding DNA sequence. When multiple mismatches appear in the sequence pair, the corresponding scores are multiplied to obtain the final score. For example, if the sgRNA-DNA sequence has a rG-dA mismatch in position 6 and a rC-dT mismatch in position 10, it will receive a CFD score of 0.67×0.87=0.583. Haeussler et al.[39] compared the performance of CFD with that of MIT, and proved that the prediction performance of CFD was slightly better than that of MIT in CRISPOR data set. CNN_std is a CNN-based sgRNA off-target propensity prediction model developed by Jiecong Lin. The combination of sgRNA and corresponding DNA sequences was encoded by “XOR” principle and predicted by multi-layer convolution network. DeepCrispr is a deep learning method which combines sgRNA-DNA sequence information with genomic epigenetic characteristics as the input. DeepCrispr used the largest data set available to conduct model training and introduced the auto-encoder to automatically acquire potential features of the sgRNA-DNA sequence, which was a good attempt at deep learning in sgRNA related prediction problems.

In order to make a more comprehensive comparison with the four models above, we tested the performance of the classification and regression models in two test patterns. We downloaded the prediction models of CFD, MIT and CNN_std from relevant websites and obtained the prediction results on the same test set as CnnCrispr. Due to the fact that the training methods were consistent between CnnCrispr and DeepCrispr, we just used the test results given by DeepCrispr to make the comparison.

2.3.1 Test pattern 1 -- withheld 20% as an independent testing set

Consistent with the training method of “Model Selection” section, we randomly divided the data sets of each cell line in the proportion of 8:2. We compared the performance of CnnCrispr with the current preferable prediction models. Fig. 3 shows the comparison results under the classification schema. CnnCrispr achieved an auROC value of 0.975 and an auPRC value of 0.679 at the total test set. Which were both higher than the value of CFD, MIT and CNN_std (there were similar trends in the Hek293t test set and K562 test set, CnnCrispr achieved the auROC of 0.971 and 0.995 on Hek293t test set and K562 test set, respectively. And auPRC of 0.686 and 0.688 on Hek293t test set and K562 test set, respectively). The AUC values of ROC curve and PRC curve of CnnCrispr on the three test sets were all higher than those of CFD, MIT and CNN_std, which proved that CnnCrispr had more advanced prediction ability. In addition, the PRC curve obtained by CnnCrispr on the total test set and Hek293t test set completely contained the PRC curve obtained by the other three models, CFD, MIT and CNN_std, while on the K562 test set, only a small portion of the curve was covered by the CNN_std. Comprehensive comparison showed that the overall performance of the CnnCrispr was better than the other three models, and since the training and test sets were extremely unbalanced, the PRC curve and the area under it were more important measures for model evaluation, where CnnCrispr had a strong competitive advantage. In addition to the comparison with the above three models, we further compared the testing performance of CnnCrispr with DeepCrispr. Since the training methods and data sets were consistent, we directly compared the test results given in ref. [37], and the results are shown in Table3. The auROC values of DeepCrispr were slightly better than those of CnnCrispr (shown more intuitively on Hek293t test set), but the auPRC values obtained by CnnCrispr on all three test sets were higher than those of DeepCrispr. By comprehensive comparison, CnnCrispr showed better performance than DeepCrispr under test pattern 1.

Unlike the classification schema, the Pearson correlation coefficient and Spearman rank correlation coefficient of the prediction results were mainly used as evaluation measures for regression schema. From the comparison results, the Pearson correlation coefficient between CnnCrispr’s prediction results and the real labels was strictly superior to the three comparison models (Since the Pearson coefficient was not selected as the evaluation measure in DeepCrispr, we only compared the Spearman values of CnnCrispr with DeepCrispr.).

The Pearson value of CnnCrispr on Hek293t test set reached 0.712(higher than 0.371 obtained by CFD, 0.153 obtained by MIT，0.33 obtained by CNN_std). In the entire test set, CnnCrispr also demonstrated its better predictive ability, with Pearson value reaching 0.682, higher than 0.343 of CFD, 0.150 of MIT and 0.321 of CNN_std. For Spearman correlation coefficient, the negative data in the test set was much larger than the positive data (about 250:1), therefore, a high Spearman value cannot be achieved. Nevertheless, the prediction ability of CnnCrispr was still better than those of the four models above (the test results of CnnCrispr on Hek293t, K562 and Total test set were 0.154, 0.160 and 0.134 respectively, while the Spearman correlation coefficients of CFD on the three test sets were 0.140, 0.143 and 0.128 respectively; Spearman correlation coefficients of MIT were 0.085, 0.084 and 0.086 respectively; Spearman correlation coefficients of CNN_std were 0.141, 0.144 and 0.132 respectively; Spearman correlation coefficients of DeepCrispr were 0.136, 0.126 and 0.133 respectively). In addition, we also compared the AUC values under ROC and PRC curves of the five models by referencing the CRISTA’s evaluation method and considering the predicted results as the probability of the classification labels. The auROC value and auPRC value obtained by CnnCrispr on the total test set were as high as 0.986 and 0.601 respectively, which were superior to 0.942 and 0.316 of CFD, 0.947 and 0.208 of CNN_std, and the same results were obtained on Hek293t and K562 test sets. Based on the above performance results, we concluded that CnnCrispr had better prediction ability.

2.3.2 Test pattern 2 – “Leave-one-sgRNA-out”

In order to examine the accuracy and generalization ability of CnnCrispr for the prediction of off-target propensity of new sgRNA, we set up the "leave-one-sgRNA-out" experiment, which is a good evaluation method for the prediction of off-target propensity. During the training, a sgRNAs and its corresponding off-target sequences (with true cleaved propensity or the potential sites obtained from whole genome) were completely extracted for model testing. According to the difference of sgRNAs, model training and performance evaluations were conducted a total of 29 times. Through this 29-fold cross-validation method, we were able to comprehensively evaluate the generalization ability of CnnCrispr and avoid over-fitting or under-fitting of the model when predicting for some special sgRNAs.

For classification, CnnCrispr achieved an average auROC of 0.957 and auPRC of 0.429, which were both higher than the results of the four models above (CFD achieved an average auROC of 0.903, auPRC of 0.319, MIT achieved an average auROC of 0.848, auPRC of 0.115, CNN_std achieved an average auROC of 0.925, auPRC of 0.303; and DeepCrispr achieved an average auROC of 0.841, auPRC of 0.421). In the 29-fold cross validation, CnnCrispr’s comprehensive competitive advantage was more significant, and the auPRC results were higher than results yielded by the other four models, which was essential to prevent the model from missing the actual off-target sites (see Fig. 5).

In order to make a more comprehensive evaluation, we also considered the distribution of the values of auROC and auPRC obtained by “29-fold” cross-validation, and drew the violin plot (Due to the fact that we weren’t able to get the test data of DeepCrispr, we were unable to draw a violin plot for it.). Violin plot is characterized by the kernel density estimation of the basic distribution, and the external shape of the violin plot is the kernel density estimation. First of all, Fig. 6 shows that the auROC values of CnnCrispr were generally higher and the AUC values of CnnCrispr were more concentrated, 75% of the prediction results were greater than 0.9. On the other hand, there were obvious abnormal points in the prediction results of auROC by the other three models, indicating that they cannot play a good role in predicting the off-target propensity of individual sgRNA. In addition, the distribution of CnnCrispr’s auROC values was more concentrated, while the auROC values of CFD and CNN_std had obvious discrete values (the whiskers on the lower side were longer).With the increase of auROC values, the horizontal distance of the violin plot plotted by CnnCrispr was larger, which showed that more auROC values were distributed on this interval, further indicating the good prediction performance of CnnCrispr. For auPRC values, the median of prediction results obtained by CnnCrispr was significantly larger than that of the other three models, which showed that CnnCrispr had a higher overall score and 75% of auPRC values obtained by CnnCrispr were greater than 0.2. CnnCrispr was more distributed at higher scores, indicating that the overall predictive performance of CnnCrispr was indeed better than that of CFD and CNN_std (see Fig. 6).

We further compared the 29-fold cross-validation results in regression schema and organized the performance visualization results in Fig. 5-6. We first compared the average value of Pearson correlation coefficient and the Spearman correlation coefficient (see Fig. 5). CnnCrispr achieved a higher mean Pearson value and Spearman value, this showed that CnnCrispr had better fitting ability. Furthermore, we drew 29 sets of Pearson values and Spearman values into violin maps. As shown in Fig. 6, Pearson values obtained by CnnCrispr were more distributed in the high score range. In addition, the Spearman scores of all four models were lower, but despite this, the distribution of CnnCrispr scores was significantly better than that of the other three models. Concluding with the fact CnnCrispr had a higher probability of obtaining highly fitting prediction results for off-target propensity (Detailed results are in Additional file 1).

As a kind of classical neural network algorithm, RNN has the following features: memory ability, Shared parameters and Turing completeness. Therefore, it has advantages in learning the nonlinear features of sequences and plays an important role in the study of sequence problems with time characteristics. In the relevant studies of CRISPR editing technology, it has been shown that the base types at different positions have a certain influence on the cleavage propensity of sgRNA[11, 21, 41, 42, 47]. Therefore, we considered introducing an RNN framework into the prediction model to extract context information for sgRNA-DNA pairs.

The convolution kernel size of the CNN was smaller than the input matrix, so the convolution operation can extract more local features -- which is consistent with the image processing. In fact, it is not necessary for each neuron to perceive the global image, but only need to perceive the local image, and then integrate the local information at a higher level to obtain the global information. The parameter sharing mode of CNN can also greatly reduce the computation. In addition, we set convolution kernels of different sizes for different levels in the convolution part, and used multiple convolution kernels to convolve the input images, to extract local features as comprehensively as possible in this way. Furthermore, GloVe method utilized the statistical information of global word co-occurrence to learn word vectors, so as to combine the advantages of statistical information with the local context window method. We used this method to replace the traditional "one-hot" representation method hence allowing the input sequence of CnnCrispr to have better characteristic representation ability.

In the initial structural design of the model, we comprehensively considered the necessity of extracting sequence context information and local region information, so we integrated RNN and CNN model to improve the ability of feature extraction, and the excellent prediction ability of the final network model CnnCrispr was proved by comparing with the performance of different pre-selected models. The final network structure is shown in Fig. 1. After the GloVe model, the biLSTM was connected to extract context features, and the two-dimensional matrix information was further extracted by using 5 convolutional layers. In the output layer of the network, the model was divided into classification schema and regression schema by setting different activation functions (softmax or sigmoid functions).

In “Model Selection” section, we also intuitively saw that the order of RNN and CNN had a great impact on the test performance, and the model CnnCrispr_Conv_LSTM cannot play a very good role in feature extraction and data prediction (see section 2.2 and Table 2).We briefly analyzed the following reasons: the RNN can fully extract the contextual text features of input sequences, while the convolution operation will initially break the internal connection of sequences and affect the function of RNN. Firstly, the RNN operation was carried out to extract the context features of the sequence, and then the CNN was used to extract the local features, and the local information was integrated at a higher level to obtain the global feature information, so as to improve the prediction ability of CnnCrispr.

In comparison with the performance of the existing four state-of-the-arts prediction models, CnnCrispr had better prediction ability in highly unbalanced test sets from DeepCrispr. In the “leave-one-sgRNA-out” experiment, the mean auPRC of 0.471 and mean Pearson value of 0.502 were achieved, which showed that CnnCrispr has a better competitive advantage. In addition, CnnCrispr only used the sequence information between sgRNA and corresponding potential DNA segments, giving up the construction of artificial features, thus avoiding the introduction of invalid or interfering information and making the prediction results more convincing.

We hope that CnnCrispr can help clinical researchers narrow down the screening range of off-target site test and save researchers more time and energy.

Since 2014, the number of open source data sets and online resources available for studying of the application of machine learning on CRISPR/Cas9 system has been increasing. As of the day this composition is written, the data set used by the author for model training is the largest data set presently available. However, with the continuous development of biological research technology, the number of available open source data sets will gradually increase, this will further improve the generalization ability of CnnCrispr in the future.

In this paper, we built a novel sgRNA off-target propensity prediction model, CnnCrispr. With introduction of the GloVe model, CnnCrispr attempted new feature representation methods to embed sequence information into the deep learning model, combined RNN with CNN, and only used sequence information to predict the off-target propensity of sgRNA at specific sites. By comparison with existing prediction models, the superior prediction ability of CnnCrispr was further confirmed. Our model used deep learning to comprehend the automatic learning of sequence features between sgRNA and corresponding potential off-target site, avoiding the unknown influence of artificial feature construction process on model prediction results, which is a new attempt at deep learning in the direction of sgRNA off-target propensity prediction.

(see Methods in the Supplementary Files)

CRISPR/Cas9 system: Clustered regularly interspaced short palindromic repeats /CRISPR-associated 9 system; GloVe: Global vector; CNN: Convolutional neural network; biLSTM: Bi-directional Long-Short Term Memory; PAM: Protospacer adjacent motif; sgRNA: Single-guide RNA; RNN: Recurrent neural network; auROC: Area under the ROC curve; auPRC: Area under the PRC curve.

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Availability of data and materials

All data used during this study are included in the published article “DeepCrispr” and its supplementary information files (DOI：10.1186/s13059-018-1459-4). We obtained 29 sgRNAs and their corresponding DNA sequences from its additional file 2. The data can also be downloaded from https://github.com/LQYoLH/CnnCrispr and the file name is “off-target_data”.

Competing interests

The authors declare that they have no competing interests.

Funding

This work was supported by grants from the Fundamental Research Funds for the Central Universities (No.FRF-BR-18-008B). The funders had no role in the design of the study, the collection, analysis, and interpretation of data and in writing the manuscript.

Author Contributions

X.L and Q.L conceived and designed the experiments. G.L and B.L made the investigation. Q.L and X.C performed the experiments, Q.L wrote the paper. X.L revised the manuscript. We ensured that all authors had read and approved the manuscript, and ensured that this is the case.

Acknowledgments

The authors are grateful to the anonymous reviewers for their insightful comments and suggestions.

Devaki B, Michelle D, Rodolphe B: CRISPR-Cas systems in bacteria and archaea: versatile small RNAs for adaptive defense and regulation. Annual Review of Genetics 2011, 45(45):273-297.
Terns MP, Terns RM: CRISPR-based adaptive immune systems. Current Opinion in Microbiology 2011, 14(3):321-327.
Blake W, Sternberg SH, Doudna JA: RNA-guided genetic silencing systems in bacteria and archaea. Nature 2012, 482(7385):331-338.
Ishino Y, ., Shinagawa H, ., Makino K, ., Amemura M, ., Nakata A, . Nucleotide sequence of the iap gene, responsible for alkaline phosphatase isozyme conversion in Escherichia coli, and identification of the gene product. Journal of Bacteriology 1987, 169(12):5429-5433.
Miller JC, Holmes MC, Wang J, Guschin DY, Lee YL, Rupniewski I, Beausejour CM, Waite AJ, Wang NS, Kim KA: An improved zinc-finger nuclease architecture for highly specific genome editing. Nature Biotechnology 2007, 25(7):778-785.
Wood AJ, Te-Wen L, Bryan Z, Pickle CS, Ralston EJ, Lee AH, Rainier A, Miller JC, Elo L, Xiangdong M: Targeted genome editing across species using ZFNs and TALENs. Science 2011, 333(6040):307-307.
Dirk H, Haoyi W, Samira K, Lai CS, Qing G, Cassady JP, Cost GJ, Lei Z, Yolanda S, Miller JC: Genetic engineering of human pluripotent cells using TALE nucleases. Nature Biotechnology 2011, 29(8):731-734.
Michelle C, Tomas C, Doyle EL, Clarice S, Feng Z, Aaron H, Bogdanove AJ, Voytas DF: Targeting DNA double-strand breaks with TAL effector nucleases. Genetics 2010, 186(2):757-761.
Makarova KS, Haft DH, Rodolphe B, Brouns SJJ, Emmanuelle C, Philippe H, Sylvain M, Mojica FJM, Wolf YI, Yakunin AF: Evolution and classification of the CRISPR-Cas systems. Nature Reviews Microbiology 2011, 9(6):467-477.
Elitza D, Krzysztof C, Sharma CM, Karine G, Yanjie C, Pirzada ZA, Eckert MR, J?Rg V, Emmanuelle C: CRISPR RNA maturation by trans-encoded small RNA and host factor RNase III. Nature 2011, 471(7340):602-607.
Martin J, Krzysztof C, Ines F, Michael H, Doudna JA, Emmanuelle C: A programmable dual-RNA-guided DNA endonuclease in adaptive bacterial immunity. Science 2012, 337(6096):816-821.
Mojica FJM, Díez-Villase?Or C, ., García-Martínez J, ., Almendros C, . Short motif sequences determine the targets of the prokaryotic CRISPR defence system. Microbiology 2009, 155(3):733-740.
Sternberg SH, Sy R, Martin J, Greene EC, Doudna JA: DNA interrogation by the CRISPR RNA-guided endonuclease Cas9. Nature 2014, 507(7490):62-67.
Cem K, Sevki A, Ritambhara S, Jeremy T, Mazhar A: Genome-wide analysis reveals characteristics of off-target sites bound by the Cas9 endonuclease. Nature Biotechnology 2014, 32(7):677-683.
Zhang Y, Ge X, Yang F, Zhang L, Zheng J, Tan X, Jin ZB, Qu J, Gu F: Comparison of non-canonical PAMs for CRISPR/Cas9-mediated DNA cleavage in human cells. Scientific Reports 2014, 4:5405.
Hsu PD, Scott DA, Weinstein JA, F Ann R, Silvana K, Vineeta A, Yinqing L, Fine EJ, Xuebing W, Ophir S: DNA targeting specificity of RNA-guided Cas9 nucleases. Nature Biotechnology 2013, 31(9):827-832.
Lu XJ, Xue HY, Ke ZP, Chen JL, Ji LJ: CRISPR-Cas9: a new and promising player in gene therapy. Journal of Medical Genetics 2015, 52(5):289-296.
Rouet P, ., Smih F, ., Jasin M, . Introduction of double-strand breaks into the genome of mouse cells by expression of a rare-cutting endonuclease. Molecular & Cellular Biology 1994, 14(12):8096-8106.
Rouet P, ., Smih F, ., Jasin M, . Expression of a site-specific endonuclease stimulates homologous recombination in mammalian cells. Proc Natl Acad Sci U S A 1994, 91(13):6064-6068.
Yanfang F, Foden JA, Cyd K, Maeder ML, Deepak R, J Keith J, Sander JD: High-frequency off-target mutagenesis induced by CRISPR-Cas nucleases in human cells. Nature Biotechnology 2013, 31(9):822-826.
Vikram P, Steven L, Guilinger JP, Enbo M, Doudna JA, Liu DR: High-throughput profiling of off-target DNA cleavage reveals RNA-programmed Cas9 nuclease specificity. Nature Biotechnology 2013, 31(9):839-843.
Tsai SQ, Zongli Z, Nguyen NT, Matthew L, Topkar VV, Vishal T, Nicolas W, Cyd K, A John I, Long P, Le: GUIDE-seq enables genome-wide profiling of off-target cleavage by CRISPR-Cas nucleases. Nature Biotechnology 2015, 33(2):187-197.
Kleinstiver BP, Prew MS, Tsai SQ, Nguyen NT, Topkar VV, Zheng Z, Joung JK: Broadening the targeting range of Staphylococcus aureus CRISPR-Cas9 by modifying PAM recognition. Nature Biotechnology 2015, 33(12):1293-1298.
Kleinstiver BP, Prew MS, Tsai SQ, Topkar VV, Nguyen NT, Zheng Z, Gonzales APW, Li Z, Peterson RT, Yeh JRJ: Engineered CRISPR-Cas9 nucleases with altered PAM specificities. Nature 2015, 523(7561):481-485.
Chiarle R, Zhang Y, Frock R, Lewis S, Molinie B, Ho YJ, Myers D, Choi V, Compagno M, Malkin D: Genome-wide Translocation Sequencing Reveals Mechanisms of Chromosome Breaks and Rearrangements in B Cells. Cell 2011, 147(1):107-119.
Frock RL, Jiazhi H, Meyers RM, Yu-Jui H, Erina K, Alt FW: Genome-wide detection of DNA double-stranded breaks induced by engineered nucleases. Nature Biotechnology 2015, 33(2):179-186.
Crosetto N, Mitra A, Silva MJ, Bienko M, Dojer N, Wang Q, Karaca E, Chiarle R, Skrzypczak M, Ginalski K: Nucleotide-resolution DNA double-strand break mapping by next-generation sequencing. Nature Methods 2013, 10(4):361-365.
Xiaoling W, Yebo W, Xiwei W, Jinhui W, Yingjia W, Zhaojun Q, Tammy C, He H, Ren-Jang L, Jiing-Kuan Y: Unbiased detection of off-target cleavage by CRISPR-Cas9 and TALENs using integrase-defective lentiviral vectors. Nature Biotechnology 2015, 33(2):175-178.
Osborn MJ, Webber BR, Knipping F, Lonetree CL, Tennis N, Defeo AP, Mcelroy AN, Starker CG, Lee C, Merkel S: Evaluation of TCR Gene Editing Achieved by TALENs, CRISPR/Cas9, and megaTAL Nucleases. Molecular Therapy 2016, 24(3):570-581.
Listgarten J, Weinstein M, Kleinstiver BP, Sousa AA, Joung JK, Crawford J, Gao K, Hoang L, Elibol M, Doench JG: Prediction of off-target activities for the end-to-end design of CRISPR guide RNAs. Nature Biomedical Engineering 2018, 2(1):38-47.
Hui KK, Min S, Song M, Jung S, Choi JW, Kim Y, Lee S, Yoon S, Kim H: Deep learning improves prediction of CRISPR–Cpf1 guide RNA activity. Nature Biotechnology 2018, 36(3):239-241.
Yanni L, Cradick TJ, Brown MT, Harshavardhan D, Piyush R, Neha S, Wile BM, Vertino PM, Stewart FJ, Gang B: CRISPR/Cas9 systems have off-target activity with insertions or deletions between target DNA and guide RNA sequences. Nucleic Acids Research 2014, 42(11):7473-7485.
Doench JG, Fusi N, Sullender M, Hegde M, Vaimberg EW, Donovan KF, Smith I, Tothova Z, Wilen C, Orchard R: Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9. Nature Biotechnology 2016, 34(2):184-191.
Pei FK, Powers S, He S, Li K, Zhao X, Bo H: A systematic evaluation of nucleotide properties for CRISPR sgRNA design. Bmc Bioinformatics 2017, 18(1):297.
Abadi S, Yan WX, Amar D, Mayrose I: A machine learning approach for predicting CRISPR-Cas9 cleavage efficiencies and patterns underlying its mechanism of action. Plos Computational Biology 2017, 13(10):e1005807.
Rahman MK, Rahman MS: CRISPRpred: A flexible and efficient tool for sgRNAs on-target activity prediction in CRISPR/Cas9 systems. Plos One 2017, 12(8):e0181943.
Chuai G, Ma H, Yan J, Chen M, Hong N, Xue D, Zhou C, Zhu C, Chen K, Duan B: DeepCRISPR : optimized CRISPR guide RNA design by deep learning. Genome Biology 2018, 19(1):80.
Jiecong L, Ka-Chun W: Off-target predictions in CRISPR-Cas9 gene editing using deep learning. Bioinformatics 2018, 34(17):i656-i663.
Haeussler M, Kai S, Eckert H, Eschstruth A, Mianné J, Renaud JB, Schneider-Maunoury S, Shkumatava A, Teboul L, Kent J: Evaluation of off-target and on-target scoring algorithms and integration into the guide RNA selection tool CRISPOR. Genome Biology 2016, 17(1):148.
Dimauro G, Colagrande P, Carlucci R, Ventura M, Bevilacqua V, Caivano D: CRISPRLearner: A Deep Learning-Based System to Predict CRISPR/Cas9 sgRNA On-Target Cleavage Efficiency. Electronics 2019, 8:1478.
Henriette OG, Henry IM, Bhakta MS, Meckler JF, Segal DJ: A genome-wide analysis of Cas9 binding specificity using ChIP-seq and targeted sequence capture. Nucleic Acids Research 2015, 43(6):3389-3404.
Xuebing W, Scott DA, Kriz AJ, Chiu AC, Hsu PD, Dadon DB, Cheng AW, Trevino AE, Silvana K, Sidi C: Genome-wide binding of the CRISPR endonuclease Cas9 in mammalian cells. Nature Biotechnology 2014, 32(7):670-676.
Doench JG, Ella H, Graham DB, Zuzana T, Mudra H, Ian S, Meagan S, Ebert BL, Xavier RJ, Root DE: Rational design of highly active sgRNAs for CRISPR-Cas9-mediated gene inactivation. Nature Biotechnology 2014, 32(12):1262-1267.
Tim W, Wei JJ, Sabatini DM, Lander ES: Genetic screens in human cells using the CRISPR-Cas9 system. Science 2014, 343(6166):80-84.
Nathan W, Weijun L, Xiaowei W: WU-CRISPR: characteristics of functional guide RNAs for the CRISPR/Cas9 system. Genome Biology 2015, 16(1):218.
Alkhnbashi OS, Fabrizio C, Shah SA, Garrett RA, Saunders SJ, Rolf B: CRISPRstrand: predicting repeat orientations to determine the crRNA-encoding strand at CRISPR loci. Bioinformatics 2014, 30(17):489-496.
Prashant M, John A, P Benjamin S, Esvelt KM, Mark M, Sriram K, Luhan Y, Church GM: CAS9 transcriptional activators for target specificity screening and paired nickases for cooperative genome engineering. Nature Biotechnology 2013, 31(9):833.
Pennington J, Socher R, Manning CD: GloVe: Global Vectors for Word Representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing(EMNLP). 2014: 1532--1543.
Hochreiter S, Schmidhuber J: Long Short-Term Memory. Neural Computation 1997, 9(8):1735-1780.
David F, Benjamin R: Estimation of the area under the ROC curve. Statistics in Medicine 2002, 21(20):3093-3106.

Due to technical limitations the Tables are available as a download in the Supplementary Files.

Additional file 1. Detailed comparison results for sgRNA off-target propensity prediction. (XLSX 20 kb)

Additional file 2. A detailed description of the model structure for model selection. (PDF 184 kb)

Download PDF

Journal Publication

published 10 Feb, 2020

Read the published version in BMC Bioinformatics →

Submission checks completed at journal
04 Feb, 2020
Editorial decision: Accept
04 Feb, 2020

You are reading this latest preprint version

Deep learning improves the ability of sgRNA off-target propensity prediction

Status:

Journal Publication

Version 3

Abstract

Figures

Background

Results

Discussion

Conclusion

Methods

List Of Abbreviations

Declarations

References

Tables

Additional File Legends

Supplementary Files

Status:

Journal Publication

Version 3