**2.1 Model Structure and Prediction**

In our initial conception, we combined biLSTM with CNN framework at the final prediction model and the model structure is shown in **Fig. 1**. We also constructed several similar but different models by removing different network parts to compare the test results and select the final prediction model. All pre-selected network frameworks for model selection are briefly described in **Table 1**.

The structure of the benchmark framework of CnnCrispr is described in detail below:

The first layer of CnnCrispr is an embedding layer, which is used for input of the vector obtained by GloVe model. Since the vector dimension of the GloVe model is set to 100, the input of embedding layer is a two-dimensional matrix with the size of 16×100. We called the mittens package in Python to train the GloVe model on the basis of the realization of GloVe co-occurrence matrix.

The second layer is a biLSTM network, which is mainly used to extract the context features of input information. Five convolution layers are subsequently connected to the model, and each layer has a different kernel number and kernel size. Then the full connection layers are introduced behind the last convolution layer, having the sizes of 20 and 2 respectively.

In addition to the framework mentioned above, Batch Normalization and Dropout layers are added between each layers to prevent model overfitting. The parameters of the Dropout layer are set as 0.3. For the output layer, *softmax* and *sigmoid* functions are used as activation functions respectively to obtain the prediction results of classification model and regression model.

In the training process, the initial learning rate was set as 0.01, and we used *Adam* algorithm to optimize the loss function. Furthermore, we set the batch size as 256 in consideration of the requirements of potential information extraction from negative data and avoiding the occurrence of over fitting. Too large of a batch size may increase the risk of multiple occurrence of some positive data in a single batch during training, while too small of a batch size may reduce the training speed of a model and extend the training time.

Our experiment was divided into two parts. First, we compared the performance of different models. Then, the final prediction model was compared with the existing models with better performance to evaluate the practical application ability of our model. Detailed network descriptions can be found in Additional file 2.

**2.2 Model Selection**

Experimental data are from the attachment provided by DeepCrispr article, and the relevant data description is detailed in section 5.1. During the process of training, 20% of the data in the Hek293t and K562 data sets were randomly selected to compile the test sets (Hek293t test set, K562 test set and Total test set respectively). Different prediction models were obtained by training with all the remaining data, and the prediction performance of each model in the three test sets were evaluated. During the training process, we generated the batch training data using the data sampling method mentioned in **Section 5.6**.

We built two models for classification and regression prediction, respectively. The first three models mentioned in** Table 1 **were trained in order to verify the influence of different parts on the prediction performance of the model. The structure of the benchmark model CnnCrispr is introduced in **Section 2.1**. And the model CnnCrispr_No_LSTM was obtained by removing the LSTM part from the basis of CnnCrispr, CnnCrispr_Conv_LSTM was obtained by adjusting the order of Convolution layers and Recurrent layer on the basis of CnnCrispr. Among them, the purpose of the latter two models was mainly to illustrate whether CNN layer and RNN layer have improved the performance, as well as whether the order of the two frameworks will affect the performance.

We initially trained the three models mentioned above and obtained the prediction results. The model performance is shown in **Table 2**.

Due to the highly unbalanced nature of the data set, it was easy for the model to obtain a high auROC value. Therefore, we gave up the comparison of auROC values and focused on the comparison results of auPRC and Recall value on the test set. The results in** Table 2** were used to draw the histogram (**Fig. 2**), from which it can intuitively be seen that CnnCrispr has better predictive performance. Therefore, we took the CnnCrispr as the benchmark network framework and further well-tuned the network structure.

Based on CnnCrispr, the Dropout layer and Batch Normalization layer were removed respectively to verify the influence of the two parts on performance. A brief description of the network structure is given in **Table 1**. The recall value of CnnCrispr_No_Dropout was 0.810 in the total test set, which was a little lower than that of CnnCrispr, this showed that the Dropout layer does have improved performance and prevented over-fitting, although the degree of improvement is not very noticeable. However, after adding the Dropout layer, the training parameters of the model were greatly reduced, which further saved time for model training, hence we kept the Dropout layer in the final model. Then we trained the model without the Batch Normalization layer several times and analyzed it on the test set, but every time the entire test set were all classified as negative samples. This indicated that the model without the BN layer has lost its ability of classification prediction. Therefore, the BN layer is essential in the final model. In addition, we also mentioned the importance of BN layer for neural network model in **Section 5.5**, hence we reserved it in our final model.

**2.3 Model Comparing**

We selected four sgRNA off-target propensity prediction models for model comparison, namely CFD[33], MIT[16], CNN_std[38] and DeepCrispr[37].

CFD is short for Cutting Frequency Determination. As a scoring model for evaluating the off-target propensity of sgRNA-DNA interaction, CFD specified different scores for the location and type of mismatch between sgRNA and corresponding DNA sequence. When multiple mismatches appear in the sequence pair, the corresponding scores are multiplied to obtain the final score. For example, if the sgRNA-DNA sequence has a rG-dA mismatch in position 6 and a rC-dT mismatch in position 10, it will receive a CFD score of 0.67×0.87=0.583. Haeussler* et al.*[39] compared the performance of CFD with that of MIT, and proved that the prediction performance of CFD was slightly better than that of MIT in CRISPOR data set. CNN_std is a CNN-based sgRNA off-target propensity prediction model developed by *Jiecong Lin*. The combination of sgRNA and corresponding DNA sequences was encoded by “XOR” principle and predicted by multi-layer convolution network. DeepCrispr is a deep learning method which combines sgRNA-DNA sequence information with genomic epigenetic characteristics as the input. DeepCrispr used the largest data set available to conduct model training and introduced the auto-encoder to automatically acquire potential features of the sgRNA-DNA sequence, which was a good attempt at deep learning in sgRNA related prediction problems.

In order to make a more comprehensive comparison with the four models above, we tested the performance of the classification and regression models in two test patterns. We downloaded the prediction models of CFD, MIT and CNN_std from relevant websites and obtained the prediction results on the same test set as CnnCrispr. Due to the fact that the training methods were consistent between CnnCrispr and DeepCrispr, we just used the test results given by DeepCrispr to make the comparison.

2.3.1 Test pattern 1 -- withheld 20% as an independent testing set

Consistent with the training method of “Model Selection” section, we randomly divided the data sets of each cell line in the proportion of 8:2. We compared the performance of CnnCrispr with the current preferable prediction models. **Fig. 3** shows the comparison results under the classification schema. CnnCrispr achieved an auROC value of 0.975 and an auPRC value of 0.679 at the total test set. Which were both higher than the value of CFD, MIT and CNN_std (there were similar trends in the Hek293t test set and K562 test set, CnnCrispr achieved the auROC of 0.971 and 0.995 on Hek293t test set and K562 test set, respectively. And auPRC of 0.686 and 0.688 on Hek293t test set and K562 test set, respectively). The AUC values of ROC curve and PRC curve of CnnCrispr on the three test sets were all higher than those of CFD, MIT and CNN_std, which proved that CnnCrispr had more advanced prediction ability. In addition, the PRC curve obtained by CnnCrispr on the total test set and Hek293t test set completely contained the PRC curve obtained by the other three models, CFD, MIT and CNN_std, while on the K562 test set, only a small portion of the curve was covered by the CNN_std. Comprehensive comparison showed that the overall performance of the CnnCrispr was better than the other three models, and since the training and test sets were extremely unbalanced, the PRC curve and the area under it were more important measures for model evaluation, where CnnCrispr had a strong competitive advantage. In addition to the comparison with the above three models, we further compared the testing performance of CnnCrispr with DeepCrispr. Since the training methods and data sets were consistent, we directly compared the test results given in ref. [37], and the results are shown in **Table3**. The auROC values of DeepCrispr were slightly better than those of CnnCrispr (shown more intuitively on Hek293t test set), but the auPRC values obtained by CnnCrispr on all three test sets were higher than those of DeepCrispr. By comprehensive comparison, CnnCrispr showed better performance than DeepCrispr under test pattern 1.

Unlike the classification schema, the Pearson correlation coefficient and Spearman rank correlation coefficient of the prediction results were mainly used as evaluation measures for regression schema. From the comparison results, the Pearson correlation coefficient between CnnCrispr’s prediction results and the real labels was strictly superior to the three comparison models (Since the Pearson coefficient was not selected as the evaluation measure in DeepCrispr, we only compared the Spearman values of CnnCrispr with DeepCrispr.).

The Pearson value of CnnCrispr on Hek293t test set reached 0.712(higher than 0.371 obtained by CFD, 0.153 obtained by MIT，0.33 obtained by CNN_std). In the entire test set, CnnCrispr also demonstrated its better predictive ability, with Pearson value reaching 0.682, higher than 0.343 of CFD, 0.150 of MIT and 0.321 of CNN_std. For Spearman correlation coefficient, the negative data in the test set was much larger than the positive data (about 250:1), therefore, a high Spearman value cannot be achieved. Nevertheless, the prediction ability of CnnCrispr was still better than those of the four models above (the test results of CnnCrispr on Hek293t, K562 and Total test set were 0.154, 0.160 and 0.134 respectively, while the Spearman correlation coefficients of CFD on the three test sets were 0.140, 0.143 and 0.128 respectively; Spearman correlation coefficients of MIT were 0.085, 0.084 and 0.086 respectively; Spearman correlation coefficients of CNN_std were 0.141, 0.144 and 0.132 respectively; Spearman correlation coefficients of DeepCrispr were 0.136, 0.126 and 0.133 respectively). In addition, we also compared the AUC values under ROC and PRC curves of the five models by referencing the CRISTA’s evaluation method and considering the predicted results as the probability of the classification labels. The auROC value and auPRC value obtained by CnnCrispr on the total test set were as high as 0.986 and 0.601 respectively, which were superior to 0.942 and 0.316 of CFD, 0.947 and 0.208 of CNN_std, and the same results were obtained on Hek293t and K562 test sets. Based on the above performance results, we concluded that CnnCrispr had better prediction ability.

2.3.2 Test pattern 2 – “Leave-one-sgRNA-out”

In order to examine the accuracy and generalization ability of CnnCrispr for the prediction of off-target propensity of new sgRNA, we set up the "leave-one-sgRNA-out" experiment, which is a good evaluation method for the prediction of off-target propensity. During the training, a sgRNAs and its corresponding off-target sequences (with true cleaved propensity or the potential sites obtained from whole genome) were completely extracted for model testing. According to the difference of sgRNAs, model training and performance evaluations were conducted a total of 29 times. Through this 29-fold cross-validation method, we were able to comprehensively evaluate the generalization ability of CnnCrispr and avoid over-fitting or under-fitting of the model when predicting for some special sgRNAs.

For classification, CnnCrispr achieved an average auROC of 0.957 and auPRC of 0.429, which were both higher than the results of the four models above (CFD achieved an average auROC of 0.903, auPRC of 0.319, MIT achieved an average auROC of 0.848, auPRC of 0.115, CNN_std achieved an average auROC of 0.925, auPRC of 0.303; and DeepCrispr achieved an average auROC of 0.841, auPRC of 0.421). In the 29-fold cross validation, CnnCrispr’s comprehensive competitive advantage was more significant, and the auPRC results were higher than results yielded by the other four models, which was essential to prevent the model from missing the actual off-target sites (see **Fig. 5**).

In order to make a more comprehensive evaluation, we also considered the distribution of the values of auROC and auPRC obtained by “29-fold” cross-validation, and drew the violin plot (Due to the fact that we weren’t able to get the test data of DeepCrispr, we were unable to draw a violin plot for it.). Violin plot is characterized by the kernel density estimation of the basic distribution, and the external shape of the violin plot is the kernel density estimation. First of all, **Fig. 6** shows that the auROC values of CnnCrispr were generally higher and the AUC values of CnnCrispr were more concentrated, 75% of the prediction results were greater than 0.9. On the other hand, there were obvious abnormal points in the prediction results of auROC by the other three models, indicating that they cannot play a good role in predicting the off-target propensity of individual sgRNA. In addition, the distribution of CnnCrispr’s auROC values was more concentrated, while the auROC values of CFD and CNN_std had obvious discrete values (the whiskers on the lower side were longer).With the increase of auROC values, the horizontal distance of the violin plot plotted by CnnCrispr was larger, which showed that more auROC values were distributed on this interval, further indicating the good prediction performance of CnnCrispr. For auPRC values, the median of prediction results obtained by CnnCrispr was significantly larger than that of the other three models, which showed that CnnCrispr had a higher overall score and 75% of auPRC values obtained by CnnCrispr were greater than 0.2. CnnCrispr was more distributed at higher scores, indicating that the overall predictive performance of CnnCrispr was indeed better than that of CFD and CNN_std (see **Fig. 6**).

We further compared the 29-fold cross-validation results in regression schema and organized the performance visualization results in **Fig. 5-6**. We first compared the average value of Pearson correlation coefficient and the Spearman correlation coefficient (see **Fig. 5**). CnnCrispr achieved a higher mean Pearson value and Spearman value, this showed that CnnCrispr had better fitting ability. Furthermore, we drew 29 sets of Pearson values and Spearman values into violin maps. As shown in **Fig. 6**, Pearson values obtained by CnnCrispr were more distributed in the high score range. In addition, the Spearman scores of all four models were lower, but despite this, the distribution of CnnCrispr scores was significantly better than that of the other three models. Concluding with the fact CnnCrispr had a higher probability of obtaining highly fitting prediction results for off-target propensity (Detailed results are in **Additional file 1**).