1. Knowledge of the reaction mechanism can guide the parameter estimation.
The binding of Cas9 begins with interaction between the Cas9 PI domain and the PAM sequence. Then, the first five nucleotides of the guide RNA hybridize to the target DNA. Hybridization of the rest of the guide is guarded by base stacking with Tyr450. Removal of the Tyr450 leads to conformational changes of the REC domain. When RNA-DNA hybridization length reaches ~ 12 nucleotides, HNH domain becomes disordered. At the hybridization length of ~ 16 nucleotides, the HNH domain refolds. To cut DNA, the HNH domain has to re-orient. It becomes possible at the hybridization length of 18–20 nucleotides. Cas9 also binds ~ 6 nucleotides upstream of the target sequence. For successful cutting, these nucleotides have to be paired [11].
The model includes three rate constants. One rate constant can be estimated from training data. An estimator of the association rate could be confirmed with the help of the validation dataset, if we could devise one. Luckily, the association reaction is the easiest to understand. Cas9 binding is mediated by RNA-DNA triplex formation. It seems plausible that the association rate depends upon the free energy of hybridization. The simple picture is complicated by the fact that relative contributions of RNA-DNA hybridization and DNA melting are unknown. If DNA melting is rate limiting, a more stable duplex would lead to lower association rates. If RNA-DNA hybrid formation is rate limiting, a more stable duplex (because DNA-DNA and RNA-DNA nearest neighbor parameters are highly correlated) would lead to higher association rates. In other words, association is not an elementary reaction. Because data are insufficient, RNA-DNA hybridization and DNA melting have to be considered as the association reaction. The estimator is regarded as the best, if it produces the largest correlation with experimentally measured association rates from the validation data set.
There are no means to validate estimators of cutting or locking rates. The validation dataset contains rates of accumulation of the cut product; i.e., macroscopic rates. The rate of product accumulation depends upon both absolute and relative values of microscopic locking and cutting rates. Consequently, unless cutting or locking rates are constant, no correlation is to be expected between \({\widehat{k}}_{l}\) or \({\widehat{k}}_{c}\) and macroscopic cutting rates.
The last assumption that is inherent to the model is that cutting and locking rates depend upon gRNA sequence. Before Cas9 is able to cut DNA, considerable conformational changes occur. Locking leads to Cas9 RNP being stuck in an unproductive conformation. The model Eq. (2) shows that at equilibrium the editing efficiency is equal to \(\frac{1}{1+\frac{{k}_{l}}{{k}_{c}}}\). It follows that if locking and cutting rates were independent of gRNA sequence, all gRNAs would have had the same efficiency at equilibrium. It contradicts observed efficiency variations in the validation data set. At least one of the rates must vary. Whether it is one or both rates is impossible to say because of insufficient data.
It is important to note that rate constant estimations in this work have arbitrary units. Actually, these constants are known only to some factor. Since reactions are modeled as irreversible, equations are monotonically increasing or decreasing functions of parameters. Even if the true values of the parameters are not known, their relative values can be used to rank gRNAs by efficiency.
2. The performance of the model is on par with contemporary machine learning models.
To find out the best estimator of the association rate constant, the correlation between free energy of hybridization for sequences of \(2\le n\le 20\) PAM-proximal nucleotides \(\varDelta {G}_{n}\) and association rate (for the validation dataset) or editing efficiency (for the training dataset) was computed (Fig. 1). Two features of the graph draw attention. In the training dataset the first five nucleotides have the largest correlation. The PAM proximal 10 nucleotides of the guide are called the seed sequence. Structural studies revealed that in Cas9 guides, as in many other RNA-guided nucleases, the seed is bi-partite. The first five nucleotides mostly control association rates, while the last five nucleotides define binding stability [12]. Large correlation coefficient at \(n=5\) reflects the importance of this check mechanism. Second, the sign of the correlation coefficient differs between training and validation data set. Knowledge of the reaction mechanism helps explain the difference. The correlation in the validation data set is negative because Cas9 RNPs were present in large (at least 10-fold) excess relative to targets. In such conditions, DNA melting becomes rate limiting. Cas9 collides with a target so often that many collisions occur during the time DNA is being melted. On the contrary, in cellulo collisions are much rarer and association rates would mostly depend upon how efficient the guide sequence can hybridize to the target. It can’t be also excluded that in cells the DNA duplex is less stable.
Table 2
|
\({{c}}_{0}\)
|
\({t}\)
|
\({{k}}_{{c}}\)
|
\({{k}}_{{l}}\)
|
Model#1
|
1.0
|
20.0
|
1.1
|
unknown
|
Model#2
|
1.0
|
40.0
|
unknown
|
1.1
|
Figure 1 Correlation between the logarithm of the estimator of the association rate and the logarithm of editing efficiency (for the training dataset) or association rate (for the validation dataset). The free energy of hybridization was evaluated using RNA-DNA [9] or DNA-DNA [10] nearest neighbor parameters.
Next, cutting and locking rates are to be estimated. Since it is possible to estimate only one additional parameter, one rate is to be fixed (Table 2). Parameter estimation is described in Methods. Coefficients of determination were 0.443 and 0.272 for locking and cutting rate models, respectively. Accuracy of models was evaluated on the same datasets that were used for testing in [1] (Table 3).
Table 3
Correlation between experimental and estimated gRNA efficiencies. All estimations are for pseudo-time \(t=1.0\).
Test dataset
|
Model#1 Pearson r
|
Model#1 Spearman r
|
Model#2 Pearson r
|
Model#2 Spearman r
|
Chari 293T
|
-0.254485
|
-0.355162
|
-0.278352
|
-0.355199
|
Hart Rpe1
|
-0.126022
|
-0.146672
|
-0.155291
|
-0.148228
|
Hart Hct1162-2 Lib1
|
-0.221989
|
-0.265629
|
-0.249242
|
-0.265974
|
Hart2016 Hela Lib1
|
-0.181994
|
-0.227824
|
-0.220271
|
-0.228232
|
Hart2016 Hela Lib2
|
-0.219137
|
-0.285188
|
-0.271233
|
-0.286214
|
Xu Kbm7
|
0.372877
|
0.453884
|
0.472507
|
0.454516
|
Wang/Wu HL60
|
0.37075
|
0.447187
|
0.44882
|
0.448447
|
Doench_2014 Human
|
-0.150492
|
-0.233777
|
-0.2356
|
-0.234539
|
Doench_2014 Humans CD13
|
-0.274975
|
-0.339741
|
-0.345057
|
-0.341134
|
Doench_2014 Human CD15
|
0.0254086
|
-0.065822
|
-0.0352259
|
-0.0678037
|
Doench_2014 Human CD33
|
-0.249523
|
-0.257942
|
-0.236194
|
-0.255413
|
Doench_2014 Mouse
|
-0.27991
|
-0.335631
|
-0.34711
|
-0.334186
|
Doench_2016
|
-0.205275
|
-0.269808
|
-0.295493
|
-0.269609
|
Datasets of Hart, Xu and Wang/Wu are libraries of gRNAs targeting essential genes. More efficient knock-outs of essential genes lead to more frequent cell death. Consequently, the fraction of gRNAs in the library decreases with time proportionally to their efficiency. Of course, any gRNA targeting an essential gene would be removed from the library after some time. Because of saturation, when depletion is measured too late, its degree might not reflect gRNA efficiency well. For example, the ‘Hart Rpe1’ dataset has the smallest (by absolute value) correlation coefficient. The mean efficiencies for ‘Hart Rpe1’, ‘Hart Hct1162-2 Lib1’, ‘Hart2016 Hela Lib1’, ‘Hart2016 Hela Lib2’ are 2.78, 1.81, 1.67 and 1.74, respectively. Efficiency values are log2-transformed fold changes of the number of reads. The large mean value indicates that gRNA efficiencies of the ‘Hart Rpe1’ dataset were measured closer to the saturation point. This might be the reason for the low correlation coefficient.
Datasets ‘Xu Kbm7’ and ‘Wang/Wu HL60’ are from the same work. In these datasets, signs of efficiency values were reversed by Xu (see Methods in [5]). This is why their correlation coefficients are positive. The reason for the relatively high values of correlation coefficients is not clear. Ploidy does not seem to be the reason. Haploid KBM7 and diploid HL60 cell lines have equally high correlation.
Doench_2014 datasets use the ‘gene percent rank’ as an efficiency measure. Most importantly, the ‘gene percent rank’ is gene-specific. Each gRNA targeting a given gene is assigned a number from 0 to 1, such that the most efficient gRNA is assigned 1. Consequently, it is improper to compute the correlation across all targeted genes. Correlations were estimated for three target genes separately (datasets ‘Doench_2014 Humans CD13’, ’Doench_2014 Humans CD15’, ‘Doench_2014 Humans CD33’). As can be seen, dataset ’Doench_2014 Humans CD15’ has nearly zero correlation. The possible cause is low expression of CD15. Protein Atlas expression levels in the NB4 cell line are 355.4, 24.2 and 55.2 for CD13, CD15 and CD33, respectively [13]. Hypothetically, the degree of KO of CD15 gene was poorly identifiable due to a low expression level.