Training data for language models
We generated sequences from NCBI databases to train ESM-2 650 million and 15 billion language models. We first use the keyword ‘CRISPR-associated protein’ to download all the gene IDs and then analyze the annotation 'gene' with ‘cas'. We further removed redundant sequences. We finally collected 77684 Cas protein sequences, including 11248 Cas1, 15148 Cas2, 12309 Cas3, 7708 Cas4, 8656 Cas5, 11281 Cas6, 340 Cas7, 299 Cas8, 6706 Cas9, 3525 Cas10, 334 Cas12, 130 Cas13 and 13047 non-Cas proteins. We split these sequences into two datasets, 80% as training data and 20% as validation data.
Training ESM language models
We performed fine-tuning on the open-source sequence classifier provided by ESM to adapt it to our application. The model consists of two fully connected layers, which are specifically designed for classification tasks. In the first layer of the model, the fully connected layer applies a linear transformation to the output features of the ESM model, mapping them to the same dimensional space, followed by a hyperbolic tangent (Tanh) activation function to introduce non-linearity. The second fully connected layer then projects the processed features to the dimension of target classes. This model structure effectively combines the excellent feature extraction ability of ESM and the efficient classification performance of fully connected neural networks, achieving effective classification of sequences. This model design maintains the high-dimensional sequence features while effectively learning the classification task.
To further improve training efficiency and model performance, we employ the DeepSpeed training acceleration framework, particularly its ZeRO-3 offload feature, to optimize memory utilization and accelerate the training process. AdamW is employed as the optimizer for its weight decay and momentum feature to enhance the training stability and efficiency. Meanwhile, WarmupLR is adopted as the learning rate scheduler, which gradually increases the learning rate in the early stages of training to facilitate model convergence. FocalLoss is used as the training loss function, which adjusts the weights of positive and negative samples to mitigate the class imbalance problem. The hyperparameter 'a' of FocalLoss is determined by considering the ratio of class sizes in the Cas data. The model architecture and training are implemented in PyTorch, ensuring code modernity and efficiency.
The training was conducted on the Zhejiang Lab Alkaid Intelligent Computing Operating System using NVIDIA Volta A100 GPUs.
Protein Expression and Purification
The candidate discovered from Cas12a was expressed and purified as previously described48. In brief, the coding sequences of Cas proteins were codon-optimized and synthesized by Tsingke Biotech. (China) and then cloned into pET28a (Novagen) with a C-terminal 10×His tag. The pET28a-Cas12a plasmid was transformed into E. coli Rosetta and induced with 0.2 mM IPTG for 16 h at 18 °C before the cell harvesting. After cell pellets lysis, the Cas12a protein was purified using a Ni-NTA resin column and Heparin Sepharose column according to the manufacturer's instructions (GE Healthcare). Then the purified Cas12a protein was concentrated in storage buffer (50 mM Tris-HCl, pH 7.5, 500 mM NaCl, 10% (v/v) glycerol, 2 mM DTT), quantitated using the absorption at 280 nm, and frozen at -80 °C until use.
Nucleic acid preparation
The double strand DNA fragment of Cas12a variants were synthesized by Tsingke Biotech. (Nanjing, China) and cloned into the pUC57 vector with a T7 primer. The crRNAs were synthesized by GenScript (Nanjing, China), and sequences are listed in Table S3.
Cas12a-mediated nucleic acid detection
The detection assays were performed according to previously reported with minor modifications48. In a 20 μL detection assay, with 200 ng Cas12a protein, 25 pM ssDNA FQ probe sensor, 50 nM crRNA and 10 ng of target dsDNA in a reaction buffer (100 mM NaCl, 50 mM Tris-HCl, 100 µg/mL BSA, pH 7.9) supplied with 10 mM MgCl2 or MnSO4, incubate at 37 °C until detection. A PerkinElmer EnSpire reader with excitation at 485 nm and emission at 520 nm was used for fluorescence detection. For the divalent ion preference screen. The metal ion preference assay was performed as previously described32. In brief, the CRISPR-Cas12 detection assay was supplemented with 10 mM CaCl2, CoCl2, CuSO4, NiSO4, MgSO4, MnSO4, or ZnSO4.
In vitro RNA and DNA binding assays
For RNA binding assays, Cas12a (100 nM) was incubated with Cy3-DNA (10 nM) at room temperature for 10 min in the reaction buffer. The reaction was quenched with glycerol loading buffer (10 mM Tris-HCl [pH 8.0], 10% glycerol). Reaction products were resolved by 12% PAGE and visualized by phosphorimaging (GE Health Care).
For DNA binding assays, Cas12a protein was first complexed with crRNA at a 1:2 ratio at room temperature for 10 min in reaction buffer. Cas12a complex (100 nM) was incubated with annealed FAM-DNA (25 nM) for 10 min at room temperature. The reaction was quenched with glycerol loading buffer (10 mM Tris-HCl [pH 8.0], 10% glycerol). Reaction products were resolved by 12% PAGE and visualized by phosphorimaging (GE Health Care).
PAM preference assay
The six short dsDNA target arrays were constructed by annealing 256 kinds of PAM sequence primer pairs in each well, which target EMX1 site1, DNMT1 site1, FANCF site1, MerS site1, eGFP site1 and eGFP site 3 (Table S4), Next, same as nucleic acid detection, Cas12a protein (200 ng), ssDNA FQ probe sensor (25 pM), crRNA (50 nM) and short target dsDNA (8.5 nM) were mixed in a reaction buffer supplied with 10 mM MgCl2 (EvCas12a_2 and RspCas12a_2) or MnSO4 (AmCas12a, CAGCas12a and RbrCas12a_1), and incubated at 37 °C, Viia 7 Real-Time PCR system were used for fluorescence tracing. In each detection plate, triple repeats of dsDNA with TTTG PAM as the fluorescence correction.
Phylogenetic analysis
The phylogenetic tree of Fig2a was constructed by a dataset of 87 sequences, including 30 Cas9 proteins, 43 Cas12 proteins and 14 Cas13 proteins. And the tree of Fig2f was constructed by 300 Cas12a variants as well as FnCas12a, LbCas12a and AsCas12a. The sequences were aligned with MAFFT-linsi (v7.480)49. A phylogenetic tree was constructed by FastTree50 with default parameters. The phylogenetic tree is annotated by iTol51.
Targeted deep sequencing
HEK293T cells were from Cell Bank/Stem Cell Bank, Chinese Academy of Sciences, and cultured in Dulbecco's modified Eagle's medium (GIBCO) supplemented with 10% fetal calf serum (v/v) (Gemini) and 1% penicillin–streptomycin at 37 °C with 5% CO2. For plasmid transfection, cells were in 24-well plates in three biological replicates and transfected with 1.2 μg plasmids (including 900 ng editor and 300 ng sgRNA) per well, when cells reached an approximate 70-90% confluency. Transfections were carried out with the aid of EZ Trans (Life-iLab; Cat. No.: AC04L091) reagent and according to the manufacturer′s protocols. Three days after transfection, cells were harvested for deep sequencing. Target sites were amplified from extracted genomic DNA using Phanta® Max Super-Fidelity DNA Polymerase (Vazyme). PCR products with different barcodes were pooled together for deep sequencing on the Illumina HiSeq X Ten platform (2 × 150 PE) by Annoroad Gene Technology (Beijing, China). Different experimental conditions were differentiated by bar codes and experimental repetitions were included in different pools. Sequencing reads were demultiplexed using AdapterRemoval (version 2.2.2), and the pair-end reads with 11 bp or more alignments were combined into a single consensus read. All processed reads were then mapped to the target sequences using the BWA-MEM algorithm (BWA v0.7.16). Indel frequency was calculated as: number of indel-containing reads/total mapped reads. The targets and primers used in this study are provided in Supplementary Table 5 and Supplementary Table 6.
Reconstruction of AmCas12a-crRNA complex
AmCas12a was expressed and purified as described above, but further purified by size exclusion column (Superdex 200 Increase 10/300, GE Healthcare) in SEC buffer1 (10 mM Tris-HCl, pH 7.5, 500 mM NaCl) for complex preparation. The sgRNA was diluted to 100 μM in refolding buffer (50 mM KCl, 5 mM MgCl2) and refolded at 72 ℃ for 5 min. The AmCas12a-crRNA binary was reconstituted by incubating 25 μM AmCas12a and 30 μM crRNA for 30 min at room temperature in a total volume of 450 μL assembly buffer (10 mM Tris-HCl, pH 7.5, 500 mM NaCl, 10 mM MgCl2). Subsequently, the mixture was purified by size exclusion column in SEC buffer2 (10 mM Tris-HCl, pH 7.5, 500 mM NaCl, 1 mM MgCl2). The purified aliquots were concentrated at 2 mg/mL, flash frozen and stocked at -80 ℃.
Cryo-EM sample preparation and data collection
Sample vitrification was performed using a Vitrobot Mark IV (Thermo Fisher) operating at 4 ℃ and 100% humidity. A 4 μL sample was applied to holey amorphous nickel–titanium alloy foil (ANTA foil 1.2/1.3) that had been glow-discharged for 30 s. The grids were blotted for 4 s at a ‘blot force’ -2 by standard Vitrobot filter paper (Ted Pella) and were then plunge-frozen in liquid ethane. Cryo-EM data were collected on a Titan Krios electron microscope operated at 300 kV equipped and Falcon4 direct electron detector with Quantum energy filter using EPU. Micrographs were recorded in counting mode at a nominal magnification of 165000x, resulting in a physical pixel size of 0.74 Å per pixel. The defocus was set between -0.6 μm and -1.8 μm. The total exposure time of each movie stack led to a total accumulated dose of 46.73 electrons per Å2 which fractionated into 32 frames. More parameters for data collection are shown in Supplementary information, Table S7.
Image processing and 3D reconstruction
The raw dose-fractionated image stacks were 2× Fourier binned, aligned, dose-weighted, and summed using MotionCor2.52 CTF-estimation, blob particle picking, 2D reference-free classification, initial model generation, final 3D refinement and local resolution estimation were performed in cryoSPARC.53 The details of data processing were summarized in Supplementary information, Fig. S6 and Table S7.
Model building and refinement
The initial protein model was generated using AlphaFold2 and manually revised in UCSF-Chimera and Coot. 26,54,55 The crRNA was manually built in Coot based on the cryo-EM density. The complete model was refined against the EM map by PHENIX in real space with secondary structure and geometry restraints.56 The final model was validated in PHENIX software package. The structural validation details for the final model are summarized in Supplementary information, Table S7.
RPA and fluorescence detection
The RPA assay was performed with a GenDx ERA Kit (Suzhou GenDx Biotech, China). According to the instructions in the manual, the 50 µl RPA system contains 2 µl DNA template, 2.5 µl forward primer (10 µM), 2.5 µl reverse primer (10 µM), 10 µl ERA basic buffer, 20 µl reaction buffer, 2 µl activator, and supplementary ddH2O.Three microliters of the RPA reaction product were transferred to the Cas12a reaction. In a 20 μL Cas12a reaction, additional with 100 ng Cas12a protein, 25 pM ssDNA FQ probe sensor, 50 nM crRNA in a reaction buffer (100 mM NaCl, 50 mM Tris-HCl, 100 µg/mL BSA, pH 7.9) supplied with 2.5 mM MnSO4, incubate at 37 °C until detection. A PerkinElmer EnSpire reader with excitation at 485 nm and emission at 520 nm was used for fluorescence detection.
Quantification and statistical analysis
All values in the text and figures are presented as mean ± SEM of independent experiments with given n sizes. For image analysis, images were collected from at least three independent experiments. Graphs were compiled and statistical analyses were performed with Prism software (GraphPad) and Excel. Statistical significance was evaluated with the two-tailed unpaired t-test when comparing two groups. Differences between more than two samples were calculated using a one-way analysis of variance (ANOVA). Statistical details, including sample sizes (n), are indicated in the figures and legends.