Tailored identification of Cas9 proteins with specific PAM requirements using a computational prediction pipeline

doi:10.21203/rs.3.rs-1652795/v1

Download PDF

Brief Communication

Tailored identification of Cas9 proteins with specific PAM requirements using a computational prediction pipeline

https://doi.org/10.21203/rs.3.rs-1652795/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 29 Oct, 2022

Read the published version in Nature Communications →

Version 1

posted

You are reading this latest preprint version

The identification of the protospacer adjacent motif (PAM) sequences of Cas9 nucleases is crucial for their exploitation as genome editing tools. Here we interrogated a massively expanded dataset of metagenome and virome assemblies for accurate and comprehensive PAM predictions, in order to identify novel Cas9s selected for their PAM requirements. As a test bed we selectively isolated a Cas9 which uses a PAM sequence corresponding to a disease-causing mutation, P23H in the RHO gene. Our PAM prediction pipeline will be instrumental to generate a Cas9 nuclease repertoire responding to any PAM requirement leading towards a natural PAM-free genome editing toolbox.

Repositioning CRISPR-Cas systems from prokaryotes to mammalian cells gave a major acceleration to genome editing applications in the clinic¹. Yet, this technology is still limited by major constraints, mainly related to the reduced number of CRISPR-Cas systems active in mammalian cells which hardly respond to the complexity of gene therapy applications. PAM sequences, necessary for nuclease recognition and activity, are key in this context, as they dictate the compatibility of each Cas tool towards specific genomic target sites. Molecular engineering significantly improved Cas9 properties for genome editing including relaxation of PAM requirements², but this approach impacts the activity of the nucleases and still does not respond to all PAM sequences needed to tackle disease causing mutations³. As a substantial number of yet unexplored CRISPR-Cas systems can still be found in microbial genomes⁴, we propose that tailor-made genetic tools for specific applications should be identified by searching for the most suitable PAM-Cas9 combination in prokaryotic systems rather than engineering a limited number of already characterized nucleases. However, the current major limitation toward efficiently navigating through the natural PAM-Cas9 diversity is the pre-determination of PAM requirements of novel Cas9 genes that can instead be easily identified by whole-gene nucleotide homology.

Here we demonstrate that the interrogation of massive metagenomic datasets combined with a newly developed computational method allows the identification of a vast number of unreported CRISPR-Cas loci and their respective PAM requirements (pipeline schematized in Fig. 1a). We focused our search on Type II systems since their simplicity facilitates their biotechnological exploitation⁴. From 825,698 bacterial and archaeal genomes reconstructed via metagenomic assembly of human, host-associated, and non-host associated environmental microbiomes (see Methods) (Blanco-Miguez et al. study related doc) and 257,670 genomes from microbial isolates retrieved from the NCBI database, we identified 92,140 CRISPR-Cas9 loci. Cas9 proteins were clustered at multiple sequence identity levels (from 100–95%) and the PAM prediction analysis was performed for each level to identify the best approach for PAM prediction accuracy (clustering at 98% nucleotide identity, see Methods). To identify protospacer flanking sequences, 613,478 unique spacers were aligned to phage genomes of the human microbiome: 142,809 from the Gut Phage Database⁵, 189,680 from the Metagenomic Gut Viral catalog⁶ and 45,872 from de-novo assembled gut phages from highly enriched viromes as profiled via ViromeQC⁷ (see Methods). Only full-length, near-perfect matches were retained (at most 4 nucleotide variations), resulting in a total of 39,109,402 putative protospacers. Cas9 clusters with less than 10 mapped spacers were discarded to retain only highly reliable PAM predictions. Upstream and downstream sequences flanking the matches, up to 30 nt, were retrieved. For each Cas9 cluster, sequences flanking the same spacer were realigned and the multiple sequence alignment was collapsed into a single consensus flanking sequence, to normalize the match counts. Nucleotide frequencies in the consensus flanking sequences were computed and represented as sequence logos. A PAM was predicted for a Cas9 cluster if there was at least one conserved position in either the upstream or the downstream flanking sequence. In total, for the 98% identity clustering, we obtained PAM predictions for 2,546 out of 2,779 Cas9 clusters (representing a total of 61,095 Cas9 sequences) with more than 10 mapped spacers (91.6%).

To validate our approach and predictions, we then searched in our dataset for gene sequences coding for proteins with high sequence identity (> 98%) to previously characterized Cas9s: SpCas9⁸, SaCas9⁹, St1Cas9¹⁰, St3Cas9¹¹ and SmCas9¹². For these Cas9s we obtained sequence predictions corresponding to the described PAMs (Fig. 1b). We further tested our method by cross checking the PAM predictions obtained with our pipeline with the sequences experimentally identified and recently reported and characterized by Gasiunas et al ¹³. Of the 79 Cas9s reported, 21 could be used for the evaluation here as they have a close ortholog in our dataset (> 98% identity), and for them we confirm the accuracy of our prediction strategy by obtaining PAM logos with high identity (assessed by Jensen-Shannon distance on nucleotide frequencies, see Methods) with the sequences determined experimentally (Fig. 1c and Supplementary Fig. 1). Overall, 85% of PAM predictions generated by our method are correct and the remaining 15% are partial predictions with at least one base correctly identified. Our method exhibits a much higher prediction accuracy compared to Spacer2PAM¹⁴, the best PAM prediction method reported so far (45% correct predictions, 55% partial predictions). Based on the median PAM distance between these 21 PAMs and our predictions, we determined that the Cas9 clustering at 98% identity generates slightly better PAM predictions than the other clustering levels, leading us to choose this clustering for subsequent analyses.

To further test experimentally the reliability and potential of the PAM prediction pipeline in expanding the Cas9 toolbox from our databank, we searched for Cas9 candidates using parameters favoring the identification of functionally active enzymes (with preserved domain structures and located in complete CRISPR-Cas loci) and with reduced molecular size (< 1,100 amino acids), thus potentially more convenient for genome editing applications. We identified four Cas9 never described before from poorly characterized species (Supplementary Fig. 2) and predicted their PAM logos which were subsequently experimentally validated through an in vitro assay¹⁵. Results demonstrated a very close identity between in silico and in vitro results as indicated by the small distance (less than 2 bits for 3 out of 4 Cas9 variants) between predicted and experimentally determined PAMs (Fig. 1d-e), thus further confirming the accuracy and the potential of this PAM prediction pipeline. Overall, our method allows PAM prediction for the vast majority of Cas9 proteins identified in our databank with 10 or more mapped spacers, across all Cas9 subtypes (93.6% for A, 93.0% for B and 87.9% for C) (Fig. 1f).

We then applied our PAM predictor to the metagenomically extended set of 2,546 Cas9 protein families (98% identity clustering) to identify all PAM requirements and explore whether specific PAM clusters may exist. Hierarchical clustering on pairwise distance of the predicted PAMs retrieved 32 clusters with at least 20 members (see Methods). For each PAM cluster, a consensus PAM was generated (Fig. 2a). Interestingly, the most prevalent PAM sequences represent only a small fraction of all possible PAMs. Therefore, even though the PAM variability is high for Type II Cas9¹⁶, only definite combinations of nucleotides were identified.

We further evaluated whether there might be an association between the PAM clusters identified in Fig. 2a and specific Cas9 subtypes. After generating a phylogenetic tree of the identified Cas9, we found that almost every PAM cluster is associated with specific clades of Cas9 proteins (Supplementary Fig. 3), thus suggesting a non-random organization of PAM recognition sequences. For instance, the most abundant PAM cluster (NGG) is found in a specific branch of type II-A and in almost all type II-B Cas9s.

A promising and simple therapeutic application of the CRISPR-Cas technology is the knock-out of mutations causing autosomal dominant genetic diseases. Nonetheless, allelic discrimination is hardly obtained through CRISPR-Cas due to various grades of sgRNA mismatch tolerance by Cas9. Conversely, since PAM sequences are stringent requirements for Cas9 activity, targeting mutated alleles generating novel PAMs would allow a specific target separation between the mutated and the wild-type alleles. Consequently, a paramount application of our PAM prediction pipeline is the identification of novel Cas9s recognizing PAM sequences generated by pathogenic mutations to offer specific targeting options for the mutated allele with a highly secured allelic discrimination. By interrogating the ClinVar database¹⁷ for mutations corresponding to PAMs associated with Cas9s from our metagenomic analysis, we found that a large fraction of pathogenic mutations (98.6% of 89,751 substitutions and small indels with known mode of inheritance) are included in at least one of the identified PAMs, thus providing allelic discrimination, with 48.7% of them being autosomal dominant alterations (Fig. 2b). As a proof of concept for the potential of our PAM prediction method, we chose a specific dominant-negative mutation, the P23H mutation in the rhodopsin (RHO) gene¹⁸, which is the most common mutation causing RHO-dependent retinitis pigmentosa¹⁹. We identified PrCas9, a Cas9 found in an unclassified Proteobacteria species, which has a predicted PAM N₅T corresponding to a P23H mutation in RHO (CGAAGT, wild-type sequence CGAAGG) and experimentally validated in vitro its PAM preferences (PAM NRVNRT, Fig. 2c and Supplementary Fig. 4). PrCas9 editing activity was first tested in an EGFP disruption assay to verify its activity in mammalian cells generating near 50% EGFP disrupted cells (Supplementary Fig. 5) and then towards RHO wild-type or carrying the P23H mutation. We obtained up to 15.8% InDels at the RHO P23H locus and the complete absence of indels in the wt sequence, thus demonstrating the efficacy of the selected Cas9 in targeting the RHO specific mutation in mammalian cells (Fig. 2d).

In conclusion, by interrogating an extended microbiome databank with an accurate computational pipeline, we identified a large variety of new Cas9 nucleases accompanied by their identified PAM requirements. This analysis revealed that PAM sequences follow defined nucleotide patterns which are associated with specific Cas9 subtypes and overlap with 98.6% of the pathogenic mutations reported in ClinVar¹⁷. The precise PAM prediction driven by a specific sequence-mutation query allows the identification of tailored Cas9s, such as PrCas9 targeting the P23H RHO mutation. This approach opens to the expansion of the genome editing toolbox with mutation-tailored nucleases and supports the strategy of an application-specific search for suitable natural prokaryotic genome editing tools requiring minimal or no engineering.

Catalog of reference and metagenomic-assembled genomes

The catalog of bacterial and archaeal genomic sequences used in this work was retrieved from: (i) 257,670 publicly available isolated sequences from the NCBI database²⁰ (available as of January 2021), (ii) 771,529 metagenome-assembled genomes (MAGs) from the Blanco-Miguez et al. study (see related doc), and (iii) 54,169 additional MAGs obtained with a validated assembly-based pipeline similarly to Pasolli et al.²¹. For retrieving these 54,169 additional MAGs, 8,487 metagenomic samples (Supplementary Table 1) were assembled using metaSPAdes²² if paired-end metagenomes were available, and MEGAHIT²³ otherwise. In both cases, default parameters were used. Contigs longer than 1,500 nucleotides were binned into MAGs using MetaBAT2²⁴.

Viral genomes retrieval from highly enriched viromes

A total of 45,872 viral genomes were metagenomically assembled from 3,044 Human Gut virome datasets as described previously²⁶. The efficacy of viral enrichment in each virome was evaluated with ViromeQC⁷. A total of 255 samples had an enrichment higher than 50X and were retained as highly viral samples. Reads were preprocessed with TrimGalore (version 0.4.4)²⁷ to remove low quality and short reads (parameters: --stringency 5 --length 75 --quality 20 --max_n 2 --trim-n). Reads aligning to the human genome hg19 were also removed with Bowtie2 (version 2.4.1)²⁸. High quality reads were assembled into contigs with metaSPAdes (version 3.10.1)²² (k-mer sizes: -k 21,33,55,77,99,127), or Megahit (version 1.1.1)²³.

To reduce non-viral contaminants, we removed contigs that mapped to microbial genomes by using the collection of Metagenomic Assembled Genomes from Pasolli et al. ²¹. Only contigs that were a) longer than 1500 bp; b) found within the same microbial species-level genome bin in less than 30 metagenomes; and c) found in the unbinned assembled fraction of more than 20 metagenomes, were retained. Contigs from i) the remaining non-highly enriched viromes, and ii) from the human gut metagenomes used in Pasolli et al.²¹, and that were similar to a potentially highly enriched viral genome, were also mapped against the unbinned contigs of Pasolli et al. with mash (version 2.0)²⁹. Contigs with a distance lower than 10% (p-value < = 0.05) were retained. Finally, we selected 699 complete viral genomes from RefSeq, release 99³⁰ by selecting genomes that could be found in at least 20 samples within the unbinned contigs of Pasolli et al²¹. All mappings were performed with blastn (version 2.6.0)³¹ identity > 80%, aln. len. >1,000 bp). Contigs were clustered at 95% identity with VSEARCH³² with each cluster needing to contain at least one contig originating from highly enriched viromes.

PAM prediction

CRISPRCasTyper (version 1.5.0, default parameters)³³ was used to identify 131,941 CRISPR-Cas loci. Loci containing Cas9 proteins shorter than 950 aa were excluded from the analysis. The resulting 92,140 Cas9 proteins were clustered at 100, 99, 98, 97, 96 and 95% identity using UCLUST (version 11.0.667)³⁴ resulting in 27,062, 14,332, 10,475, 8,568, 7,538, and 6,898 clusters respectively.

In total, 613,478 spacers were retrieved from CRISPR arrays and were aligned to 366,233 viral genomes (142,809 from Gut Phage Database⁵, 189,680 from Metagenomic Gut Virus catalog⁶ and 45,872 from de-novo assembled gut phages from highly enriched viromes) using blastn (version 2.5.0)³¹ to identify putative protospacers. Matches with more than 4 mismatches or gaps were filtered out. For each Cas9 clustering level, clusters with less than 10 mapped spacers were discarded, resulting in 7,177 (26.52%), 3,908 (27.27%), 2,779 (26.53%), 2,169 (25.32%), 1,814 (24.06%), and 1,594 (23.11%) clusters. Since the orientation of CRISPR arrays is unknown, both upstream and downstream flanking sequences, up to 30 nt, were retrieved for each putative protospacer. For each Cas9 cluster, protospacer and their flanking sequences, found using the same spacer, were aligned to each other using MUSCLE (version 3.8.31)³⁴ and the alignment was collapsed into a single consensus sequence by taking the most frequent base at each position and discarding columns composed mostly (> 50%) of gaps. Spacers were aligned exactly to the consensus sequence to define up- and downstream regions, which were then used to compute nucleotide frequencies and generate sequence logos using Logomaker (version 0.8)³⁵.

For each Cas9 cluster, a PAM was considered predicted if there was at least one highly conserved base in only one of the two flanking regions (the PAM can be either upstream or downstream, not both). We defined a highly conserved base as a position in the logo with more information than the maximum between 1 bit and the third quartile plus 1.5 times the interquartile range of the distribution of information in both flanking sequences (i.e. the conserved position is an outlier with at least 1 bit of information). For each clustering level, a PAM was predicted for 6,758 (94.16%), 3,622 (92.68%), 2,546 (91.62%), 1,944 (89.63%), 1,601 (88.26%), and 1,387 (87.01%) clusters with more than 10 mapped spacers.

tracrRNA identification

tracrRNA sequences of the novel Cas9 orthologs were identified computationally, searching for sequences starting with a putative anti-repeat and ending with a Rho-independent transcription terminator (RIT). Putative anti-repeats were identified aligning CRISPR repeats to sequences flanking the CRISPR-Cas locus (up to 1,000 nt) using blastn (version 2.5.0)³¹ and RITs were predicted using RNIE³⁶.

In vitro PAM determination

In vitro PAM evaluation of the novel Cas9 orthologs was performed according to the protocol from Karvelis et al¹⁵. In brief: for each Cas9 ortholog the human codon optimized version of its coding sequences was ordered as a synthetic construct (Genscript) and cloned into an expression vector for in vitro transcription and translation (IVT) (pT7-N-His-GST- Thermo Fisher Scientific). Reaction was performed according to the manufacturer protocol (1-Step Human High-Yield Mini IVT Kit - Thermo Fisher Scientific). The Cas9-guideRNA RNP complex was assembled by combining 20 µL of the supernatant containing soluble Cas9 protein with 1µL of RiboLock RNase Inhibitor (Thermo Fisher Scientific) and 2µg of guide RNA. The Cas9-guideRNA complex obtained was used to digest 1ug of a plasmid (p11-lacY-wtx backbone - Addgene #69056) containing an 8-nucleotide randomized PAM sequence flanking the gRNA target. Digestion reaction was incubated for 1 hour at 37°C.

A double-stranded DNA adapter was then ligated to the DNA ends generated by the targeted Cas9 cleavage and the final ligation product was purified using a GeneJet PCR Purification Kit (Thermo Fisher Scientific).

One round of a two-step PCR (Phusion HF DNA polymerase - Thermo Fisher Scientific) was performed to enrich the sequences that were cut using a set of forward primers annealing on the adapter and a reverse primer designed on the plasmid backbone downstream of the PAM (Supplementary Table 2). A second round of PCR was performed to attach the Illumina indexes and adapters. PCR products were purified using Agencourt AMPure beads in a 1:0.8 ratio.

The generated library was analyzed with a 71-bp single read sequencing, using a flow cell v2 micro, on an Illumina MiSeq sequencer.

PAM sequences were extracted from Illumina MiSeq reads and used to generate PAM sequence logos. PAM heatmaps³⁷ were used to display PAM enrichment, computed dividing the frequency of PAM sequences in the cleaved library by the frequency of the same sequences in a control uncleaved library.

PAM comparison and hierarchical clustering

Differences between PAM sequences were quantified using the Jensen-Shannon distance (defined as the square root of the Jensen-Shannon divergence)³⁸. PAM predictions resulting from the 98% identity Cas9 clustering showed the lowest median distance from the in vitro determined PAMs of Cas9 orthologs characterized by Gasiunas et al¹³ and were therefore chosen for subsequent analyses. An all-to-all PAM prediction distance matrix was computed and hierarchical clustering was performed to generate PAM clusters, using usearch (version 11.0.667, parameters -cluster_aggd -id 0.6 -linkage avg)³⁴. Consensus PAMs for each cluster were generated using the protospacer flanking sequences of each cluster member.

PAM clusters association with Cas9 phylogenetic tree

Cas9 proteins with a predicted PAM (98% identity clustering) were aligned using mafft (version 7.490, with parameters --maxiterate 10)³⁹ and a phylogenetic tree was built using FastTree (version 2.1.11, with parameters -spr 4 -mlacc 2 -slownni)⁴⁰. Cas9 clades were defined using TreeCluster (version 1.0.3, with parameters -m max_clade)⁴¹ and a range of thresholds (0.3 to 4). Associations between PAM clusters and Cas9 clades were assessed using Fisher’s exact test, computing p-values by Monte Carlo simulation with 100,000 replicates and a 0.001 significance level.

Identification of PAM-matching mutations in ClinVar

Mutations in the ClinVar database (accessed March 6, 2022)¹⁷ were filtered to select single nucleotide variants and short InDels (10 or less nucleotides) annotated as pathogenic or likely pathogenic and associated with pathologies with known mode of inheritance, for a total of 89,751 mutations. To compute the fraction of mutations that can be targeted by at least a Cas9 in our databank with allelic discrimination, predicted PAM sequences resulting from the 98% identity Cas9 clustering were aligned exactly to wild-type and mutated alleles.

Cell culture and InDels analysis

HEK293T/17 obtained from ATCC were cultured in DMEM supplemented with 10% fetal bovine serum, 2mM L-Glutamine, 100 U/ml Penicillin and 100 ug/ml streptomycin (Life Technologies) and incubated at 37°C and 5% CO2 in a humidified atmosphere. Cells tested mycoplasma negative (PlasmoTest, Invivogen). For InDels analyses cells were seeded in 24-well plate and transfected after 24 hours with 1000 ng pX-PrCas9-sgRNA-RHO-P23H, 50 ng pCMV-TO-RHO-P23H or pCMV-TO-RHO-WT and 50 ng pEGFP-IRES-Puro using TransIT-LT1 transfection reagent (Mirus Bio) according to manufacturer’s instructions. 48 hours post-transfection cells were pool-selected with 1 ug/ml Puromycin and collected after 72 hours. Genomic DNA was obtained from cell pellets using the QuickExtract DNA extraction solution (Lucigen) according to the manufacturer’s instructions. The RHO P23 locus was amplified using the HOT FIREPol Multiplex Mix (Solis Biodyne) with primers RHO-TO-F (CAGTGATAGAGATCTCCCTATC) and RHO-int1-R (GAGATAGATGCGGGCTTCCA). PCR amplicons were purified using CleanNGS beads (CleanNA) and Sanger sequenced (Microsynth) using RHO-TO-F primer. Indel levels were evaluated using TIDE⁴².

Plasmids

A pX330-derived plasmid was used to express the Cas9 orthologs in mammalian cells. Briefly, pX330 (Addgene) was modified by substituting SpCas9 and its sgRNA scaffold with the human codon-optimized coding sequence of the variant of interest and its sgRNA scaffold. The Cas9 variants coding sequences, modified, as described before, by the addition of an SV5 tag at the N-terminus and two nuclear localization signals (1 at the N-term and 1 at the C-term) and human codon-optimized, as well as the sgRNA scaffolds, were obtained as synthetic fragments from either Genscript or Genewiz. Spacer sequences were cloned into the pX-Cas9 plasmids as annealed DNA oligonucleotides containing a variable 20 or 24-nt spacer sequence using a double BsaI site present in the plasmid. The list of spacers sequences used in the EGFP disruption assay and in the evaluation of editing activity against the RHO P23H mutation is reported in Supplementary Table 3.

pCMV-TO-RHO-WT plasmid was obtained by cloning the human rhodopsin (RHO) gene into the pCDNA5/TO plasmid (Addgene). The hRHO gene was PCR-amplified using the primers RHO_gene_F (attaggatccAGAGTCATCCAGCTGGAGCCC) and RHO_gene_R (taatctcgagTGGGGTTTTTCCCATTCCCAGG) from genomic DNA extracted from HEK293T/17 cells using the Phusion high fidelity DNA Polymerase (ThermoFisher Scientific). The P23H mutation was further inserted by site-directed mutagenesis using primers mut-P23H-F (GTGTGGTACGCAGCCaCTTCGAGTACCCACAG) and mut-P23H-R (CTGTGGGTACTCGAAGtGGCTGCGTACCACAC) to generate pCMV-TO-RHO-P23H plasmid. All the oligonucleotides were purchased from Eurofins Genomics.

Data availability. The data that support the findings of this study are available from the corresponding author upon reasonable request.

Acknowledgements

We are grateful to Cereseto’s and Segata’s lab members for helpful discussion throughout the project. We thank the Next Generation Sequencing facility at the University of Trento for technical support.

This work was supported by the European Union’s Horizon 2020 innovation programme through the UPGRADE (Unlocking Precision Gene Therapy) project (grant agreement No 825825) to AC and by the European Research Council (ERC-STG project MetaPG) to NS; by MIUR ‘Futuro in Ricerca’ (grant No. RBFR13EWWI_001) to NS; and by the National Cancer Institute of the National Institutes of Health (1U01CA230551) to NS.

Author contributions

M.C., L.S. and N.S developed the PAM prediction pipeline; M.D., E.P., E.V., L.P designed and performed the experiments; M.D., E.P., E.V., L.P., A.Ca and M.C. collected and analyzed the data; A.C., N.S., M.C. and A.Ca. conceived and designed the study, wrote and edited the paper; A.C. and N.S. were responsible for the coordination of the study. All authors read, corrected, and approved the final manuscript.

Additional information

The authors declare competing financial interests: A.Ce. is a co-founder and holds stocks of Alia Therapeutics, a genome editing company. A.Ca. is a co-founder, holds stocks and is currently an employee of Alia Therapeutics. L.P. is an employee of Alia Therapeutics. A patent application has been filed covering certain aspects of the presented work.

Doudna, J. A. The promise and challenge of therapeutic genome editing. Nature 578, 229–236 (2020).
Christie, K. A. & Kleinstiver, B. P. Making the cut with PAMless CRISPR-Cas enzymes. Trends Genet. 37, 1053–1055 (2021).
Collias, D. & Beisel, C. L. CRISPR technologies and the search for the PAM-free nuclease. Nat. Commun. 12, 555 (2021).
Makarova, K. S. et al. Evolutionary classification of CRISPR-Cas systems: a burst of class 2 and derived variants. Nat. Rev. Microbiol. 18, 67–83 (2020).
Camarillo-Guerrero, L. F., Almeida, A., Rangel-Pineros, G., Finn, R. D. & Lawley, T. D. Massive expansion of human gut bacteriophage diversity. Cell 184, 1098–1109.e9 (2021).
Nayfach, S. et al. Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome. Nat. Microbiol. 6, 960–970 (2021).
Zolfo, M. et al. Detecting contamination in viromes using ViromeQC. Nat. Biotechnol. 37, 1408–1412 (2019).
Jinek, M. et al. A Programmable Dual-RNA–Guided DNA Endonuclease in Adaptive Bacterial Immunity. Science 337, 816–821 (2012).
Ran, F. A. et al. In vivo genome editing using Staphylococcus aureus Cas9. Nature 520, 186–191 (2015).
Garneau, J. E. et al. The CRISPR/Cas bacterial immune system cleaves bacteriophage and plasmid DNA. Nature 468, 67–71 (2010).
Horvath, P. et al. Diversity, Activity, and Evolution of CRISPR Loci in Streptococcus thermophilus. J. Bacteriol. 190, 1401–1412 (2008).
Shields, R. C. et al. Repurposing the Streptococcus mutans CRISPR-Cas9 System to Understand Essential Gene Function. PLOS Pathog. 16, e1008344 (2020).
Gasiunas, G. et al. A catalogue of biochemically diverse CRISPR-Cas9 orthologs. Nat. Commun. 11, 5512 (2020).
Rybnicky, G. A., Fackler, N. A., Karim, A. S., Köpke, M. & Jewett, M. C. Spacer2PAM: A computational framework to guide experimental determination of functional CRISPR-Cas system PAM sequences. Nucleic Acids Res. 50, 3523–3534 (2022).
Karvelis, T., Young, J. K. & Siksnys, V. Chapter Ten - A pipeline for characterization of novel Cas9 orthologs. in Methods in Enzymology (ed. Bailey, S.) vol. 616 219–240 (Academic Press, 2019).
Vink, J. N. A., Baijens, J. H. L. & Brouns, S. J. J. PAM-repeat associations and spacer selection preferences in single and co-occurring CRISPR-Cas systems. Genome Biol. 22, 281 (2021).
Landrum, M. J. et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 46, D1062–D1067 (2018).
Dryja, T. P., Hahn, L. B., Cowley, G. S., McGee, T. L. & Berson, E. L. Mutation spectrum of the rhodopsin gene among patients with autosomal dominant retinitis pigmentosa. Proc. Natl. Acad. Sci. U. S. A. 88, 9370–9374 (1991).
Hamel, C. Retinitis pigmentosa. Orphanet J. Rare Dis. 1, 40 (2006).
Benson, D. A. et al. GenBank. Nucleic Acids Res. 41, D36–D42 (2013).
Pasolli, E. et al. Extensive Unexplored Human Microbiome Diversity Revealed by Over 150,000 Genomes from Metagenomes Spanning Age, Geography, and Lifestyle. Cell 176, 649–662.e20 (2019).
Nurk, S., Meleshko, D., Korobeynikov, A. & Pevzner, P. A. metaSPAdes: a new versatile metagenomic assembler. Genome Res. 27, 824–834 (2017).
Li, D., Liu, C.-M., Luo, R., Sadakane, K. & Lam, T.-W. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics 31, 1674–1676 (2015).
Kang, D. D. et al. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ 7, e7359 (2019).
Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015).
Karcher, N. et al. Genomic diversity and ecology of human-associated Akkermansia species in the gut microbiome revealed by extensive metagenomic assembly. Genome Biol. 22, 209 (2021).
Krueger, F., James, F., Ewels, P., Afyounian, E. & Schuster-Boeckler, B. FelixKrueger/TrimGalore: v0.6.7 - DOI via Zenodo. (Zenodo, 2021). doi:10.5281/zenodo.5127899.
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
Ondov, B. D. et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17, 132 (2016).
Brister, J. R., Ako-adjei, D., Bao, Y. & Blinkova, O. NCBI Viral Genomes Resource. Nucleic Acids Res. 43, D571–D577 (2015).
Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).
Rognes, T., Flouri, T., Nichols, B., Quince, C. & Mahé, F. VSEARCH: a versatile open source tool for metagenomics. PeerJ 4, e2584 (2016).
Russel, J., Pinilla-Redondo, R., Mayo-Muñoz, D., Shah, S. A. & Sørensen, S. J. CRISPRCasTyper: Automated Identification, Annotation, and Classification of CRISPR-Cas Loci. CRISPR J. 3, 462–469 (2020).
Edgar, R. C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 2460–2461 (2010).
Tareen, A. & Kinney, J. B. Logomaker: beautiful sequence logos in Python. Bioinformatics 36, 2272–2274 (2020).
Gardner, P. P., Barquist, L., Bateman, A., Nawrocki, E. P. & Weinberg, Z. RNIE: genome-wide prediction of bacterial intrinsic terminators. Nucleic Acids Res. 39, 5845–5852 (2011).
Walton, R. T., Christie, K. A., Whittaker, M. N. & Kleinstiver, B. P. Unconstrained genome targeting with near-PAMless engineered CRISPR-Cas9 variants. Science 368, 290–296 (2020).
Nettling, M. et al. DiffLogo: a comparative visualization of sequence motifs. BMC Bioinformatics 16, 387 (2015).
Katoh, K. & Standley, D. M. MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Mol. Biol. Evol. 30, 772–780 (2013).
Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments. PLOS ONE 5, e9490 (2010).
Balaban, M., Moshiri, N., Mai, U., Jia, X. & Mirarab, S. TreeCluster: Clustering biological sequences using phylogenetic trees. PLOS ONE 14, e0221068 (2019).
Brinkman, E. K., Chen, T., Amendola, M. & van Steensel, B. Easy quantitative assessment of genome editing by sequence trace decomposition. Nucleic Acids Res. 42, e168–e168 (2014).

Yes there is potential Competing Interest. A.Ce. is a co-founder and holds stocks of Alia Therapeutics, a genome editing company. A.Ca. is a co-founder, holds stocks and is currently an employee of Alia Therapeutics. L.P. is an employee of Alia Therapeutics. A patent application has been filed covering certain aspects of the presented work.

CicianietalSupplementaryInformation.pdf
Supplementary Information

Download PDF

Journal Publication

published 29 Oct, 2022

Read the published version in Nature Communications →

Version 1

posted

You are reading this latest preprint version

Tailored identification of Cas9 proteins with specific PAM requirements using a computational prediction pipeline

Status:

Journal Publication

Version 1

Abstract

Figures

Full Text

Methods

Catalog of reference and metagenomic-assembled genomes

Viral genomes retrieval from highly enriched viromes

PAM prediction

tracrRNA identification

In vitro PAM determination

PAM comparison and hierarchical clustering

PAM clusters association with Cas9 phylogenetic tree

Identification of PAM-matching mutations in ClinVar

Cell culture and InDels analysis

Plasmids

Declarations

References

Additional Declarations

Supplementary Files

Status:

Journal Publication

Version 1