Repositioning CRISPR-Cas systems from prokaryotes to mammalian cells gave a major acceleration to genome editing applications in the clinic1. Yet, this technology is still limited by major constraints, mainly related to the reduced number of CRISPR-Cas systems active in mammalian cells which hardly respond to the complexity of gene therapy applications. PAM sequences, necessary for nuclease recognition and activity, are key in this context, as they dictate the compatibility of each Cas tool towards specific genomic target sites. Molecular engineering significantly improved Cas9 properties for genome editing including relaxation of PAM requirements2, but this approach impacts the activity of the nucleases and still does not respond to all PAM sequences needed to tackle disease causing mutations3. As a substantial number of yet unexplored CRISPR-Cas systems can still be found in microbial genomes4, we propose that tailor-made genetic tools for specific applications should be identified by searching for the most suitable PAM-Cas9 combination in prokaryotic systems rather than engineering a limited number of already characterized nucleases. However, the current major limitation toward efficiently navigating through the natural PAM-Cas9 diversity is the pre-determination of PAM requirements of novel Cas9 genes that can instead be easily identified by whole-gene nucleotide homology.
Here we demonstrate that the interrogation of massive metagenomic datasets combined with a newly developed computational method allows the identification of a vast number of unreported CRISPR-Cas loci and their respective PAM requirements (pipeline schematized in Fig. 1a). We focused our search on Type II systems since their simplicity facilitates their biotechnological exploitation4. From 825,698 bacterial and archaeal genomes reconstructed via metagenomic assembly of human, host-associated, and non-host associated environmental microbiomes (see Methods) (Blanco-Miguez et al. study related doc) and 257,670 genomes from microbial isolates retrieved from the NCBI database, we identified 92,140 CRISPR-Cas9 loci. Cas9 proteins were clustered at multiple sequence identity levels (from 100–95%) and the PAM prediction analysis was performed for each level to identify the best approach for PAM prediction accuracy (clustering at 98% nucleotide identity, see Methods). To identify protospacer flanking sequences, 613,478 unique spacers were aligned to phage genomes of the human microbiome: 142,809 from the Gut Phage Database5, 189,680 from the Metagenomic Gut Viral catalog6 and 45,872 from de-novo assembled gut phages from highly enriched viromes as profiled via ViromeQC7 (see Methods). Only full-length, near-perfect matches were retained (at most 4 nucleotide variations), resulting in a total of 39,109,402 putative protospacers. Cas9 clusters with less than 10 mapped spacers were discarded to retain only highly reliable PAM predictions. Upstream and downstream sequences flanking the matches, up to 30 nt, were retrieved. For each Cas9 cluster, sequences flanking the same spacer were realigned and the multiple sequence alignment was collapsed into a single consensus flanking sequence, to normalize the match counts. Nucleotide frequencies in the consensus flanking sequences were computed and represented as sequence logos. A PAM was predicted for a Cas9 cluster if there was at least one conserved position in either the upstream or the downstream flanking sequence. In total, for the 98% identity clustering, we obtained PAM predictions for 2,546 out of 2,779 Cas9 clusters (representing a total of 61,095 Cas9 sequences) with more than 10 mapped spacers (91.6%).
To validate our approach and predictions, we then searched in our dataset for gene sequences coding for proteins with high sequence identity (> 98%) to previously characterized Cas9s: SpCas98, SaCas99, St1Cas910, St3Cas911 and SmCas912. For these Cas9s we obtained sequence predictions corresponding to the described PAMs (Fig. 1b). We further tested our method by cross checking the PAM predictions obtained with our pipeline with the sequences experimentally identified and recently reported and characterized by Gasiunas et al 13. Of the 79 Cas9s reported, 21 could be used for the evaluation here as they have a close ortholog in our dataset (> 98% identity), and for them we confirm the accuracy of our prediction strategy by obtaining PAM logos with high identity (assessed by Jensen-Shannon distance on nucleotide frequencies, see Methods) with the sequences determined experimentally (Fig. 1c and Supplementary Fig. 1). Overall, 85% of PAM predictions generated by our method are correct and the remaining 15% are partial predictions with at least one base correctly identified. Our method exhibits a much higher prediction accuracy compared to Spacer2PAM14, the best PAM prediction method reported so far (45% correct predictions, 55% partial predictions). Based on the median PAM distance between these 21 PAMs and our predictions, we determined that the Cas9 clustering at 98% identity generates slightly better PAM predictions than the other clustering levels, leading us to choose this clustering for subsequent analyses.
To further test experimentally the reliability and potential of the PAM prediction pipeline in expanding the Cas9 toolbox from our databank, we searched for Cas9 candidates using parameters favoring the identification of functionally active enzymes (with preserved domain structures and located in complete CRISPR-Cas loci) and with reduced molecular size (< 1,100 amino acids), thus potentially more convenient for genome editing applications. We identified four Cas9 never described before from poorly characterized species (Supplementary Fig. 2) and predicted their PAM logos which were subsequently experimentally validated through an in vitro assay15. Results demonstrated a very close identity between in silico and in vitro results as indicated by the small distance (less than 2 bits for 3 out of 4 Cas9 variants) between predicted and experimentally determined PAMs (Fig. 1d-e), thus further confirming the accuracy and the potential of this PAM prediction pipeline. Overall, our method allows PAM prediction for the vast majority of Cas9 proteins identified in our databank with 10 or more mapped spacers, across all Cas9 subtypes (93.6% for A, 93.0% for B and 87.9% for C) (Fig. 1f).
We then applied our PAM predictor to the metagenomically extended set of 2,546 Cas9 protein families (98% identity clustering) to identify all PAM requirements and explore whether specific PAM clusters may exist. Hierarchical clustering on pairwise distance of the predicted PAMs retrieved 32 clusters with at least 20 members (see Methods). For each PAM cluster, a consensus PAM was generated (Fig. 2a). Interestingly, the most prevalent PAM sequences represent only a small fraction of all possible PAMs. Therefore, even though the PAM variability is high for Type II Cas916, only definite combinations of nucleotides were identified.
We further evaluated whether there might be an association between the PAM clusters identified in Fig. 2a and specific Cas9 subtypes. After generating a phylogenetic tree of the identified Cas9, we found that almost every PAM cluster is associated with specific clades of Cas9 proteins (Supplementary Fig. 3), thus suggesting a non-random organization of PAM recognition sequences. For instance, the most abundant PAM cluster (NGG) is found in a specific branch of type II-A and in almost all type II-B Cas9s.
A promising and simple therapeutic application of the CRISPR-Cas technology is the knock-out of mutations causing autosomal dominant genetic diseases. Nonetheless, allelic discrimination is hardly obtained through CRISPR-Cas due to various grades of sgRNA mismatch tolerance by Cas9. Conversely, since PAM sequences are stringent requirements for Cas9 activity, targeting mutated alleles generating novel PAMs would allow a specific target separation between the mutated and the wild-type alleles. Consequently, a paramount application of our PAM prediction pipeline is the identification of novel Cas9s recognizing PAM sequences generated by pathogenic mutations to offer specific targeting options for the mutated allele with a highly secured allelic discrimination. By interrogating the ClinVar database17 for mutations corresponding to PAMs associated with Cas9s from our metagenomic analysis, we found that a large fraction of pathogenic mutations (98.6% of 89,751 substitutions and small indels with known mode of inheritance) are included in at least one of the identified PAMs, thus providing allelic discrimination, with 48.7% of them being autosomal dominant alterations (Fig. 2b). As a proof of concept for the potential of our PAM prediction method, we chose a specific dominant-negative mutation, the P23H mutation in the rhodopsin (RHO) gene18, which is the most common mutation causing RHO-dependent retinitis pigmentosa19. We identified PrCas9, a Cas9 found in an unclassified Proteobacteria species, which has a predicted PAM N5T corresponding to a P23H mutation in RHO (CGAAGT, wild-type sequence CGAAGG) and experimentally validated in vitro its PAM preferences (PAM NRVNRT, Fig. 2c and Supplementary Fig. 4). PrCas9 editing activity was first tested in an EGFP disruption assay to verify its activity in mammalian cells generating near 50% EGFP disrupted cells (Supplementary Fig. 5) and then towards RHO wild-type or carrying the P23H mutation. We obtained up to 15.8% InDels at the RHO P23H locus and the complete absence of indels in the wt sequence, thus demonstrating the efficacy of the selected Cas9 in targeting the RHO specific mutation in mammalian cells (Fig. 2d).
In conclusion, by interrogating an extended microbiome databank with an accurate computational pipeline, we identified a large variety of new Cas9 nucleases accompanied by their identified PAM requirements. This analysis revealed that PAM sequences follow defined nucleotide patterns which are associated with specific Cas9 subtypes and overlap with 98.6% of the pathogenic mutations reported in ClinVar17. The precise PAM prediction driven by a specific sequence-mutation query allows the identification of tailored Cas9s, such as PrCas9 targeting the P23H RHO mutation. This approach opens to the expansion of the genome editing toolbox with mutation-tailored nucleases and supports the strategy of an application-specific search for suitable natural prokaryotic genome editing tools requiring minimal or no engineering.