CrisPam: SNP-derived PAM analysis web tool and human pathogenic SNPs database for CRISPR allele-specific targeting

Background CRISPR is a promising novel technology for treating genetic conditions. Therefore, it is essential to further develop and promote treatment’s safety and specificity. While the guide-RNA offers position-specific DNA targeting, it may tolerate small changes such as single-nucleotide polymorphisms (SNPs). To that end, an allele-specific targeting approach is in need for future treatments of heterozygous patients, suffering from genetic conditions caused by a SNP. The SNP-derived PAM approach allows highly allele-specific DNA cleavage by incorporating a protospacer adjacent motif (PAM) sequence only at the target allele. Description Here we present CrisPam, a tool that detects SNP-derived PAMs for allele-specific targeting by the CRISPR/Cas system. The algorithm scans the generation of each reported PAM for a given DNA sequence and its variations. A successful result is such that at least one PAM is generated by a SNP. Thus, the PAM shall be part of the variant allele only and the Cas protein will therefore be able to exclusively bind the variant allele for gene-editing, while the wildtype allele remains unchanged. Conclusion CrisPam is available online for researchers and also offers access to the CrisPamDB, a database that contains the CrisPam analysis for any reported pathogenic SNP in humans.

scans the generation of each reported PAM for a given DNA sequence and its variations. A successful result is such that at least one PAM is generated by a SNP. Thus, the PAM shall be part of the variant allele only and the Cas protein will therefore be able to exclusively bind the variant allele for geneediting, while the wildtype allele remains unchanged. Conclusion CrisPam is available online for researchers and also offers access to the CrisPamDB, a database that contains the CrisPam analysis for any reported pathogenic SNP in humans.

Background
The clustered regularly interspaced short palindromic repeat (CRISPR) system enables precise genome editing mediated by a single-guide RNA (sgRNA) that guides the CRISPR associated (Cas) protein to the target DNA in the genome. Cas9, the catalytic unit of the CRISPR system, generates a double-strand break (DSB) in the DNA in the presence of a DNA:sgRNA match and a protospaceradjacent motif (PAM) in immediate proximity to the target DNA1,2. The diverse of Cas proteins, derived from different bacterial strains, differ in several properties such as PAM sequence, cleavage pattern and position, size, activity in mammalian cells, off-targets and substrate (DNA or RNA). The standard Cas protein has been modified to broaden its applications to base-editing3,4, transcription repression and activation5-7, epigenomic modifications8, visualization of genomic loci9 and DNA nicking10 (single-strand cleavage). In an experiment design, the PAM sequence and size of the designated Cas should be taken under consideration; presence of a PAM is a limiting step in targeting 3 unique loci, and the Cas size affects the optional possibilities of delivery systems.

SNP-derived PAM
The CRISPR/Cas system can tolerate some mismatches between the CRISPR RNA (crRNA) and the target DNA. The bases at the positions of 8 to 13 at the 3′ end of the spacer (regarding type II Cas proteins) are termed the seed sequence along with the first base at the 5′ end. Mismatches at the seed sequence are thought to be not tolerated and abolish DNA cleavage. As for pathogenic singlenucleotide polymorphisms (SNPs), previous studies have shown that targeting an allele caused by a SNP by choosing a gRNA sequence containing the variated nucleotide is seemingly insufficient, resulting in a non-specific knockdown of both the mutant alleles and the wildtype allele in some proportion11,12. A SNP-derived PAM approach overcomes this potential limitation of targeting the disease-causing allele while leaving the wildtype allele intact. This method dramatically increases the specificity of targeting the mutant allele alone by choosing a PAM sequence that is present only at the mutant sequence. Meaning, the mutant SNP generates the PAM sequence12,13.
When targeting a gene without a particular DNA cleavage location preference, almost all Cas proteins are optional. However, when targeting a SNP in general, or if utilizing the SNP-derived PAM approach in particular, the selection of Cas is limited mostly due to the condition of PAM presence in proximity to the SNP or having a PAM generated by the SNP.

CrisPam
CrisPam is a pythonic code that scans DNA sequences for 30 candidate PAMs from 19 Cas proteins (Table 1). It obtains data of a given SNP, and tests whether it generates a unique PAM sequence in the DNA of the mutation allele only. Thus, CrisPam generates a list of matching Cas proteins for targeting the pathogenic allele. Here we show a database of all known clinically significant pathogenic or likely pathogenic SNPs that generate a PAM. Furthermore, we developed a bioinformatics tool for researchers, available at http://CRISPR.tau.ac.il, to detect SNP-derived PAMs at their SNPs of interest regardless of taxonomy and clinical significance.

4
Due to technical limitations, Table 1 is only available as a download in the supplemental files section.

Implementation
The CrisPam tool is web-based. Thus, no software installation effort is required. The CrisPam DB is a .xlsx file and can be opened by Excel. The CrisPam script is written in Python 3.6 and uses standard libraries (xml.etree.ElementTree, csv and time). Biopython is used on the web-based tool29.
Parsing and SNP data analysis The guiding principle of the SNP-derived PAM concept is having a PAM present in the desired target allele. The following workflow occurs to detect unique PAMs generated by a SNP: Parameters are being parsed from the data (wildtype sequence, mutation sequence, gene name and ID, SNP ID and the chromosome) into a list of SNPs. The code analyses a given SNP by obtaining the DNA sequence upstream and downstream to the SNP, the wildtype nucleotide (reference nucleotide) and the variation nucleotide. The anti-sense strand is analysed to detect unique PAMs generated on the complementary strand as well. 16 PAM sequences of 14 Cas proteins are scanned for, in the DNA sequence (table 1). For each Cas, CrisPam is defined to find its PAM at the position of the SNP ( figure   2). Some SNPs have more than one variation nucleotide, thus CrisPam considers any variation of a SNP and scans each one of them. Once a PAM is found, it is accepted as a match only if it exists at a variation allele and not at the wildtype allele. For a given SNP, more than one PAM may be generated, therefore, CrisPam presents all the matches for a given SNP. The suggested sgRNA sequence for each matching Cas -is the 20-23nt upstream or downstream to the PAM, according to the Cas type (type II or type V, respectively).
The PAM sequences were determined according to previous studies that characterized the unique properties and PAM compatibility for each Cas1-2,14-28.
We obtained a database of all known pathogenic and likely-pathogenic SNPs in humans from NCBI's dbSNP (SNP database). The code is written to analyse dbSNP's data in XML format and each SNP that is found to be PAM generating is represented in a row of a CSV file.

Results
A database of PAM-generating SNPs:

5
The CrisPam algorithm scanned 49,634 pathogenic SNPs and 14,722 likely pathogenic SNPs (64,356 in total). Successful matches of SNPs that generate at least one PAM were found in 84% of the total SNPs -41,162 of the pathogenic SNPs and 12,940 of the likely pathogenic SNPs (figure 1).
The SNP-derived PAM targeting approach is highly ideal for heterozygous patients suffering from a disease caused by a SNP. Figure 2 represents a study case SNP (rs63750526 of PSEN1) that generates 7 PAMs. Such SNPs confer the ability to opt the most suitable Cas depending on the application's limitations (vector size, activity efficiency, lab stock etc.).
The full database of PAM generating pathogenic and likely pathogenic SNPs is available at http://crispr.tau.ac.il/DBs/CrisPam_results.xlsx The CrisPam algorithm is available at https://github.com/ristllin/CrisPam CrisPam -an online SNP-derived PAM finding tool: We established a web tool that performs CrisPam's SNP-derived PAM targeting abilities on user data.
Since many SNPs are yet to be reported and included in NCBI's dbSNP, and for research purposes non-pathogenic SNPs may be of one's interest to target, our web tool offers a platform for researchers to enter their sequences of interest for CrisPam analysis.

Discussion
The SNP-derived PAM targeting approach for promoting allele specificity is a promising method in CRISPR based novel therapies to enter the clinic. As most patients suffering from genetic conditions are heterozygous, carrying one copy of a pathogenic allele, developing SNP customized treatments is essential for increasing treatment's safety, by reducing unintended cleavage of the well-functioning wildtype allele. While many web tools offer gRNA designs for CRISPR based experiments, none of them, to our knowledge, offer an allele-specific gRNA design. Since CrisPam cannot offer scoring and off-targets assessment for now, we strongly suggest further off-target prediction examination of the gRNA of interest. Moreover, gRNA length may vary for different Cas proteins; thus, we strongly recommend using CrisPam as the first step in the experiment design. Further assessments of activity in target organism, gRNA length and off-targets prediction are required. Moreover, for multiple-PAM 6 generating SNPs, considerations such as delivery vector capacity (e.g. AAV or lentivirus) and efficiency may also determine the most suitable Cas protein for the experiment. In the future more features will be added to CrisPam: alternative input options (txt files and rsID) and customized PAMs.

Conclusions
While CRISPR applications have been widely expanded, the SNP-derived PAM approach may be utilized for gene silencing (using inactive Cas), genetic screening and more applications other than allele-specific DNA cleavage. This study emphasizes the emerging importance of broadening PAM compatibility of Cas proteins to enable allele-specific targeting and overcome the PAM limitation.
Furthermore, CrisPam offers a simple interface to design an allele-specific targeting experiment using the CRISPR/Cas system.  The proportion of PAM-generating pathogenic and likely-pathogenic SNPs in humans Representation of PAM generators compared to PAM non-generators. PAM generators are SNPs that generated at least one PAM. While a SNP may be generating more than one PAM, a PAM-generating SNP is counted once regardless of the number of PAMs it generates. 84% of the total SNPs checked where found to be PAM-generators. The average number of PAMs generated by a SNP is 6.97.

Figure 2
A case study -rs63750526 SNP of PSEN1 as an example of a SNP-derived PAM rs63750526 SNP of PSEN1, known as a risk factor for early onset Alzheimer's disease, as an example of multiple PAMs generated by a SNP. A) The variant nucleotide A generated 7 PAMs: EQR SpCas9's NGAG, VQR SpCas9's NGAN, SaCas9's NNGRRT, KKH SaCas9's NNNRRT, xCas9's HGA and cCas9's NNVRRN and NNVACT on the complement strand. B) The obtained data for rs63750526 from the database generated by CrisPam, showing a pathogenic SNP of PSEN1 and the PAMs it generates.

Supplementary Files
This is a list of supplementary files associated with this preprint. Click to download.