Data prerequisites for the design of sgRNA library
To Design an sgRNA library, GLiDe requires two types of files: a FASTA file containing the entire DNA sequence, and an annotation file offering comprehensive details including genomic names, coding types, locations, strands, phases, and functional attributes. Regarding the annotation file, two formats are acceptable: (i) generic feature format files (GFF) and (ii) protein/RNA table files (PTT/RNT). These files can be easily obtained from a variety of publicly available databases such as NCBI [28], EMBL [29], DDBJ [30], etc. In the case of a newly sequenced organism, they can also be generated using genome annotation pipeline like PGAP [31].
sgRNA library design workflow
As shown in Fig. 2A, upon receiving the files, GLiDe constructs a coding list that connect each gene feature with its corresponding sequence. Genes with high similarity are considered as the same function [32, 33] and are grouped into a cluster. This is achieved by employing BLASTN [34] with default parameters for sequence alignment, applying a strict threshold (< 0.001 evalue, > 95% identity, > 95% hit coverage, and > 95% query coverage). Each cluster may encompass one or more genes, and genes within the same cluster are considered to be multiple copies of the same gene. Consequently, potential hits across the cluster are not considered as off-target. For example, Table S1 provides a list of all clusters of multi-copy coding genes in E. coli MG1655. Candidate sgRNAs are identified using regular expressions, targeting two categories of guides: N20NGG and N20NAG (N = A, T, C or G), corresponding to canonical and non-canonical PAM sequences [9]. These candidate sgRNAs are categorized into three groups: (i) sgRNAs with an NGG PAM targeting the non-template strand; (ii) sgRNAs with an NGG PAM targeting the template strand; and (iii) sgRNAs with an NAG PAM. Three FASTA files (FASTA1, FASTA2, and FASTA3) are generated, corresponding to group (i), (ii), and (iii), respectively. These files contain the PAM-proximal 12-bp regions (seed regions) of all candidate sgRNAs, as these regions are pivotal for binding [14]. An illustrative example is presented in Figure S1. Following this, candidate sgRNAs undergo a rigorous quality control process to eliminate those with potential off-target hits (see “Quality control of sgRNA library” section). Subsequently, only sgRNAs adhering to the user-specified upper and lower limits for GC content are retained. Finally, the designed library is presented to the user, as detailed in the “Visualization of results” section.
Quality control of sgRNA library
To improve the accuracy of off-target hit identification, our approach relies on the strand invasion model, which is grounded in the natural binding process [35–37] and extensive experimental data [8, 38]. The fundamental concept involves identifying sequences that are less likely to result in off-target hits, with a focus on those exhibiting fewer mismatches in the PAM-proximal region compared to the target sequence. This process involves three alignments according to the user defined target strand (template or non-template) using SeqMap [39]. For instance, if the non-template strand is selected, FASTA1 would be aligned with FASTA1, FASTA2, and FASTA3. We implement two general rules to evaluate the impact of each off-target hit (Fig. 2B).
One is seed region rule, where the seed region refers to the PAM-proximal 7–12 bp region. Mismatches within the seed region exert a notable impact on the binding affinity [14, 40]. Our previous study revealed that two mismatches in the seed region substantially weaken binding affinity [27]. Despite its effectiveness, this principle has not been integrated into existing tools. Consequently, in our design, off-target binding is not deemed to occur when there are more than two mismatches within the 12 bp PAM-proximal region.
The second rule is the penalty scoring rule, which considers the influence of mismatches based on their distance to the PAM, considering that mismatches are generally better tolerated at the 5’ end of the sgRNA than at the 3’ end [18]. The sgRNA regions are categorized from the 3’ end to the 5’ end as region I (7 nt), region II (5 nt), and region III (the remaining sgRNA sequence). This division is informed by the experimental findings [15, 24, 41]. The mismatch penalties for region I, II, and III are 8, 4.5, and 2.5 (for NGG PAM), and 10, 7, and 3 (for NAG PAM), respectively (Figure S2). The effectiveness of this strategy has been demonstrated in a previous study [10]. An off-target site is identified when the penalty score falls below the user-defined threshold, with a recommended range of 18–21 for designing genome-scale libraries.
Design of negative control sgRNAs
Negative control sgRNAs are designed to have no specific targets across the genome, which can be used to assess the influence of external factors on cellular phenotype. GLiDe designs these sgRNAs by generating random N20 sequences and subsequently removing those with notable target sites. For both NGG and NAG PAMs, a penalty score of 25 is applied, and the GC content limits match those of the sgRNA library. Additionally, GLiDe ensures that there are no five or more consecutive identical bases in these negative control sgRNAs.
Visualization of results
GLiDe provides the final sgRNA library in a table, where each sgRNA is associated with its target gene and ranked according to its proximity to the start codon. This ranking is because sgRNAs targeting closer to transcription start site (TSS) have shown higher knock-down activity [9]. In addition to the table, GLiDe also generates an interactive graphical interface using D3GB [42]. This interface provides users with a comprehensive overview of the entire genome and the designed sgRNA library.
DNA manipulation and reagents
Plasmid extraction and DNA purification procedures were carried out employing kits provided by Vazyme. PCR reactions were performed utilizing KOD One™ PCR Master Mix from TOYBO Life Science. The PCR primers were ordered from Azenta (Table S2). Plasmids were constructed by Gibson Assembly, with a mixture comprising 10 U/µL T5 exonuclease, 2 U/µL Phusion High-Fidelity DNA polymerase, and 40 U/µL Taq DNA ligase, all sourced from New England Biolabs. The antibiotic concentrations for kanamycin and ampicillin were maintained at 50 and 100 mg/L, respectively. A comprehensive list of all strains and plasmids utilized in this study can be found in Table S3.
Plasmid construction
The reporting system was established using two separate plasmids: one harbored dCas9, named pdCas9-J23111, was previously constructed as described in prior work [10]. The other plasmid was derived from pN20test-114mCherry-r0-m1 [27], which was responsible for the expression of sgRNA and mcherry. In this particular plasmid, sgRNA expression was regulated by the J23119 promoter, while the N20 sequence was inserted upstream of the − 35 region of the J23114 promoter, controlling mcherry expression. A total of 21 distinct pN20test-114mCherry plasmids were constructed (Figure S3). The plasmids were assembled using the Gibson Assembly method from PCR products (primers listed in Table S2), using the original pN20test-114mCherry-r0-m1 plasmid as the template. All constructed plasmids were confirmed through Sanger sequencing.
Cell cultivation
Strains were initially cultured overnight at 37°C and 220 rpm in a 48-well deep-well plate, each contained 1 mL of LB medium with kanamycin and ampicillin. The grown cells were transferred to fresh LB medium with a 0.5% dilution and grown again under the same conditions as above for 10 hours. This subculture process was repeated to ensure the stability of mCherry expression and avoid cell adhesion. In preparation for cytometry assays, cultures were next diluted in fresh LB medium with antibiotics to OD600 = 0.02 and then grown for 4 hours to the logarithmic phase. After cultivation, 5 µL of culture medium from each well was diluted into 200 µL of phosphate-buffered saline. Three independent biological replicates were prepared for each strain.
Flow cytometry assay and data processing
The flow cytometry assay was performed on an LSRFortessa flow cytometer (BD Biosciences) using a 96-well plate. Gating based on the FSC area and SSC area was carried out to exclude non-cell particles. To ensure accurate measurements, autofluorescence was quantified using the MCm/PdCas9-J23111 strain and subsequently subtracted during the data analysis process.
In the cytometry analysis, the fluorescence intensity distribution was log10-transformed and fitted to a two-component Gaussian mixture model [43] with parameters \((\lambda ,{\mu }_{1},{\mu }_{2},{\sigma }_{1},{\sigma }_{2})\) through the expectation-maximization algorithm. Here, \(\lambda\) and \(1-\lambda\) represent the mixing coefficients of the two Gaussian components, \({\mu }_{1}\), \({\mu }_{2}\), \({\sigma }_{1}\) and \({\sigma }_{2}\) represent the mean and standard deviation of the first and second Gaussian component, respectively (Eq. 1).
$$f\left(x\right)=\lambda N\left({\mu }_{1},{\sigma }_{1}^{2}\right)+\left(1-\lambda \right)N\left({\mu }_{2},{\sigma }_{2}^{2}\right) \left(1\right)$$
The mean expression strength was calculated with Eq. 2.
$$Mean={Mean}_{1}\times {Mean}_{2}={exp}\left({m}_{1}+{V}_{1}/2\right)\times {exp}\left({m}_{2}+{V}_{2}/2\right) \left(2\right)$$
where \({m}_{1}={\lambda }_{i}{\mu }_{1i}{log}\left(10\right)\), \({V}_{1}={\left({{\lambda }_{i}\sigma }_{1i}{log}\left(10\right)\right)}^{2}\), \({m}_{2}=\left({1-\lambda }_{i}\right){\mu }_{2i}{log}\left(10\right)\) and \({V}_{2}={\left({\left(1-{\lambda }_{i}\right)\sigma }_{2i}{log}\left(10\right)\right)}^{2}\).
The repression rate of each group was calculated using Eq. 3.
$$repression rate=1-\frac{{Mean}_{Ri}-{Mean}_{Ri-PC}}{{Mean}_{Ri-NC}-{Mean}_{Ri-PC}} \left(3\right)$$
where i = 1–4.