Structure reveals why genome folding is necessary for site-specific integration of foreign DNA into CRISPR arrays

Bacteria and archaea acquire resistance to viruses and plasmids by integrating fragments of foreign DNA into the first repeat of a CRISPR array. However, the mechanism of site-specific integration remains poorly understood. Here, we determine a 560-kDa integration complex structure that explains how Pseudomonas aeruginosa Cas (Cas1–Cas2/3) and non-Cas proteins (for example, integration host factor) fold 150 base pairs of host DNA into a U-shaped bend and a loop that protrude from Cas1–2/3 at right angles. The U-shaped bend traps foreign DNA on one face of the Cas1–2/3 integrase, while the loop places the first CRISPR repeat in the Cas1 active site. Both Cas3 proteins rotate 100 degrees to expose DNA-binding sites on either side of the Cas2 homodimer, which each bind an inverted repeat motif in the leader. Leader sequence motifs direct Cas1–2/3-mediated integration to diverse repeat sequences that have a 5′-GT. Collectively, this work reveals new DNA-binding surfaces on Cas2 that are critical for DNA folding and site-specific delivery of foreign DNA. Here, using cryo-EM, the authors show how Cas1–Cas2/3 and integration host factor, by means of a U-shaped bend that traps the invading DNA and a loop that positions it for the integrase, regulate integration of foreign DNA into the first repeat of the CRISPR array.

Cas1b-b*) flanking a Cas2 homodimer (Fig. 1a,b) [9][10][11] .Foreign DNA fragments bind across one face of the Cas2 homodimer, which positions the 3′ ends into Cas1 active sites on either end of the complex (that is, Cas1a* and Cas1b*) [8][9][10]12 . TheCRISPR repeat sequence wraps around the opposing face of Cas2, sandwiching the Cas2 homodimer between the foreign and repeat DNA duplexes.Opposing Cas1 subunits (Cas1a* and Cas1b*) catalyze two successive strand-transfer reactions, linking the 3′ ends of the foreign DNA to opposite ends of the repeat 5,11 .CRISPR integration complexes sense a 2-5 base-pair (bp) 3′ overhang, called a protospacer-adjacent motif (PAM), in the foreign DNA to determine the integration orientation 9 .Correct spacer orientation is necessary to produce a functional CRISPR RNA that guides the CRISPR interference machinery (that is, Cascade) to a, Scheme of the I-F CRISPR system of P. aeruginosa PA14.A CRISPR is composed of repeated DNA sequences (diamonds) interspersed with unique spacer sequences (black squares).The CRISPR is adjacent to six cas genes (arrows).Four Cas1 and two Cas2/3 proteins assemble into a heterohexamer in which the Cas3 and Cas1 subunits surround the central Cas2 homodimer like petals of a closed flower (Cas1 4 -Cas2/3 2 ).Cas1-2/3 and IHF proteins cooperate with DNA upstream of the CRISPR (leader) to integrate foreign DNA at the first repeat.The leader sequence contains two IHF-binding sites and two IRs that are necessary for integration of foreign DNA at the leader-repeat junction.In addition to playing a central role in integration, the Cas2/3 fusion is recruited to DNA-bound Cascade CRISPR surveillance complex to degrade foreign genetic parasites.b, The Cas1-2/3 heterohexamer and IHF proteins were mixed with a half-site DNA integration intermediate consisting of a foreign DNA linked to one strand of the CRISPR DNA at the leader-repeat junction.c, Cryo-EM density map of the type I-F CRISPR integration complex at ~3.5 Å resolution (Extended Data Fig. 1g-k and Table 1).d, Atomic model of the type I-F CRISPR integration complex.Cas1-2/3 proteins alone (left) are shown in cartoon representation. Cas3domains rotate by 100° simulating the motion of a bloomed flower and exposing DNA-binding sites on Cas2 that interact with each of the IRs (Extended Data Fig. 2a and Supplementary Video 1).DNA alone is shown in the middle (surface representation).Proteins (cartoon representations) and DNAs of the integration complex are shown on the right.
To understand how the Cas1-2/3 integrase cooperates with IHF and CRISPR leader motifs to integrate foreign DNA at the first CRISPR repeat, we purified the heterohexameric Cas1-2/3 integrase and the IHFα-β heterodimer, incubated these proteins with a DNA substrate representing a half-site integration intermediate and isolated the assembled complex using size-exclusion chromatography (SEC) (Extended Data Fig. 1a-f).The purified I-F integration complex was applied to cryo-electron microscopy (cryo-EM) grids and vitrified.We recorded 10,740 movies and picked 366,794 particles to determine an ~3.48-Å resolution structure of the integration complex.The reconstructed density was sufficient to model 90.7% of the 10 polypeptides and 88.4% of the 396 nucleotides of DNA (Fig. 1b-d, Extended Data Fig. 1g-kand Table 1).The model explains how the Cas1-2/3 subunits cooperate with two IHF heterodimers to kink and twist ~150 base pairs of host DNA into a structure that precisely positions foreign DNA for integration at the first repeat of the CRISPR (Fig. 1b-d).
The Cas1 and Cas2 subunits adopt a familiar quaternary arrangement that binds a foreign DNA on one face of the Cas2 homodimer and CRISPR repeat DNA on the other face (Figs.1d, 2a and 3a) 8 .A previously determined structure of Cas1-2/3 alone revealed that the Cas3 and Cas1 domains surround the central Cas2 homodimer like petals of a closed flower (Cas1 4 -Cas2/3 2 ) (Fig. 1b) 54 .While this structure explained how Cas1 regulates the Cas3 nuclease, the role of Cas3 during integration remained unclear 54 .Here, we show that the addition of DNA drives a series of conformational changes in both the DNA and proteins.The Cas3 domains rotate ~100° to align in a https://doi.org/10.1038/s41594-023-01097-2planar configuration with Cas2, simulating the motion of a bloomed flower and exposing equivalent surfaces on opposite sides of the Cas2 homodimer that recognize an IR that is conserved in I-F leaders (Fig. 1d, Extended Data Fig. 2a and Supplementary Video 1) 54 .Thus, the new planar conformation of Cas1-2/3 enables the simultaneous coordination of four DNA helices (IR distal , IR proximal , foreign DNA and CRISPR repeat) around the central Cas2 homodimer (Figs.1d and 3).Further, this Cas3 rotation flips the nuclease domain from an interaction with Cas1 that suppresses the Cas3 nuclease activity to the opposite side of the complex, where the back of the Cas3 nuclease domain docks onto a groove created at the Cas1-Cas1 interface (Fig. 1b,d and Extended Data Fig. 2a,b).The structure reveals two prominent DNA bends that protrude at right angles from Cas1-2/3 (Fig. 1c,d).An IHF heterodimer is wedged at the apex of each DNA bend, consistent with the well-defined role of IHF in DNA bending 38 .These two DNA protrusions extend ~75 Å from the Cas1-2 core.Flexibility of these DNA extensions limits the resolution of the regions to 4-8 Å (Fig. 1c,d and Supplementary Video 2).IHF-mediated bending of the IHF distal site positions the flanking IR sequences as symmetrical DNA pillars, which are recognized by equivalent surfaces on opposite sides of the Cas2 homodimer (Fig. 3 and Extended Data Figs.3b and 4c) 22 .Cas2 binding to these DNA pillars traps foreign DNA on one face of the Cas1-2/3 integrase.Further, Cas2 bends the IRs and steers downstream DNA away from Cas1-2/3, which would project the downstream CRISPR repeat away from the Cas1-2/3 integrase (Fig. 1d).However, Cas1-2/3 and IHF cooperate to constrict the DNA around the IHF proximal site, forming a loop that places the CRISPR repeat into the Cas1a* active site (Figs.1d and 3).

Foreign DNA constrains Cas2/3 linker against Cas1
The type I-F Cas2 and Cas3 subunits are connected by a 21 amino acid disordered linker (residues 90-110) 4,28,29,55 .The structure explains how foreign DNA constrains the Cas2/3 linker against conserved surfaces of Cas1, which suggests that foreign DNA binding either initiates, or stabilizes, the Cas3 rotation (Fig. 2a) 54,55 .The constrained Cas2/3 linker positions the HD nuclease domain of Cas3 (residues 111-374) against the Cas1-Cas1 interface, and facilitates Cas3 interactions with the IRs (Fig. 2a and Extended Data Figs. 2 and 4c).The foreign DNA and amino acids in the Cas2/3 linker contact conserved residues in type I-F Cas1 proteins (Fig. 2d,e and Extended Data Fig. 2c).Polar residues in the Cas2/3 linker may assist the binding https://doi.org/10.1038/s41594-023-01097-2or splaying of the foreign DNA duplex at the conserved histidine wedge (H25) in Cas1 (Fig. 2b,c and Extended Data Fig. 3).Mutation of the histidine wedge (Cas1 H25A ) decreases Cas1-2/3 integration activity of foreign DNA that has either fully complementary or splayed DNA ends (Extended Data Figs. 5 and 6a,b).The integration defect on substrates with splayed ends suggest that H25 is more than a simple wedge that pries apart the ends for foreign DNA 10 .The histidine steers the 3′ ends down a positively charged channel that positions each 3′-hydroxyl into Cas1 active sites on opposite ends of the complex (Figs. 2 and 4a,d), whereas the 5′ ends of the protospacer DNA are directed towards the back face of the Cas3 HD domain (Fig. 2) 9,10,56 .

Cas2 homodimers recognize and bend inverted repeat sequences
The structure of the type I-F CRISPR integration complex reveals that Cas2 is the homodimer that binds the IRs (Fig. 3a).Mutations that scramble the order of nucleotides in either the IR distal or IR proximal motifs limit Cas1-2/3-mediated integration 22 .While Cas2 does not make extensive sequence-specific contacts with nucleobases of the IR, a single residue (Cas2 R55 ) intercalates in the minor groove, and may participate in recognizing two conserved bases in the 10-bp long motif (Fig. 3a and Extended Data Fig. 4c,e).However, there is insufficient density for the R55 side chain to confidently assign contacts.Other conserved Cas2 residues (that is, K11, R12 and N56) form additional hydrogen bonds with the phosphate backbone of one DNA strand in each IR (Fig. 3a and Extended Data Fig. 4c).Mutation of these Cas2 residues (Cas2 K11D,R12E , Cas2 R55E,N56D , Cas2 K11D,R12E,R55E,N56D ) prevents Cas1-2/3-mediated DNA integration (Extended Data Figs. 5 and 6c,d).Cas2 acts as a wedge that induces a 25-35° bend in the DNA upstream of IR distal and downstream of IR proximal (Fig. 3a).These flared IRs lean against basic residues (K381, R393, K397) on the back surface of Cas3 (Fig. 3 and Extended Data Figs. 3 and 4c).In sum, these observations reveal that the IR DNA sequences are primarily recognized by Cas1-2/3 through shape readout rather than base readout 39 .

Cas2 homodimer is surrounded by four DNA helices
Cas2 is a cube-shaped homodimer at the center of the Cas integrase.The Cas2 cube is flanked by Cas1 homodimers to form an elongated DNA-binding platform that interacts with the CRISPR repeat on one face and the foreign DNA on the other (Fig. 3).Unique to the type I-F Cas1-2/3 integration complex, the IRs occupy the last two accessible surfaces of the Cas2 cube (Fig. 3a).Positively charged surfaces on Cas1-2/3 bind and shield negatively charged DNA, which enables the packing of four DNA helices around the small Cas2 homodimer (Fig. 3b and Extended Data Fig. 3).The foreign DNA-binding face of Cas2 has two electronegative pillars of leader DNA that straddle the foreign DNA, such that major grooves of the leader DNA pillars are clamped against major grooves of the foreign DNA.The two DNA pillars continue past Cas2 to flank the Cas1 active sites (Fig. 3b).At the IHF proximal loop, Cas3 packs the leader against the Cas1-bound repeat, decreasing the phosphate-to-phosphate distances between these helices to ~11-12 Å.Although the latter two-thirds of the CRISPR repeat could not be resolved, the trajectory of the repeat suggests it will follow a path that threads between the distal leader DNA duplex and the 3′-hydroxyl of the foreign DNA that rests in the Cas1b* active site (Fig. 3b).Collectively, Cas1, the Cas2/3 linker and Cas3 accommodate four DNA helices (IR distal , IR proximal , foreign DNA and repeat) around the central Cas2 homodimer to facilitate site-specific integration.

IHF and the leader direct integration into diverse sequences
The structure suggests that Cas1-2/3 is guided to the first repeat of the CRISPR by IHF-mediated folding of the I-F leader, rather than direct recognition of the repeat sequence (Fig. 1d and Extended Data Fig. 4b,d).To determine whether or how the repeat sequence impacts integration, we measured the efficiency of Cas1-2/3-catalyzed integration into DNAs containing either a I-F, I-E, I-C or II-A repeat downstream of a I-F leader (Fig. 4a-c).The type I-F leader supports Cas1-2/3-catalyzed leader-side integration at repeats derived from I-E, I-C and II-A CRISPR loci (Fig. 4a-c and Extended Data Figs.7 and 8).Integration efficiency at non-native repeats is not correlated with sequence similarity to the I-F repeat or with GC content (Extended Data Fig. 8b).Instead, integration efficiency is correlated to the length of the repeat.I-F and I-E repeats are similar in length (28 and 29 bp, respectively), whereas the I-C and II-A repeats are 0.5 to 1 full DNA turns longer than the I-F repeat (33 and 37 bp, respectively) (Fig. 4b and Extended Data Fig. 9a,b).While leader-side integration is robust with different repeats, spacer-side integration is ~3.5-fold slower for the I-E repeat, and undetectable for the longer repeats (Extended Data Figs.7 and 9a,b).As expected, Cas1-2/3 does not catalyze integration at a I-F repeat downstream of a scrambled I-F leader, nor does Cas1-2/3 catalyze integration at I-E, I-C or II-A repeats downstream of their respective leaders 22 (Extended Data Fig. 7).Cas1-2 sequences are diverse, such that leader-interacting residues are only conserved in a subset of proteins within a given CRISPR subtype 22 .For example, the I-F Cas1 protein lacks residues required to interact with the I-E leader 22 .Further, the I-E, I-C and I-F leaders have distinct nucleotide spacings between the leader motifs and the repeat.These nucleotide spacings impart a unique shape to the IHF-folded leader DNA, which is critical for integration.These experiments were performed with foreign DNA substrates, either with or without a PAM (Extended Data Fig. 7).Cas1-2/3 integration of PAM-containing DNA is more specific, but the conclusions are otherwise consistent between the two substrates.The PAM must positions.An ungapped sequence alignment reveals nine identical nucleotide positions between the I-F and I-E repeats.All four repeats have different internal palindromes and GC content (Extended Data Fig. 8b).c, Endpoint integration reactions with CRISPR repeat-swapped mutants, resolved on denaturing polyacrylamide gels.One of three representative gel images is shown (Extended Data Fig. 7).Quantification of leader-(gray circles) or spacer-side (white circles) integration events from all three replicate gels (Extended Data Fig. 7).
The reactions were performed in triplicate, each dot represents one reaction, and some dots overlap.d, Four-minute time point of time-course integration reactions with I-F repeat mutants, resolved on denaturing polyacrylamide gels.One of three representative images is shown (Extended Data Fig. 9).Quantification of leader-(gray circles) or spacer-side (white circles) integration events from all three replicate gels (Extended Data Fig. 9).e, CRISPR integration model.IHF-mediated folding of the genome presents IRs as symmetric DNA pillars that recruit foreign DNA-bound Cas1-2/3.Cas3 domains of Cas1-2/3 must rotate away from Cas2 to expose IR-binding sites on Cas2.Cas1-2/3 and IHF cooperate to fold DNA into a loop, docking the leader-repeat junction at the Cas1 active site.Foreign DNA integration at the leader-repeat junction nicks the DNA duplex, releasing tension in the DNA duplex and inhibiting the reverse disintegration reaction (Extended Data Fig. 4) 61,62 .5′-GT dinucleotides are required for efficient leader-and spacer-side integration, but no strict sequence requirements are necessary in the rest of the repeat.
https://doi.org/10.1038/s41594-023-01097-2be trimmed by an ancillary nuclease before Cas1-2/3 can catalyze spacer-side integration, therefore we focused our discussion on results from the trimmed foreign DNA to compare differences in spacer-side integration (Extended Data Fig. 7a-d) [14][15][16] .Collectively, these integration experiments indicate that leader sequences and host factors dictate site-specific integration of foreign DNA at diverse DNA target sites.

5′-GT is critical for Cas1-mediated integration
Repeat sequences are strongly conserved within CRISPR subtypes, but vary in sequence and length between subtypes 57,58 .However, in the small subset of repeats tested above, we noticed that the 5′-GT is conserved.To determine whether conservation of the 5′-GT is a coincidence or a more widely conserved feature of repeats, we performed a bioinformatic analysis consisting of 24,940 CRISPRs.This bioinformatic analysis reveals that a 5′-GT dinucleotide is broadly conserved at the leader side of the repeat, and conserved in some CRISPR systems at the spacer side of the repeat (Extended Data Fig. 4g).Therefore, we hypothesized that the 5′-GT dinucleotide is a base-specific determinant for leader-side integration.To test this hypothesis, we mutated the 5′-GT (G1A, T2A) and repeated the integration assays.The 5′-GT to AA mutation ablates both leader-and spacer-side integration, indicating that the 5′-GT is essential and that leader-side integration is a prerequisite for spacer-side integration (Fig. 4d).Since Cas1 requires a 5′-GT at the leader side of the repeat, we hypothesized that introducing a 5′-GT at the spacer side of the repeat would increase spacer-side integration efficiency.To test this hypothesis, we replaced adenosine 28 of the I-F repeat with cytosine (A28C) and repeated the integration assays.The A28C mutation increases the rate and amount of spacer-side integration approximately twofold, relative to the wild type (WT) I-F repeat (Fig. 4d).We do not detect integration into a I-F repeat that lacks a 5′-GT at the leader side, even if the repeat contains a 5′-GT at the spacer side.This result further supports our conclusion that leader-side integration is a prerequisite for spacer-side integration (Fig. 4d).We examined the structure of the Cas1 active site to determine whether the 5′-G is directly recognized by protein contacts.The Cas1 residue E184 is within 4 Å of the 5′-G (Extended Data Fig. 4b,d).However, a Cas1 E184A mutation destabilizes the complex, decreasing the amount of Cas1 subunits per Cas1-2/3 complex (Extended Data Fig. 5).Therefore, the decrease in integration activity of the Cas1 E184A -2/3 complex cannot be solely attributed to a decrease in recognition of the repeat (Extended Data Fig. 9d,f).Collectively, these data reveal that the 5′-GT is a conserved feature necessary for integration in most CRISPR systems, although no available structure provides a mechanism for direct recognition of the 5′-GT of the repeat 5,11,15,59 .

Discussion
Here we demonstrate that Cas1-2/3 and IHF fold DNA into a structure that is necessary for site-specific integration of foreign DNA into CRISPRs.IHF proteins are highly expressed and most IHF-binding sites are thought to be occupied in vivo 60 .Therefore, IHF may prefold the CRISPR leader into a 'landing pad' that recruits foreign DNA-bound Cas1-2/3 (Fig. 4e) 22,38 .The Cas3 and Cas1 domains of the Cas1-2/3 complex are arranged like petals of a closed flower around the central Cas2 homodimer, such that the Cas3 domains occlude two of the four DNA-binding surfaces on Cas2, which precludes interactions with the leader (Fig. 1b,d).Foreign DNA binding to Cas1-2/3 physically constrains the Cas2/3 linker against the Cas1 homodimer, pulling the Cas3 HD domain against the Cas1-Cas1 interface (Fig. 2, Extended Data Fig. 2 and Supplementary Video 1).The 100° rotation of each Cas3 simulates the motion of a bloomed flower and exposes DNA-binding sites on Cas2 that interact with each of the IR motifs in the leader (Figs. 1 and 3 and Extended Data Figs. 2 and 3).Cas1-2/3 and IHF proteins fold DNA around the leader-repeat junction into a 260° loop that docks the first CRISPR repeat into the Cas1 active site under tension (Figs. 1 and 4 and Extended Data Figs. 3 and 4).The structure suggests that Cas1-mediated strand transfer releases tension in this DNA loop, which may prevent disintegration of an otherwise isoenergetic strand-transfer reaction, and thereby favor complete integration (Extended Data Fig. 4f).A similar mechanism has been proposed to favor complete integration in other systems, where both strand-transfer events occur simultaneously 61,62 .We show that Cas1-2/3, IHF and the I-F leader facilitate leader-side integration at four different repeat sequences (Fig. 4b,c).These repeats are diverse in sequence identity, length, palindrome and GC content, but they share a 5′-GT (Fig. 4b and Extended Data Fig. 8b).To determine whether the 5′-GT is a universal feature of CRISPR repeats we analyzed 24,940 CRISPRs.This analysis reveals that CRISPR repeats contain a strongly conserved 5′-GT at the leader end, and that a 5′-GT is also conserved at the spacer end of repeats from several CRISPR systems (Extended Data Fig. 4g).We demonstrate that the 5′-GT is critical for leader-side integration and that introducing a 5′-GT at the spacer end of the repeat increases spacer-side integration.The broad conservation of 5′-GT is consistent with previous reports that type I-A, II-A and I-E systems require a 5′-G for integration 11,63 .Similarly, the putative evolutionary ancestor to Cas1 enzymes, casposase, requires a conserved 5′ dinucleotide in target-site DNA for integration 64 .Collectively, these data suggest that Cas1 proteins retain a shared sequence preference for a 5′-G or a 5′-GT, and lack strict sequence requirements for the central body of the repeat (Fig. 4b,c) 65 .The lack of strict sequence requirements may be advantageous because the CRISPR repeat is at the nexus of foreign DNA integration, processing of the transcribed CRISPR and loading the mature CRISPR RNA (crRNA) into the surveillance complexes (for example, Cascade, Cas9).
Genetic parasites commonly escape CRISPR-based immunity through point mutations 66 .To counter escape mutants, many CRISPR-Cas systems use existing spacer sequences to enhance the acquisition of new spacers from the same foreign genetic element via 'primed' acquisition 4,[30][31][32][33][34][35][36][37] .In some examples of primed acquisition, the Cas3 nuclease/ helicase degrades CRISPR-targeted DNAs into single-stranded (ss) DNA fragments enriched in PAM-containing termini 67 .Cas1-2 has been proposed to anneal complementary ssDNA fragments and integrate these into CRISPRs 14,30,31,67 .Single-molecule colocalization and bulk immunoprecipitation suggest that the type I-E Cas1-2 integrase is recruited to a Cas3-Cascade-target DNA complex to facilitate primed acquisition 34,68 .The structure of the type I-F integration complex reveals conformational changes in Cas3 that may enable interactions with DNA-bound Cascade (Fig. 5a) 36,69 .Cascade improves integration efficiency and fidelity in vivo 69,70 , and the structure suggests a model for the formation of a primed acquisition complex (Cas1-2/3-Cascadetarget DNA) that transfers new foreign DNA fragments to the integrase.Additional structures will be necessary to clarify the mechanism(s) of primed adaptation.
A comparison of the I-E and I-F integration complexes with a structure of the lambda-phage excision complex, reveals structural similarities and differences.In I-E systems, the IHF protein folds the leader DNA to present an upstream motif to a lobe of Cas1.In contrast, the I-F structure highlights extensive cooperation between IHF and Cas1-2/3 in bending the leader into an energetically strained conformation that may increase the specificity of CRISPR recognition (Fig. 6a,b).This cooperation includes Cas1-2/3 kinking the IR DNA motifs presented as parallel DNA pillars by IHF, and Cas1-2/3 constricting the IHF proximal 180° bend into a 260° bend (Fig. 6b).Further, the sequestration of the IR-binding surface of Cas2 suggests a unique structural mechanism that prevents Cas1-2/3 interactions with the leader until foreign DNA binding induces a rotation of Cas3 (Fig. 4e and Supplementary Videos 1 and 3) 54 .

Lambda phage excision complex
Diverse systems thus use DNA as a flexible scaffold to regulate the isoenergetic mobilization of DNA (Fig. 6a-c).
Diverse DNA-mobilizing enzymes across the tree of life co-opt DNA folding to regulate DNA mobilization [3][4][5][6][7] .In sum, these data provide a mechanistic understanding for the role of DNA as a flexible scaffold that controls DNA mobilization.These insights are critical to developing applications of DNA-mobilizing enzymes in gene therapy, genetic engineering and chronological DNA recordings 6,53,[71][72][73][74][75] .

Online content
Any methods, additional references, Nature Portfolio reporting summaries, source data, extended data, supplementary information, acknowledgements, peer review information; details of author contributions and competing interests; and statements of data and code availability are available at https://doi.org/10.1038/s41594-023-01097-2.
Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Nucleic acid preparation
Four single-stranded DNAs (Supplementary Table 1) were synthesized (IDT) and resuspended in 1× TE buffer (10 mM Tris-HCl pH 8, 1 mM EDTA) before being used to assemble the structure of the type I-F integration complex, the assembly is detailed in a section below.The splayed foreign DNAs used in integration assays (Supplementary Table 1) were synthesized and resuspended in 1× TE buffer before use.To make 32 P-labeled CRISPR integration substrates, the sequences consisting of the leader and CRISPR arrays were first synthesized and cloned into pUC57 (Genscript).These plasmids have been made available on Addgene (Supplementary Table 1).These plasmids were transformed into chemically competent E. coli DH5α cells and the transformed cells were plated onto LB agar plates containing 100 µg ml −1 ampicillin.These cells were cultured in LB medium and plasmids were purified using ZymoPURE II Plasmid Midiprep kit (Zymo Research).Each plasmid was then digested with EcoRI-HF and BamHI-HF (NEB) restriction enzymes, and the 294-383 bp inserts of interest were separated from the vector backbone by agarose gel electrophoresis.The gel segments containing the DNA inserts of interest were excised and DNA was purified using a Zymoclean Gel DNA Recovery kit (Zymo Research) (Supplementary Table 1).The 5′ ends of the CRISPR leader and array fragments were dephosphorylated using Quick calf intestinal alkaline phosphatase (NEB), and the DNAs were purified away from protein using a DNA Clean and Concentrator kit (Zymo Research).Both 5′ ends of 1 pmol of the CRISPR leader and array fragments were then labeled with 32 P, by incubation with 4 pmol of [γ-32 P]ATP (PerkinElmer) by polynucleotide kinase (NEB) in 1× PNK buffer at 37 °C for 45 minutes.PNK was heat denatured by incubation at 65 °C for 20 minutes.Spin column purification (G-25, GE Healthcare) was used to remove unincorporated radioactive nucleotides and to buffer exchange DNAs into 1× TE buffer.

In vitro integration assays
Endpoint integration reactions were performed in triplicate using 300 nM of splayed foreign DNA fragments, containing or lacking a PAM (IDT), 200 nM of Cas1-2/3, 300 nM of IHF heterodimer and roughly 1 nM of a given 32 P-labeled CRISPR variant fragment in integration buffer (20 mM HEPES pH 7.5, 150 mM potassium glutamate, 5 mM MnCl 2 , 1 mM TCEP, 1% glycerol) (Supplementary Table 1).Reactions were assembled on ice and then incubated at 35 °C for 20 minutes.Time-course integration reactions were performed in triplicate using 300 nM of splayed, or fully complementary, foreign DNA fragments lacking a PAM (IDT), 200 nM of Cas1-2/3, 300 nM of IHF heterodimer and roughly 1 nM of a given 32 P-labeled CRISPR variant fragment in modified integration buffer (20 mM HEPES pH 7.5, 150 mM potassium glutamate, 5 mM MnCl 2 , 10 mM TCEP, 1% glycerol) (Supplementary Table 1).We noticed that IHF and Cas1-2/3 protein stocks exhibited a propensity to precipitate when diluted into prechilled buffer to form working dilution stocks.Therefore, all working dilutions of IHF and Cas1-2/3 prepared for time-course assays were made by first diluting the proteins into room-temperature buffer, mixed and then chilled on ice.Reactions were assembled on ice and then incubated at 20 minutes.Time points were taken at 0, 1, 2, 4 and 8 minutes.Reactions were stopped by the addition of phenol.The aqueous (nucleic acid containing) layer was mixed 1:1 with 2× formamide loading buffer (95% formamide, 20 mM EDTA, 0.05% bromophenol blue, 0.05% xylene cyanol) and then denatured at 95 °C for 5 minutes, before resolving the 32 P-labeled CRISPR substrates and integration products on a 7% (w/v) (29:1 mono:bis) polyacrylamide urea gel in 1× TBE (100 mM Tris-borate pH 8.3, 2 mM EDTA).Gels were dried and quantified using a Typhoon phosphorimager (GE Healthcare).The intensities of full-length CRISPR variant, leader-side integration fragments and spacer-side integration fragments were quantified with Multi Gauge v.3 (Fujifilm).These readings were then used to calculate leader-and spacer-side integration events as percentages of all events.Images of all gels that resolved integration reactions are shown Extended Data Figs.6, 7 and 9), and additional control gels show that Cas1-2/3 is required for integration, and show how the custom 32 P-labeled ladder was generated by restriction enzyme digestion of 32 P-labeled CRISPR variant DNAs (Extended Data Fig. 8).Time-course integration data were fit to a plateau followed by one-phase association (GraphPad Prism v.10).

Assembly and purification of I-F integration complex
A total of four ssDNAs synthesized to mimic a half-site integration intermediate were annealed in a stepwise manner.Two nanomoles of ssDNAs, mostly corresponding to the sense and antisense strands of the CRISPR leader ('strand_1' and 'strand_2') were denatured at 100 °C and then slow annealed using a PCR program that cooled the samples to 25 °C, in 5 °C steps for 5 minutes each, in 100 µl of hybridization buffer (20 mM Tris-HCl pH 7.5, 100 mM monopotassium glutamate, 5 mM EDTA, 1 mM TCEP).Two nanomoles of ssDNAs mostly corresponding to the sense and antisense of the strands of the foreign DNA ('strand_3' and 'strand_4') were slow annealed using the same protocol (Extended Data Fig. 1a and Supplementary Table 1).The two sets of annealed DNAs (tube 1: 'strand_1' and 'strand_2'; tube 2: 'strand_3' and 'strand_4') were mixed together, heated to 80 °C and then slow annealed using a PCR program that cooled the samples to 25 °C, in 5 °C steps for 5 minutes each, to anneal the complementary sense and antisense regions of the CRISPR repeat included in strand_2 and strand_3 together.Next, 6 nanomoles of IHF heterodimer in 50 µl of hybridization buffer was warmed to 25 °C and then mixed and incubated with the annealed DNAs at 25 °C for 10 minutes.Next, 3 nanomoles of Cas1-2/3 in 250 µl of hybridization buffer was warmed to 25 °C and mixed with the prepared DNA and IHF mixture, and incubated at 25 °C for 10 minutes.The total concentration of monopotassium glutamate in the mixture at this stage was ~200 mM, due to carryover from the stored protein stocks.This sample was centrifuged at 22,000g at 4 °C for 20 minutes to remove precipitates.The type I-F CRISPR integration complex was then purified on a Superdex 200 10/300 column (Cytiva) equilibrated in SEC buffer (20 mM Tris-HCl pH 7.5, 200 mM monopotassium glutamate, 5 mM EDTA, 1 mM TCEP, 2% glycerol).Then 0.5-ml fractions were individually concentrated and stored.The sixth SEC fraction contained all DNAs and proteins of interest and was further analyzed by cryo-EM (Extended Data Fig. 1d-f).

Cryo-EM sample preparation and data acquisition
Purified integration complex was diluted to a concentration of 1 µM in SEC buffer lacking glycerol (20 mM Tris-HCl pH 7.5, 200 mM monopotassium glutamate, 5 mM EDTA, 1 mM TCEP), such that the final glycerol concentration was 0.2% within 1 hour of freezing.Sample was applied to Quantifoil R2/2 Cu 200 mesh grids that were glow discharged using 15 mA for 15 seconds with a 10 second hold (easiGlow, Pelco).A 4-µl portion of diluted integration complex was applied to the grids, and then the grids were blotted for 5-6 seconds using Vitrobot filter paper (Electron Microscopy Sciences) with a blot force of 6, at 100% humidity, 8 °C, followed by plunge freezing into liquid ethane using a Vitrobot (Mk.IV, ThermoFisher Scientific).A preliminary dataset of 230 movies was collected on Montana State University's Talos Arctica transmission electron microscope (ThermoFisher Scientific), with a field-emission gun operating at an acceleration voltage of 200 kV using parallel illumination conditions 79 .Movies were acquired using a Gatan K3 direct electron detector operated in electron counting mode, applying a total electron exposure of 50 e − /Å 2 over 50 frames (3.995 s exposure, 0.08 s frame time).The SerialEM data collection software was used to collect micrographs at 36,000-fold nominal magnification (1.152 Å per pixel at the specimen level) with a nominal defocus set to 0.5 µm-2.0 µm (ref.80).Stage movement was used to target the center of four 2.0-µm holes for focusing, and image shift was used to acquire high-magnification images in the center of each of the holes.A preliminary reconstruction was determined from a curated set of 160 images that had CTF fits less than 9 Å and a full-frame motion less than 40 pixels.Briefly, a round of blob picking (150-270 Å) followed by two-dimensional (2D) classification was used to identify 2D classes used as templates for template picking in cryoSPARC 81 .Template picking identified 33,403 initial particles from the above 160 images.The 2D classification of these 33,403 particles into 50 classes was used to identify 5 classes with strong structural features containing 4,002 particles.Nonuniform refinement of these 4,002 particles resulted in an ~14.7 Å resolution reconstruction that appeared to contain a complete integration complex; therefore, new grids were prepared as described above and shipped to the National Center for CryoEM Access and Training (NCCAT) and the Simons Electron Microscopy Center located at the New York Structural Biology Center (NYSBC) for additional data collection (Table 1).At NCCAT, grids were imaged using a 300 kV Titan Krios G3i (ThermoFisher Scientific) equipped with a GIF BioQuantum and K3 camera (Gatan).A total of 10,740 images were recorded with Leginon v. 3.5 (ref.82) with a calibrated pixel size of 0.5335 Å per pixel (micrograph dimension of 11,520 × 8,184 pixels) over a nominal defocus range of −0.7 µm to −2.1 µm and 20 eV slit.Movies were recorded in 'super-resolution mode' (native K3 camera binning 1) with subframes of 50 ms over a 2.5 s exposure (50 frames) to give a total exposure of ∼69 e − /Å 2 (Table 1).

Cryo-EM image processing
Patch motion correction and patch CTF correction were performed in cryoSPARC 81 .First, 3,792 of 10,740 total images (CTF < 8 Å, full-frame motion < 30 Å) were processed to build an initial template.Blob picking was used to pick particles with diameters ranging from 120 to 280 Å.These ~1.8 million particles were extracted, Fourier-binned 2 × 2 and then subjected to 2D classification (custom parameters: initial classification uncertainty factor = 3; number of online-EM iterations = 30; batchsize per class = 200) (Extended Data Fig. 1g).Particles from 82 of the 200 2D classes were selected for an initial round of ab initio reconstruction and heterogeneous refinement.Particles from one of five of these classes were selected for a second round of ab initio reconstruction and heterogeneous refinement.Particles from one of three of these classes (174,000 particles) were selected for nonuniform refinement (custom parameters: optimize per-particle defocus = true; optimize per-group CTF params = true) to create an initial reconstruction with a resolution of ~3.7 Å, which was used to calculate templates 83 .These templates were used to choose particles from 9,858 of 10,740 total images (CTF < 8 Å cutoff).These ~5.85 million particles were classified into a total of six classes by heterogeneous refinement, which were seeded with 1 good volumes and 5 junk volumes taken from the above heterogeneous refinement analyses.The ~1.31 million particles in the single selected class, were passed through a round of 2D classification (custom parameters: batchsize per class = 200) (Extended Data Fig. 1h).The ~1.29 million particles from 49 of the 50 2D classes were selected for a round of ab initio reconstruction followed by heterogeneous refinement into two classes.The ~1.1 million particles from one of these two classes were subjected to three-dimensional (3D) classification into four classes (custom parameters: batchsize per class = 20,000; initialization mode = PCA; target resolution = 2 Å; particles per reconstruction = 500; class similarity = 0.3), followed by separate nonuniform refinements of particles from each of these four classes (custom parameters: optimize per-particle defocus = true; optimize per-group CTF params = true) 83 .The final set of 366,794 particles were re-extracted and recentered, Fourier-binned 2 × 2 and subjected to nonuniform refinement to generate a final reconstruction refined to a global resolution of 3.48 Å on the basis of 0.143 FSC cutoff (Extended Data Fig. 1h-k and Table 1) 84 .The 3D FSC was calculated using the webserver 3dfsc.salk.edu(ref.85).

Model building and validation
The map was sharpened from two half-maps using the local anisotropic sharpening job in Phenix 86 .The published structure of the P. aeruginosa Cas1 homodimer was used as the starting model 56 because Colabfold consistently failed to predict the alternative fold that one Cas1 subunit adopts to form the asymmetric homodimer interface 87 , even when provided with template structures.Whereas, the Colabfold-predicted models for the P. aeruginosa IHF heterodimer and the P. aeruginosa Cas2/3 subunit were used as starting models.The conformation of the DNA sequences within the E. coli IHF-DNA cocrystal structure (PDB 1IHF) 38 was used as starting model for DNA segments within the IHF distal and IHF proximal DNA bends.For all other double-stranded DNA segments, B-form DNA was used as a starting model.Single-stranded DNA segments were built in de novo.Protein and DNA segments were individually rigid-body fitted into the EM density map.The relative orientations of the Cas2, Cas2/3 linker and Cas3 domains were corrected by real-space refinement into the EM density map in WinCoot 88 .The ReadySet job in Phenix was used to generate hydrogens on all proteins https://doi.org/10.1038/s41594-023-01097-2and nucleic acids and prepare the model for further refinement.Then, protein and DNA segments were real-space refined in WinCoot 88 , restrained to ideal geometry, secondary structure and German McClure distance restraints generated in ProSMART from the input models 89 .The models were iteratively real-space refined in WinCoot and in Phenix using Ramachandran and secondary structure restraints 86,88 .The starting model was used as a reference model, and harmonic restraints on the starting coordinates were enabled.MolProbity 90 and the PDB validation service server (https://validate-rcsb-1.wwpdb.org/)were used to identify problem regions subsequently corrected in WinCoot 88 .For regions of the reconstruction where side chains are not visible (resolution >4.0 Å) the atomic model was truncated to the peptide backbone.For regions of the reconstruction where the backbone was ambiguous the sections of the peptide or DNA model were removed.Contacts and hydrogen bonds between residues were identified by ChimeraX v.1.4using the 'contacts' and 'hbonds' commands, respectively, with default parameters 91,92 .The DNAproDB webserver (https://dnaprodb.usc.edu/) was further used to analyze DNA-protein contacts (Extended Data Fig. 4d,e) 93 .Structure-guided mutagenesis was used to further validate key Cas1-2/3-DNA contacts in the above biochemical assays.

Cas1, Cas2/3 and repeat conservation analysis
To build a list of type I-F Cas1 sequences, CRISPRDetect v.2.4 with default parameters was used to identify CRISPR arrays within a total of 18,225 bacterial and 376 archaeal complete genomes accessed from the NCBI Assembly database on 10 June 2019, as previously described 22,94 .The 15,274 high-confidence CRISPR arrays were classified with a CRISPR subtype by CRISPRDetect v.2.4 (by matching to a list of repeats with known subtype annotations) and by genetic proximity to subtype-specific cas genes (within 20,000 bp).To identify cas genes, the 20,000 bp flanking the CRISPR were submitted to PRODIGAL v.2.6.3 (default parameters) to predict all potential open reading frames (ORFs) 95 .This ORF database was then used as input to search for cas gene clusters with MacsyFinder v.1.0.5 (ref.96).The following parameters were used: 'macsyfinder --sequence-db<peptide_database> --db-type gembase -d<CRISPR_sub-type_definitions> -p<HMM_profiles> -w 50 -vv all'.HMM profiles and classification definitions used in MacsyFinder were acquired from the local version of CRISPRCasFinder v. 4.2.20 (ref.97).Next, the first repeat and 200 nucleotides upstream of CRISPR arrays (leader), which were classified as type I-F (1,683 arrays), were collected.A nonredundant list of I-F CRISPR leaders (536 leaders) was generated using CD-HIT v.4.8.1 with a 95% identity cutoff 98 .A local copy of FIMO was used to identify matches to the position weight matrix representing the I-F IHF-binding site, as previously described 22,99 .I-F CRISPR arrays that possess more than one IHF site (IHF proximal and/or IHF distal ) in the leader sequences were extracted for downstream analyses.Cas1 homologs were identified within the 20,000-bp flanking regions of extracted 444 I-F CRISPR arrays by using PRODIGAL and MacsyFinder with the same parameters described above.A total of 371 Cas1 homologs associated with type I-F CRISPRs and possessing at least one IHF site in the leader sequences were identified.A nonredundant list of Cas1 sequences was generated with CD-HIT v.4.8.1 with a 95% identity cutoff, resulting in 222 sequences 98 .Sequences smaller than 200 residues and larger than 500 residues were removed, and the remaining 205 sequences were further curated with MaxAlign, which selected a list of 144 unique type I-F Cas1 sequences 100 .The P. aeruginosa PA14 Cas1 sequence was then added to a final list of 145 type I-F Cas1 sequences.To build a list of type I-F Cas2/3 sequences, the P. aeruginosa PA14 Cas2/3 sequence was used as an input for HHMER for a search for homologs using three iterations, an E value cutoff of 0.0001, against the UNIREF-90 database 101,102 .A list of 500 representative sequences was further curated with MaxAlign, to generate a final list of 458 unique Cas2/3 sequences.Type I-F Cas1 and Cas2/3 sequences were aligned using the MAFFT webserver with the E-INS-I iterative refinement methods to result in alignments with the highest number of gap-free sites 103 .
To build an updated list of CRISPR repeat sequences, CRISPRDetect v.3.0 with default parameters was used to identify CRISPR arrays within a total of 25,502 bacterial and 398 archaeal complete genomes and chromosomes accessed from the NCBI RefSeq Assembly database (accessed on 10 June 2021) 94 .This search identified CRISPR loci within 58,864 genomic and plasmid sequences, resulting in 24,940 high-confidence CRISPR loci predictions (array quality score >3).Similar to above, CRISPRDetect annotated the subtype of 14,446 of these CRISPR loci, on the basis of the sequence similarity of the repeats in these loci to known CRISPR repeats.The subtypes of the remaining 10,494 CRISPR loci were determined by their proximity to subtype-specific cas genes as described above.A total of 5,321 of the 10,494 unclassified CRISPR loci were assigned a subtype using this protocol, such that 5,173 CRISPR loci remained unclassified.The consensus repeats for each of the 24,940 CRISPR loci, as reported by CRISPRDetect, were used for downstream analyses.To ensure the repeats were arranged in the correct orientation, the 24,940 repeats were grouped by subtype, and each group was individually aligned by MAFFT using the '--adjustdirection' parameter.Sequence logos of the first and last three base pairs of CRISPR repeats were made using Weblogo v.3.7.1 for CRISPR subtypes and across all subtypes 104,105 (Extended Data Fig. 4g).

Reporting summary
Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.
Extended Data Fig. 1 | Cryo-EM sample preparation, imaging and processing for type I-F integration complex.a, Sequence-level schematic of DNA used to assemble the integration complex.The length of each motif is listed.The latter two thirds of the CRISPR repeat and second spacer (grey dashed box) could not be resolved in the cryo-EM reconstruction.See also Supplementary Table 1.b, Size-exclusion chromatography (SEC) profile (Superdex 75 16/600, Cytiva) of IHF heterodimer purified as described in methods section, and SDS-PAGE gel (inset).c, SEC profile (Superdex 200 10/300, Cytiva) of Cas1-2/3 heterohexamer purified as described in methods section, and SDS-PAGE gel (inset).d, I-F integration complex was assembled from purified DNAs, IHF and Cas1-2/3 as described in the methods section, and the assembled complex was further purified by size-exclusion chromatography (SEC) (Superdex 200 10/300, Cytiva).Individual fractions were collected along the elution profile, and were concentrated and stored separately for further analysis and imaging.e, Individual SEC fractions were analyzed by SDS-PAGE to determine which fractions contained all the proteins necessary for a complete complex.
f, Individual SEC fractions were phenol-chloroform extracted, and the aqueous layer was analyzed by Urea-PAGE to determine which fractions contained all four DNA strands necessary for a complete complex.The fraction chosen for cryo-EM analysis is indicated with a dotted purple box.g, Image processing pipeline for a small subset of 10,740 total micrographs for the type I-F integration complex, to generate an initial model for template picking.Scale bar represents 100 nm.h, Final image processing pipeline for the type I-F integration complex.i, Viewing direction distribution plot depicting particle orientations present in final reconstruction.More populated views are shown in red, and less populated views are shown in blue.j, 3D Fourier Shell Correlation (3DFSC) of the final I-F integration complex reconstruction.The global resolution at 0.143 is indicated by a dashed line, 3.48 Å. k, Local resolution estimation of the cryo-EM reconstruction calculated by cryoSPARC 81 .The purification of proteins, assembly of the integration complex, and analysis of these samples by SDS-PAGE or Urea page was performed once.Micrographs were collected on two separate occasions with similar results.Extended Data Fig. 2 | Cas1-2/3 undergoes a large structural rearrangement during integration.a, The Cas3 domains of the Cas1-2/3 complex have undergone a ~ 100° rotation in the structure of the integration complex as compared to Cas1-2/3 alone 54 .The positions of the first (90) and last residues (110) of the Cas2/3 linker are shown (Cas2/3a, red; Cas2/3b, tomato).The linker residues were not resolved for the previously determined pseudo-atomic model of Cas1-2/3 alone, and so are not shown for either complex for clarity 54 .The rotation of the Cas3 domains outwards unveils new DNA binding sites on two opposing faces of the Cas2 dimer.b, Zoom-in on the atomic fit of the Cas2/3 linker to the cryo-EM map.The Cas2/3 linker is disordered in the absence of Cas1 (PDB:5B7I) 55 , but has become ordered in the I-F integration complex structure due to packing by the foreign DNA against the Cas1 beta hairpins.c, A sequence logo depicting the conservation of Cas2/3 linker residues (top).The Cas1 residues that contact the Cas2/3 linker are conserved in Cas1 proteins associated with type I-F CRISPR loci (middle), but are not conserved in the closely related Cas1 proteins associated with type I-E CRISPR loci that have similar IHF motifcontaining leaders 22 .Residues are numbered according to P. aeruginosa PA14 Cas1 and Cas2/3 proteins.Extended Data Fig. 7 | PAM blocks Cas-mediated integration of foreign DNA into CRISPR repeat.a, Schematic summarizing the step-wise strand-transfer reactions catalyzed by a Cas integrase.In the absence of putative host nucleases that trim PAMs, the foreign DNA fragments that contain a PAM stall the reaction at leader-side integration.However, the foreign DNA fragments that have been trimmed (no PAM) proceed through leader-and spacer-side integration.b-d, Endpoint integration reactions performed with a PAM-containing foreign DNA in triplicate, resolved on denaturing polyacrylamide gels.The X1 and X2 lanes signify lanes that were not further analyzed for this manuscript.e-g, Endpoint integration reactions performed with a trimmed foreign DNA in triplicate, resolved on denaturing polyacrylamide gels.The X1 and X2 lanes signify integration substrates that were not analyzed for this manuscript.h, i, Quantification of leader-(grey circles) or spacer-side (white circles) integration events from all three replicate gels.Individual dots for each triplicate reaction are shown, and some dots overlap.Three independent gels were run for PAM-containing or trimmed Foreign DNA integration reactions with similar results.

Extended
Extended Data Fig. 8 | Control reactions for integration assay and generation of 32 P-labelled ladder.a, Scheme of nine CRISPR fragments used for in vitro integration assays.Each CRISPR locus contains two repeats and two spacers.Leader motifs are color-coded and annotated (IR, inverted repeat; DR, direct repeat; IHF, IHF binding site; LAS, Leader anchoring site).To simplify the pictograms shown here and in subsequent panels, a single-colored rectangle was used to represent a given collection of leader motifs, and a single diamond was used to represent a CRISPR locus composed of two repeats and two spacers.b, The four CRISPR repeats tested in the integration assays have diverse palindromes (yellow), and a wide range of GC-content.c, Control reactions in which all components necessary for integration, except Cas1-2/3, were incubated.The overexposed gels show that the majority of the 32 P signal for a given integration substrate DNA corresponds to the full-length strands.d, A custom 32 P-labelled DNA ladder was made by mixing the degradation products generated by individual restriction enzyme digests of different 32 P-labelled integration substrate DNAs.X1 and X2 signify integration substrates that were not analyzed for this manuscript.e, Schematics of the nine CRISPRs tested in Extended Data Fig. 5.The arrows identify locations of off-target integration reactions.Most off-target integration reactions occur by spurious integration at DNA motifs (for example, second CRISPR repeat, IHF distal site, or upstream motifs) found near the ends of the CRISPR DNA target.Previous deep-sequencing of similar integration reactions has shown that the second repeat is a common off-target integration site, and that IHF blocks integration at the IHF binding site 22 .Urea-PAGE gels were run once.
Extended Data Fig. 9 | Validation of Cas1-2/3 interactions with the repeat.a, Time-course integration reactions to compare rate of integration into I-E and I-F repeats downstream of a I-F leader.b, Quantification of time-course experiments to determine the impact of 19 mutations associated with swapping the I-F repeat for the I-E repeat, on integration.Leader-side integration is indistinguishable.But spacer-side integration is slower into the I-E repeat.The mean and standard deviation three replicate experiments are shown.c, Timecourse integration reactions to measure the impact of I-F repeat mutations on integration rate.d, Time-course integration reactions to measure the impact of I-F repeat mutations on integration rate, in the context of a Cas1 E184A mutation.The Cas1 E184A mutation is expected to disrupt 5′ G recognition, but also impacts stability of the Cas1-2/3 complex (Extended Data Fig. 5).e, Quantification of time-course experiments to determine the impact of I-F repeat mutations on integration rate, in the context of WT Cas1-2/3.The mean and standard deviation of three replicate experiments are shown.f, Quantification of time-course experiments to determine the impact of I-F repeat mutations on integration rate, in the context of Cas1 E184A -2/3.In panels e and f, no integration occurs into either the 'G1A,T2A,A28C' or 'G1A,T2A' repeat mutations, and the datapoints for these plots overlaps at roughly Y = 0 over the time course.The mean and standard deviation of three replicate experiments are shown.Each Urea-PAGE gel was run once.

Fig. 1 |
Fig. 1 | Cryo-EM structure of the type I-F CRISPR integration complex.a, Scheme of the I-F CRISPR system of P. aeruginosa PA14.A CRISPR is composed of repeated DNA sequences (diamonds) interspersed with unique spacer sequences (black squares).The CRISPR is adjacent to six cas genes (arrows).Four Cas1 and two Cas2/3 proteins assemble into a heterohexamer in which the Cas3 and Cas1 subunits surround the central Cas2 homodimer like petals of a closed flower (Cas1 4 -Cas2/3 2 ).Cas1-2/3 and IHF proteins cooperate with DNA upstream of the CRISPR (leader) to integrate foreign DNA at the first repeat.The leader sequence contains two IHF-binding sites and two IRs that are necessary for integration of foreign DNA at the leader-repeat junction.In addition to playing a central role in integration, the Cas2/3 fusion is recruited to DNA-bound Cascade CRISPR

Fig. 2 |
Fig. 2 | Foreign DNA constrains the Cas2/3 linker against conserved Cas1 residues.a, View of the foreign DNA-bound face of Cas1-2/3.The foreign DNA, Cas2 subunits, Cas2/3 linker and Cas1 beta hairpins that contact the start and end of the Cas2/3 linker are shown in solid while other parts of the complex are shown at 40% transparency for clarity.Insets outline locations of close-up views shown in panels b-e.b,c, The foreign DNA constrains the Cas2/3 linker against each Cas1 subunit (Extended Data Fig. 2b).Cas2, the Cas2/3 linker and Cas1 cooperate to bind the foreign DNA body and to splay the ends of the foreign DNA.Histidine wedges in Cas1 measure out a central foreign DNA duplex of 22 base pairs.Most DNA-binding residues are conserved or undergo conservative mutations (Extended Data Fig. 3).Close-up view of Cas1a and Cas2a interface (b) and closeup view of Cas2b and Cas1b interface (c).d,e, Conserved Cas2/3 linker residues (blue, sticks) contact residues conserved in Cas1 proteins from type I-F CRISPR systems (mauve, surface) (Extended Data Fig. 2b,c).Cas2 and Cas3 domains are shown at 90% transparency for clarity.Close-up view of Cas1a and Cas1a* interface (d) and close-up view of Cas2b and Cas1b* interface (e).Inset shows the Cas1 sequence conservation color key.

Fig. 3 |
Fig. 3 | The Cas2 homodimer simultaneously coordinates four dsDNA helices critical to CRISPR integration.a, The Cas2 homodimer (pink surface) is flanked by DNA on four sides.Previous structures have shown that the CRISPR repeat (yellow) and foreign DNA (red) are bound to opposite faces of Cas2.Here, we show that symmetrical surfaces on Cas2 also bind IR (left and right) motifs in the leader.Surface representations of the IHF heterodimers, Cas1 homodimers, Cas2/3 linkers, Cas3 domains and the 3′ overhang of the foreign DNA are shown in 100% transparency for clarity.Each Cas2 inserts an arginine (R55) into the center of the IRs, which stack between deoxyribose sugars, and additional polar residues (R54, N56, R12 and K11) contact the DNA backbone.Cas2 induces 25-35° bends in the DNA (Extended Data Fig. 4c,e).The sequence logos of the type I-F IR proximal (left) and IR distal motifs (right), and the IR sequences present in the P. aeruginosa PA14 CRISPR leader, are shown.b, Views of the foreign DNA-(left) and repeat-bound (right) faces of Cas1-2/3 are shown in surface representation and colored by Coulombic potential.For clarity the highly electronegative DNA is shown in cartoon representation.Labels highlight highly basic and conserved surfaces of each Cas1-2/3 subunit that accommodate the packing of four dsDNA helices in proximity around the Cas2 homodimer (Extended Data Fig. 3).IHF heterodimers are shown in 100% transparency for clarity.The phosphate-tophosphate distances of DNA helices packed around Cas2 are noted.

Fig. 4 |
Fig. 4 | Sequence motifs in the leader and IHF proteins facilitate Cas1-2/3based integration into diverse repeat sequences.a, Scheme of reactants and products of in vitro CRISPR integration assays (Extended Data Figs.6,7 and 9).b, Four CRISPR repeats used in the integration assays.A gapped sequence alignment highlights two identical (asterisks) and six similar (dots) positions.An ungapped sequence alignment reveals nine identical nucleotide positions between the I-F and I-E repeats.All four repeats have different internal palindromes and GC content (Extended Data Fig.8b).c, Endpoint integration reactions with CRISPR repeat-swapped mutants, resolved on denaturing polyacrylamide gels.One of three representative gel images is shown (Extended Data Fig.7).Quantification of leader-(gray circles) or spacer-side (white circles) integration events from all three replicate gels (Extended Data Fig.7).The reactions were performed in triplicate, each dot represents one reaction, and some dots overlap.d, Four-minute time point of time-course integration

Fig. 6 |
Fig. 6 | DNA is a flexible scaffold that controls DNA mobilization.a-c, Structures for the I-E CRISPR integration complex (a), I-F CRISPR integration complex (b) and lambda-phage excision complex (c).DNA shown as a surface, IHF (purple) and all other proteins shown as transparent cartoons.Integration and excision sites, along with DNA motifs that regulate DNA mobilization, are labeled and colored according to the schematic (bottom). Articlehttps://doi.org/10.1038/s41594-023-01097-2

Data Fig. 3 |
Conservation analysis of Cas1-2/3 residues involved in DNA binding and integration.a, Conservation of Cas1 and Cas2/3 residues involved in binding the foreign DNA, or catalyzing the strand transfer reaction, or catalyzing the degradation of nucleic acids.See Fig. 2. b, Conservation of basic and polar Cas1 and Cas2/3 residues involved in accommodating the DNA duplexes bound by the Cas1-2/3 complex during integration.See Fig. 3. Residues are numbered according to P. aeruginosa Cas1 and Cas2/3 proteins.Extended Data Fig. 4 | Cas1-2/3 predominantly recognizes IR motifs, CRISPR repeat and foreign DNA through non-sequence specific interactions.a, Splayed 3′ ends of the foreign DNA are directed into the Cas1 transesterification active site.The product of the first strand-transfer reaction is shown in the Cas1a* active site (top), and the 3′ OH of the other end of the foreign DNA is positioned in the Cas1b* active site (bottom).The cryo-EM map is shown in transparent grey.b, Zoom-in on the Cas1-2/3 contacts to the CRISPR repeat (ChimeraX contacts command with default parameters).Most protein contacts occur to the DNA backbone and minor groove.Cas1 residue E184 appears to probe nucleotide G1 of the repeat.c, Zoom-in on the Cas1-2/3 contacts to the IR leader motifs.Most protein contacts occur to the DNA backbone and minor groove.d, DNAproDB analysis of Cas1-2/3 interactions with the 3′ ends of foreign DNA and the CRISPR leader-repeat junction.For clarity, only protein interactions to the nucleobases are shown 93 .e, DNAproDB analysis of Cas1-2/3 interactions with the IR leader motifs 93 .For clarity, only protein interactions to the nucleobases are shown.f, Zoom-in on the atomic fit of the base-pairs around the leader-repeat junction (-3 and +3 bps, coordinated by Cas1a*), to the cryo-EM map.Tension in the DNA loop at the leader-repeat junction has been released in the post-integration structure by a physical separation of base-pairs, as measured by an increase in base step rise.This tension may further pull the leaving 3′OH out of the Cas1 transesterification active site, to inhibit disintegration of the foreign DNA from the repeat.The approximate local base step rise was calculated using the http://web.x3dna.org/webserver.g, A bioinformatic analysis of the first repeat from 24,940 CRISPR loci reveals that a 5′ GT dinucleotide is strongly conserved across most CRISPR subtypes.Similarly, a 5′ GT is present at the spacer-end of the repeat (seen as AC-3′ on the sense strand) within certain CRISPR subtypes (I-D, II-C, III-C, III-D, V-B, V-E, V-K, VI-A), but it is not broadly conserved.Extended Data Fig. 5 | Purification of structure-guided mutants of Cas1-2/3.a, SEC profile of a new preparation of wildtype Cas1-2/3 and all variants purified in the same manner on a Superdex 200 16/600 (Cytiva).An excess of free Strep-tagged Cas1 elutes at approximately 82 mL.b, SDS-PAGE gel of the Cas1-2/3 hetero-hexamer peak for all purified Cas1-2/3 variants.The SDS-PAGE gel of all Cas1-2/3 samples was run twice with similar results.Extended Data Fig. 6 | Validation of Cas1-2/3 interactions with the foreign DNA and IR motifs.a, Time-course integration reactions to test the role of Cas1 H25 in splaying the foreign DNA ends.Integration reactions were performed with trimmed foreign DNA (lacking a PAM) in triplicate, resolved on denaturing polyacrylamide gels.Timepoints were taken at 0, 1, 2, 4 and 8 minutes.Reactions were stopped by the addition of phenol.A 32 P-labelled DNA that is shorter (140-160 bp) than the full length CRISPR is present in some DNA preparations (also see Extended Data Fig. 6c).Full-length CRISPR DNA, leader-and spacerside integration products, do not overlap with this band.Further, Cas1-2/3, foreign DNA and IHF are in excess over the 32 P-labelled DNA.The 140-160 bp band does not interfere with the quantification or generation of integration products.b, Quantification of time-course experiments to determine the role of Cas1 residue H25 in integration.The mean and standard deviation of three replicate experiments are shown.The Cas1 H25A mutant integrates splayed and fully complementary foreign DNA fragments less efficiently that WT Cas1-2/3, suggesting that H25 steers the non-nucleophilic DNA strand away from the Cas1 active site.These results mirror the previously published effect of type I-E Cas1-2 tyrosine wedge mutation 10 .c, Time-course integration reactions to test the role of Cas2 residues in recognition of the IR motifs in the leader, performed as in panel a. d, Quantification of time-course experiments to determine the role of Cas2 residues K11, R12, R55 and N56 in integration.The mean and standard deviation of three replicate experiments are shown.The Cas2 R55E,N56D /3 mutant retains a small amount of integration activity.The Cas2 K11D,R12E /3 and Cas2 K11D,R12E,R55E,N56D /3 mutants do not integrate DNA into the I-F CRISPR.Quantification of leader-(grey circles) or spacer-side (white circles) integration events from all three replicate gels.Individual dots for each triplicate reaction are shown, and some dots overlap.