Distribution of RAG1L-RAG2L gene pairs across phyla
To search for additional RAGL transposons in disparate phyla, RAG1L and RAG2L amino acid sequences from invertebrate deuterostomes [15, 18, 19] were used to scan recently updated databases including Whole-Genome Shotgun Contigs (WGS), High Throughput Genomic Sequences (HTGS) and Transcriptomic Shotgun Assembly (TSA). Scans were performed on all metazoan invertebrate and jawless vertebrate projects available before February 2019. RAG1L proteins, possibly because of their direct role in catalysis, exhibit greater evolutionary conservation than RAG2L proteins, and detecting evolutionarily-distant RAG2L homologues using existing genome-wide blast techniques is challenging. For example, using tblastn [24, 25] and mouse or human RAG2 as query does not detect RAG2L from amphioxus WGS data (data not shown), and standard blast searches failed to identify RAG2L sequences in the purple sea urchin genome [15]. We overcame these difficulties using an iterative blast search approach (see Methods), which allowed the identification of new RAG1L-RAG2L gene pairs in protostomes, in both the Mollusca and Nemertea phyla, and in cnidarians (Fig. 1b, c). Notably, most RAG2L sequences identified were found to reside adjacent to RAG1L sequences, and in such gene pairs, the two genes invariably resided in transcriptionally convergent (tail-to-tail) configuration, an organization characteristic of RAGL transposons and jawed-vertebrate RAG loci (Fig. 1a).
In Mollusca, RAG1L-RAG2L gene pairs were found in class Bivalvia, subclass Pteriomorphia, in oysters (eastern oyster (Crassostrea virginica), pacific oyster (Crassostrea gigas - as previously reported by Kapitonov and Koonin [16]) and sydney rock oyster (Saccostrea glomerata)), mussels (Philippine horse mussel (Modiolus Philippinarum) and the deep-sea mussel (Bathymodiolus platifrons)) and in Pterioida (akoya pearl oyster (Pinctada imbricata)) (Fig. 1c and Additional file 1: Figure S1). In Nemertea, numerous RAG1L-RAG2L gene pairs were detected in the ribbon worm Notospermus geniculatus, the only species with DNA or mRNA sequence data reported from this phyla. In N. geniculatus, several of the RAGL genes identified were supported by mRNA transcriptomic data (Fig. 1c and Additional file 1: Figure S1).
We also identified new RAG1L-RAG2L gene pairs in recently-sequenced invertebrate deuterostome genomes from the amphioxus Branchiostoma lanceolatum and the sea urchin Hemicentrotus pulcherrimus; these new elements are quite similar in sequence (>60% protein sequence identity within the RAG1L core region) to those previously identified in amphioxus [18] and purple sea urchin [15], respectively (Additional file 1: Figure S1, and see below). Application of our search strategy to available sequence data from agnathans (jawless vertebrates) failed to identify RAG1L or RAG2L sequences.
Finally, WGS scans detected RAG1L-RAG2L gene pairs outside the Bilaterian clade in three species in the phylum Cnidaria: anthozoan stony coral Porites rus, mountainous star coral Orbicella faveolata, and moon jellyfish Aurelia aurita from the jellyfish Scyphozoan group (Fig. 1c and Additional file 1: Figure S1). Only a single RAG1L-RAG2L gene pair was detected in each species of coral, while in the moon jellyfish, one intact and two degenerate pairs were detected. Given the current low quality of the WGS data from these species, we did not attempt further analyses and interpret these findings cautiously. The intact RAG1L-RAG2L element from A. aurita is predicted to encode RAG1L and RAG2L proteins with striking conservation of functionally-important domains and sequence features (see below).
These findings indicate that RAG1L-RAG2L gene pairs are present not only in deuterostomes, but in protostomes and potentially non-bilaterians as well (Fig. 1 and Additional file 1: Figure S1).
Identification of TIRs and TSDs flanking protostome RAGL transposons
TIRs define the boundaries of a transposable element and serve as sites that direct the binding and cleavage of transposase during transposition [26]. We searched for TIRs flanking RAG1L-RAG2L gene pairs using a homology variation-based approach that was based on the expectation that copies of a transposable element inserted in different sites in the genome should share a higher degree of homology within the elements than in their flanking regions (see Methods). We also required the presence of TSDs, short direct repeats flanking the 5'-TIR and 3'-TIR that arise as a consequence of staggered attack of the transposon ends on target DNA during transposition [26]. While many of the new RAG1L-RAG2L gene pairs identified failed to satisfy our stringent TIR criteria, multiple elements with TIRs and TSDs were found in three protostome species: eastern oyster (C. virginica), akoya pearl oyster (P. imbricata), and N. geniculatus (ribbon worm) (Fig. 1 and Additional file 1: Figure S1).
While within a species TIRs contain extended regions of sequence similarity, between species the sequence similarity is largely confined to the outside termini of the TIRs, in a region of about 13 to 15 bp (Fig. 2a, Additional file 2: Figure S2a). The first 3 bp of protostome TIRs are 5'-CAC, matching the invariant and functionally critical first 3 bp of the RSS and the TIRs of Transib and deuterostome RAG transposons including ProtoRAG. The protostome TIR consensus, CACTWMCAAACKTYKBB, also includes a highly conserved AAA sequence at positions 8-10 that aligns with an A-rich region in deuterostome RAGL transposon TIRs [18, 19] (Fig. 2a). The TSDs found flanking protostome RAGL transposons are invariably 5 bp in length (Fig. 2a), similar to the predominant length of TSDs of ProtoRAG, Transib, and RAG [10, 16, 18-21].
In addition to complete RAGL transposons with the structure TSD-5'TIR-RAG1L-RAG2L-3'TIR-TSD, numerous incomplete RAGL transposons were identified which lack one or both TIRs (and hence TSDs) and/or contain a solitary RAG1L or RAG2L gene (Fig. 1c and Additional file 1: Figure S1). We also detected a small number of 5'-3' TIR pairs in which the intervening DNA lacked an intact RAG1L or RAG2L gene, structures which are known as miniature inverted repeat transposable elements (MITES) [27]. Such RAGL MITES, often flanked by TSDs, were found in the oyster P. imbricata and in several deuterostomes (Additional file 2: Figure S2b). The occurrence of complete and incomplete RAGL transposon configurations indicates the existence of potentially active, fossilized, and possibly domesticated RAGL transposon copies.
Curiously, for the mussel M. philippinarum, WGS data contain a single RAG1L-RAG2L gene pair in which the genes are incomplete and contain stop codons, as well as unpaired RAG1L and RAG2L loci that have the potential to encode full length, intact protein products (Fig. 1c and Additional file 1: Figure S1).
Phylogenetic analysis of protostome RAGL sequences suggests vertical transmission
Phylogenetic analysis of RAG1L sequences is more informative than that of RAG2L sequences because of RAG1L's greater sequence conservation between lineages [9]. We constructed phylogenetic trees of RAG1L protein sequences using several different algorithms (Fig. 2b and Additional file 3: Figure S3, Additional file 4: Figure S4), which consistently yielded a tree structure similar to that of species phylogeny (Fig. 3). This finding suggests that the RAGL transposon evolved primarily by vertical transmission in deuterostomes and protostomes, with many duplication and loss events within clades, consistent with our previous analysis of deuterostome RAG1L sequences [19]. While our results are fully consistent with the hypothesis that the RAGL transposon was present in the bilaterian ancestor, we cannot rule out alternative scenarios that include horizontal gene transfer events. As expected, the phylogenetic analysis of RAG2L protein sequences was largely uninformative, with the only branch with >50% bootstrap support being consistent with the observations deduced from the RAG1L proteins tree (Additional file 3: Figure S3e).
Our prior phylogenetic analyses indicate the existence of several families of RAGL proteins in deuterostomes [19]. The RAGL_A family, in the hemichordate P. flava, is the closest relative of vertebrate RAG. RAGL_C family members have also only been identified in P. flava. The RAGL_B family appears to be more widespread, with members identified in cephalochordates (including ProtoRAG), in several echinoderm lineages, and in P. flava [19]. Inclusion of protostome RAGL sequences in our analysis revealed four families: the RAG_A family composed of RAG in jawed vertebrates (Gnathostomata) and the RAGL_A family in P. flava; the RAGL_B family in all clades except jawed vertebrates; the RAGL_C family in P. flava; and the RAGL_D family in the nemertean N. geniculatus (Fig. 2b). Because of the relative lack of data, little can be concluded about families A, C, and D except that the RAG_A family was probably present in the deuterostomian ancestor. The most widespread and conserved family is RAGL_B, which is present in numerous copies in every species examined thus far except for jawed vertebrates. This distribution is consistent with a RAG1L_B-RAG2L_B gene pair being present in the bilaterian ancestor and its subsequent loss in jawed vertebrates.
RAG1L sequences from the class Pteriomorphia of bivalves, which includes P. imbricata, form a monophyletic group belonging to the RAGL_B family (Fig. 2b and Additional file 3: Figure S3c). Overall, the RAG1L phylogeny corresponds to the consensus species phylogeny (Fig. 3) except for some RAG1L sequences from the pearl oyster P. imbricata. The findings suggest that a duplication occurred in the Bivalvia clade, which led to the RAGL_B_Bivalvia1 and RAGL_B_Bivalvia2 subfamilies (Additional file 3: Figure S3c). Based on the RAG1L family tree, two distinct RAGL families were found in N. geniculatus, RAGL_B and RAGL_D (Fig. 1c, 2b). Interpretation of our findings (see Discussion) is limited by the availability of data in the databases (Fig. 3 and Additional file 6: Table S1). For example, we cannot be confident about the absence of RAGL sequences in some groups, as the number of sequenced species is not homogenous among clades.
Protostomes encode intact and potentially functional RAG1L and RAG2L proteins
The ProtoRAG-encoded BbeRAG1L-BbeRAG2L proteins from the cephalochordate amphioxus together constitute an active endonuclease/transposase in vitro and in vivo and contain domains corresponding to all of the functionally essential "core" subdomains of jawed vertebrate RAG1 and RAG2 (Fig. 4) [18, 22]. These core subdomains are: the nonamer binding domain (NBD), the dimerization and DNA binding domain (DDBD), the Pre-RNase H domain (PreR), the catalytic RNase H domain (RNH), two zinc binding domains (ZnC2 and ZnH2) that together coordinate a zinc ion, and the C-terminal Domain (CTD) (Fig. 4). The RAG1 C-terminal tail (CTT) is not required for catalytic activity [4], while the corresponding domain from BbeRAG1L (CTT*; an asterisk specifies domains derived from RAG1L proteins) is critical for full activity and is part of the BbeRAG1L core region [18, 22]. Both RAG1 and BbeRAG1L contain zinc finger motifs C1/C1*, C2/C2*, and C3/C3*, as well as the ring zinc finger dimerization domain (ZDD/ZDD*) in their non-essential N-terminal regions (Fig. 4a). The RAG2 core domain is a 6-bladed b-propeller composed of 6 kelch repeats [5]; this region, but not the RAG2 C-terminal domain with its acidic hinge and plant homeodomain (PHD) finger, is found in BbeRAG2L (Fig. 4) [18].
We analyzed the sequences of the predicted RAG1L and RAG2L proteins from protostomes to determine if these species, like amphioxus, had the potential to encode active RAGL complexes. Indeed, the phyla Mollusca and Nemertea each harbor multiple pairs of intact RAG1L-RAG2L open reading frames able to encode the kelch repeat domain of RAG2/BbeRAG2L and all of the essential core subdomains found in RAG1 and BbeRAG1L, including CTT* of BbeRAG1L (Fig. 4a; Fig. 5 and 6 show sequence alignments of selected RAG1L and RAG2L proteins, respectively, while alignments of all of the RAG1L and RAG2L protein sequences identified are shown in Additional file 7: Alignment S1a,b). In Mollusca, such pairs are observed in two species (C. virginica and P. imbricata), while the nemertean N. geniculatus harbors at least six different intact RAG1L-RAG2L protein pairs. Conservation of the core domains is also observed in the RAG1L-RAG2L pair identified in the cnidarian A. aurita (Fig. 4, 7 and Additional file 5: S5a,b).
Many protostome RAG1L proteins also exhibit substantial conservation with the RAG1 N-terminal non-core region. The three N-terminal zinc finger motifs are well conserved among most of the newly detected RAG1L homologs while ZDD* is readily identified in RAG1L sequences from Mollusca (Fig. 4a, Additional file 7: Alignment S1a). However, as in RAG1L proteins from echinoderms [15, 16, 19], ZDD* is absent from all RAG1L proteins from N. geniculatus and from RAG1L of A. aurita (Fig. 4a, Additional file 7: Alignment S1a). This suggests that this domain, which in RAG1 forms a tight dimer with E3 ubiquitin ligase activity [28], has undergone at least two independent loss events during RAG1L evolution.
With the exception of RAG2L from amphioxus [18], all invertebrate RAG2L proteins contain a C-terminal PHD finger (Fig. 6a, 8, 10b).
Patterns of sequence conservation in RAG1L and RAG2L domains
We analyzed the patterns of conservation of key amino acid residues and domains of protostome RAG1L and RAG2L proteins to provide further insight into their potential functional properties and evolutionary relationships. Analysis of levels of sequence identity within the core region of RAG1L proteins reveals a broad correspondence with species phylogeny, with levels of identity highest between the RAG1L_B family sequences from protostomes and deuterostome invertebrates (Fig. 7a). And in general, core RAG1L sequences from invertebrates exhibit greater identity to one another than to core RAG1 sequences from jawed vertebrates, with the exception of the RAG1L_A family member from P. flava. Transib sequences diverge most strongly, exhibiting less than 22% sequence identity with the RAG1 and RAG1L proteins analyzed. Transib's low sequence identity with RAG1/RAG1L and absence of elements corresponding to the RAG1 N-terminal non-core region allow one to distinguish between Transib and RAG1/RAG1L proteins. As was the case for RAG1/RAG1L, RAG2 and RAG2L core region sequences exhibit higher levels of identity within invertebrates than between jawed vertebrates and invertebrates (Fig. 7b). As expected, overall levels of sequence identity are lower for RAG2/RAG2L than for RAG1/RAG1L.
Numerous stretches of conservation are observed between protostome and deuterostome core RAG1L/RAG1 sequences beginning with the preR domain and extending to the CTD (Fig. 5 and Additional file 7: Alignment S1a). In addition, numerous functionally/structurally important residues are highly conserved in RAG1L sequences from protostomes. These include critical catalytic residues [4] and four zinc-coordinating residues from ZnC2 and ZnH2 that stabilize domain folding (Fig. 5, Additional file 5: Figure S5a and Additional file 7: Alignment S1a). The Cx2Cx3GHx4C motif that defines CTT* of BbeRAG1L is found in essentially all RAG1L sequences from protostomes and deuterostome invertebrates (Fig. 5 and Additional file 7: Alignment S1a). Numerous other potential zinc coordinating residues are also conserved in protostome RAG1L sequences including many in ZDD*, C1*, C2*, and C3* (Fig. 4a, and Additional file 7: Alignment S1a). Most protostome RAG1L proteins contain valine at the position equivalent to mouse RAG1 E649, and mutation of E649 to V or A increases the propensity of RAG to perform asynchronous, or "uncoupled" cleavage in vitro and in cells [22, 29] (Additional file 5: Figure S5a). Virtually all protostome RAG1L proteins contain a hydrophobic amino acid at the position equivalent to mouse RAG1 R848, a change that strongly activates the transposition activity of RAG in vitro and in cells [22] (Additional file 5: Figure S5a). Furthermore, invertebrate RAG2L proteins reported here and previously lack the acidic linker that exists between the RAG2 core and the PHD in jawed vertebrate RAG2 proteins (Figure 6 and Additional file 7: Alignment S1b), and this acidic region has been shown to inhibit RAG-mediated transposition in cells [22]. These latter observations are consistent with the idea that protostome RAGL enzymes are, or evolved from, active transposases.
The core region of protostome RAG2L proteins preserve the structural features of a kelch-type domain including the GG motif that typifies the second b-strand of each kelch repeat (Fig. 6 and Additional file 7: Alignment S1b). The core region of protostome RAG2L is invariably followed by a cysteine-rich PHD, but the pattern of C and H zinc-coordinating residues found in protostome and deuterostome invertebrate PHDs (Cx5-7Cx14-15Cx2-4Cx5Cx2Hx12-18Cx2C) differs considerably from that seen in vertebrate RAG2 PHDs (CCx2Cx22Cx5-6Hx2Hx2Cx19Cx2H) (Additional file 5: Figure S5b). The remarkable conservation of the C/H pattern in invertebrate RAG2L PHDs and its divergence from the pattern observed in its vertebrate counterparts suggest structural and functional differences that are as yet largely unexplored. The RAG2L PHD from the purple sea urchin S. purpuratus is capable of binding the tail of histone H3 when lysine 4 is methylated, although its preference for dimethylated lysine differs from the trimethylation preference of the mouse RAG2 PHD [30].
Together, these sequence analyses argue that many RAG1L-RAG2L protein pairs from protostomes have the potential to be active endonucleases with transposase activity.
Analysis of protein-DNA and protein-protein interaction surfaces
The availability of RAG1L-RAG2L sequences from protostomes provided an opportunity for a broad evolutionary examination of the conservation of interaction surfaces in the complexes formed by these proteins with each other and with DNA. This in silico analysis involved mapping representative RAG1L and RAG2L sequences from protostomes and deuterostomes (the seven species whose sequences are shown in Fig. 5 and 6) onto the recently reported BbeRAG1L-BbeRAG2L three-dimensional structure, which closely resembles that of vertebrate RAG1-RAG2 [22]. This revealed strong conservation of the RAG1L DNA binding groove in regions that interact with both the TIR heptamer and the TIR heptamer-flanking region (Fig. 8A, B). This binding region contains numerous basic amino acid residues, creating a positively charged surface for DNA interaction [22]. This observation, combined with the high sequence conservation that surrounds the active site residues D701, E764, D811, H894 and E1063 in BbeRAG1L (Fig. 5, Additional file 5: Figure S5a and Additional file 7: Alignment S1a), suggests that protostome RAGL proteins have the potential to interact with and cleave DNA in a manner similar to that of RAG and BbeRAGL.
In contrast, the portions of RAG1L predicted to interact with RAG2L are less well conserved (Fig. 8a, b), and reciprocally, the portions of RAG2L predicted to interact with RAG1L also show high variability (Fig. 8c, d). Hence, the predicted RAG1L-RAG2L protein-protein interaction surfaces appear to have evolved more rapidly than the central DNA binding groove.