Comprehensive identification of transposon-mobility genes in the C. elegans genome
To identify transposon copies on the latest C. elegans genome assembly (VC2010), we used RepeatMasker (RM), an algorithm that identifies repetitive DNA elements in a genome (https://www.repeatmasker.org/). Our RM analysis showed that repeat elements occupy 13.86% of the C. elegans genome (Table 1). This result is consistent with previous reports that 12–17% of the C. elegans genome is repeat elements (Kanzaki et al., 2018; Laricchia et al., 2017; Stein et al., 2003). After excluding simple, satellite, and low-complexity repeat elements, we identified 62,773 copies of transposons. To identify protein-coding elements among these 62,773 transposon copies, we used an ab initio gene finder, SNAP, which was shown to identify 97% of ORFs on the C. elegans genome (Korf, 2004). Our SNAP analysis identified 808 potential gene-coding regions. Among them, 428 genes conserved the complete ORF structure. To infer the function of genes encoded in these complete ORFs, we used DIAMOND (Double Index AlignMent Of Next-generation sequencing Data), an algorithm for fast and sensitive protein alignment (Buchfink et al., 2014). DIAMOND identified 285 transposon-mobility genes (80 RT, 11 IN, 189 TP, and 5 REP-HEL genes).
Transposon-mobility genes in LTRs in C. elegans
Among the 80 RT genes, 19 genes were encoded in LTRs (rtz_LTRs) and 61 genes were encoded in LINEs (rtz_LINEs) (Supplementary Tables 1 and 3). RTs, i.e. RNA-dependent DNA polymerases, have 5 evolutionarily conserved motifs (A, B’, C, D, and E) (Poch et al. 1989; Xiong and Eickbush 1990). Asp residues in Motifs A and C are widely conserved in all RNA/DNA-dependent DNA polymerases and DNA/RNA-dependent RNA polymerases (Delarue et al. 1990). Previous crystal structure analyses showed that one Asp in Motif A and 2 Asp residues in Motif C form a catalytic triad for holding a bivalent metal ion for conjugating the alpha phosphate of a new dNTP to the OH group to the 3’ end of the DNA strand (Jacobo-Molina et al. 1993; Steitz 1998; Le Grice 2012). Mutations in Asp residues of Motifs A or C reportedly abolish RT activity (Larder et al., 1987). By aligning RTZ_LTRs with the reference RTs, we identified 8 RTZ_LTRs that conserved Asp residues in Motifs A and C (red asterisks in Fig. 1A). Additionally, these 8 RTZ_LTRs preserved 1) the conserved Gly residue in Motif B’ (Smith et al. 2006), which functions for interacting with the incoming nucleotide and template strand (Mitchell et al. 2010), and 2) the Leu and Gly residues in Motif E, which function for fixing the primer strand and positions it toward the active site (Mitchell et al. 2010) (Fig. 1C).
In RTZ_LTR-11, -17, and − 19, the second Asp residue in Motif C was substituted with Asn (black asterisk in Fig. 1A). As described above, the second Asp residue is involved in a catalytic triad. Site-directed mutagenesis at the second Asp residue, including mutagenesis to Asn as found in RTZ_LTR-11, -17, and − 19, significantly reduces RT activity in human immunodeficiency virus (HIV) (Le Grice et al. 1991; Boyer et al. 1992; Kaushik et al. 1996). However, amino acid substitution at the second Asp to Asn has been found in functional RNA-dependent RNA polymerases in negative-strand RNA viruses (Poch et al. 1989; Barik et al. 1990; Delarue et al. 1990). Therefore, we considered the possibility that RTZ_LTR-11, -17, and − 19 might still be functional.
In RTZ_LTR-11 and − 19, the conserved Gly and Lys in Motif D were substituted. In RTZ_LTR-3, -8, and − 17, the conserved Gly in Motif D was substituted. Amino acid substitutions at Gly in Motif D are often observed in RNA-dependent RNA polymerases in negative-strand RNA viruses (Poch et al. 1989). Therefore, we considered the possibility that RTZ_LTR-3, -8, and − 17 might still be functional. On the other hand, Lys in Motif D is highly conserved throughout polymerases (Poch et al. 1989; Xiong and Eickbush 1990). Motif D functions for forming a phosphodiester bond with dNTPs with the 3’-OH (hydroxyl) of the primer (Canard et al. 1999; Castro et al. 2009). Motif D in RTZ_LTR-11 and − 19 may be nonfunctional. Nevertheless, given the conservation of the 2 critical amino acids holding a bivalent metal ion, and the fact that we did not perform functional testing, we held out the possibility that RTZ_LTR-11 and − 19 might be functional. Taken together, these results led us to classify all 8 rtz_LTRs as potentially functional genes (Supplementary Table 1).
Next, we identified 11 IN genes encoded in LTRs (inz_LTRs) (Supplementary Table 2). The catalytic core domain of IN has an evolutionarily conserved DD35E motif that is required for EN activity (Engelman and Craigie 1992; Kulkosky et al. 1992; Van Gent et al. 1992). DD35E holds 2 metal ions, which integrate a free transposon DNA into the genome (Hare et al. 2012; Maertens et al. 2021). Amino acid substitution at the 3 critical amino acid residues in the DD35E motif abolishes EN activity of INs in Rous sarcoma virus (RSV) and HIV (Drelich et al. 1992; Engelman and Craigie 1992; Kulkosky et al. 1992; Van Gent et al. 1992). By aligning the 11 INZ_LTRs with the reference INs, we identified 7 INZ_LTRs that conserved the DD35E triads (Fig. 1B and Supplementary Table 2). Therefore, we considered these 7 inz_LTRs as potentially functional genes. All of the inz_LTRs except inz_LTR-10 were encoded in the same LTR with the potentially functional rtz_LTR genes (Fig. 1C).
Transposon-mobility genes in LINEs in C. elegans
By aligning the 61 RTZ_LINEs with reference RTs, we identified 28 RTZ_LINEs that had Asp residues conserved in Motifs A and C (red asterisks in Fig. 2A). Additionally, these 28 RTZ_LINEs had residues conserved in Motifs B’ and C (black asterisks in Fig. 2A). These 28 RTZ_LINEs had Lys but not Gly residues conserved in Motif D. Amino acid substitution of Gly in Motif D is often observed in RNA-dependent RNA polymerases in negative-strand RNA viruses (Poch et al. 1989). Therefore, we held out the possibility that the RTZ_LTRs with amino acid substitution of Gly in Motif D were still functional. In RTZ_LINE-61, Gly residues in Motif E were substituted. However, given the conservation of the 2 critical amino acids for holding a bivalent metal ion, and the lack of a functional test, we held out the possibility that RTZ_LTR-61 might be still functional. Taken together, these results suggest that the RT domains of these 28 rtz_LINEs are potentially functional.
Next, by aligning the 61 RTZ_LINEs with reference ENs, we identified 14 RTZ_LINEs in which the critical Asp and His residues were conserved in their EN domains (red asterisks in Fig. 2B). A previous experimental test of transposition activity in 138 human L1 copies revealed that the Asp and His residues are essential, but conserved amino acid residues in other motifs tolerate multiple mutations (Kines et al. 2016). Among the 14 RTZ_LINEs, RTZ_LINE-19 and − 34 lacked a large N-terminal portion of the EN domain (Fig. 2B). Therefore, we concluded that the remaining 12 RTZ_LINEs have potentially functional EN domains. Taken together, we concluded that the 6 rtz_LINEs encoded both potentially functional RT and EN domains (rtz_LINE-5, rtz_LINE-13, rtz_LINE-22, rtz_LINE-57, rtz_LINE-59, and rtz_LINE-61; Figs. 2A and 2B and Supplementary Table 3).
Human L1 can retrotranspose via an EN-independent and RT-dependent mechanism (Morrish et al. 2002). Therefore, we concluded that the 28 rtz_LINEs with conserved RT domains (Fig. 2A) are potentially functional genes for LINE transposition. These 28 functional rtz_LINEs were distributed across the entirety of each chromosome (Fig. 2C).
Transposon-mobility genes in TIRs in C. elegans
TIR is composed of 19 super families: hAT, Tc1/mariner, CACTA (En/Spm), Mutator (MuDR), P, PiggyBac, PIF/Harbinger, Mirage, Merlin, Transib, Novosib, Rehavkus, ISL2EU, Kolobok, Chapaev, Sola, Zator, Ginger, and Academ (Yuan and Wessler 2011). The transposition of TIR elements is mediated by TP, which has a conserved DDD/E motif at the catalytic core (Yuan and Wessler 2011). Structural analysis suggests that the DDD/E motif holds 2 metal ions to cleave and integrate the TIR dsDNA element (Davies et al. 2000; Lovell et al. 2002; Richardson et al. 2009). Mutation of the conserved DDD/E motif abolishes TP activity (Bolland and Kleckner 1996; Naumann and Reznikoff 2002). By aligning the 189 TPZs with reference TPs, we identified 94 TPZs that had DDD/E motifs conserved that were homologous to those of Tc1/mariner family TPs (Fig. 3A and Supplementary Table 4). No TP aligned to DDD/E motifs of other TP super families. Thus, we considered these 94 tpzs to be potentially functional genes (Supplementary Table 4). The 94 functional tpzs were distributed across the entirety of each chromosome (Fig. 3B).
Transposon-mobility genes in Helitrons in C. elegans
Helitron transposition is mediated by a SF1 family helicase (HEL) with REP domain (REP-HEL). The 4 SF family HELs (SF1, SF2, SF3, and SF4) have conserved Motif I and Motif II domains. Motifs I and II correspond to the Walker A and Walker B domains, respectively, which are widely conserved among NTP-binding proteins (Walker et al. 1982; Hall and Matson 1999). The Walker A/Motif I and Walker B/Motif II domains exhibit conservation of the Lys and Asp-Glu residues, respectively (Walker et al. 1982; Gorbalenya and Koonin 1993; Koonin 1993; Ambudkar et al. 2006). Crystallographic analysis showed that Lys in Motif I contacts with magnesium ion and β phosphate of ATP, and functions to stabilize the transition state during ATP hydrolysis. Asp-Glu residues in Motif II are also involved in ATP hydrolysis (Velankar et al. 1999; Raney et al. 2013). Mutations at Lys in Motif I or Asp-Glu in Motif II abolish HEL activity (Brosh and Matson 1995; Weng et al. 1996; Graves-Woodward et al. 1997; Walker et al. 1997).
By aligning the 5 REP-HELs (RHZs) with reference HEL domains, we identified 5 RHZs that had Lys residues conserved in Motif I and had Asp-Glu residues conserved in Motif II (red asterisks in Fig. 4A). Additionally, these 5 RHZs had: 1) weakly conserved residues in Motif Ia (black asterisks in Fig. 4A), which functions for ssDNA binding and energy transfer from the ATP-binding site to the DNA-binding site (Raney et al. 2013); 2) conserved Gly-Asp and other resides in Motif III (black asterisk in Fig. 4A), which is involved in contacting nucleotide g-phosphates (Raney et al. 2013); 3) a conserved Arg residue in Motif IV, which may be involved in NTP hydrolysis (Raney et al. 2013); 4) conserved residues in Motif IV/V, the function of which is not well understood (Kapitonov and Jurka 2001); 5) conserved residues in Motif V, which interacts with the sugar-phosphate backbone of DNA (Raney et al. 2013); and 6) conserved residues in Motif VI, which may form part of the ATP binding cleft to couple ATPase activity to HEL activity (Raney et al. 2013) (Fig. 4A). Therefore, we considered these 5 rhzs to encode functional HEL domains.
In the REP domain, the HUH Y2 motif (in which U is a hydrophobic residue) is evolutionarily conserved. HUH holds divalent metal ions to form nicks in the DNA strand in EN activity, whereas Tyr residues form a transient covalent bond with the cleaved DNA strand to generate phospho-tyrosine for DNA strand transfer. The EN activity is abolished by mutation of either 2 His or 2 Tyr residues (Grabundzija et al., 2016). By aligning the 5 RHZs with the reference REP domains, we found that the 5 RHZs conserved the 2 His and 2 Tyr residues (red asterisks in Fig. 4B). Taken together, our results suggest that the 5 rhzs encode both potentially functional HEL and EN domains (Supplementary Table 5). These 5 rhzs are located only on chromosome II (Fig. 4C).
Transposon-mobility genes in Maverick/Polinton (MP) elements in C. elegans
Our RepeatMasker (RM) analysis did not identify MP elements. MPs are reportedly located on chromosomes I, II, III, IV, and X of the C. elegans genome (Feschotte and Pritham 2005; Gao and Voytas 2005; Pritham et al. 2007). Using the representative AL110478.1 sequence of MP in C. elegans (Pritham et al. 2007), we searched for MP copies in VC2010 by using the NCBI nucleotide BLAST program (https://blast.ncbi.nlm.nih.gov/). We identified short homologous regions > 100 bps that were densely scattered on 3 distinct regions on chromosome I (MP Ia, MP Ib, and MP Ic), and 2 regions on chromosome X (MP Xa and MP Xb) (Fig. 5A and 5B). Thus, the reference MP copy used in our homology search was aligned discontinuously in these MP copies in the VC2010 genome assembly (Fig. 5B). Additionally, MP copies were located on different chromosomes from previous reports.
The VC2010 genome assembly provides substantial advantages over its predecessors in both precision and completeness, but at the current stage, any assembly is imperfect (Yoshimura et al., 2019). Compared to the N2 genome assembly, the VC2010 assembly contains short insertions, deletions, and duplications ranging from tens to thousands of base pairs, which are distributed in all chromosomes (Yoshimura et al., 2019). These differences could be due to polymorphism in the PD1074 strain that was used to obtain the VC2010 assembly, or could be due to errors arising during sequencing and/or assembly of the N2 genome assembly. Therefore, the discontinuity within the MP copies (Fig. 5A and 5B) is likely to be the actual sequence present in the PD1074 strain. On the other hand, artifactual large structural variations in VC2010, as we found in chromosomal locations of the MP copies, have been suggested to remain even after careful correction in the VC2010 assembly (Yoshimura et al., 2019). Thus, it remains inconclusive whether the different chromosomal locations of the MP copies in VC2010 are actually present in the PD1074 strain or due to errors in the assembly process to obtain the VC2010 assembly.
To identify transposon-mobility genes encoded in these MP copies, i.e., DNA polymerase B (DNA POLB) and IN, we applied SNAP and DIAMOND to the 5 MP copies. Two DNA POLB-related genes (polB_MP) encoding 378 and 93 amino acids were located in MP Ia and MP X, respectively. Two IN genes (inz_MP) were located in MP Ib and MP Ic. Interestingly, a gene encoding the helix-turn-helix 28 (HTH28) domain-containing protein, which is conserved in some TPs (shttps://www.uniprot.org/uniprotkb?query=HTH28%20transposase), is located in MP Xa. Because TPs are functionally and evolutionarily related to INs (Capy et al. 1996, 1997), we considered these 3 IN-related genes to be inz_MPs.
DNA POLB has 5 conserved motifs from Motifs I to V (Iwai et al. 2000). By aligning PolB_MP-1 (at MP Ib) and PolB_MP-2 (at MP Xa) with the reference DNA POLBs, we found that PolB_MP-1 lacked most of the N-terminal motifs from I to III, and only had YnDTD conserved in Motif IV. Additionally, PolB_MP-2 only had conservation of a short N-terminus fragment, and did not exhibit conservation of Motifs I to V. Therefore, we considered that these PolB_MPs do not encode functional proteins. Next, by aligning INZ_MP-1 (at MP Ib), -2 (at MP Ic), and − 3 (MP Xa) with reference INs, we found that INZ_MP-1 and − 2 aligned their EEEs with the reference DDE triad (black asterisks in Fig. 5C). Asp and Glu have the same negatively charged carboxylate groups. We conclude that inz_MP-1 and − 2 are potentially functional genes. INZ_MP-3 did not align with the reference DDE triads, whereas INZ_MP-3 aligned with the DDD/E triad in the Tc1 family TPs (red asterisks in Fig. 5C). We also considered that inz_MP-3 to be a potentially functional gene.
The IN encoded in a MP copy has been identified as a cellular IN (c-integrase) that is homologous to the retrotransposon IN (Gao and Voytas 2005). Additionally, the IN in MP is highly homologous to the TP encoded in ginger DNA transposon (Bao et al. 2010). MP has been proposed to evolutionarily play a role as a hotbed that exchanges genes between eukaryotic DNA mobile elements (dsDNA viruses, adenoviruses, small ssDNA viruses, Mavirus-like virophages, icosahedral viruses) (Koonin et al. 2015; Krupovic and Koonin 2015; Yutin et al. 2015). Considering that inz_MP-3 at MP Xa encodes the HTH28 domain, and that the DDD motif is homologous to the Tc family of DNA TPs, we conclude that the Tc family of DNA transposons may be involved in MP evolution.
Transposon-mobility genes in DIRS, PLE, and Crypton elements in C. elegans
Our RM analysis did not identify copies of DIRS or Crypton. A previous homology-based search of 34 nematode species to identify Tyr-REC genes that mediate transposition of DIRS and Crypton identified an incomplete cDNA for a Tyr-REC gene in chromosome II of C. elegans (Szitenberg et al. 2014). We applied SNAP to a genomic region of this incomplete cDNA with 20-kbp 5’ flanking and 20-kbp 3’ flanking genomic regions in VC2010, but did not find any complete ORF. From this result and previous studies, we conclude that C. elegans may not have a functional Tyr-REC gene. Finally, consistent with a previous report (Arkhipova 2006), our analysis on VC2010 did not identify any PLE copies.