Without classification, a TE library is of limited use. While entries in Dfam have always been classified, in this release we have added an interactive tool to the website, displaying our classification scheme for repetitive sequences in eukaryotic genomes in the form of an identification key.
Classification of TEs poses specific problems that may prevent a universal solution to be found [18]. A purely cladistic approach is impossible as TEs are polyphyletic (they have many independent origins) and because their relationship is reticulated (sections of TEs can have entirely different evolutionary histories, due to recombinations, gene captures and nested insertions). Classic SINEs, which have originated many times from fortuitous positioning of an internal promoter (e.g. in a small RNA gene) and the 3' fragment of an active LINE [15, 19] provide an example for both these issues. Nevertheless, most currently used classification systems for eukaryotic TEs are very similar and are based on hybrids of cladistic, mechanistic and structural approaches.
In 1989, David Finnegan introduced an early classification with just four classes [20]. His basic division between TEs that transpose via an RNA intermediate (class I) and those that “transpose directly from DNA to DNA” (class II) is still used by most. Considering the fundamental impact of the trans-activity of class II proteins on their transcripts and the cis -activity of class I proteins on their genomic copies, this division is indeed fundamental. At the time, very few types of eukaryotic TEs were known, and his further divisions of class I elements into those with and without long terminal repeats (LTRs), and divisions of class II elements into those with short and long terminal inverted repeats (TIRs) has not survived the onslaught of new data, although LTR and non-LTR (LINE) elements still form valid clades, at least from the reverse-transcriptase point of view [21].
When we introduced RepeatMasker in 1995, we needed a succinct classification to fit in the slightly modified cross_match format [22] we used to annotate genomic DNA. We chose a three-level form coded as “level1/level2-level3” (e.g. “DNA/hAT-Charlie”). We adopted Finnegan's LTR, LINE and class II (“DNA”) divisions and added SINE and a number of non-TE classes for the first divisions, the three class I elements reflecting a bias towards the frequency of elements encountered in the human and other mammalian genomes. Second and third divisions represent clades of elements based on reverse transcriptase (RT) or transposases phylogenies. Non-autonomous elements whose movement depends upon the coding capacity of autonomous elements, were grouped within the autonomous elements' classification, based on similarities of the LTRs or TIRs in the absence of any coding sequence. Entries in Repbase more or less inherited this simple classification hierarchy. In later years, attempts were made to reflect as much as possible of the classification in the name of the elements [23]. The classification system suggested in 2007 by researchers with a primarily plant genomics background [24], has the same basis in Finnegan and follows a similar logic; in order to display compact classification on an annotation line, they suggested a three-letter class-order-superfamily code to add to each "family" classification. The "subfamily" was suggested to be used in the TE's name itself.
Our classification, like those before, combines a mechanistic, cladistic and structural approach. Where possible, the relationship of the RT in class I elements and transposase, helicase, or DNA polymerase in class II elements guides the tree. While non-autonomous LTR elements tend to remain dependent on the autonomous element from which they formed and can be classified with these, LINE-dependent non-autonomous elements have a variety of origins. They are separated by those with a small RNA derived pol III internal promoter (the SINEs) and other elements. The latter category is a grab bag of sorts, classified by the type of LINE they depend upon, and contains elements mostly consisting of LINE-material to hodgepodges like SVA [25]. The modular, classic SINEs are organized by their 5' small RNA-derived, core, and 3' LINE-derived modules. Class II elements are divided in the four fundamental mechanisms of propagation so far known in eukaryotes, "cut-and-paste" via a linear or circular dsDNA, "rolling circle", and "self-synthesizing" groups, after which the phylogenetic relationship of the transposase, recombinase, helicase, or DNA polymerase, respectively, takes over. Like non-autonomous LTR elements, most non-autonomous DNA transposons can be classified based on their TIR combined with their target site duplication (TSD) pattern. We therefore do not provide structural categories like LARD (large LTR retrotransposon derivatives) or MITE (miniature inverted–repeat transposable elements).
Figure 2 - Dfam TE Classification System. (A) A portion of the dynamic visualization of the classification system found at the Dfam website. Filled in circles represent internal nodes of the tree while hollow circles are leaf nodes in the classification tree. A classification is specified by concatenating the path through the classification tree. For example, the classification “Interspersed_Repeat;Unknown” is highlighted in the tree. (B) In addition, wherever possible a mapping is provided between classification systems. The Dfam classification for the L1 group of LINEs is shown with the equivalent classifications in several other systems.
The Dfam classification system (Fig. 2) does not display a ranked hierarchy as there will never be satisfying definitions for what a class, order, family or subfamily of TEs constitutes, while with the addition of new elements and growing knowledge of their relationship, the number of branches, and therefore subdivisions, along some parts of the tree will remain in flux. Wicker et al. proposed to define a family as a group of TEs that can be aligned over at least 80 bp and show 80%+ identity covering 80% or more of the alignment [24]. Meant as a pragmatic definition, it has been pointed out that applying it would lead "to an unpredictable mix of monophyletic, paraphyletic and polyphyletic groups" [26]. Strictly following this rule will also not be practical, as, for example, newly identified TEs intermediate between known families will force these to be merged over time and the aforementioned reticulate relationship of TEs could join radically different TEs in one family. Also, some of the ranks are already in use for other purposes: the term "family" is often used for any group of aligned TE copies for which a consensus or HMM has been derived and, in animal TE annotation, "subfamilies" either indicate subsets of class I TE copies that share multiple co-segregating differences from the rest (Fig. 1) or sets of particular internal deletion products of an autonomous class II transposon. With its lack of taxonomic ranks, our schema avoids these issues.
TEs may also be classified by their transposition mechanism and classification systems based on the mechanism of integration and chemistry of the transposition reaction have been proposed [27, 28]. These have the benefit of being able to integrate the wide variety of TEs active in prokaryotes, but are somewhat hampered by the lack of knowledge on the details of transposition by new, bioinformatically discovered TEs. Furthermore, written specifically to include prokaryotic TEs, the mechanistic classifications do not have the fundamental division in cis-active and trans-active elements, brought about by the separation of transcription and translation in eukaryotes. While the focus of the RepeatMasker/Repbase/Wicker classification on eukaryotes and on reverse transcriptase phylogeny has been criticized [29], a unified eukaryotic/prokaryotic TE classification would be unwieldy. In the future, we will explore the use of an independent classification for prokaryotic TEs.
A TE family can be classified as belonging to any node in the classification tree by concatenating the names along the path from the root to the designated node. For example, the highlighted node in Fig. 2 is referenced with the string “Interspersed_repeat;Unknown”. This enables partial classifications to be made and node labels to be reused. All classifications are linked to the corresponding RepeatMasker, RepBase, Wicker-et-al. or Curcio-Derbyshire classification, where they are available.
While most interspersed repeats identified by de novo repeat finding programs are derived from TEs, alternative origins include (i) simple tandem repeats, originating independently at many sites, (ii) long tandem repeats like satellites, found at multiple (sub)telomeric and centromeric sites, (iii) segmental duplications, (iv) common coding motifs like zinc fingers, and (v) gene families. In mammals, the most common non-TE source of interspersed repeats are retro(pseudo)genes that have been accidently copied by the LINE1-mechanism; some small structural RNAs occur with over a thousand copies [30]. While our classification system includes these categories, most of these entries should not be part of Dfam. Satellites and small structural RNAs are included in Dfam, but shorter tandem repeats are better detected by specialized programs like TRF [31] and ULTRA [32] and the inclusion of segmental duplications, cellular transcripts or coding regions would lead to much false annotation.