The streptophyte mitochondrial group II intron data set
We scanned plant mitochondrial genomes for the occurrence of group II introns, including the complete phylogenetic diversity representing the seven major land plant clades: Flowering plants (Angiosperms), Gymnosperms, Ferns (Monilophytes), Lycophytes, Hornworts (Anthocerotophyta), Mosses (Bryophyta) and Liverworts (Marchantiophyta). We additionally included the available mitogenomes of streptophyte algae (“Charophytes”) representing six classes: Zygnematophyceae (Desmidiales: Closterium bailyanum, Gonatozygon brebissonii; Zygnematales: Entransia fimbriata, Roya anglica, R. obtusa and Zygnema circumcarinatum) recently considered to be most closely related to the plant lineage as well as Charophyceae (Chara vulgaris, C. braunii, Nitella hyalina), Coleochaetophyceae (Chaetosphaeridium globosum, Coleochaete scutata), Chlorokybophyceae (Chlorokybus atmophyticus), Klebsormidiophyceae (Klebsormidium flaccidum) and the early-branching Mesostigmatophyceae (Mesostigma viride). For a clear designation of introns we use the previously suggested intron nomenclature based on the intron insertion site in a given gene behind the reference position in the respective homologue of the liverwort Marchantia polymorpha [56, 57].
The example of the cox2 gene (Fig. 2) is used to introduce into important issues of our analyses addressing the huge diversity of mitochondrial introns in the plant lineage. Altogether twelve different group II insertion sites are presently identified in the cox2 genes of land plants and streptophyte algae. Their phylogenetic distributions vary widely from introns present in most land plant lineages excluding liverworts (cox2i373g2 and cox2i691g2) to others presently only identified in the hornwort genus Anthoceros (cox2i98g2) or in the streptophyte alga Coleochaete scutata (cox2i550g2). The assignments of introns to core families, which comprise significantly similar intron paralogues as introduced in this work, are given below the intron names: cox2i381g2 in family F01, cox2i550g2 in F03, cox2i97g2 and cox2i98g2 in F04, cox2i564g2 in F16 and cox2i94g2 in F19 (see below). Including intron-based maturases allows for definition of maturase-based families and superfamilies, here cox2i373g2, in mF26 and cox2i127g2 in mF27. Remaining “solitary” introns lacking significant similarity to other paralogues are labeled “S”. Insertion sites must be considered very carefully and precisely, notably when introns occur differently in a small gene region such as introns cox2i94g2, cox2i97g2, cox2i98g2 and cox2i104g2, an issue occasionally overlooked and leading to mis-annotation in database entries. Introns in the same position were considered as orthologues also when lacking significant sequence similarities across large phylogenetic distances if they were not assigned to different families.
Our re-evaluation of database entries during intron sampling has made us re-consider several intron and splice site annotations and allowed us to suggest yet further, and very likely functionally splicing, introns that were previously unnoticed but would reconstitute important and conserved parts of their host genes (e.g. cox3i34g2 in Gonatozygon, nad2i81g2 in Zygnema or nad9i89g2 in Closterium and Gonatozygon) as well as dysfunctional, “fossil” introns (e.g. rps8i52g2f in Phlegmariurus), as we will discuss below. Our final total mitochondrial group II intron sampling in streptophyte mitogenomes comprised 161 group II intron paralogues defined by their unique insertion sites.
The occurrence of streptophyte mitochondrial group II intron paralogues in the major clades is displayed with Euler diagrams in Fig. 3. The striking discrepancy of mitochondrial vs. chloroplast group II introns is immediately apparent with 22 land plant chloroplast introns all of which have counterparts in streptophyte algae vs. altogether 161 mitochondrial group II introns, of which only 13 are shared between embryophytes and streptophyte algae (Fig. 3A). Differentiating among the embryophytes, larger intersections are found between bryophytes and tracheophytes (Fig. 3B) than with either group and the outgroup algae and among the latter between hornworts and mosses (Fig. 3C) and between hornworts, mosses and tracheophytes (Fig. 3D), respectively. Notably, of 101 group II introns identified in embryophyte mtDNAs, only one (atp9i87g2) is shared between all three bryophyte clades (Fig. 3C).
The group II intron family concept
Despite their highly conserved six-domain structures, group II intron paralogues a priori share no significant sequence similarities aside from mostly conserved sequence motifs at the 5’-end (mostly GUGCG), the 3’- termini (mostly AY) and their characteristic domain V comprising 34 nucleotides, which mostly folds into two base-paired regions of 9 and 5 base pairs, respectively, with a dinucleotide bulge and a terminal tetraloop (most often GNRA) in the overwhelming majority of group II introns. Only a consensus sequence representing domain V may in fact be used as an initial query to scan for group II intron candidates in sequence databases [58].
Our criteria for considering introns occupying different insertion sites for inclusion into families of related paralogues are detailed under methods. In brief, we used several rounds of identifying sequence-related intron paralogues that share significant sequence similarities (beyond short similarities of domain V and the immediate flanking regions), which can exceed those of evidently orthologous introns in the same positions in distant plant taxa. Naturally, some rare borderline cases are represented with introns occupying the same insertion site in phylogenetically very distant taxa. An independent insertion of a given position cannot be excluded as an alternative explanation to vertical transmission followed by sequence divergences obliterating recognizable similarities. Below, we will discuss such borderline cases in the context of our consideration of the now defined group II intron families. A cladogram of group II introns sorted into families is shown in Fig. 4. Subsequently, we will consider, where present, similarities of intron-borne maturases to independently verify family assignments based on the nucleotide sequence similarities alone. Furthermore, maturase similarities will occasionally allow for the inclusion of some previously “solitary” introns lacking significant nucleotide sequence similarities to other paralogues into extended families and for some fusions of primary intron families into “superfamilies” (Fig. 5).
Family F01
Among the here defined families of mitochondrial group II introns in the streptophyte lineage, family F01 is the largest one, comprising altogether 13 intron paralogues (Fig. 4, Fig. 6). It includes four isolated introns, each presently identified in only one streptophyte algae clade (atp9i209g2 and cox3i745g2 in Chlorokybus, atp9i214g2 in Charophyceae and cox1i1039g2 in Coleochaete, respectively). The “moss-specific” introns cox3i507g2 and nad9i283g2 of F01 are universally conserved in that plant clade. Phylogenetically more broadly distributed introns atp6i439g2, nad4i1399g2, nad5i230g2, nad7i209g2 and rps3i74g2 occur in up to five embryophyte clades. In two cases, the respective introns remain recognizable despite pseudogene degeneration of nad7 and rps3 in hornwort mitogenomes. Three introns in family F01 are present both in at least one embryophyte clade and in up to three classes of streptophyte algae: rps3i74g2, atp6i439g2 and cox2i381g2 (Fig. 4, Fig. 6). Notably though, family F01 does not include an intron paralogue present in liverworts. Only four of the intron paralogues in F01 carry maturases: atp9i209g2, cox1i1039g2, cox2i381g2 and nad3i211g2 (Fig. 4, Fig. 6).
Although outgroup rooting is necessarily difficult for the individual clades of intron families, the unrooted phylogeny suggests particular close relations of certain introns in F01, in this case of the “moss” intron cox3i507g2 with the “tracheophyte” intron nad4i1399g2 and of the phylogenetically wider distributed paralogues nad5i230g2 with rps3i74g2 (Fig. 6). Interestingly, the latter two and nad7i209g2 as a third intron paralogue of the F01 family have recently been demonstrated to be affected by the same nuclear-encoded splicing factor in the model angiosperm Arabidopsis thaliana, OZ2 [50]. Also quite notably, the maize NCS2 mutant (non-chromosomal stripe mutant) described long ago [59] is the result of a mitogenome recombination between two mitochondrial introns that we have now assigned to family F01: nad4i1399g2 and nad7i209g2.
The phylogeny of intron paralogues is in accordance with known species phylogenies for the well-supported nodes, while apparent phylogenetic “mis-placements” e.g., of the diverging lycophyte sequences, remain without support (Fig. 6). However, and notably so with respect to phenomena like shared splicing factors, the possibility of concerted and converging evolution must be considered. Whereas the case of atp6i439g2 is an example where even the distant algal orthologue in Zygnema is included with its counterparts in lycophytes and hornworts in one subclade, this is not the case for rps3i74g2 where the Charales orthologues are attracted to the isolated atp9i214g2 paralogue in Chara vulgaris. Our later additional considerations of maturase protein similarities that we will discuss below includes family F01 into a “superfamily” also including the small family F22 and two “solitary” introns (Fig. 5).
Family F02
As in family F01, some intron paralogues in family F02 likewise have a very restricted occurrence: cox3i625g2 is exclusively present in liverworts and nad9i246g2 is only identified in hornworts (Fig. 4). However, despite their phylogenetically disjunct occurrence, cox3i625g2 and nad9i246g2 are closely related paralogues in F02 (Fig. 7). Yet more phylogenetically restricted, nad5i392g2 is only present in the lycophyte order Lycopodiales, here represented by Phlegmariurus squarrosus. The extreme diversity of mitogenome evolution within the lycophyte clade with retained genes and introns and a conserved genetic synteny in the P. squarrosus mtDNA versus highly recombining mitogenomes with reduced gene sets in Isoetales and Selaginellales is also fully in line with intron rps10i235g2 being retained in Phlegmariurus and shared with the seed plants. The similarity of angiosperm rps10i235g2 to liverwort introns rrnLi833g2 and cox3i625g2 had already been noted right along with its initial discovery [60]. Other than rps10i235g2, P. squarrosus intron nad5i1242g2 is shared with ferns, similarly indicating an early vascular plant ancestry. Notably, we recently found intron rrnLi833g2 universally conserved among liverworts now to be also present in the mitogenome of the leptosporangiate fern Haplopteris ensiformis [61] and in a preliminary mtDNA assembly of the fern Dryopteris crassirhizoma (database accession MW732172). Intriguingly, however, the fern rrnLi833g2 introns cluster with their nad5i1242g2 paralogues, possibly indicating concerted evolution or a loss-and-regain scenario.
Intron family F02 neither includes an extant intron paralogue present in mosses nor one carrying a maturase ORF (Fig. 4). Notably though, we were able to identify several additional fossilized group II intron sequences “F02g2f” clearly tracing back to family F02 intron paralogues both in Phlegmariurus squarrosus and in hornwort mitogenomes (Suppl. Figure 1A & B). The series of intron copying events leading to the five F02 paralogues now recognized in the Phlegmariurus mitogenome (Fig. 7B) remains unclear except for nad5i392g2 present only in Lycopodiales and likely emerging late from the more ancestral nad5i1242g2 paralogue. The newly identified fossil intron rps8i51g2f receives the ‘f’ behind the g2 intron label because we could not confirm functional splicing despite significant similarity to its functional counterparts in nad5 and rps10 extending up to the very intron 5’- and 3’-ends as opposed to another intergenic F02 intron fossil in the tatC-cox2 spacer (Fig. 7C). Altogether five intergenic F02 intron fossils could be discovered in the hornwort mitogenomes, likely originating from nad9i246g2 or a common ancestor (Fig. 7D). In this case, the two intergenic intron fossils between rps4 and tatC and between atp6 and nad6 appear more closely related to each other than the remaining three. The extension of the intron sequence similarities ending with the respective 5’- or 3’-intron termini and their identification in the conservatively evolving mitogenomes of early land plants devoid of heavy recombination as in seed plant mitogenomes strongly argues for their origin by retrotransposition rather than mtDNA recombination.
Family F03
Intron family F03 contains two intron paralogues, nad4i976g2 and nad5i1477g2, shared between hornworts and all four tracheophyte clades, potentially supporting a hornwort-tracheophyte (HT) clade (Fig. 4). However, both nad4i976g2 and another intron paralogue in F03, nad3i140g2, also reveal counterparts in Charophyceae algae. No losses are identified for nad3i140g2 among available liverwort, hornwort or lycophyte mitogenomes. Similarly, two further intron paralogues are universally conserved in mosses and shared with either lycophytes (atp9i21g2) or hornworts (sdh3i100g2). Finally, three F03 introns have a restricted occurrence among streptophyte algae, of which only one is shared with a land plant clade, cox2i250g2 present in Closterium (Zygnematophyceae) and in the liverworts. This intron is an interesting exceptional case with a maturase carried in the liverwort orthologues, but not in the algal counterpart of cox2i250g2. Among the eight intron paralogues in F03 (Fig. 4), two pairs of introns show particularly high sequence similarities: nad3i140g2 with nad5i1477g2 and sdh3i100g2 with nad4i976g2. In this case, the intron paralogues occur with a phylogenetic overlap in the hornworts, a clade characterized by high intron mobility as also reflected below with the example of family F04. Again, we were able to identify fossilized group II intron sequences (“F03g2f”) in intergenic sequences of the lycophyte Phlegmariurus squarrosus clearly tracing back to family F03 (Suppl. Figure 1C).
Family F04
Group II intron family F04 is dominated by introns exclusively occurring in hornworts, indicating a pronounced intron mobility in that land plant clade: cox2i98g2, nad5i881g2, nad6i444g2 and nad9i502g2 (Fig. 4). This is further enhanced by the observation that the respective internal introns of two recently discovered twintrons [13], atp1i1050g2ii1536g2 and cox1i1116g1ii207g2 in the hornwort genus Anthoceros, likewise belong to the here defined F04 family. In fact, the extraordinary sequence similarity of 98% between atp1i1050g2ii1536g2 to cox2i98g2 is the most striking case of closely related paralogues in our entire intron sampling, indicating a very recent copying event. Considering that the atp1 twintron is present in all hornworts whereas cox2i98g2 exists only in the genus Anthoceros immediately suggests a copying from the former to the latter insertion site. Notably, however, internal splicing of zombie-twintron atp1i1050g2ii1536g2 could not be detected.
At the same time, family F04 is a prime example for the necessity of careful analyses to distinguish cox2i98g2 from its paralogue cox2i97g2 in liverworts that is inserted one nucleotide upstream in the cox2 gene (Fig. 2). Intron nad7i676g2 is conserved in all tracheophyte clades. Its former presence in hornworts, and accordingly in a possible HT stem lineage, is elusive owing to the pseudogene degeneration of the downstream part of nad7 in all hornwort mtDNAs. Notably, and despite the very clear evidence of recently active retrotransposition, none of the F04 members carries a maturase-ORF (Fig. 4).
Family F05
In contrast to F04 mainly comprising intron paralogues in hornworts, group II intron family F05 (Fig. 4) is dominated by intron paralogues exclusively occurring in liverworts: nad4i548g2, nad4Li100g2, nad7i336g2 and rpl2i28g2. Liverwort-specific intron nad4i548g2 had been introduced to elucidate the liverwort phylogeny [62] and is now found to be notably similar to rpl2i28g2. Other than the four liverwort paralogues, F05 also includes lycophyte introns recently discovered and characterized as the external intron of a “zombie” twintron upon closer reinspection of the Phlegmariurus squarrosus sdh3 gene [13].
Family F06
Apart from rps3i249g2 that is shared between lycophytes, ferns and gymnosperms, the group II intron paralogues in family F06 are phylogenetically very restricted in occurrence: atp1i361g2 exclusively in ferns, cobi783g2 only in liverworts and cox1i150g2 exclusively in hornworts (Fig. 4). A detailed study of atp1i361g2 concluded that this fern-specific intron has originated from the, likely more ancestral, paralogue rps3i249g2 and also found evidence for convergent evolution of specific intron structures, mainly group II intron domain III, in the two paralogues in later emerging fern lineages [25].
Interestingly, a truncated copy of cobi783g2 had been identified earlier in the spacer between nad5 and nad4 in liverworts in an early sampling of this intergenic region among bryophytes [63]. Our extended consideration of intron families now adds support to the idea that also this case of an intron fossil has arisen through copying, likely by a retrotransposition event. Similarity of the fossil sequence starts exactly from the intron 5’-end and extends for nearly 800 bp with 99% identity in the case of Treubia lacunosa representing an early liverwort branch (Suppl. Figure 1D) whereas a higher degree of degeneration is observed in derived taxa. Yet other intron fossils have now been discovered in the intergenic region between rrn5 and trnM-CAU of the hornwort Anthoceros agrestis and the moss Sphagnum palustre mitogenomes (Suppl. Figure 1E). The only evidence for a maturase among the four members of intron family F06 are traces of a former maturase-ORF in cox1i150g2.
Family F07
Intron family F07 contains three paralogues of phylogenetically wide distribution and only one member with isolated occurrence in the liverworts alone: cobi372g2 (Fig. 4). The similarity between introns nad1i477g2 and nad2i156g2, now found to be widely conserved among vascular plants (Fig. 4), has been recognized early after the complex structures of nad1 and nad2 in flowering plants had been elucidated [54]. Quite interestingly, the nuclear encoded splicing factor ODB1 has meantime been found to promote splicing of both of these two paralogues [64]. The fourth paralogue in family F07, intron atp9i87g2, is particularly noteworthy for (i) being an intron now identified to be shared between all three bryophyte clades, (ii) carrying an ancestral maturase that has independently degenerated in all four plant clades where it is present and (iii) carrying the internal intron atp9i87g2ii1114g2 of family F08 (see below) turning it into a twintron in the lycophyte Phlegmariurus squarrosus. A history of intron paralogues is immediately suggestive for the family F07 paralogues (Fig. 8). The ancestral maturase-carrying atp9i87g2 independently likely gave rise to cobi372g2 in liverworts and to nad2i156g2 in a possible non-liverwort lineage. After loss of nad2i156g2 in hornworts, it gave rise to nad1i477g2 only in the tracheophyte lineage.
Family F08
Two introns in family F08 have a phylogenetically striking distribution being conserved in liverworts but also occurring in lycophytes: nad7i1113g2 and rps14i114g2 (Fig. 4). Given that the host genes, nad7 and rps14, are frequently subject to EGT, nad7i1113g2 is at present only determined in Isoetes engelmannii and rps14i114g2 only in Phlegmariurus squarrosus among the lycophytes. However, we now found rps14i114g2 equally conserved in the mitogenome of the leptosporangiate fern Haplopteris ensiformis [61]. Yet more notably, the recently characterized inner intron of a twintron in the atp9 gene, atp9i87g2ii1114g2 [13] as a third member in family F08 is characteristically more similar to its rps14i114g2 counterpart in liverworts than in P. squarrosus.
Family F09
Intron ccmFCi829g2 in F09 has a phylogenetically wide distribution in bryophytes and tracheophytes (Fig. 4). Its former presence in lycophytes remains unclear, however, owing to the loss of the entire ccm gene suite for cytochrome c maturation in this clade. In contrast, pseudogene traces of ccmFC including ccmFCi829g2 are clearly detected in hornworts [65]. Intron rpl2i846g2 is evidently a gain in the tracheophyte stem lineage. This ancestral intron obviously gave rise to its closely related paralogue rps1i25g2 exclusively present in ferns [24]. As in family F08, no traces of maturase-ORFs are recognizable in any of the F09 intron paralogues. Highly interesting, however, the splicing of both F09 paralogues present in Arabidopsis thaliana, ccmFCi829g2 and rpl2i846g2, was found to be affected by the same nuclear-encoded splicing factor, WTF9 [66].
Family F10
Intron family F10 comprises group II introns with a particularly complex history (Fig. 4). It contains two intron paralogues with a phylogenetically disjunct distribution: cobi824g2 exclusively present in liverworts and nad1i669g2 previously assumed to be restricted to tracheophytes. Intron nad1i669g2 has received attention as being conserved in a trans-splicing arrangement in seed plants. A cis-arranged orthologue was initially identified in the fern Osmunda regalis, also noticing traces of a degenerated maturase [5]. We now found that intron nad1i669g2 has a clear orthologue also in the mtDNA of the alga Coleochaete scutata [67] with a maturase-ORF annotated in the corresponding database entry (MN613583). Intron family F10 also contains the “hypermobile invader” intron cox1i1149g2 in the lycophyte Phlegmariurus squarrosus, which gave rise to internal introns of two twintrons (in the newly determined sdh3i349g2 and in itself) and to seven further intron fossils in intergenic regions [13] (Suppl. Figure 1F).
Family F11
Intron family F11 contains introns atp6i80g2 and atp9i95g2 of phylogenetically wider distribution, shared between mosses, hornworts and lycophytes (Fig. 4). Like the above cases of the neighboring introns cox2i97g2 and cox2i98g2 in F04, also atp9i95g2 requires careful inspection given the closely neighboring intron atp9i87g2 in family F07 [65]. Despite the evident dynamics of intron insertions in atp9 of bryophytes and lycophytes, no evidence is ever found for an intron in the atp9 gene among ferns, gymnosperms or angiosperms. Intron cox1i1064g2 is exclusively present in mosses and cox1i653g2 is only present in hornworts. Again, the latter needs particular attention to distinguish it from an unrelated “solitary” intron cox1i652g2 inserted one nucleotide upstream in the Coleochaete mitogenome.
Family F12
The two intron paralogues in family F12 (Fig. 4) have a strikingly divergent distribution with cobi274g2 presently only identified in Charales algae and nad2i1282g2 present in hornworts and euphyllophytes (i.e., ferns and seed plants). No traces of former maturase reading frames can be detected in nad2i1282g2 or cobi274g2.
Family F13
Group II intron family F13 contains two intron paralogues conserved in liverworts: cox1i178g2 and cox3i171g2 (Fig. 4). The former has counterparts in the algae Klebsormidium flaccidum and Coleochaete scutata. However, only cox1i178g2 in Coleochaete shares significant sequence similarity with the liverwort homologues, clearly warranting family inclusion according to our criteria. The “positional homologue” in Klebsormidium neither shares similarity with the liverwort nor with the Coleochaete counterpart, leaving its status as a true orthologue vs. a possible “analogue” occupying the same insertion site open at present. The cox1i178g2 introns carry maturase reading frames both in algae and the liverworts, no intron-borne ORFs are present in the much smaller cox3i171g2 paralogues shared between liverworts and the lycophyte Phlegmariurus squarrosus.
Family F14
As in family F12, the intron members in family F14 are also very disjunct in occurrence with nad1i258g2 being restricted to the monilophyte (fern) clade and nad1i517g2 so far only identified in the alga Zygnema circumcarinatum (Fig. 4). The third paralogue, cox1i511g2, however, is present in liverworts, mosses, lycopyhtes and the algae Coleochaete and Zygnema. Like most group II introns in the algal mitogenomes also cox1i511g2 carries maturase reading frames. The counterparts in the land plant lineage, however, are frameshifted ORFs in liverworts and degenerated or unrecognizable in the cox1i511g2 orthologues of mosses or in the lycophyte Selaginella moellendorffii. We identified an F14-type intron fossil (F14g2f) in the cox1-rrnS spacer of the Coleochaete mitogenome (Suppl. Figure 1G).
Family F15
As in F14, the intron paralogues in family F15 are likewise phylogenetically disjunct (Fig. 4). Intron cox1i835g2 is presently only identified in the algal genus Chara (and absent in the Characeae genera Nitella and Nitellopsis). Intron nad1i287g2 had initially been identified serendipitously in a screen for ancestors of trans-splicing intron nad1i394g2 [5] and is meantime found to be universally conserved both in mosses and in hornworts [56, 65]. As in the above cases, a maturase reading frame is only present in the algal cox1i835g2 but not in the much smaller nad1i287g2 paralogues in the bryophytes of less than 800 bp. The subsequent additional analyses of intron-borne maturases (see below) add S-type intron atp9i145g2 in Charales to an extended family eF15, which is ultimately linked to the superfamily SF13-14 (Fig. 5).
Family F16
Family F16 contains two intron paralogues that occur exclusively in hornworts: cox2i564g2 and nad1i348g2 (Fig. 4). The latter intron is of exceptionally small size of less than 600 nt. and conserved in all hornwort genera. In contrast, cox2i564g2 is ca. five times larger and lost together with the upstream and downstream neighboring introns cox2i381g2 and cox2i691g2 in Nothoceros aenigmaticus. Despite their extended sizes of more than 2.5 kb no maturase traces are discernible in the hornwort cox2i564g2 copies.
Family F17
Intron family F17 contains two intron paralogues with a phylogenetic distribution that could have been taken as further support for an NLE (“Non-Liverwort Embryophyte”) clade: nad1i728g2 and nad4i461g2 (Fig. 4). Both introns are absent in liverworts but particularly well conserved in mosses, hornworts, lycophytes and vascular plants with only rare exceptions including the absence of nad4i461g2 in the hornwort Leiosporoceros dussii and of nad1i728g2 in the lycophyte Isoetes engelmannii. Intron nad4i461g2 counterparts in the algae Coleochaete scutata and Zygnema circumcarinatum are large introns of 3.1 and 5.7 kb with long maturase reading frames that are continuous with the upstream nad4 coding sequence. Matching the general observations, only small traces of former maturases remain in the land plant counterparts where the sizes of nad4i461g copies in the mosses are reduced to less than 800 bp.
Intron nad1i728g2 is a particularly interesting case, being the only mitochondrial intron carrying a maturase reading frame in flowering plants, widely known as “matR”, now systematically labelled mat-nad1i728g2. Moreover, intron nad1i728g2 is also known for multiple independent transitions from cis- to trans-splicing with intron disruptions either up- or downstream of the maturase ORF in flowering plants [6]. Intriguingly, a gene transfer of the matR/mat-nad1i728g2 reading frame into the nuclear genome has been identified in Pelargonium [68]. Most interestingly, three different nuclear-encoded splicing factors have already been identified, which affect the two closely related F17 angiosperm paralogues nad1i728g2 and nad4i461g2 simultaneously: EMP8 [69], DEK55 [70] and SMK3 [71]. The third paralogue in F17, cobi399g2, is so far only identified in the Coleochaete scutata mitogenome and, as in most cases in the algal mitogenomes, also carries a maturase reading frame. The intergenic region between trnM-CAU and trnA-UGC contains an F17-type intron fossil (F17g2f) in the Phlegmariurus squarrosus mitogenome (Suppl. Figure 1H).
Family F18
Family F18 contains two intron paralogues presently only identified in algae of the Zygnematophyceae (Fig. 4): nad5i537g2 is present exclusively in Zygnema circumcarinatum and nad7i777g2 is present in Z. circumcarinatum and in Closterium baillyanum, of which only the latter homologue is equipped with an intron-borne maturase ORF.
Family F19
Group II intron family F19 (Fig. 4) comprises two introns present in lycophytes where sequence similarities are blurred by the highly divergent mitogenomes in genera Isoetes and Selaginella and largely rely on the conserved mitogenome of Phlegmariurus squarrosus. While intron cox2i94g2 is exclusively present in all three orders of lycophytes, intron paralogue cox1i323g2 is also present in the mitogenome of Sphagnum, representing a very early branch in the phylogeny of mosses. In both, Phlegmariurus and Sphagnum cox1i323g2 contains maturase remnants. Notably, cox1i323g2 has extensive similarity with the extended intergenic region between nad9 and trnI-CAU in hornwort mtDNAs, representing yet another example for traces of an intron fossil (Suppl. Figure 1I).
Family F20
Intron family F20 comprises two intron paralogues so far exclusively identified in the mitogenome of the alga Closterium baillyanum of the Zygnematophyceae: cox3i641g2 and nad9i275g2 (Fig. 4). Neither of the two intron paralogues carries a maturase reading frame.
Family F21
Intron family F21 contains introns presently only found in Coleochaete scutata: atp1i66g2, nad5i1725g2, rrnSi435g2 and trnH-GUGi32g2 (Fig. 4). Only the nad5i1725g2 paralogue carries a maturase. The size increase of the Coleochaete scutata mitogenome in comparison to that of Chaetosphaeridium globosum, also in the Coleochaetales, had in part been ascribed to the presence of 57 vs. only 11 introns [67]. We now find the here defined family F21 particularly interesting, because additional intron homologies are also present in multiple intergenic regions of the Coleochaete mitogenome: trnMf-nad9, mttB-trnL and trnV-trnD. In all three cases, homologies sharply coincide with the intron 5’-end and extend for minimally 600 bp indicating retrotransposition rather than events of DNA recombination at their origins (Suppl. Figure 1J).
Family F22
As in the case of F21, intron family F22 also includes members presently only identified in the mitogenome of Coleochaete scutata: atp1i850g2 and nad5i362g2 (Fig. 4). Of these two, only atp1i850g2 carries a maturase reading frame.
Family F23
Intron family F23 comprises two intron paralogues that are hitherto likewise only identified in streptophyte algae: nad7i925g2 in Closterium baillyanum and rrnLi2032g2 identified in Coleochaete scutata and Nitella hyalina (Fig. 4). Only the latter carries a maturase reading frame.
Family F24
Intron family F24 comprises rrnSi1148g2 present in Coleochaete scutata, rrnLi1747g2 presently only identified in Entransia fimbriata, nad7i250g2 present in the Zygnematophyceae algae Closterium baillyanum and Gonatozygon brebissonii, rrnLi629g2 present in Entransia, Coleochaete and Nitella and trnS-GCUi43g2, the only F24 paralogue shared with embryophytes (Fig. 4). The latter is present in Chlorokybus and the Zygnematophyceae genera Closterium, Gonatozygon and Roya and highly conserved among liverworts. Moreover, trnS-GCUi43g2 is evidently present as a degenerated copy in the trnS-GCU pseudogene retained in the conserved intergenic space between trnA and trnD in the mitogenomes of mosses and the lycophyte Phlegmariurus squarrosus. An independent degeneration has occurred among hornworts with functional copies present in Leiosporoceros and Anthoceros but pseudogenes in Nothoceros and Phaeoceros. Hence three independent degenerations of trnS-GCU and its intron are evident in land plants in mosses, among hornworts and in the tracheophytes (Suppl. Figure 1K).
Family F25
Family F25 comprises two intron paralogues of phylogenetically disjunct distribution: cobi537g2 present in the Charales algae and nad5i1455g2, present in all embryophytes except the liverworts (Fig. 4).
Maturases in the plant mitochondrial lineage and extended maturase-based intron families
Our categorization of 100 group II introns into the 25 “core” families F01 - F25 outlined above is based on their primary nucleotide sequence similarities alone. To cross-check for independent confirmation and to explore further and deeper relationships, also outside of the streptophytes, we independently compiled the intron-borne maturase ORFs present in 43 streptophyte mitochondrial group II introns as seeds for identifying protein homologs. This seed query data set also included several maturases that remained hitherto unnoticed or not annotated in database entries (e.g., the spliced variants of mat-atp9i87g2 and mat-atp9i95g2 in Phlegmariurus). The search for homologs ultimately resulted in a large protein data set that also contained related maturases of distant chlorophyte algae as well several maturase proteins in red algae, stramenopiles, fungi, animals and bacteria. The independent phylogenetic analysis of the large maturase protein sequence collection (Suppl. Figure 2) fully confirmed the identified core families F01, F03, F10, F11, F17 and F25, all of which contain at least two intron paralogues carrying maturases (Fig. 5). Moreover, the maturase similarities identified four additional, “maturase-based” group II intron families mF26-mF29 and helped to define superfamilies (SF) of higher order that combine the core intron families and include additional, previously solitary, introns.
For space limitations, we here focus on the examples of the large superfamily SF01-22 comprising families F01, F03, F22 and the two previously solitary introns cox1i652g2 and cox1i769g2 (Fig. 9A) and on superfamily SF10-28 comprising families F10, F17 and mF28 (Fig. 9B). The independent protein analysis fully confirms monophyly of the maturases in F03 and a well-supported clade of maturases in F01, now extended to include to mat-cox1i769g2, mat-atp1i850g2, the free-standing maturases in the mitogenomes of liverworts and a nuclear maturase copy (mat-nuc1) in the moss Physcomitrium (Fig. 9A). Notably, the extended SF01-22 superfamily also includes homologs in fungi having identical insertion sites and clustering with mat-cox1i652g2 in Coleochaete with high support. Vice versa, the extended F03 maturase clade likewise includes fungal mitochondrial maturases and a rhodophyte plastid maturase and, maybe even more notable, a cluster of nuclear maturases in tracheophytes (Fig. 9A).
As in the above case, the independent maturase phylogeny perfectly confirms the intron assignments to families F10 and F17 and adds the maturase-based family mF28 for a joint inclusion in superfamily SF10-28 (Fig. 9B). Particularly intriguing further cases for introns giving rise to fossil paralogues are “liverwort” introns atp1i989g2 and atp1i1050g2 (also present in hornworts), which are now jointly placed in mF28 (Fig. 4, Fig. 9B). An extended and significant sequence similarity (with perfect intron domain V and VI ends) of atp1i989g2 is present in the mitogenome of the lycophyte Phlegmariurus squarrosus embedding the trnI-rps11 region with all nine genes in the same direction fitting the intron orientation (Suppl. Figure 1L). Hence, it appears that a huge block of genes was inserted into an intergenic intron fossil paralogue of atp1i989g2. Interestingly, intron atp1i989g2 is absent in the early branching liverwort genera Treubia and Haplomitrium, which could have indicated a gain within the liverworts only after split of the Haplomitriopsida, but this now seems unlikely given the unorthodox Phlegmariurus fossil paralogue. Along the same lines, intron atp1i1050g2 has fossil intron paralogues not only in liverwort mitogenomes but also in the Phlegmariurus mtDNA behind cox1 in the spacer towards the trnW gene running in the opposite direction (Suppl. Figure 1M).
A further, very notable insight emerges from the maturase phylogeny: Mitochondrial group II introns with intron-borne maturases at the same insertion sites appear distributed across very distantly related lineages of eukaryotes. The most striking example concerns group II introns inserted into position 1147 of the cox1 gene. Solitary type intron cox1i1147g2 inhabits mitogenomes of the streptophyte alga Coleochaete, but also in chlorophytes, rhodophytes, fungi and metazoa. The associated mat-cox1i1147g2 RT-domains, the X-domain and the D/En domain are highly conserved. Similarly, the peculiar case of cox1i748g2 in Equisetum arvense [72], but not E. diffusum, having no similarity to its Chlorokybus counterpart is also found in the brown alga Pylaiella littoralis and the red alga Pyropia fucicola. A third intriguing example along those lines is cox2i373g2 of Coleochaete (now placed in mF26) that has maturase-free orthologs in mosses, hornworts and tracheophytes but forms a well-supported maturase-based clade with mat-cox2i373g2 in ascomycetous fungi, e.g., the endophytic symbiont Epichloe.
The remaining “solitary” introns lacking significantly similar paralogs
Altogether 61 streptophyte group II introns lacked paralogues with significant nucleotide sequence similarity precluding their assignment into our 25 core families. Extending the analysis to characteristic maturase similarities placed 15 of them in the four additional families mF26-mF29 and included another four previously solitary introns into superfamilies (cox1i44g2, cox1i245g2, cox1i769g2 and cox1i652g2). This leaves 42 streptophyte mitochondrial group II intron solitary, lacking both primary nucleotide similarity to paralogues and an intron-borne maturase of significant similarity to protein homologues (Fig. 5).
Many of the core solitary group II introns are presently very restricted in occurrence, most notably many introns presently identified in only one algal genus. This category, for example, also includes the unique trans-splicing introns nad3i84g2 and nad3i301g2 in the nad3 gene in Mesostigma viride. Among land plants this includes solitary introns exclusively restricted to mosses (atp1i1127g2 and cox1i1200g2), to hornworts (atp1i805g2, atp1i1019g2, cobi838g2, cox1i1298g2, cox2i281g2 and cox3i109g2) and to lycophytes (cobi693g2, cox1i227g2, cox1i266g2 and cox1i995g2), respectively, whereas no solitary introns are identified that are restricted to either ferns, gymnosperms or angiosperms.
Other solitary introns, however, are shared between at least two major plant clades (Fig. 5) e.g., nad4Li283g2 in liverworts and mosses or nad3i52g2 in hornworts and lycophytes. Yet others are shared among all euphyllophytes (nad5i1872g2), all tracheophytes (nad1i394g2, nad2i542g2 and nad7i917g2) or tracheophytes and at least one bryophyte clade (cox2i691g2, nad2i709g2 and nad7i140g2). An interesting further case is trnN-GUUi38g2 which is present in three algal classes likely close to the land plant lineage, absent in liverworts but clearly recognizable in pseudogenized form in mosses, hornworts and Phlegmariurus among the lycophytes (Suppl. Figure 1N).