Comparative plastid genomics
Sequencing of plastid genomes of liverworts is gaining ground. Today, in the GenBank database can be found the complete plastome sequences of five complex thalloid liverworts, three of simple thalloid liverworts and 14 plastomes of leafy liverworts. Here we present 22 newly sequenced chloroplast genomes of 10 Calypogeia species (leafy liverworts). The structure of the above plastomes was typical for most plants and divided into two IRs regions separated by LSC and SSC regions (Fig.1.). The length of chloroplast genomes seems to be variable not only at the genus level, but also at the species and even at the individual level. Plastome of Calypogeia is 824–1 366 bp shorter than the longest known leafy liverwort chloroplast genome of Gymnomitrion concinnatum [48] and 1 880–2 422 bp shorter than the longest known thalloid liverwort chloroplast genome of Dumortiera hirsuta [49]. Calypogeia species also varied in plastid genome length similarily to Aneura pinguis cryptic species [32]. Moreover, it was observed that plastome lengths differ within one species (Table S1). Similar event was recorded in the case of Marchantia polymorpha subsp. ruderalis, for which two independent research team obtained two different chloroplast genome lengths: 120,457 bp [50] and 120,304 bp [51].
The GC content of Calypogeia plastome is 34.6% and is almost the same like in Gymnomitrion concinnatum (34.5%), the latest leafy liverwort species for which the complete chloroplast genome was sequenced [48]. The GC content in Calypogeia plastid genome falls in the range of values known for other liverwort species, from 28.8% for Marchantia paleacea [39] to 40.6% for Aneura mirabilis [52].
The comparative analysis of three known leafy liverwort plastomes (Ptilidium pulcherrimum, Gymnomitrion concinnatum and sequenced here Calypogeia sp.), revealed similarity in the gene content and order. All of them contain 122 unique genes: 81 protein-coding, 6 of unknown function (ycf),, 31 tRNAs and 4 rRNAs. However, in Ptilidium pulcherrimum cysA (in LSC region) and cysT (in SSC region) are pseudogenes [53], whereas in the chloroplast genomes of Calypogeia and Gymnomitrion concinnatum, like in complex thalloid liverworts (Marchantia paleacea, Marchantia polymorpha, Dumortiera hirsuta),, these genes are functional. The ycf68 motif that has been annotated as a pseudogene in Calypogeia plastid genome. In other liverwort species is also registered as a nonfunctional gene (e.g. Aneura mirabilis, Gymnomitrion concinnatum) or skipped in the plastome description (e.g. Pellia endiviifolia, Aneura pinguis, Dumortiera hirsuta, Marchantia sp.).. On the other hand, Forrest et al. [53] are confused about the functionality of the hypothetical ycf68 gene in Ptilidium pulcherrimum. In many vascular plants the above mentioned motif is reported, however as functional gene only in several lineages: Stipa sp. [28], Lolium multiflorum, Festuca pratensis [54]. Raubeson et al. [55] suggest that ycf68 could be a pseudogene, but the occurrence of RNA editing in chloroplast genomes of many plants may result in restoring of fully worked gene.
Phylogenetic relationships
The phylogenetic relationships between studied Calypogeia species obtained on the basis of the whole chloroplast genomes analysis are, in general, consistent with previous studies of the genus Calypogeia [56]. The whole plastome genomes analysis confirmed a close relationship of C. muelleriana and C. integristipula (Fig.3.). Previous studies [41, 46] revealed that C. muelleriana is an allopolyploid, while C. integristipula haploid species, thus it can be assumed that C. integristipula was one of C. muelleriana parents who was the donor of its chloroplast genome. C. sphagnicola and C. paludosa, which originally were considered as the forms of C. sphagnicola i.e. C. sphagnicola f. sphagnicola and C. sphagnicola f. paludosa [57, 58] belong to two different clades, which support the hypothesis that they represent genetically distinct species [44]. Calypogeia sphagnicola belongs to the same clade as C. suecica (both haploid species), while allopoliploid species C. paludosa forms its own distinct clade, which is a sister to clade containing C. azurea, C. azorica and C. fissa. Moreover, our studies revealed a high variation of C. suecica, indicating the cryptic speciation within this species. C. suecica is an obligate xylicole, it is almost restricted to moist decorticated logs and shows low morphological variability [57]. However, in Europe two cytoforms of C. suecica n = 9 and n = 18 are reported by Lorbeer [59] and Paton [60], respectively, which may support the hypothesis that an unrecognized species is present within C. suecica. Our results indicate that C. suecica requires further molecular and morphological studies.
Hot-spots and DNA barcoding
Analysis of the variability of whole liverwort chloroplast or mitochondrial genomes rarely appears in studies. Hitherto, such kind of analysis was carried out only among cryptic species of Aneura pinguis [32]. Therefore our results, obtained for a group of species belonging to one genus, are difficult to compare with the outcomes for the complex species of A. pinguis.
The research results for A. pinguis have shown that among protein coding regions genes ycf1 and ycf2 are ones of the most variable genes [32]. Similarly, inour studies on Calypogeia, genes of the ycf family: ycf1, ycf2, ycf66 were in the top five of the most variable coding regions, what predisposes them to be a potential DNA barcodes. In the past few years, it is more and more often reported about the usefulness of the ycf1 and ycf2 genes in the identification of plant species [29, 61, 62, 63]. Especially two regions of the ycf1 gene: ycf1a and ycf1b are highly variable and can serve as an effective barcode of land plants. Ycf1b fragment is proven to work better than any of the matK, rbcL and trnH-psbA appliedindividually and slightly better than the combination of matK and rbcL in woody plants [64]. On the other hand, the application of the ycf1a fragment in the discrimination of Paris species was more effective than the using only the ycf1b gene piece [65]. Nevertheless, a discrimination success of the ycf1b fragment (about 72%) in research by Dong and others [64] and the ycf1a gene piece both separately (52.63%) and in a combination with the ycf1b (89.47%) [65] was smaller than an application of the entire ycf1 gene sequence in our studies for Calypogeia species (over 95%). However, wanting to use ycf1 and ycf2 genes as barcodes, it should keep in mind the limitations of these sequences in an amplification. The above mentioned genes are quite long (e.g. ycf1-3 147 bp and ycf2- 6 216 bp in Calypogeia) and a recovering the entire sequences of these genes in a PCR reaction would be a challenge. Not without reason Dong et al. [64] applied as barcodes for woody plants only the most variable and with the biggest resolution power the ycf1 fragments. Our results indicated the most promising ycf2 500 bp-long fragments for Calypogeia species delimitation. As many as the first 13 positions in the list of the most-variable fragments of Calypogeia plastom with a length of 500 bp were occupied by pieces of the ycf2 gene (Table 1.) and could be a potential DNA barcodes.
The discriminatory power of the 10 most-variable protein-coding regions in genus Calypogeia was in the most cases at the high level of 95.45% (Table 1.). Only the resolution power for ndhB, ycf66 and rpoC2 genes had different values: 100%, 90.9% and 90.9%, respectively. While the ndhB gene rather occurs commonly in plant chloroplast genomes, the ycf66 gene not necessarily. A presence of this ycf gene was not reported in Aneura mirabilis [52] and Aneura pinguis [32]. Similarly, an occurrence of the cysT gene, the most variable protein-coding sequence in Calypogeia, is changeable in liverworts. Despite the large potential of this gene as a barcode, it is lacking in Ptilidium pulcherrimum [53] and Aneura species [32, 52]. CysT gene in above mentioned species functions as a pseudogene, therefore high variability is here justifiable [28]. Nevertheless, a literature does not mention the cysT and ycf66 genes as DNA barcodes. Also, no one has reported the ndhB gene as an effective plant barcode, but it seems to be one of the core sequence in a species discrimination in Calypogeia. On the other hand, Krawczyk et al. [28] pointed out the potential of the ndh gene family in species identification, indicating the ndhH gene as best-performing loci for Stipa. Although ndhH was not listed at the top of the most variable coding regions in Calypogeia chloroplast genome (and therefore not tested in our research), its discrimination power was 100%. Slightly less, however also quite effective in the species identification was the rpoC2 gene- reported to belong to the relatively fast evolving rpo genes [66]. The last statement was confirmed in our analyzes by the high polymorphism of this sequence (Table 1.). Recently, reports on the rpoC2 gene as a barcode are more frequent [22, 67, 68, 69].
Consortium for the Barcode of Life (CBOL) Plant Working Group recommended two locus: rbcL and matK as core DNA barcodes for plants [23]. In our research the matK gene was the third on the list of coding regions with the highest variability and correctly identified 21 out of 22 sequences. Unfortunately, the matK gene is said to be troublesome in amplification among bryophytes and ferns [23, 70, 71]. Therefore, it is inconlusive whether the use of the matK gene in the identification of species can be extended to bryophytes [71]. In contrast to the above, the PCR success of the rbcL gene is reported to be high [23, 70, 71]. However, it is mediocre in its capacity to distinguish specimens at the species level [23]. Despite the fact that the rbcL gene was not among the most variable coding regions in Calypogeia, the discriminatory power of the rbcL gene (90.9%) was almost the same as that of matK gene (Table 1., Table S2). The high resolution power of the rbcL was also reported among species of bryophytes [21, 70, 71] as well as its potential as a barcoding marker for bryophytes was noticed by some researchers [25, 32, 72]. However, Stech and Quandt [73] state that in general for bryophytes the rbcL gene exhibits low variation at the family level and therefore it is not useful for DNA barcoding among the early land plants. On the other hand, CBOL recommended a combination of two above mentioned loci as pivotal barcode for land plants. In our tests, the rbcL and matK genes together did not raise the discriminatory power which was the same as for only the matK gene (Table 1.).
Liu et al. [70] also mentioned rpoC1 and rps4 regions as good potential barcodes for mosses. Actually, the resolution success for these sequences in the case of Calypogeia was considerable (95.45%; Table S5).
Among non-coding regions the resolution success of 100% in genus Calypogeia gave the trnT-trnL spacer, previously tested in the tribe Stipeae [28, 74, 75]. However there, this spacer as a separate region was not variable enough to give satisfying results. In the literature the trnT-trnL spacer is not mentioned as a potential barcode in bryophytes (only as a phylogenetic marker [73] in contrast to the following regions: trnH-psbA [20, 70], atpF-atpH, psbK-psbI [76] and trnL-trnF [20, 73]. The trnH-psbA spacer is one of a recommended plant DNA barcodes by CBOL Plant Working Group [23]. However, in Calypogeia it is not informative like in the Solidago genus [77] because of its short length (only 131 bp). As a consequence, trnH-psbA is proposed to be used in two- or three-locus barcodes to provide acceptable resolution [77, 78].
Similarly, too short sequences for identification of Calypogeia species had the rest of the spacers proposed for bryophytes (71–288 bp). Moreover, it is questionable whether the sequences of the spacers: atpF-atpH and psbK-psbI could be obtained without problems in a PCR reaction. Low amplification rates of these regions were reported in mosses by Liu et al. [70]. On the other hand, the trnL-trnF spacer is reported to be a longer sequence in some liverwort species and amplified with high success [20, 71]. In the genus Calypogeia, trnL-trnF is only a 71 bp-long region. As a consequence we have tested at least 400 bp-long fragments of non-coding regions according to Hebert et al. [19] who reported that the standard barcode has a length of 400–800 bp. Theoretically, it is possible to apply shorter sequences as DNA barcodes, so-called mini-barcodes (100–250 bp) or even micro-barcodes (within 100 bp) [79, 80]. However, these types of barcodes are rather taxon specific than universal [81]. Currently, it is realistic to search the whole chloroplast genome to find the most informative fragments for species identification. Mini-barcodes for Calypogeia should be sought within the ycf2 gene or between the genes psbA and ycf2 as demonstrated sliding window analysis.
Our research shows that in Calypogeia plastom there is a lot of regions which has potential to be barcodes and best match/best close match analysis demonstrated that whole chloroplast genome can be used as a barcode. On the basis of the entire plastome data we revealed that most species tested in our research were with a barcoding gap. Only one individual was incorrectly identified based on entire plastome sequences as well as based on selected chloroplast regions, namely C. suecica. PTP analysis indicated two representatives of C. suecica as two separate species (Table 2.), which is in accordance with the results on variability of Calypogeia chloroplast genomes. Plastomes of two representatives of C. suecica were similar in 97.7%, what indicates quite significant differences taking into account that pairwise identity of all studied plastome sequences of the genus Calypogeia was 95.6%. This is probably due to the occurrence of cryptic speciation within C. suecica. However, three regions: the ndhB and ndhH genes and the trnT-trnL spacer coped very well with solving the riddle about genetic recognition of species (100% of the power discrimination).
The super-barcoding turned out in the case of studied liverwort genusto be slightly less effective (95.45% of the power discrimination) in compare to a traditional barcoding approach (100% of the power discrimination). However, some of plastid regions with 100% efficiency are very long (ndhB- 1503 bp; ndhH-1182 bp) and the amplification of their whole, intact sequences could be problematic. On the other hand, using plastid genome as a marker solves the issues referring to low PCR efficiency or gene loss [82]. Li et al. [27] proposed a new approach to plant DNA barcoding (so-called “1+1 Model”) that combines super- and single-locus barcoding. This method consists in a development of the “specific barcode”, which is derived from chloroplast genome of the target plant group and so variable that lets species recognition. Testing 10 the most variable DNA fragments, we found the most specific barcodes for Calypogeia species among the protein-coding regions (Table 1.). Seven genes correctly assigned 21 out of 22 sequences to the species, two loci (ycf66, rpoC2)-- 20 out of 22 sequences and one locus (ndhB) identified rightly all individuals. Protein-coding regions were the least mutable in comparison to non-coding regions and fragments generated by the sliding window approach. The last method allowed to obtain the most variable plastid DNA pieces with length over 400 bp, but unfortunately its efficiency in species discrimination, similar to non-coding regions, was lower. Our results proved that a good barcode may be even a region with average variability like the ndhH gene taking the 51. position in the ranking of the most variable protein-coding regions in the genus Calypogeia.