Complete plastid genomes of S. vulgaris
We assembled five complete plastid genomes of S. vulgaris from Eurasia – the haplotypes D11, VS1, ZE2, KRA and KOV (Table 1). They ranged in length from 151,463 bp to 151,572 bp and contained a long single copy region (LSC), a short single copy region (SSC), and two inverted repeats (IRs) (Table 2). The boundaries between repeat and single copy regions and gene content were identical to the previously published plastid genome of S. vulgaris SD2 [14].
We identified the positions of simple sequence repeats (SSR), short arrays of tandem repeat units, in the six plastid genomes of S. vulgaris under study. We found 871 mononucleotide repeats longer than five nucleotides, but only 46 of them were longer than nine nucleotides (Additional file 1: Data Set 1). In addition, there were 62 dinucleotides, eight trinucleotides, eight tetranucleotides, and only one pentanucleotide. SSRs represent useful markers in population genetic studies owing to their variability. Within-individual variations in the number of mononucleotide units, arisen due to heteroplasmy, was observed in most mononucleotide regions. SSRs with a repeat unit higher than two nucleotides, not affected by heteroplasmy, can therefore be recommended for plastid genotyping in S. vulgaris. We found 19 positions of di-, tri-, or tetranucleotides, which varied among the analyzed plastid genomes of S. vulgaris.
Phylogenetic relationships and sequence polymorphism
Plastid haplotypes VS1 and D11 of the high mountain S. vulgaris populations, occurring at the altitudes above 1200 m a. s. l, diverged first on the phylogenetic tree constructed on the basis of concatenated protein-coding sequences and with S. latifolia as an outgroup (Fig. 1b). The same topology was confirmed when entire plastid sequences except for homopolymers larger than five were used (Fig. 1a). In the latter phylogenetic tree, the position of the KRA haplotype from Siberia and D11 haplotype from the Alps was not resolved. The results obtained by maximum-likelihood (ML) method were consistent with the outputs generated by MrBayes, except that the position of KRA and D11 was always resolved (Additional file 2: Fig. 1, Additional file 3: Fig. 2).
The number of indels in pairwise alignments of individual plastid genomes of S. vulgaris varied from 33 to 112, and nucleotide substitutions ranged from 35 to 212 (Table 2). The highest number of the polymorphisms was found in the alignments with the haplotype VS1 from the Jeseníky Mts, consistent with its divergence’s basal position within phylogenetic tree of S. vulgaris haplotypes. The indels occurred primarily in intergenic regions, but the deletion of nine bp in length, which shortened the rpl20 protein by the last three amino acids was identified in the haplotypes SD2 and ZE2. All other haplotypes (KOV, KRA, D11, VS1) harbored the longer version of the rpl20 protein.
S. vulgaris plastid genes varied in their degree of polymorphism. Thirty of 77 unique protein coding genes were identical, and additional 25 genes carried only synonymous segregating sites and therefore encoded proteins identical among the six plastid haplotypes. Only 22 genes, including accD, matK, rpoB or ycf2, carried at least one non-synonymous segregating site (Fig 2). Two highly polymorphic genes – ycf1 and ndhF – contained the vast majority of polymorphisms: 48 of the total 69 non-synonymous segregating sites. (Additional file 1. Data Set 2). A detailed inspection of Ka/Ks for these two genes revealed values > 1.0, which indicate relaxed or positive selection. This was the case in all the pairwise comparisons among the S. vulgaris haplotypes for ycf1. However, only the ndhF alignments which comprised the mountain haplotypes VS1 and D11 exhibited high Ka/Ks (Additional file 1: Data Set 3).
Plastid transcriptomes of S. vulgaris KRA and KOV
We generated plastid transcriptomes of two haplotypes of S. vulgaris KRA [17, 18] using the data sets previously employed to construct the mitochondrial transcriptomes. We compared gene coverages and RNA editing rates in flower buds between F and H individuals and between the two haplotypes.
We compared depth of coverage of protein coding genes, because rRNA was removed before cDNA library preparation and small RNAs (< 100 nt) including tRNAs were lost in the course of RNA extraction. Depth of coverage was similar in F and H plants in both haplotype KOV and KRA, no gene was differentially expressed between the genders. The depth of coverage could not be directly compared between the KOV and KRA plastid genomes, because Illumina sequencing was performed on different platforms and produced reads of different lengths for each plastid transcriptome. We therefore compared the sets of highly and lowly covered genes between the two haplotypes. The genes psbA, rbcL, psbE, rps14 and rps16 were among the most highly expressed, whereas the ndhF and psbN were the least expressed genes both in KRA and KOV plants (Additional file 1: Data Set 4.), which indicates general similarity between the two plastid transcriptomes. Introns were covered to a lower extent than adjacent exons in most plastid genes, but their coverage reached levels comparable to exons in some genes, for example in ndhB (Fig. 3a).
Antisense non-coding transcripts in plastid transcriptomes of S. vulgaris KRA and KOV
We identified five antisense non-coding transcripts > 100 nt with transcript abundance comparable to protein coding genes, overlapping trn genes (Additional file 1: Data Set 5.), three of them were found in both KRA and KOV transcriptomes, two were revealed in KOV only. The trnS-GGA and trnW-CCA - trnP-UGG antisense transcripts corresponded to the 3’UTRs of the rps4 gene and petG genes, respectively. The trnS-UGA antisense transcript colocalized with the 5’UTR of the psbZ gene (Fig. 3b, c, d).
The antisense transcript spanning the 5’UTR and the start of the rps14 gene (Fig. 4a), was the most abundant atisense transcript derived from protein coding genes in KOV, but it was absent in the KRA transcriptome. Another antisense transcript derived from the rps19 and rpl2 genes was revealed in both S. vulgaris transcriptomes under study (Fig. 4b). The psbN gene coding for a small transmembrane protein necessary for the assembly of photosystem II [23] was transcribed from minus DNA strand in sense orientation and from the opposite strand in antisense orientation as a part of the longer psbT-psbH transcript. The antisense psbN transcript exhibited much higher depth of coverage than the sense psbN transcript coding for the psbN protein (Fig. 4c).
Similarly with protein-coding genes, no statistically significant differences in antisense transcript levels were found between F and H plants. In contrast, the abundance of antisense transcripts differed between the KOV and KRA transcriptomes of S. vulgaris more than the transcript levels of protein coding genes. The most remarkable distinction was found in the antisense rps14 transcript (Fig. 4a), which was highly covered in KOV, but completely missing in KRA.
The depth of coverage in the KRA and KOV plastid transcriptome estimated by the ChloroSeq pipeline [24] was in agreement with the results stated above, which were obtained using GSNAP according to [18] (Additional file 1: Data Set 4.).
RNA editing positions in plastid genomes of S. vulgaris KRA and KOV
We identified 51 unique C to U editing sites in the plastid genomes of S. vulgaris KRA and KOV, 38 of them located in protein coding regions, 2 of them in introns, and 11 in intergenic regions. Editing sites in rRNAs and tRNAs were not evaluated due to their biased coverage caused by the sample preparation methods. Most edits (95%) in coding sequences were non-synonymous, changing the amino acid composition; only two editing sites were silent. The most frequently edited genes were ndhB (9 edits), ndhD (4 sites), and ndhA (4 sites).
We compared S. vulgaris editing sites with eight angiosperm species, for which plastid editome had comprehensively been studied: Amborella trichopoda, Cucumis sativus [25], Spirodela polyrhiza [26], Aegilops tauschii [27], Arabidopsis thaliana [28], Hevea brasiliensis [29], Nicotiana tabacum [30], Vigna radiata [31] (Table 3). The majority of the 38 edits in protein coding regions identified in S. vulgaris were either edited, or C was replaced with T at the DNA level in most angiosperms under comparison. The two silent edits were not conserved across angiosperms. A highly edited position in the rps16 intron was also edited in A. tauschii and replaced with T in DNA of A. trichopoda and A. thaliana, which may indicate its functional importance. The intergenic regions could not have been reliably aligned across the angiosperm species under comparison.
The rate of RNA editing in plastid transcriptomes of S. vulgaris KRA and KOV
The editing rate higher than 80% in at least one of the two S. vulgaris transcriptomes was determined in 26 of 38 edits in protein coding genes, all of them were non-synonymous (Additional file 1: Data Set 6). Both silent sites were edited only about 50% or less. An editing event introduced a premature stop codon in about 10% of the ndhJ transcripts, but this position was not edited in other angiosperms under comparison. Editing is necessary to create a start codon in the ndhD gene in all angiosperms, where C is present in the second position of the coding region. However, all KRA and KOV plants were edited < 15% in this position, which means that only a small portion of the ndhD transcripts encoded a functional protein.
In contrast with protein coding genes, only two of 11 editing positions in intergenic regions were edited more than 80%. One of them was located in 3’UTR of the atpH gene, the second one in the position 64,933 of the KRA plastid genome in the trnW-CCA - trnP-UGG antisense transcript.
Editing rates in the KRA and KOV plastid transcriptomes were mutually congruent, exhibiting moderate diffrences in the positions with intermediate rates 40 – 70% (Fig. 5) . The most remarkable difference was found in the position 50 of the psbZ coding region, which changed leucine for serine. No editing was observed in this position in the KRA haplotype, whereas approximately 6 % of psbZ transcripts, which represented about four hundred reads, were edited in each of six KOV plants (Table 3). Editing of this position varied across angiosperms. The same position was edited in A. thaliana and S. polyrhiza, while no editing was reported in A. tauschii, H. brasiliensis, or C. sativus. T replaced C in this position in the plastid genomes of N. tabacum and V. radiata. To verify editing of this position in Caryophyllales, we downloaded the transcriptomic data of four Silene species, Agrostemma githago and Spinacia oleracea from the SRA archive and mapped them against the psbZ sequence. We found high editing of the position 50 of psbZ in S. conica and no editing in spinach, S. noctiflora and Silene paradoxa. This position was edited to a lower extent in Silene latifolia and A. githago (Additional file 1: Data Set 7). Although the coverage of psbZ was low in most data sets, it showed variable editing across close relatives of S. vulgaris, the pattern similar to scattered editing at high taxonomic level.
No statistically significant differences of editing rates between F and H individuals were observed between the KOV or KRA haplotypes. The estimates of editing rates provided by the GSNAP [18] and the ChloroSeq pipeline [24] were consistent (Additional file 4: Figure S3).