Genome sequencing and assembly
A summary of the sequencing data obtained from Illumina sequencing and assembly of A. retroflexus, B. muricata, B. cycloptera, B. sinuspersici, H. ammodendron, S. aralocaspica, S. eltonica, and S. maritima chloroplast genomes is presented in Table 1. Three large contigs with overlapping 5’ and 3’ regions were generated during genome assembly for A. retroflexus, B. muricata, B. cycloptera, B. sinuspersici, H. ammodendron, S. aralocaspica, and S. maritima. These three contigs were identified as LSC, SSC, and IR via BLAST homology alignment [42], GE-Seq - Annotation of Organellar Genomes[43] and DOGMA gene identity prediction [44]. The overlapping regions were present at all four possible junctions when the IR region was reverse complemented (LSC-IR, IR-SSC, SSC-IR, and IR-LSC). These overlapping areas ranged from 19 to 51 nt (illustrated in Fig. S1 with B. cycloptera). The directionality of the LSC, SSC and IR, and all overlapping aligned junctions were validated via Sanger sequencing of both strands of the amplicons generated from these regions (Table S1; Fig. S1). For S. eltonica, the LSC-IRa and IRb-LSC overlapping regions were 23 nt long and were validated with Sanger sequencing (Table S1). The IRa-SSC and SSC-IRb sections were both missing a 1,475 nt section in the IRa and IRb borders. The 300 nt sequence contiguous to the 1,475 nt section had a low GC content of 19%. A possible cause of the shortened contig flanking the IR-1475 area may be due to the low GC content value which could impact the accuracy of the HTS genome assembly [45]. The 1,475 nt section was sequenced by primer walking and Sanger sequencing (Table S1). The GC content in the 1,475 nt region and IR was 31.3% and 42.1%, respectively.
The average base depth of coverage for the eight assembled chloroplast genomes ranged from 1553 to 5998-fold. For accurate assembly a minimum of 30-40x sequence coverage is recommended [46–48]. In this study, the only areas with less than 40x average coverage were identified in the last 1 to 3 nucleotides of the IRb sequence for each of the eight genomes. This is expected due to the assembler algorithm parameters. The end of the IRb and the beginning of the LSC were concatenated and these sections were remapped. Remapped coverage results were reported to be above 40x for the IRb ends and surrounding areas. The eight assembled genomes (0.8/0.9 for the read length fraction/similarity fraction mapping) were also compared with a more stringent remapping of the reads to the contigs of 0.99/0.99 length fraction/similarity fraction. Analyses with both levels of stringency show almost identical assembly minimum-coverage and average-coverage for the eight species sequenced in this study (Fig. S2).
Overall, the assembly and subsequent Sanger sequencing-based validation generated high quality and complete chloroplast genomes with all possessing a quadripartite structure as reported in other land plant species.
Size, organization and gene content of the chloroplast genomes
The size of the chloroplast genomes from the eight species ranged from 146,634 to 161,251 nt (Table 2). As expected, each chloroplast genome included a pair of inverted repeat regions, IRa and IRb, separated by an SSC and an LSC region (Table 2 and Fig. S3). With one exception, the size of the IRs ranged from 23,461 to 25,213 nt. (Table 2). The H. ammodendron inverted repeat sequence presented an instance of IR length expansion (29,061 nt) compared to the other seven species. The GC content was similar among the eight species and for all the plastomes, LSC, SSC and IRs it ranged from 36.4-36.6, 34.1-34.6, 29.1-30.2 and 42.1-43.0%, respectively (Table 2). All chloroplast genomes contained a similar number of protein coding, ribosomal, and tRNA genes. The number of genes and tRNAs ranged from 113 to 116 and 27-29, respectively in the eight genomes (Table 3 and Fig. S3). For seven out of eight species, 60.1-61.9% of the chloroplast sequence consisted of coding region, which included 52.7-54.3% of protein coding genes and 7.4-7.9% of RNA genes. The S. eltonica chloroplast genome was composed of 56.8% coding region including 48.9% of protein coding genes and 7.9% of RNA genes. This difference between S. eltonica and the rest of chloroplast genomes is possibly due to the higher repeat content in intergenic sequences of the S. eltonica chloroplast genome (Table 4 and Fig. 1).
Gene order and content were largely conserved among the eight chloroplast genomes in this study. However, some structural rearrangements, gene losses and IR expansions were identified. The genes ycf15, ycf68, and rpl23 were identified as pseudogenes due to the presence of internal stop codons. The ycf15 and ycf68 genes are quite commonly classified as pseudogenes in angiosperms [23,49]. The rpl23 is also classified as a pseudogene in some species such as the Fagopyrum spp., buckwheat, and spinach as well as Suaeda and Haloxylon species [22,23,50,51]. In S. eltonica, rpl23 was not predicted to be in the chloroplast genome by GeSeq but it was identified as a pseudogene via the BLAST sequence analysis [42]. No stop codons were identified in the rpl23 of a previously published B. sinuspersici chloroplast genome [52]. In this study, 4 stop codons were identified at the same locations for B. sinuspersici and its close relative B. cycloptera.
At least one complete copy of the ycf1 gene was identified in the eight chloroplast genomes (total length of 5.3-5.6 Kb). In seven out of the eight chloroplast genomes, a duplicated ycf1 pseudogene (1,000-1,300 nt) was found at the IRa-SSC boundary. This is a common feature found in other species [23,53]. In the case of H. ammodendron, there is a complete duplication of the ycf1 gene, therefore the H. ammodendron chloroplast genome has two full copies in the IR-SSC borders. The complete duplication of the ycf1 gene in H. ammodendron leads to the previously mentioned IR expansion (Fig. S3). This phenomenon has also been observed in Amphilophium, Adenocalymma, Anemopaegma, and Fagopyrum species; these species possess an expanded IR region and two full-length copies of ycf1 gene [23,54,55]. The IRs for the other seven species are variable in length. In A retroflexus, B. muricata, B. cycloptera, B. sinuspersici, S. aralocaspica and S. maritima, the IR includes the duplicated ycf1 pseudogene (1-1.3 kb) (Fig. S3). A small segment of the ycf1 gene is also duplicated in V. vinifera, S. oleracea and B. vulgaris. In S. eltonica, the IR has expanded to include the trnH-GTG and a fragment of the psbA gene (Fig. S3). The biological significance of this duplication remains unknown.
Annotation of the ycf15 gene with the Dual Organellar Genome Annotator (DOGMA) [44] shows variability in terms of its physical location. In A. retroflexus, B. vulgaris and S. eltonica, the ycf15 is located between the rps12 and trnV-GAC. In B. cycloptera, B. muricata, B. sinuspersici, H. ammodendron, S. aralocaspica, and S. maritima the ycf15 is located between ycf2 and trnL-CAA. The ycf15 as well as other genes, such as the ycF2, psbA, clpP, and matK, have been reported to have variable physical location in different plants [56–59].
The genes ycf3, clpP, rpoc1, and rpl2 have been found to have a variable number of introns among and within some taxonomic groups [23]. The gain or loss of introns in these genes have occurred independently in several linages of flowering plants [23,60]. However, no differences were found in the number of introns among the eight species; the ycf3, clpP, rpoc1, and rpl2 contain 2, 2, 1, and 0 introns, respectively.
The orientation of the SSC region in A. retroflexus, and B. muricata differs from the orientation of the SSC in B. cycloptera, B. sinuspersici, H. ammodendron, S. aralocaspica, S. maritima and S. eltonica (Fig. S3). The SSC orientation has been shown to exist in the two different states within individual plants [61–64]. Therefore, SSC variation observed among taxa in this study is likely due to alternative states of the SSC region within individual plants. Although there was some variation in the SSC orientation, the number and content of genes was the same among the eight species. The only exception is the presence of a trnU-TCA in the SSC of H. ammodendron.
Repeat structures and microsatellites
Seven out of the eight chloroplast genomes had 45 to 58 repeats, which ranged in length from 30 to 73 nt per repeat (Fig 1). The majority of these repeats were shown to be between 30 and 40 nt in length. In the S. eltonica chloroplast genome, repeat analysis with REPuter [65] found a total of 174 repeats which ranged from 30 to 145 nt in length (Fig 1). The number of repeats was similarly distributed among species for repeats found in intergenic regions and intron/exons (Table 4). An exception was S. eltonica in which a majority (80%) of repeats were located in the intergenic regions. Four species possessed reverse repeats; S. maritima and S. aralocaspica had one, B. muricata had two, and S. eltonica had four.
The presence of repeats varied for the genes ycf1, ycf2, ycf3, and psaA. Repeats were present in the gene ycf1 except for A. retroflexus, S. aralocaspica and S. maritima. All chloroplast genomes possessed repeats in the ycf2 gene except for H. ammodendron. Repeats in the introns of the ycf3 gene were only present in the A. retroflexus, B. cycloptera, and B. sinuspersici. All species presented at least one repeat in the psaA gene and H. ammodendron presented the highest number with six repeats.
Microsatellites, or simple sequence repeats (SSRs), were identified in the eight chloroplast genomes. The total number of microsatellites ranged from 41 to 72 of which the majority, 36 to 64, represent mononucleotide repeat microsatellites (Table 5). The complete list of microsatellites identified for each of the eight chloroplast genomes and their positions in the respective genomes is provided in Table S2.
Comparison of Amaranthus retroflexus chloroplast genome with previously sequenced Amaranthus spp. chloroplast genomes
A. retroflexus, commonly known as pigweed, is used as a vegetable for human consumption as well as for fodder. It is the most widely distributed and damaging Amaranthus weed in the US and the world [66]. Availability of the A. retroflexus chloroplast genome provides an important tool for accurately monitoring the spread of this species and identifying possible hybridizations. Microsatellites were previously identified for Amaranthus spp. [67]. Six out of the nine polymorphic microsatellites were shown to be polymorphic between A. hypochondriacus and A. retroflexus (Table 6). Most of these microsatellites were located in the LSC regions and represented A or T mononucleotide repeats. SSRs can serve as molecular markers for future molecular breeding for Amaranthus spp. which are considered as emerging crops [67]. The chloroplast genomes of four Amaranthus spp; A. hypochondriacus, A. cruentus, A. caudatus, and A. hybridus, have been reported previously [67]. The A. hypochondriacus genome (GenBank accession KX279888.1) is 150,725 nt and the quadripartite regions of LSC, SSC and 2 IRs consist of 83,873, 17,941 and 24,352 nts, respectively. These sizes are very similar to the lengths of the A. retroflexus chloroplast genome reported in this study (Table 2). BLAST analysis showed a 99% sequence similarity between the chloroplast genomes of A. hypochondriacus and A. retroflexus.
Comparative analysis of the B. sinuspersici chloroplast genomes
Kim et al. (2016), and Caburatan et al., (2018) previously reported the chloroplast genome of B. sinuspersici (GenBank accession no. KU726550). Compared to our results with B. sinuspersici (Table 2), the size of their genome (153,472 nt) is 138 nt larger; the LSC and SSC in their study are 84,560 nt and 19,016 nt in size, respectively which is 70 nt larger than in our study (Table 2). The IR was reported to be 24,948 nt in length, versus 24,949 nt length in this study. The increase in length in the published B. sinuspersici chloroplast genome [52] is predominantly located at the LSC-IRa and SSC-IRb junctions, which has a repeat of 72 and 13 nts respectively. The two repeats are separated by spacer sequences of 1nt in the LSC-IRa junction and 48 nt in the SSC-IR junction. The 72 and 13 nt sequences were present just once in the B. sinuspersici chloroplast genome presented in the current study. The presence of a single occurrence of the 72 and 13 nt sequence in the genome was validated by Sanger sequencing of loci in question for both IRb-LSC and LSC-IRa loci (Table S1). Further comparison of the two B. sinuspersici genomes identified 18 SNPs and 9 indels. In the published B. sinuspersici chloroplast genome, the LSC is inverted with respect to the rest of the sequence (IRa+SSC+IRb). In our study, the orientation of the LSC was validated using Sanger sequencing of PCR amplicons spanning the junctions IRb-LSC and LSC-IRa (Table S1). As described above, there were also differences in the presence of stop codons in the rpl23 gene. In the previous study [68] a total of 110 unique genes were reported; a total of a total of 114 genes were identified in the current study (Fig. S3).
Differences between the previously reported chloroplast genome of B. sinuspersici compared to the current study likely stems from how the Celera assembler algorithm and the CLC algorithm process the read data. Each of these algorithms have their inherent pros and cons [69]. The assembly parameters for the previous B. sinuspersici chloroplast genome were not reported. Also, the chloroplast genome loci that were found to be different within the two previous versions [52,68] were not resequenced. The chloroplast genome of B. sinuspersici presented in this study showed a minimum, maximum and average coverage of 37, 23,533, 3,204.28 nt. Furthermore, areas of ambiguity were validated via Sanger sequencing of PCR amplicons generated from selected loci. The combination of the assembly strategy utilized, and resequencing of loci, resulted in the generation of an improved version of the B. sinuspersici chloroplast genome.
Analysis of the two closest SCC4 related species, B. cycloptera and B. sinuspersici, chloroplast genomes showed a 99.70% sequence similarity between both sequences. B. cycloptera and B. sinuspersici chloroplast genomes differed in overall length by seven nt. B. sinuspersici IR, and SSC regions were larger than the B. cycloptera by 44 nt and B. cycloptera’s LSC region was larger by 51 nt. The difference in size was due to changes in the intergenic region, length, and number of repeat regions. Number of genes with introns and repeats was the same between the two species. B. cycloptera had two larger repeats, one between 40 to 44nt and the second greater than 45nt. B. sinuspersici had one smaller repeat of 30 to 34nt. Both species had the same number and identity of protein-coding, tRNA, and rRNA genes.
Comparative analysis of Haloxylon ammodendron chloroplast genomes: a case of transfer of mitochondrial DNA to the plastid genome
The chloroplast genome of H. ammodendron was published recently (GenBank accession no. KF534478) [70]. The size of the chloroplast genome was reported to be 151,570 nt, with a LSC of 84,214 nt, SSC of 19,014 nt and two IRs of 24,171 nt [70]. In our study, the genome assembled to a size of 161,251 nt, which is 9,681 nts larger. BLAST alignment of the two genomes indicated that the additional 9,681 nts were derived from the expansion of the IR, which is 4,868 nt in size. The IRs of H. ammodendron chloroplast genome in our study were 29,061 nt long. This represents an expansion of the IR that is also observed in S. eltonica (Table 2). Expansion and gene duplication are common phenomenon in the IR regions of chloroplast genomes [71,72]. In grasses, the junctions between the IR and SSC regions are highly variable with the ends of genes ndhF, rps19, and ndhH repeatedly migrating into and out of the adjacent IR regions [73]. BLAST alignment between the two genomes revealed that the first 115 nt showed 78% homology with chloroplast sequences of H. persicum, and H. ammodendron present in the IRs of the published genomes [70]. The following region of 671 nt did not show any significant similarity and the last 4,028 nt showed homology to mitochondrial genome sequences. The highest significant hit (94%; E value = 0.0) for this 4,028 nt section resembled Beta vulgaris and Spinacia oleraceae. Interestingly, annotation identified the mitochondrial gene Cytochrome b (cob) in this 4,814 nt section, although the plastid copy had a nonsense mutation that resulted in a premature stop codon.
Evidence showing transfer of mitochondrial DNA (mtDNA) or nuclear DNA (nucDNA) to the plastid genome in plants had been lacking until recently. A few recent reports indicate that plastid genomes of carrot [74], milkweed [75], and bamboo [73] show evidence of gene transfer from mitochondria to the plastid. Daucus carota has a 1.5 kb region of mitochondrial origin located in the rps12-trnV intergenic space of the chloroplast genome. Only Daucus species and the close relative cumin (Cuminum cyminum) show the mitochondrion-to-chloroplast gene transfer [74]. It was concluded that a mitochondria-located DNA segment present in the ancestor of the Apiaceae subsequently moved to the plastid genome in the common ancestor of Daucus and Cuminum. Asclepias syriaca, the common milkweed, has a 2.4 kb mtDNA-like insert in the chloroplast genome. The mtDNA-like insert contains an intact exon of the mitochondrial ribosomal protein (rpl2) as well as a noncoding region [75]. There was a 92% sequence identity between the mitochondrial and plastid version of rpl2 in A. syriaca whereas the plastid copy had a nonsense mutation resulting in a premature stop codon. Similarly, the IR region in three herbaceous bamboo species of the Pariana genus had a 2.7 kb insertion [73]. The insertion was located in the trnI-CAU-trnL-CAA intergenic spacer region. Potential variations of this insertion in another Pariana species and species from the sister genus Eremitis were also reported. These studies suggest that the transferred sequence may have originated as a single event in a common ancestor; however, the inserted sequence evolved rapidly [73].
In our study, the inserted section in H. ammodendron had an average coverage of 1,320 x reported from the stringent 0.99-0.99 length fraction/similarity mapped to the assembly. The coverage corresponded well to the average coverage of 1,269 X for other regions. Five kb regions flanking the 4.8kb section had a similar coverage of 929 and 1,066 reads. The Illumina reads from H. ammodendron (0.99-0.99 99 length fraction/similarity fraction) were mapped to three randomly selected intronless mitochondrial genes identified from the H. ammodendron assembly [73]. The mitochondrial genes ccmFN, matR and rrn26 showed a much lower average coverage of 242, 211, and 447, respectively. Thus, the mapping results supported the result that the insertion in the H. ammodendron chloroplast genome was not an artifact of the assembly.
Since the H. ammodendron chloroplast genome reported in this study was assembled from reads obtained using total cellular DNA, the origin of 4.8 kb insert was confirmed using a complementary Sanger sequencing approach. Amplified segments flanking the entire 4,814 nt insertion were 6,607, 7,172 and 8,132 nt long with the forward and the reverse primers flanking the ycf1 and ndhF genes, respectively (Fig. 2; Table S3). Primers flanking both the ycf1 and ndhF genes coupled with a primer annealing to the middle section of the inserted region produced amplicons of predicted sizes of 3,810 and 4,458 nt (Fig. 2; Table S3). The PCR results were the first line of confirmation since no PCR amplification should be expected from the published H. ammodendron chloroplast genome due to primer mismatch. Interestingly, expected DNA amplicons were also obtained when PCR was performed on Haloxylon persicum, a close relative of H. ammodendron (Fig. 2). A total section of 6.2 kb, including the 4,814nt inserted section, was sequenced and validated via primer walking (Table S3). The sequenced amplicon results produced a 100% alignment match to the H. ammodendron chloroplast genome assembly obtained in this study. Amplification and sequence homology validation of the 4,814 nt section confirmed the presence of the insertion in the H ammodendron chloroplast genome. The integration of intracellularly transferred DNA into the intergenic region of ycf1 and ndhF would be expected as insertion in the coding region would have disrupted gene function.
This is the first report to document mitochondria-to-chloroplast interorganellar gene transfer in the Chenopodiaceae family and the fourth example in angiosperms. However, the mechanisms underlying the transfer of genomic DNA fragments remains to be elucidated [73–75].
Chloroplast genomes among different types of C4 species versus C3 species
The 8 chloroplast genomes studied, include the C3 species S. maritima and 7 forms of C4 species. The results indicate the chloroplast genomes are very similar in the number (82-84) and type of CDS genes encoding proteins. Despite some differences in gene content and organization among the chloroplast genomes, these differences do not coincide with the type of oxygenic photosynthesis (C3 or C4) that these 8 species represent. There is a general conservation of genes present in the C3 species B. muricata and the C4 species. This suggests nuclear genes encode most chloroplast-targeted proteins that are needed to support the C4 pathway. Both Kranz type and single-cell type C4 species have dimorphic chloroplasts (relative to function in carbon assimilation, starch synthesis, and in relative expression of photosystem I and photosystem II for balancing requirements for ATP and NADPH). In carbon assimilation one type of chloroplast supports fixation of atmospheric CO2 by PEPC with synthesis of C4 acids. They generate energy to support conversion of pyruvate to phosphoenolpyruvate utilizing pyruvate, Pi dikinase, adenylate kinase, and inorganic pyrophosphatase, and they support reduction of oxaloacetate to malate by NADP-malate dehydrogenase. The other type of chloroplast has the Calvin-Benson cycle with Rubisco fixing CO2 that is generated by decarboxylation of C4 acids (utilizing plastid-targeted NADP-malic enzyme in some C4 species). Currently all enzymes required in chloroplasts to support the C4 cycle and Calvin-Benson cycle are considered to be nuclear encoded except the gene for the large subunit of Rubisco which is in the chloroplast genome, while the small subunit gene is in the nucleus [39,76–79]. In the dual-cell Kranz type C4 plants, cell specific control of transcription of nuclear genes may contribute to development of dimorphic chloroplasts. Other mechanisms must control development of dimorphic chloroplasts in SCC4 species [see hypotheses, selective protein import, selective mRNA targeting, selective protein degradation; [77]]. Future studies are needed to determine how dimorphic chloroplasts develop to coordinate function of C4 in carbon assimilation, metabolite transport between chloroplasts, and requirements of energy from photochemistry.