The genomic features of chloroplast genomes are diverse and dynamic
A study of 2511 chloroplast genomes was conducted to gain insight into the genomic structure and evolution of the chloroplast genome. The analysis included the complete genome sequences of algae (335), austrobaileyales (2), bryophytes (24), chloranthales (2), corals (2), eudicots (1314), Flacourtiaceae (1), gymnosperms (95), magnoliids (67), monocots (570), nymphaeales (14), opisthokonta (1), protists (31), pteridophytes (52), and an unclassified chloroplast genome (1) (Supplementary File 1). A comparison of the analysed genomes indicated that Haematococcus lacustris encoded the largest chloroplast genome, comprising 1.352 Mbs; while Asarum minus encoded the smallest chloroplast genome, comprising 0.0155 Mbs (Figure 1). The overall average size of the chloroplast genome was found to be 0.152 Mbs. The order of the average size (Mbs) of the chloroplast genome in different plant groups was 0.164 (algae), 0.160 (Nymphaeales), 0.154 (eudicot), 0.154 (Magnoliid), 0.149 (pteridophyte), 0.144 (monocot), 0.134 (bryophyte), 0.131 (gymnosperm), and 0.108 (protist). The average chloroplast genome size in algae (0.164 Mbs) and the Nymphaeales (0.160 Mbs) was larger than it was in eudicots (0.154 Mbs), monocots (0.144 Mbs), and gymnosperms (0.131 Mbs). The average size of the protist chloroplast genome (0.108 Mbs) was found to be the smallest. Principal component analysis (PCA) of the chloroplast genome size of algae, bryophytes, eudicots, gymnosperms, magnoliids, monocots, Nymphaeales, protists, and pteridophytes reveals a clear distinction between the different plant groups (Figure 2). The size of the chloroplast genome of gymnosperm and bryophytes grouped together; and eudicots, magnoliids, and pteridophytes grouped together. In contrast, the algae and protists were independently grouped (Figure 2).
The number of coding sequences (CDS) in the analysed chloroplast genomes ranged from 273 (Pinus koraiensis) to 4 (Monoraphidium neglectum) (Figure 1). The average number of CDS in all the studied chloroplast genome was found to be 91.67 per genome. Some other species, however, were found to contain a higher number of CDS in the chloroplast genome; including Grateloupia filicina (233), Osmundaria fimbriata (224), Porphyridium purpureum (224), Lophocladia kuetzingii (221), Kuetzingia canaliculata (218), Spyridia filamentosa (218), Bryothamnion seaforthii (216) and others (Supplementary File 1). Similarly, some of the species were found to encode a lower number of CDS in the chloroplast genome; including Asarum minus (7), Cytinus hypocistis (15), Sciaphila densiflora (18), Gastrodia elata (20), Burmannia oblonga (22), Orobanche gracilis (24), and others (Supplementary File 1). PCA analysis indicated that the number of CDS in bryophytes, eudicots, magnoliids, monocots, and pteridophytes grouped together (Figure 3). The number of CDS in algae, gymnosperms, and protists grouped very distantly from the above-mentioned grouping (Figure 3).
The GC content of the analysed chloroplast genomes ranged from a high of 57.66% (Trebouxiophyceae sp. MX-AZ01) to a low of 23.25% (Bulboplastis apyrenoidosa) (Figure 4, Supplementary File 1). The average GC content in the chloroplast genome was 36.87%. Some species contained a higher percentage of GC content, including Paradoxia multiseta (50.58%), Haematococcus lacustris (49.88%), Chromerdia sp. RM11 (47.74%), Elliptochloris bilobata (45.76%), Choricystis parasitica (45.44%) and others. On the other hand, some species had a lower percentage of GC content, including Ulva prolifera (24.78%), Ulva linza (24.78%), Ulva fasciata (24.86%), Ulva flexuosa (24.97%), and others (Supplementary File 1). PCA analysis revealed that the percentage GC content of eudicots, gymnosperms, magnoliids, monocots, and Nymphaeales grouped together, and the percentage of GC content in algae and protists grouped together (Figure 5). The percentage GC content in bryophytes and pteridophytes did not group with the algae and protists or the eudicots, gymnosperms, magnoliids, monocots, or Nymphaeales (Figure 5).
PsaM, Psb30¸ChlB, ChlL, ChlN, and RPL21 are chloroplast genes characteristic of algae, bryophytes, pteridophytes, and gymnosperms
The PsaM protein is subunit XII of photosystem I. Among the 2511 studied species, 94 were found to possess the PsaM gene. All of the species those found to possess the PsaM gene were algae, bryophytes, pteridophytes, or gymnosperms (Supplementary File 2). Notably, no species in the angiosperm lineage possessed the PsaM gene; clearly indicating that the PsaM gene was lost in the angiosperm lineage. The size of the PsaM deduced proteins ranged from 22 to 33 amino acids. The deduced molecular weight (MW) of PsaM protein ranged from 3.568 kDa in Mesotaenium endlicherianum to 1.59 kDa in P. bungeana, while the isoelectric point ranged from 3.37 in Taxodium distichum to 9.95 in P. bungeana. The MW of PsaM proteins in species of Picea were smaller than the Cycas, Ginkgo, and species from bryophytes, pteridophytes, and algae. The PsaM protein was found to contain the characteristic conserved amino acid motif Q-x3-A-x3-A-F-x3-I-L-A-x2-L-G-x2-L-Y (Supplementary Figure 1). A few species, including Cephalotaxus, Podocarpus tortara, Retrophyllum piresii, Dacrycarpus imbricatus, Glyptostrobus pensilis, T. distichum, Cryptomeria japonica, Pinus contorta, Pinus taeda, and Ptilidium pulcherrimum, however, did not contain the conserved amino acid motif. Instead, they possessed the conserved motif, F-x-S-x3-C-F-x4-F-S-x2-I (Supplementary Figure 1). Phylogenetic analysis revealed that PsaM genes grouped into five independent clusters, suggesting that they have evolved independently from multiple common ancestral nodes (Supplementary Figure 2A). Duplication and deletion analysis of PsaM genes revealed that deletion events were more predominant than the duplication or co-divergence events (Table 1). Among the 84 analysed PsaM genes, 12 had undergone duplication and 34 had undergone deletions, while 34 genes had undergone co-divergence (Table 1, Supplementary Figure 2B). The upper and lower boundaries of the time of each duplication and co-divergence revealed that the Jugermanniopsida, Pinaceae, and Streptophyta were in the upper boundary while the Eukaryota, Pinaceae, Streptophyta, Cycadales, Podocarpaceae, Apopellia endiviifolia, and Zygnematophyceae were in the lower boundary (Supplementary Figure 2B). The lower boundary represents the oldest species where duplications must have occurred and the upper boundary represents the most recent species where duplication events are not present.
Psb30 encodes the Ycf12 protein, which is essential for the functioning of the photosystem II reaction centre. The size of the translated protein in the analysed chloroplast genomes ranged from 24 (Pinus nelsonii) to 34 amino acids (Isoetes flaccida). The MW of Ycf12 ranged from 3.75 kDa in Schizaea pectinata to 2.46 kDa in P. nelsonii and the predicted isoelectric point ranged from 3.13 in P. nelsonii to 10.613 in Cylindrocystis brebissonii. A multiple sequence alignment revealed the presence of a conserved consensus amino acid sequence, N-x-E-x3-Q-L-x2-L-x6-G-P-L-V-I (Supplementary Figure 3). A total of 164 species were found to possess psb30 gene and all of the species were belonged to algae, bryophytes, pteridophytes, or gymnosperms (Supplementary File 2). Psb30 was absent in the chloroplast genome of angiosperms. Phylogenetic analysis of Psb30 genes resulted in the designation of two major clusters and six minor clusters, suggesting that it evolved from multiple common ancestral nodes (Supplementary Figure 4A). Deletion/duplication analysis indicated that 39 Psb30 genes had undergone a duplication event and 120 had undergone a deletion event, while 49 were found to be co-diverged (Table 1, Supplementary Figure 4B). The upper boundary species, where duplication events were absent, belonged to the Streptophyta, Pinaceae, Polypodiopsida, Mesotaeniaceae, Zygnematophyceae, Zamiaceae, and Eukaryota. Species in the lower boundary group, where duplication events must have occurred, belonged to the Eukaryota, Streptophyta, Pinaceae, Cathaya argyrophylla, Cycadales, Dioon spinulosum, Anthoceros formosae, Pteridaceae, Aspleniaceae, Polypodiales, Zygnematales, C. brebissonii, and Viridiplantae.
ChlB encodes a light-independent protochlorophyllide reductase. A total of 288 of the examined chloroplast genome sequences were found to possess a ChlB gene (Supplementary File 2) among protists, algae, bryophytes, pteridophytes, and gymnosperms. The ChlB gene was absent in species in the Chloranthales, corals, or angiosperm lineage. The predicted size of the ChlB proteins ranged from 724 (Gonium pectorale) to 177 (Welwitschia mirabilis) amino acids. The MW of ChlB protein ranged from 20.99 (W. mirabilis) to 79.89 (G. pectorale) kDa, while the isoelectric point ranged from 5.10 (Sequoia sempervirens) to 10.74 (S. pectinata). A multiple sequence alignment revealed the presence of several highly conserved amino acid motifs (Supplementary Figure 8). At least seven conserved motifs were identified, including A-Y-W-M-Y-A, L-P-K-A-W-F, E-N-Y-I-D-Q-Q, S-Q-A-A-W-F, H-D-A-D-W-F, E-P-x2-I-F-G-T, E-K-F/Y-A-R-Q-Q, and E-V-M-Y-A-A (Supplementary Figure 5). A phylogenetic analysis indicated that ChlB genes grouped into two major clusters and thirteen minor clusters, reflecting multiple evolutionary nodes (Supplementary Figure 6A). ChlB genes were comprised of a few groups. Specifically, deletion and duplication analysis revealed that 35 ChlB genes had undergone duplications and 126 had undergone deletions, while 116 exhibited co-divergence in their evolutionary history (Table 1, Supplementary Figure 6B). The lower time boundary contained species where duplication events must have occurred. These taxa included members of the Viridiplantae, Streptophyta, Cycadales, D. spinulosum, Ampelopteris prolifera, Polypodiales, A. endiviifolia, Zygnematophyceae, Chlorophyta, Trebouxiophyceae, Hydrodictyaceae, Bryopsidales, and Pseudochloris wilhelmii. The upper time boundary included taxa where duplication events were not present. These included members of the Streptophyta, Zamiaceae, Thelypteridaceae, Jungermanniopsida, Zygnematophyceae, Chlorophyta, Trebouxiophyceae, Sphaeropleales, and Ulvophyceae.
ChlL is a light-independent protochlorophyllide reductase iron-sulphur ATP-binding protein that functions in the reduction of ferredoxin, reducing the D ring of protochlorophyllide to form chlorophyllide. It plays a role in the light-independent reaction of photosynthesis and the L component serves as the electron donor to the NB-component. This protein is also involved in light-independent biosynthesis of chlorophyll. Analysis of the chloroplast genome sequences identified 303 species that possess ChlL genes (Supplementary File 2). All of the identified species those possess ChlL gene belonged to algae, bryophytes, gymnosperms, protists, and pteridophytes. None of the taxa in the angiosperm or magnoliid lineage were found to possess a ChlL gene. Within the protist lineage, only species in the genera Nannochloropsis, Vaucheria, Triparma, and Alveolata encode a ChlL gene. The length of the predicted ChlL proteins ranged from 186 amino acids in Pycnococcus provasolii to 299 amino acids in Macrothelypteris torresiana. The MW of ChlL proteins ranged from 20.19 (Pycnococcus provasolii) to 33.08 (Retrophyllum piresii) kDa, while the isoelectric point ranged from 11.74 (Hypodematium crenatum) to 4.53 (Leptosira terrestris). A multiple sequence alignment revealed the presence of several highly conserved amino acid motifs, including K-S-T-T-S-C-N-x-S, W-P-E-D-V-I-Y-Q, K-Y-V-E-A-C-P-M-P, C-D-F-Y-L-N, Q-P-E-G-V-V/I, and S-D-F-Y-L-N (Supplementary Figure 7). The phylogenetic analysis indicated that ChlL genes grouped into one major independent cluster and eleven minor clusters, suggesting that they also evolved independently from different common ancestors (Supplementary Figure 8A). Deletion and duplication analysis indicated that 49 ChlL genes had undergone duplication events and 184 had undergone deletions, while 100 ChlL genes exhibited co-divergence (Table 1, Supplementary Figure 8B). The lower boundary contains taxa where duplication events must have occurred. These include members of the Viridiplantae, Streptophyta, Cycadales, D. spinulosum, Polypodiopsida, Pteridaceae, Diplaziopsidaceae, Zygnematophyceae, Chlorophyta, Hydrodictyaceae, Trebouxiophyceae, and P. wilhelmii. The upper boundary includes taxa where deletion events must have occurred. These include members of the Streptophyta, Zamiaceae, Polypodiopsida, Thelypteridaceae, Chlorophyta, and Trebouxiophyceae.
ChlN protein is a dark-operative, light-independent, protochlorophyllide reductase. It utilizes Mg2+-ATP mediated reduction of ferredoxin to reduce the D ring of protochlorophyllide that is subsequently converted to chlorophyllide. At least 289 of the analyzed chloroplast genomes possess ChlN genes. These genomes were from taxa within the protists, algae, bryophytes, pteridophytes, and gymnosperms (Supplementary File 2). The length of the predicted ChlN proteins range from 373 (Toxarium undulatum) to 523 (Chlorella mirabilis) amino acids and have a predicted MW ranging from 44.93 (P. provasolii) to 58.86 (C. mirabilis) kDa, while the isoelectric point ranges from 4.92 (Ginkgo biloba) to 9.83 (R. piresii). A multiple sequence alignment revealed the presence of highly conserved amino acid motifs, including N-Y-H-T-F, A-E-L-Y-Q-K-I-E-D-S, M-A-H-R-C-P, and Q-I-H-G-F (Supplementary Figure 9). Phylogenetic analysis revealed that ChlN genes group into two independent clusters (Supplementary Figure 10A). No lineage specific grouping, however, was identified in the phylogenetic tree. Deletion and duplication analysis indicated that 8 ChlN genes had undergone duplication events, 46 had undergone deletion events and 34 genes exhibited co-divergence (Table 1, Supplementary Figure 10B). The lower boundary, which indicates where duplication events must have occurred, contained members of the Chlorophyta, Polypodiales, and A. prolifera. The upper boundary contains taxa where deletion events must have occurred and included members of the Thelypteridaceae, Polypodiales, and Chlorophyta.
Rpl21 protein is a component of the 60S ribosomal subunit. The chloroplast genomes of at least 137 of the examined species were found to possess a RpL21 gene and included taxa within the algae, bryophytes, pteridophytes, and gymnosperms (Supplementary File 2). In the majority of cases, however, the CDS of the Rpl21 genes were truncated. Therefore, only 22 full length CDS were used to identify deletion and duplication events. Rpl21 proteins were found to contain the conserved amino acid motifs, Y-A-I-I-D-x-G-G-x-Q-L-R-E-V-x-G-R-F, R-V-L-M-I, G-x-P-W-L, R-I-L-H, and K-x2-I/V-x5-K-K (Supplementary Figure 11). Phylogenetic analysis shows the presence of three clusters, reflecting their origin from multiple common ancestral nodes (Supplementary Figure 12A). Deletion/duplication analysis indicted that 3 RpL21 genes had undergone duplication events, 8 had undergone deletion events, and 9 exhibited co-divergence (Table1, Supplementary Figure 12B).
The Rbcl gene has been lost in parasitic and heterotrophic plant species
Ribulose-1,5-biphosphate carboxylase (Rbcl), the most abundant enzyme of the earth, functions in the process of carbon fixation during the dark reaction of photosynthesis to produce carbohydrate. The presence of the RBCL gene and its role in carbon assimilation contributes to the role of plants as producers in natural ecosystems. Therefore, a characteristic feature of almost all photosynthetic plants is that their chloroplast genome encodes an Rbcl gene to modulate photosynthesis. However, not all the plants with chloroplasts/plastids, possess an Rbcl gene. Among the plant taxa analysed, we identified at least 17 species that did not encode an Rbcl gene in their chloroplast genome. The species lacking an Rbcl gene were Alveolata sp. CCMP3155, A. minus, Bathycoccus prasinos (picoplankton), Burmannia oblonga (orchid), Codonopsis lanceolate (eudicot), Cytinus hypocistis (parasite), Gastrodia elata (saprophyte), Monotropa hypopitys (myco-heterotroph), Orobanche austrohispanica (parasite), Orobanche densiflora (parasite), Orobanche gracilis (parasite), Orobanche pancicii (parasite), Phelipanche purpurea (parasite), Phelipanche ramosa (parasite), Prototheca cutis (parasitic algae), Prototheca stagnorum (parasitic algae), and Sciaphila densiflora (myco-heterotroph). The length of the RBCL protein in taxa that did possess an Rbcl gene ranged from 334 (Asterionellopsis glacialis) to 493 (Gossypium australe) amino acids and the MW of the RBCL protein ranged from 36.809 (Nannochloropsis oceanica) to 54.87 (G. australe) kDa; while the isoelectric point ranged from 4.47 (Primula sinensis) to 7.79 (Oogamochlamys gigantea) (Supplementary Figure 13). An analysis of the composition of the RBCL protein revealed that Ala and Gly were the most abundant amino acids and that Trp was the least abundant.
Deletion of inverted repeats (IRs) has occurred across all plastid lineages
Inverted repeats (IR) are one of the major characteristic features of the chloroplast genomes. The analysis conducted in the present study revealed the deletion of inverted repeats in the chloroplast genome of 259 (10.31%) species from the 2511 species examined (Supplementary File 3). IR deletion events were identified in protists (14), protozoans (1), algae (126), bryophytes (1), gymnosperms (64), magnoliids (1) monocots (9), and eudicots (43). The average size of the deleted IR region in algae was 0.177 Mb, which is larger than the overall size of the chloroplast genome in the respective taxa. The average size of the deleted IR region in eudicots, monocots, and gymnosperms was 0.124, 0.131, and 0.127 Mb, respectively, which is smaller than the overall size of the chloroplast genome in the respective lineages.
Phylogenetic analysis of chloroplast genomes containing deleted IR regions produced three major clusters (Figure 6). Gymnosperms were in the upper cluster (cyan) while the lower cluster (red) comprised the algae, bryophytes, eudicots, gymnosperms, and pteridophytes. No chloroplast genomes from monocot plants were present in the lower cluster (Figure 6). The middle cluster contained at least four major phylogenetic groups (Figure 6). Monocot plants were present in two groups (pink) in the middle cluster. Gymnosperm (cyan) and eudicot (green) chloroplast genomes were also present in two of the groups in the middle cluster. Although there was some sporadic distribution of algae in the different groups of the phylogenetic tree, the majority of the algal species were present in a single group (yellow) (Figure 6). A phylogenetic tree of taxa with an IR-deleted chloroplast genome and taxa with chloroplast genomes that did not lose the IR region (Floydiella terrestris, Carteria cerasiformis, B. apyrenoidosa, E. grandis, O. sativa and others) did not reveal any specific difference in their clades. Instead, they also grouped with the genomes in which the IR region was deleted. Inverted repeats stabilize the chloroplast genome and the loss of a region of inverted repeats most likely leads to a genetic rearrangement in the chloroplast genome. The lower cluster (red) contained the oldest group. Genomic recombination analysis revealed that the chloroplast genomes across different lineages had undergone vivid recombination (Supplementary Figure 14A and 14B). In addition, the IR deleted chloroplast genomes were also undergone vivid recombination (Supplementary Figure 15).
Several genes in the chloroplast genome have been subject to deletion events.
The chloroplast genome encodes genes for photosynthesis, amino acid biosynthesis, transcription, protein translation, and other important metabolic processes. The major genes involved in such events are AccD (acetyl-coenzyme A carboxylase carboxyl transferase), AtpA, AtpB, AtpE, AtpF, AtpH, AtpI, CcsA (cytochrome C biogenesis protein), CemA (chloroplast envelope membrane), ChlB (light independent protochlorophyllide reductase), ChlL, ChlN, ClpP (ATP-dependent Clp protease), MatK (maturase K), NdhA (NADPH-quinone oxidoreductase), NdhB, NdhC, NdhD, NdhE, NdhF, NdhG, NdhH, NdhI, NdhJ, NdhK, Pbf1 (photosystem biogenesis factor 1), PetA (cytochrome precursor), PetB, PetD, PetG, PetL, PetN, PsaA (photosystem I protein), PsaB, PsaC, PsaI, PsaJ, PsaM, Psb30, PsbA (photosystem II protein), PsbB, PsbC, PsbD, PsbE, PsbF, PsbH, PsbI, PsbJ, PsbK, PsbL, PsbM, PsbT, PsbZ, Rbcl (ribulose 1,5- biphosphate carboxylase), Rpl2 (60S ribosomal protein), Rpl14, Rpl16, Rpl20, Rpl21, Rpl22, Rpl23, Rpl32, Rpl33, Rpl36, RpoA (DNA directed RNA polymerase), RpoB, RpoC1, RpoC2, Rps2 (40S ribosomal protein), Rps3, Rps4, Rps7, Rps8, Rps11, Rps12, Rps14, Rps15, Rps16, Rps18, Rps19, Ycf1, Ycf2, Ycf3, and Ycf4. Our analysis of 2511 chloroplast genomes revealed that a number of these genes were lost in one or more species (Table 2). The analysis indicated that the ribosomal proteins Rpl and Rpo were lost less frequently than the other chloroplast genes (Table 2). Ndh genes were lost in a number of different species. Several other genes had been deleted in a considerable number of species across different lineages. These included AccD (402), AtpF (217), Clp (194), Ycf2 (226), Ycf4 (111), PetL (248), PetN (125), PsaI (129), PsbM (166), PsbZ (145), Rpl22 (137), Rpl23 (221), Rpl32 (182), Rpl33 (163), Rps15 (263), and Rps16 (372), where the number in parentheses indicates the number of taxa in which the gene has been deleted from the chloroplast genome (Table 2).
Chloroplast genomes possess genes that encode at least six different ATP molecules encoded by AtpA, AtpB, AtpE, AtpF, AtpH, and AtpI genes. Among the 2511 analysed chloroplast genomes, AtpA, AtpB, AtpE, AtpF, AtpH, and AtpI were found to be lost in 8, 8, 12, 14, 13, and 12 species, respectively (Table 2, Supplementary File 4). The loss of Atp genes occurred in algae, eudicots, magnoliids, and monocots, while no loss of Atp genes occurred in any species of bryophytes, pteridophytes, and gymnosperms (Supplementary File 4). The AccD gene in chloroplasts, which encodes the beta-carboxyl transferase subunit of acetyl Co-A carboxylase (ACC) complex, was found to be lost from 387 plant species (Supplementary File 4). The AccD gene was lost in taxa belonging to algae (92 species), eudicots (32 species), gymnosperms (7 species), magnoliids (1 species), monocots (227 species), and protists (27 species), while the AccD gene was found to be present in all bryophytes and pteridophytes.
The CcsA gene in the chloroplast genome encodes a cytochrome C biogenesis protein. CcsA genes were found to be lost in at least 29 species (Table 2). It was lost in taxa belonging to members of algae (8 species), bryophyte (3 species), eudicots (6 species), magnoliids (1 species), monocots (5 species), and protists (6 species), while no evidence of a loss was observed in members of the pteridophytes and gymnosperms. The CemA gene encodes a chloroplast envelope membrane protein and was found to be lost in 29 species (Table 2). The loss of the CemA gene was found in algae, eudicots, magnoliids, monocots, and protists, while no evidence of deletion was observed in taxa of bryophytes, pteridophytes, and gymnosperms. The ClpP gene encodes an ATP-dependent Clp protease proteolytic subunit that is necessary for ATP hydrolysis. It was observed to be lost in at least 142 species (Supplementary File 4) belonging to the algae (108 species), eudicots (2 species), gymnosperms (3 species), magnoliids (1 species), and protists (28 species). Loss of the ClpP gene was not found in members of bryophytes, pteridophytes, and monocots (Supplementary File 4). The chloroplast genome possesses at least six different Psa genes, PsaA, PsaB, PsaC, PsaI, PsaJ, and PsaM (Table). The PsaA gene was absent in 16 species (3 algae, 8 eudicots, 1 magnoliid, and 4 monocots). PsaB was lost in 10 species (3 algae, 2 eudicots, 1 magnoliid, and 4 monocots). PsaC was lost in 19 species (2 algae, 10 eudicots, 1 magnoliid, and 6 monocots). PsaI was absent in 72 species (42 algae, 18 eudicots, 1 magnoliid, 6 monocots, and 5 protists), and PsaJ was lost in 24 species (6 algae, 12 eudicots, 1 gymnosperm, 1 magnoliid, 3 monocots, and 1 protist). The PsaM gene was lost in 2214 species distributed amongst all of the angiosperm lineages examined in our analysis. The chloroplast encodes 16 different Psb genes (PsbA, PsbB, PsbC, PsbD, PsbE, PsbF, PsbH, PsbI, PsbJ, PsbK, PsbL, PsbM, PsbN, PsbT, PsbZ, and Psb30). The analysis indicated that PsbA, PsbB, PsbC, PsbD, PsbE, PsbF, PsbH, PsbI, PsbJ, PsbK, PsbL, PsbM, PsbN, PsbT, PsbZ, and Psb30 genes were lost in a variety of species. Evidence for the deletion of these genes was observed as follows: PsbA was lost in 12 species (2 algae, 6 eudicots, and 4 monocots); PsbB was lost in 18 species (2 algae, 11 eudicots, 1 magnoliid, and 4 monocots); PsbC was lost in 16 species (2 algae, 7 eudicots, 1 magnoliid, and 6 monocots); PsbD was lost in 17 species (2 algae, 9 eudicots, 1 magnoliid, and 5 monocots); PsbE and PsbF were lost in 21 species (2 algae, 12 eudicots, 1 magnoliid, and 6 monocots in both cases); PsbH was lost in 20 species (2 algae, 14 eudicots, 1 magnoliid and 3 monocots); PsbI was lost in 18 species (3 algae, 10 eudicots, and 5 monocots); PsbJ was lost in 21 species (2 algae, 12 eudicots, 1 magnoliid, 6 monocots); PsbK was lost in 13 species (2 algae, 5 eudicots, and 6 monocots); PsbL was lost in 22 species (2 algae, 12 eudicots, 1 magnoliid, 6 monocots, and 1 protist); PsbM was lost in 158 species (115 algae, 9 eudicots, 6 monocots, and 27 protists); PsbN was lost in 22 species (2 algae, 14 eudicots, 1 magnoliid, and 5 monocots); PsbT was lost in 23 species (3 algae, 14 eudicots, and 6 monocots); and PsbZ was lost in 31 species (14 algae, 7 eudicots, 1 magnoliid, 4 monocots, and 5 protists) (Supplementary File 4).
The Ndh gene encodes a NAD(P)H-quinone oxidoreductase that shuttles electrons from plastoquinone to quinone in the photosynthetic chain reaction. The analysis indicated that the chloroplast genome encodes at least 11 Ndh genes, NdhA, NdhB, NdhC, NdhD, NdhE, NdhF, NdhG, NdhH, NdhI, NdhJ, and NdhK (Table 2). Evidence for the deletion of these genes was observed as follows: NdhA was lost in: 339 species (207 algae, 1 bryophyte, 35 eudicots, 37 gymnosperms, 2 magnoliids, 27 monocots, 28 protists, and 2 pteridophytes); NdhB was lost in 258 species (211 algae, 5 eudicots, 7 gymnosperms, 1 magnoliid, 3 monocots, 29 protists, and 2 pteridophytes); NdhC was lost in 339 species (212 algae, 38 eudicots, 7 gymnosperms, 2 magnoliids, 49 monocots, 29 protists, and 2 pteridophytes); NdhD was lost in 293 species (214 algae, 28 eudicots, 9 gymnosperms, 1 magnoliid, 9 monocots, 30 protists, and 2 pteridophytes); NdhE was lost in 322 species (218 algae, 30 eudicots, 19 gymnosperms, 1 magnoliid, 20 monocots, 30 protists, 4 pteridophytes); NdhF was lost in 346 species (207 algae, 37 eudicots, 37 gymnosperms, 1 magnoliid, 33 monocots, 30 protists, and 2 pteridophytes); NdhG was lost in 335 species (213 algae, 1 bryophyte, 35 eudicots, 37 gymnosperms, 2 magnoliids, 16 monocots, 30 protists, and 1 pteridophyte); NdhH was lost in 322 species (213 algae, 1 bryophyte, 34 eudicots, 15 gymnosperms, 1 magnoliid, 26 monocots, 30 protists, and 2 pteridophytes); NdhI was lost in 378 species (213 algae, 1 bryophyte, 43 eudicots, 37 gymnosperms, 2 magnoliids, 50 monocots, 30 protists, and 2 pteridophytes); NdhJ was lost in 340 species (215 algae, 40 eudicots, 37 gymnosperms, 2 magnoliids, 15 monocots, 30 protists, and 2 pteridophytes); and NdhK was lost in 331 species (204 algae, 1 bryophyte, 39 eudicots, 7 gymnosperms, 2 magnoliids, 46 monocots, 30 protists, and 2 pteridophytes) (Supplementary File 4). The loss of Ndh genes was found to occur in members of algae, bryophytes, pteridophytes, gymnosperms, monocots, eudicots, magnoliids, and protists.
The chloroplast genome encodes PetA (cytochrome f precursor), PetB (cytochrome b6), PetD (cytochrome b6-f complex subunit 4), PetG (cytochrome b6-f complex subunit 5), PetL (cytochrome b6-f complex subunit 6), and PetN (cytochrome b6-f complex subunit 8) genes. Evidence for deletion of these genes was observed as follows: PetA was lost in 33 species (8 algae, 10 eudicot, 1 magnoliid, 6 monocots, and 8 protists); PetB was lost in 15 species (2 algae, 8 eudicots, 1 magnoliid, and 4 monocots); PetD was lost in 36 species (7 algae, 13 eudicots, 1 magnoliid, 6 monocots, and 9 protists); PetL was lost in 71 species (39 algae, 11 eudicots, 1 magnoliid, 4 monocots, and 16 protists); and PetN gene was lost in135 species (106 algae, 5 bryophytes, 11 eudicots, 1 magnoliid, 6 monocots, and 6 protists) (Supplementary File 4). PetA was lost in taxa of members of algae, eudicots, magnoliids, monocots, and protists, while PetA was found to be present in bryophytes, pteridophytes, and gymnosperms. PetB gene was found to be lost in taxa of members of algae, eudicots, magnoliids, and monocots, while it was found to be present in bryophytes, pteridophytes, and gymnosperms (Supplementary File 4). PetD was found to be lost in taxa of members of algae, eudicots, magnoliids, monocots, and protists, while it was found to be intact in bryophytes, pteridophytes, and gymnosperms. PetL was found to be lost in taxa of members of algae, eudicots, magnoliids, monocots, and protists, while it was found to be intact in bryophytes, pteridophytes, and gymnosperms. PetN genes were found to be lost in taxa of members of algae, bryophytes, eudicots, magnoliids, monocots, and protists, while it was intact in pteridophytes and gymnosperms.
The chloroplast genome encodes at least nine Rpl genes, Rpl2, Rpl14, Rpl16, Rpl20, Rpl22, Rpl23, Rpl32, Rpl33, and Rpl36 (Table 2). Deletion of these genes was found in taxa of different lineages (Table 2). Rpl2 was lost in two species (1 eudicot and 1 magnoliid). Rpl14 was lost in four species (2 algae, 1 eudicot, and 1 magnoliid). Rpl16 was lost in three species (2 algae and 1 magnoliid). Rpl22 was lost in 127 species (107 algae, 12 eudicots, 3 magnoliids, 2 monocots, and 3 protists). Rpl32 was lost in 114 species (21 algae, 73 eudicots, 5 gymnosperms, 1 magnoliid, 6 monocots, 8 protists). Rpl33 was lost in 133 species (111 algae, 7 eudicots, 1 magnoliid, 4 monocots, and 10 protists). Rpl23 was lost in 24 species (8 algae, 4 eudicots, 6 gymnosperms, 1 magnoliid, and 5 monocots) (Supplementary File 4).
The chloroplast genome encodes 12 Rps genes, Rps2, Rps3, Rps4, Rps7, Rps8, Rps11, Rps12, Rps14, Rps15, Rps16, Rps18, and Rps19 (Table 2). Our analysis indicated that different Rps genes were lost from a variety of species (Supplementary Table 1). Specifically, Rps2 was lost in three species (1 algae, 1 eudicot, and 1 magnoliid); Rps3 was lost in three species (2 algae and 1 magnoliid); Rps4 was lost in four species (3 algae and 1 magnoliid); Rps7 was lost in three species (1 algae, 1 gymnosperm, and 1 magnoliid); Rps8 was lost in three species (1 eudicot, 1 magnoliid, and 1 protist); Rps11 was lost in two species (1 eudicot and 1 magnoliid) and Rps12 was lost in two species (1 algae and 1 magnoliid). The chloroplast genome encodes four Rpo genes, RpoA, RpoB, RpoC1 and RpoC2 (Table 2). RpoA and RpoC1 encode for the alpha-subunit and RpoB and RpoC2 encode the beta-subunit of DNA-dependent RNA polymerase. The analysis revealed the loss of RpoA, RpoB, RpoC1, and RpoC2 genes from the chloroplast genome of several taxa (Supplementary File 4). Specifically, RpoA1 was lost in 26 species (5 algae, 6 bryophytes, 7 eudicots, 1 magnoliid, 4 monocots, and 3 protists); RpoB was lost in 19 species (1 algae, 14 eudicots, 1 magnoliid, and 3 monocots); RpoC1 was lost in 21 species (15 eudicots, 1 magnoliid, 5 monocots) and RpoC2 was lost in 13 species (1 algae, 7 eudicots, 1 magnoliid, and 4 monocots). The loss of RpoA occurred across diverse lineages including algae, bryophytes, eudicots, magnoliids, monocots, and protists. Additionally, RpoB was lost in algae, eudicots, magnoliids, and monocots; RpoC1 was lost in eudicots, magnoliids, and monocots and RpoC2 was lost in eudicots, magnoliids, and monocots (Supplementary File 4).
The majority of chloroplast genomes encode four Ycf genes, Ycf1, Ycf2, Ycf3, and Ycf4. Our analysis indicated a dynamic loss of Ycf genes from the chloroplast genome of a variety of taxa (Supplementary File 4). Ycf1 was lost in 161 species (125 algae, 4 eudicots, 1 magnoliid, 3 monocots, and 28 protists), Ycf2 was lost in 219 species (185 algae, 1 eudicot and magnoliid each, 2 monocots, and 30 protists). Ycf3 was lost in 30 species (7 algae, 7 eudicots, 1 magnoliid, 6 monocots, and 9 protists). Ycf4 was lost in 39 species (6 algae, 24 eudicots, 1 magnoliid, 5 monocots, and 3 protists). Although researchers have yet to elucidate the function of Ycf genes, Ycf3 and Ycf4 have been reported to be a photosystem I assembly factor. The loss of Ycf1 and Ycf2 genes was more prominent in algae and the loss of Ycf1 and Ycf2 genes were not found in bryophytes, pteridophytes, and gymnosperms. The loss of Ycf4 was most prominent in eudicots and the loss of Ycf3 and Ycf4 was not observed in bryophytes, pteridophytes, and gymnosperms (Supplementary File 4).
The loss of genes in chloroplast genomes is dynamic
When the collection of all the lost genes were grouped, it was evident that a large number of genes had been found to lost in algae, eudicots, magnoliids, and monocots (Supplementary Table 1). Only a small number of genes were lost in bryophytes, gymnosperms, protists, and pteridophytes (Supplementary Table 1). When the species of algae, gymnosperms, monocots, eudicots, magnoliids, and bryophytes were grouped together, NdhA, NdhC, NdhD, NdhE, NdhF, NdhG, NdhH, NdhI, NdhJ, and NdhK were found to be lost in all six lineages; while AtpB, AtpE, AtpH, AtpI, CemA, PetA, PetB, PetD, PetG, PetL, PsaA, PsaB, PsaC, PsaI, PsbA, PsbB, PsbC, PsbD, PsbE, PsbF, PsbH, PsbJ, PsbL, PsbZ, Psbf1, Rpl22, Rpl33, RpoB, and RpoC2 had been lost in algae, eudicots, magnoliids, and monocots (Supplementary Figure 16, Supplementary Table 1). AccD, NdhB, PsaJ, Rpl23, and Rpl32 genes were only absent in species of algae, eudicots, gymnosperms, magnoliids, and monocots. When species of algae, bryophytes, gymnosperms, angiosperms (monocot and dicot), pteridophytes, and protists were grouped together, at least 11 genes were found to be lost in all of the lineages (Supplementary Table 1, Supplementary Figure 17). The most common lost genes were NdhA, NdhC, NdhD, NdhE, NdhF, NdhG, NdhH, NdhI, NdhJ, NdhK, and Rps16. The NdhB gene, however, was lost in algae, angiosperms, gymnosperms, protists, and pteridophytes; while it was present in all species of bryophytes. When the higher groupings of plant lineages (gymnosperms, magnoliids, and monocots) were grouped together, it was found that AccD, NdhA, NdhB, NdhC, NdhD, NdhE, NdhF, NdhG, NdhH, NdhI, NdhJ, NdhK, PsaJ, Rpl23, and Rpl32 had been lost in all four lineages (Supplementary Figure 18, Supplementary Table 1). AtpB, AtpE, AtpH, AtpI, CcsA, CemA, PetA, PetB, PetD, PetG, PetL, PetN, PsaA, PsaB, PsaC, PsaI, PsbA, PsbB, PsbC, PsbD, PsbE, PsbF, PsbH, PsbJ, PsbL, PsbZ, Psbf1, Rpl22, Rpl33, RpoB, RpoC1, RpoC2, and Rps19 were found to be lost in eudicots, magnoliids, and monocots. ClpP was found to be lost in eudicots, gymnosperms, and magnoliids. A comparative analysis of gene loss in eudicot and monocot plants revealed that gene loss was more frequent in eudicots (69 genes) than in monocots (59 genes). Eudicots and monocots share the loss of 59 genes in their chloroplast genomes. The loss of ClpP, Rpl2, Rpl14, Rpl36, RpoA, Rps2, Rps8, Rps11, Rps14, and Rps18 occurred only in eudicots and not in monocots. A comparative analysis of gene loss in eudicots, gymnosperms, and monocots indicated that the loss of Rps7 was unique to the gymnosperms. The loss of at least 17 genes (accD, ndhA, ndhB, ndhC, ndhD, ndhE, ndhF, ndhG, ndhH, ndhI, ndhJ, ndhK
psaJ, rpl23, rpl32, rps15, and rps16) were found to be common in between eudicots, gymnosperms, and monocots.
Chloroplast-derived genes are present in the nuclear genome
It has been speculated that genes lost from chloroplast genomes may have moved to the nuclear genome and are regulated as a nuclear-encoded gene. Therefore, a genome-wide analysis of fully sequenced and annotated genomes of 145 plant species was analysed to explore this question. Results indicated the presence of almost all of the chloroplast encoding genes in the nuclear genome. We found the presence of 189381 putative nuclear encoding chloroplast gene from the study of 145 plant species (Supplementary File 5). Some of the chloroplast- derived genes that were found in the nuclear genome were: Rubisco accumulation factor, 30S ribosomal 30S ribosomal proteins (1, 2, 3, S1, S2, S3, S5, S6, S7, S8, S9, S10, S11, S12, S13, S14, S15, S16, S17, S18, S19, S20, S21, and S31) 50S ribosomal proteins (5, 6, L1, L2, L3, L4, L5, L6, L9, L10, L11, L12, L13, L14, L15, L16, L17, L18, L19, L20, L21, L22, L23, L24, L27, L28, L29, L31, and L32), Psa (A, B, C, I, and J), Psb (A, B, D, E, F, H, I, J, K, L, M, N, P, Q, T, and Z), Rpl (12 and 23), RpoA, RpoB, RpoC1, RpoC2, Rps7, Rps12, Ycf (1, 2, and 15), YlmG homolog, Ribulose bisphosphate carboxylase small chain (1A, 1B, 2A, 3A, 3B, 4, F1, PW9, PWS4, and S4SSU11A), Ribulose bisphosphate carboxylase/oxygenase activase A and B, (-)-beta-pinene synthase, (-)-camphene/tricyclene synthase, (+)-larreatricin hydroxylase, (3S,6E)-nerolidol synthase, (E)-beta-ocimene synthase, 1,4-alpha-glucan-branching enzyme, 10 kDa chaperonin, 1,8-cineole synthase, 2-carboxy-1,4-naphthoquinone phytyltransferase, 2-C-methyl-D-erythritol 2,4-cyclodiphosphate synthase, 2-C-methyl-D-erythritol 4-phosphate cytidylyltransferase, ABC transporter B family, AccD, acyl carrier protein, adenylate kinase, ALBINO protein, allene oxide cyclase, anion transporter, anthranilate synthase, APO protein, aspartokinase, ATP synthase, Atp (A, B, E, F, H, I), ATP-dependent Clp protease, beta carbonic anhydrase, calcium-transporting ATPase, Calvin cycle protein CP12, carbonic anhydrase, cation/H(+) antiporter, chaperone protein Clp (B, C, and D), DnaJ, chaperonin 60 subunit, chlorophyll a-b binding protein (1, 2, 3, 4, 6, 7, 8, 13, 15, 16, 21, 24, 26, 29, 36, 37, 40, 50, 80, M9, LHCII, and P4), chlorophyll(ide) b reductase (NOL and NYC), chloroplastic acetyl coenzyme A carboxylase, chloroplastic group IIA intron splicing facilitator CRS (S1, A, and B), chorismate mutase, cytochrome b6/f complex subunit (1, 2, IV, V, VI, and VIII), cytochrome c biogenesis protein CCS1, DEAD-box ATP-dependent RNA helicase, DNA gyrase A and B, DNA polymerase A and B, DNA repair protein recA homolog, DNA-(apurinic or apyrimidinic site) lyase, DNA-damage-repair/toleration protein, DNA-directed RNA polymerase, early light induced protein, fatty acid desaturase, ferredoxin--NADP reductase, fructokinase, gamma-terpinene synthase, geraniol synthase, geranylgeranyl pyrophosphate synthase, glucose-1-phosphate adenylyltransferase small and large subunit, glutathione S-transferase, GTP diphosphokinase CRSH, inactive ATP-dependent zinc metalloprotease FTSHI, inactive shikimate kinase, kinesin protein KIN (D, E, K, L, and M), L-ascorbate peroxidase, light-harvesting complex protein, light-induced protein, light-regulated protein, lipoxygenase, magnesium transporter, magnesium-chelatase, MATE efflux family protein, multiple organellar RNA editing factor, N-(5'-phosphoribosyl)anthranilate isomerase, NAD Kinase, NAD(P)H-quinone oxidoreductase subunits (1, 2, 3, 4, 5, 6, H, I, J, K, L, M, N, O, S, T, and U), NADH dehydrogenase subunits (1, 2, 3, 4, 5, 6, 7, I, J, and K), NADH-plastoquinone oxidoreductase subunits (1, 2, 3, 4, 5, 6, 7, I, J, and K), NADPH-dependent aldehyde reductase, nifU protein, nudix hydrolases, outer envelope pore proteins, oxygen-evolving enhancer proteins, pentatricopeptide repeat-containing protein (CRP1, DOT4, DWY1, ELI1, MRL1, OTP51, PPR5), peptide chain release factor, peptide methionine sulfoxide reductase, peptidyl-prolyl cis-trans isomerases, Pet (A, B, G, and L), phospholipase, photosynthetic NDH subunit of lumenal location, photosynthetic NDH subunit of subcomplex B, protochlorophyllide reductase subunits (B, L, and N), phytol kinase, plastid-lipid-associated proteins, protease Do 1, protein cofactor assembly of complex c subunits, protein CutA, DCL, pyruvate dehydrogenase E1 component subunits, sodium/metabolite cotransporter BASS, soluble starch synthase, stearoyl-[acyl-carrier-protein] 9-desaturase, thioredoxins, thylakoid luminal proteins, translation initiation factor, transcription factor GTE3, transcription termination factor MTERF, translocase of chloroplast, zinc metalloprotease EGY, and others (Supplementary File 6).
The ratio of nucleotide substitution is highest in Pteridophytes and lowest in Nymphaeales
Determining the rate of nucleotide substitution in the chloroplast genome is an important parameter that needs to be more precisely understood to further elucidate the evolution of the chloroplast genome. Single base substitutions, and insertion and deletion (indels) events play an important role in shaping the genome. Therefore, an analysis was conducted to determine the rate of substitution in the chloroplast genome by grouping them according to their respective lineages. Results indicated that the transition/transversion substitution ratio was highest in pteridophytes (k1 = 4.798 and k2 = 4.043) and lowest in Nymphaeales (k1 = 2.799 and k2 = 2.713) (Supplementary Table 2). The ratio of nucleotide substitution in species with deleted IR regions was 2.951 (k1) and 3.42 (k2) (Supplementary Table 2). The rate of transition of A > G substitution was highest in pteridophytes (15.08) and lowest in protists (8.51) and the rate of G > A substitution was highest in protists (22.15) and lowest in species with deleted IR regions (16.8). The rate of substitution of T > C was highest in pteridophytes (14.01) and lowest in protists (8.95) (Supplementary Table 2). The rate of substitution of C > T was highest in protists (22.34) and lowest in Nymphaeales. The rate of transversion is two-times less frequent than the rate of transition. The rate of transversion of A > T was highest in protists (6.80) and lowest in pteridophytes (4.64), while the rate of transversion of T > A was highest in algae (6.98) and lowest in pteridophytes (Supplementary Table 2). The rate of substitution of G > C was highest in Nymphaeales (4.31) and lowest in protists (2.46), while the rate of substitution of C > G was highest in Nymphaeales (4.14) and lowest in protists (2.64) (Supplementary Table 2). Based on these results, it is concluded that the highest rates of transition and transversion were more frequent in lower eukaryotic species, including algae, protists, Nymphaeales, and pteridophytes; while high rates of transition/transversion were not observed in bryophytes, gymnosperms, monocots, and dicots (Supplementary Table 2). Notably, G > A transitions were more prominent in chloroplast genomes with deleted IR regions (Supplementary Table 2).
Chloroplast genomes have evolved from multiple common ancestral nodes
A phylogenetic tree was constructed to obtain an evolutionary perspective of chloroplast genomes (Figure 7). All of the 2511 studied species were used to construct a phylogenetic tree (Figure 7). The phylogenetic analysis produced four distinct clusters, indicating that chloroplast genomes evolved independently from multiple common ancestral nodes. Lineage-specific groupings of chloroplast genomes were not present in the phylogenetic tree. The genomes of algae, bryophytes, gymnosperms, eudicots, magnoliids, monocots, and protists grouped dynamically in different clusters. Although the size of the chloroplast genome in protists was far smaller than in other lineages, they were still distributed sporadically throughout the phylogenetic tree. Time tree analysis indicated that the origin of the cyanobacterial species in this study those used as out-group date back to ~2180 Ma and that the endosymbiosis of the cyanobacterial genome occurred ~ 1768 Ma ago and was incorporated into the algal lineage ~ 1293-686 Ma ago (Figure 8); which then further evolved into the Viridiplantae ~1160 Ma, Streptophyta ~1150 Ma, Embryophyta ~532 Ma, Tracheophyte ~431 Ma, Euphyllophyte 402 Ma, and Spermatophyta 313 Ma (Figure 8). The molecular signature genes PsaM, ChlB, ChlL, ChlN, Psb30, and Rpl21 in algae, bryophytes, pteridophytes, and gymnosperms were lost ~203 (Cycadales) and -156 (Gnetidae) Ma ago, and as a result, are not found in the subsequently evolved angiosperm lineage (Figure 8).