Convergent Molecular Evolution of PEPC Gene Family in C4 and CAM Plants

Background: Phosphoenolpyruvate carboxylase (PEPC), as the key enzyme in initial carbon xation of C4 and crassulacean acid mechanism (CAM) pathways, was thought to undergo convergent adaptive changes resulted in the convergent evolution of C4 and CAM photosynthesis in vascular plants. However, the integral evolutionary history and convergence of PEPC in plants remained lack of understanding. Results: In present study, we identied the members of PEPC gene family across green plants with genomic data, found ten conserved motifs and modeled three-dimensional protein structures of 90 plant-type PEPC genes. After reconstructed PEPC gene family tree and reconciled with species tree, we found PEPC genes occurred 71 gene duplications and 16 gene losses, which might result from whole-genome duplication events in plants. Based on the integral phylogenetic tree of PEPC gene family, we detected four convergent evolution sites of PEPC in C4 species but no one in CAM species. Conclusions: PEPC gene family was ubiquitous and highly conservative in green plants. After originated from gene duplication of ancestral C3-PEPC, C4-PEPC isoforms underwent convergent molecular substitution that facilitated the convergent evolution of C4 photosynthesis in Angiosperms, but PEPC gene did not exist molecular convergence corresponded to the multiple independent evolution of CAM photosynthesis. Our ndings help to understand the origin and evolution of C4 and CAM pathways and shed new light on the adaptation of plants in drought and high-temperature habitats. PEPC: Phosphoenolpyruvate carboxylase; CAM: Crassulacean acid mechanism; CCMs: Carbon-concentrating mechanisms; PTPC: Plant-type PEPC; BTPC: Bacterial-type PEPC; WGD: Whole-genome duplication; 3D: Three-dimentional; RuBisCO: Ribulose 1,5-bisphosphate carboxylase oxygenase.


Background
Plant photosynthesis is the most important biochemical process on our planet, and it provides energy, food and oxygen for survival and reproduction of all heterotrophic organisms, including our humans [1]. Since its origin, the oxygen levels gradually rose and carbon dioxide (CO 2 ) concentration decreased in the atmosphere of Earth [2], which enhanced photorespiration and led to losses of bioenergy and carbohydrate [3]. To obtain adequate CO 2 and improve photosynthetic e ciency, plants developed several carbonconcentrating mechanisms (CCMs), such as C4 photosynthesis and crassulacean acid metabolism (CAM), to adapt the suddenly decline of the atmospheric CO 2 concentration at ~ 350 Mya [4,5]. C4 photosynthesis, as the notable examples of convergent evolution, has been repeatedly evolved more than 60 times in owering plants [6,7], whereas CAM pathway has been independently occurred in ~ 35 families of vascular plants [8]. Recently, several comparative genomics researches provided new insights into the genetics and evolution of CCMs [4,[9][10][11]. However, the molecular mechanisms underlying CCMs convergent evolution remained poorly understand.
In plants, phosphoenolpyruvate carboxylase (PEPC; EC 4.1.1.31) is a key enzyme that catalyzes the primary xation reaction of photosynthesis CO 2 assimilation in C4 and CAM photosynthesis [12][13][14]. Additionally, PEPC genes also play crucial roles in a variety of non-photosynthesis functions, such as carbon and nitrogen metabolism [15], seed development and germination [16], abiotic stresses [17,18] and so on [12,19,20]. As the so-called housekeeping functions, PEPC genes are highly conserved, which makes it valuable molecular markers for phylogenetic reconstruction of plants [21]. Although PEPC genes share origins during the evolution of C4 and CAM photosynthesis, these genes are distinct copies generated by whole genome duplication in angiosperms [22].
Interestingly, PEPC genes were essential for the regulation of core circadian clock in CAM photosynthesis [23], and shared convergent amino acid changes in diverse CAM species [9]. In C4 grasses, PEPC genes also underwent parallel adaptive genetic changes [10,24]. Therefore, PEPC genes were crucial for elucidating the origin and convergent evolution of C4 and CAM photosynthesis. However, PEPC genes were extensively distributed in all plants, previous studies, which only performed in a few lineages of angiosperms, could not su ciently elucidate the origin and evolution of PEPC gene family [25]. Furthermore, CAM photosynthesis was widely evolved in the major clades of vascular plants, including pteridophytes (lycopods and ferns), gymnosperms and angiosperms, while C4 photosynthesis was only distributed in angiosperms. Therefore, to understand the convergent molecular evolution of PEPC gene family in plant kingdom, it was urgently required that samples with genomic datasets could su ciently represent the species of C3, C4 and CAM photosynthesis across the major lineages of plants.
Fortunately, up to now, more than 400 plant genomes were sequenced (collected in plaBiPD, https://www.plabipd.de/index.ep), which provided us an excellent opportunity to resolve the origin and convergent evolution of C4 and CAM in plants [4,11,26]. In present study, we extracted PEPC genes from 17 omics data, which consisted of C3, C4 and CAM species across algae, bryophytes, pteridophytes, gymnosperms and angiosperms, and reconstructed the evolutionary history and molecular convergence of PEPC genes in C4 and CAM plants. Our ndings help to elucidate the origin and evolution of C4 and CAM plants and shed new lights on adaptation of plants in drought and high temperature environments.

Results And Discussion
PEPC gene family was ubiquitous and conserved in plants PEPC is a highly regulated enzyme that catalyzes the irreversible β-carboxylation of phosphoenolpyruvate in the presence of bicarbonate and Mg 2+ to yield oxaloacetate and Pi, a reaction that serves a variety of physiological functions in plants [12,27,28]. As the key carboxylase, PEPC genes were widely distributed in green plants (Table 1). In present study, we identi ed 264 homologous genes using Arabidopsis PEPC genes (At1g53310, At1g68750, At2g42600 and At3g14940) as the BLAST queries and 179 homologous genes by searching with the Pfam seed of PEPCase domain (PF00311) based on 17 genomic data across all of the green plants, respectively. After combination of two gene sets, we obtained 179 common PEPC homologous genes and then searched conserved domains based on the conserved domain database, 109 genes contained conserved domains, in which planttype PEPC (PTPC) gene domain (PEPCase) was speci cally identi ed in 90 genes and bacterial-type PEPC (BTPC) gene domain (PRK00009) was found in 19 genes, and 70 genes were contained other PEPCase superfamily domains (Table 1, S1). The copy number of PEPC genes was various in different clade of green plants. BTPC genes were distributed in 11 of 17 species in this study and retained relatively lower copies than PTPC genes (Table 1). However, due to missing the speci c conserved domain of PEPCase, we have not detected PEPC gene in Norway spruce (Picea abies), perhaps owing to numerous pseudogenization and insertion of transposable element in conifers [36] and 10 homologs of PEPC in Norway spruce contained the PEPCase superfamily domain, which probably acted as the PEPC-like physiological functions. Thanks to arising numerous whole-genome duplications [44], PEPC gene family had relatively more copies in angiosperms, especially in maize which contained 22 PEPC gene copies. Interestingly, mosses (Physcomitrella patens) also retained 22 gene copies but its sister clades, hornworts (Anthoceros angustus) and liverworts (Marchantia polymorpha), only had one remanent of PEPC gene family (Table 1). This extreme difference of gene content corresponded to their different adaptation strategies of plant terrestrialization, mosses has occurred whole genome duplication events (WGDs) to increase gene family complexity for coordinating multicellular growth and dehydration response [31], however, liverworts have ancient dimorphic sex chromosomes resulted in a lack of WGDs and reduced proliferation of regulatory genes [30], and the genome of A. angustus was interestingly simpli ed and obtained stress-response and metabolic pathways genes through horizontal gene transfer from bacteria or fungi, which probably assisted their survival in a terrestrial environment [32].
PEPC genes displayed highly conserved amino acid sequence in all green plants. Here, we predicted ten conserved motifs without overly similar pairs from 109 PEPC proteins and found that the length of all motifs was more than 29 amino acids and each motif was covered more than 104 of 109 PEPC protein sequences (Table 2). Additionally, most sites of amino acid were exactly the same in all motifs and the linear order of these motifs, especially in PTPC genes, was also identical across all of green plants, although some motifs repeated in various genes (Fig. 1). These results clearly indicated that PEPC genes family was extremely conserved throughout its evolutionary history of more than 500 Mya since origin from algae [14,20,28]. However, so conserved gene family performed hyper-diverse housekeeping functions including photosynthesis and nonphotosynthesis for survival in the terrestrial environments [19]. To understand the function and molecular mechanisms of PEPC gene family, the three-dimensional (3D) structure of PEPC proteins were elucidated by X-ray crystallographic analysis [45][46][47][48], which discovered many structure-function relationships of PEPC catalysis, allosteric control and regulatory phosphorylation [49]. In this study, we modeled 3D structures of 90 PTPC proteins using SWISS-MODEL server ( Figure S1) and predicted three templates of protein structures, 5vyj.1.A ( Fig. 2A) [48], 3zgb.1.A (Fig. 2B) [46] and 5fdn.1.A (Fig. 2C) [47], with higher seq identify (Table S2). All PTPC proteins were tetrameric enzyme with three kinds of 3D structures, in which, the template of 3zgb.1.A was widely distributed in all species of this study, but 5vyj.1.A template was distributed in seven species and 5fdn.1.A template was only distributed in Arabidopsis (Fig. 3, Table S2). In the widespread template 3zgb.1.A, compared to C3-PEPC, C4-PEPC isoforms carried two amino acid substitution to increase PEP saturation kinetics and reduce inhibitor a nity, respectively [46,50]. Therefore, the e ciency of photosynthetic carbon xation was greatly improved in C4 plants.

PEPC was convergent in C4 but not in CAM photosynthesis
The C4 and CAM photosynthetic pathways evolved independently more than 60 times from different phylogenetic lineages in angiosperms and vascular plants, respectively [4,6,8]. However, the molecular mechanisms underlying the convergent evolution of these two carbon concentration mechanisms were poorly understood. Recently, comparative genomic analyses indicated that speci c amino acid substitutions at few key sites could lead to highly predictable convergent evolution events [9,51,52]. In C4 and CAM plants, PEPC enzyme catalyzed the primary carbon xation to improve CO2 concentration at the activate site of the ribulose 1,5-bisphosphate carboxylase oxygenase (RuBisCO), which increased water-use e ciency and utilization of nitrogen and other mineral nutrients [3]. Interestingly, PEPC 1 is very important for regulating the core circadian clock in CAM photosynthetic pathway of Kalanchoe laxi ora [23], and PEPC 2 of Kalanchoe fedtschenkoi shared convergent amino acid substitution with diverse CAM species [9]. Moreover, PEPC genes did also undergo parallel adaptive genetic changes in C4 grasses [10,24]. Therefore, PEPC genes probably played crucial roles in the convergent evolution of C4 and CAM photosynthesis [12-14, 19, 25, 28, 49].
Here, we reconstructed the phylogenetic tree of PEPC gene family in green plants with relatively adequate sampling involved the species of C3, C4 and CAM photosynthetic pathways across the major lineages of green plants (Fig. 1). PEPC gene family consisted of two major subfamilies, PTPC and BTPC, the former of which performed the critical roles for initial carbon xation in C4 and CAM photosynthesis [12,13,25,27]. Therefore, the evolutionary history of PTPC genes with maximum likelihood was reconstructed and reconciled with species tree based on duplication-loss reconciliation (Fig. 3). The reconciliation tree showed PTPC genes underwent, at least, 71 duplications and 16 losses in the evolutionary history of sampled species in present study, which indicated PTPC genes occurred multiple times large-scale duplication events, maybe caused by whole-genome duplication in the evolutionary process of green plants, especially in angiosperms (Fig. 3) [53]. Gene duplication was critical for plant to response new environments through neo-functionalization of gene copies [54,55]. Previous researches assumed PEPC isoforms in C4 and CAM species were originated from a non-photosynthetic PEPC gene that already exists C3 ancestral species [14,22], our results however indicated that PEPC gene duplications corelated with presence of C4 but not with CAM photosynthesis (Fig. 3), which was also found in orchids [56]. In other words, PEPC gene duplications were very important for the evolutionary origination of C4 photosynthesis, but not for CAM pathways, in which posttranslational regulation of PEPC possibly played key roles [19,27,57].
To test whether convergent molecular evolution at amino acid level existed in PTPC proteins, we performed the comprehensive detection of convergent evolution sites in C4 and CAM photosynthesis based on the PTPC gene tree of green plants with PCOC pipeline, which could detect not only convergent substitutions to exact same amino acid but also convergent shifts correspond to convergent phenotypic changes [58]. Respectively, we detected convergent sites in all phenotypic convergent clades of CAM and C4 photosynthesis. The results showed two convergent shifts existed in PTPC proteins of CAM species and three convergent shifts existed in C4 species, but identical convergent substitution was not detected in clades of both photosynthetic pathways (Fig. 4), which indicated convergent molecular evolution at amino acid level did not exist in all copies of PTPC proteins.
In addition to photosynthetic functions, PEPC genes also performed hyper-diverse non-photosynthetic functions, such as abiotic stress, fruit maturation, seed formation and germination, and so on [12, 15, 17-19, 28, 59]. In other words, different isoforms of PEPC gene family might perform different functions, and only a few of PEPC isoforms was corresponded to convergent evolution of C4 and CAM photosynthesis. Therefore, we further detected convergent evolution sites of different gene groups, in which each convergent phenotypic species retained one clade or one gene copy (one-to-one). Interestingly, four convergent amino acid sites in one gene group (Ahy03/Zma11) were discovered in C4 species, two of which were also con rmed in previous studies [10,24,46,50]. The convergent amino acid mutations in the active site Ala774 (position 898 in Fig. 5) and the inhibitory site Arg884 (position 1010 in Fig. 5) were su cient to switch the photosynthetic function from C3 to C4 activity [46]. However, when we detected convergent evolution sites of the one-to-one gene groups in CAM species, none of identical convergent sites were found, which indicated PEPC gene did not exist identical convergent sites resulted in photosynthetic conversion from C3 to CAM pathway.

Conclusions
In summary, PEPC gene, as the key enzyme, was a ubiquitous, highly conserved in green plants, and consisted of two major subfamilies, BTPC and PTPC, the latter of which played crucial roles for the initial carbon xation in C4 and CAM photosynthetic pathways. In the evolutionary history of PEPC gene family, gene duplication events were frequently occurred due to multiple wholegenome duplication events, which resulted in the emergence of C4 photosynthesis after gene neo-functionalization, but not correlated to CAM photosynthesis. Additionally, four convergent substitution sites, two of which were the key functional positions to switch the photosynthetic pathway from C3 to C4 activity, were detected in C4-PEPC isoforms, however, none of convergent sites were detected in CAM-PEPC genes. Our results indicated that convergent molecular substitutions of PEPC gene played key roles for origin and convergent evolution of C4 photosynthesis, but the convergent evolution of CAM photosynthesis was not caused by PEPC gene convergence at amino acid level.

Gene family identi cation
To identify the members of PEPC gene family, we created a local BLAST database with protein sequences of 17 plants species, then performed the BLASTp searches with default parameters using Arabidopsis PEPC amino acid sequences, including At1g53310, At1g68750, At2g42600 and At3g14940 as queries. Furthermore, the seed le of the PEPCase domain (PF00311), obtained from the Pfam website (http://pfam.xfam.org/), was used to build the HMMER pro le, then candidate PEPC genes was searched in 17 genomic datasets by HMMER v3.2.1 [60]. After combined the two PEPC gene sets, the conserved domain search was performed using the Batch CD-search Tool with the default parameters in the conserved domain database [61], and the genes with the PEPCase conserved domain were identi ed as the reliable PEPC genes of plant type.

Conserved motifs prediction and proteins structures modeling
After identi ed the reliable PEPC genes with strict pipeline, ten conserved motifs were predicted by MEME v5.1.0 with the default parameters [62]. Motif alignment was performed by MAST v5.1.0 with default parameters [63] and conserved motifs were visualized by TBtools v1.046 [64]. Three-dimensional structures of PEPC proteins were predicted by SWISS-MODEL [65], the optimal protein model was selected with highest value of GMQE score and seq identify, which indicated highest reliability.

Phylogenetic reconstruction and gene-species tree reconciliation
To understand the evolution of PEPC gene family, the sequence alignment of PEPC amino acid sequences was performed by MAFFT v7.453 with most accurate strategy L-INS-i and the maximum number of iterative re nements was 1000 [66]. The conserved blocks were selected by GBLOCKS 0.91b [67] with the parameters (-b4 = 5 -b5 = h), then we reconstructed the PEPC gene family tree with maximum likelihood by IQ-TREE v1.6.11 [68] using the best-t aa model JTTDCMut + R5 detected by ModelFinder [69] based on Bayesian information criterion, and the ultrafast bootstrap approximation was calculated by 1000 random replicates [70]. Phylogenetic reconciliation of gene tree and species tree was performed by Treerecs [71] with the default parameters, the species tree was based on recently phylogenomic reconstruction of green plants [72].

Convergent substitution detection
Due to the multiple copies of PEPC gene family, putative convergent clades or gene copies were labeled on phylogenetic tree with three kinds of gene combinations, including (1)

Availability of data and materials
All data generated or analyzed during this study are included in this published article.
Ethics approval and consent to participate Not applicable.

Consent for publication
Not applicable.