General features of the O. gaubae plastid genome
It was previously reported that the plastomes of Papilionoideae, particularly IR-loss clade, are not conserved in their genomic structure in terms of gene order and gene content and exhibit numerous rearrangements and gene/intron losses5,24,25. The plastid genome of O. gaubae with 123,645 bp in length and having only one copy of the IR region, which is in accordance with the reports and its genome structure is similar to those of other IRLC species. In this context, the lack of rps16 and rpl22 genes and intron 1 of clpP in the plastome of O. gaubae should be noted; these genes, are absent from the chloroplast genomes of entire IRLC24,26. The assembled chloroplast genome of O. gaubae contained 110 genes, including 76 protein-coding genes, 30 transfer RNA (tRNA) genes and four ribosome RNA (rRNA) genes (Fig. 1, Table 1). The LSC (81,066 bp), SSC (13,777 bp) and IR (28,802 bp) regions along with the locations of 110 genes in the chloroplast genome are shown in Fig. 1. A total of 16 genes contained a single intron, whereas ycf3 exhibits two introns (Supplementary Table S1). The rps12 gene is a trans-splicing gene which does not have introns in the 3'-end. The trnK-UUU has the largest intron encompassing the matK gene, with 2,495 bp, whereas the intron of trnL-UAA is the smallest (542 bp). The overall GC content of the O. gaubae chloroplast genome sequence was 35.1%, which is consistent with other IRLC species, whose plastomes have GC-contents ranging from 33.6–35.1% (Table 2). Different GC content occurs in the LSC (34.4%), SSC (30.8%) and IR (39.2%) regions (Supplementary Table S2). Higher GC content was usually detected in the IRs compared to the other regions of plastome, which is mainly due to the presence of rRNA genes (rrn23, rrn16, rrn5, rrn4.5) with high GC content (50%-56.4%) in IRs 6,27. The GC content of the protein-coding regions of O. gaubae chloroplast genome comprised 36.02%. Within these regions, the GC contents for the first, second and third positions of the codons were 42.4%, 36.1% and 29.5%, respectively.
Table 1
Genes predicted in the chloroplast genome of O. gaubae. The number of asterisks after the gene names indicates the number of introns contained in the genes.
Category of genes
|
Group of genes
|
Name of genes
|
Self-replication
|
Large subunit of ribosomal proteins
|
rpl14, rpl16*, rpl2*, rpl20, rpl23, rpl32, rpl33, rpl36
|
|
Small subunit of ribosomal proteins
|
rps2, rps3, rps4, rps7, rps8, rps11, rps12*, rps14, rps15, rps18, rps19
|
|
DNA-dependent RNA polymerase
|
rpoA, rpoB, rpoC1*,rpoC2
|
|
Ribosomal RNA genes
|
rrn16S, rrn23S, rrn 4.5S, rrn 5S
|
|
Transfer RNA genes
|
30 trn genes (5 contain an intron)
|
Genes for photosynthesis
|
Subunits of NADH-dehydrogenase
|
ndhA*, ndhB*, ndhC, ndhD, ndhE, ndhF, ndhG, ndhH, ndhI, ndhJ, ndhK
|
|
Subunits of photosystem I
|
psaA, psaB, psaC, psaI, psaJ
|
|
Subunits of photosystem II
|
psbA, psbB, psbC, psbD, psbE, psbF, psbH, psbI, psbJ, psbK, psbL, psbM, psbN, psbT, psbZ
|
|
Subunits of cytochrome b/f complex
|
petA, petB*, petD*, petG, petL, petN
|
|
Subunits of ATP synthase
|
atpA, atpB, atpE, atpF*, atpH, atpI
|
|
Subunit of rubisco
|
rbcL
|
Other genes
|
Maturase K
|
matK
|
|
Envelope membrane protein
|
cemA
|
|
Subunit of Acetyl-CoA-carboxylase
|
accD
|
|
C-type cytochrome synthesis gene
|
ccsA
|
|
Protease
|
clpP*
|
Genes of unkown function
|
Conserved hypothetical chloroplast
open reading frames
|
ycf1, ycf2, ycf4, ycf3**
|
Table 2
Chloroplast genome information from sampled IRLC species and the newly assembled O. gaubae. LSC: Large Single Copy, SSC: Small Single Copy, IR: Inverted Repeat.
Species
|
Size (bp)
|
LSC (bp)
|
SSC (bp)
|
IR (bp)
|
GC (%)
|
Astragalus mongholicus
|
123,582
|
80,986
|
13,773
|
28,823
|
34.1%
|
Caragana microphylla
|
130,029
|
85,436
|
14,106
|
30,487
|
34.3%
|
Carmichaelia australis
|
122,805
|
80,588
|
14,074
|
28,143
|
34.3%
|
Cicer arietinum
|
125,319
|
82,583
|
13,820
|
28,916
|
33.9%
|
Glycyrrhiza glabra
|
127,943
|
84,714
|
14,187
|
29,042
|
34.2%
|
Hedysarum semenovii
|
123,407
|
80,288
|
13,679
|
29,440
|
34.9%
|
Lens culinaris
|
122,967
|
81,659
|
13,833
|
27,604
|
34.4%
|
Lessertia frutescens
|
122,700
|
80,698
|
13,750
|
28,252
|
34.2%
|
Medicago sativa
|
125,330
|
83,756
|
13,383
|
28,191
|
34%
|
Melilotus albus
|
127,205
|
84,279
|
13,806
|
29,120
|
33.6%
|
Meristotropis xanthioides
|
127,735
|
84,629
|
14,150
|
28,956
|
34.2%
|
Onobrychis gaubae
|
123,645
|
81,066
|
13,777
|
28,802
|
35.1%
|
Oxytropis bicolor
|
122,461
|
80,170
|
14,017
|
28,274
|
34.2%
|
Tibetia liangshanensis
|
123,372
|
79,916
|
13,513
|
29,943
|
34.7%
|
Wisteria floribunda
|
130,561
|
87,193
|
14,127
|
29,628
|
34.4%
|
Codon usage bias
The total coding DNA sequences (CDSs) were 81,121 bp in length and encoded 75 genes including 24,765 codons which belonged to 61 codon types. Codon usage was calculated for the protein-coding genes present in the O. gaubae cp genome. Phenylalanine was the most abundant amino acid, whereas arginine showed the least abundance in this species (Supplementary Table S3). Most protein-coding genes employ the standard ATG as the initiator codon. Among the O. gaubae protein-coding genes, three genes used alternative start codons; ACG for psbL, GTG for rps8 and ACG for ndhD.
The chloroplast genomes of the IRLC were analyzed for their codon usage frequency according to sequences of protein-coding genes and relative synonymous codon usage (RSCU). RSCU is an important indicator to measure codon usage bias in coding regions. This value is the ratio between the actual observed values of the codon and the theoretical expectations. If RSCU = 1, codon usage is unbiased; if RSCU > 1, specific codon frequency is higher than other synonymous codons, otherwise, the frequency is low28,29. The total number of codons among protein-coding genes in the IRLC species varies from 20,381 codons in Hedysarum taipeicum (as the smallest number) to 24,765 codons in O. gaubae. The most often used synonymous codon was AUU, encoding isoleucine, and the least used was CGC/CGG, encoding arginine (Supplementary Table S4). In the IRLC, the standard AUG codon was usually the start codon for the majority of protein-coding genes and UAA was the most frequent stop codon among three stop codons. Methionine (AUG) and tryptophan (UGG) showed RSCU = 1, indicating no codon bias for these two amino acids. The highest RSCU value was for UUA (~ 2.04) in leucine and the lowest was GGC (~ 0.35) in glycine. Leucine preferred six codon types (UUA, UUG, CUU, CUC, CUA, and CUG) and actually showed A or T (U) bias in all synonymous codons (Supplementary Table S4). The result of distributions of codon usage in the IRLC species showed that RSCU > 1 was recorded for most codons that ended with an A or a U, except for UUG codon, resulting in the bias for A/T bases. As well as, more codons with the RSCU value less than one, ended with base C or G. So, there is high A/U preference in the third codon of the IR-loss clade coding regions, which is a common phenomenon in cp genomes of higher plants30.
Analysis of repeats
Repeat analysis of the O. gaubae plastome identified 28 repeat structures with lengths ranging from 30 bp to 81 bp. These structures included 12 palindromic repeats with lengths in the range of 33–50 bp, 11 forward repeats of 30–81 bp, four reverse repeats which varied from 30 bp to 31 bp and one complementary repeats with a length of 30 bp (Supplementary Table S5). Among the 28 repeats, 75% are located in the LSC region, 14.28% in the SSC region and 10.71% in the IR region. Also, most of the repeats (60.71%) were found in the intergenic spacer regions (IGS), 25% were distributed in coding region (psaA, psaI, psbJ, trnR-UCU, trnS-UGA) and 14.28% were located in the introns (ndhA and ycf3). In the majority of the studied IRLC species, the most frequently observed repeats were forward, then palindromic, and the least was the reverse (Fig. 2). The forward type was the most abundant repeat with length in the range 30–50 bp in all the IRLC species. The longest repeats were also of the forward type, with length of 560 bp were detected in the Hedysarum taipeicum, followed by Vicia sativa of 517 bp and Caragana microphylla of 455 bp, which were much longer than other species studied. Furthermore, in the IRLC, repeat sequences involved in genome rearrangement, were mainly distributed in non-coding regions (IGS). Repeat structures induce indels and substitutions resulting in the mutation hotspot in the reconfiguration of genome6; therefore, these repeats can provide valuable information for phylogenetic and population studies29.
Simple sequence repeats (SSRs), or microsatellites, are a type of tandem repeat sequences which contain 1–6 nucleotide repeat units and have wide distribution throughout the genome29,31. Accordingly, microsatellites play a crucial role in the genome recombination and rearrangement. These nucleotide motifs show a high level of polymorphism that can be widely used for phylogenetic analysis, population genetics and species authentication29,32. A total of 89 SSRs were detected in the O. gaubae plastome, which were composed by a length of at least 10 bp. Among them, 53 (59.55%) were mono-repeats, 21 (23.55%) were di-repeats, 10 (11.23%) were tri-repeats and five (5.61%) were tetra-repeats. No penta- and hexanucleotide SSRs existed in O. gaubae genome (Supplementary Table S6). These SSR loci were located primarily in the LSC region (68.53%), followed by the SSC region (16.85%) and IR (14.6%) (Fig. 3A). In the mononucleotide repeats, A/T motifs were the most abundant but no G/C motif was detected in the cp genome. Likewise, the majority of the dinucleotides and trinucleotides were found to be particularly rich in AT sequences. Therefore, the AT richness in the SSRs of the chloroplast genome of O. gaubae is consistent with the results of previous studies27,29 which have shown that in the cp genome, SSRs generally composed of polythymine (poly T) or polyadenine (poly A) repeats33. The number of SSRs in the cp genomes (cpSSRs) ranged from 68 (Vicia sativa and Lens culinaris) to 151 (Melilotus albus) across the IRLC species (Fig. 3B). The mononucleotide repeats (P1) were identified at a much higher frequency, which varied from 45 (Tibetia liangshanensis, Glycyrrhiza glabra) to 93 (Melilotus albus). In all cases, the P1s were AT-rich (Fig. 3C). Strong AT bias in SSR loci was also observed in other legumes such as Vigna radiate34, Arachis hypogaea35 and Stryphnodendron adstringens27 which, like other plastomes of species, may contribute to the bias in base composition6. The results showed that SSR loci of LSC regions appeared more frequently than in SSC or IR regions, which may be hypothesized that this phenomenon is relevant to the lack of one IR region in IR-loss clade. In general, cpSSRs show abundant variation and might provide useful information for detecting intra- and interspecific polymorphisms at the population level31,33.
Divergent hotspots in the IRLC chloroplast genomes
The average nucleotide diversity (Pi) among the protein-coding genes of 19 species of the IRLC was estimated to be 0.05736. Furthermore, comparison of nucleotide diversity in the LSC, SSC and IR regions indicated that the IR region exhibits the highest nucleotide diversity (0.11549) and the SSC region shows the least (0.04132). We detected three hyper-variable regions with Pi values > 0.1 among the IRLC species; ycf1 and ycf2 from IR region and clpP from LSC region (Fig. 4). These genes might be undergoing rapid nucleotide substitution in IR-loss clade at the genus and species levels. Among these, ycf1 encoding a protein of 1800 amino acids has the highest nucleotide diversity (0.18745). ycf1 gene is more variable than matK gene and it can be useful for molecular systematics at low taxonomic levels36,37. Numerous studies14,18,38 analyzed the phylogenetic reconstructions of IRLC species at various taxonomic levels based on different fragments of plastid coding genes such as matK, ndhF and rbcL, the nuclear ribosomal ITS and the combined sequences of these genes/spacers. We could use the highly variable regions acquired from this study to develop the potential phylogenetic markers which can be useful for species authentication and reconstruction of phylogeny within different tribes/genera of IR-loss clade in further studies.
The non-synonymous (Ka) to synonymous (Ks) rate ratio (Ka/Ks) was estimated for 75 protein-coding genes across the 28 IRLC species analyzed (Supplementary Table S7). In general, the Ka/Ks values were lower than 0.5 for almost all genes. The ycf4 gene which is involved in regulating the assembly of the photosystem I complex had the highest nonsynonymous rate, 0.165691, while the ycf1 gene with unknown functions had the highest synonymous rate, 0.181067. The Ka/Ks ratio (denoted as ω) is widely used as an estimator of selective pressure for protein-coding genes. An ω > 1 indicates that the gene is affected by positive selection, ω < 1 indicates purifying (negative) selection, and ω close to 1 indicates neutral mutation39. In present study, the Ka/Ks ratio was calculated to be 0 for psbL gene which encodes one of the subunits of photosystem II. The Ka/Ks ratio indicates purifying selection in 73 protein-coding genes. The highest Ka/Ks ratio which indicates positive selection was observed in accD gene which encodes a subunit of the acetyl-CoA carboxylase (ACCase) enzyme. Some studies have investigated whether selective pressure is acting on a particular protein-coding gene in different genera/tribes of IR-loss clade. For instance, tests for positive selection suggested that Lathyrus, Pisum and Vavilovia, all belonging tribe Fabeae, have undergone adaptive evolution in the ycf4 gene15,16.
Legumes chloroplast genome, and in particular IRLC, have regions with high mutation rates, including rps16-accD-psaI-ycf4-cemA region. rps16 gene was lost from cpDNA in the common ancestor of the IR-loss clade15. accD coding region was completely absent in the Trifolium subgenus Trifolium and has nuclear copies in Medicago truncatula and Cicer arietinum25. Three consecutive genes psaI-ycf4-cemA is situated in a local mutation hotspot and has been lost in some species of Lathyrus15,16.
Comparative analysis of genome structure
We compared whole chloroplast genome sequences of different taxa of IRLC to analyze gene order and content. We found that, similar to other plant species, the gene coding regions were more conserved than the noncoding regions (Fig. 5). High nucleotide variations were observed across IRLC for the protein-coding regions ycf1, ycf2 and clpP. Similar results were also obtained from calculation of nucleotide diversity (Pi). Papilionoideae, in particular the IRLC, displays structural variations which provide informative characters to increase phylogenetic resolution and make the taxon an excellent model for genome evolution studies5,25. The plastomes of several members of the IRLC have regions with significant variation and rearrangement and accelerated mutation rates, including loss of introns from rps12 and clpP genes24, absence of rps16 gene26 and transfer/loss of rpl22 to the nucleus24. Numerous studies have also shown some other rearrangements in some IRLC taxa, such as loss of accD gene in six species of Trifolium10,25, loss of rpl23 and rpl33 genes in some species of Lathyrus, Pisum and Vicia32 and loss of ycf4 gene in some species of Lathyrus and Pisum15,16. As revealed in other studies, there are several reasons for occurrence of rearrangements in the plastome, such as the lack of one IR region, variable IR region size and many tandemly repeated sequences40. For example, the loss of the rps16 gene was probably due to the presence of a nuclear rps16 copy, which contributed to pseudogenization of the plastid copy41. Likewise, the lack or expansion of the accD gene was explained by the presence of tandemly repeated sequences6,15.
Plastid RNA editing prediction
RNA editing is one of the post-transcriptional events which converts cytidine (C) to uridine (U) or U to C at specific sites of RNA molecules and modifies the genetic information from the genome in the plastids and mitochondria of land plants. RNA editing serves as a mechanism to correct missense mutations of genes by inserting, deleting and modifying nucleotides in a transcript42. RNA editing sites of O. gaubae plastid genes were predicted using Prep-CP prediction tool (Supplementary Table S8). In total, 58 editing sites were present in 19 chloroplast protein-coding genes and all of the editing sites were C-to-U conversions (Supplementary Table S8). Among them, nine editing sites, the highest number, were found in the region encoding ndhB gene followed by seven editing sites in petB (Fig. 6). There were six editing sites detected each in ndhA and rpoB genes. accD, ndhG and petD had three editing sites, and ndhD and ndhF had two editing sites. Two editing sites were also found in ccsA, matK and rpoC1 genes. The remaining seven genes had only one editing site. The results showed that ndh genes exhibited the most abundant editing sites which were nearly 39.6% of the total editing sites. In flowering plants, the highest number of plastid editing sites was found in the ndh group genes42. Moreover, the ndh genes encoding for a thylakoid Ndh complex, have been lost or pseudogenized in different species of algae, bryophytes, pteridophytes, gymnosperms, monocots, eudicots, magnoliids, and protists43–45. The RNA editing is probably important for the NDH protein complex function and may also lead to improved photosynthesis and display positive selection during evolution42.
Phylogenetic relationships
Phylogenetic relationships within the IRLC were reconstructed using the representative taxa (27 species from different tribes) and two species as outgroup based on 75 protein-coding genes of their chloroplast genomes. The total concatenated alignment length from the 75 protein-coding genes was 87,455 bp. The reconstructed phylogeny is in agreement with previous studies5,14,25,46 indicated that IRLC was monophyletic and consisted of several clades. As shown in the previous studies, Glycyrrhiza + Meristotropis were monophyletic, along with the tribe Wisterieae was sister to the rest of the IRLC19,47. Then, there are two major clades: clade I and II (Fig. 7). Clade I comprises tribes Caraganeae17, Hedysareae20 and Coluteae18 as well as genera Oxytropis and Astragalus. Our results showed a close relationship between O. gaubae and O. viciifolia and Hedysarum species and confirmed O. gaubae phylogenetic position in the tribe Hedysareae. Furthermore, our plastid DNA analyses which are consistent with the previous study46, show that Oxytropis is sister to the tribe Coluteae. In clade II, tribe Cicereae is the basal branch and formed a sister group relationship with the paraphyletic Trifolieae and the monophyletic tribe Fabeae. The results of the present study suggest that there is no conflict between the phylogeny made by whole cp genome and that inferred by individual gene datasets. Therefore, a phylogenetic reconstruction for IR-loss clade species studied here showed that plastid genome database will be a helpful resource for molecular phylogeny at the higher taxonomic level (generic to tribal rank).