Characteristics of the chloroplast genome of O. gaubae
The number of paired-end raw reads obtained by the Illumina HiSeq 2000 system is 43,189,861 for O. gaubae sample. The plastid genome of O. gaubae with 122,873 bp in length and having only one copy of the IR region, which is in accordance with the reports and its genome structure is similar to those of other IRLC species. In this context, the lack of rps16 and rpl22 genes and intron 1 of clpP in the plastome of O. gaubae should be noted; these genes, are absent from the chloroplast genomes of entire IRLC24-26. The assembled chloroplast genome of O. gaubae contained 110 genes, including 76 protein-coding genes, 30 transfer RNA (tRNA) genes and four ribosome RNA (rRNA) genes (Fig. 1, Table 1). The LSC (81,066 bp), SSC (13,777 bp) and IR (28,802 bp) regions along with the locations of 110 genes in the chloroplast genome are shown in Fig. 1.
Table 1
Genes predicted in the chloroplast genome of O. gaubae.
Category of genes
|
Group of genes
|
Name of genes
|
Self-replication
|
Large subunit of ribosomal proteins
|
rpl14, rpl16*, rpl2*, rpl20, rpl23, rpl32, rpl33, rpl36
|
|
Small subunit of ribosomal proteins
|
rps2, rps3, rps4, rps7, rps8, rps11, rps12*, rps14, rps15, rps18, rps19
|
|
DNA-dependent RNA polymerase
|
rpoA, rpoB, rpoC1*,rpoC2
|
|
Ribosomal RNA genes
|
rrn16S, rrn23S, rrn 4.5S, rrn 5S
|
|
Transfer RNA genes
|
30 trn genes (5 contain an intron)
|
Genes for photosynthesis
|
Subunits of NADH-dehydrogenase
|
ndhA*, ndhB*, ndhC, ndhD, ndhE, ndhF, ndhG, ndhH, ndhI, ndhJ, ndhK
|
|
Subunits of photosystem I
|
psaA, psaB, psaC, psaI, psaJ
|
|
Subunits of photosystem II
|
psbA, psbB, psbC, psbD, psbE, psbF, psbH, psbI, psbJ, psbK, psbL, psbM, psbN, psbT, psbZ
|
|
Subunits of cytochrome b/f complex
|
petA, petB*, petD*, petG, petL, petN
|
|
Subunits of ATP synthase
|
atpA, atpB, atpE, atpF*, atpH, atpI
|
|
Subunit of rubisco
|
rbcL
|
Other genes
|
Maturase K
|
matK
|
|
Envelope membrane protein
|
cemA
|
|
Subunit of Acetyl-CoA-carboxylase
|
accD
|
|
C-type cytochrome synthesis gene
|
ccsA
|
|
Protease
|
clpP*
|
Genes of unkown function
|
Conserved hypothetical chloroplast
open reading frames
|
ycf1, ycf2, ycf4, ycf3**
|
The number of asterisks after the gene names indicates the number of introns contained in the genes. |
A total of 16 genes contained a single intron, whereas ycf3 exhibits two introns (Additional File 1: Table S1). The rps12 gene is a trans-splicing gene which does not have introns in the 3'-end. The trnK-UUU has the largest intron encompassing the matK gene, with 2,495 bp, whereas the intron of trnL-UAA is the smallest (542 bp). The O. viciifolia (the only other species of Onobrychis which cp genome has been sequenced) plastome with 121,932 bp in length showed similar conserved gene contents, order and orientations to O. gaubae. The most important structural differences in the O. viciifolia plastid genome compared to the O. gaubae are the lack of the atpF intron and inversion of ycf2 gene.
The length of plastome in the IRLC taxa in this study ranging from 121,020 to 130,561 bp. All plastomes exhibited the typical structure of IR-loss clade composed of LSC region (79,916 to 87,193), SSC region (13,383 to 14,187) and only one inverted repeat region (27,604 to 30,487) (Table 2).
Table 2
Chloroplast genome information from sampled IRLC species and the newly assembled O. gaubae.
Species
|
Size (bp)
|
LSC (bp)
|
GC (%)
(LSC)
|
SSC (bp)
|
GC (%)
(SSC)
|
IR (bp)
|
GC (%)
(IR)
|
GC (%)
Total
|
Astragalus mongholicus
|
123,582
|
80,986
|
33.4%
|
13,773
|
29.9%
|
28,823
|
38.1%
|
34.1%
|
Caragana microphylla
|
130,029
|
85,436
|
33.3%
|
14,106
|
30.4%
|
30,487
|
38.8%
|
34.3%
|
Carmichaelia australis
|
122,805
|
80,588
|
33.5%
|
14,074
|
30.2%
|
28,143
|
38.6%
|
34.3%
|
Cicer arietinum
|
125,319
|
82,583
|
33%
|
13,820
|
29.9%
|
28,916
|
38.3%
|
33.9%
|
Galega officinalis
|
125,086
|
82,915
|
33.2%
|
13,347
|
30.5%
|
28,824
|
38.7%
|
34.2%
|
Glycyrrhiza glabra
|
127,943
|
84,714
|
33.1%
|
14,187
|
30.1%
|
29,042
|
39.6%
|
34.2%
|
Hedysarum semenovii
|
123,407
|
80,288
|
34.1%
|
13,679
|
30.5%
|
29,440
|
38.9%
|
34.9%
|
Lens culinaris
|
122,967
|
81,659
|
33.7%
|
13,833
|
30.2%
|
27,604
|
38.7%
|
34.4%
|
Lessertia frutescens
|
122,700
|
80,698
|
33.4%
|
13,750
|
29.9%
|
28,252
|
38.4%
|
34.2%
|
Medicago sativa
|
125,330
|
83,756
|
32.9%
|
13,383
|
30.2%
|
28,191
|
38.6%
|
34%
|
Melilotus albus
|
127,205
|
84,279
|
32.7%
|
13,806
|
29.8%
|
29,120
|
38.1%
|
33.6%
|
Meristotropis xanthioides
|
127,735
|
84,629
|
33.1%
|
14,150
|
30.1%
|
28,956
|
39.6%
|
34.2%
|
Onobrychis gaubae
|
122,873
|
79,968
|
33.7%
|
13,805
|
30.5%
|
29,100
|
38.8%
|
34.5%
|
Onobrychis viciifolia
|
121,932
|
78,986
|
33.8%
|
13,821
|
30.4%
|
29,125
|
38.8%
|
34.6%
|
Oxytropis bicolor
|
122,461
|
80,170
|
33.5%
|
14,017
|
30%
|
28,274
|
38.3%
|
34.2%
|
Tibetia liangshanensis
|
123,372
|
79,916
|
33.9%
|
13,513
|
30.6%
|
29,943
|
38.6%
|
34.7%
|
Wisteria floribunda
|
130,561
|
87,193
|
33.2%
|
14,127
|
30%
|
29,628
|
39.4%
|
34.4%
|
LSC: Large Single Copy, SSC: Small Single Copy, IR: Inverted Repeat. |
Gene order and gene/intron content in plastomes of all IRLC taxa are highly conserved. The overall GC content of the O. gaubae chloroplast genome sequence was 34.5%, which is consistent with other IRLC species, whose plastomes have GC-contents ranging from 33.6% to 35.1% (Table 2). Different GC content occurs in the LSC (32. 7% - 34.1%), SSC (29.8% - 30.6%) and IR (38.1% - 39.6%) regions (Table 2).
Codon usage bias
The total coding DNA sequences (CDSs) were 81,121 bp in length and encoded 75 genes including 24,765 codons which belonged to 61 codon types. Codon usage was calculated for the protein-coding genes present in the O. gaubae cp genome. Phenylalanine was the most abundant amino acid, whereas Alanine showed the least abundance in this species (Additional File 1: Table S2). Most protein-coding genes employ the standard ATG as the initiator codon. Among the O. gaubae protein-coding genes, three genes used alternative start codons; ACG for psbL and ndhD, and GTG for rps8. Similar codon usage pattern was exhibited in the O. viciifolia species (Additional File 1: Table S3).
The chloroplast genomes of the IRLC were analyzed for their codon usage frequency according to sequences of protein-coding genes and relative synonymous codon usage (RSCU). RSCU is an important indicator to measure codon usage bias in coding regions. This value is the ratio between the actual observed values of the codon and the theoretical expectations. If RSCU=1, codon usage is unbiased; if RSCU>1, specific codon frequency is higher than other synonymous codons, otherwise, the frequency is low27,28. The total number of codons among protein-coding genes in the IRLC species varies from 20,381 codons in Hedysarum taipeicum (as the smallest number) to 24,765 codons in O. gaubae. The most often used synonymous codon was AUU, encoding isoleucine, and the least used was CGC/CGG, encoding arginine (Additional File 2: Table S4). In the IRLC, the standard AUG codon was usually the start codon for the majority of protein-coding genes and UAA was the most frequent stop codon among three stop codons. Methionine (AUG) and tryptophan (UGG) showed RSCU = 1, indicating no codon bias for these two amino acids. The highest RSCU value was for UUA (~2.04) in leucine and the lowest was GGC (~0.35) in glycine. Leucine preferred six codon types (UUA, UUG, CUU, CUC, CUA, and CUG) and actually showed A or T (U) bias in all synonymous codons (Additional File 2: Table S4). The result of distributions of codon usage in the IRLC species showed that RSCU>1 was recorded for most codons that ended with an A or a U, except for UUG codon, resulting in the bias for A/T bases. As well as, more codons with the RSCU value less than one, ended with base C or G. So, there is high A/U preference in the third codon of the IR-loss clade coding regions, which is a common phenomenon in cp genomes of higher plants29.
Analysis of repeats
Repeat analysis of the O. gaubae plastome identified 50 repeat structures with lengths ranging from 30 bp to 179 bp. These structures included 29 forward repeats with lengths in the range of 30-179 bp, 19 palindromic repeats of 30-81 bp and two reverse repeats with a length of 31 and 37 bp (Additional File 3: Table S5). Among the 50 repeats, 66% are located in the LSC region, 18% in the IR region and 16% in the SSC region. Also, most of the repeats (42%) were found in coding region (accD, psaA, psaB, psbC, psbJ, ycf1, ycf2, ycf4, rps12, trnR-UCU, trnK-UUU), 40% were distributed in the intergenic spacer regions (IGS) and 18% were located in the introns (ndhA, rpl16, rps12, petB, ycf3). The pattern of repeat structures (both in frequency and location) in O. gaubae is similar to that of O. viciifolia (Additional File 3: Table S6). In the majority of the studied IRLC species, the most frequently observed repeats were forward, then palindromic, and the least was the reverse (Fig. 2).
The forward type was the most abundant repeat with length in the range 30-50 bp in all the IRLC species. The longest repeats were also of the forward type, with length of 560 bp were detected in the Hedysarum taipeicum, followed by Vicia sativa of 517 bp and Caragana microphylla of 455 bp, which were much longer than other species studied.
Simple sequence repeats (SSRs), or microsatellites, are a type of tandem repeat sequences which contain 1-6 nucleotide repeat units and have wide distribution throughout the genome28,30. Accordingly, microsatellites play a crucial role in the genome recombination and rearrangement. These nucleotide motifs show a high level of polymorphism that can be widely used for phylogenetic analysis, population genetics and species authentication28,31. A total of 83 SSRs were detected in the O. gaubae plastome, which were composed by a length of at least 10 bp. Among them, 47 (56.62%) were mono-repeats, 24 (28.91%) were di-repeats, 6 (7.22%) were tri-repeats, five (6.02%) were tetra-repeats and one were penta-repeats (1.2%). No hexanucleotide SSRs existed in O. gaubae genome (Additional File 3: Table S7). Onobrychis viciifolia with 101 SSRs including 50 mono-repeats (49.5%), 30 (29.7%) di-repeats, nine (8.91%) tri-repeats, 11 (10.89%) tetra-repeats and one penta-repeat (0.99%), exhibited similar SSR distribution pattern in the plastome (Additional File 3: Table S8). The number of SSRs in the IRLC cp genomes (cpSSRs) ranged from 68 (Vicia sativa and Lens culinaris) to 151 (Melilotus albus) across the IRLC species (Fig. 3A). The mononucleotide repeats (P1) were identified at a much higher frequency, which varied from 45 (Tibetia liangshanensis, Glycyrrhiza glabra) to 93 (Melilotus albus) (Fig. 3B).
In the mononucleotide repeats, A/T motifs were the most abundant but no G/C motif was detected in the cp genome. Likewise, the majority of the dinucleotides and trinucleotides were found to be particularly rich in AT sequences. Therefore, the AT richness in the SSRs of the chloroplast genome of O. gaubae is consistent with the results of previous studies28,32 which have shown that in the cp genome, SSRs generally composed of polythymine (poly T) or polyadenine (poly A) repeats33.
Sequence divergence analysis
The average nucleotide diversity (Pi) among the protein-coding genes of 23 species of the IRLC was estimated to be 0.05736. Furthermore, comparison of nucleotide diversity in the LSC, SSC and IR regions indicated that the IR region exhibits the highest nucleotide diversity (0.11549) and the SSC region shows the least (0.04132). We detected three hyper-variable regions with Pi values>0.1 among the IRLC species; ycf1 and ycf2 from IR region and clpP from LSC region (Fig. 4).
These genes might be undergoing rapid nucleotide substitution in IR-loss clade at the genus and species levels. Among these, ycf1 encoding a protein of 1800 amino acids has the highest nucleotide diversity (0.18745). The average nucleotide diversity was also investigated between two Onobrychis plastid genome sequences. The average value of Pi between the Onobrychis species was estimated to be 0.05632 (Additional File 4: Fig. S1). High nucleotide variations were observed for the protein-coding regions ycf1, ycf2, clpP, accD and ycf4 and intergenic regions such as trnL(UAA)-trnT(UGU) and trnT(GGU)-trnE(UUC). Sliding window analysis results revealed the same variable regions in the cp genome of the two Onobrychis species.
Moreover, we compared whole chloroplast genome sequences of different taxa of IRLC to analyze gene order and content with mVISTA. We found that, similar to other plant species, the gene coding regions were more conserved than the noncoding regions (Additional File 5: Fig. S2). High nucleotide variations were observed across IRLC for the protein-coding regions ycf1, ycf2 and clpP. Similar results were also obtained from calculation of nucleotide diversity (Pi).
Ka/Ks analysis
In this study, the non-synonymous (Ka) to synonymous (Ks) rate ratio (Ka/Ks) was estimated for 75 protein-coding genes across the 28 IRLC species analyzed (Additional File 6: Table S9). In general, the Ka/Ks values were lower than 0.5 for almost all genes. The ycf4 gene which is involved in regulating the assembly of the photosystem I complex had the highest nonsynonymous rate, 0.165691, while the ycf1 gene with unknown functions had the highest synonymous rate, 0.181067. The Ka/Ks ratio (denoted as ω) is widely used as an estimator of selective pressure for protein-coding genes. An ω > 1 indicates that the gene is affected by positive selection, ω < 1 indicates purifying (negative) selection, and ω close to 1 indicates neutral mutation34. In the present study, the Ka/Ks ratio was calculated to be 0 for psbL gene which encodes one of the subunits of photosystem II. The Ka/Ks ratio indicates purifying selection in 73 protein-coding genes. The highest Ka/Ks ratio which indicates positive selection was observed in accD gene which encodes a subunit of the acetyl-CoA carboxylase (ACCase) enzyme.
Prediction of RNA editing sites
RNA editing as a post-transcriptional modification process, mainly occurs in chloroplasts and mitochondrial genomes. In higher plants, some chloroplast RNA editing sites which provide a way to create transcript and protein diversity are conserved28.
RNA editing sites of O. gaubae plastid genes were predicted using Prep-CP prediction tool (Additional File 7: Table S10). In total, 58 editing sites were present in 19 chloroplast protein-coding genes and all of the editing sites were C-to-U conversions (Additional File 7: Table S10). Among them, nine editing sites, the highest number, were found in the region encoding ndhB gene followed by seven editing sites in petB. There were six editing sites detected each in ndhA and rpoB genes. accD, ndhG and petD had three editing sites, and ndhD and ndhF had two editing sites. Two editing sites were also found in ccsA, matK and rpoC1 genes. The remaining seven genes had only one editing site. The results showed that ndh genes exhibited the most abundant editing sites which were nearly 39.6% of the total editing sites. Furthermore, we predicted 65 RNA editing sites out of 22 plastid genes in chloroplast genomes of O. viciifolia. In this species, the highest number of editing sites belongs to the petB, rpoC1 and ndhB genes with 9, 8 and 7 sites, respectively (Additional File 7: Table S11). In these two Onobrychis species, the amino acid conversion from S to L occurred most frequently, while R to W occurred least.
Phylogenetic analysis
Phylogenetic relationships within the IRLC were reconstructed using the representative taxa (28 species from different tribes) and two species as outgroup based on 75 protein-coding genes of their chloroplast genomes. The total concatenated alignment length from the 75 protein-coding genes was 87,455 bp. The maximum likelihood (ML) analysis resulted in a well-resolved tree and the Bayesian inference yielded a well-resolved topology with high support values (Fig. 5).
The ML and Bayesian trees were largely congruent. The reconstructed phylogeny is in agreement with previous studies5,14,25,35 indicated that IRLC was monophyletic and consisted of several clades. As shown in the previous studies, Glycyrrhiza + Meristotropis were monophyletic, along with the tribe Wisterieae was sister to the rest of the IRLC19,36. Then, there are two major clades: clade I and II (Fig. 5). Clade I comprises tribes Caraganeae17, Hedysareae20 and Coluteae18 as well as genera Oxytropis and Astragalus. Our results showed a close relationship between O. gaubae and O. viciifolia and Hedysarum species and confirmed O. gaubae phylogenetic position in the tribe Hedysareae. Furthermore, our plastid DNA analyses which are consistent with the previous study35, show that Oxytropis is sister to the tribe Coluteae. In clade II, tribe Cicereae is the basal branch and formed a sister group relationship with the paraphyletic Trifolieae and the monophyletic tribe Fabeae.