Features of the T. alexandrium and T. resupinatum cp genomes
More than 20 million ReadSum (pair-end reads) were yielded for T. alexandrium and T. resupinatum, with the Q20 and Q30 (the percentage of bases whose mass value is greater than or equal to 20, 30) were higher than 94% and 87%, respectively. We assembled them successfully based on the alignment of paired-end sequences to the reference of T. medeseum (Fig 1). The cp genomes of T. alexandrium and T. resupinatum were detected with IR lacking and have a size of 148,545 bp and 149,026 bp, respectively (Table 1). The GC content in the two cp genomes was about 34.09% and 33.80% overall, and 37.05% and 36.64% in coding sequences (CDS). A total of 112 and 109 genes were consisted in the complete cp genomes of T. alexandrinum and T. resupinatum, which were contained 31 and 37 tRNA, 75 and 66 mRNA, and 6 rRNA, and 13 and five genes possessing introns, respectively.
There were 46 genes related to photosynthesis in cp genomes of T. alexandrinum and T. resupinatum (Table 2), of which four genes psbN, atpF, ndhA and ndhB were specific for T. alexandrinum. These genes include the ones encoding subunit of Rubisco, subunits of photosystem I, subunits of photosystem II, subunits of ATP synthase, cytochrome b/f complex, c-type cytochrome synthesis and subunits of NADH dehydrogenase. Thirty-one genes were related to self-replication, including four ribosomal RNA genes and 27 transfer RNA genes, in which trnT-CGU was unique in T. alexandrinum. Besides, ten genes encoding ribosomal proteins and twelve were associated with transcription. Among them, rps18, rpl2 and rpoC1 were unique in T. alexandrinum. Furthermore, two genes clpP and ycf3 with other functions were particular for T. alexandrinum (Additional file1: Table S1).
Introns are not subject to natural selection thus theoretically accumulate more mutations than exons. In this study, a total of seven genes (atpF, clpP, ndhA, ndhB, rpoC1, rps18 and tRNA-CGU) only contained an intron in T. alexandrinum (Additional file1: Table S1). Other five genes tRNA-UAA, tRNA-UAC, tRNA-UGC, tRNA-UUC and tRNA-UUU all had an intron in T. alexandrinum and T. resupinatum. The exons length of those five genes was more conserved compared with intron. In particular, ycf3 had two introns in T. alexandrinum.
Repeats analysis
Scattered repeating sequences (palindrome repeats and direct repeats) and simple sequencing repeats (SSRs) were analyzed respectively. Over all, percentage of repetitive sequences in IR containing species (2.035%, Table 1) was less than those IR lacking species (11.956%). A total of 1941 scattered repetitive sequences in T. alexandrinum cp genome were annotated, which was greater than T. resupinatum (1250). The percentages of palindrome repeats (type P, 50.49%, Fig 2B) of T. alexandrinum were slightly larger than T. resupinatum (46.4%). A total of 370 (Fig 2A) and 383 SSRs (sizes ranged from 8 – 81 bp and 8 – 36 bp) were predicted in T. alexandrinum and T. resupinatum and 30.54% and 23.24% of them were distributed in genic regions. In particular, the majority of SSRs were located in ycf1 (18 for T. alexandrinum and T. resupinatum), followed by rpoC2 (11 for T. resupinatum and 9 for T. alexandrinum). The mononucleotide repeats were dominant (65.41% in T. alexandrinum and 74.93% in T. resupinatum), followed by trinucleotide repeats (25.68% in T. alexandrinum and 22.19% in T. resupinatum), in which the repeats of the polyadenine (poly A, 37.34% for T. resupinatum and 35.95% for T. alexandrinum) and polythymine (poly T, 36.55% for T. resupinatum and 37.84% for T. alexandrinum) were much more than guanine (G) or cytosine (C) repeats (less than 1.35%). A total of 24 SSRs were identified to be shared by T. alexandrinum and T. resupinatum (Additional file2: Table S2; Fig 2A). The common repeat sequences larger than 30 bp with the longest length of 117 bp was showed in Fig 2C.
The relative synonymous codon usage analysis
The relative synonymous codon usage analysis (RSCU), which is considered to be a combination result of natural selection, species mutation and genetic drift, was analyzed (Fig 3; Additional file3: Table S3). The value for initiation codon AUG (RSCU = 2.9745) was much greater than GUG (RSCU = 0.0129). The values of three termination codons UAA, UAG and UGA were 1.6215, 0.5676 and 0.8109. A total of 46.97% (31 of 66, include three termination codons) of the codons were with the greater codon frequency (RSCU values more than one), in which 93.55% (29 of 31) were prefer A or U in the third sites. In the other codons with RSCU values less than one (including one), C or G were more general in the third position (88.57%, 31 of 35).
Ka/Ks, single nucleotide polymorphisms (SNPs) and insertions/deletions (In/Dels)
Single nucleotide polymorphisms (SNPs), mainly including transversion (Tv) and transition (Tn), along with insertions/deletions (In/Dels) could lead the non-synonymous (Ka) or synonymous (Ks) substitution. The total SNPs and In/Dels in every gene varied from 1 (ndhE and psaC) to 677 (atpB) with the total of 8560. Additionally, more In/Dels, Tn and Tv were detected in intergenic regions (5.66%, 17.11% and 38.70%) than genic regions (3.05%, 10.40% and 25.08%) (Fig 4; Additional file4: Table S4). The 62 shared protein-coding genes with variations were used to calculate the Ka/Ks (Additional file5: Table S5). The values of Ka and Ks were ranged from 0 (ndhE, petD, psaI, psbA, psbB and so on) to 3.0151 (rps4) and 0 (petG, petN, trnR-ACG, trnL-CAA, trnI-CAU, trnI-CAU and so on) to 2.9415 (rps8) and Ka/Ks varied from 0 (ndhE, psbZ, psbA, psbJ and so on) to 3.7723 (rps4, Fig 5), respectively. Nine genes including rpoC2, ndhG, trnK-UUU, ccsA, ndhF, ycf1, rps4, psaC and rrn4.5 have Ka/Ks values above one, implying positive selection on these genes. The Pi values calculated by 88 common genes of T. alexandrinum and T. resupinatum were from 0 to 0.7867 (trnl-CAU). Twenty-one genes had a Pi values of 0. Ninteen of them were tRNA. The nine genes with Ka/Ka above one also possessed relatively high Pi values (Fig 6; Additional file6: Table S6).
Whole-cp genome comparison with other Trifolium species
In order to excavate the sequence divergence of Trifolium genus and further shed light on the evolutionary events, such as pseudogenization, gene mutation, rearrangement and gene loss, cp genomes of ten species including four IR containing species (T. aureum, T. boissieri, T. glanduliflerum and T. strictum) and six IR lacking species (T. alexandrinum, T. resupinatum, T. repens, T. pratense, T. subterraneum and T. meduseum) were compared. The results indicated that the average size of cp genomes of these IR lacking species (126173.25 bp, Table 1) and genic density (7.63E-04) were lower than the ones containing IR (139704.5 bp, 8.70E-04), and the latter held higher mean GC content of whole cp genome (35.00%). Only minor variations were detected in the total numbers of genes, tRNA and mRNA among the selected species. T. pratense possessed the smallest numbers of tRNA (28), mRNA (58) and total number of genes (90).
Furthermore, abundant gene rearrangements were detected among ten Trifolium species using MAUVE program and the T. subterraneum as the reference sequence (Fig 7). Compared with CDS, non-coding sequences (CNS) showed most significant variation among selected Trifolium species (Fig 8). In other words, sequences variation in genic regions was lower than intergenic regions in the cp genomes of those ten species.
Phylogenetic divergence time estimation
The CDS of 76 genes shared in cp genomes of the 20 species (18 of Papilionoideae, one of Caesalpinioideae and one of Mimosaceae) were subjected to phylogeny analysis and divergence times estimation (Fig 9). The topological structure of phylogenetic tree was almost consistent with the classification of Leguminosae with strong bootstrap support. Three subfamilies Papilionoideae, Caesalpinioideae and Mimosaceae were clearly separated. It is worth noting that Glycine max and Lotus japonicus, belonging to Papilionoideae, were evolutionally grouped with Ceratonia siliqua (Caesalpinioideae) and Albizia odoratissima (Mimosaceae). Trifolium species split from Medicago species during the Early Cretaceous (127.2288 Mya). It seems that during the Late Cretaceous (83.5049 Mya), the IR lacking Trifolium species diverged with IR containing Trifolium species.