The chloroplast genome sequencing of two important annual Trifolium species T. alexandrinum and T. resupinatum and comparative analysis with other congeneric species

Background: Chloroplast (cp) genome of most plant species has two typical inverted-repeats (IRs) regions. However, in some species this IR structure is lost for unknown reasons and the consequence still needs to be revealed . Here, we present whole cp genome sequencing of Trifolium alexandrinum (Egyptian clover) and T. resupinatum (Persian clover) from the IR lacking clade (IRLC) . Results: Global aligning of T. alexandrinum and T. resupinatum to other eight Trifolium species revealed a large amount of rearrangement and repetitive events in these ten species. We found that IR lacking species have lower GC content and higher percentage of repetition than IR containing species. Abundant single nucleotide polymorphisms (SNPs) and insertions/deletions (In/Dels) were discovered between those two species. As hypothetical cp open reading frame (ORF) and RNA polymerase subunits severally, two genes ycf1 and rpoC2 in the cp genomes, which both contain vast repetitive sequences and high Pi values (0.6656, 0.455) between T. alexandrinum and T. resupinatum , possessed highly variation among ten Trifolium species. Thus they could greatly influence evolutionary process of Trifolium species. In addition, IR containing and IR lacking Trifolium species were estimated to split during the upper Cretaceous period, which was potentially related to the violent crustal movement and sea-land changes. Conclusions: Cp genomes of T. alexandrinum and T. resupinatum , which belong to IRLC were sequenced and annotated in present study, and compared with cp genomes of other eight Trifolium species reported previously. This valuable information will provide insight into the evolution of IR lacking species. Nevertheless, further investigating of the detailed reason of IR lacking is still challenging, but it may be related to the violent crustal movement and sea-land changes of the Cretaceous period presented in this study.

been reported yet. Phylogenetic relationships between IR containing and IR lacking species of Trifolium were well estimated using 58 protein-coding genes in cp genomes [10]. However, T. alexandrinum and T. resupinatum have not been studied yet. Variation among different species could provide a fascinating glimpse into the understanding of plant biology and diversity [7].
Here, cp genomes of T. alexandrinum and T. resupinatum were sequenced and annotated using next generation sequencing (NGS). We compared the sequence differences caused by nucleic acid polymorphism (Pi), In/Del and repetitive sequences, as well as the evolution pressure reflected by non-synonymous/synonymous (Ka/Ks) between these two species. Furthermore, these two species were compared with other eight (four IR containing species and four IR lacking species of Trifolium) congeneric species and divergent times were estimated. This study provides insights into evolution of IR lacking cp genomes.

Features of the T. alexandrium and T. resupinatum cp genomes
More than 20 million ReadSum (pair-end reads) were yielded for T. alexandrium and T. resupinatum, with the Q20 and Q30 (the percentage of bases whose mass value is greater than or equal to 20,30) were higher than 94% and 87%, respectively. We assembled them successfully based on the alignment of paired-end sequences to the reference of T. medeseum (Fig 1). The cp genomes of T. alexandrium and T. resupinatum were detected with IR lacking and have a size of 148,545 bp and 149,026 bp, respectively ( Table 1). The GC content in the two cp genomes was about 34.09% and 33.80% overall, and 37.05% and 36.64% in coding sequences (CDS). A total of 112 and 109 genes were consisted in the complete cp genomes of T. alexandrinum and T. resupinatum, which were contained 31 and 37 tRNA, 75 and 66 mRNA, and 6 rRNA, and 13 and five genes possessing introns, respectively. There were 46 genes related to photosynthesis in cp genomes of T. alexandrinum and T. resupinatum (Table 2), of which four genes psbN, atpF, ndhA and ndhB were specific for T. alexandrinum. These genes include the ones encoding subunit of Rubisco, subunits of photosystem I, subunits of photosystem II, subunits of ATP synthase, cytochrome b/f complex, c-type cytochrome synthesis and subunits of NADH dehydrogenase. Thirty-one genes were related to self-replication, including four ribosomal RNA genes and 27 transfer RNA genes, in which trnT-CGU was unique in T. alexandrinum.
Besides, ten genes encoding ribosomal proteins and twelve were associated with transcription.
Among them, rps18, rpl2 and rpoC1 were unique in T. alexandrinum. Furthermore, two genes clpP and ycf3 with other functions were particular for T. alexandrinum (Additional file1: Table S1).
Introns are not subject to natural selection thus theoretically accumulate more mutations than exons.
The exons length of those five genes was more conserved compared with intron. In particular, ycf3 had two introns in T. alexandrinum. Note: * , Genes containing a single intron; ** , Genes containing two introns; (ale), Genes that are particular for Trifolium alexandrinum; */ale , Genes that only have an intron in Trifolium alexandrinum.

Repeats analysis
Scattered repeating sequences (palindrome repeats and direct repeats) and simple sequencing repeats (SSRs) were analyzed respectively. Over all, percentage of repetitive sequences in IR containing species (2.035%, Table 1) was less than those IR lacking species (11.956%). A total of 1941 scattered repetitive sequences in T. alexandrinum cp genome were annotated, which was greater than T. resupinatum (1250). The percentages of palindrome repeats (type P, 50.49%, Fig 2B) of T. alexandrinum were slightly larger than T. resupinatum (46.4%). A total of 370 (Fig 2A)  resupinatum and 37.84% for T. alexandrinum) were much more than guanine (G) or cytosine (C) repeats (less than 1.35%). A total of 24 SSRs were identified to be shared by T. alexandrinum and T.
resupinatum (Additional file2: Table S2; Fig 2A). The common repeat sequences larger than 30 bp with the longest length of 117 bp was showed in Fig 2C.

The relative synonymous codon usage analysis
The relative synonymous codon usage analysis (RSCU), which is considered to be a combination result of natural selection, species mutation and genetic drift, was analyzed (Fig 3; Additional file3: Table   S3). The value for initiation codon AUG (RSCU = 2.9745) was much greater than GUG (RSCU =    Table S6).

Whole-cp genome comparison with other Trifolium species
In order to excavate the sequence divergence of Trifolium genus and further shed light on the evolutionary events, such as pseudogenization, gene mutation, rearrangement and gene loss, cp genomes of ten species including four IR containing species (T. aureum, T. boissieri, T. glanduliflerum and T. strictum) and six IR lacking species (T. alexandrinum, T. resupinatum, T. repens, T. pratense, T. subterraneum and T. meduseum) were compared. The results indicated that the average size of cp genomes of these IR lacking species (126173.25 bp, Table 1) and genic density (7.63E-04) were lower than the ones containing IR (139704.5 bp, 8.70E-04), and the latter held higher mean GC content of whole cp genome (35.00%). Only minor variations were detected in the total numbers of genes, tRNA and mRNA among the selected species. T. pratense possessed the smallest numbers of tRNA (28), mRNA (58) and total number of genes (90).
Furthermore, abundant gene rearrangements were detected among ten Trifolium species using MAUVE program and the T. subterraneum as the reference sequence (Fig 7). Compared with CDS, non-coding sequences (CNS) showed most significant variation among selected Trifolium species (Fig   8). In other words, sequences variation in genic regions was lower than intergenic regions in the cp genomes of those ten species.

Phylogenetic divergence time estimation
The CDS of 76 genes shared in cp genomes of the 20 species (18 of Papilionoideae, one of Caesalpinioideae and one of Mimosaceae) were subjected to phylogeny analysis and divergence times estimation (Fig 9). The topological structure of phylogenetic tree was almost consistent with the

Base mutation of T. alexandrinum and T. resupinatum to IR containing species
Point mutation was generally more common than frame shift for natural mutation [19]. As expected, more SNPs (21963, 6618 Tn and 15345 Tv; Additional file4: Table S4) than In/Del (2097) were found between T. alexandrinum and T. resupinatum. What's more, 60% of them occurred in intergenic regions, which was consistent with the conservatism of CDS displayed in mVISTA (Fig 6). This discovery was in agreement with the hypothesis that CDS had a slower rate of evolution compared with CNS [20]. Furthermore, minuscule SNPs (159 between Oryza sativa and O. nivara [21], 330 between Citrus sinensis and C. aurantiifolia [22] and 231 between Machilus yunnanensis and M. balansae [23]) were identified in IR containing species, which were exceptionally smaller than the SNPs between two IR lacking species T. alexandrinum and T. resupinatum calculated in present study.
As an important structure in stabilizing cp genome, IR region can hold from deviating by selective force [24]. Thus the observed abundant SNPs/Indels in T. alexandrinum and T. resupinatum are not surprising.
The comparison between the Ka and Ks of genes is an important content of molecular evolution [25].
Most genes were subjected to the neutral selection and purification selection, however, there are also limited genes whose rate of Ka is higher than that of Ks because the function of the gene has been dramatically changed, called Darwinian positive selection [26]. Lacking one IR region is believed to directly enhance the nucleotide substitution rate of the single repeat sequence. Previous studies have shown that in the IR lacking cp genome, the nucleotide substitution rate in the remaining repeat region is comparable to that of the single repeat region, which is 2.3 times higher than that in the IR containing cp genome [27]. Here, seven protein-coding genes in the cp genome of T. alexandrinum rps4 [28] and rpoC2 [29] have been reported to be under positive selection in previous studies.
However, beneficial mutations might be fixed in those genes and, thus, reduce genetic polymorphism at selected sites [30].

Global alignment between IR containing and IR lacking Trifolium species
Comparing the cp genomes of T. alexandrinum and T. resupinatum sequenced in present study to other eight (four IR containing species) Trifolium species showed relatively high conserved genome length and gene content though T. pretense possessed only 90 genes (Table 1). Furthermore, the high average GC content, which forms a more stable structure of genome, were observed in four IRs containing species (T. strictum, T. aureum, T. boissieri and T. glanduliferum) compared with IR lacking species (Table 1). According to Millen et al. [31], the vast of angiosperms held in shared 74 codingprotein genes but other five genes (accD, ycf1, ycf2, rpl23 and infA) were only existed in some specific species. Ycf1, with the premature stop codons in the CDS thus always be defined as pseudogene or ORF in other angiosperm [32], was present in all the ten Trifolium species and with the biggest value of Pi (0.6656; Additional file6: Table S6) among 76 common genes in ten species. This is closely related to the fact that most pseudogenes have undergone the processes of accelerated mutation rate, decreased GC content, and decreased secondary structure stability [33]. Therefore, ycf1 possesses considerable variation among different species thus could be considered as the good candidate gene for phylogenic study among Trifolium species.
As the driving force of evolution, repetitive sequences indicate that the genetic material of a species is continuously self-replicating in the process of evolution, thus greatly expanding and enriching the genetic information [34]. Present study revealed higher repetitive percentage (11.96%, Table 1) in IR lacking species than IR containing species (2.04%). The number of repeated sequences in cp genome are associated with rearrangement in some species [35]. However, the driving force of repetitive sequence was seemly related to nuclear genes and genomic recombination [10], such as homologous recombination and microhomology-mediated break-induced replication acting on more than 50 bp and less than 30 bp repeats, respectively. Known as "hotspots" for variation [36], ycf1 and rpoC2, highly varied among ten Trifolium species (Fig 8), possessed the majority of repetitive sequences in T.
alexandrinum and T. resupinatum and high values of Pi, thus could have essential function in the evolutionary process of Trifolium species.
In general, there is a strong correlation between the presence of IR and structurally stabilization of cp genomes. Substantial rearrangement was usually found in cp genomes lacking IR [9]. Among those IR lacking species of Leguminosae such as alfalfa, subclover, pea, etc, some are structurally stable and have not been rearranged, some undergo intermediate rearrangements, while others experienced a series of complex rearrangements [9]. This study found abundant rearrangements within six IR lacking species or four IR containing cp genomes of Trifolium (Fig 7). According to Palmer and Thompson [8], IR could prevent the rearrangement of cp sequence to some extent, so the rearrangement probability will be increased in the species lacking IR sequence. However, some species of legumes were reported having obtained the ability to increase rearrangement, this could be why there are many rearrangement events detected among those ten Trifolium species [8].
However, lacking IR leads to increased rearrangement is only one of the explains.

Phylogeny analysis and divergent time
The topological structure of other eight Trifolium species using 76 protein coding genes in this study was generally in agreement with the report of Sveinsson and Cronk [9] by 58 protein coding genes. In addition, the phylogenetic location of tested T. alexandrinum and T. resupinatum was confirmed ( Fig   9). Furthermore, T. alexandrinum and T. pratense, both belonging to Trifolium Sect. Trifolium, were clustered together and T. resupinatum was grouped with T. repens though those two species were in a separate group in Malaviya's study based on isozyme data [37]. Two IR containing species T.

Cp genome assembly, annotation and visualization
Both the raw data for two Trifolium species were filtered according to the following criterion: reads of less than 5% unidentified nucleotides and more than 50% of its bases with the quality score of 20 were retained after adapter trimming. With the reference genome of T. meduseum [10] [11]) and visualized in OGDRAW (https://chlorobox.mpimp-golm.mpg.de/OGDraw.html).

The relative synonymous codon usage analysis (RSCU) and simple repeating sequences (SSRs) prediction
The RSCU was analyzed using MEGA v7.0 to reflect the relative preference of a particular codon encoding the corresponding amino acid codon [12]. The values of RSCU more than one was considered as greater codon frequency. SSRs with the same repeats units and times and distributed in the genic regions were considered as shared repeats, the repetitive sequences were distinguished using VMATCH V2.3.0 (http://www.vmatch.de/) and MISA v1.0 (http://pgrc.ipkgatersleben.de/misa/misa.html) based on the genomic data, which was also utilized to determine the mono-, di-, tri-, tetra-, penta-and hexa-nucleotides.
resupinatum tested in present study were utilized for Ka/Ks and nucleotide diversity (Pi) calculation.
Ka/Ks, which was generally considered as a reflection of selection pressures, was computed via KaKs_Calculator v2.0 [16]. Pi, which could be used to estimate the degree of nucleotide sequences variation and further provide potential molecular markers for population genetics, was calculated using VCFTOOLS by sequences comparison of the CDS of the common genes of different species by MAFFT version 7.017 [17]. Finally, single nucleotide polymorphisms (SNPs) and insertions/deletions (In/Dels) of T. alexandrium and T. resupinatum were also identified using Mafft program [17].

Consent for publication
Not applicable

Consent for publication
Not applicable

Availability of data and materials
The annotated chloroplast genomes of T. alexandrinum and T. resupinatum have been deposited in the NCBI GenBank with the accession numbers MN857160 and MN857161.

Competing interests
The authors declare that they have no competing interests.

Funding
This research was supported by the earmarked fund for Modern Agro-industry Technology Research System (No. CARS-34) and National Natural Science Foundation of China (3177131276).

Authors' contributions
XM, YP and YLX conceived and designed the study. YLX and YX participated in assembly and sequence analysis and wrote the draft paper. QQY and JMZ cultured and maintained the experimental material. ZXD and JY participated in the assembly of the genome sequences. XQZ and YX performed sequence analyses and generated the figures. JH and XL analyzed the data and prepared tables. All authors read and approved the final manuscript.
resupinatum, the numbers of "Number" mean number of repeats in T. alexandrinum and T.
Additional file6: Table S6. The nucleic acid polymorphism (Pi) computed using 88 common genes of T. alexandrinum and T. resupinatum. Figure 1 Gene maps of T. alexandrinum and T. resupinatum. Notes: Genes drawn inside and outside of the circle are transcribed clockwise and counterclockwise, respectively. Genes belonging to different functional groups are color coded. The darker gray color and lighter gray color in the inner circle corresponds to the GC content and the AT content, respectively.  The relative synonymous codon usage (RSCU) for the T. alexandrinum and T. resupinatum.  The synonymous/synonymous substitution rates (Ka/Ks) calculated using 62 shared genes in T. alexandrinum and T. resupinatum.

Figure 6
The nucleic acid polymorphism (Pi) computed using 88 common genes of T. alexandrinum and T. resupinatum. Genes with more than one Ka/Ks were red coded.

Supplementary Files
This is a list of supplementary files associated with this preprint. Click to download.