Complete plastid genome of Angelica, Ostericum and related species (Apiaceae): genomic evolution and phylogenetic implications

Subtribe Angelicinae is a large and taxonomically complex group of Apiaceae, encompassing Angelica, Archangelica, Coelopleurum, Conioselinum, Czernaevia, Glehnia, Levisticum and Ostericum that are distributed in the Northern Hemisphere, and whether this taxa is natural is debatable, especially between Angelica and Ostericum. To determine genommic evolution and phylogenetic relationships between Angelica, Ostericum, and related species, we newly assembled the complete plastid genome sequences of eight subtribe Angelicinae species and Melanosciadium pimpinelloideum using next-generation sequencing technology. The nine plastid genomes we sequenced were conserved, and their size ranged from 146765 bp to 164329 bp, showing the typical quadripartite circular structure with an overall GC content of 37.5–37.8%. IR boundary analyses showed that the genes in the LSC region transfer into the IR regions and the SSC region was relatively stable. Codon usage patterns were similar among these species and we identied 66–86 SSRs, with the most abundant SSR being mononucleotide. The Pi analyses showed that petA-psbJ(0.02778), atpI-atpH(0.17333) and petA-psbJ(0.04726) intergenic regions had the highest Pi values in Angelica, Ostericum, and ten species, respectively. trees. And the cpDNA tree was inconsistent with the ITS tree and this may have been caused by incomplete lineage sorting (ILS), hybridization and gene ow. The current plastid genomic dataset and the detailed analysis of subtribe Angelicinae species provide abundant genetic resources for the future molecular phylogeny, evolution and population genetic studies of subtribe Angelicinae and Apiaceae.

Plastids are plant organelles responsible for photosynthesis, producing organic matter and storing energy [25]. Plastid genomes are very conserved, mainly in their genome structure, gene sequence and gene type. A typical plastid genome can be divided into four regions: two inverted repeats regions (IRs), a large single-copy region (LSC) and a small single-copy region (SSC) [26,27]. In higher plants, plastid DNA is a circular molecule of double-stranded DNA and in multi-copy form, which ranges from 120 kb to 170 kb in length and usually encodes 120 to 130 genes. Plastid DNA has many features, such as monolepsis, small subfractions, multiple replications, and slow molecular evolution. Because of these speci c features, the plastid DNA sequence has been widely applied in molecular phylogenetic studies [28][29][30][31][32][33][34][35]. With the development of sequencing technology, especially the widespread use of next-generation sequencing, comparative genomics has become commonly used for the research of evolution, phylogeny, genome structure and population genetics [33,[36][37][38].
To date, there are 11 complete plastid genomes of Angelica, Ostericum and related species (Table S1). Although 11 species plastid genomes have been published, there are many questions unanswered. Such as, are there variations in plastid genome structure between species within genera of subtribe Angelicinae, especially between the Angelica and Ostericum? Based on plastid genome data, what are the phylogenetic relationships among species of Angelica and Ostericum and related species? Consequently, in this study we sequenced de novo eight species of four genera in the subtribe Angelicinae and 1 related species (3 of Angelica, 3 of Ostericum, 1 of Coelopleurum, 1 of Conioselinum and 1 of Melanosciadium) ( Table 1). We conducted structural and comparative analyses of ten plastid genome sequences (nine de novo and downloaded Glehnia littoralis of subtribe Angelicinae), including repetitive sequences, SSRs, codon usages, IR contraction and expansion, and nucleotide sequence diversity to de ne the relationship of these species. Additionally, we used 34 plastid genomes and 44 internal transcribed spacers (ITSs) to reconstruct phylogenetic relationships of subtribe Angelicinae based on maximum likelihood (ML) and Bayesian inference (BI) methods.

Analysis of inverted repeat contraction and expansion
To assess the Angelica, Ostericum and related species' (including Glehnia littoralis) expansion and contraction of the IR regions, we identi ed and focused on the junctions of IR/LSC and IR/SSC ( Figure 2). Gene ycf2 of 277-766 bp in the junction of the LSC and the IRb region was located in the IRb region in Angelica sp., Coelopleurum saxatile, Conioselinum chinense, and G. littoralis. In Melanosciadium pimpinelloideum the junction of the IRb and LSC coincided with the petB gene, with 118 bp in the LSC region and 1293 bp located in the IRb region. In Ostericum sp. the rps19 gene extended 81 bp into the IRb region. The ψycf1 gene was located in the junction of the SSC and IRb region in all ten species. Similarly, the ndhF gene was located in the SSC region 25-126 bp away from the IRb/SSC border except for G. littoralis. In G. littoralis, the ndhF gene extended 9 bp into the IRb region at the border of IRb/SSC. The ycf1 gene crossed the SSC and IRa border in all ten species, with 3439-3592 bp in the SSC region and 1856-1975 bp located in IRa region. In all species, the trnH gene was closest to the IRa/LSC border and was located in the LSC region being 3-1155 bp away from the IRa region. However, there were differences in IRa region away from the IRa/LSC border. The trnL gene was in the IRa region in Angelica sp., Coe. saxatile, Con. chinense, and G. littoralis, but was closest to the IRa/LSC border being 619-1329 bp away. Whereas the petD gene was closest to the IRa/LSC border and in the IRa region in M. pimpinelloideum, being 1472 bp away from the border. In Ostericum sp., it was the ψrps19 gene that was closest to the IRa/LSC border and located in the IRa region with 81 bp.

Analysis of codon usage and amino acids frequency
The 20 amino acids were encoded by 64 codons in the ten complete plastid genomes (i.e. Angelica amurensis, Angelica biserrata, Angelica tianmuensis, Coelopleurum saxatile, Conioselinum chinense, Melanosciadium pimpinelloideum, Glehnia littoralis, Ostericum grosseserratum, Ostericum huadongense, Ostericum sieboldii) ( Figure 3). Methionine (Met) and tryptophan (Trp) only had the minimum type of codons with one codon, while leucine (Leu) and serine (Ser) had the maximum type of codons with six. The total number of codons of these species ranged from 22489 in G. littoralis to 26569 in M. pimpinelloideum. Among the amino acids, Leu had the maximum codon numbers ranging between 2399-2826, and cysteine (Cys) had the minimum codon numbers ranging from 240-285. The most used codon was AUU ranging from 927 to 10744, with the second being AAA ranging from 905 to 1105. The three least used were the termination codons UAA, UGA and UAG, with UGA using the least and ranging from 12 to14, while UAA was the most used and ranged from 33-43. Excluding the termination codons, the least used codon was UGC and ranged from 58-60. The codon with the highest RSCU value was UUA, then AGA, and GCU (deep red). The codon with the lowest RSCU value was AGC, followed by (in order) CGC, CUG, GAC, UAC, and CUC (deep blue) ( Figure 3, Table S2). Furthermore, RSCU values of third position codons A or U were mostly greater than 1, whereas the third position codons C or G were mostly less than 1 (Table S2).

Analyses of repeats sequences and single sequence repeats (SSRs)
We detected forward, palindromic, reverse, and complementary repeats in these ten species. After ltering out duplicate data, we found that forward and palindromic repeats were typical, while reverse and complementary repeats were rare. Most repeat sequence lengths were 30-45 bp ( Figure 4). Among these species, Conioselinum chinense had the smallest total number of repeats with 24, whereas Melanosciadium pimpinelloideum and Ostericum sieboldii shared the largest number of repeats 40.

Analysis of sequence diversity nucleotide
Nucleotide diversity (Pi) of the ten plastid genomes was calculated to estimate the sequence divergence level of different regions. Among all species, the SSC and LSC regions exhibited high divergence levels and IR regions exhibited low divergence (Table S3. Figure 5). The results indicated that the IR region was relatively conserved. The sequences with high Pi values were predominantly in intergenic spacers in all species. There were two exceptions, gene regions of ycf2 and ycf1, which were in the boundaries of IR/SSC and IR/LSC.

Phylogenetic analysis
We used 39 complete plastid genomes and 44 nrITS sequences to construct ML and BI trees to investigate the phylogenetic relationships in subtribe Angelicinae ( Figure 6). The plastid genome and ITS trees produced incongruent topology trees. The plastid tree indicated that Angelica is monophyletic but subtribe Angelicinae is non-monophyletic. Angelicasp. were closely related to Coelopleurum saxatile and Melanosciadium pimpinelloideum. Ostericum sp. were closely related to Pternopetalum, and belonged to the Acronrma Clade [21]. Conioselinum chinense was in the Sinodielsia Clade [21]. In addition, Angelica sinensis was closely related to Con. chinense and also belonged to the Sinodielsia Clade [21]. In particular, Glehnia littoralis was embedded in Angelica (Figure6B, 6D, BS=69, PP=1).
However, these species and genera were in parallel branches in the nrITS tree, including Melanosciadium, Coelopleurum, Glehnia, Peucedanum, Angelica acutiloba and Angelica. The parallel branches and low support in BI analysis (Figure6A, 6C, BS=100, PP=0.5591) indicated that nrITS provided insu cient information to determine phylogenetic relationships and more nuclear gene and cpDNA sequences are needed to increase certainty in models. The ITS tree results from ML and BI analysis were a little different regarding the placement of Peucedanum japonicum where it was not one of the parallel branches of Angelica in BI analysis, but in ML analysis is one of the parallel branches of Angelica. Nevertheless, all trees indicated that the subtribe Angelicinae is non-monophyletic, and relatedness between Angelica and Ostericum is more distant.

Discussion
Plastid genome evolution Plastid genomes are highly conserved in genome structure, GC content, gene numbers and gene order [29,45]. Previous studies have con rmed that genome rearrangement, gene pseudogenization or deletion, expansion and contraction of IRs have occurred during the evolutionary process of other plants [46][47][48][49]. In this study, the highly conserved nature of the plastic genome does not mean that they are alike, there are important differences and diversity between plastid genomes. We found that the longest genome was 164329 bp in Melanosciadium pimpinelloideum, followed by Ostericum sp. with lengths around 160000 bp, while the other species (Angelica amurensis, Angelica biserrata, Angelica tianmuensis, Coelopleurum saxatile and Conioselinum chinense) had genomes with lengths about 148000 bp. The order of most to least number of genes and protein-coding genes was the same as length, with M. pimpinelloideum having the most, then Ostericum sp., followed by all other species. The tRNA genes were divided into two groups, M. pimpinelloideum and Ostericum sp. had 37 tRNAs, the other species had 36 tRNAs. We compared our sequenced O. grosseserratum against downloaded O. grosseserratum (Genebank NO.: KT852844), where we compared their size, gene number, GC content, and tRNAs (Table S4). We conclude that O. grosseserratum (Genebank NO.: KT852844) is probably an identi cation error.
Codon usage bias is a crucial indicator for studying the evolution of genomes [50]. Codon usage bias may be caused by nucleotide mutation, random genetic drift, or the affection of translation e ciency [50][51][52]. In addition, gene sequence length, gene expression level, GC distribution position and tRNA abundance can affect the preference of synonymous codon usage [52][53][54][55][56]. We found that the genes of studied species preferred the codons ending with A or U, and this is a universal phenomenon in the plastid genome of higher plants [54,57,58].
Plastid genomes contain repeats (forward, palindromic, reverse, and complementary) and in most studies of angiosperm plastids, the authors reported repeats ≥ 30 bp [40,[59][60][61]. Repeat sequences are important in the identi cation of mutational hotspots and play an important role in genome rearrangement [62,63]. Hence, these repeats are used extensively for population genetics and biogeography studies [37,40,41]. Overall, we found that forward and palindromic repeats were more common in our studied species and reverse and complementary repeats were rare, indicating these species tend to generate the forward and palindromic repeats rather than the reverse and complement repeats (Figure 4b).
Simple sequence repeats (SSRs) are valuable markers to detect variability within the same species and have been widely used in plant population genetics and evolutionary studies [40,61,64]. The most abundant SSRs were mononucleotide in our studied species, followed by (in decreasing abundance) dinucleotide, tetranucleotide, trinucleotide, pentanucleotide, and hexanucleotide repeats. This phenomenon is common in Allium [35,36,42] and Apiaceae [34,37,40,41]. In our species, mononucleotide and dinucleotide compositions were similar, but trinucleotide, tetranucleotide and pentanucleotide were different across clades. For example, Ostericum was signi cantly different from Angelica, such as in the type of Trinucleotide, in Angelica amost is ATT and ATA, but in Ostericum is TAT, AAT and TTC (Table S3). All dinucleotide repeats were AT/TA, while other SSRs had motifs that were mostly A/T, causing overall AT richness in the studied plastid genomes [37,65].
Contraction and expansion of IR regions are important for genome size variations and play a crucial role in evolution [66,67]. The IR regions of our studied species can be divided into 3 groups by length. The rst group has IR regions ranging from 17797-18467 bp and contains Angelica, Co. saxatile, Con. chinense and G. littoralis. The second group only contains M. pimpinelloideum because its IR regions were the largest with 35157 bp. The Ostericum species are grouped together with IR region lengths ranging from 26033-26322 bp. Through the length of LSC regions, SSC regions and IR regions, we can conclude that size of IR regions is the key to affecting the size of genomes, the IR regions larger, the size of genomes larger. There are several models that have been proposed to explain the in uence of IR regions on genome size. By examination of IR/LSC junctions in 13 Nicotiana species, Goulding et al. proposed a stepwise model involving a single-strand break, heteroduplex formation via a Holliday junction, and then small IR expansions via gene conversion [68]. In the study of N. acuminata, Wang et al. proposed a different model of double-strand break (DSB) followed by strand invasion and recombination that may cause the expansion and contraction of IR [69]. When studying Pelargonium, Zhu et al. proposed a model involving multiple inversions promoted by the dispersed repeats, expansions and contractions with several rounds of ebb-and-ow [46].
In this study there was no obvious change in SSC/IR borders, but we did detect an obvious shift of LSC/IR borders, indicating that these IR regions were undergoing a contraction and expansion. In Apiaceae the SSC region almost the same length, similar results were observed in the previously reported Chamaesium, Ligusticum and Bupleurum [34,37,40]. However, the size of the LSC region in our studied species was similar, ranging from 76450-94111 bp. We observed the transfer of LSC sequences and genes into the IR regions or the IR regions contracted in the IR/LSC border. We observed Ostericum had more had more ycf2, trnI-CAU, rpl23, rpl2 in IRb and Ira regions than Angelica, and part rps19 gene in IRb region in Ostericum. Compared to Angelica, M. pimpinelloideum had more ycf2, trnI-CAU, rpl23, rpl2, rps19, rpl22, rps3, rpl16, rpl14, rps8, infA, rpl36, rps11, rpoA, petD, petB in IRb region, and more ycf2, trnI-CAU, rpl23, rpl2, rps19, rpl22, rps3, rpl14, rps8, infA, rpl36, rps11, rpoA, petD in IRa region. This result is different from Plantago where genes are transferred from the SSC region into the IR region [60,70]. During land plant evolution, the IR expansion or contraction that have transfered genes from the SC regions into the IR or vice versa is generally [44,70,71]. The IR contraction and expansion has been considered important evolutionary phenomena that resulted in the origin of pseudogenes, gene duplication, or the reduction of duplicate genes to single copy [61,[72][73][74].

Phylogenetic relationship analysis
Until now molecular phylogenetic studies based on nrITS and a few cpDNA sequences did not support the monophyly of subtribe Angelicinae [21,23,24]. In this study, we performed phylogenetic analyses for subtribe Angelicinae using protein-coding genes of the plastid genomes and ITS sequences, and we conclude that subtribe Angelicinae is non-monophyletic. Angelica is closely related to Coelopleurum saxatile, Melanosciadium pimpinelloideum, and Angelica acutiloba belongs to the tribe Peucedaneae (tribe Selineae) [21]. Regardless of which tree, ITS or cpDNA, A. acutiloba is one of the parallel branches of Angelica and A. acutiloba may not belong to Angelica. The speci c phylogenetic position of A. acutiloba needs more research to ascertain its correct position. Ostericum is closely related to Pternopetalum and it belongs to the Acronrma Clade [21]. We found that Pterygopleurum neurophyllum was nested within Ostericum and there are three possible reasons for this unexpected result. The rst is that P. neurophyllum should belong to Ostericum and that the plastid tree suggests that P. neurophyllum clusters with Ostericum forming a monophyletic taxon. The second is that the downloaded ITS (GeneBank No.: AY509127) data is from a species misidenti ed as P. neurophyllum and since there is only one sequence in GeneBank this cannot be compared against other sequences of the species. The third is that it was caused by the con ict between nuclear and plastid, like Glehnia littoralis and Angelica sp.. Nevertheless, more data and sequences of eld collected materials is needed to solve this unexpected result.
The plastid genome and ITS trees produced incongruent tree topologies. The main difference is in Angelica and related species, especially the phylogenetic position of G. littoralis. In the cpDNA tree, G. littoralis was embedded in Angelica, but in the nrITS tree G. littoralis forms a parallel branch of Angelica. The organellar relationships and nuclear relationships discord is common in angiosperms, such as in Rosidae [75], Temperate Bamboos [76], and Heuchera [77]. Given that nuclear and plastid genomes have distinct evolutionary histories, organellar genomes have experienced unique population genetic pressures. Several factors can drive such con ict, including incomplete lineage sorting (ILS), hybridization and gene ow [76][77][78]. Together, our results suggested that subtribe Angelicinae species, and possibly Apiaceae, may have experienced a complex evolutionary history and speciation process.

Conclusion
In this study, we used the next-generation sequencing to sequence and annotate the complete plastid genomes of Angelica amurensis, Angelica biserrata, Angelica tianmuensis, Coelopleurum saxatile, Conioselinum chinense, Melanosciadium pimpinelloideum, Ostericum grosseserratum, Ostericum huadongense, and Ostericum sieboldii. The structure and organization of these plastid genomes are similar to each other and previously reported genomes from Apiaceae. We compared these nine genomes to another Angelicinae species, Glehnia littoralis, for repetitive sequences, SSRs, codon usages, IR contraction and expansion, and nucleotide sequence diversity. We found that Ostericum exhibited striking differences in size of genomes, content of genes and tRNAs, GC content, some type of SSRs, and IR boundaries to Angelica, and con rmed that the relatedness between Angelica and Ostericum is more distant in protein-coding genes of the plastid genomes trees. And the cpDNA tree was inconsistent with the ITS tree and this may have been caused by incomplete lineage sorting (ILS), hybridization and gene ow. The current plastid genomic dataset and the detailed analysis of subtribe Angelicinae species provide abundant genetic resources for the future molecular phylogeny, evolution and population genetic studies of subtribe Angelicinae and Apiaceae.

Methods
DNA sequencing, assembly, and annotation Fresh green leaves were collected from adult plants of nine species from the eld, and then they were desiccated and stored in silica gel. Qiu-Ping Jiang underwork the formal identi cation of the plant materials in this study. The herbarium specimens of these species were stored in the Herbarium, College of Life Sciences, Sichuan University (SZ). Specimen voucher details can be found in the table S1. Permission is not required to sample these plants because they are not key protected plants.
We extracted the total genomic DNA from the stored dry leaves, using the modi ed CTAB method [79]. sieboldii. Preliminary genome annotation was conducted using PGA [81], manual modi cations for uncertain genes, and uncertain start and stop codons were corrected based on comparison with other a nis' plastid genomes using GENEIOUS R11 [82]. The nine species' annotated genome sequences were submitted to GenBank, and their corresponding accession numbers are listed in (Table S1). We also downloaded G. littoralis from NCBI, GenBank accession No.: MH142518, to enhance genome comparative analyses.
Genome comparative analyses of codon usage Circular gene maps of the annotated genomes were drawn by the online program OGDRAW [83]. Genome features were compared between the nine species of subtribe Angelicinae and one related species based on the program GENEIOUS R11. We removed protein-coding genes (PCGs) less than 300 bp from the ten species' genomes, leaving protein-coding genes (PCGs) (>300 bp) that were analysed for codon usage bias in the program codon W [84]. We also calculated the relative synonymous codon usage (RSCU) [85] of the ten species.

Characteristics of cpSSRs and repetitive sequences
The plastid simple sequence repeats (cpSSRs) of the 10 species were generated using Perl script MISA [86] with the same settings: 10 repeats for mononucleotide, 5 repeats for dinucleotide, 4 repeats for trinucleotide and 3 repeats for tetranucleotide, pentanucleotide and hexanucleotide. Furthermore, the repetitive sequences (i.e. complementary, forward, palindromic, and reverse repeats) of plastid genomes were identi ed using REPuter [87]. The parameters were set as follows: (1) minimum repeat size of sequence was 30 bp; (2) sequence identity was more than 90% between two repeats; and (3) Hamming distance was equal to 3. Manual modi cations of the data included removal of all overlapping repeat sequences.

Sequence diversity analysis
The alignment of the ten species' plastid genomes were visualized using MAFFT v7.402 [88] and calibrated manually. The single nucleotide polymorphism (SNP) analysis was generated using DnaSP v5 [89]. The parameters were set as follows: the windows length was 600 bp, and step size was 200 bp.

Phylogenetic analysis
To infer phylogenetic relationships within subtribe Angelicinae, the nine plastid genomes (nine do novo sequenced including one non-subtribe) were compared to another 25 Apiaceae Apioideae plastid genomes, with Bupleurum and Chamaesium as the outgroup. All plastid genome sequences were aligned using MAFFT v7.402, and we then extracted protein-coding genes. Maximum likelihood (ML) analyses was undertaken using RAxML v7.2.8 [90] and the best-t model with GTR+G model and 1000 bootstrap replicates. Bayesian Inference (BI) analyses were performed in MrBayes version 3.2 [91], the best-t model for the sequence (GTR+G+I) was conducted using Modeltest 3.7 [92] with optimized parameters. Four simultaneous runs were performed using Markov chain Monte Carlo (MCMC) simulations for 10 million generations, starting from a random tree and sampling one tree every 1000 generations. The rst 20% of obtained trees were discarded as burn-in and the remaining were used to calculate a 50% majority-rule consensus topology and posterior probability (PP) values. Declarations