Chloroplast genomic characteristics of L. sinense and L. jeholense
The number and sequence of coding genes in chloroplast genome of higher plants are highly conserved [18]. The variation of genome composition may be related to the change of base selection at the third codon, which affects the composition of amino acids, but it seems to have little effect on the overall physical and chemical properties of amino acids [19]. The average GC content of the whole gene, as well as the average GC content at three codon positions, the nuclear gene is higher than the cp gene, indicating that the pressure of genome organization and mutation of nuclear gene and cp gene is different [20]. The total cp gene length of L. sinense was 148,515 bp, and that of L. jeholense was 148,493 bp. The difference between them was small. The content of GC in L. sinense and L. jeholense were 36.67% and 37.25%. These are the distinct features of the two genomes. The GC content of cp DNA in sweet potato is 38.45%, which is similar to other cp genomes reported in Convolvulaceae [21]. The difference of cp genome length is mainly caused by the change of LSC length. The length of SSC is 17,607 bp and 17,629 bp. The length of IR is 17,607 bp and 17,629 bp (Table 1). The regional constraints strongly affect the sequence evolution of the cp genomes, while the functional constraints weakly affect the sequence evolution of cp genomes [22].
Table 1
Comparison of general characteristics of chloroplast genomes of two species in Umbelliferae
Type | Size(bp) | GC Content(%) | LSC length (bp) | SSC length (bp) | IR length (bp) | Gene number | Gene number in IR regions | Protein-coding gene number | rRNA gene number | tRNA gene number |
L. sinense | 148,515 | 36.67 | 93,978 | 17,607 | 51,781 | 127 | 28 | 83 | 8 | 36 |
L. jeholense | 148,493 | 37.25 | 93,932 | 17,629 | 36,932 | 127 | 28 | 83 | 8 | 36 |
Among them, L. sinense and L. jeholense are obviously different in IR region, and 8 rRNA and 36 tRNA are in IR region. By sequencing, we found that the size of LSC region of cp genome of L. sinense and L. jeholense is very similar, and the number of total genes and coding protein genes are basically the same, which proves that there is a close relationship between them. Organization of the spacer in Umbelliferae is consistent with a general pattern evident for angiosperms [23]. Cp DNA has been used extensivelyto infer plant phylogenies at different taxonomic levels [24].
The genetic types of L. sinense and L. jeholense are the same, the difference of bp length between them is very small, which proves that the genetic relationship between them is very close. In general, cp are divided into a large single copy (LSC) area, short single copy (SSC) area and two reverse repeat (IR) areas [25]. Specifically, the length of the IRb region of L. sinense is longer than that of L. jeholense, which shows that all the ycf2 genes of L. jeholense are in the LSC region, while in L. sinense, part of the ycf2 genes are in the LSC region and part of the IRb region. According to the symmetry of IRb and IRA, L. sinenseis has a blank gene in IRa area. The remaining LSC and SSC areas are identical. This also explains the reason why the total length of L. sinenseis 148,515 bp longer than that of L. jeholense (Fig. 1–2). Prangos fedtschenkoi (Regel & schmalh.) Korovin and P. lipskyi Korovin (Apiaceae) also combined new DNA in LSC region near LSC/IRa connection [26]. The ycf94 predicted protein has a distinct transmembrane domain but with no sequence homology to other proteins with known function [27].
In the cp gene of L. sinense and L. jeholense, 127 genes were detected, among which 6 genes (atpA, atpB, atpE, atpF, atph, atpI) were found in the gene group of photosynthesis, The Chlamydomonas reinhardtii chloroplast atpB mRNA contains sequences at its 3′ end that can form a complex stem/loop structure [28]. Twelve genes (ndhA, ndhB, ndhc, ndhd, ndhe, ndhF, ndhg, ndhk) were found in NADH dehydrogenase subunits, including 2 ndhB genes and 6 cytochrome subunits petA, petB*, petD*, petG, petL, petN gene, among which 20 coding genes were detected in light system, including 6 genes (psaA, psaB, psaC, psaI, psaJ, psaL) in light system I and 14 genes (psbA, psbB, psbC, psbD, psbE, psbF, psbH, psbI, psbK, psbL, psbM, psbN, psbT, psbZ) in light system II. The psbA gene has been located in a large single copy fragment in some monocotyledons, which is very close to the end of a reverse repeat sequence in the cp genome of most terrestrial plants [29]. Cp intergenic psbA-trnH spacer has recently become a popular tool in plant molecular phylogenetic studies at low taxonomic level and as suitable for DNA barcoding studies [23]. It includes accD gene of acetylCoA carboxylase subunit, accD gene of type C cytochrome, ccsA gene of synthetic gene, cemA gene of envelope protein, clpP gene of protease and matK gene of maturity. There are also 75 self replicating genes and 7 unknown genes (ycf1, ycf2, ycf3, ycf4, ycf5, ycf15) (Table 2). The sequence and structure of chloroplast genes are highly conserved [30, 31]. Among 127 genes in L. sinense, one intron gene is rps16, atpf, ropc1, petb, petd, rpl16, ndhb-d2, ndhH genes and two introns of ndhB gene are ycf3, clpP and rps12 gene. Among the 127 genes with one intron compared with L. sinense, L. jeholense does not contain the intron gene of ndhH, in addition, there is one more intron gene of ndhB compared with L. sinense. The two introns have the same kind and quantity. Exons showed more random behaviors than introns [32] (Additional file 1: Table S1).
Table 2
List of the genes in the chloroplast genomes of two species of Ligusticum
Gene category | Gene group | Gene name |
Photosynthesis | Subunits of ATP synthase (6) | atpA, atpB, atpE, atpF *, atpH, atpI |
| Subunits of NADH dehydrogenase (12) | ndhA, ndhB*(x2), ndhC, ndhD, ndhE, ndhJ, ndhF, ndhH, *ndhG, ndhJ, ndhK |
| Subunits of cytochrome (6) | petA, petB*, petD*, petG, petL, petN |
| Subunits of photosystem I (6) | psaA, psaB, psaC, psaI, psaJ, psaL |
| Subunits of photosystem II (14) | psbA, psbB, psbC, psbD, psbE, psbF, psbH, psbI, psbK, psbL, psbM, psbN, psbT, psbZ |
| Subunit of rubisco (1) | rbcL |
Other genes | Subunit of Acetyl-CoA-carboxylase (1) | accD |
| c-type cytochrome synthesis gene (1) | ccsA |
| Envelop membrane protein (1) | cemA |
| Protease (1) | clpP** |
| Maturase (1) | matK |
Self-replication | Large subunit of ribosome (9) | rpl2, rpl14, rpl16*, rpl20, rpl22, rpl23, rpl32, rpl33, rpl36, |
| DNA dependent RNA polymerase (4) | rpoA, rpoB, rpoC1*, rpoC2 |
| Small subunit of ribosome (13) | rps2, rps3, rps4, rps7 (x2), rps8, rps11, rps12**(x2), rps15, rps16*, rps18, rps19 |
| rRNA Genes (8) | rrn4.5 (x2), rrn5 (x2), rrn16 (x2), rrn23 (x2) |
| tRNA Genes (36) | trnA-UGC (x2), trnC-GCA, trnD-GUC, trnE-UUC, trnF-GAA, trnG-GCC, trnG-GCC, trnH-GUG, trnI-CAU, trnI-CAA, trnK-UUU, trnL-UAG, trnL-UAA, trnL-GAU (x2), trnL-CAA, trnfM-CAU, trnM-CAU, trnN-GUU (x2), trnP-UGG, trnQ-UUG, trnR-UCU, trnR-ACG (x2), trnS-GCU, trnS-GGA, trnS-UGA, trnT-GGU, trnT-UGU, trnV-GAC (x2), trnV-UAC, trnW-CCA, trnY-GUA |
Unknown function | Conserved open reading frames (7) | ycf 1, ycf 2, ycf 3**, ycf4 (x2), ycf5, ycf15 |
* contains one intron
** contains two introns
Numbers in brackets behind name of gene group give number ofrepetitive genes
Repeat sequences analysis
SSR sequence analysis
Site-specific recombinase (SSR) technology allows the manipulation of gene structure to explore gene function and has become an integral tool ofmolecular biology [33]. When SSR technology is used to analyze closely related genotypes, the high polymorphism of many microsatellites has special value, just like the casein breeding project working in the narrow sense adaptive gene pool [34]. Polymorphic SRAPs and SSRs were abundant in genetic diversity analysis among closely related cultivars [35]. Because of the characteristics of neutral markers, the highly variable numbers of repeats and the relative conservatism of flanking sequences of SSRs, it is widely distributed in the genome of organisms. The technique is easy to operate and has high repeatability and codominant inheritance among alleles. SSRs marker technique is the best choice for evaluating genetic diversity of crops [36, 37] (Additional file 2: Table S2).
The availability of complete sequences of chloroplast genomes enhances their use for genetic engineering. In chloroplast transformation, finding appropriate intergenic spacer regions is very important for efficient integration of transgenes [38]. There are 169 SSRs in the cp genome of L. sinense and 166 SSRs in the cp genome of L. jeholense. The main difference between them is the number of single nucleotide. The cp genome of L. tenuissimum contains 174 SSRs. There are six SSR types in L. sinense and L. jeholense, including single nucleotide, dinucleotide, trinucleotide, tetranucleotide, pentanucleotide and hexanucleotide (Table 3, Fig. 3). In addition, L. tenuissimum does not contain six nucleotides for identification. The number of dinucleotide of L. tenuissimum is different from that of the other two. Dinucleotide repeat sequence shows high polymorphism in eukaryotic DNA. These sequences are convenient as genetic markers and can be analyzed by PCR [39] (Additional file 3: Table S3, Additional file 4: Table S4).
Table 3
SSRs in the chloroplast genomes in two species of Ligusticum
Unit size | Mononucleotide | Dinucleotide | Trinucleotide | Tetranucleotide | Pentanucleotide | Hexanucleotide |
L. sinense | 136 | 19 | 3 | 7 | 2 | 2 |
L. jeholense | 134 | 19 | 3 | 7 | 1 | 2 |
L. tenuissimum | 137 | 23 | 3 | 7 | 4 | 0 |
Large Repeat Analysis
Many repeats occur in whole cp genes. In this study, L. sinense and L. jeholense have 39 and 37 pairs of large repeat sequences, respectively. It was found in the cp genome with sequence identity exceeding 90%. The large repeat range of L. sinense is 30 to 102 bp, and L. jeholense is 30 to 66 bp. A total of 16 and 13 large repeats were located in the genic regions of the two Ligusticum, respectively (Additional file 5: Table S5, Additional file 6: Table S6).
Analysis Of The LSC, SSC, And IR Border Regions
Inversion repeat (IR) is a feature of most plant cp genomes [40]. The cp is roughly divided into three regions, LSC, IR and SSC, among which IR is divided into IRa and IRb, which are symmetrical. The results showed that there was little difference in the length of infrared region among the three species of L. sinense, among which the length of ycf2 gene was different between L. sinense and L. tenuifolia at the boundary of LSC and IRb. In addition, all ndhF gene fragments of them were in SSC region. The length of ycf1 gene at the boundary between SSC and IRa was 17,607 bp in the SSC region of L. sinense and 17,692 bp in the SSC region of L. jeholense. The length of ycf1 gene in SSC region of L. tenuissimum is 17,661 bp (Fig. 4). L. tenuissimum and the other two are quite different at the boundary of LSC and IRb. After connecting rps3 gene and rpl2 gene 2 in LSC region, the SSC region connects rps19 gene and rpl2 gene. As in Escherichia coli and Euglena, the C. reinhardtii rps12 gene is continuous, in contrast to its trans-spliced structure in higher plants [41]. In addition, L. tenuissimum separated ycf1 gene from SSC region in IRb region, and connected 7 bp trnH gene with LSC region after rps19 gene ended at the border of IRa region. In the evolutionary process of cp genome, the length of IR region is not constant [42]. The intron boundary sequence does not follow the G-U / A-G rule, but is similar to the tobacco cp division gene trnagly (UCC) and ribosomal proteins L2 and S12 [43] .
Nucleotide Diversity Analysis
The coding region and noncoding region of gene can better reflect homologous kinship, and these regions are highly variable, according to the coding region and noncoding region of genome to explore and solve the relationship between the same genus of plants. Additionally, fewer SSRs are distributed in the protein-coding sequences compared to the non-coding regions, indicating an uneven distribution of SSRs within the cp genomes [44]. In this experiment, the coding region and noncoding region of two kinds of L. sinense were compared and analyzed. 151 coding genes (Fig. 5a) and 152 noncoding genes (Fig. 5b) were generated in cp genome of two kinds of Ligusticum. Through the comparative analysis of Pi value (Additional file 7: Table S7), it was found that the Pi value of coding genes was almost zero between 0-0.0087719 (petG gene), and 66 genes were found. In the non coding region, the change of Pi value in LSC region is more than that in LSC region, indicating that the coding region is more stable and conservative. The genes of IRa, SSC and IRb in the non-coding region range from 0 to 0.0044843 (ccsA < ndhD gene), and most of the Pi values are zero, including 51 genes. The first two significant gene mutations were petG gene and psaI, ycf4 gene in LSC region.