Comparative analysis of whole chloroplast genomes of Ligusticum sinense and L. jeholense, Umbelliferae

Background


Conclusions
In this study, the cp genomic characteristics of Ligusticum sinense and L. jeholense were identi ed, which provided a theoretical basis and documentation for the identi cation and phylogenetic analysis of Ligusticum.

Background
Gaoben it has a long cultivation history in China. It has been recorded in "Shennong Classic of Materia Medica", which has the effect of dispelling wind and dampness, dispersing cold and relieving pain [1]. Gaoben comes from Ligusticum sinense and L. jeholense. Gaoben is mainly distributed in the south of China, while Liaogaoben mostly occurs in the north. There are obvious differences in root morphology and microscopic characteristics between the two. The traditional Chinese medicine Gaoben is commonly used in the treatment of wind cold headache, pain on the top of the mountain, wind cold dampness, etc. Gaoben is included in the traditional Chinese medicine, while Liaogaoben as the source of Ligusticum is recorded in it, not included in it [2]. Because of the close relationship between them, it is di cult to distinguish them, so we use the methods of chloroplast genome to identify them.
In recent years, with the rapid development of plant molecular systematics, new methods have been widely used in the localization of chloroplast (cp) restriction sites in cp genome, but limited to the applicable classi cation level. The change of gene sequence and content in cp genome is rare, which is usually caused by a part of genome or gene deletion. For example, differences in gene sequence and content [3]. With a few exceptions, the chloroplast genome contains two reverse repeat sequences (IR). These repeat regions separate the rest of the molecule into large single copy (LSC) and small single copy (SSC) regions [4]. Recent studies on more than 200 owering plants in 30 families have shown that the chloroplast genome is highly conserved in its size, tissue and primary sequence [5]. Umbelliferae is a large and complex family of owering plants Phylogenetic relationships, especially at the lower taxonomic level, are usually fuzzy based on the molecular markers currently widely used. Therefore, it is very advantageous to attach the information of phylogenetic utility of molecular markers at these levels [6].
The cp genomes of two Ligusticum species were studied. To explore the differences between the molecular level, and to analyze and identify L. sinense and L. jeholense by cp genome and other content. The relationship between L. sinense and L. jeholense was analyzed.

DNA extraction and sequencing
DNA extraction from plant tissues is a very critical process. Many time-consuming steps in plant molecular biology are involved [7]. We collected the fresh leaves and roots of L. sinense and L. jeholense from the key varieties protection park of Liaoning Province in Dalian campus of Liaoning University of traditional Chinese medicine (N 39°06´, E 121°87´, Dalian, Liaoning Province, China). Five mg fresh leaves and 5 mg roots were collected by improved extraction method for cp DNA separation [8]. After DNA separation, 1 µg of puri ed DNA was segmented and used to construct a short insertion Library (430 bp insertion size) according to the manufacturer's instructions (Illumina), and then sequenced on Illumina Hiseq 4000 [9]. Genome assembly and annotation Using Illumina truseq Gamma The library was constructed by nano DNA sample prep kit method. Using assembly software to assemble clean data, adjusting parameters and lling holes for many times to obtain the genome assembly sequence cp genome is reconstructed by using the combination of denovo and reference guided assembly. Before assembly, rst lter the original reading. This ltering step is performed to remove reads with adapters, reads with a display quality score of less than 20 (Q < 20), reads that contain a percentage of an uncalled base ("n" character) equal to or greater than 10%, and repeat sequences. The following three steps are used to assemble the cp genome [10]. First, use soapdenovo 2.04 [11] to assemble the ltered reads into contrags. Secondly, we use blast to compare the contigs with the reference genomes of the two species, and rank the contigs according to the comparison of the reference genomes (both similarity and query coverage are greater than 80%). Third, map clean reads to the assembled cp genome sketch to correct the wrong base and ll in most of the gaps through local assembly. Cp gene uses online dogma tool annotation [12]. using default parameters to predict protein coding genes, transfer RNA (tRNA) genes and ribosomal RNA (rRNA) genes. The general function database compares the gene function annotation information of the samples, and classi es the speci c functions. The sequencing data and gene annotations were then submitted to GenBank and assigned accession numbers (L. sinense: MT512897, L. jeholense: MT512888).

Comparative analysis of genomes
The differences between samples and reference genomes were compared from two levels of genome and gene. SSR sequences were identi ed by SSR software microsatellite (MISA) (http://pgrc.ipk-gatersleben.de/misa/), and the tandem repeats of 1-6 nucleotides were regarded as microsatellites. For single, two, three, four, ve, and six nucleotides, the minimum number of repeats is set to 10, 6, 5, 5, 5, and 5, respectively. The maximum base number of two SSR interruptions in composite microsatellites is 100. We compared these data with Ligusticum tenuissimum (NC_029394), and focused on the perfect repeat sequence [14]. The long repeat sequences were analyzed by using the network reputer (http://bibiserv.techfak.uni-bielefeld.de/reputer/), including forward, reverse, complement and palindrome repeat sequences. The minimum sequence length was 30 bp, and the editing distance was 3 bp [15]. The variation characteristics of two kinds of Ligusticum were screened out. The average number of nucleotide differences and the total number of mutations were measured by dnasp v5.10 to analyze nucleotide diversity (polymorphic information, Pi) [16].

Phylogenetic analysis
In order to analyze the phylogenetic relationship between Ligusticum and othergenera of Umbelliferae, phylogenetic tree was constructed from cp genome sequences of 25 species of Umbelliferae, 23 of which were downloaded from GenBank. Among the 28 plants, there are three exotaxa: Ginkgo biloba (NC_016986), Panax ginseng (NC_006290) and Arabidopsis thaliana (NC_000932). The phylogeny includes 25 species of Umbelliferae, 1 species of Cruciferae, 1 species of Ginkgo family and 1 species of Acanthopanax family. Single nucleotide polymorphisms (SNPs) in the whole cpgenome of 28 species of plants were analyzed. Phymlv3.0 software was constructed by maximum likelihood method (ML), GTR + I + GML model was established, bootstrap value was calculated by 100 bootstrap replications [17].

Results And Discussion
Chloroplast genomic characteristics of L. sinense and L. jeholense The number and sequence of coding genes in chloroplast genome of higher plants are highly conserved [18]. The variation of genome composition may be related to the change of base selection at the third codon, which affects the composition of amino acids, but it seems to have little effect on the overall physical and chemical properties of amino acids [19]. The average GC content of the whole gene, as well as the average GC content at three codon positions, the nuclear gene is higher than the cp gene, indicating that the pressure of genome organization and mutation of nuclear gene and cp gene is different [20]. The total cp gene length of L. sinense was 148,515 bp, and that of L. jeholense was 148,493 bp. The difference between them was small. The content of GC in L. sinense and L. jeholense were 36.67% and 37.25%. These are the distinct features of the two genomes. The GC content of cp DNA in sweet potato is 38.45%, which is similar to other cp genomes reported in Convolvulaceae [21]. The difference of cp genome length is mainly caused by the change of LSC length. The length of SSC is 17,607 bp and 17,629 bp. The length of IR is 17,607 bp and 17,629 bp ( Table 1). The regional constraints strongly affect the sequence evolution of the cp genomes, while the functional constraints weakly affect the sequence evolution of cp genomes [22]. Among them, L. sinense and L. jeholense are obviously different in IR region, and 8 rRNA and 36 tRNA are in IR region. By sequencing, we found that the size of LSC region of cp genome of L. sinense and L. jeholense is very similar, and the number of total genes and coding protein genes are basically the same, which proves that there is a close relationship between them. Organization of the spacer in Umbelliferae is consistent with a general pattern evident for angiosperms [23]. Cp DNA has been used extensivelyto infer plant phylogenies at different taxonomic levels [24].
The genetic types of L. sinense and L. jeholense are the same, the difference of bp length between them is very small, which proves that the genetic relationship between them is very close. In general, cp are divided into a large single copy (LSC) area, short single copy (SSC) area and two reverse repeat (IR) areas [25]. Speci cally, the length of the IRb region of L. sinense is longer than that of L. jeholense, which shows that all the ycf2 genes of L. jeholense are in the LSC region, while in L. sinense, part of the ycf2 genes are in the LSC region and part of the IRb region. According to the symmetry of IRb and IRA, L. sinenseis has a blank gene in IRa area. The remaining LSC and SSC areas are identical. This also explains the reason why the total length of L. sinenseis 148,515 bp longer than that of L. jeholense ( Fig. 1-2). Prangos fedtschenkoi (Regel & schmalh.) Korovin and P. lipskyi Korovin (Apiaceae) also combined new DNA in LSC region near LSC/IRa connection [26]. The ycf94 predicted protein has a distinct transmembrane domain but with no sequence homology to other proteins with known function [27].
Cp intergenic psbA-trnH spacer has recently become a popular tool in plant molecular phylogenetic studies at low taxonomic level and as suitable for DNA barcoding studies [23]. It includes accD gene of acetylCoA carboxylase subunit, accD gene of type C cytochrome, ccsA gene of synthetic gene, cemA gene of envelope protein, clpP gene of protease and matK gene of maturity. There are also 75 self replicating genes and 7 unknown genes (ycf1, ycf2, ycf3, ycf4, ycf5, ycf15) ( Table 2). The sequence and structure of chloroplast genes are highly conserved [30,31]. Among 127 genes in L. sinense, one intron gene is rps16, atpf, ropc1, petb, petd, rpl16, ndhb-d2, ndhH genes and two introns of ndhB gene are ycf3, clpP and rps12 gene. Among the 127 genes with one intron compared with L. sinense, L. jeholense does not contain the intron gene of ndhH, in addition, there is one more intron gene of ndhB compared with L. sinense. The two introns have the same kind and quantity. Exons showed more random behaviors than introns [32] (Additional le 1: Table S1).  Site-speci c recombinase (SSR) technology allows the manipulation of gene structure to explore gene function and has become an integral tool ofmolecular biology [33]. When SSR technology is used to analyze closely related genotypes, the high polymorphism of many microsatellites has special value, just like the casein breeding project working in the narrow sense adaptive gene pool [34]. Polymorphic SRAPs and SSRs were abundant in genetic diversity analysis among closely related cultivars [35]. Because of the characteristics of neutral markers, the highly variable numbers of repeats and the relative conservatism of anking sequences of SSRs, it is widely distributed in the genome of organisms. The technique is easy to operate and has high repeatability and codominant inheritance among alleles. SSRs marker technique is the best choice for evaluating genetic diversity of crops [36, 37] (Additional le 2: Table S2).
The availability of complete sequences of chloroplast genomes enhances their use for genetic engineering. In chloroplast transformation, nding appropriate intergenic spacer regions is very important for e cient integration of transgenes [38].
There are 169 SSRs in the cp genome of L. sinense and 166 SSRs in the cp genome of L. jeholense. The main difference between them is the number of single nucleotide. The cp genome of L. tenuissimum contains 174 SSRs. There are six SSR types in L. sinense and L. jeholense, including single nucleotide, dinucleotide, trinucleotide, tetranucleotide, pentanucleotide and hexanucleotide (Table 3, Fig. 3). In addition, L. tenuissimum does not contain six nucleotides for identi cation. The number of dinucleotide of L. tenuissimum is different from that of the other two. Dinucleotide repeat sequence shows high polymorphism in eukaryotic DNA. These sequences are convenient as genetic markers and can be analyzed by PCR [39] (Additional le 3: Table S3, Additional le 4: Table S4). Large Repeat Analysis Many repeats occur in whole cp genes. In this study, L. sinense and L. jeholense have 39 and 37 pairs of large repeat sequences, respectively. It was found in the cp genome with sequence identity exceeding 90%. The large repeat range of L. sinense is 30 to 102 bp, and L. jeholense is 30 to 66 bp. A total of 16 and 13 large repeats were located in the genic regions of the two Ligusticum, respectively (Additional le 5: Table S5, Additional le 6: Table S6).

Analysis Of The LSC, SSC, And IR Border Regions
Inversion repeat (IR) is a feature of most plant cp genomes [40]. The cp is roughly divided into three regions, LSC, IR and SSC, among which IR is divided into IRa and IRb, which are symmetrical. The results showed that there was little difference in the length of infrared region among the three species of L. sinense, among which the length of ycf2 gene was different between L. sinense and L. tenuifolia at the boundary of LSC and IRb. In addition, all ndhF gene fragments of them were in SSC region. The length of ycf1 gene at the boundary between SSC and IRa was 17,607 bp in the SSC region of L. sinense and 17,692 bp in the SSC region of L. jeholense. The length of ycf1 gene in SSC region of L. tenuissimum is 17,661 bp (Fig. 4). L. tenuissimum and the other two are quite different at the boundary of LSC and IRb. After connecting rps3 gene and rpl2 gene 2 in LSC region, the SSC region connects rps19 gene and rpl2 gene. As in Escherichia coli and Euglena, the C. reinhardtii rps12 gene is continuous, in contrast to its trans-spliced structure in higher plants [41]. In addition, L. tenuissimum separated ycf1 gene from SSC region in IRb region, and connected 7 bp trnH gene with LSC region after rps19 gene ended at the border of IRa region. In the evolutionary process of cp genome, the length of IR region is not constant [42]. The intron boundary sequence does not follow the G-U / A-G rule, but is similar to the tobacco cp division gene trnagly (UCC) and ribosomal proteins L2 and S12 [43] .

Nucleotide Diversity Analysis
The coding region and noncoding region of gene can better re ect homologous kinship, and these regions are highly variable, according to the coding region and noncoding region of genome to explore and solve the relationship between the same genus of plants. Additionally, fewer SSRs are distributed in the protein-coding sequences compared to the non-coding regions, indicating an uneven distribution of SSRs within the cp genomes [44]. In this experiment, the coding region and noncoding region of two kinds of L. sinense were compared and analyzed. 151 coding genes (Fig. 5a) and 152 noncoding genes (Fig. 5b) were generated in cp genome of two kinds of Ligusticum. Through the comparative analysis of Pi value (Additional le 7: Table S7), it was found that the Pi value of coding genes was almost zero between 0-0.0087719 (petG gene), and 66 genes were found. In the non coding region, the change of Pi value in LSC region is more than that in LSC region, indicating that the coding region is more stable and conservative. The genes of IRa, SSC and IRb in the non-coding region range from 0 to 0.0044843 (ccsA < ndhD gene), and most of the Pi values are zero, including 51 genes. The rst two signi cant gene mutations were petG gene and psaI, ycf4 gene in LSC region.

Conclusion
By comparing the cp genomes of L. sinense umbelliforme and L. jeholense. We found that they have the same cp genome sequence, mainly different in the length of a single gene and the length of IR region. There are also differences between the two annotated genes, including different types and numbers of an intron, which can be used to identify the two. The difference of cp length of the total gene is that L. sinense in IR region isolates the ycf2 gene at the boundary of LSC and IRb region, while L. jeholense completely exists in LSC region, and 83 coding genes and 44 non-coding genes are produced in cp genome of both L. sinense and L. jeholense. There are obvious differences in the ycf2 gene which can be used for marker development. There are 88 repeats in the repeats, which can be used to develop markers. Based on the cp genome of Umbelliferae and SNPs of shared coding proteins, the phylogeny of two L. sinense species in Umbelliferae was analyzed, which provided a reference for further study on the evolutionary history of Umbelliferae.

Declarations
Availability of data and materials All data generated or analyzed during the course of this study are included in this document or obtained from the appropriate author(s) at reasonable request.
Ethics approval and consent to participate Not applicable.

Consent for publication
Not applicable.