Comparative Complete Chloroplast Genome of Nine Species in Litsea (Lauraceae) and Uncover Evolutionary Dynamic Patterns

Weicai Song Qingdao University of Science and Technology Zimeng Chen Qingdao University of Science and Technology Qi Feng Qingdao University of Science and Technology Chuxuan Ji Imperial College London Chengbo Wei Peking University Shenzhen Graduate School Chao Shi (  chaoshi@qust.edu.cn ) Qingdao University of Science and Technology Michael S. Engel University of Kansas Shuo Wang Qingdao University of Science and Technology


Introduction
Litsea is an evergreen tree or shrub and is one of the most diverse genera (about 400 species) in the family Lauraceae (Mesangiospermae: Magnoliids: Laurales). It is widely distributed in tropical and subtropical Asia, North and South America [1,2], and 74 species of which are located in China, at a maximum elevation of 2,700 m above sea level [3]. Species of Litsea are utilized in a wide range of applications, covering medical, agricultural, industrial, and many other elds. Litsea can be used to treat a variety of conditions such as diarrhea, stomach pain, indigestion, the common cold, gastroenteritis, diabetes, edema, arthritis, asthma, pain, and trauma [2]. In addition, Litsea is also known for the highly effective properties of its essential oil against food-borne pathogens [4]. Its essential oils can also be resistant to several types of bacteria, has antioxidant, anti-parasitic, acute toxicity, genotoxic, and cytotoxic properties, and can even prevent several types of cancer [5][6][7]. Despite the pharmaceutical applications of Litsea, it is also widely used as feed for silkworm pupae, especially for muga silk worms (Antheraea assama) [5]. By comparison with ordinary silk produced from other food sources, muga silk produced from Litsea possesses a higher value and is considered of better quality, as re ected in its The cp genome features of nine species were analyzed and the total length ranged from 152,051 bp to 152,747 bp (Fig. 1). 128 genes were found in these complete cp genomes, including 36 tRNA genes, 8 rRNA genes, and 84 protein-coding genes. These genes can be divided into three categories: selfreplication related, photosynthesis related, and other genes. Large subunit of ribosomal proteins, small subunit of ribosomal proteins, DNA-dependent RNA polymerase, rRNA genes, and tRNA genes belong to the Self-replication category; Photosystem I, Photosystem II, NADH oxidoreductase, Cytochrome b6/f complex, ATP synthase, and Rubisco belong to the Photosynthesis category; while the remaining genes that have not been authorially classi ed yet were attributed to the other genes category (Table 1) [19].
Typical quadripartite and circular structures were discovered. These cp genomes contain a large singlecopy (LSC) of 93,093 -93,631 bp, a small single-copy region (SSC) of 18,813 -18,902 bp, separated by two identical interspersed regions (IRs) of 20,014 -20,117 bp. Among four types of regions, the LSC region contains the largest number of genes, including 66 protein-coding genes and 23 tRNA genes. The SSC region contained only 11 protein-coding genes and one tRNA gene, but its average gene length was the longest at 1,100 bp, far exceeding that of the LSC and IR regions, each with an average length of 625 and 639 bp, respectively. Two identical IR regions contained 5 protein-coding genes, 6 tRNA genes, along with 4 rRNA genes ( Table 1). The genome features of Litsea are consistent with the basic structure of chloroplasts reported by other studies [20].
We also analyzed the GC content of the complete cp genome for the nine species of Litsea, as well as the values of each region (Table 2). We discovered that the average GC content of the full cp genome was 39.2% for all species except for L. sericea, which was 39.1%. In addition, the GC content of the IR region was rmly consistent at 44.4% and signi cantly higher than the other two regions, which was assumed to be related to the presence of many rRNA genes [21].

Codon usage analysis
All organisms share a common codon table, re ective of the shared ancestry of all life, but through time disproportionate biases have evolved in various clades. Different species exhibit certain preferences for different synonymous codons, and even different proteins within the same species may show a preference for the same amino acid, a phenomenon called codon bias. A measurement called Relative Synonymous Codon Usage (RSCU) removes the effect of amino acid composition on codon usage [20]. The codon usage and RSCU value of coding sequences (CDSs) are reported for the species examined in this study (Table S1). The protein-coding genes in the complete chloroplast genome of L. moupinensis consist of 84 genes coded by 61 codons, which encode 20 amino acids. The results showed that Leu (UUA), Ala (GCU), and Arg (AGA) are the most frequently used amino acids, while Ser (AGC) and Arg (CGC) were the least abundant amino acids (Fig. 2). RSCU values greater than 1 mean that there is signi cant codon bias. This results in a different use of amino acids, which correlates with protein properties and functions [21]. Analysis of RSCU values of the codons encoding each amino acid revealed that most codons with RSCU > 1 contained either an A-or G-terminal. By contrast, RSCU values for codons that ended with a C-terminal, such as CGC (Arg), UGC (Cys), CAC (His), and AGC (Ser), are relatively low. This result was consistent with previous studies [22].

Long repeat and SSR analysis
Simple sequence repeats (SSRs), also known as microsatellites, commonly exist throughout the cp genome, consisting of one to six nucleotide repeats (Cui et al., 2019). Due to its variability at the intraspeci c level, SSRs are commonly used as markers in population genetic analyses [23,24]. In the cp genome of nine species, the total number of repeats ranged from 109 (L. chuni) to 119 (L. auriculata) ( Table S2). 111 SSRs were detected from the cp genome of the representative species L. moupinensis, including 62 mononucleotide, 36 dinucleotide, 3 trinucleotide, 8 tetranucleotide, 1 pentanucleotide, and 1 hexanucleotide repeats. In general, the SSR number decreased along with the increase in nucleotide number. The percentage of tri-, tetra-, penta-, and hexa-nucleotide repeat sequences detected were remarkably lower than that of mono-and di-nucleotide repeats, a phenomenon reported previously [25]. Mono-nucleotide repeats, the largest class of SSRs and consisting of 56.97% of all repeats, are notably rich in A/T bases, causing the differences in terms of base content, which is quite similar to that of other angiosperm species [26]. We also analyzed the distribution of SSRs in LSC/SSC/IR regions. The number of SSR markers in the LSC region of nine species of Litsea ranged from 79 to 87, far exceeding that of SSC (19) and IR regions (12). In particular, IR region contains the lowest number of SSRs, which further demonstrates the high degree of conservatism of IR regions. Some repeats larger than 30 bp in length are called long repeat sequences, which increase the rearrangement of the cp genome [27]. We detected interspersed repeated sequences (IRs) including four types of long repeat sequences: complement repeats (C), forward repeats (F), palindromic repeats (P), reverse repeats (R). Among all types of repeats detected, palindromic repeats (P, 16) were richest in most species, followed by forward repeats (F), with an average number of 12.5 and reverse repeats (R) at 5.8 (Table S3). Complement repeats (C) were notably rare among all species. However, in the cp genome of L. ichangensis, the number of forward repeats (F, 17) were slightly higher than that of palindromic repeats (P, 16). What more, in the cp genome of L. auriculata, the ratio of reverse repeats (R) was more than that of forward repeats (F), which is also different from the other eight species (Fig. 3B). We also measured the number of long repeat sequences with different lengths (Fig. 3C). It was found that long repetitive sequences of length of 20 -21 bp were most common, while the remainder decreased in number with an increase in length, with two exceptions for 33 bp and 44 bp. Notably, the repeats with 29, 31, and 38 bp in length were almost absent, indicating that mutations may have occurred during evolution in the corresponding species.

IR contraction analysis and genome divergence between the Litsea species
The contraction and expansion of IR regions contribute greatly for to variations of cp genomes among different species, resulting in gene duplication, deletion, and the generation of pseudogenes. Studying the characteristic genes of the border region contributes to species identi cation and phylogenetic analyses [28]. In this study, we analyzed and visualized the genes located in the junction region of LSC and IRa (JSa) as well as the junction of SSC and IRb (JSb) in the cp genome of the nine species of Litsea (Fig. 4). JLa represents the junction between LSC and IRa, and the same applies for JLb. In this study, we observed that genes located in the junction of four regions were highly conserved, with only a few variations. Most genes located at cp genome junctions in all nine species differed only in the distance to their corresponding boundaries, such as ycf2, ndhF, trnH, and psbA. To be more speci c, the ycf2 gene spans LSC/IRb and is distributed in both regions of similar length, with the LSC region being slightly longer. The ndhF gene existed among nine species, completely located in SSC and a short distance from IRb except for that of L. sericea, which was longer and closer to the JSb boundary. The trnH gene was located in the LSC region, adjacent to the IRa/LSC border, and was 21 -22 bp in length. PsbA was located entirely in the LSC region. Yet, notable variations were found. The ycf1 gene was absent in this junction, while the remaining eight species contain ycf1Ψ (pseudocopy, 5' end missing) in JSb, which spans JSb with only 4 -5 bp of length located in SSC. Apart from that, the contraction and expansion event located in the JSb was greater than that of the JLa boundary. This pattern is consistent with previous studies [29].
The whole sequence identity plot of nine species within Litsea was analyzed using mVISTA with L. garretti set as the reference sequence for comparison (Fig. 5). Genome sequences of the nine Litsea exhibited a high degree of concordance. In this study, we revealed that most of the variations in the cp genome of different species were distributed in CNS (non-coding sequences) region, and there were two distinct slips in the genome sequence alignment diagram. Notable high-divergent regions in CNS were atpF -atpH and ndhC -trnV-UAC, the divergent value of which exceeded 100%. Other variant regions include rps16 -trnQ-UUG, ycf4 -cemA, rps8 -rpl14, rps12 -trnV-GAC. Some of the coding genes, such as ndhK, ndhF, and ycf1, were found to be highly divergent. In general, the divergence in the IR region was signi cantly smaller than that in the LSC and SSC regions, a result comparable to the divergence analysis.

Nucleotide divergence and SNPs
The highly variable area in the cp genomes of the nine species was studied by a sliding window analysis and variation patterns are depicted in gure 6. The average Pi value in the SSC fragment was the highest, whereas the LSC region uctuated at a little lower value. However, a spike containing psbJ, psbL, psbF, and psbE genes appeared in the LSC region, which is the peak with the largest Pi value in the entire cp genome. At the same time, the nucleotide diversity at the IR fragment was the lowest, indicating a high degree of conservatism, a conclusion supported by previous studies [30]. Moreover, the nucleotide divergent value of some genes was relatively high as well, such as ycf1, ycf2, matK, ndhA, ndhF, rpoC2, trnG-UCC, and trnK-UUU. These results were in line with the ndings of previous studies [31,32] To further explore the divergence of nucleotides, we compared and analyzed Single Nucleotide Polymorphisms (SNPs) and Insertions and Deletions (InDels) of nine species within Litsea. The polymorphism ratio of transition substitution (Ts) was higher than transversion substitutions (Tv) in the LSC region of nine cp genomes ( Table 3). The most substitutions were located in the LSC region, while IR regions contained the lowest rate of polymorphisms. This result is consistent with previous studies [33].
In terms of transition substitutions, the polymorphism ratios of A/G and C/T were almost the same, although the former took up a slightly larger proportion, with only three exceptions (L. auriculata, L. chunii, and L. tsinlingensis). As for transversion substitutions, the polymorphism ratios of A/T and C/G were greatly lower than that of A/C and G/C substitutions. The same pattern applied for InDels (Table 4). LSC presented the largest number of InDels in comparison with IR and SSC regions, while the average length of InDels in IR regions was the longest, with the longest variation length at 678 bp (L. tsinlingensis). It is worth mentioning, in the cp genome of L. auriculata, the average length of InDels located in the IR regions contained a considerable number of small InDels rather than only several long InDels, as was found in the other eight species, causing its average length to be three times shorter than others, indicating that L. auriculata may have experienced some degree of mutation during its evolution that differed from related species.

Phylogenetic analysis
The expanding cp genome database provides an important basis for determining evolutionary relationships [30,34]. Phylogenetic trees based on different data had slightly varied topologies, with trees based on the whole cp genome and CDS data having the same topology, and being more credible than trees based on the IR area and introns [35][36][37][38]. We found two similar topological structures with few changes based on the full cp genome and the protein-coding sequences of 23 selected species, with Neolitsea sericea and Actinodaphne obovate as outgroup species (Table S3, Fig 7). We assumed that the variations presented by the different trees may be rooted in the changes occurring in the IGS (intergenic spacer) regions, clarifying the importance of a greater knowledge of non-coding regions. Despite minor differences, the phylogenetic relationships of most species in the two topologies were consistent, showing similar genetic a nities in the topology, and which aligned nicely with the elevational distribution of the species [39][40][41].

Conclusion
We sequence and report complete cp genome sequences from nine species of Litsea, revealing typical quadripartite and circular structures. The cp genome size of nine Litsea ranged from 152,051 bp to 152,747 bp. We performed long repeat and SSR analyses of the complete cp genomes of nine species of Litsea to better study practical gene markers. Litsea auriculata contained the most SSR sequences, while L. chunii had the least. Codon bias was observed in the protein-coding region. In the comparative analysis, differences between species occurred mainly in non-coding sequences, except for a few highly divergent coding genes, such as ycf1. We also observed the contraction and expansion of IR boundaries, which caused gene loss, changes in gene length, and the occurrence of pseudogenes, resulting in differences between the species. In terms of nucleotide differences, the LSC region had the largest number of nucleotide variants, but the variation frequency was lower than that of the SSC region. The IR regions showed a high degree of conservation. Phylogenetic relationships within the genus were explored using two sets of data from the complete cp genome and another from 84 protein-coding genes for 21 species of Litsea and two outgroup species. Essentially the same conclusions were obtained: L. moupinensis and L. rubescena, L. chunii and L. tsinlingensis were sisters in the phylogenies and showed similar genetic relationships consistent with their elevational distributions. This study provides aid to taxonomic studies for Litsea, providing speci c genetic markers for taxon identi cation and for inferring evolutionary relationships among the species. These data may also contribute to future conservation efforts as well as the practical use of these species.

Sample collection, DNA extraction, and sequencing
In this study, the nine species of Litsea were collected from Plant Germplasm and Genomics Center, Kunming Institute of Botany, the Chinese Academy of Sciences, and was approved by Kunming Institute of Botany and local policy. Fresh leaf tissue was collected without apparent disease symptoms and preserved in silica gel. Total genomic DNA was extracted from fresh leaves using modi ed CTAB [42], and the quantity and quality of the extracted DNA was assessed by spectrophotometry while the integrity was evaluated using a 1% (w/v) agarose gel electrophoresis [18]. The Illumina TruSeq Library Preparation Kit (Illumina, San Diego, CA, USA) was used to prepare approximately 500 bp of paired-end libraries for DNA inserts, according to the manufacturer's protocol. These libraries were sequenced on the Illumina HiSeq 4000 platform in Novogene (Beijing, China), generating raw data of 150 bp paired-end reads. About 22.5 Gb high quality, 2 × 150 bp pair-end raw reads were obtained and were used to assemble the complete cp genome of Litsea.

Chloroplast genome de novo assembly and annotation
The raw data were preprocessed using Trimmomatic 0.39 software [43], including removal of adapter sequences and other sequences introduced during sequencing, removal of low-quality and over-N-base reads, etc. The quality of newly produced clean short reads was assessed using FASTQC v0.11.9 [44] and MULTIQC software [45], and high-quality data with Phred scores averaging above 35 were screened out. According to the reference sequence (Litsea glutinosa), the chloroplast-like reads were isolated from clean reads by BLAST [46]. Short reads were de novo assembled into long contigs with SOAPdenovo 2.04 [47] by setting kmer values of as 35, 44, 71, and 101. Finally, the long-contigs complete sequence expansion and gap lling using Geneious ver 8.1 [48], which forms the complete cp genome. The complete cp genome was further validated and calibrated by using de novo splicing software NOVOplsty 4.2 [49]. GeSeq [50] was used to annotate the assembled genomes, and tRNAscanSE ver 1.21 [51] was applied to detect tRNA genes with default settings, and RNAmmer [52] was used to validate rRNA genes with default settings. As a nal check, we compared the results with the reference sequence and corrected misannotated genes by GB2Sequin [53] in an arti cial manner. The circular map of the genomes was drawn by using Organellar Genome DRAW (OGDRAW) [54]. The nine newly assembled Litsea cp genomes were deposited in GenBank with the accession numbers MW802253-MW802261.

Analysis of chloroplast genome characteristics
Information regarding GC content, genome length, and number of each region in cp genomes was obtained using Geneious Prime software [48]. RSCU (Relative synonymous codon usage) was calculated by Computer Codon Usage Bias function in MEGA X [55]. SSRs were identi ed using MISA [56], with a setting of ten repeats for mononucleotide SSRs, four for dinucleotide and trinucleotide SSRs, and three for tetranucleotide, pentanucleotide, and hexanucleotide SSRs. REPuter [57] was used to identify four types of repeats with the minimum repetition unit set as 20 bp and the maximum as 300, and the remaining options set to default parameters.

Whole genome alignment
To compare the gene differences among the nine species, L. garrettii was selected as the reference species and the online comparison tool mVISTA [58] was used for sequence alignment. IRscope [39] was used to detect and visualize the contraction and expansion of IRs boundaries. We also used DNAsp 6 software [60] to calculate the nucleotide diversity (Pi) of the cp genome with 1000 bp window length and 50 bp step size.

Phylogenetic analysis
We downloaded 12 further chloroplast genomes of Litsea from NCBI (National Center for Biotechnology Information). Two species from Lauraceae but in different genera, Actinodaphne obovate and Neolitsea sericea, were selected as outgroups to root our phylogenetic networks. A total of 23 species were compared for phylogenetic evaluation. MAFFT v7 was used to perform multiple genome alignment [61], and we used the complete cp genome sequence data as well as a separate dataset of 84 protein-coding genes to construct individual maximum likelihood (ML) topologies. The MPL analyses were performed using MEGA X [55], and bootstrap tests were performed with 1000 replicates with tree bisectionreconnection branch swapping.

Declaration of Competing Interest
The authors declare no con ict of interest associated with the work described in this manuscript.   Complete genome map of the chloroplast genome for representative species L. moupinensis. The inner grey ring is divided into four areas, Clockwise, and they are: SSC, IRb, LSC, and IRa. The genes in the outer ring region are transcribed clockwise, while those in the inner ring are transcribed counterclockwise. In addition, this gure also re ects the GC content, the inner ring dark gray indicates the GC content, the light gray reaction AT content. In the lower left is a legend that classi es chloroplast genes according to their functions. The frequency (Freq) and preference of codon usage (RSCU) in L. moupinensis protein-coding region.
Abscissae indicate each amino acid and its abbreviation as well as the respective codon, the blue bar in the gure is the frequency of codon usage, and the orange line represents codon preference.