Comparative Chloroplast Genome Analyses of Ilex (Aquifoliaceae): Insights Into Evolutionary Dynamics and Phylogenetic Relationships

Despite many species of Ilex (Aquifoliaceae) are of horticultural importance and are widely grown in parks and gardens throughout the world for their foliage and decorative berries, limited genetic information has greatly hampered our understanding of the chloroplast genome evolution and phylogenetic relationships within the genus. This study attempted to address these problems by comparing the chloroplast genomes and analyzing phylogenetic relationships within the genus. In this study, analyses of chloroplast genome structure, codon usage, GC content, gene rearrangement, nucleotide diversity, inverted repeats (IR) boundary, repeat sequence, and SSR component were conducted by comparing 41 chloroplast genomes of Ilex. The results showed that these Ilex chloroplast genomes were evolutionary conserved at the genome level and no rearrangement of the complete cp genome in the 41 Ilex genomes was recorded. On the contrary, there were still a few mutational hotspots identied from these chloroplast genomes, which were considered as hypervariable regions useful for future phylogenetic studies and DNA barcoding. Using the complete chloroplast genome sequences, we reconstructed a highly supported phylogeny of Ilex and well-resolved the complicated relationships among the different lineages within Ilex. The present study increased our understanding of the chloroplast genome evolution and phylogenetic relationships within Ilex. The availability of these genetic resources will be helpful for the future studies in DNA barcoding, species delimitation, phylogenetic reconstruction as well as the hybridization and introgression events between distantly related lineages within Ilex.


Introduction
Ilex L., comprises of ca. 600 evergreen or deciduous tree and shrub species, is the only genus in the family Aquifoliaceae [1]. Members of the genus are mostly distributed in the tropics, but also extend into some temperate regions, making its center of species diversity to be located at the tropical America and the south-east Asia regions [2,3]. Most species of Ilex such as, I. cornuta Lindl. et Paxt., I. purpurea Hassk., I. paraguariensis A. St.-Hil., and I. rotunda Thunb., contain important economic values [4][5][6], while many of them contain high endemism and are limited to certain geographic distributions. To date, as much as 250 species of Ilex have been classi ed as endangered species based on the International Union for Conservation of Nature (IUCN) red list [7].
In the past two decades, the advances in sequencing technology and analytical methods have contributed to the phylogenetic reconstruction of the genus Ilex using several plastid DNA fragments, including rbcL, trnL-trnF, and atpB-rbcL as well as the nuclear ribosomal DNA internal transcribed spacers (nrITS) and chloroplast glutamine synthetase gene (nepGS) sequences [8][9][10][11][12][13][14]. Although a large taxon sampling of Ilex has been achieved, the phylogeny of Ilex is still not completely resolved [11,14]. Substantial incongruence between the nuclear phylogeny and the plastid phylogeny of the genus Ilex has been found in recent phylogeny studies, but only a few plastid or nuclear gene fragments were used in these studies and the resolution was generally poor due to their relatively conserved features in some plastid genes [8, [11][12][13]. Furthermore, the ndings obtained from molecular phylogenetic analysis did not complement to those previously proposed traditional classi cation of Ilex based on morphological features [15,16]. At present, the phylogenetic relationships among lineages in genus Ilex remain uncertain, thus, further investigations are requisite to decipher the status and evolution pattern of this genus.
In addition to the commonly used short plastid gene sequences that have been useful in analyzing the phylogenetic relationship in most land plants, the complete chloroplast genome sequences have been tested to be more promising in resolving the relationships of many genera at different taxonomic levels [17][18][19]. In general, the chloroplast genome is relatively stable, and usually contains four extremely evolutionarily conserved parts: a pair of inverted repeat regions (IRa and IRb), a large single-copy region (LSC), and a small single-copy region (SSC) in most land plants [20]. At the same time, it contains a large amount of effective information with a suitable mutation rate that are su cient for phylogeny reconstruction and species delimitation [21].
To date, the complete chloroplast genome sequences of a total of 34 Ilex species have been made available so far in the NCBI GenBank database (accessed on 1 August 2021). Yet, this number only accounts for ca. 5.7% of the total number of species recorded in the genus. Despite the limited number of the complete chloroplast genome reported, phylogenetic analysis of Ilex based on the complete chloroplast genome sequence could still provide some insights to the relationship between species of the genus. In the effort to expand the current genetic resource of Ilex at chloroplast genome-scale, herein, we newly sequenced the chloroplast genomes of seven species of Ilex, including I. dasyphylla, I. fukienensis, I. lohfauensis, I. venusta, I. viridis, I. yunnanensis, and I. zhejiangensis. Three of them, Ilex fukienensis, I. venusta, and I. zhejiangensis, were known to have a very narrow distribution in China [13,22], while the other four species are widely distributed in China and adjacent regions. We aimed: (i) to investigate the structural and compositional variations of the chloroplast genomes in genus Ilex by involving more lineages; (ii) to identify highly variable regions in the chloroplast genomes that could be used as potential markers in resolving the interspeci c relationship and species delimitation of Ilex; and (iii) to reconstruct a high resolution phylogenetic tree based on the complete chloroplast genome sequences that is further tested for its cyto-nuclear discordance by comparing to existing nuclear-based phylogenetic tree of the genus Ilex.

Results
Chloroplast genome structure of Ilex species All the seven newly sequenced chloroplast genomes obtained through this study possess a typical quadripartite structure of a vascular plant, which consisted of two single-copy regions (LSC and SSC) that are separated by a pair of inverted repeats (IRa and IRb) ( Figure 1). By taking account of the 34 chloroplast genomes downloaded from the GenBank, the length of the chloroplast genome of Ilex ranged from 157,119 bp (I. gracili ora) to 158,020 bp (I. kwangtungensis). The length of the LSC region was ranged from 86,506 bp (I. gracili ora) to 87,414 bp (I. crenata), the SSC regions was ranged from 18,228 bp (I. yunnanensis) to 18,447 bp (I. lohfauensis), and the IR regions ranged from 26,005 bp (I. vomitoria) to 26,121 bp (I. rotunda) ( Table 1). The GC content was from 37.60-37.69% (Table 1). All the chloroplast genomes obtained in this study contained 116 genes, including 82 protein-coding, 31 tRNA, and four rRNA genes, except for I. dasyphylla, which had 83 protein-coding genes ( Table 2 and 3). All chloroplast genomes had the same gene order and gene arrangement.  Table 2 List of annotated genes in the chloroplast genomes of the Ilex species.

SSR Polymorphisms and Long Repeat Sequence Analysis
A total of 2,146 simple sequence repeat (SSR) loci were detected among the 41 Ilex chloroplast genomes, with a length variation from 10 to 168 bp ( Figure 6, Additional les 2: Table S2). The mononucleotide repeats were most abundant (1,771), while tetranucleotide repeats had the least occurrence (49). In contrast, the number of di-, trinucleotide and compound repeats were 109, 79, and 138, respectively. For mononucleotide repeats, the A/T repeats occurred the most (1769) Table S4). For stop codon, the amino acid UAA was like preferred when compared to UAG. Overall, codons with A/U-ends were more preferred than those with G/C-ends among the amino acids in Ilex species (Figure 8, Additional les 4: Table S4).

Phylogenomic analyses
The phylogenetic trees of both the maximum likelihood (ML) and Bayesian inference (BI) were reconstructed based on the 52 complete chloroplast genomes. The closely related species Helwingia himalaica (NC031370) was selected as outgroup [23]. The total alignment length of the dataset used was 157,836 bp after trimmed. A total of 8,869 variable sites and 1,735 parsimony informative sites were detected in the alignment. The phylogenetic trees based on the ML and BI displayed similar topologies; thus, the posterior probability (PP) values were merged into the ML tree for clearer presentation (Figure 9). The monophyly of Ilex was well supported by the two analyses ( Fig. 8; ML BS: 100%; BI PP: 1.00).
Based on our phylogenetic analysis and with consideration of macro-morphological and distribution information, the genus Ilex was categorized into ve highly supported clades (Figure 9; ML BS: 100%; BI PP: 1.00). The relationships among the ve major clades were well resolved with high resolution (ML BS: 100%; BI PP: 1.00) in the present phylogenetic study. Members of clade D and clade E were sister to the rest, followed by the clade C, which is sister to a clade containing clade A and clade B. The relationships within each clade were also well resolved with high support values.

Comparison of the chloroplast genome in genus Ilex
The chloroplast genomes of Ilex were found to possess the typical quadripartite structure with similar genome size of most land plant genomes [20]. At genus level, all 41 chloroplast genomes showed high conservatism in term of genome structure, but still exhibited moderate variations among different species. The genome length of the chloroplast genes of Ilex was similar among different species with a maximum 901 bp difference in length. Expansion and contraction events at the SC/IR boundaries usually give rise to the change of the chloroplast genome length [24]. Although small variations around the IR junctions were detected in the present study, the IR regions of the chloroplast genomes of these Ilex species showed a modest expansion or contraction and the length of IR regions varied from the longest 26,121 bp to the shortest 25,080 bp.
In addition, variation in gene spacer region, the loss or gain of genes and introns also play an important role in determining the chloroplast genome size of most plants [20,25]. In the seven newly sequenced chloroplast genomes, except for I. dasyphylla, all species lacked the gene psbI. The event of gene loss in genus Ilex (e.g., deletions in the trnT-trnL and ycf4-cemA spacers of I. gracili ora) has been identi ed in an earlier report [26], which suggested that the gene loss in Ilex could be a common phenomenon. The length of the SSC and the IR regions were very similar, while that of LSC region showed a difference of about 900 bp (Table 1).

SSRs and simple sequence repeat analyses
SSRs derived from the plastid region were commonly known to be the markers of choice by population genetics and evolutionary studies owing to their high polymorphism rate and abundant variation at the species level [27]. In the present study, a total of 2146 SSR loci were identi ed from the 41 Ilex chloroplast genomes (Figure 6, Additional les 2: Table S2). Few studies have involved in the population genetics using data of SSRs in genus Ilex, and these SSR loci will contribute to further studies of genetic diversity and polymorphisms at the population, intraspeci c and cultivar levels, as well as phylogeography for a certain species in Ilex.
Long repeat sequences with length longer than 30 bp are important because they might play important roles in the insertion/deletion mismatches and rearrangements that led to genomic variation [28][29][30] and the increase in the genetic diversity of populations [31]. We found that the number of long repeat sequences in Ilex is comparably higher to some other angiosperm species (e.g., 364 long repeats in Oxalidaceae [32]; 403 in Veratrum [33]; 32 in Oresitrophe rupifraga, and 34 in Mukdenia rossiiand [34]). Among these long repeats, the forward, palindromic and tandem repeats were rather common, accounting for 33.84%, 30.81%, and 34.44% of the total number of repeats, respectively, while the complementary and reverse repeats were quite rare, only accounting for 0.42% and 0.49%, respectively.
Hypervariable regions for future studies of phylogeny and DNA barcoding Hypervariable regions are usually used as effective evidence for delimitation of closely related taxa as well as to provide a wealth of phylogenetic information [35,36]. In general, sequence divergence in IR regions is more conserved than that in the SSC and the LSC regions [37]. In the present study, the eight hypervariable regions including four genes and four gene with their adjacent and overlapped regions were detected and selected as the most divergent loci (Table 4). All the hypervariable loci were distributed in the SC regions, and also showed lower levels of variation in the IR regions, which was consistent with most angiosperm genera [32,33].
Previous phylogenetic analyses of Ilex that were based on a handful of plastid markers (mainly including atpB-rbcL, psbA-trnH, rbcL, and trnL-trnF) did not generate well-resolved interspeci c relationships within genus Ilex [1,2,8,11,13,[38][39][40]. When comparing these markers to those of highly variable regions identi ed from this study, only one of them, which is the rbcL region, was present. Therefore, we proposed the use of the eight highly variable regions reported in this study to be used as potential markers for reconstruction of the phylogenetic tree based on short plastid regions. Furthermore, with the high resolution at species level, a powerful DNA barcode can be obtained for Ilex that is suitable for species identi cation and delimitation. However, further studies are required to evaluate the delimitation strength in these regions prior to conducting studies on the DNA barcoding of Ilex.

Phylogenetic inference
Researchers have attempted to resolve the phylogenetic relationships among major lineage of Ilex and to test the consistency between the molecular phylogeny and the traditional taxonomic systems based on morphology evidence [8-13, 23, 38]. However, these studies failed to elucidate the phylogenetic relationships effectively, especially at the species level. The presences of poor species resolution and weak support at most branch nodes were due to lack of su cient variable information sites when using short DNA markers [8, 10-12, 23, 38]. Undoubtedly, this limitation can be addressed by using longer DNA sequences that contain an abundant of variable sites [41]. The complete chloroplast genome sequences containing su cient informative sites and can obviously resolve the phylogenetic relationships at almost any taxonomic level [18]. Earlier phylogenetic analyses using the complete chloroplast genome sequences based on a small quantity of samples achieved the same conclusion [14,26,42].
At present, the relationships within the genus Ilex were well resolved based on the current phylogeny, and ve major clades (lineages), clade A to clade E, were identi ed with consideration of macro-morphological and distribution information (Figure 9).
The main lineages were largely consistent with the previous plastid phylogeny, while the relationships among these major lineages were largely discrepant [8, 11,13]. Our results showed that the American groups (clade E) and the Eurasia groups (clade F) were sister to each other, while both of them formed the basal clade of the genus. These two basal lineages were sister to the rest (clade A-C), which were mostly from the Asia region. This nding was not consistent to the previous results produced by Manen [11], of which the American groups (group 3) and the Eurasia groups (group 4) were the most recent diversi ed groups. The discordance between our results and the previous conclusions mainly owed to the fact that lacking genes from hypervariable regions by their phylogenetic studies and the low resolution among major clades [8,11].
The inconsistency between the plastid phylogeny and the traditional taxonomic systems was also revealed by the present studies ( Figure 9). Almost all the subgenera, sections and series included in the analysis were support as non-monophyly (both the two subgenera, ve out of the six sections, and six out of eleven series). We did not rule out the possibility that more non-monophyly groups will be present in this plastid-based phylogenetic tree when more taxa are involved. In addition, though the resolution of phylogenetic tree in early studies was quite low, they indicated that the nuclear and plastid phylogeny were highly incongruent and the nuclear phylogeny was more consistent with traditional morphological classi cation than the plastid one [11]. We here con rmed their conclusions by improving the resolution of plastid phylogenetic tree using the complete chloroplast genome data.
In general, species of the same origin should be clustered next to each other in the same clade due to them having the same inheritance. This is proven accurate for some of the Ilex species present in our study, including I. cornuta, I. dasyphylly, I. latifolia, and I. integra. However, the accessions from I. pubescens and I. lohfauensis displayed otherwise; the two accessions of I. pubescens were resolved in two distinct clades (clade A and clade B), and the two accesions of I. lohfauensis were separated from each other, with I. championii placed in between them. Three samples of I. viridis were clustered with morphologically similar species I. tri oral. Some of the non-monophyly species might be cause by genomic capture or hybridization events, as well as misidenti cation caused by taxonomic confusion among different taxa. Howsoever, our well resolved phylogeny will help to clarify the taxonomic confusion among closely related species in the genus.

Conclusions
In this study, the characterization and phylogenetic analysis of the chloroplast genome of Ilex have been carried out, and as a result, an in-depth understanding on the evolution of Ilex at genome-scale level was obtained. However, we proposed the future studies on Ilex should focus on the phylogenetic reconstruction based on the nuclear region. We propose the capturing of low-copy nuclear genes from the genome-skimming data, which is thought to provide a better resolution for most plant species when compared to short nuclear DNA markers, i.e. ITS, and incorporate it with the phylogenetic tree based on the complete chloroplast genomes as well as morphology analyses, to enhance our understanding of the complex evolutionary history of genus Ilex.

Materials And Methods
Taxon sampling, DNA extraction, and sequencing Seven species of Ilex, including I. dasyphylla, I. fukienensis, I. lohfauensis, I. venusta, I. viridis, I. yunnanensis, I. zhejiangensis. Ilex fukienensis, I. venusta, and I. zhejiangensis, were collected from their natural accessions in China. Fresh leaf specimens were collected in the eld and stored in silica gels prior to be sent to the laboratory for DNA extraction. Voucher specimens were prepared and deposited at the herbarium of Nanjing Forestry University (NF). In addition, 34 complete chloroplast genomes of Ilex species that are publicly available in the NCBI GenBank were downloaded for their sequences in fasta format and kept for subsequent comparative analyses (Additional les 5: Table S5). Based on the classi cation of Ilex that is generally accepted [22], the current dataset comprised of species from six sections and 11 series of the genus Ilex.
Total genomic DNA was extracted using the Plant Genomic DNA Kit (Tiangen Biotech, China) following the manufacturer's protocol. The DNA extract was inspected with agarose gels, and quanti ed using Qubit 2.0 (Life Technologies, Country) for its integrity, purity, and concentration. The quali ed DNA (≥50 ng) was used to construct a paired-end (2×150 bp) library, and sequencing was conducted on a HiSeq X Ten platform (Illumina, USA). The generated raw data were ltered with fastp v.0.20.0 software [43] to remove low-quality reads.

Chloroplast genome assembly and annotation
The ltered raw data were fed into the NOVOPlasty 2.6.3 [44] pipeline for genome assembly, with the rbcL gene sequence of I. latifolia (Accession number: KX897017) as the seed sequence and the chloroplast genome sequence of I. latifolia (Accession number: MN688228) as reference genome. A contig was obtained at the end of the process, and annotation was conducted using Plann [45], in which the annotated chloroplast genome of I. latifolia (Accession number: MN688228) was set as reference. Start and stop codons in the chloroplast genomes were manually corrected using DOGMA [46], while the tRNA genes were veri ed using tRNA scan-SE v2.0.3 embedded in GeSeq [47] based on default parameters. The circular chloroplast genome maps were visualized using OrganellarGenomeDRAW [48].

Genome comparative analyses
Sequence alignment of the 41 complete chloroplast genomes was carried out using MAFFT v.7 [49] and the alignment was further trimmed using trimAI v1.2 using the "-gappyout" setting [50]. The expansion and contraction of the IR regions were visualized using online IRscope [51] and then was manually checked. The nucleotide diversity (Pi) was estimated using DnaSP v.5 [52] with a step size of 200 bp and a window length of 800 bp. The genome variability across the 41 species of Ilex was assessed using mVISTA [53], under Shu e-LAGAN mode. Mauve version 2.3.1 [54] plug-in that was available in Geneious version 11.0.3 [55] was used to identify locally collinear blocks presented among the chloroplast genomes based on default parameters.

Repeat sequence identi cation
The number of large repeats, including forward, palindromic, reverse, and complement repeats were identi ed using the onlineREPuter program [56] according to the following criteria: sequence identities of 90%, cutoff point at ≥30 bp, Hamming distance set at 3, and a minimum repeat size of 30 bp. Tandem Repeat Finder [57] was used to analyze tandem repeat sequences with the default parameters. SSRs were identi ed using web-MISA [58], with minimum repeat number set at 10, 5, 4, 3, 3, and 3 for mono-, di-, tri-, tetra-, penta-, and hexanucleotides, respectively. Compound SSRs were detected by identifying independent SSRs that are separated by less than 100 nucleotides and were combined into one.
Codon Usage Analysis RSCU is de ned as the ratio of the observed frequency of codons to the expected frequency. The sequences of 77 shared proteincoding genes were extracted and concatenated using PhyloSuite v1.2.1 [59]. The value of RSCU was calculated using MEGA X [60].
A dendrogram based on the value of RSCU was constructed using pheatmap, which is available in R package (https://cran.rproject.org/web/packages/pheatmap/).

Phylogenetic Analyses
Phylogenetic analysis was conducted based on 51 complete chloroplast genome sequences representing 39 Ilex species from six sections and 11 series. Based on the nding from Yao et al. [23], Helwingia himalaica (Accession number: NC031370) was selected as the outgroup. Genome alignment was carried out using MAFFT v.7 [49] and then trimmed using trimAI v1.2 with the "-gappyout" setting [50] Maximum likelihood (ML) analyses were conducted using IQ-tree  Sequence alignment of 41 Ilex chloroplast genomes using the mVISTA program with I. szechwanensis as a reference. The vertical scale indicates the percent identity, ranging from 50% to 100%. The horizontal axis indicates the location within the plastomes.
Genome regions are color-coded as exon, intron and untranslated regions (UTRs).

Figure 4
Mauve multiple alignment of 41 Ilex chloroplast genomes revealing no interspeci c rearrangements.   Phylogenetic trees inferred from maximum likelihood (ML) and Bayesian inference (BI) analyses of 52 accessions based on the complete chloroplast genomes. Numbers near the nodes are ML bootstrap support values (BS, left of the slashes) and Bayesian posterior probabilities (PP, right of the slashes). 100% BS or 1.00 PP were indicated by asterisks. Incongruences between the BI and ML trees were indicated by dashes. Hu's classi cation was illustrated by color graphic pattern. Recognized groups (major clades) were also marked by the right-hand black bar.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.