Comparative analysis of the complete chloroplast genome of seven Nymphaea species

Background: Although there has been a long history of cultivation and research on Nymphaea, the taxonomic relationships and evolutionary relationships among Nymphaea species remain controversial. The chloroplast (cp) genome can provide a new method to determine species origin, evolution, and phylogenetic relationships of Nymphaea. However, there are few studies on the cp genomes of Nymphaea, and the data of their genomes are very scarce. The complete cp genomes of seven Nymphaea species were sequenced by high-throughput sequencing technology, and the structural characteristics and phylogenetic relationships of cp genomes were analyzed. Results: A total of 126–129 genes were annotated in seven Nymphaea species, including 81–84 protein coding genes, eight rRNA, and 37 tRNA genes. A comparative cp genomic analysis of seven Nymphaea species showed that the cp gene sequence of Nymphaea was consistent, with no signs of reverse rearrangement. The codons preferentially ended with A/U. The cp genomes of seven Nymphaea species contained 147–168 dispersed repeats with a length of 15–19 bp and 979 simple sequence repeats (SSRs). Using N. colorata as the reference sequence, a total of 8,328 single nucleotide polymorphisms (SNPs) and 1,579 insertions/deletions were obtained. The degree of variation of the cp genome of the seven Nymphaea species in rpoA–rpl20, rbcL–ndhC, ndhD–ndhF, and trnN-GUU–ndhA regions is relatively high; related regions can be used as potential molecular markers for population genetics research. KaKs analysis showed that the ycf2 gene was positively selected. The results of the phylogenetic analysis showed that the genus Nymphaea can be divided into ve branches: subgenus Nymphaea, subgenus Hydrocallis, subgenus Lotos, subgenus Anecphya, and subgenus Brachyceras. Conclusions: The cp genome of Nymphaea is very conservative in structure and composition, but it has rich variation in LSC and SSC regions. The phylogenetic analysis showed that Nymphaea could be further divided into ve subgenera, and Euryale and very conservative in structure and composition, but it has rich variation in LSC and SSC regions, which can be used for germplasm identication, phylogenetic analysis, and evolution research. The phylogenetic analysis showed that Nymphaea could be further divided into ve subgenera, and Euryale and Victoria were most closely related to Nymphaea. This study not only enriched the genetic information on Nymphaea, but also laid a theoretical foundation for the evaluation of water lily germplasm resources, molecular breeding, development of SSR molecular markers, and research into genetic diversity. Moreover, this study provided support for the resolution of the phylogenetic relations among Nymphaea species.

some controversies on the phylogenetic classi cation of Nymphaea. For example, based on the noncoding cp markers, some studies have supported the division of Nymphaea genera into three subgenera Subgenus Nymphaea followed by Subgenus Anecphya-Brachyceras and another clade comprising Subgenus Hydrocallis-Lotos [6,7]. In addition, relationships among species within Nymphaea remain unclear. For example, using the cp trnT-trnF sequences to construct phylogenetic trees of Nymphaeaceae, a previous study has supported three subbranches within Subgenus Nymphaea [7]. However, a phylogeny was constructed using the internal transcribed spacers (ITS) region of nrDNA and the study supported the division of Subgenus Nymphaea into two subbranches [8]. There are different understandings of the evolutionary relationships among Nymphaea species; therefore, it is of great signi cance to study its phylogenetic relationship.
The cp genome is relatively conservative in gene composition and structure compared with the nuclear and mitochondrial genomes [9,10]. Because of its small size and relatively conservative structure, the cp genome has become an ideal model for evolutionary and comparative genomic research [11], and its structure and sequence information are of great value in revealing the origin, evolution, and relationships among different species. With the rapid development of high-throughput sequencing technology, research on the cp genome has increased [12,13]. However, there are few studies on the cp genomes of Nymphaea, and the data of their genomes are very scarce.
In this study, the cp genomes of seven species of Nymphaea were sequenced by high-throughput sequencing technology. The full length of cp genomes was obtained and the composition, structure, and phylogenetic relationships of cp genomes were analyzed in order to enrich the genetic information of Nymphaea and provide a theoretical basis for future research on phylogeography, phylogenetics, and conservation of Nymphaea.

Results
The basic structural characteristics of the cp genome of seven species of Nymphaea After high-throughput sequencing, raw data was obtained. Low quality reads were removed, leaving the remaining paired-end reads of seven Nymphaea species. The paired-end reads ranged from 19,642,319 reads for N. odorata to 26,112,273 reads for N. micrantha. After performing de novo genome sequencing and assembly, seven complete cp genome sequences for N. odorata, N. rubra, N. gigantea, N. potamophila, N. colorata, N. tetragona, and N. micrantha were obtained (Fig 1; Table 1) and submitted to GenBank under the following accession numbers: MT107636, MT107632, MT107637, MT107633, MT107631, MT107634, and MT107635, respectively.
These seven novel cp genome sequences had the classical quadripartite structure that contained one LSC, one SSC, and two IR (IRa and IRb) regions (Fig 1). The cp genomes of seven species of Nymphaea had different sizes, among which the largest was N. gigantea at 160,179 bp long and the smallest was N. potamophila at 159,232 bp (Table 1). The length of the LSC ranged from 89,450-90,266 bp, SSC ranged from 19,340-1,965 bp, and IR ranged from 25,163-25,232 bp (Table 1). Therefore, the variation of the length of the LSC is greater than that of SSC and IR, and the difference in genome length was mainly due to the varying lengths of the LSC and SSC. The GC content of the seven Nymphaea cp genomes was 39% (Table 1); the GC contents in IR (~43%) of these seven Nymphaea species were higher than that of LSC and SSC regions (~37% and 34%).
The IR and SC boundaries of the cp genomes of the seven Nymphaea species were compared and analyzed. As shown in Figure 2, the LSC/IRb boundaries of the seven Nymphaea species were all within the rpl2 gene, and the length of the overlapping region between the boundary and the rpl2 gene was 15-39 bp. The IRb/SSC boundaries of the six species were between trnN and ndhF, and the length of the ndhF gene entering SSC was 49-80 bp. However, the SSC region of N. odorata lacked an ndhF gene and there was a 522 bp gap between the boundary and the trnN gene in the IRb region. The SSC/IRa boundaries of ve species were all located in ycf1gene, the length of the gene in the SSC region was between 5,659-5,719 bp, and was between 155-167 bp in the IRa region. No ycf1 gene was found in the SSC regions of N. odorata or N. tetragona, and there were 521 bp and 545 bp gaps between the trnN gene in the border and IRa regions, respectively.

Codon preference
The codon preferences of the cp genome of seven Nymphaea species were analyzed, and there was no signi cant difference among the genomes. N. colorata was used as an example to determine the codon preference of the cp genome of Nymphaea. A total of 25,995 codons were detected in 79 protein coding genes of the N. colorata cp genome and their RSCU values were calculated (Additional le 1). Among these codons, 2,636 were used to encode leucine (Leu), accounting for 10.1% of all codons, with UUA being the most commonly used. The content of codons used to code Tryptophan (Trp) was the lowest, 447 in total, accounting for 1.7% of all codons. The RSCU values of 32 codons were greater than 1, of which 28 ended with A/U, indicating that the codon ending preference was A/U. The RSCU of tryptophan (Trp) was equal to 1, indicating that Trp has no codon preference.

Analysis of dispersed repeats and simple repeat sequences (SSRs)
The dispersed repeats of the cp genome include forward, reverse, palindrome, and complementary types.
Using Vmatch v2.3.0 software, the dispersed repeat sequences of seven Nymphaea species were identi ed. For N. odorata, N. rubra, N. gigantea, N. potamophila, N. colorata, N. micrantha, and N. tetragona, 163, 147, 151, 158, 153, 155, and 168 repeat sequences, respectively, with a length of 15-19 bp were detected in the cp genome and were distributed in the coding region, the gene interval region, and the gene intron region (Table 3). There was little difference in the total number of dispersed repeats of seven Nymphaea species, and the length of the sequences was concentrated between 15-19 bp, accounting for 83.4-88.0% of the total dispersed repeats (Table 3).

Collinearity, KaKs, Pi, SNP, and indel analysis
Although the cp genome of plants is conservative, there will be rearrangement over the course of evolution. In this study, we selected the cp genome of N. colorata as the reference sequence for comparison with the other six cp genomes of Nymphaea. The results showed that there was no reverse rearrangement in the cp genomes of the seven species of Nymphaea, indicating a collinear relationship (Fig 3). In addition, the sequence differences of 105-115 kb of the cp genome of the seven species of Nymphaea are large (Fig 3); therefore, it can be used to develop molecular markers for future genetic analyses of Nymphaea.
The synonymous substitution rate (Ka), missense substitution rate (Ks), and Ka/Ks of seven species of Nymphaea were calculated by using the cp genome of N. colorata as the reference sequence (Additional le 2). Most of the cp genes have low Ka and Ks, indicating the conservation of the evolutionary process of the cp genome. Ka/Ks < 1 of most genes indicate that they may undergoing purifying selection, i.e., the mutation is harmful and eliminated in the population. Among the genomes, the Ka/KS of ycf2 in the six comparison groups, N. colorata vs. N. gigantea, N. colorata vs. N. micrantha, N. colorata vs. N. odorata, N. colorata vs. N. potamophila, N. colorata vs. N. rubra, and N. colorata vs. N. tetragona were all greater than 1, indicating that ycf2 was subject to positive selection effect.
The Pi values of eight genes (petN, petG, psaI psbF psbL psbM psbN and rps18) in the LSC region, all tRNAs, the psaC gene in SSc region, and ve genes (rrn4.5s rrn5s rrn16s rps7 and rpl23) in the IR region were 0 (Additional le 3), indicating that these genes were relatively conserved in the seven Nymphaea species. Some genes had relatively high Pi values, for example, rpl32, rps19, rpl20, and ycf1 had Pi values of 0.012579, 0.011947, 0.009195, and 0.006307, respectively, indicating that the greater the number of polymorphisms in these genes in Nymphaea, the more abundant their genetic diversity. These regions of rpl32-ndhF, rps19-psbN, and atpB-ndhJ in the cp genome of Nymphaea were highly variable, indicating that they may be potential molecular markers (Fig 4).

Phylogenetic analysis
Nymphaeaceae was divided into ve main branches, Brasenia, Nelumbo, Nuphar, Nymphaea, and Cabomba, but Euryale and Victoria were closely related to Nymphaea (Fig 5). Nymphaea can be further divided into ve subbranches, corresponding to the ve Nymphaea subgenera (Fig 5). The rst branch consists of N. gigantea from the subgenus Anecphya; the second subfamily consists of N. colorata, N. micrantha, N. capensis, and N. ampla which compose subgenus Brachyceras. The third branch is the subgenus Nymphaea, which is composed of N. mexicana, N. odorata, N. alba var. rubra, N. alba, and N. tetragona; N. potamophila and N. jamesoniana compose the fourth branch, subgenus Hydrocallis. The fth branch is the subgenus Lotos, which contains of N. lotus and N. rubra.

Discussion
The structure and composition of cp genome of Nymphaea The cp genome of Nymphaea is composed of LSC, SSC, IRa, and IRb regions, and the overall length of the cp genome within middle and upper ranges of angiosperms (159-160 kb) [14]. Although the structure and composition of the cp genome of seven Nymphaea are very similar, a comparative analysis shows that there are also many variations in the microstructure. The SSC and LSC regions contain the most variation, while multiple copies of the two IR regions make the IR region less prone to mutation [15]. In this study, N. gigantea and the other six species had a large difference in cp genome length, while N. odorata and N. tetragona, as well as N. colorata and N. micrantha had similar cp genome length, which revealed that N. gigantea was distant from the other six species, while N. odorata and N. tetragona, N. colorata and N. micrantha were closely related. The difference in the cp genome size was mainly caused by the contraction and expansion of the IR [14]. However, the difference of the cp genome size of Nymphaea was primarily related to the length variation of LSC and SSC. Gene loss in the cp genome was common, for example, gene ycf1 and ycf2 were completely lost in Gramineae [16], and gene infA was lost in many angiosperm cp genomes [17,18]. In this study, the cp genome of seven Nymphaea species also had gene loss and the ycf1 gene disappeared completely in N. odorata and N. tetragona. It is unknown whether ycf1 was transferred to the nuclear or mitochondrial genome. The GC content of the cp genomes of seven Nymphaea species showed differences in four regions, with the highest content in the IR region (43.35-43.44%), while the GC content in the LSC and SSC regions was lower, which is due to the existence of rRNA and tRNA in IR region. When the GC content in genome is greater, it has a greater the density of DNA bases, making the DNA sequence more stable and di cult to mutate [14].
The IR region is the most conservative region in the cp genome, but it also shrinks and expands during evolution [19]. Gene trnH-GUG and ndhF often appeared at the boundary of IR and SC region of cp gene, which were considered to be footprints of cp genome evolution [20]. We analyzed the contraction and expansion of IR region and found that the gene distribution at the boundary of the four regions of the cp genome had the same rule, but differences were observed in the microstructure, especially the locations of rps19, ndhF, ycf1, and trnH-GUG. According to the different locations of these four genes, N. tetragona and N. odorata can be identi ed from the other Nymphaea species. The boundary of SSC and IRa region is located in gene ycf1. The expansion of the IR region into ycf1 had also been described in Cardiocrinum [21] and Amana [22]. Therefore, rps19, ndhF, ycf1, and trnH-GUG can be considered part of the evolutionary footprint of the cp genome of Nymphaea.

Codon preference
In this study, 67 codons were used in seven Nymphaea species, with 32 preferred codons (RSCU>1). The most commonly used codon in the cp genome is the codon with a higher AU content [23], and the preferred codon tends to end with A/U [24]. In this study, only UUG and AUG codons of the 32 preferred codons were G-terminal and the other preferred codons were A/U-terminal. Leucine (Leu) was the most encoded amino acid, and its synonymous codon preference was UUA>UUG>CUU>CUA>CUG>CUC, which is inconsistent previous studies conducted in other families [16,25]. The codon usage is related to gene expression level, base composition, asymmetric mutation of DNA strands, and selection [26], but the mechanism of codon usage requires further study. The results of this study provide insight into the evolution of the cp genome of Nymphaea.

Repeat sequence analysis
In this study, the length of the genome repeat sequence of seven Nymphaea species was between 15-19 bp and no large segment repeat sequence (>100 bp) reported in other species was found [27,28]. Large fragment repeats are related to gene rearrangement [29]. For example, Erodium, Pelargonium, and Geranium (Geraniaceae) have gene rearrangements that all contain large fragment repeats (>100 bp) [9]. The results of the collinearity analysis of seven Nymphaea species showed that all the genes are collinear and no rearrangement occurs, which is consistent with the result that there is no large segment repeat sequence in cp genome. SSR markers are highly polymorphic and are therefore used as molecular markers in population genetics and studies of evolution. Most SSRs in the cp genomes of seven Nymphaea species are trinucleotide SSRs (48.2%), which is inconsistent with the main SSR types of other plants [30,31]. For example, most SSRs in the cp genomes of Primula are single nucleotide SSRs [25]. In addition, in the gene coding region, the number of SSRs in ycf1 is the largest, which is consistent with the research conducted in Primula [25] and Cardiocrinum [21]. The SSRs in the cp genome reported in this study can be used as potential molecular markers in future studies on Nymphaea.

Nucleotide polymorphism, KaKs, SNP and indel analysis
The Pi values of the LSC and SSC regions were signi cantly higher than that of the IR region, which was related to the conservation of the IR region. Three gene spacer regions, rpl32-ndhF, rps19-psbN, and atpB-ndhJ, as well as four protein coding regions, rpl32, rps19, rpl20, and ycf1, have been reported in other plants as highly variable regions, and they are also considered potential molecular markers [27,[32][33]. These regions may undergo faster nucleotide substitution rates during the course of evolution, which is of great signi cance to phylogenetic analysis and identi cation of Nymphaea.
Our result showed that ycf2 in the cp genome of Nymphaea was under positive selection. ycf2 was located in the reverse repeat region of cp genome, which is an important enzyme coding gene in the cp [34]. The evolutionary rate of ycf2 is relatively fast, its gene function is controversial, and it has not been classi ed into a main functional gene type (genetic system gene and photosynthetic system gene) [35].
However, because of its cp ATPase coding function [34], it is believed to have a regulatory role in the development of fruit and ower organs [35], which is considered one of the important genes in the cp genome. Sites undergoing positive selection were detected in gymnosperms and other angiosperms, indicating that the evolution of ycf2 is widespread in plants [36]. The seven Nymphaea species in this study are distributed on different continents and grow under quite different environmental conditions. Therefore, the positive selection of ycf2 in Nymphaea may be related to the difference of their optimal growth environment and the different demand for light intensity. The discovery of positive selection on ycf2 con rmed that there was evolution at the molecular level in the process of environmental adaptation for Nymphaea.
Large indels are uncommon among the cp genomes of the same genus [15]. Similarly, for the cp genome of Nymphaea, most of indels exist in the noncoding region, and the largest indel is only 28 bp long.
Compared with the small number of indels, we found 8,328 SNPs in the cp genome of Nymphaea. These SNPs and indel regions have a rapid rate of evolution and the degree of variation was the greatest in the rpoA-rpl20, rbcL-ndhC, ndhD-ndhF, and trnN-GUU-ndhA intergenic regions. These data indicate that these regions may be suitable DNA barcodes for Nymphaea.

Phylogenetic analysis
In this study, a phylogenetic tree were constructed with 21 species of seven genera in Nymphaeaceae. Cabomba was found to be at the base of the phylogenetic tree. Euryale, Victoria, and Nymphaea  [8]. The genetic relationship between subgenus Nymphaea and subgenus Brachyceras, subgenus Brachyceras and subgenus Anecphya, subgenus Hydrocallis and subgenus Lotos is relatively close, which is also supported by their cross compatibility [40][41][42].

Conclusions
The complete cp genomes of seven Nymphaea species were sequenced and successfully constructed. The cp genome of Nymphaea is very conservative in structure and composition, but it has rich variation in LSC and SSC regions, which can be used for germplasm identi cation, phylogenetic analysis, and evolution research. The phylogenetic analysis showed that Nymphaea could be further divided into ve subgenera, and Euryale and Victoria were most closely related to Nymphaea. This study not only enriched the genetic information on Nymphaea, but also laid a theoretical foundation for the evaluation of water lily germplasm resources, molecular breeding, development of SSR molecular markers, and research into genetic diversity. Moreover, this study provided support for the resolution of the phylogenetic relations among Nymphaea species.

Plant materials
Seven Nymphaea species, N. odorata, N. rubra, N. gigantea, N. potamophila N. tetragona, N. colorata, and N. micrantha, were included in this study. Herbarium specimens are all preserved at the Zhenjiang Agricultural Science and Technology Innovation Center and live plants were introduced to the Shanghai Chenshan Botanical Garden. These plants were identi ed by Nianjun Teng and Yingchun Xu of Nanjing Agricultural University according to their key morphological characteristics provided in Huang et al. [5], and the voucher ID were listed in Table 1. In August 2019, fresh leaves were collected, wrapped with tinfoil, frozen in liquid nitrogen, and stored at -80°C.
Extraction and sequencing of whole genome DNA Whole genome DNA was extracted with the plant genomic DNA Extraction Kit (TIANGEN Beijing China). After testing the genomic DNA of the sample, DNA was fragmented by sonication, and then puri ed by fragment puri cation, terminal repair, 3′-terminal plus A, linked sequencing adapter, and agarose gel electrophoresis for fragment size selection, and PCR ampli ed into a sequencing library. After passing quality inspection, the constructed library was sequenced with an Illumina Novaseq platform. After sequencing, raw data was ltered out to remove the joint sequence and low-quality reads to obtain highquality clean data.

Cp genome assembly
First, we used Bowtie 2 V2.2.4 (http://bowtie-bio.sourceforge.net/bowtie2/index.shtml) software to compare the cp genome database built by Nanjing Jisi Huiyuan Biotechnology Co., Ltd. and selected the matching clean reads for subsequent assembly. We then assembled the ltered clean reads with SPAdes V3.10.1 (http://cab.spbu.ru/software/spades/) software. If the cyclic gene sequence could be directly obtained, the sequence was genome corrected, and then the corrected genome was rearranged to obtain the complete cyclic cp genome sequence. If the complete circular genome was not directly obtained, we used SSPACEV2.0 (http://www.baseclear.com/services/bioinformatics/basetools/sspace-standard/) software to connect the contig sequences to obtain the scaffold sequence; Gap ller V2.1.1 (http://www.sourceforge.net/projects/gap ller/) software was used to supplement gaps If gaps still exist, primers were designed, sequenced by PCR, reassembled until the complete pseudo genome sequence was obtained, and then the sequences were matched to the pseudo genome for genome correction.
Finally, the corrected sequence was rearranged according to the cp structure to obtain the complete circular cp genome sequence. There are four boundaries between IR (Inverted repeat), LSC (Large single-copy region), and SSC (Small single-copy region). In the process of genome evolution, the IR boundary will expand and contract, pushing some genes into the IR region or a single copy region. We used Adobe Illustrator CS5 mapping software to map the annotation information of the cp genome of seven species of Nymphaea to the map of the simpli ed structure of the cp genome, which was used to compare the boundaries of four regions of the cp genome of Nymphaea, in order to show the extent of the shrinking and expanding of IR regions. The number of codons encoded by all protein coding genes and the relative synonymous codon usage (RSCU) in the cp genome of seven species of Nymphaea were calculated using Codonw software (http://codonw.sourceforge.net/). RSCU value > 1 indicates that the codon is a preferred codon and is frequently used. RSCU = 1 indicates that the codon has no preference. The value of RSCU < 1 indicates that the frequency of using the codon is low.

Repeat sequence and SSR Analysis
We

Declarations
Ethics approval and consent to participate Not applicable.

Consent for publication
Not applicable.

Availability of data and materials
The seven cp genomes sequences we obtained from this study were archived in NCBI. The accession numbers are MT107636, MT107632, MT107637, MT107633, MT107631, MT107634, and MT107635.

Competing interests
The authors declare that they have no competing interests.    Figure 1 Seven novel cp genome sequences had the classical quadripartite structure that contained one LSC, one SSC, and two IR (IRa and IRb) regions The LSC/IRb boundaries of the seven Nymphaea species were all within the rpl2 gene, and the length of the overlapping region between the boundary and the rpl2 gene was 15-39 bp.  A relatively large number of SNPs in some highly variable regions located in rpoA-rpl20, rbcL-ndhC, ndhD-ndhF, and trnN-GUU-ndhA