The complete chloroplast genome and characteristics analysis of Musa basjoo Siebold

An ornamental plant often seen in gardens and farmhouses, Musa basjoo Siebold can also be used as Chinese herbal medicine. Its pseudostem and leaves are diuretic; its root can be decocted together with ginger and licorice to cure gonorrhea and diabetes; the decoct soup of its pseudostem can help relieve heat, and the decoct soup of its dried flower can treat cerebral hemorrhage. There have not been many chloroplast genome studies on M. basjoo Siebold. We characterized its complete chloroplast genome using Novaseq 6000 sequencing. This paper shows that the length of the chloroplast genome M. basjoo Siebold is 172,322 bp, with 36.45% GC content. M. basjoo Siebold includes a large single-copy region of 90,160 bp, a small single-copy region of 11,668 bp, and a pair of inverted repeats of 35,247 bp. Comparing the genomic structure and sequence data of closely related species, we have revealed the conserved gene order of the IR and LSC/SSC regions, which has provided a very inspiring discovery for future phylogenetic research. Overall, this study has constructed an evolutionary tree of the genus Musa species with the complete chloroplast genome sequence for the first time. As can be seen, there is no obvious multi-branching in the genus, and M. basjoo Siebold and Musa itinerans are the closest relatives.


Introduction
Musa basjoo Siebold, a perennial herb of the family Musaceae belonging to the genus Musa, is one of the main tropical plants. It is mainly distributed in subtropical areas in China, such as Guizhou, Guangdong, Guangxi, Hainan, Sichuan, Yunnan, and Taiwan [1,2]. More than 40 species of genus Musa can be found in the southeast of India and Thailand, slightly fewer in Indonesia [3]. Since ancient times, the genus Musa has been widely spread in China as a good food and medicine product. Its sweet and refreshing pulp makes it an excellent fruit with appetizing and digesting functions. Also, its flowers, leaves, and roots have high medicinal values. They are mainly used to treat rheumatism and other diseases in cardiovascular, cerebrovascular, digestive, and circulatory systems [4][5][6][7][8][9]. The pulp, flower, leaf, and root of genus Musa are rich in sugar, amino acid, cellulose, minerals, selenium, other trace elements, and various compounds. Seventy-six reported main compounds of M. acuminate, M. balbisiana, M. sapientum, and M. Nana include phenylphenalenone, triterpenoids, xanthones, and alkaloids [10][11][12]. Most family Musaceae species are similar in shape, but their origin, evolution, and phylogeny could remain controversial. Different scholars have identified Musa species based on morphological characteristics, physical and chemical analysis, tissue anatomy, and molecular markers, which are significant for Musa species classification. In recent years, DNA barcoding gains popularity as it is not affected by the external environment and can help quickly and accurately identify species [13,14]. At present, Internal transcribed spacer region (ITS) 1 and 2 sequences, chloroplast matK, rbcL, rpoB, trnH-psbA, psbK-psbI, and atpB-rbcL DNA barcodes have been reported for molecular identification of species [15][16][17][18].
Chloroplast genome is the second-largest genome containing rich genetic information. The intron with a relatively fast mutation rate is mainly divided into a highly conserved chloroplast coding sequence. Barcode fragments with small molecular weight, simple structure, horotelic evolution, low mutation rate, and stable heredity have apparent advantages in determining phylogenetic, genetic, and homologous relationships among species. They are also used in species identification, molecular geology, and species origin research [19]. At present, chloroplast DNA barcoding has been widely used in plants. The combination of rbcL + psbA trnH is a universal DNA barcode for terrestrial plants [20]. CBOL Plant Working Group has proposed the rbcL + matK combination as the core barcode for terrestrial plant identification [21]. Plant chloroplast genome has incomparable advantages in the studies of phylogeny, population dynamics, and species evolution. It is suitable for plant taxonomy and adaptive evolution studies, especially those regarding interspecific identification and related species phylogeny [16,22]. With the development of high-throughput DNA sequencing technology, an increasing number of chloroplast genome sequences are available, which provide essential references for the chloroplast genome research of M. basjoo Siebold.
In this study, the whole chloroplast genome sequence of M. basjoo Siebold was obtained by sequencing. The chloroplast genome of M. basjoo Siebold was assembled, annotated, and analyzed. The chloroplast genome sequences of M. basjoo Siebold were compared with other published chloroplast genomes of genus Musa.
The fundamental characteristics and variation patterns of M. basjoo Siebold chloroplast genome were investigated to compare the interspecific and intraspecies variation of the sequences and select the high variation sequence among species. The chloroplast phylogenetic analysis of representative medicinal plants of genus Musa could provide a reference for the classification and identification, conservation genetics, resource development, and utilization of M. basjoo Siebold.
This study used Illumina sequencing technology to display the whole chloroplast genome of M. basjoo Siebold and explore its relationship with other genus species. The generated results will shed light on the study of the genetic structure and phylogenetic process of the natural population of this species. They will also contribute to the understanding of the structural diversity of plastids and the phylogeny of Musaceae.

Sampling, filtering of raw reads, and sequencing
We collected fresh leaves from one M. basjoo Siebold tree, which grew in the Botanical Garden of Medicinal Plants, Xianlin Campus of Nanjing University of Chinese Medicine, China. Total genomic DNA was extracted by the modified CTAB extraction method [23]. DNA was sheared by compressed nitrogen atomization to produce a fragment with a length of 300 bp. The fragmented quality was checked on the Bioanalyzer 2100 (Agilent Technology). The shortinsert libraries preparation and sequencing were performed by Genepioneer (Nanjing, China). Genomic DNA was sequenced on the Illumina Novaseq PE150 platform.
The fastp (version 0.20.0, https:// github. com/ OpenG ene/ fastp) software was used to filter the original data. The filtering criteria are as follows: (1) The sequencing adaptor and primer sequences in reads were removed; (2) The reads with an average mass value less than Q5 were filtered out; (3) The reads with the N number greater than five were weeded out.
After a series of quality controls like this, 16,108,597 high-quality raw reads with 150-bp paired-end (PE) were obtained. This method of connecting pair-end readings is adopted in order to improve assembly accuracy.

Assembly, annotation, and analysis of the plastid genome sequences
After quality filtering, reads were mapped to the database of Musaceae species with a chloroplast genome available (NCBI download), using Bowtie2 v.2.2.6 to exclude reads of nuclear and mitochondrial origins. All putative chloroplast reads were mapped to the above reference sequence and then used for de novo assembly.
In reference to published Musaceae species, we used the chloroplast-like reads to assemble genomic sequence with NOVOPlasty [24], assembled parts of reads, and stretched it as much as possible until a circular genome was formed. The assembled chloroplast sequence was annotated by the CpGA-VAS [25]. In our study, we checked the annotation results by DOGMA and BLAST [26]. The circular gene maps of the Musa basjoo Siebold plastid genome were formed by OGDRAW [27]. The codon usage distribution was detected by the software CodonW (University of Texas, Houston, TX, USA) with the relative synonymous codon usage (RSCU) ratio. Ka/Ks value for each gene was calculated using the KaKs_calculator [28] with the following settings: genetic code Table 11 (bacterial and plant plastid code); method of calculation: MLWL. Microsatellite (Mono-, di-, tri-, tetra-, penta-, and hexanucleotide repeats) detection was performed by using MISA with thresholds of eight repeat units for mononucleotide SSRs, five repeat units for di-and trinucleotide SSRs, and three repeat units for tetra-, pentaand hexanucleotide SSRs.he online tool IRscope (https:// irsco pe. shi-nyapps. io/ irapp/) was used to visualize the genes on the boundaries of LSC, SSC, and IRs according to their annotations. Subsequently, a sliding window analysis was conducted to evaluate the nucleotide variability (Pi) of the cp genome using DnaSP version 5.1 software.
The identified long repeat regions and corresponding genome coordinates were checked and annotated with the REPUTER tool [29] and deposited in GenBank (login No. BankIt2410783 M. _basjoo_Siebold MW376865).

Genome comparison
The complete cp genome of M. basjoo Siebold was compared with that of its related species of M. acuminata subsp (HF677508.  [30,31]. With M. basjoo Siebold as a reference, ten plastids were compared in Musaceae.

Phylogenetic position analysis
We constructed phylogenies by the Maximum Likelihood method (ML) using 10 Musaceae species to study the coding region evolution. The default whole-genome analysis of the evolutionary tree was adopted by setting the same starting point for the ring sequence. Multiple sequence alignment was done with MAFFT software [32] across species. The data results could help construct the largest likelihood evolutionary tree, using trimAl, the RAxML v8.2.10 (https:// cme.h-its. org/ exeli xis/ softw are. html) software, the GTRGAMMA model, and rapid Bootstrap analysis (Bootstrap = 1000).  (Table S1) [33]. The plastid genome structure of M. basjoo Siebold shows a typical quadripartite circular molecule Fig. 1, including a large single copy (LSC; 90,160 bp and a short single copy SSC; 11,668 bp), which is divided into a pair of inverted repeats (IRs; 35,247 bp) regions ( Fig. 1 and Table S1).

Chloroplast genome features and guanine-cytosine of M. basjoo Siebold
We annotated 139 different genes with the same arrangement order in M. basjoo Siebold, including 89 messenger RNAs (mRNA), 8 ribosomal RNAs (rRNA), 38 transfer RNAs (tRNA), and 4 pseudogenes (Table S1). The LSC region contains 60 protein-coding and 21 tRNA genes, while the SSC region contains 9 protein-coding genes and 1 tRNA gene. The IRa and IRb regions include 10 protein-coding genes, 8 tRNA genes, and 4 rRNA genes. In the anti-clockwise direction, the intermediate region from LSC to SSC is defined as IRa, and the intermediate region from SSC to LSC is IRb (Table 1 and Fig. 1).
The genes of chloroplast mainly work for photosynthetic pathways and self-replication. In addition, there are some genes with other functions or unknown functions, which have been shown in Table 2. We found that four genes are duplicated in the IRa and IRb regions, including rpl2, ndhB, trnI-GAU , and trnA-UGC (Table 3), identical to those found in M. basjoo Siebold. M. basjoo Siebold plastid genome contains 23 gene introns, with each gene seeing one intron, except for the ycf3, rps12, and clpP genes containing two introns (Table 3). It was found that ycf3 is an essential substance for the stable accumulation of photosystem I complexes [40].

SSR and long repeat analysis
Simple sequence repeats (SSRs) have high intraspecific variability in the chloroplast genome and are often used as genetic markers in population genetics and evolutionary research. Our study analyzed the simple sequence repeats (SSRs) in the cp genomes of M. basjoo Siebold (Table S4, Fig. 3). We detected 246 SSRs and 6 kinds of SSRs: mononucleotide, dinucleotide, trinucleotide, tetranucleotide, pentanucleotide, and hexanucleotide in M. basjoo Siebold. The mono-nucleotide repeat is the most abundant (51.82%), contributing more to genetic variation than other SSRs. The mono-nucleotide A and T repeats account for 96.09%, the highest proportion in all SSRs identified. It was found that mono-nucleotide, dinucleotide, trinucleotide, tetranucleotide, and pentanucleotide repeats comprise higher levels of poly-A or poly-T, consistent with the overall A-T abundance of the chloroplast genome (82.59%). We found that plastid SSRs are usually made up of A and T and rarely contain tandem guanine (G) or cytosine (C) repeats [42]. This might be attributed to the fact that A-T transformation is more effortless than G-C transformation in the chloroplast genome [43]. Long repeats, unlike simple sequence repeats, are dispersed in the genome. In the research, we identified 372 long repeats in the M. basjoo Siebold chloroplast genome (Table S5, Fig. 4), including 219 forward (F) repeats, 139 palindromic (P) repeats, 12 reverse (R) repeats, and 2 complement (C) repeats. Most of the long repeats range from 30 to 214 bp, and intergenic spacer regions (IGS) have 110 repeats at most. M. basjoo Siebold has fewer SSRs and long repeats than other species.

KaKs Analysis of M. basjoo Siebold and M. itinerans
Previous studies have revealed that non-synonymous and synonymous nucleotide substitution patterns serve as critical markers in gene evolution. We compared the gene sequences with mafftv7.310 software and calculated the gene Ka/Ks value with the KaKs calculator [28]. In this research, the non-synonymous (Ka)/synonymous (Ks) substitution ratios of rpoc2, rpoC1, rpoB, ropA within Musa basjoo Siebold and Musa invenans (NC 035723.1) were calculated to study the functional limitation of four copies at the DNA sequence level (Table S6). The results indicated that the gene with the

IRscope analysis of Musaceae
In this study, the contraction and expansion of the IR boundaries in 10 Musaceae genomes were visualized between four chloroplast genome regions (LSC/IRb/SSC/IRa). The plastid genome shows a circular structure. IR has four boundaries with LSC-IRb, namely IRb-SSC, SSC-IRa, IRa-LSC, and LSC-SSC. We compared IR/LSC and IR/SSC binding sites among Musaceae in detail. Our analysis unveiled that the lengths of LSC, IRa, IRb, and SSC regions in the plastids of ten species are similar. JLA (IRa-LSC) border lies between rps19 and psbA genes of Musaceae, similar to that of Musa balbisiana var. balbisiana and Musa ornata (Fig. 5). The boundary of JLB (LSC-IRb) is situated at the coding region between rpl22 and rps19 genes. The NdhA gene generally spans JSA (SSC-IRa) region. In the previously studied species, IR boundary displacement was relatively slight and reported only a minor number of genes. The NdhA gene was found at the boundary of JSA (SSC-IRa) and only showed irregular translocation. In this study, the ndhA sequence of Musa itinerans is the longest with 1191 bp, and the ndhH and ndhF genes deviate from JSB (IRb-SSC) region.

Pi analysis of nucleotide diversity
The nucleotide diversity (π) of the chloroplast genome shows the difference in nucleic acid sequences among different species. Highly variable regions provide potential molecular markers in population genetics (Fig. 6). The pi analysis makes evident gene variation, and the SNP and indel are the global comparison points. Our results revealed the Pi values of matK, rps16, psaC, rpl16, ndhF, rpl36, rpl32, ccsA, accD, rps15, ycf1, and trnG-UCC genes are higher, as to be seen in the subsequent barcodes.

Comparative analysis of chloroplasts in Musaceae
This research compared the M. basjoo Siebold chloroplast genome with nine other Musaceae species (Fig. 7). The mauve program was adopted to compare the whole genome of ten Musaceae species [44,45]. The ten plastid lengths vary from 161,347 bp (M. textilis) to 172,322 bp (M. basjoo Siebold). M. basjoo Siebold cp genome is similar in gene order to the other Musaceae species analyzed (Fig. 7). As a result, Musaceae plants are highly conserved in plastid genome content, gene sequence, and genetic structure, without inversion or translocation in the species we analyzed.

Phylogenetic relationship analysis
Phylogenetic analysis in the light of chloroplast genome sequence is fundamental in tracing many plant species lineages [46,47]. Our study selected the complete plastid genome sequences from ten Musaceae species to study the phylogenetic position of M. basjoo Siebold. The ten selected complete plastome sequences were aligned using MAFFT software [48]. ML analysis was carried out using the RAxML software [49], and most of the nodes had 100% boot support in our ML tree. GTR model and hill clipping algorithm were used to construct the evolutionary tree (Fig. 8).
We obtained a complete genome of M. basjoo Siebold chloroplast, which studied M. basjoo Siebold phylogeny in Musaceae. We used the complete chloroplast genomes of ten Musaceae species for multiple sequence alignment. The results showed that this evolutionary tree could be divided into three branches. M. basjoo Siebold and M. itinerans are the closest relatives within one sister group. M. balbisiana

Discussion
In this study, we used Illumina sequencing technology to sequence the plastid sequence of M. basjoo Siebold. Chloroplast genome analysis showed that the genome has a pair of inverted repeat regions (IRa and IRb) with quadrilateral structure, separated by a large single-copy region (LSC) and The cp microsatellites (cpSSRs) have often been used as molecular markers in evolutionary studies such as genetic   but few in the protein-coding gene regions. In this study, the detected microsatellite will contribute to the study of the evolution of Musa and the protection and identification of the genus.
The variation of chloroplast genome size is due to the contraction and expansion of inverted repeats (IRs). We observed the contraction and expansion of IRs region in the chloroplast genome of M. basjoo Siebold and other sequenced Musaceae. Among these 10 species, the boundaries of IR-SC region are different. According to the positions of rps19, rpl22, ndhH, ndhF, ndhA, psbA, and trnH, we identified some types of connections caused by the contraction and expansion of the inverted repeat region.
The dN/dS ration, which is Synonymous (dS) and nonsynonymous (dN) substitution rate, was used to evaluate the purification selection of protein-coding genes and the sequence difference. Our results showed that most of the gene sequences make little difference (dS < 0.1). The dN/dS analysis revealed that most protein-coding genes are under negative selection, with only a few under positive selection (dN/dS > 1). Other plastids display similar findings.
Plastid genome is an excellent resource to infer the relationship between evolution and phylogeny. Many studies have used chloroplast sequences to analyze phylogenetic relationships at different taxonomic levels. Previously, only a few genes were used to evaluate the phylogenetic relationship and tribe classification of Musa, but tribe classification still needs to be explained. This paper used the maximum parsimony method to reconstruct the phylogenetic relationships of chloroplast genomes of 10 species representing the four major tribes in Musa. Our phylogenetic tree shows the same topological structure with high-resolution values at the branches. In this study, ten Musa species were selected as research objects, which has confirmed that M. basjoo Siebold and M. itinerans are the closest relatives within one sister group.

Conclusions
Chloroplast organelles are typical in green plants and other organisms. Simple, conservative, and easy to be rearranged, the genome is mainly used to analyze species origin and evolution [50]. In this study, we obtained and analyzed for the first time the complete chloroplast genome sequence of a Chinese traditional medicinal plant (M. basjoo Siebold) by using Illumina high-throughput sequencing technology. The genome sequencing, assembly, annotation, and comparative analysis revealed that M. basjoo Siebold cp genomes have a typical quadruple structure with a conserved arrangement. Their GC content, arrangement, and codon usage features are similar to those found in the cp genome of other Musaceae species. In addition, the analytical result (Ka/Ks > 1, as shown by matK, ycf2) indicated more substantial natural selection effects. The M. basjoo Siebold cp genome analysis revealed some exciting features, which can pave the way for a better understanding of the plant's anti-resurrection ability. For the first time, we attempted to analyze genetic diversity using whole-genome information. We found these genes with high nucleotide diversity: matK, rps16, psaC, rpl16, ndhF, rpl36, rpl32, ccsA, accD, rps15, ycf1, and trnG-UCC , which can set the barcode for subsequent species identification analysis.
In conclusion, the current study has demystified the complete chloroplast genome of M. basjoo Siebold for the first time. As genus Musa and zingiberales plants generally grow in tropical areas, they share a close relationship. Previous research used Musa and Zingiberales species to construct the evolutionary tree to illustrate the relationship between the two genera. It is the first time that we have used Musa species alone, and by far the most genus Musa species, to conduct an evolutionary tree as a whole to determine whether there exist multiple sister groups among Musa species. The analysis found out mainly three such groups. M. basjoo and M. itinerans are the closest relatives within one sister group. M. balbisiana and M. textilis belong to another sister group. In addition, M. banksii, M. ornata, M. acuminata, and M. beccarii find themselves in a third group. This finding may well provide a relatively complete reference for future studies.