The complete chloroplast genome sequence of Clerodendranthus spicatus, a medicinal plant for preventing and treating kidney diseases from Lamiaceae family

Clerodendranthus spicatus (Thunb.) C. Y. Wu ex H. W. Li is one of the most important medicines for the treatment of nephrology in the southeast regions of China. To understand the taxonomic classification of Clerodendranthus species and identify species discrimination markers, we sequenced and characterized its chloroplast genome in the current study. Total genomic DNA were isolated from dried leaves of C. spicatus and sequenced using an Illumina sequencing platform. The data were assembled and annotated by the NOVOPlasty software and CpGAVAS2 web service. The complete chloroplast genome of C. spicatus was 152,155 bp, including a large single-copy region of 83,098 bp, a small single-copy region of 17,665 bp, and a pair of inverted repeat regions of 25,696 bp. The Isoleucine codons are the most abundant, accounting for 4.17% of all codons. The codons of AUG, UUA, and AGA demonstrated a high degree of usage bias. Twenty-eight simple sequence repeats, thirty-six tandem repeats, and forty interspersed repeats were identified. The distribution of the specific rps19, ycf1, rpl2, trnH, psbA genes were analyzed. Analysis of the genetic distance of the intergenic spacer regions shows that ndhG-ndhI, accD-psaI, rps15-ycf1, rpl20-clpP, ccsA-ndhD regions have high K2p values. Phylogenetic analysis showed that C. spicatu is closely related to two Lamiaceae species, Tectona grandis, and Glechoma longituba. In this study, we sequenced and characterized the chloroplast genome of C. spicatus. Phylogenomic analysis has identified species closely related to C. spicatus, which represent potential candidates for the development of drugs improving renal functions.


Introduction
Clerodendranthus spicatus (Thunb.) C. Y. Wu ex H. W. Li, is commonly called Yanumiao, Maoxugong and Maoxucao [1]. It is a perennial herb and is the first reported species among the Clerodendranthus genus in the Lamiaceae family. It is mainly produced in Guangdong, Hainan, Guangxi, Yunnan, Fujian provinces of China, and Taiwan. And it is also distributed in India, Myanmar, Thailand, Indonesia, Philippines to Australia and adjacent islands [2]. In the wild, C. spicatus commonly grows in the wet place under the forest. They can also be cultivated in areas up to 1050 m above sea level [3]. C. spicatus has numerous chemical constituents, such as flavonoids [4], phenols [5,6], terpenes [7], volatile oils [8], and lignans [9]. The entire plants or aerial parts are widely used to treat chronic nephritis, cystitis, lithangiuria and rheumatoid arthritis, and kidney diseases [10,11].
The genetic study of C. spicatus is still in its infancy. The genetic clustering and genetic similarity analysis of 87 C. spicatus samples based on inter-simple sequence repeat (ISSR) markers have been reported previously [12]. The results showed that there were 120 polymorphic bands. And the samples were classified into two groups using the UPGMA method. The ISSR markers were feasible for the genetic diversity analysis of C. spicatus resources and can well reveal the genetic characteristics and genetic relationships in C. spicatus germplasm resources. It provides a theoretical basis for species identification, variety selection, product development, and application of C. spicatus.
The chloroplasts are important organelles for plant photosynthesis [13]. The chloroplast genome (cpgenome) study is significant in revealing the mechanism and metabolic regulation of plant photosynthesis [14]. A large number of proteins that function in the chloroplast are coded and transcribed in the nuclear genome. For this reason, the chloroplast is only a semi-autonomous organelle [15]. Studies of the chloroplast genomes help to understand the interaction between nuclear and chloroplast genomes [16]. Moreover, chloroplast genomes have an important value in revealing the plants' origin, molecular evolution process, and relationship among different species [17].
Here, the cpgenome of C. spicatus was sequenced, annotated, and characterized for the first time. After we had characterized its structure features, we chose other 14 species known to have pharmacological effects on the kidney system in clinical research for phylogenetic analysis. Epimedium brevicornu and Dioscorea polystachya were used as outgroups. In addition, we compared their gene contents to identify possibly related to the chemical biosynthesis and disease treatment [18]. We systematically identified simple sequence repeat markers that can be used in the future for genetic diversity analysis as described above [19]. The results from this study laid the foundation for taxonomic classification and species discrimination of C. spicatus related species. This information will be indispensable for the exploration of novel medicinal materials used for treating kidney diseases.

Plant materials
Young leaves of C. spicatus were collected from the Guangxi Medical Botanical Garden, Nanning, Guangxi, China (Geospatial coordinates: N22.859968, E108.383475) and treated by Silica gel immediately for total genomic DNA isolation.

DNA extraction and determination of DNA quality
Total genomic DNA was extracted from the dried leaves using the plant genomic DNA kit (Tiangen Biotech, Beijing, China) [20,21]. The DNA purity was evaluated with 1.0% agarose gel. Moreover, we measured the DNA concentration using the Nanodrop spectrophotometer 2000.

Chloroplast genome sequencing and assembly
DNA extracts were fragmented for 300 bp short-insert library construction. The library was sequenced in pair-end mode with the read length of 150 bp on an Illumina Hiseq 2500 platform following the manufacturer's recommendations [22,23]. The raw reads were filtered using Trimmomatic 0.35 to remove adapters and low-quality bases [24]. We used BLASTN software to compare the clean reads and cpgenome sequences downloaded from GenBank. The reads that were similar to the cpgenome sequences satisfying the BLASTn e value cutoff of 1e-5 were filtered. Then, the filtered reads were assembled using NOVOPlasty (v 4.2) software [25] with the following parameters (Type = chloroplast, K-mer = 39, Read Length = 150, Insert size = 300, Single/ Paired = PE, Insert Range = 1.9) and the rbcL sequences as the seed.

Genome annotation and manual curation
We annotated the genome using the CpGAVAS2 web service (http:// www. herba lgeno mics. org/ cpgav as2/) [26], including the prediction of protein-coding genes (PCGs), tRNA gene, rRNA genes, and the analysis of the repeat sequences. The annotation problems were manually corrected using the Apollo software [27]. At last, the assembly and the annotation results of C. spicatus cpgenome was deposited in Gen-Bank under the accession number MZ063774.

Characterization of cis-and trans-splicing genes
Schematic representation of cpgenome structure and the detailed structures of the cis-splicing genes and trans-splicing PCGs were depicted using CPGview-RSG software (http:// www. herba lgeno mics. org/ cpgvi ew/).

Analysis of relative synonymous codon usage (RSCU) and codon adaptation index (CAI)
The Relative Synonymous Codon Usage (RSCU) was calculated as the ratio of the observed codon frequency to the expected frequency of the same codon within a synonymous codon group in the entire coding sequence of the gene concerned. The RSCU patterns and quantity were estimated by using CodonW (v1.4.4) (http:// codonw. sourc eforge. net/) [28]. The Codon Adaptation Index (CAI) value was calculated by using the CAIcal server (http:// genom es. urv. es/ CAIcal/) [29].

Analysis of the gene contents at the borders of the LSC, SSC, and IR regions
The boundaries of the LSC, SSC, and IR regions boundary of cpgenomes from C. spicatus and six species were visualized using the IRscope software (https:// irsco pe. shiny apps. io/ irapp/). These six species include five closely related Lamiales species (Cistanche deserticola, Glechoma longituba, Salvia miltiorrhiza, Salvia miltiorrhiza f. alba, and Tectona grandis) and Epimedium brevicornu from the Berberidaceae family [33].

Isolation, quality control, and sequencing of total genomic DNA
We analyzed the genomic DNA using electrophoresis on a 1% agarose gel. A clear band corresponding to the genome DNA was observed. The detection value of OD 260 /OD 280 by the microspectrophotometer is 1.8-1.9, and that of OD 260 / OD 230 is over 2.0, indicating that the purity of the DNA sample is of high quality for PCR experiment and library construction. 11.5 GB of paired-end reads were obtained after filtering, resulting in 10.0 GB clean data. We mapped the reads to the assembled C. spicatus cpgenome. For a total of 15,926,314 reads, 354,188 pairs of reads (1.11%) were mapped to the genome exactly one time, corresponding to a depth of 354,188*151/152155 = 351.5.

General features of C. spicatus cpgenome
The assembled cpgenome of C. spicatus was 152,155 bp, including a large-single copy (LSC) region of 83,098 bp, a small-single copy (SSC) region of 17,665 bp, and a pair of inverted repeat regions (IRa and IRb) of 25,696 bp each (Fig. 1). The GC content of the whole cpgenome was 37.86%, and those of LSC, SSC, and IR areas were 35.90%, 31.75%, and 43.12%, respectively ( Table 1). The Fig. 1 Graphic representation of features identified in C. spicatus cp genome by using CPGview-RSG (http:// www. herba lgeno mics. org/ cpgvi ew). The map contains seven circles. The first circle shows the distributed repeats connected with red (the forward direction) and green (the reverse direction) arcs from the center going outward. The next circle shows the tandem repeats marked with short bars.
The third circle shows the microsatellite sequences as short bars. The fourth circle shows the size of the LSC and SSC. The fifth circle shows the IRA and IRB. The sixth circle shows the GC contents along the plastome. The seventh circle shows the genes having different colors based on their functional groups GC content of the IR region was higher than those of the SSC and LSC regions.

Analysis of RSCU and CAI regarding the codon usage
Codon usage bias provides critical clues for the plant evolution process and has an essential role in the expression of proteins [40]. We analyzed the codon usage in the cpgenome of C. spicatus. The PCGs contained a total of 26,421 codons. Among them, 1102 (4.17%) codons were for Isoleucine, the most abundant amino acid. Lysine was the second most abundant amino acid. Its codons accounted for 4.06% of all codons. The least abundant codons were for cysteine, representing only 0.24% of all codons (Table S4). The RSCU bias affects the gene expression because synonymous codon usage can usually influence the protein secondary structural units. The RSCU values varied from 0.336 to 2.9904 in the cpgenome of C. spicatus. Codons having an RSCU value > 1 was preferred. In this study, the codons of AUG, UUA, and AGA had the higher RSCU values, indicating the presence of higher codon usage bias among the 64 codons. To understand the synonymous codon usage bias for the cpgenome of C. spicatus. We calculated the CAI value, which was 0.645, indicating that the presence of codon preference. It should be noted that only two amino acids (Trp and Met) were encoded by only one codon, respectively.

Repeat sequences analysis
Repeat sequences are important genetic markers and are closely related to the origin and evolution of species. Repeat sequences can be divided into scattered (interspersed) repetition and tandem repetition sequences. Interspersed repetition sequences are scattered and distributed in the genome. Multiple repeats of a sequence on a chromosome are called tandem repeats. A special form of tandem repeats is simple tandem repeats, also known as simple repeats (SSR). SSRs often have natural polymorphism and are the focus of chloroplast genome repeat analysis.
We analyzed the three types of repetitive sequences, including SSRs, tandem repeats, and interspersed repeats in the cpgenome of C. spicatus. For the SSR, we found 12 of 28 contained "A" as the repeat unit, 15 of 28 contained "T" as the repeat unit, and 1 SSR contained "TA" as the repeat unit (Table S5). The sizes of SSRs were between 10 and 15 bp (Table S6). 36 repeats were identified for the tandem repeats satisfying the conditions: the length of the repeat unit ≥ 20 bp, and the similarity among the repeat unit sequences ≥ 70% (Table S6). For the interspersed repeats, 40 repeats were identified, including 22 palindromic repeats and 18 direct repeats (Table S7), with the length of repeat units 1, 2 being between 30 and 60 bp. The e values of interspersed repeats varied from 7.80E−23 to 6.19E−04 (Table S7).

Structures of the IR boundaries from seven selected species
IR contraction and expansion have been a common phenomenon in the genome evolution of cpgenome. We analyzed the IR boundaries' structure and calculated the length of various regions for seven species. The results showed that these cpgenomes had similar structures, but different lengths of corresponding regions (Fig. 2). Two genes, rps19 and ndhF, were found most frequently at the border regions. The rps19 genes were located at the border area of LSC and IRb in C. spicatus, S. miltiorrhiza, S. miltiorrhiza f. alba, and T. grandis. In contrast, the rps19 genes were located in the LSC of E. brevicornu cpgenome. In C. deserticola cpgenome, it was located in IRb. More interestingly, a small fragment of the rps19 gene was found at the border area of IRa and LSC in Salvia species and C. deserticola. No rps19 and ycf1 gene fragments were found in the G. longituba cpgenome. According to the IR regions boundary and sizes analysis of C. spicatus, the rps19 genes have various distributions across the borders of the single-copy and the inverted regions [41].
In contrast, ndhF genes were located at the border area of IRb and SSC in the cpgenomes of S. miltiorrhiza [42], S. miltiorrhiza f. alba, and T. grandis, and the area of SSC in the cpgenome of E. brevicornu. Most of the ycf1 genes were located at the border area of SSC and IRa in the Salvia species and T. grandis, while they were also located at the border area of SSC and IRb in the cpgenomes of C. spicatus, E. brevicornu, G. longituba, and T. grandis. Besides, the rpl2, trnH, and rpl22 genes were located at the IR and LSC border areas, except for the cpgenome of G. longituba, respectively. Lastly, the psbA genes were located in the cpgenome of Salvia species, T. grandis, and C. spicatus (Fig. 2).
The phylogenetic tree showed that all 13 species were clustered together besides the two outgroup species, E. brevicornu, and D. polystachya. The Maximum likelihood (ML) bootstrap values were fairly high. The values for seven nodes are 100%, with values ≥ 87% for another four nodes. The 6 Lamiales species, E. ulmoides, and D. asper, were clustered together [44]. In contrast, the 4 Polygonaceae species and C. corylifolium from the Fabaceae family were clustered together. C. deserticola forms its own branch because this is possible considering it is a parasitic plant. Nevertheless, the two plants of E. brevicornu and D. polystachya were more distantly related to the others ( Fig. 3 and Table S8).

Discussion
We sequenced and assembled a C. spicatus chloroplast genome for the first time, characterized its structure, and identified many unique features. A total of 131 protein-coding genes were predicted, including 87 protein-coding genes, 36 tRNA genes, and 8 rRNA genes. Five hypervariable regions have been identified. The IR border analysis revealed a complicated IR contraction and expansion pattern, suggesting that the IR border areas undergo continuous recombinations. Codon usage analysis identified significant preference. In summary, these unique features can be used to understand the evolution process and functions of C. spicatus chloroplast.
We have chosen 15 species having common medical values for phylogenetic analysis. Most of them have been used to treat renal and urinary diseases, such as acute and chronic nephritis, cystitis, urinary calculi, lithangiuria, nephropyelitis, diuresis, premature ejaculation, soreness of the waist, and skelalgia. As reported, Fructus Psoraleae from dry mature fruits of C. corylifolium has been used clinically to treat kidney yang deficiency-induced impotence due to its strong estrogen-like activity [45]. Long-term administration of D. polygonoides can induce changes of AST and ALT activities in renal and hepatic functions [46]. A polysaccharide from the roots of D. asper can regulate the renal complication through inhibition of AGE accumulation and RAGE expression in streptozotocin STZ-treated diabetic rats [47]. Comparison of raw product and Wine Steam-Processed Product on biological activities from C. deserticola showed that they could restore the level of sex hormone in the kidney-yang deficiency model [48]. Emodin is the main activity compound from Rheum genus of Polygonaceae family. It is reported that emodin can effectively treat proliferative glomerulonephritis and delay chronic renal failure [49]. The likely mechanism involves the inhibition of the production of inflammatory factors such as IL-1, IL-6, TNF and MCIL-6. Lastly, Glechoma, Salvia genus, Tectona grandis, and Herba Epimedii contain terpenoids, flavonoids, steroids, alkaloids, and organic acids with anti-oxidation and antiinflammatory biological activities such as treatment of gallbladder, diuresis, dissolved stones and male infertility [50][51][52][53]. Their similar clinical effects suggest that they might contain similar chemical components. This study was initiated to explore the therapeutic potential of the these plant species fully.
Results obtained from this study have several applications. Firstly, the chloroplast transgenic expression system can efficiently express exogenous proteins. Several oral vaccine antigens and therapeutic proteins have been overexpressed using this system [54]. As a result, the cpgenome generated from this study can be used to express endogenous and exogenous genes. In particular, we can improve the expression of genes related to the biosynthesis of secondary metabolites from C. spicatus. Secondly, we have established the phylogenetic relationships of these 15 medicinal plant species that are potentially effective on kidney disease. In the future, we can profile the chemical compositions from these species. We might identify additional therapeutic areas for these species beyond their traditional uses through the combined analysis of the phylogenetic relationship and the chemical profiles.

Conclusions
In the present study, we sequenced and assembled the C. spicatus cpgenome for the first time using Illumina sequencing technology. We have identified several unique features of the cpgenome, including codon usage preference, IR contraction and expansion, hypervariable regions, etc. The unique characteristics of the cpgenome provide valuable resources for subsequent genetic analysis and bioprospecting.