3.1 chloroplast genome assembly, organization, and features of Benincasa hispida
The paired-end sequencing of Benincasa hispida by Illumina HiSeq 4000 generated around 14.6 GB raw data with 82.6 million 150 bp reads. We de novo assembled its complete chloroplast genome and the data was submitted to NCBI under accession number MW362306 after a throughout check of correctness. As showing in Table 1 and Fig. 1, the size of its complete chloroplast genome is 156,758 bp in length, presenting a typical quadripartite structure with a large single copy region (LSC, 86,538 bp), a small single copy region (SSC, 18,060 bp) and two inverted repeat regions (IRa/b, 26,080 bp each).
Table 1
Chloroplast genome general features of Benincasa hispida.
Characteristics | Benincasa hispida |
Size (base pair, bp) | | 156,758 |
LSC length (bp) | | 86,538 |
SSC length (bp) | | 18,060 |
IR length (bp) | | 26,080 |
Number of genes | | 131 |
Number of protein-coding genes | | 86 |
Number of tRNA genes | | 37 |
Number of rRNA genes | | 8 |
Duplicate genes | | 18 |
GC content | Total (%) | 37.2 |
| LSC (%) | 35 |
| SSC (%) | 31.7 |
| IR (%) | 42.9 |
| CDS (%) | 37.9 |
| rRNA (%) | 55.2 |
| tRNA (%) | 53.2 |
| ALL gene % | 39.4 |
Protein coding part (CDS) (% bp) | | 51.1 |
All gene (% bp) | | 71.6 |
Non-coding region (% bp) | | 28.4 |
The cp genome of B. hispida had 131 genes (Table 2), including 86 protein-coding genes, 37 tRNA genes and 8 rRNA genes, 18 of which were duplicated genes (7 protein-coding genes, 7 tRNA genes and 4 rRNA genes). The total GC content of cp genome was 37.2%, with the IR regions having the highest GC content at 42.9%, followed by LSC (35%) and SSC (31.7%). In terms of the GC content of different gene types, the number of rRNA (55.2%) and tRNA (53.2%) was relatively high, and that of CDSs was 37.9%. In total, 18 genes contained introns, 16 of which (10 protein-coding genes and 6 tRNA genes) contained one intron and two CDSs (ycf3 and clpP1) possessed two introns (Table S1). Among these genes, 17 genes were duplicated in IR regions except one trans-splicing gene, which was observed in rps12 gene with 5’ end located in LSC region and 3’ end duplicated in IR regions. The truncation event was observed in ycf1 gene that started from IRa region and ended at the SSC region, leaving a 100 bp truncated copy in the IRb region.
Table 2
Genes predicted in the chloroplast genome of Benincasa hispida. The number of asterisks after the gene names indicates the number of introns contained in the genes.
Caregory of Genes | Group of genes | Gene name |
photosynthesis related genes | Large subunit of rubisco | rbcL |
Photosystem I | psaA, psaB, psaC, psaI, psaJ |
Assembly/srability of photosystem I | ycf3**, ycf4 |
Photosystem II | psbA, psbB, psbC, psbD, psbE, psbF, psbH, psbI, psbJ, psbK,psbL, psbM, psbN, psbT, psbZ |
ATP synthase | atpA, atpB, atpE, atpF*, atpH, atpI |
Cytochrome b6/f complex | petA, petB*, petD*, petG, petL, petN |
Cytochrome c synthesis | ccsA |
NADH dehydrogenase | ndhA*, ndhB*, ndhC, ndhD, ndhE, ndhF, ndhG, ndhH, ndhI, ndhJ, ndhK |
Transcription and translation related genes | RNA polymerase subunits / transcription | rpoA, rpoB, rpoC1*, rpoC2 |
Small subunit of ribosomal proteins | rps11, rps12*(*2), rps14, rps15, rps16*, rps18, rps19, rps2, rps3, rps4, rps7 (*2), rps8 |
Large subunit of ribosomal proteins | rpl14, rpl16*, rpl2*(*2), rpl20, rpl22, rpl23(*2), rpl32, rpl33, rpl36 |
translation initiation factor | infA |
RNA genes | ribosomal RNA | rrn16 (*2), rrn23 (*2), rrn4.5 (*2), rrn5 (*2) |
transfer RNA | trnA-UGC* (*2), trnR-ACG (*2), trnR-UCU, trnN-GUU (*2), trnD-GUC, trnC-GCA,trnQ-UUG,trnE-UUC, trnG-GCC, trnG-UCC*, trnH-GUG, trnI-CAU (*2), trnI-GAU* (*2), trnL-CAA (*2), trnL-UAA*, trnL-UAG, trnK-UUU*, trnfM-CAU, trnM-CAU, trnF-GAA, trnP-UGG, trnS-GCU, trnS-GGA, trnS-UGA, trnT-GGU, trnT-UGU, trnW-CCA, trnY-GUA, trnV-GAC (*2), trnV-UAC* |
Other genes | RNA processing | matK |
carbon metabolism | cemA |
fatty acid synthesis | accD |
proteolysis | clpP1** |
component of TIC complex | ycf1 (*2) |
hypothetical proteins | ycf2 (*2) |
* Gene with one intron, **Gene with two introns, (*2) Gene with two copies. |
3.2 Codon usage, and amino acid frequencies
The complete cp genome of Benincasa hispida contained 80,109 bp of coding sequences (CDSs) that encoded 86 genes, having 26,703 codons that fit in 64 codon types. The result of amino acid frequency analysis showing that Leucine with 10.5% occurrence was the most abundant amino acid followed by Isoleucine with 8.5%. The number of Cysteine with only 1,1% abundance was the least occurred amino acid.
Relative synonymous codon usage (RSCU) of four species was also calculated, presenting a high codon bias of A or T bases. The distribution of codon usage showing that codons ended with A or T had RSCU > 1 except GGT (Glycine, 0.96), AGT (Serine, 0.9), and CGT (Arginine, 0.68), revealing that codons ended with A or T were preferred while codons ended with C or G were non-preferred. Among all three stop codons, TAA with 64% abundance was the most frequent one (Table S2).
3.3 Putative RNA editing site within Benincaseae
RNA editing event is typical in the cp genome of most land plants and essential in understanding the chloroplast genome at the transcript level. Out of that purpose, we determined the RNA editing site in the cp genome of four species from Benincaseae. In the cp genome of Benincasa hispida, PREP-web found 58 putative RNA editing sites in 21 CDS genes (Table S3a). Among these genes, ndhB gene with 13 editing sites was determined to be the most variant gene, followed by ndhD (8 sites) and rpoB (5 sites). We also found that 81% of all RNA editing events occurred at the second nucleotide position of codons while none of these events were located in the third codon position.
Moreover, these RNA editing events result in post-transcriptional substitutions, causing amino acid conversions. In the group of these conversions, fifteen-four out of fifteen six RNA editing sites led to hydrophobic products, comprising Phenylalanine (9), Isoleucine (5), Leucine (32), Methionine (2), Valine (4), Tryptophan (2). Four exceptions led to hydrophilic (neutral) amino acid products, including Cysteine (1), Tyrosine (2), and Serine (1). What is more, Serine to Leucine was found to be the most abundant post-transcriptional substitutions with 41.82% of all RNA editing events, followed by Proline to Leucine (14.55%) and Serine to Phenylalanine (7.27%). Worth mentioning, two RNA editing events were detected that transformed ACG (Thr) to initiation codon AUG, resulting in the start of translation in ndhB and ndhD gene.
As shown in Table S3b, the total number of RNA editing sites detected was 57 in Citrullus lanatus, 55 in Lagenaria siceraria and Citrullus colocynthis. All patterns mentioned above showed high consistency in all four species analyzed with only minor differences in terms of numerical values.
3.4 Repeated sequence and SSR analysis
In this study, we analyzed microsatellites or simple sequence repeats (SSRs) in cp genome of Benincasa hispida, Citrullus lanatus, Lagenaria siceraria and Citrullus colocynthis using MISA-web (Beier et al., 2017) and high similarity was revealed among four species. We found that B. hispida contained the most abundant number of SSRs (238) while C. lanatus, with only 219 SSRs, had the least. In the cp genome of B. hispida, most of the SSRs were mononucleotide (42%), varying from 9 to 15 repeat units. Meanwhile, the abundance of dinucleotide was only 25%, which was slightly lower than that of trinucleotide (30%). The frequency of tetranucleotide and pentanucleotide were only 3% and 0.42%, and that of hexanucleotide repeats was absent in all species (Fig. 2C). Moreover, most of the mononucleotide repeats were A/T motifs while AT/TA motifs comprised 68% of dinucleotide repeats (Table S4).
We also analyzed the distribution of SSRs in two types of different regions, specifying in LSC/IR/SSC regions and intergenic spacer (IGS) /gene regions. According to the result, most of the repeats were located in LSC region, varying from 136 in C. lanatus to 148 in B. hispida. Second by SSC region (38 in B. hispida) and IR regions. Noticeably, the SSC number in IR regions in all species was 26 except L. siceraria (24), implementing that IR regions were more conserved than LSC and SSC regions (Fig. 2A). IGS regions were determined to be highly abundant of SSRs in comparison with gene regions. We found 125 SSRs within 46,150 bp IGS regions while 116 SSRs in 112,281 bp gene regions, meaning the density of SSRs in IGS regions was 2.62 times of that of gene regions (Fig. 2B). And similar results were present in all species.
Oligonucleotide repeat sequences were also analyzed using REPuter program (Kurtz et al., 2001). to detect the abundance of four types oligonucleotide repeats, including forward (F), palindromic (P), reverse (R), and complementary (C). Though minor variations presented about the total number of oligonucleotide repeats, the distribution of four types of repeats and the size of repeats presented an obvious resemblance. In terms of the number of oligonucleotide repeats and its distribution in each type, we found 42 repeats (F = 16, P = 22, R = 4) in cp genome of B. hispida; 41 (F = 16, P = 21, R = 4) in C. lanatus; 46 (F = 14, P = 26, R = 4) in L. siceraria and 42 (F = 14, P = 23, R = 5) in C. colocynthis (Fig. 2A). The length of repeats was mostly found between 20 to 24 bp (Fig. 2C). The palindromic repeats were the most abundant repeats followed by forward repeats, whereas the number of reverse repeats was few. None of the species had complementary repeats. We also located the region of each oligonucleotide repeats; significant consistency was presented among four species. The number was exactly the same in all species regarding the repeats located in the IRs regions (6) and some shared sequences, including sequences between LSC and IRa/b (4), between SSC and IRa/b (2), and from IRb to IRa crossing SSC (5, Fig. 2B).
3.5 IR contraction and expansion
The genome length of chloroplast ranged from 159,758 bp (B. hispida) to 157,147 bp (C. colocynthis). Besides, in the cp genome of B. hispida, the length of IR regions was the shortest with 260,080 bp while that of SSC region was the longest with 180,060 bp (Table S5). Thus, we inferred that the variation in size of cp genome was contributed by the expansion and contraction of IR regions with the evidence followed (Fig. 3). The junction sites between each region were denoted as: JLB (IRb /LSC), JSA (SSC/IRa), JSB (IRb/SSC), and JLA (IRa/LSC). All eight species analyzed presented functional ycf1 gene and six of which were at JSA while the other two were located in SSC region completely. Moreover, the ycf1Ψ (pseudo copy) was only presented in two species (B. hispida and L. siceraria) at JSB and were 3 bp and 25 bp into SSC region respectively. The ndhF gene was revealed in all species in JSB with the same length (2246 bp) and relatively consistent position with only few bp into IRb region, except C. hystrix with 21 bp and B. hispida (completely located in SSC region).
The rpl2 gene was found close to JLB while that of two species (C. moschata and C. lanatus) were into LSC region with 11 bp and 6 bp respectively. At the same time, the duplicate rpl2 gene were absent in the same specific two species. The rps19 gene was the most variant gene among all genes that close to the IR junction. In four species, rps19 gene were 2 bp into IRb region and the left four were completely in LSC region.
3.6 Divergence analysis of chloroplast genome
To identify the diversity in the chloroplast genomes of four Benincaseae species, we visualized the percent of identity between sequences and colored regions of high conservation using mVISTA program (Frazer et al., 2004). As showing in Fig. 4, the sequences varied remarkably among different regions. Firstly, most of the differences were located in the LSC and SSC regions while IR regions were almost identical among four species except rps12 gene, revealing that IR regions were more conserving than LSC and SSC regions. Moreover, IGS regions revealed remarkably divergent than gene regions. Notable divergent non-coding regions including: trnR-UCU – atpA, atpH – atpI, trnT-GGU – psbD, trnL-UAA –trnF-GAA, accD – pasI, ndhF – rpl32. While genes such as ycf1, ycf2, accD, psbA, ccsA, ndhF and matK were found to be highly divergent coding genes among all.
Ka/Ks ratio is an essential index to identify a mutation from neutral, purifying, and beneficial. Thus, we compared B. hispida with C. colocynthis, C. lanatus, and L. siceraria respectively to analyze the synonymous substitutions (Ks), non-synonymous substitutions (Ka) and their ratio (Ka/Ks) of 73 PCGs (Table S6). Among all, 18 genes could not be determined due to absent information (Ks = 0). After deleting these genes as well as non-substitution results, we found that genes carrying out photosynthesis functions revealed Ka/Ks = 0 or at relatively low values, indicating that these groups of genes were fairly conserved. The Ka/Ks ratio of 26 genes was lower than 0.5 and that of 96% genes was lower than one, with only three expectations. The number of accD gene were relatively close to one (1.09, 0.85 and 0.92), signifying that the accD gene experienced neutral mutation. The Ka/Ks ratio of rpl36 gene was greater than one with absent information in L. siceraria. Outstandingly, the number of rpl22 gene was the greatest number that up to 2.41 and the ratio between the smallest and Ka/Ks greatest number was 0.5. Thus, we could infer that a beneficial mutation and rapid development had occurred to the rpl22 gene among four species in Benincaseae.
To get a holistic understanding of sequence divergence, we performed slicing window analysis to visualize the nucleotide variability values of all cp genomes. We found that none of the \(\pi\)values of CDS genes exceed 0.05 and the IGSs revealed more divergent than gene regions, which result was in consistent with the previous analysis mentioned. It can be clearly seen in the figure that SSC and LSC regions were much more divergent than IR regions, the \(\pi\) value of which was remarkably low and mirror symmetrized with SSC as the center (Fig. 5).
3.7 Phylogenetic analysis
To locate the phylogenetic position of Benincasa hispida precisely, we selected 26 species and constructed two phylogenetic trees using the complete cp genome (Fig. 6a) and 73 selected CDSs (Fig. 6b) respectively. And the results all supported that B. hispida was closely related with Cucumis, Citrullus, and Lagenaria as their sister group with fairly high bootstrap values. The phylogenetic relationship results of two approaches presented highly consistency with two main variations. Firstly, in general, the bootstrap values in the tree that applied complete cp genome revealed higher than tthe tree constructed with 73 CDSs (Fig. 6b). In addition, Begoniaceae was a sister group with Coriariaceae and Corynocarpaceae according to Fig. 6a while in Fig. 6b, Coriariaceae and Corynocarpaceae were the early-diverging lineages of Begoniaceae. However, only 82 bootstrap values support the former situation (Fig. 6a) while 94 support the second (Fig. 6b).