General features of the cp genome and its organization
The complete cp genome of X. spinosum was 152,422 bp in length. The cp genome showed a typical quadripartite structure containing two short inverted repeats (IRa and IRb) (25,075 bp) separated by a small single-copy (SSC) region (18,083 bp) and a large single-copy (LSC) region (84,189 bp) (Fig. 1). The cp genome encodes 115 unique genes, including 80 PCGs, 31 transfer RNA (tRNA) genes, and 4 ribosomal RNA (rRNA) genes. Six protein-coding, six tRNA, and four rRNA genes were duplicated in the IR regions. The overall GC content of the cp genome was 37.4%, while those of LSC, SSC, and IR regions were 35.4%, 31.2%, and 43%, respectively (Table 1).
Comparative analyses of Xanthium species
The borders of LSC-IRb and SSC-IRa in the cp genome of X. spinosum were compared to three other closely related species of Heliantheae, namely, X. sibiricum, Ambrosia artemisiifolia, and Parthenium argentatum [28, 29] (Fig. 2). An intact copy of the rps19 gene was present in the LSC/IRb borders of X. spinosum, A. artemisiifolia, and P. argentatum, as well as a shared 95 bp to 119 bp sequence in the IRb region adjacent to the rpl2 gene. By contrast, the X. sibiricum rps19 gene was completely shifted to the LSC region, 71 bp away from the IRb region, despite the rpl2 gene being present at the LSC/IRb border. In addition, 154–175 bp of the fragmented rps19 gene in all four species was present at the IRa/LSC, LSC/IRa regions or its border. On the other hand, ѱycf1 was present in the IRa/SSC border of X. spinosum, whereas it was located in the IRb or silenced in the SSC region of X. sibiricum and A. artemisiifolia, and was situated in the SSC region of the P. argentatum cp genome. The entire ndhF gene was present in the SSC region of all four cp genomes. Similarly, an intact ycf1 gene was present in the SSC/IRa region of all of the cp genomes analyzed, except P. argentatum, which has a 565 to 583 bp fragment of ycf1 in the IRa region. However, P. argentatum encodes two copies of ѱycf1 in its genome. The trnH gene sequences are located in the LSC region 0 to 118 bp from the IRa/LSC border in all cp genomes.
The cp genomic sequences of four Heliantheae species were analyzed using mVISTA software to detect variation among the sequences (Fig. 3). The sequence divergence differed markedly among regions. The data revealed that the non-coding region was more divergent than its coding counterparts. Relative to the LSC and SSC regions, IR regions of all cp genomes were less divergent.
Repeat structure and SSR analyses
The presence of repeat sequences in the X. spinosum and X. sibiricum cp genomes was analyzed and the species were compared. Repeats in the X. spinosum cp genome consist of 264 forward, 256 palindromic, 251 reverse, and 228 complement. By contrast, X. sibiricum contained 18 forward, 15 palindromic, 6 reverse, and 2 complement repeats (Fig. 4a). In total, X. spinosum and X. sibiricum contain 999 repeats and 41 repeats, respectively. Among the 999 repeats identified in X. spinosum, repeats of 30–39 bp in length (983) were predominant in the cp genome; the longest repeat was 115 bp and was a palindrome sequence. Similarly, in X. sibiricum, 34 repeats were 30–39 bp in length, and the longest was a palindromic sequence of 177 bp (Fig. 4b).
In total, 701 and 705 simple sequence repeats (SSRs) were identified in the X. spinosum and X. sibiricum cp genomes, respectively. The 701 SSRs in the X. spinosum cp genome included 247 (35.24%) mono-nucleotide repeats, 30 (4.3%) di-nucleotide repeats, 58 (8.3%) tri-nucleotide repeats, 67 (9.6%) tetra-nucleotide repeats, 80 (11.4%) penta-nucleotide repeats, 112 (15.98%) hexa-nucleotide repeats, 31 (4.42%) 7-nucleotide repeats, and 76 other repeats ranging from 8 nucleotides to 27 nucleotides (10.84%) (Fig. 5a). Similarly, the cp genome of X. sibiricum contained 250 (35.46%) mono-nucleotide repeats, 28 (3.97%) di-nucleotide repeats, 63 (8.94%) tri-nucleotide repeats, 74 (10.5%) tetra-nucleotide repeats, 81 (11.49%) penta-nucleotide repeats, 114 (16.18%) hexa-nucleotide repeats, 32 (4.54%) 7-nucleotide repeats, and 63 repeats with lengths from 8 nucleotides to 21 nucleotides (8.94%). Furthermore, the distributions of SSRs in the LSC, IR and SSC regions of X. spinosum and X. sibiricum indicated that the corresponding cp genomes contain 483 and 481 SSRs in the LSC, 91 and 93 in the IR, and 127 and 131 in the SSC regions (Fig. 5b). Likewise, SSRs were analyzed in the protein-coding (exon, protein-coding exon), intron and intergenic spacer (IGS) sequences of X. spinosum and X. sibiricum, which indicated that their cp genomes contain 244 and 252 SSRs in CDs, 69 and 69 in introns and 388 and 384 in IGS regions, respectively (Fig. 5c).
Nucleotide diversity analyses
The nucleotide diversity of 208 regions was analyzed using DnaSP software, including 79 PCGs and 129 intergenic and intron regions in the cp genomes of X. spinosum and X. sibiricum. The most variable region was infA (0.03) among PCGs (Fig. 6a), and high variability was observed for the trnH-psbA (0.05), psbA-trnK (0.06), trnK exon2-matK (0.09), psbI-trnS (0.05), ycf3-trnS (0.07), trnF-ndhJ (0.21), ndhC-trnV (0.13), trnV intron (0.07), petD-rpoA (0.05), infA-rps8 (0.18), rpl14-rpl16 (0.05), rpl16-rps3 (0.03), psaC-ndhD (0.09) and trnL-rpl32 (0.08) genes in introns and intergenic regions (Fig. 6b; Table 2).
Synonymous (KS) and nonsynonymous (KA) substitution rate analyses
The synonymous and nonsynonymous substitution rates were evaluated for 79 PCGs in the X. spinosum and X. sibiricum cp genomes. The KA/KS ratios of nearly all genes were less than 1, except for the PCG accD (1.56) (Fig. 7).
Positive selection analyses of the accD gene
Positive selection of the accD PCG in Heliantheae cp genome species was investigated using site-specific models with four comparisons (M0 vs. M3, M1 vs. M2a, M7 vs. M8, M8a vs. M8), using a likelihood ratio test (LRT) threshold of p ≤ 0.05 in EasyCodeML software. Among these models, M2a was the positive selective model and p (p0, p1 and p2) are the proportions of negative or purifying, neutral, and positive selection, respectively. The ω2 value of the accD gene was 3.70 in the M2a model. In addition, Bayes empirical Bayes (BEB) analyses were used to analyze the locations of consistent selective sites in the accD PCG using the M7 vs. M8 model comparison, and one site was found to potentially be under positive selection, with posterior probabilities greater than 0.95, while another site had probabilities greater than 0.99 (Table 3); the 2ΔLnL value was 25.91 and the p-value of LRT was 0 (Table 4).
In all, 71 PCGs from 21 cp genome sequences were selected for inferring phylogenetic relationships among closely related species of Heliantheae, and Ligularia fischeri (MG729822) was selected as an outgroup. A maximum likelihood tree was constructed using 71 concatenated PCGs in the cp genomes. The genus Xanthium was closely related to the genus Ambrosia (Fig. 8). Our analyses showed that Parthenium was a sister clade to both Xanthium and Ambrosia, and also an early-diverging lineage of the subtribe Ambrosiinae with a weak bootstrap value (57%).