General features of the chloroplast genome and organization
The complete chloroplast genome of X. spinosum was 152,422 bp in length. The cp genome shows a typical quadripartite structure and contained two short inverted repeats (IRa and IRb) regions (25,075 bp) which were separated by a small single-copy (SSC) region (18,083 bp) and a large single-copy (LSC) region (84,189 bp) (Fig. 1). The cp genome encodes 115 unique genes, including 80 protein-coding genes, 31 transfer RNA (tRNA) genes and 4 ribosomal RNA (rRNA) genes. Six protein-coding, six tRNA and four rRNA genes were duplicated in their IR regions. The overall GC content of the cp genome was 37.4% while that of LSC, SSC and IR regions was 35.4%, 31.2% and 43%, respectively (Table 1).
Comparative analysis of the Xanthium sp.
The cp genome border LSC-IRb and SSC-IRa of X. spinosum are compared with three other closely related species of Heliantheae such as X. sibiricum, Ambrosia artemisiifolia and Parthenium argentatum [20, 21] (Fig. 2). The intact copy of the gene rps19 is present in the LSC/IRb border of the X. spinosum, A. artemisiifolia and P. argentatum and shares 95 bp to 119 bp in the IRb region that leads to rpl2 gene is located in the IRb region. In contrast, the X. sibiricumrps19 gene is completely shifted to the LSC region and 71 bp away from the IRb region and although the rpl2 gene is present in the LSC/IRb border. Besides, 154 – 175 bp partially fragmented rps19 gene of all these four species is present in the IRa border. The pseudogene, ycf1 is present in the IRa/SSC border of X. spinosum on the other hand ѱycf1 is either located in the IRb region (X. sibiricum and A. artemisiifolia) or in the SSC region (P. argentatum) of the respective cp genomes. But the ndhF gene is entirely present in the SSC region of all the four cp genomes. Comparably, the intact ycf1 gene in all the cp genomes except P. argentatum crosses SSC/IRA region with a 565-583 bp length fragment of ycf1 located in the IRa region. However, P. argentatum encoded two copies of ѱycf1 in their genome. The trnH gene sequences are found in the LSC region and it is ~0-118 bp away from the IRA/LSC border of all the cp genomes.
The genomic sequences of four Heliantheae species were analyzed by the mVISTA software, detecting the variations of the sequences (Fig. 3). The sequence divergence is not similar to each other sequences. The data plot revealed that the non-coding region was more divergent than its coding counterparts. As compared with LSC and SSC regions, IR regions were less divergent in all the cp genomes.
Repeat structure and SSRs analysis
The existence of repeat sequences in the X. spinosum and X. sibiricum cp genomes were analyzed and compared. The repeats of the X. spinosum cp genome consist of 264 forward, 256 palindromic, 251 reverse and 228 complement. By contrast, a major variant number of repeats was found in X. sibiricum, which contained 18 forward, 15 palindromic, six reverse and two complement (Fig. 4a). In total, X. spinosum and X. sibiricum contains 999 repeats and 41 repeats, respectively. Among 999 repeats in X. spinosum, 30-39 bp length repeats (983) predominantly present in their genome and the longest repeat length is 115 bp, which is palindrome sequence. Similarly, in X. sibiricum 34 repeats are 30-39 bp and the longest is 177 bp that represents the also palindromic sequence (Fig. 4b).
A total of 701 and 705 simple sequence repeats (SSR) are identified in the X. spinosum and X. sibiricum cp genomes, respectively. Of 701 SSRs in the X. spinosum, 247 (35.24%) were mono-nucleotide repeats, 30 (4.3%) di-nucleotide repeats, 58 (8.3%) tri-nucleotide repeats, 67 (9.6%) tetra-nucleotide repeats, 80 (11.4%) penta-nucleotide repeats, 112 (15.98%) hexa-nucleotide repeats, 37 (4.6%) and 31 (4.42%) 7-nucleotide repeats and other repeat length from 8-nucleotide to 27-nucleotide repeats occupies 10.84% (76 repeats) (Fig. 5a). Similarly, in X. sibiricum, 250 (35.46%) were mono-nucleotide repeats, 28 (3.97%) di-nucleotide repeats, 63 (8.94%) tri-nucleotide repeats, 74 (10.5%) tetra-nucleotide repeats, 81 (11.49%) penta-nucleotide repeats, 114 (16.18%) hexa-nucleotide repeats, and 32 (4.54%) 7-nucleotide repeats and other repeat length from 8-nucleotide to 21-nucleotide repeats occupies 8.94% (63 repeats). Furthermore, the distribution of SSRs in the LSC, IR and SSC regions of X. spinosum and X. sibiricum were evaluated and discovered that the corresponding genome contains 483 and 481 in the LSC, 91 and 93 in the IR and 127 and 131 in the SSC regions (Fig. 5b). Likewise, SSRs also analyzed in the protein-coding and intron and intergenic regions (IGS) of X. spinosum and X. sibiricum and identified that the respective genome comprises 244 and 252 in the CDs, 69 and 69 in the intron and 388 and 384 in the IGS regions (Fig. 5c).
Nucleotide diversity analysis
The nucleotide diversity of 208 regions was analyzed using DnaSP software, including 79 protein coding genes and 129 intergenic and intron regions among two Xanthium cp genomes namely X. spinosum and X. sibiricum. The results showed that the highest variable region was infA (0.034188) among protein-coding genes (Fig. 6a), as were the trnH-psbA (0.047739), psbA-trnK (0.057143), trnK exon2-matK (0.092857), psbI-trnS (0.046667), ycf3-trnS (0.0683376), trnF-ndhJ (0.209402), ndhC-trnV (0.12551), trnV intron (0.073604), petD-rpoA (0.051813), infA-rps8 (0.181818), rpl14-rpl16 (0.046729), rpl16-rps3 (0.032258), psaC-ndhD (0.086207) and trnL-rpl32 (0.080882) among intron and intergenic regions (Fig. 6b; Table 2).
Synonymous (KS) and nonsynonymous (KA) substitution rate analysis
Synonymous and nonsynonymous substitution rates were evaluated for 79 protein-coding genes of X. spinosum and X. sibiricum cp genomes. The KA/KS ratio of nearly all the genes is less than 1, except for the protein-coding gene, accD (1.56) (Fig. 7).
Positive selection analysis of the accD gene
The positive selection of the accD protein-coding gene of Heliantheae cp genome species was investigated. The ω2 values of the accD gene are 3.70208 in the M2a model. Also, Bayes empirical Bayes (BEB) analysis is used to analyze the location of consistent selective sites in the accD protein-coding gene using M7 vs. M8 model and identified that one site under potentially positive selection with posterior probabilities more than 0.95 and one sites greater than 0.99 (Table 3) and the 2ΔLnL value is 25.91159 (Table 4).
Phylogenetic analysis
A total of 79 protein-coding genes of 20 cp genome sequences were selected to infer the phylogenetic relationships among the closely related species of Heliantheae and Achyrachaena mollis (NC_036504) was selected as an outgroup. The maximum likelihood tree was constructed using the concatenated 79 cp protein-coding genes. The topology of the phylogenetic tree showed that X. spinosum has a close relationship with the species of Ambrosia (Fig. 8). Although, the analysis showed that Parthenium is the sister clade to both Xanthium and Ambrosia and it is an early-diverging lineage of subtribe Ambrosiinae with weak bootstrap value (54%).