Characterization of the complete chloroplast genome sequence and phylogenetic analysis of B. oleracea var. italica

Backgrounds: Broccoli (Brassica. oleracea var. italica L.) is known as one of the most nutritionally rich vegetables, as well as rich in functional components that benet to health. The main purposes of this research were sequencing, assembling and annotation of chloroplast genome of broccoli based on Illumina HiSeq2500 sequencing platform. Results: The size of the broccoli cp genome is 153,364 bp, including two inverted repeat (IR) regions of 26,197 bp each, separated by a small single copy (SSC) region of 17,834 bp and a large single copy (LSC) region of 83,136 bp. The GC content of the complete genome is 36.36%, while those of SSC, LSC, and IR are 29.1%, 34.15% and 42.35%, respectively. It harbors 134 functional genes, including 87 protein-coding genes, 39 tRNAs and 8 rRNAs, with 31 duplicates in the IRs. The most abundant amino acid in the protein-coding genes is leucine, while the least is cysteine. Codon usage frequency showed bias for A/T-ending codons in the cp genome. In the repeat structure analysis, a total of 34 repeat sequences and 291 simple sequence repeat (SSRs) were detected in the work. Although cp genomic structure and size are highly conserved, the SC-IR boundary regions are variable between the 7 cp genomes. The phylogenetic relationships based on complete cp genome from 9 species suggest that B. oleracea var. italica is closely related to Brassica juncea. Conclusions: The complete cp genome sequence was obtained and annotated for broccoli for the rst time. The information acquired from this research will be useful for further species identication, population genetics and biological research of broccoli. codon-anticodon recognition patterns of the cp genome that 29 tRNAs contained codons corresponding to 20 essential amino acids for protein biosynthesis. The AT content at the rst, second and third codon positions 55.3%, 71.21%, Moreover, of all 65 codons, the RSCU values of 31 codons were(cid:0)1, and most of those 90.3%) end with base A or U, whereas 33 codons had the RSCU(cid:0)1, and most of those (16/14, 90.91%) end with base C or G. Trp is encoded by only a UGG codon, which means no codon biased usage (RSCU = 1). 0–3 bp located in the SSC region. Overlaps between the ycf1 and ndhF were detected at the JSB boundary in all studied cp genomes, with lengths varied from 35 bp to 38 bp. The ycf1 crossed the JSA region except in the cp genome of B. juncea, and its length reected changes in the JSA region. The tRNA noncoding gene trnH-GUG in the 7 species were all fell in the LSC region, which at a distance of 3–30 bp from the JLA boundary. The results in this study suggested that the IR border shifts relatively minor, involving only a small number of genes, differences in gene overlap lengths and the distance of trnH-GUG gene located at the junction of JLA boundaries only presented irregular shifting.

In the current research, we de novo sequenced and assembled the complete cp genome of broccoli with a Illumina HiSeq2500 sequencing platform, then analyzed its gene annotation and structure. Followed, we identi ed a number of SSR marks in our new assembly and reconstructed the evolutionary relationships with the family to investigate the phylogenetic position in several Brassicaceae species. It is expected that the results will be helpful to improve understanding of the cp genome and will also provide a theoretical basis for future scienti c research of broccoli.

Repeat sequences and SSR analysis
A total of 34 repeat sequences, including 12 forward (F), 20 palindromic (P) and 2 reverse (R) repeats, were detected by REPuter in the broccoli cp genome ( Table 5). The length of most of these repeats were between 30 to 47 bp, and the longest repeat was 26,197 bp in length and was located in the IR region. LSC, SSC and IR regions harbored 20, 7 and 13 repeats separately. Most repeats were mainly located in the intergenic spacers (IGS), ycf gene and intron sequences, whereas 13 repeats were located in the psaA, psaB, rrn4.5, trnS-GGA, trnS-UGA, rps19 and trnS-UGA . Most of the detected complex SSRs were within noncoding region; 15 were located in the IGS regions, 2 were located in the intron regions and one was contained in the ycf1 gene. Moreover, we also found 144 repeats were located in different genes, and the remaining were all harbored in intergenic regions.  Table 7 Distribution of SSRs in the broccoli cp genome.

Phylogenetic analysis
The cpDNA-based phylogenetic analysis has revealed better understanding of evolutionary relationship, population analysis and classi cation in rice species [25], Brassica genus [26], Myrtales species [27], and Aristolochia species [3]. To investigate the taxonomic status and evolutionary relationship of broccoli, alignment of 9 complete cp genomes downloaded from NCBI Gene Bank database was performed, and a ML tree was constructed by FastTree. The phylogenetic analysis corroborated the traditional taxonomy of the Brassicaceae with 100% bootstrap support ( Figure 6). Speci cally, 6 species Alliaria grandifolia (NC-034286.1), Arabidopsis thaliana (NC-000932.1), Capsella bursa−pastoris (NC-009270.1), Braya humilis (NC-035515.1), Bunias orientalis (NC-036111.1) and Brassica nigra (NC-030450.1) were clustered into a group, Brassica napus (NC-016734.1) was clusted into a group, and Brassica. oleracea var. italica was found to be closely related to Brassica juncea (NC-028272.1). The complete cp genome information reported in this paper would provide valuable data for future research to clarify the genomic information of broccoli chloroplasts and the chloroplast information could also be used on the phylogeny, molecular marker, DNA barcoding and conservation genetics.

Discussion
The typical organization cp genome of Brassica oleracea var. italica with two identical IRs separating the SSC and LSC regions exhibits identical gene content and order to land plant cp genomes.  153,363 bp)). The results obtained in this study indicated that the DNA GC content was not distributed evenly among the four regions. The GC content in the IR region is higher than that in other regions, this is possible because of the presence of higher GC content of the four rRNAs in this region and the DNA GC content was usually thought as an important indicator to distinguish species a nity [28-31].
The broccoli genome contains 134 genes, with highly conserved in composition and arrangement, including self-replication genes, photosynthetic genes, other functional genes, and some other genes of unknown functions, which is consistent with previous research [32]. Among the distinct genes, 25genes contain one intron and two introns, and even the gene trnR-UGG has the largest intron. Introns play crucial roles in the regulation of some gene expression [33]. They might improve gene expression level, in the speci c situation and on the special position [34][35]. Coding usage has key parts in shaping cp genome evolution. Among codons of broccoli cp genome, the most and the least used frequently amino acids is leucine and cysteine respectively, which is the same as reported in other angiosperm genomes, such as Ananas comosus, Decaisnea insignis, Nasturtium o cinale, M. zenii [36][37][38]. The broccoli cp genome codons preferred AT over GC, especially at the second and the third position, 62.53% and 71.21%, respectively, which is consistent with results widely observed in many terrestrial species [39][40].
Repeat analysis revealed 12 forward, 20 palindromic and 2 reverse repeats in the broccoli cp genome. Most of these repeats were located in intron sequences, intergenic spacers and ycf gene, but several occurred in CDS and tRNAs. It was reported that repeat sequences took part in sequence variations, genome rearrangements and many rearrangement endpoints in rearranged algal and angiosperm genomes [41][42][43][44]. Because of highly conserved organization of cp genome sequences, and the SSR primer for cp genomes can be inherited across genera and species. SSRs are widely used as molecular markers genetic linkage map construction, population genetics, polymorphism research and plant breeding and also play an important role in plant taxonomy [45][46][47]. A total of 291 SSRs were obtained in this study, and 196 (67.4%) SSRs belonged to the P1 type, among of them, 190 (65.3%) belonged to A and T repeat units, while TA and AT repeats composed all the P2 type. These ndings agree with the results in other researches [30,31,36]. Several complex SSRs were also detected, they might caused by two or more individual simple sequence repeats adjacent to each other and divided by a certain length of sequences [7,48].
Plant cp genomes have been thought as highly reserved, but the sizes and LSC/IRb/SSC/IRa boundaries will change due to contraction/expansion at the borders of the IR region [49][50]. Our results indicated that the IR border variations between 7 species mainly because they crossed different position of 4 genes, rps19, ycf1, ndhF and trnH-GUG, this agree with the results of previous researches [36, [51][52].

Conclusions
In this research, we assembled the complete cp genome of broccoli with Illumina HiSeq platform. Annotation and comparison of the obtained data with reference helped us identify and verify 134 functional genes including 47 RNA and 87 protein-coding genes. The codon usage was biased toward A/T-ending. Repeat sequences and SSRs detected in the work could be used for molecular marker development and phylogenetic analysis. Phylogenetic reconstruction based on complete cp genome showed that B. oleracea var. italica is closely related to Brassica juncea. All the information presented in this paper will facilitate further biological research and genetic engineering of broccoli.

Chloroplast genome assembly and annotation
High-quality clean reads were generated by trimming and ltering out the low-quality reads and sequencing adapters with Trimmomatic v 0.3649. The clean reads were mapped onto the available cp genome reference of B. oleracea var. capitata (NCBI accession: KX681654.1) using Bowtie2 software [53] with its default parameters and preset options. All of the cp-like reads were assembled into contigs by SPAdes [54]. Then the obtained contigs were align again on the reference of L. capitata using BLAST algorithm. The generated contigs and mate-pair reads were used to scaffold using SSPACE program [55] until a circular genome formed.

Declarations
Ethics approval and consent to participate Not applicable.

Consent for publication
Not applicable.

Competing interests
The authors declare that they have no competing interests.
Funding ZCZ would like to gratefully thanks supported by National Natural Science Foundation of China (NSFC, grant No. 31401871) and Zhenjiang Science and Technology Innovation Fund Project (No. NY2019001). The funders did not take part in the design of the study and assembling, annotation and analysis of data and in writing the manuscript.
Authors' contributions ZCZ and ZLD assembled and annotated the chloroplast genome, analysed the data and wrote the draft paper, YMY and YFP prepared and sequenced DNA libraries, SGS and XS designed and coordinated research and nalized the paper. All authors read and approved of the nal manuscript.