Assembly and comparative analysis of chloroplast genome of wheat K-CMS line and maintainer line CURRENT STATUS:

The sterile line is vital for the heterosis utilization of wheat (Triticum aestivum L.), and the cytoplasmic male sterility (CMS) line has more practical significance in the utilization of heterosis. To reveal the sterile mechanism of wheat K-CMS line K519A from the perspective of the chloroplast genome, the chloroplast genomes of the K519A and its same nuclear genotype maintainer line 519B were sequenced using second-generation high-throughput technology and assembled. Then the chloroplast genomic contents, SSR sequences, long repeat fragments, and qPCR were analyzed. The results showed that the total length of K519A and 519B were 136,996 bp and 136,235 bp, respectively. Both chloroplast genomes encode 126 genes, including 89 protein-coding genes, 8 rRNA genes, and 39 tRNA genes. There were 186 and 188 SSRs in K519A and 519B, respectively. And a developed SSR maker, which from atpF, can be used to distinguish the cytoplasmic type of Aegilops kotschyi and Triticum aestivum, respectively. There were 50 and 52 long repeat fragments, which come from the gene sequences, in K519A and 519B, respectively. A total of 107 mutations between K519A and 519B, which from 45 genes, were identified, of which, 16 genes (matK, rps16, rpoB, rpoC2, atpI, ndhK, atpB, rbcL, psbB, psbH, rpl14, ndhH, ndhF, rpl32, ccsA, ndhA) contained non-synonymous mutations. Further, the qPCR analysis was performed, and the gene expression levels of selected six genes for K519A compared to 519B were mostly downregulated at the binucleate and trinucleate stages of pollen development, thus, these non-synonymous mutant genes might affect the fertility of K519A.

of complete abortion, stable sterility, wide recovery source, good agronomic traits, so it has great application value. There have been some researches about K-CMS in physiology [26], cytology [27], protein expression [28] and gene function analysis [29]. For instance, Yang et al. [29] cloned and analyzed the functions of MADS-box genes, TaAG-A and TaAG-B, from a wheat K-CMS line. They found that TaAG-A and TaAG-B had relatively high expression levels in spikes. In the maintainer line, both genes showed downregulation during the uninucleate to the trinucleate stage. Then the TaAG-A and TaAG-B were silenced in fertile wheat lines using the VIGS method, resulting in green and yellow striped leaves, emaciated spikes, and decreased seed set rates. Zhang et al. [28] found a total of 41 protein spots differentially expressed through proteomic analysis of pollen in K-CMS line and maintainer, among which, all seven identified proteins were downregulated in CMS line, but there were no significant differences in the maintainer line at the uninucleate stage. The caffeic acid Omethyltransferase was downregulated in CMS and the other five proteins were upregulated in CMS at the binucleate stage, except glutamine synthetase isoform GS1a. However, genome sequence analysis of chloroplast of K-CMS in wheat has not been published.
K519A is a non-1B/1R K-CMS line, and 519B has the same nuclear genotype maintainer line of K519A.
Their nuclear genomes are derived from the same paternal species, but their cytoplasm genomes are from different maternal species. Thus, the differences in fertility come from the differences between cytoplasmic genomes. To understand the sterile mechanism of K519A, the chloroplast genomes of K519A and 519B were sequenced, assembled, annotated and compared in this study. Then the nonsynonymous mutant sites of coding genes between K519A and 519B were verified, and the expressed patterns were analyzed. Besides, SSR, long repeat sequences, and phylogeny were analyzed. This study can provide an important reference value for the sterile mechanism study of K519A and its genetic relationship.
2. Materials And Methods 2.1 Material planting are provided by our laboratory (Northwest A&F University, Yangling, Shaanxi, China). After pollens matured, the opening angle of the glumes of K519A become larger than 519B, the anthers stretch out from the glume ( Fig. 1-A1, B1). And the anther is wizened, short, and crooked, and the pollens can't spread out from the anther ( Fig. 1-A2). After pollens matured, the anthers of 519B are plump and cracked, and amount of fertile pollens spread out from the anthers ( Fig. 1-B2). The test materials were planted in the experimental field of the Northwest A&F University (108°4'E, 34°16'N) on October 3, 2018. The two materials were planted in 4 rows (25 cm between rows), respectively, with 15 seeds in 1 m. Planting with general field management measures. Young leaves were stored in -80℃ refrigerator in March 2019.

CTAB extraction of DNA and chloroplast genome sequencing
The extraction of genomic DNA from young leaves of K519A and 519B wheat was by the CTAB method. The whole-genome DNA sequences were obtained using second-generation high-throughput technology, with the Illumina HiSeq XTM Ten sequencing platform and paired-end 150 bp sequencing strategy. Firstly, the 10G original sequencing data for each sample was obtained, then low-quality reads and connector sequences are removed. And the clean data was used for further genome assembly of the chloroplast.

Genome assembly and annotation
The chloroplast assembly was carried out by the assembly method of Hahn et al. [30]. The assembly steps are as follows: Firstly, the clean data is spliced, using the SPAdes [31] software, setting default parameters except that the cut off parameter is not selected. The clean data are spliced into scaffolds. Using the published wheat chloroplast DNA sequences and protein-coding gene sequences as a reference, then sequences are aligned using Blastn and Exonerate, respectively. The evalue threshold of DNA alignment is 1e − 10 , and the protein similarity threshold is 70%, respectively. Using PRICE and MITObim to splice the collected fragmented sequences to extend scaffolds. The number of iterations generally is 50. The original sequencing reads are aligned to scaffolds using bowtie2, then the matched paired reads are picked out and spliced by SPAdes. Look at the path to see if a distinct circle diagram can be formed. If so, the circle genome is extracted, otherwise, the above 3-5 steps are repeated until a circular genome is formed.
The organelle genome annotation is mainly divided into three parts: protein-coding gene annotation, RNA annotation, and structural annotation. Three methods are commonly used for protein-coding gene prediction. The DNA sequences or the protein-coding sequences directly align in NCBI to confirm the genes. Submitting sequences to the online service annotation tools DOGMA (http://dogma.ccbb.utexas.edu/) and CpGAVAS [32], predicting the ORFs, and then annotating genes in the nr database. In combination with the three methods and choose the most accurate annotation results.
Chloroplast tRNA annotations are performed using tRNAscan-SE online site (http://lowelab.ucsc.edu/tRNAscan-SE/). The rRNA annotations are performed using the RNAmmer 1.2 Server (http://www.cbs.dtu.dk/services/RNAmmer/), in combination with homologous alignments, and the boundaries correcting. After the sequence annotation is completed, the structure note is edited by Sequin to generate a submission file that can be submitted to the GenBank database. Submit the file to OGDRAW (https://chlorobox.mpimp-golm.mpg.de/OGDraw.html) to draw the annotated map.

Codon preference analysis of chloroplast gene
A codon is a link between genetic information from DNA and protein, and every three adjacent codons encode an amino acid. In all organisms in nature, there are only 20 amino acids in total, however, each amino acid corresponds to at least one or up to six codons. Synonymous codons mean that different codons can encode the same amino acid [33]. The analysis of codon preference is performed using the CHIPS program in the EMBOSS software package [34].

Simple sequence repeat (SSR) and long repeat sequences analysis
The Microsatellite identification tool (MiSa) [35] is used for finding simple sequence repeat (SSR) in the genome. We used this software to detect SSR with eight times for minimal repeat number of mononucleotides, five times for dinucleotides, and at least three times for three or more bases.
The polymorphism SSR markers at the same position in the same gene of K519A and 519B were selected. Primer Premier 6.0 software was used to design SSR primers to verify the polymorphism between K519A and 519B.
Using REPuter online software [36] to find long repeat sequences. All options: forward, reverse, complement, and palindromic are selected for "Match Direction". Set the parameter "Hamming distance" to 3, "maximum computed repeats" to 50, and "minimal repeat size" to 15.

Gene alignment analysis and phylogenetic analysis
We used DNAMAN software to align the homology of each coding gene in K519A and 519B, then we selected all non-synonymous mutant sites to verify the accuracy of second-generation sequencing results. According to the company's sequencing sequence, Primer Premier 6.0 software is used to design primers of non-synonymous mutant sites, then PCR products were sand to the company for first-generation sequencing (FGS). The DNAMAN software was used to align the gene sequences and amino acids of K519A and 519B.
The genome-wide alignment of chloroplast genomes A, B and other 100 chloroplasts of near-source species were performed using the HomBlocks (https://github.com/fenghen360/HomBlocks) software with the trimAl trim method. Using the NJ method to build the evolutionary tree with the 1000 bootstrap value.

RNA extraction and qPCR
The extraction of RNA from anthers using the TaKaRa miniBEST Plant RNA Extraction Kit (Takara NO.9769). The DNase I reaction solution was used to effectively remove genomic DNA during RNA extraction so that the extracted RNA generally does not contain genomic DNA. Then, agarose gel electrophoresis was used for quality testing. The total RNAs were reverse-transcribed into cDNA using PrimeScript RT Reagent Kit (Takara). The TB GreenTM Premix Ex TaqTM II (Tli RNaseH Plus) Kit (Takara Biological Engineering) was used for the qPCR to study gene expression patterns with three biological repeats. The target gene-specific primers were designed using Premier6.0 software. qPCR was performed with Q7 Real-Time PCR instrument produced by Life Technologies, USA. Using Actin as the reference gene and the expressions of target genes were calculated by the 2 −ΔΔCT method.

Data Availability Statement
All data of chloroplast genomes generated in this study have been deposited in NCBI GenBank, with accession number MN605257 (K519A) and MN605258 (519B).

Genome content and characteristics
Like the chloroplast genomes of other higher plants, the two chloroplast genomes in this experiment contained two inverted repeats, IRA and IRB, which divided the entire genome into four parts, and the remaining regions were the large single-copy region (LSC) and the small single-copy region (SSC). The total length of the chloroplast genome of K519A was 136,996 bp, and the lengths of the four parts were 81,136 bp (LSC), 12,776 bp (SSC), and 21,542 bp (IRA and IRB) (Fig. 2). The total length of the 519B was 136,235 bp, and the lengths of the four parts were 80,335 bp (LSC), 12,796 bp (SSC), and 21,552 bp (IRA and IRB) (Fig. 3). The GC contents of genomes were 38.27% (K519A) and 38.28% (519B) respectively. The structure of the two chloroplast genomes was different, such as the expansion and contraction of the boundaries of the four parts, and the sequence insertion of some regions.
The regions of exons and introns for each gene were predicted, and the codon preferences were analyzed based on the sequence of exons. The codon preferences of K519A and 519B were different.
And the codon preferences of genes in the K519 and 519B were shown in Table 2 and Table 3, respectively. Among them, both Met and Trp were encoded by only one codon. Ile and stop codons were encoded by three. Ala, Gly, Pro, Thr, and Val were encoded by four. Arg, Leu, and Ser were encoded by six kinds.

Discovery of SSRs
The SSR markers of chloroplasts have been applied for the studies of population genetics, evolution processes and phylogeny [37,38]. In this study, we found 186 SSR repeat motifs in K519A (Table 4) and 188 in 519B (Table 5). Of SSRs, the mononucleotide repeats were the most abundant compared to polynucleotide repeats, and the number of A/T repeat motifs was significantly higher than the number of C/G, which was consistent with the results of six legume plastid genomes [39]. In K519A, the proportion of single nucleotide repeat motifs was 68.28% and was 67.02% in 519B. The SSR number composed of dinucleotide was 8, of which, 6 AT/TA repeat motifs and 2 TC in K519A. The SSR number composed of dinucleotide was 9, of which, 7 AT/TA repeat motifs and 2 TC in 519B. Among the trinucleotides, AAC/TTC most frequently appeared. And the SSR number of the remaining repeat motifs type appeared only once.
After analysis, it was found that there was one SSR marker with the polymorphic difference between K519A and 519B from the five coding genes, respectively. Thus, PCR was performed after primers were designed for the five sites (Table 6), and the bands of SSR markers from the atpF gene were polymorphic between K519A (200 bp) and 519B (197 bp) according to the polyacrylamide gel electrophoresis (Fig. 4).

Discovery of Long repeat sequences
Besides to SSR analysis, we also analyzed long repeats of the chloroplast genome. Long repeat sequences are special DNA sequences that occur repeatedly in the genome and usually occupy a large proportion in the genome. In general, the proportion of repeat sequences is positively related to the size of the genome. For example, the Arabidopsis genome is 120 Mb, and the repeat sequences account for 10%. The wheat genome is 17,000 Mb, and the repeat sequences account for 81% [40,41]. Most repeat sequences exist in the intergenic region, and a small number of repetitive sequences exist in the gene coding region [42]. Repetitive fragments have important molecular significance for plant evolution research [43]. In the K519A, a total of 50 long repeats were found, and the sizes were 20-286 bp. Of the 50 repeats, the positive repeats, palindrome repeats, and inverted replicates were 37, 11, and 2, respectively. The longest repeat was the forward repeat with 286 bp in length (Table 7). In 519B, a total of 52 long repeats were found, the sizes of which were also 20-286 bp, which was consistent with K519A, of which, positive repeats, palindrome repeats, and inverted repeats were 39, 10, and 3, respectively (Table 8).

Gene alignment analysis between K519A and 519B
A comparative analysis of the K519A and 519B revealed that 107 SNPs were identified in the proteincoding sequence in 45 genes (Table 9). Among them, 75 were synonymous mutations, 32 were nonsynonymous mutations, the proportion of non-synonymous mutations was 29.9% of the total mutations. Of non-synonymous mutations, the RNA polymerase-related genes are the most, and six genes occurred mutations. And followed by the ATP synthase-related genes and the NADH dehydrogenase-related genes, with five mutations occurred, respectively. Among the 45 genes that had mutated, 29 gene sequence differences were synonymous mutations that the encoded amino acid sequences were identical. The base conversion occurred 73 times and the transversion occurred 34 times. Among the 126 genes, the gene with the largest difference between K519A and 519B sequence was the rpoC2 gene, and there were 12 SNPs with 5 amino acid differences. The number of tRNA and rRNA genes was completely consistent between the K519A and 519B chloroplast genomes, respectively.

The verification of non-synonymous mutant sites by first-generation sequencing
To verify the accuracy of second-generation sequencing results, 32 non-synonymous mutant sites from 16 genes, based on the results of the above alignment analysis, were identified via FGS technology. These genes included matK, rps16, rpoB, rpoC2, atpI, ndhK, atpB, rbcL, psbB, psbH, rpl14, ndhH, ndhF, rpl32, ccsA, and ndhA. The corresponding primers were listed in Table 10. Then, we designed primers for all non-synonymous mutant sites and performed FGS analysis to compare the sequences of the FGS with those of the second-generation sequencing, and it was concluded that the amino acids of 16 genes were all consistent under two methods, excepting some mutated bases.
Some comparison results were listed in Fig. 5

Analysis of qPCR
To further study the effect of non-synonymous mutant genes in the chloroplasts, we performed a qPCR analysis to assess the expression patterns of genes. According to the existing research results, we selected six non-synonymous mutant genes (atpB, ccsA, matK, rbcL, rpoC2, and ndhH) for analysis of gene expression levels of K519A compared with 519B. The corresponding primers of qPCR were listed in Table 11.
It can be seen from the figures that the expression levels of all genes were downregulated at the binucleate and trinucleate stages, except that the rbcL gene was downregulated only at the trinucleate stage. And the rpoC2 gene was upregulated at the late uninucleate stage (Fig. 6).

Phylogenetic analysis of K519A and 519B
The cluster analysis and evolutionary tree of the chloroplast genome of K519A, 519B, and other 100 species that have near relatives were performed. The Bromus vulgaris was used as the outgroup. It can be concluded that the chloroplasts of K519A and 519B were from two independent taxa, the chloroplast of 519B was closely related to the Triticum timopheevii, Aegilops spelta L., and other Triticum species. However, K519A was closely related to the Aegilops genus and other Aegilops species (Fig. 7), which was consistent with the characteristic that K519A's cytoplasm was derived from Aegilops kotschyi. This is consistent with the characteristics of the materials themselves, the K519A has the cytoplasm of Aegilops genus, 519B has a common wheat cytoplasm.

Discussion 4.1 Plant chloroplast genomics research and phylogenetic analysis
The chloroplast genome sequencing technology has developed from the FGS to the current thirdgeneration sequencing [44][45][46]. Currently, most of the chloroplast genome sequences on NCBI are performed by second-generation sequencing technology [47,48]. Due to the fast improvement of sequencing technology, results in that the sequencing costs continue to decrease, so chloroplast genome data is largely supplemented. In the past few decades, chloroplasts studies have made a breakthrough. More than 2400 plant chloroplast genomes have been published on the NCBI database.
Of which, the chloroplast genome of tobacco is the first genome to be sequenced in higher plants [45]. The chloroplast genome encodes 100-120 genes, including 70-88 protein structural genes, 30-32 tRNA genes, and 4 rRNA genes. According to the functional classification of chloroplast genes, they can be divided into three categories: the first are genes encoding photoreactive structural proteins, such as psaA, psbB, petD, etc. The second are the coding genes of energy metabolism-related enzymes, such as atpA, ndhB, rbcL, and other genes. The third are the encoding genes of the transcriptional translation system, mainly including the ribosome large subunit encoding rpl family, the ribosome small subunit encoding rps family, the transport RNA encoding trn family, and transport RNA polymerase encoding rpoA family and so on. Besides, there are some genes whose functions have not yet been determined, such as the ycf family [49]. Researches have shown that many genes of plant chloroplast are involved in some important biological processes. For instance, Zhong et al. [50] demonstrated that chloroplast heat shock protein HSP21 is involved in the plastid-encoded RNA polymerase (PEP)-dependent transcription in Arabidopsis thaliana. Yu et al. [51] found that the downregulated expression of ribosomal protein S1 (RPS1) in rps1 mutants negatively modulates the expression of heat-responsive gene HsfA2 and its target genes in Arabidopsis thaliana.
Our study has completed the sequencing of the chloroplast genome of the K-type CMS K519A and maintainer 519B. This is similar to the chloroplast genome structure of most plants, which is a typical four-segment structure. The chloroplast genome of angiosperms belongs to the maternal inheritance and is generally not affected by genetic recombination. The chloroplast genome is much smaller than the nuclear genome and there are more copies in normal cells, so the chloroplast genome is easy to obtain. And the structure and coding genes of the chloroplast genome are relatively conserved [52,53]. In recent years, with the continuous completion of plant chloroplast genome sequencing, phylogenetic researches based on chloroplast genomes or genes have been greatly promoted.
Kallersjo et al. [54] utilized the rbcL chloroplast gene to analyze the phylogenetic relationship of 2538 species, covering the prokaryotic cyanobacteria to higher flowering plants, using parsimony jackknife analysis method. Bruneau et al. [55] analyzed the evolutionary relationship of 223 legume materials by analyzing the introns of the chloroplast trnl gene. Kim [56] sequenced the chloroplast genome of Panax schinseng Nees and analyzed the evolutionary relationship about vascular plants. Liu [57] analyzed and reported the chloroplast genome of Morella rubra and the evolution patterns of Fagales, proved that Myricaceae is a sister to Juglandaceae.
When analyzing evolutionary relationships, the sequencing depth and coverage, annotation methods, gene losses, and pseudogenizations possibly affect the results [58]. It is not difficult to see from the successful application of next-generation sequencing technology in the acquisition of chloroplast genomic data in recent years that the emergence of this new technology will greatly promote the development of chloroplast phylogenetic analysis, and the phylogenetic research of plants will have a bright prospect [59,60].

SSR molecular marker
Molecular marker technology is to detect the polymorphism of DNA sequences based on PCR, which can directly compare the mutations of DNA sequences. With the continuous improvement of molecular biology technologies, it is becoming more and more mature and is applied to many research directions [61,62]. Among the molecular markers, the SSR marker has high variability, codominant inheritance, easy operation and easy analysis [63]. It is widely used in genetic mapping construction, gene mapping, genetic diversity analysis and variety identification of various crops.
Such as, Stachel et al. [3] successfully analyzed the genetic diversity of 60 wheat cultivated varieties originating from three agro-ecological zones using microsatellite markers. Salameh et al. [64] used a molecular marker to transfer the resistant QTL (Fhb1) on the 5A chromosome into nine European winter wheat lines to develop new scab-resistant lines. Besides, Shim et al. [65] sequenced the whole chloroplast genome of millet and compared it with other reported chloroplast genomes, and obtained 125 SSR loci and 34 indels changes, which can provide effective information for the phylogeny. In our study, SSR polymorphic analysis between the K519A and 519B on the chloroplast genes was performed. Among identified 5 SSR polymorphic markers, the SSR marker from atpF was deeply demonstrated by polyacrylamide gel electrophoresis. Thus, this SSR marker can be used for the chloroplast and cytoplasmic type identification for the Aegilops kotschyi and Triticum aestivum L.

Analysis of chloroplast gene related to cytoplasmic male sterility
Some nuclear male sterility genes have been identified. For example, Xing et al. [66] cloned the gene TaAPT2 related to thermosensitive sterility in wheat, and the gene expression levels of the young anthers, which under fertility conversion stage, were significantly reduced under the sterile condition by Northern analysis, indicating that the TaAPT2 gene may be related to the fertility. Xia et al. [67] confirmed that the male-sterile ms2 mutant in wheat was caused by an insertion of terminal-repeat retrotransposons in miniature (TRIM) element in the promoter region of Ms2 gene. In addition to this, some studies have shown that the deletion, duplication or mutation of plant chloroplast DNA may cause the wrong transmission of genetic information between chloroplast, mitochondrial and nuclear, leading to male sterility [68,69].
Some studies have found that some chloroplast genes in plants may affect the fertility of plants. In this study, it was found that some genes had non-synonymous mutant sites between K519A and 519B, and these genes might be related to male sterility. For example, the matK gene is located in the intron of the chloroplast trnK gene. Jia et al. [70] extracted DNA of abortive flower buds from radish male sterile line BT-18 and normal flower buds using DD-PCR technology to study the differential expression of RNA between abortive and normal buds in Radish. And it was found that the matK gene appeared three times in different primers of PCR amplification, and the amplification bands were enhanced in abortive buds, which showed upregulated expression, indicating that the abortion of radish buds had a great relationship with matK gene. Ou et al. [71] studied the intron sequences of ORFs and rps16 genes in chloroplasts of different CMS lines in rice. And they found that in the gametophyte sterility line, the rps16 gene intron contains a GMTGAG sequence and a unique G at position 595, which can be used as a molecular marker to distinguish sporophytic sterility and gametophyte sterility.
The RNA polymerase is a key enzyme controlling chloroplast genomic transcription and gene expression. It is encoded by the copy genes rpoA, rpoB, rpoC1 and rpoC2 genes [72]. The α subunit and β subunit of RNA polymerase are encoded by rpoA and rpoB genes respectively and the β' and β" subunit by rpoC1 and rpoC2 genes. Chen et al. [73] found a new chloroplast DNA deletion in the sorghum CMS line, which occurs in the middle of the rpoC2 gene, which may be associated with cytoplasmic male sterility in sorghum. We infer that the non-synonymous mutations of rpoB and rpoC2 genes may affect the production of β and β" subunits of RNA polymerase.
In K519A, the expression levels of atpB, ccsA, matK, rbcL, rpoC2, and ndhH genes were mostly downregulated at the binucleate and trinucleate stages. Previous studies found that the key periods of pollen abortion in non-1B/1R and 1B/1R wheat male-sterile lines were binucleate and trinucleate stages [74]. Therefore, the result of this study was coincident with the previous research [26].

Chloroplast genome fragment transfer phenomenon
In the chloroplast of higher plants, a phenomenon is that the chloroplast gene can be transferred to the nucleus, which is called chloroplast genome fragment transfer. The phenomenon has been found in tobacco [64]. However, the genes in the nucleus are not transferred to the chloroplast [66].
Besides, the chloroplast gene can also be transferred to mitochondria, and the genes that are transferred to mitochondria often evolve into non-functional pseudogenes [75]. We speculate that the combination of different types of cytoplasm and nucleus may destroy the intrinsic balance between chloroplast, nucleus, and mitochondria, which may lead to the formation of cytoplasmic male sterility.

Conclusion
In this study, we obtained and analyzed two chloroplast genomes of CMS line K519A and maintainer line 519B in wheat, which encoded 126 genes, respectively. Based on a comparative analysis of K519A and 519B, a total of 107 mutations were found in 45 genes. 16

Conflicts of Interest
The authors declare no conflicts of interest.