Genome sequencing, assembly and annotation
For the de novo assembly of full-length nuclear, mitochondrial and chloroplast genomes of Dongxiang wild rice line 159 (DXWR159), one 500 bp and one 10 kb DNA libraries were prepared using fresh leaves from a few (<4) seedlings of DXWR159 and sequenced on the Illumina and PacBio platforms, respectively. In the subsequent data analyses, 217,263 subreads extracted from the PacBio DNA-seq data were used to assemble the DXWR159 cpDNA with a total length of 134,509bp at an extremely high depth of approximately 13,441×. Then, the two IRs, long poly(GC),low complexity, and other repeat regions (Supplementary file 1) were exactly determined by manual curation (Methods and Materials). In particular, 13,117 long (> 20 Kb) PacBio subreads were to validate the structure of DXWR159 cpDNA. However, the draft genome using high-depth PacBio data may still contain two types of errors in the low complexity and STR regions, respectively. Finally, 6,480,424 bp cleaned Illumina DNA-seq data were aligned to the DXWR159 cpDNA and only one error in a STR region was corrected. The DXWR159 cpDNA is a full-length genome (Figure 1), which is defined to has no gaps and ambiguous nucleotides [10].
In a previous study [1], 4 rRNA, 30 tRNA and 76 protein-coding genes (Table 1) have been annotated in the Onobrychis gaubae cpDNA (GenBank: LC647182) with a length of 122,688bp and also in the O. viciifolia (GenBank: MW007721) with a length of 121,932bp. 4 rRNA, 30 tRNA and 90 protein-coding genes or open reading frames (ORFs) have been annotated in the cpDNA ofO. sativassp. japonica (Nipponbare). Both DXWR159 and Nipponbare have almost identical chloroplast rRNA, tRNA and protein-coding genes. Based on the previous annotations, 73 of the 76 protein-coding genes are common genes which are present in both rice and Onobrychis cpDNAs (RefSeq: NC_001320), while three protein-coding genes (accD, ycf1 and ycf2) and three genes (infA, rpl22 andrps16) are absent in rice and Onobrychis cpDNAs, respectively. Furthermore, 15 ORFs had been annotated in rice cpDNAs. By sequence alignment, we updated the annotations of rice and Onobrychis cpDNAs with correction (Table 1), particularly: (1) the annotation of psbF was added into the Onobrychis cpDNAs, respectively, (2) rps16, ORF23, ORF28, ORF56, ORF72, ORF82, ORF85, ORF100, ORF137 and the third exon of rps12 in rice were removed, and (3) ORF44 in rice was renamed as psaJ.
Finally, 4 rRNA, 30 tRNA, 75 protein-coding genes and 6 ORFs were annotated in the DXWR159 cpDNA (Supplementary file 2). The 75 protein-coding genes includes the 73 common genes which are also present in Onobrychis cpDNAs, and two new genes(infA andrpl22). Among the 30 tRNAs, six are multi-exon genes which contain two exons, and they are tRNALys(AAA), tRNAGly(GGA), tRNALeu(UUA), tRNAVal(GUA), tRNAIle(AUC), tRNAAla(GCA). Among the 75 protein-coding genes, nine multi-exon genes are rpl2, rps12, ndhA, ndhB, petB, petD, atpF, and clpP with two exons, and ycf3 with three exons. The biological function of the 6 ORFs (ORF63, 70, 91, 106, 133, and 249) remains unknown.
Short tandem repeats between individuals
Blasting the DXWR159 cpDNA sequence to the NCBI NT database, we found that the best hit is the cpDNA (GenBank: CP056064) of Zhenshan97, a cultivar of O. sativa ssp. indica.The length of DXWR159 cpDNA was determined to be 134,509 bp, which is very close to the Zhenshan97 and Nipponbare cpDNA (RefSeq: NC_001320) lengths of 134,501 bp and 134,525 bp, respectively. The GC contents of the DXWR159, Zhenshan97, and Nipponbare cpDNAs were also very close (approximately 39 %). Furthermore, the LSC, IR1, SSC, and IR2 of the DXWR159 cpDNA had lengths of 80,553, 20,805, 12,346, and 20,805 bp, and shared nt sequence identities of 99.93 % (80505/80565), 99.99 % (20804/20805), 99.99 % (12346/12347), and 99.99 % with those of the Zhenshan97 cpDNA, respectively. Likewise, the LSC, IR1, SSC, and IR2 of the DXWR159 cpDNA shared the nt sequence identities of 99.54 % (80338/80709), 99.81 % (20779/20817), 99.76 % (12320/12350), and 99.81 % with those of the Nipponbare cpDNA, respectively. These results are consistent with those from the previous studies [2]: cpDNAs across spermatophytes exhibit a higher degree of conservation with respect to their gene content, structure and organization. Multiple sequence alignment of the three cpDNAs demonstrated that about 76% (1519/2000) of the variation sites were single nucleotide polymorphisms (SNPs), while the others were small insertions and ieletions (InDels). Furthermore, almost all the InDels were attributed to the presence of STRs [11]. STRs, which are widely used by forensic geneticists and genealogy experts, are often referred to as simple sequence repeats (SSRs) by plant geneticists or microsatellites by oncologists. The minimum repeat unit length of STR is evidently 1 bp, this type of STR is predominantly referred to as a polynucleotide (e.g. polyAs and polyTs).
The above finding thus prompted us to investigate the STRs between different lines of DXWR. Subsequently, we performed the detection of STRs between the cpDNAs of DXWR159 and DXWR line 44 (DXWR44). Accordingly, aligning the NGS data of DXWR44 to the DXWR159 cpDNA demonstrated that the DXWR159 and DXWR44 cpDNAs were identical. Moreover, most of the detected STRs between the DXWR159 and DXWR44 cpDNAs were polyAs or polyTs. We then examined the STRs in cpDNAs among individuals of DXWR159 and DXWR44, respectively. Our analysis revealed that the DXWR159 and DXWR44 cpDNAs share at least nine polyA or polyT sites. They are 11291[T]7-8, 31462[T]7-8, 36488[T]9-10, 49216[T]10-15, 63462[A]6-9, 80673[A]2-3, 102303[A]6-8, 107081[A]6-7 and 111165[A]6-8, wherea STR is described by its genomic position in numbers, the repeat unit [in brackets], and its copy number as subscripts, as described in our previous study [11]. These results thus suggest that polyAs or polyTs occur frequently among the individuals of each rice line.
Discovery of two large structural variations
The comparative genomics analysisrevealed many SVs between rice andOnobrychis cpDNAs and most of them concentrated in the 70% parts of the LSCs(Figure 2A).Among SVs in other regions, two large deletions resulted in the loss of two genes (ycf1 and ycf2) in the rice IRs. As for the other twohypothetical chloroplast reading frames(ycf3 and ycf4),Nipponbare (rice) and Onobrychis gaubae share a comparatively highnucleotide (nt) sequence identityof 88% (447/507)forycf3, while they share a very lowidentity (far less than 70%) for ycf4. The gene ycf3 is the only gene containing three exons in cpDNAs(Table 1), however, it is stillhighly conservedat the nt sequence level, indicating thatycf3 may has important biological functions. As a notable SV, a large inversion of SSC is present between rice andOnobrychis cpDNAs, although the SSCsare highly conserved with respect to their gene content, structure and organization.
Subsequently, we discovereda large inversion of SSC and a large duplication of IR inDXWR159 cpDNAs using the long PacBio subreads.Particularly, the large inversion of SSC results in a reverseorientationof SSC. As the reference cpDNA of rice, the organization of the Nipponbare cpDNA (RefSeq: NC_001320) is characterized by the common cpDNA structure LSC-IR1-SSC-IR2, which was defined as the forward SSC (SSC-F) type structure in the present study. Accordingly, the organization of the DXWR159 cpDNA with the reverseorientationof SSC was defined as the reverse SSC (SSC-R) type structure (denoted as LSC-IR1-ssc-IR2). The SSC-F and SSC-R structures were also designated as wild types and mutants of cpDNAs (Figure 2). As the most significant finding,both SSC-F and SSC-R type cpDNAs were detected in several seedlings of DXWR159. Then, we used long (> 20 Kb) PacBio subreads as junction reads spanning the LSC-IR1-SSC and LSC-IR1-ssc regions (Figure 2B) for validation (Materials and Methods). As a result, these junction reads (Supplementary file 2) demonstrated the presence of both SSC-R and SSC-F type cpDNAs with a ratio of 1.2 (99/82). This high ratio suggested that the inversion of SSCis not a rarely occurring event and may occur more frequently in cell growth. We named the frequent inversion of SSCas SSC switching, as the SSC in cpDNA is analogous to the mating-type (MAT) region in a yeast genome [9]. The SSC with a length of ~12.4 Kbp between two IRs has a structure of IR-SSC-IR, while the MAT region with a length of ~19 Kbp between two IRs has a structure of IR-MAT-IR. During MAT switching, the two copies of the IR recombine, inverting the MAT region relative to the rest of the chromosome. Neither MAT nor SSC switching changes the sequences of genes, as the structure of IR-MAT-IR or IR-SSC-IR is symmetric. By MAT switching, yeasts initiate the transcription of some mating related genes, determining the sexual identity of a cell. This process may be regulated by centromeric heterochromatin [9]. However, the underlying mechanism of SSC switching is unknown.
Based on this novel finding, the structures of cpDNAs can be classified into the SSC-F and SSC-R types.Further analysis showed that rich cpDNAs are highly conserved on the junction site between IR1 and SSC (IR1-SSC) or that between SSC and IR2 (SSC-IR2). As rice cpDNAs are linearized for their representation, starting at the first three nts “CCC” of LSC, almost all the IR1-SSC junction sites contain the highly conserved sequences of TGGAAAAAATCG|GCAAATAGGAAA and TGGAAAAAATCG|CGGAAAACCGAA for the SSC-F and SSC-R types, respectively. This feature was then used to classify the structures of rice cpDNAs from the NCBI GenBank database. The results revealed that the structures of most rice cpDNAs belong to the SSC-F type, whereas very few (e.g. Zhenshan97) belong to the SSC-R type. Notable SSC-F type cpDNAs included the cpDNAs of O. sativa L. spp. indica (GenBank: JN861110), O. nivara (KF359901), O. glaberrima (KF359903), O. barthii (KF359904), O. glumaepatula (KF359905), O. meridionalis (KF359906), O. punctata (MT726932) and O. brachyantha(MT726939). However, some of SSC-F type cpDNAs are chimeras. For instance, O. nivaraandO. punctatahave IR1-SSC junction sites for the SSC-R type, but SSC bodies for the SSC-F type. This indicated that these SSCs are likely to contain sequence errors, resulting from the assembly of hybrid data. This suggested that both SSC-F and SSC-R type cpDNAs were present in these species. As many of these cpDNAs were not absolutely de novo assembled using short NGS or Sanger sequencing data, the orientation of large segments was determined according to the common structure LSC-IR1-SSC-IR2 without validation using junction reads (Figure 2B). According to the classical definition, de novoassembly refers to the process of reconstructing a genome without knowing its structure. Using PacBio DNA-seq,the Zhenshan97 cpDNA (GenBank: CP056064) was absolutely de novo assembled, indicating the presence of SSC-R type cpDNA in the sequenced sample. The SSC-F type cpDNA of Zhenshan97 was also present, but its proportion was less than the SSC-R type cpDNA. Therefore, the assembled genome only represents the dominant SSC-R type cpDNA of Zhenshan97.
The other large SV reported in the present study is a large duplication of IR in cpDNAs of severalDXWR159 seedlings. Compared to the wild-type cpDNA, the mutant acquired a 22804-bp insertion and a 319-bp deletion. The 22804-bp insert includes an entire IR and a segment of LSC (noted as e1), while the 319-bp deleted segment is the 3' end of the IR and the 319-bp lacking IR is noted as IRa (Figure 2C). Six long (> 20 Kb) PacBio subreads (Supplementary file 2) were used as junction reads for manual curation (Materials and Methods). As a result, these junction reads span the IR-e1-IR regions (Figure 2C), validating this new finding. Although the underlying mechanism remains unknown, we have proposed a recombination model to explain these results. At the 5' end of the 319-bp deleted segment, we discovered the high-score segment pairs (HSPs) that may be involved as recombination sites (Figure 2C). Accordingly, in this model, a recombination event between two cpDNA molecules, results in an exchange between e1+IR2+SSC and the 319-bp segment plus SSC. Consequently, one of the two cpDNA molecules acquired a large duplication of IR, whereas the other lost an IR. Although the IR‑lacking sequence was not detected in the present study, it had been previously reported in Onobrychisspp. [1]. Using this model, we also explained the SSC switching (Figure 2B).