KitaakeX flowers significantly earlier than other sequenced rice varieties
Kitaake has long been recognized as a rapid life-cycle variety [26], but it has yet to be systematically compared to other rice varieties. We compared the flowering time of KitaakeX with other sequenced rice varieties under long-day conditions (14 h light/10 h dark). Consistent with other studies, we found that KitaakeX flowers much earlier than other varieties (Fig. 1a, 1b), heading at 54 days after germination. Other rice varieties Nipponbare, 93-11 (ssp. indica), IR64 (ssp. indica), Zhenshan 97, Minghui 63 (ssp. indica), and Kasalath (aus rice cultivar) start heading at 134, 99, 107, 79, 125, and 84 days after germination, respectively (Fig. 1b).
We next assessed how KitaakeX is related to other rice varieties using a phylogenetic approach based on the rice population structure and diversity published for 3,010 varieties [2]. The 3,010 sequenced accessions were classified into nine subpopulations, most of which could be connected to geographical origins. The phylogenetic tree reveals that KitaakeX and Nipponbare are closely related within the same subpopulation (Fig. 1c).
Genome sequencing and assembly
To obtain a high-quality, de novo genome assembly, we sequenced the KitaakeX genome using a strategy that combines short-read and long-read sequencing. Sequencing reads were collected using Illumina, 10x Genomics, PACBIO, and Sanger platforms at the Joint Genome Institute (JGI) and the HudsonAlpha Institute. The current release is version 3.0, which is a combination of a MECAT (Mapping, Error Correction and de novo Assembly Tools) PACBIO based assembly and an Illumina sequenced 10x genomics SuperNova assembly. The assembled sequence contains 377.6 Mb, consisting of 33 scaffolds (476 contigs) with a contig N50 of 1.4 Mb, covering a total of 99.67% of assembled bases in chromosomes (Table 1).
We assessed the quality of the KitaakeX assembly for sequence completeness and accuracy. Completeness of the assembly was assessed by aligning the 34,651 annotated genes from the v7.0 Nipponbare to the KitaakeX assembly using BLAT [27]. The alignments indicate that 98.94% (34,285 of genes) genes completely aligned to the KitaakeX assembly, 0.75% (259 genes) partially aligned, and 0.31% (107 genes) were not detected. A bacterial artificial chromosome (BAC) library was constructed and a set of 346 BAC clones (9.2x clone coverage) was sequenced using PACBIO sequencing. A range of variants was detected by comparing the BAC clones to the assembly. Alignments were of high quality (<0.1% of error) in 271 clones (Additional file 1: Figure S13). Sixty BACs indicate a higher error rate (0.45% of error) due mainly to their placement in repetitive regions (Additional file 1: figure S14). Fifteen BAC clones indicate a rearrangement (10 clones) or a putative overlap on adjacent contigs (5 clones) (Additional file 1: figure S15). The overall error rate in the BAC clones is 0.09%, indicating the high quality of this assembly (for detailed information, see Additional file 1).
Genome annotation
We predicted 35,594 protein-coding genes in the KitaakeX genome (Table 1, Additional file 2: Table S12), representing 31.5% genic space of the assembled genome size (Table 1). There is some transcriptome support for 89.5% (31,854/35,594) of the KitaakeX genes, and 81.6% (29,039/35,594) genes are fully supported by the transcriptome (Additional file 2: Table S11). The predicted protein-coding genes are distributed unevenly across each chromosome; gene density tends to be higher toward chromosome ends (Fig. 2f). The average GC content of the genome is 43.7% (Fig. 2e, Table 1).
To assess the quality of the annotation of KitaakeX genes, we compared the KitaakeX annotation to those of other completed rice genomes using the BUSCO v2 method, which is based on a set of 1,440 conserved plant genes. The results confirm 99.0% completeness of the KitaakeX genome annotation (Table1, Additional file 2: Table S7). To further evaluate the quality of the annotation, we studied the extent of conservation of functional genes in KitaakeX. We selected 291 genes (Additional file 3: Table S13) from three pathways associated with stress resistance, flowering time and response to light [8], and then searched for orthologous genes in the KitaakeX genome. We found that 275 of 291 (94.5%) of the selected KitaakeX genes show greater than 90% identity with the corresponding Nipponbare genes at the protein level. Twenty-three out of the 291 show 100% identity on the genome level but not on the protein level. Of these 23 genes, the KitaakeX gene model for 16 genes has better transcriptomic evidence than the Nipponbare gene model. One of the 291 KitaakeX genes is slightly shorter than its Nipponbare ortholog due to an alternative transcript (Additional file 3: Table S13). These results indicate the high quality of the annotation, and conservation between the KitaakeX and Nipponbare japonica rice varieties.
Using SynMap, we identified 2,469 pairs of colinear genes (88 blocks) in the KitaakeX genome (Fig. 2g). These results correlate with already published findings [28]. We used RepeatMaker and Blaster to identify transposable elements (TEs) in the KitaakeX genome, and identified 122.2 Mb of sequence corresponding to TEs (32.0% of the genome). DNA transposons account for ~33 Mb; retrotransposons account for ~90 Mb. The TEs belong mostly to the Gypsy and Copia retroelement families, and account for 23% of the genome (Additional file 2: Table S8), as is true in the Nipponbare and Zhenshan97 genomes [6].
Table 1 Summary of the KitaakeX genome assembly and annotation
|
Estimated genome size
|
409.5 Mb
|
|
Assembled contigs
|
377.6 Mb
|
|
Contig N50
|
1.4 Mb
|
Genome assembly
|
Longest contig
|
8.6 Mb
|
|
Assembled scaffolds
|
381.6 Mb
|
|
Scaffold N50
|
30.3 Mb
|
|
Longest scaffold
|
44.3 Mb
|
|
GC content
|
43.7%
|
|
Retrotransposons
|
89.6 Mb
|
Transposable elements
|
DNA transposons
|
32.6 Mb
|
|
Total
|
122.2 Mb
|
|
Protein-coding genes
|
35,594
|
|
Complete BUSCOs
|
99.0%
|
Genome annotation
|
Average transcript length
|
1,874 bp
|
|
Average coding sequence length
|
1,222 bp
|
|
Functionally annotated
|
33,226
|
Genomic variations between KitaakeX and other rice varieties
We compared the genome of KitaakeX to the Nipponbare and Zhenshan97 genomes to detect genomic variations, including single nucleotide polymorphisms (SNPs), insertions and deletions under 30 bp (InDels), presence/absence variations (PAVs), and inversions using MUMmer [29]. We found 331,335 variations between KitaakeX and Nipponbare (Additional file 4), and nearly 10 times as many (2,785,991) variations between KitaakeX and Zhenshan97 (Additional file 5). There are 253,295 SNPs and 75,183 InDels between KitaakeX and Nipponbare, and 2,328,319 SNPs and 442,962 InDels between KitaakeX and Zhenshan97 (Additional files 6 and Additional file 2: Table S3). With respect to SNPs in both intersubspecies (japonica vs. indica) as well as intrasubspecies (japonica vs. japonica) comparisons, transitions (Tss) (G ->A and C ->T) are about twice as abundant as transversions (Tvs) (G ->C and C ->G) (Additional file 2: Table S10). Genomic variations between KitaakeX and Nipponbare are highly concentrated in some genomic regions (Fig. 2b), but variations between KitaakeX and Zhenshan97 are spread evenly through the genome (Fig. 2c). Intersubspecies genomic variations, then, are much more extensive than intrasubspecies variations. We also detected multiple genomic inversions using comparative genomics (Additional files 4 and 5).
For variations occurring in the genic regions, we found that single-base and 3 bp (without frame shift) InDels are much more abundant than others (Additional file 7: Figure S16a), suggesting that these genetic variations have been functionally selected. We carried out detailed analysis of gene structure alterations that exist as a consequence of SNPs and InDels between KitaakeX and Nipponbare and Kitaake and Zhenshan97. Between KitaakeX and Nipponbare, we identified 2,092 frameshifts, 78 changes affecting splice-site acceptors, 71 changes affecting splice-site donors, 19 lost start codons, 161 gained stop codons, and 15 lost stop codons. In the comparison of KitaakeX to Zhenshan97, 6,809 unique genes in KitaakeX are affected by 8,640 frameshifts (Additional file 7: Figure S16b), 531 changes affecting splice-site acceptors, 530 changes affecting splice-site donors, 185 lost start codons, 902 gained stop codons and 269 lost stop codons (Additional file 7: Figure S16b).
Based on PAV analysis, we identified 456 loci that are specific to KitaakeX (Additional file 4) compared with Nipponbare. Pfam analysis of KitaakeX-specific regions revealed 275 proteins. Out of these 275 genes, 148 genes are from 19 different gene families with more than 2 genes in those regions. These gene families include protein kinases, leucine-rich repeat proteins, NB-ARC domain-containing proteins, F-box domain containing proteins, protein tyrosine kinases, Myb/SANt-like DNA binding domain proteins, transferase family proteins, xylanase inhibitor C-terminal protein, and plant proteins of unknown function (Additional file 7: Figure S16c). We identified 4589 loci specific to KitaakeX compared with Zhenshan97 (Additional file 5).
We also compared our de novo assembly of KitaakeX genome with Kitaake resequencing reads using an established pipeline [17] . This analysis revealed 219 small variations (200 SNPs and 19 INDELs) between the two genomes (Additional file 8). These variations affect 9 genes in KitaakeX besides the Ubi-Xa21 transgene, including the selectable marker encoding a hygromycin B phosphotransferase on chromosome 6 (Additional file 8, Additional file 9: Figure S17).