Morphology and genetic structure analysis of an O. rufipogon accession
Firstly, we investigated more than 1000 accessions from their original habitats all over China and have high geographical and agricultural phenotypic diversity. The biotic stress resistance and abiotic stress tolerance of this panel were evaluated. One accession Y476, which was collected from Sanya city, Hainan province had excellent overall resistance to various biotic and abiotic stresses. It has the typical characteristics of wild rice, including creeping and vigorous growth capacity, long awns, purple and completely exserted stigma, black-hulls, and reddish-brown pericarps (Fig. 1A). It was immune to the rice blast pathogen, and survived after treated with 200 mM NaCl for 5 days (Fig. 1B, 1C).
To characterize the genome composition of Y476, we generated 26.2 Gb paired-end (PE) Illumina short reads (Supplementary Table 1) and performed population structure analysis combined with whole-genome re-sequencing data from 444 O. rufipogon accessions, 268 O. sativa aus, 1882 O. sativa indica, and 1194 O. sativa japonica accessions2,13. We performed a phylogenetic and principal component analysis (PCA) of the rice pool. Japonica, indica and Aus rice formed completely isolated clusters, whereas the genomes of O. rufipogon were classified into three types, Or-I, Or-II and Or-III. The phylogenetic analysis and PCA showed that Y476 was closed to Or-II type clusters (Fig. 2A, Supplementary Fig. 1A), which mainly distribute in South and South-East Asia. We further analyzed the admixture pattern of Y476 to confirm its genomic composition using the program ADMIXTURE. With a K-value of 6, six genome types (Or-I to Or-III, aus, japonica-type, and indica-type) were identified. Y476 comprises 69.5% of the wild rice components, and 8.4% of indica, 9% of japonica, and 4% of aus component. (Supplementary Fig. 1B).
Sequencing, assembly and annotation of a gap-free wild rice genome
Different sequencing platforms were applied to develop a high-quality genome assembly for Y476. Approximately, 26.2Gb 150 bp Illumina PE reads (~ 62.2X) was used for K-mer analysis. The genome size of Y476 was estimated to be ~ 429Mb with a heterozygosity of 1.06% (Supplementary Fig. 2). The quality of long reads sequencing determines the quality of genome assembly, and the sequencing data generated in this study is particularly remarkable. 29.7Gb (~ 70.5X) HiFi reads were generated by the PacBio sequel II platform (Supplementary Table 1), with 99.9% accuracy of long reads and the N50 length of greater than 15 kb (Supplementary Fig. 3A). For the PacBio HiFi read assemblies, the preliminary assembly applied Hifiasm on HiFi reads with -hic mode, and generated contigs with the N50 length of 33.2Mb, which is 3 ~ 30 times larger than that of the reported wild rice genomes3,9 (Table 1). Nanopore ONT sequencing applies the latest technology Ultra-long sequencing, and the N50 is about 100 kb (Supplementary Fig. 3B), which is especially helpful for the assembly of highly repetitive regions, such as centromeric and telomeric regions, and finally generated telomere-to-telomere genome. About 40X Nanopore Ultra-long reads (~ 17Gb) were generated to assemble Y476. The NextDenovo was used to assemble the ONT data, forming 50 contigs with the N50 length of 9.4Mb. For highly heterozygous species with more heterozygous fragments in the genome sequence, the assembled contig sequences were further filtered using purge_dups. To correctly anchor the contigs of the Y476 primary assembly, 68.9Gb chromosome conformation capture sequencing (Hi-C) data of Y476 were generated. HiC-Pro was applied to generate chromosomal interaction maps and confirmed the correct ordering and orientation of all the pseudomolecules. With interactive signals, de novo assembled contig was anchored to each of 12 pseudomolecules. The chromosome ID and orientation were tuned in accordance with R49814, the gapless Y476 genome was generated, which contained 12 chromosomes and 3 gaps in chromosome (Chr.) 5, 8, and Chr. 12. The ONT-assembled genome was used to patch the HiFi genome with the three gaps by RagTag, which successfully closed 3 gaps (Supplementary Fig. 4). The final assembly was comprised of 12 gap-free chromosomes and 3 gaps achieved extremely high mapping rate (> 99.90%) and coverage of Illumina sequencing reads, further supporting the high quality of gap-closing. The de novo Bionano optical maps were aligned to the genome to verify the correct sequence and direction of the genome assemblies (Supplementary Fig. 5). Finally, the gap-free Y476 genome with a total length of 421.1Mb was assembled. HiCPlotter was applied to generate chromosomal interaction heatmaps of 12 chromosomes (Fig. 2B).
Table 1
Comparison of the Y476 genome with previously published assemblies of the wild rice genomes. * indicated data lack.
Accession number | Y476 | IRGC1061629 | Yuanjiang wild rice of China3 |
Assembly | | | |
Genome size (Mb) | 421.1 | 377.1 | 376.5 |
Number of Chromosome(gaps) | 12(0) | * | 12 (1219) |
Contig N50 (Mb) | 33.2 | 13.2 | 1.1 |
Genome BUSCO (%) | 98.80 | 96.20 | 97.36 |
QV | 64.61 | * | * |
LAI | 24.76 | 10.19 | 16.55 |
Telomeres number | 18 | 0 | 0 |
Centromere number | 12 | 0 | 0 |
Annotation | | | |
Number of predicted genes | 38,055 | 33,903 | 34,830 |
Average gene length(bp) | 2726 | 2397 | 2921 |
Average CDS length(bp) | 1168 | * | 1,125 |
Protein BUSCO (%) | 98.00 | 93.40 | 94.20 |
Repeat sequences (%) | 56.80 | 49.37 | 44.14 |
For the genome annotation, our pipeline is shown in Supplementary Fig. 6. First, we screened the genome repetitive sequences to annotate transposable elements (TEs). The Extensive de novo TE Annotator (EDTA) was used to generate a high-quality repetitive sequence library and then the RepeatMasker was used to annotate TEs. A total of 687,167 TEs accounted for 59.21% of Y476 genome (Supplementary Table 2). After that, we used the sequences after repetitive sequence mask to predict the gene structure. Several de novo prediction software, homologous prediction, and transcript prediction with different principles produced an accurate gene model. The EVM integrated the results of different software outputs and filtered out some low-quality gene models (Supplementary Table 3). Finally, 38,055 protein-coding genes were predicted in our assembled genome, with an average coding sequence size of 1,168 bp and an average of 4.42 exons per gene (Supplementary Table 4). Homology-based annotation of the non-coding RNA sequences predicted 772 tRNA and 465 ribosomal RNA genes for the Y476 genome (Supplementary Table 5). Using MUMmer to further compare Y476 and MSU7.0 genomes (Nipponbare), we identified 2,726,973 SNPs and 404,723 InDels between the two genomes. Using 500kb intervals across the 12 chromosomes, we plotted the densities of GC content, genes, TEs, SNPs, and InDels in the genome (Fig. 2C).
Quality assessment and validation of Y476 assembly
The quality and completeness of the Y476 assembly were evaluated in multiple ways. First, Illumina short reads and long reads were mapped to the assembled genome, and mapping rates and coverage were used to evaluate the accuracy of the genome (Table 1). For the Illumina reads mapping rate was 99.15% and the coverage was 99.97%. Secondly, we used BUSCO to evaluate genomic completeness. About 98.8% of the core conserved plant genes (1,595 out of 1,614 BUSCOs) were found complete in the Y476 genome assembly (Supplementary Table 6), and 98.0% (1,581 out of 1,614) were found complete in the Y476 predicted genes (Supplementary Table 7). Of the protein-coding genes, 94.8% were predicted based on the information from NR (non-redundant, NCBI), Swissprot, InterPro, GO, and the KEGG database (Supplementary Table 8). Collectively, these results illustrated the high quality, reliability and accuracy of the Y476 assembly. We also assessed the k-mer-based quality estimations (k = 17) for the Y476 genome using Merqury with both Illumina PCR-free and HiFi reads15,16. The Merqury-estimated quality value (QV) of the Y476 gap-free genome was 64.61 (Table 1). Finally, the LTR assembly index (LAI) value for the Y476 genome was 24.76 (Table 1), which is higher than all the published wild rice genomes8,17. In addition, we identified the centromeric region on each chromosome (Supplementary Table 9). Using seven base telomeric repeats (CCCATTT at 5 'end and TTTAGGG at 3' end) as sequence query, we identified 18 telomeres in the Y476 genome (Supplementary Table 10). These results showed that our Y476 genome assembly has high reliability and quality. We also evaluated the two published wild rice genomes3,9 using the same analysis, and the quality of Y476 was significantly better in comparison (Table 1).
Global comparison, identification of unique gene family and novel genes of Y476
To dissect the genome variation between Y476 and cultivated rice, we compared the Y476 genome and indica (Xian) variety ‘9311’ and japonica (Geng) variety ‘Nipponbare’ (Nip). The Y476 genome contains a higher proportion of repeat sequences, which resulted in larger genome size. Synteny comparison of the genomic structure revealed high collinearity between Y476 and Nip and 9311, and the fragment inversions, duplications, and translocations were showed in Fig. 3A. Three long fragment inversions, 4 Mb on Chr. 9 of 9311, 4.5 Mb on Chr. 6 and 3.3 Mb on Chr. 9 of Nip, were confirmed by PCR and sequencing (Supplementary Fig. 7). SNPs, small InDels, and the presence-absence variations (PAVs) were listed in Table 2. The genome similarity between Y476 and Nip is higher than between Y476 and 9311. PAV analysis identified 47,284 PAVs between Y476 and Nip, including 80 Mb deletion and 128 Mb insertion, and some randomly selected large PAVs were also verified by PCR amplification (Fig. 3B; Supplementary Fig. 8; Supplementary Table 11). Moreover, 18,656 genes were affected by these PAVs. 50,889 PAVs were identified between Y476 and 9311, (Fig. 3B; Supplementary Table 12), 11,205 genes were affected by these PAVs.
Table 2
Global comparison of Y476 and Nip and 9311
Statistics of variation | Nip vs Y476 | 9311 vs Y476 |
No. SNPs | 2,726,973 | 3,262,275 |
No. small indels | 404,723 | 522,076 |
Deletion number | 21,340 | 24,030 |
Deletion total length (bp) | 80,447,894 | 91,882,432 |
Deletion length range (bp) | 50 − 1,076,270 | 50–838,432 |
Insertion number | 25,944 | 26,859 |
Insertion total length (bp) | 128,011,421 | 127,762,275 |
Insertion length range (bp) | 50 − 2,624,538 | 50 − 1,024,568 |
Number of genes affected by PAV | 18, 656 | 11,205 |
Duplications | 527 | 467 |
Inversions | 226 | 218 |
Translocations | 320 | 357 |
We performed orthologous clustering on Y476, Nip and 9311. In the Y476 genome, 27,454 gene families were identified. Among these, 20,675 of the gene families were common to all three genomes, whereas 303 gene families were specific to the Y476 genome (Supplementary Fig. 9). 690 gene families were expanded and 247 gene families were contracted in Y476 genome, including NBS-LRR and RNA_pol_Rpc4 gene family, which play important roles in disease resistance and grain regulation, respectively. An analysis of Gene Ontology (GO) terms for these specific families revealed that several biological processes, such as DNA integration, viral genome integration into host DNA, and ion transport are enriched in the Y476 genome (Supplementary Table 13). A total of 5,984 novel protein-coding genes absent in the MSU7(Nip)/9311 were predicted in the Y476 genome. Of these novel genes, 18.1% (1,085) were annotated by Pfam, 6.9% (415) genes were annotated by GO terms. Prominently, 155 of 1,085 genes are potential disease resistance genes (R genes), including 106 NB-ARC genes and 49 genes with LRR domain. Some of these domains may lead to the excellent disease resistance of Y476 (Supplementary Table 14).
We identified two large structural variations on Chr. 4 and 11, which contained tandem repeats of gene clusters. We found a gene cluster on Chr. 4 (32 genes) and LOC_Os04g32350 belongs to RNA_pol_Rpc4 gene family and can regulate the expression of genes involving in grain development. The gene cluster on Chr.11 (39 genes) and three NBS-LRR genes in Nip belong to a common gene family, and were discovered by tandem repeat duplication of these three genes (Supplementary Table 15). The number of NBS-LRR resistance genes in Y476 was significantly increased. By RNA-seq with the conditions of rice blast challenge, we identified that the gene expression of these two gene clusters was significantly changed compared with that of Nip gene (Fig. 3C-F).
Development of chromosome segments substitution lines in different cultivated rice genetic backgrounds
Based on the observation that there is wide genetic variation between wild rice accession Y476 and cultivated rice, it is anticipated that some of these variations could be responsible for phenotypic differences that include resistance to biotic and abiotic stresses. In order to dissect the wild rice genomic information and manipulate these variations for directional breeding, we developed two sets of chromosome segment substitution lines (CSSLs) using Y476 as the donor parent, 9311 and Nip as the receipt/recurrent parent, respectively. The procedure for the development of two sets of CSSLs was summarized in Supplementary Fig. 10. The CSSL/9311 population is an advanced backcrossed generation population, contains 198 lines and covers 85% of the wild rice genome, each line has an average of 96.5% recurrent parent genome, and harbor 7.3 substitution segments (Fig. 4A-C, Supplementary Table 16,17). The CSSL/Nip population, a less backcrossed population, contains 225 lines and covers the whole genome of wild rice. Each line has an average of 80.6% recurrent parent Nip genome, and harbor 21 substitution segments (Fig. 4D). The genetic constitution of CSSL/Nip population was analyzed. In the CSSL/Nip population, the sizes of the substituted segments ranged from 0.37 to 106 cM, with an average of 14.4 cM. Twenty-six percent of substituted segments were smaller than 5 cM, 22% of substituted segments were from 5 to 10 cM, and 11% were over 30 cM. 40% of CSSLs contained no more than 20 substituted segments (Fig. 4D-F, Supplementary Table 18,19).
The QTLs mapping was performed based on the phenotype and genotype association using these two CSSL populations. Some domestication-related trait associated genes, such as sd1 for plant height18,19, sh4 for seeds shattering20, and C1 for red or purple traits21,22, were observed in these two CSSL populations and easily fine mapped on their exact positions (Supplementary Fig. 11). Variates in these three loci from CSSLs have the wild-type haplotypes which were previously reported. This demonstrated that CSSLs were useful resources for identification of wild rice genes.
QTL mapping of favorable agronomic traits using CSSLs
Using the CSSL/Nip population, nine agronomical traits, including grain length, grain width, 1000-grain weight, grain length to grain width ratio, plant height, panicle length, tiller number, length of flag leaf and width of flag leaf were investigated in three environments and showed a large range of variation (Supplementary Fig. 12, Supplementary Table 20). We identified 244 QTLs associated with these nine agronomic traits in three environments (Fig. 5A, Supplementary Fig. 13, Supplementary Table 21). Among these QTLs, 214 were previously uncharacterized. Approximately 130 novel genes which were not found in the cultivated rice were identified in these QTLs, the new loci from wild rice provided resources for further studies. PAV identification was conducted on the QTL interval, and 1,323 PAVs were identified, which distributed on 199 QTLs on 12 chromosomes (Fig. 5B, Supplementary Table 22).
Several documented genes in QTLs were confirmed for their PAVs between Y476 and Nip. For example, a known OsSWEET14 gene (LOC_Os11g31190), which encodes a sugar transporter and negatively regulated grain weight23, harbors a 663bp PAV upstream. CSSL with this allele from Y476 showed a reduced expression level and increased grain weight compared with Nip (Supplementary Fig. 14). Therefore, we believe that the existence of these PAVs may be the cause for significant differences in the expression of these genes and thus affect the phenotype.
Identification of chromosome loci controlling salt tolerance using CSSLs
To identify beneficial alleles for abiotic stresses, the salt tolerance was investigated under 0.5% NaCl stress during the whole growth period using the CSSL/Nip population. Phenotypic transgressive variation was observed in this population, nearly half of CSSLs were more salt tolerant than their recurrent parent Nip and half were the same (Supplementary Fig. 15A). Three QTLs were identified and one QTL near loci S2_4579633 associated with a large LOD value of 12.3 were selected (Fig. 6A, Supplementary Table 23), four genes were found in this interval (Supplementary Table 24), and the variation between wild rice and Nip were investigated. One CSSL line, N133, which harbors this locus and is highly salt tolerant (Level 3), was selected for further study (Supplementary Fig. 15B, C). The 20-day seedling of N133 also showed higher salt tolerance than Nip, the survival rate, the fresh and dry weight of above-ground seedlings of N133 were much higher than those of Nip under salt stress (Fig. 6B-F).
Based on the transcriptomic data, we detected the relative expression levels of four genes in this QTL near loci S2_4579633, only one gene, LOC_Os02g08540 has a significantly different transcriptional level between N133 and Nip under salt treatments (Supplementary Fig. 15D-G). LOC_Os02g08540 encodes an unknown expressed protein and was strongly induced by salt stress in N133 plants, which was confirmed with Real-time PCR (Fig. 6G). According to Y476 and Nip genomic sequence, two PAVs, 87-bp and 240-bp deletion, were found in the promoter and downstream regions of this gene respectively (Fig. 6H). Therefore, integrating the transcriptome and genome variant data identified LOC_Os02g08540 as a candidate gene associated with salt tolerance.
As an innovative rice germplasm harbor wild rice gene, line N133 showed a comprehensive excellent salt tolerance, both at the seedling and adult plant stages in the field, whereas N133 survived under 0.5% salt stress during the whole growth period (Supplementary Fig. 15B, C). Additionally, the yield traits of line N133 were also better than those of Nip (Supplementary Fig. 16). Our results provided promising gene and germplasm resources for future rice salt tolerance breeding.
Identification of rice blast resistant gene from wild rice using CSSLs
Seven QTLs for rice blast resistance were identified, and contained 70 opening read frames. Among them, the QTL located near S7_21365207 has largest LOD value of 10.9 and phenotypic variance explained (PVE) of 12.1% (Fig. 7A, Supplementary Table 25). One CSSL/Nip line, N154, which harbors this QTL and was highly resistance to rice blast, was selected for further study. The lesion length on line N154 was almost zero (Fig. 7B). Three genes located near the S7_21365207 locus, i.e. LOC_Os07g35660, LOC_Os07g35670, LOC_Os07g35680 (Supplementary Table 26). Only LOC_Os07g35680 had significant expression level difference between N154 and Nip, and was confirmed by Real-time PCR (Fig. 7C; Supplementary Fig. 17). A 7.8-kb PAV was found in the LOC_Os07g35680 intron region between N154 and Nip (Fig. 7D). qPCR analysis revealed that the expression level of LOC_Os07g35680 was significantly higher in the leaves of N154 than that in Nip, and was significantly increased after rice blast treatment (Fig. 7C, E). Combining with genome variant data, LOC_Os07g35680 was identified as a candidate gene from wild rice for rice blast resistance. LOC_Os07g35680 encoding a receptor-like kinase (RLK). Phylogenetic analysis revealed that LOC_Os07g35680 belonged to RLK family in rice germplasm and clustered with several known RLK genes (Supplementary Fig. 18).
We next compared the transcriptomes of N154 with Nip after rice blast infection using RNA-seq analysis. Global gene expression differences were found between Nip and N154. The analysis revealed 3,788 differentially expressed genes (DEGs) in N154 without blast infection when compared to Nip. Among this set of genes, with blast infection, 841 DEGs were found in N154 when compared to Nip, including 87 DEGs enriched in the GO term of “Cell death”, ‘‘response to stimulus”, ‘‘defense response”, and ‘‘immune response” (Fig. 7F; Supplementary Fig. 19A, B). There are four cloned resistance genes related to rice blast among the 841 DEGs, Pi924 were significantly up-regulated while OsWAK112d 25, OsMADS26 26,27 and Pish were significantly down-regulated in N154 (Supplementary Fig. 19C). The expression pattern of these four rice blast resistance genes were confirmed by Real-time PCR (Fig. 7G).
Using CRISPR/Cas9 technology, we generated knock-out mutants of LOC_Os07g35680 in the N154 background (N154-KO) (Supplementary Fig. 20). As expected, the rice blast resistance of the knock-out plants decreased significantly. The N154-KO lines displayed enhanced susceptibility to Magnaporthe oryzae similar to Nip, and the lesion length on the N154-KO lines was significantly increased compared with N154 (Fig. 7H, I). Subsequently, we detected the expression levels of these four rice blast resistance genes in N154-KO lines, and found that only OsMADS26 expression in N154-KO lines was significantly changed, and the expression alteration was consistent with Nip, which is significantly different from N154 (Fig. 7J). These results demonstrated that the natural variation in the LOC_Os07g35680 locus is critical for the rice resistance to M. oryzae and, led to the transcriptional changes of rice blast resistance gene OsMADS26, which might contribute to an adaptive molecular response to the rice blast resistance in line N154 and wild rice.