Phenotypic variation of GPC in the association panel
The GPC was investigated using 529 rice accessions collected worldwide (Data S1). Structural analysis of these accessions revealed that the population had a unique structure, and was divided into several distinct subpopulations, including indI (95 accessions), indII (74 accessions), indIII (13 accessions), indica intermediate (117 accessions), Tej (93 accessions), Trj (43 accessions), japonica intermediate (20 accessions), aus (46 accessions), and a mixture (Chen et al., 2014). Rice GPC was normally distributed in the full population and varied widely over the 3 years (Fig. 1a, Data S3). The protein content ranged from 95.92 to 122.96 mg/g in 2014; from 92.11 to 121.13 mg/g in 2015; and from 92.92 to 122.10 mg/g in 2016 in the full population. Large variations were found in the indica accessions, and japonica accessions presented a similar distribution (Fig. 1b, c, Data S3).The highest mean values for GPC were observed in 2014 and the lowest mean values were observed in 2015 in the full population and the two subpopulations (Fig. 1, Data S3). The correlation coefficients for GPC were significant among the 3 years, being 0.57 between 2014 and 2015, and 0.77 between 2014 and 2016. The correlation coefficient of GPC between 2015 and 2016 was the highest, at 0.93 (Data S4).
Identification Of Loci Associated With Gpc By Gwas
In this study, 3,916,415 SNPs in the whole rice genome were selected for GWAS, which was performed separately for different populations. In the following analysis, we analyzed GWAS results from the full population and in indica and japonica subpopulations over the 3 study years (Fig. 2, Data S5). To avoid structural noise, 313 indica and 156 japonica accessions were used for GWAS. Manhattan plots for the association analysis of protein content and quantile-quantile plots of protein content over 3 years for the full population are illustrated in Fig. 2. Any two lead SNPs of around or less than 200 kb were identified as being caused by a common gene and treated as one association locus. A lot of associations have been found in different populations, and some associations were identified in different years (Fig. 2, Data S5).
Many lead SNPs were detected; details of these significant association signals with a threshold value at 5.0E-05 are listed in Data S5. More detailed information on lead SNPs, such as population, chromosomes, physical positions, proportion of phenotypic variance explained, P-values calculated using LMM, MAF, and neighboring known genes, are described in the list (Data S5). To further analyze the association results, GWAS were performed using the full association panel, indica, and japonica subpopulations over the 3 years (Table 1, Data S5). The detected association loci were widely distributed on 12 chromosomes based on their physical positions in the rice genome, with chromosomes 1, 4, 7, and 8 exhibiting the highest numbers of associations. Of these, 56, 36, and 43 loci were detected in 2014, 2015, and 2016, respectively. The significance levels of the associations (except those located close to known genes) ranged from P = 4.8E-05 to P = 3.6E-08 in the LMM for protein content. Lead SNP sf0102043947, located on chromosome 1, presented the most significant effect. The proportion of phenotypic variance explained of each locus ranged from 0.11 to 75.88%. In addition, 67 associations were detected in different environments or populations, explaining more than 10% of the phenotypic variation (Table 1, Data S5).
To better understand the comprehensive association results, the number of loci detected was counted (Table 1). Significant genetic heterogeneity was observed among the 12 chromosomes, the different groups, and the three years. Compared with GWAS results over the three years, most of the loci were detected in only one year, demonstrating that environmental variation can greatly impact the performance of GWAS. However, five lead SNPs, sf0102342328, sf0400950425, sf0618816229, sf0709202668, and sf0817958573, were detected in the three years and the different groups, and sf0401819925 was detected in the three years and only in the full population. In addition, 16 lead SNPs were detected in two years. Three lead SNPs that were simultaneously identified in the full population and two subpopulations were merged for protein content. For example, loci sf0102103441, sf0400950425, and sf0709202668 were detected in different populations. We found that a multiple significant GWAS signal on chromosome 1, lead SNP sf0102103441, in a hot spot at 2.04–2.17 Mb, explaining 8.18–58.3%, was detected eight times in the different populations and environments, indicating that the loci exerted important roles in the protein content phenotype. Similarly, lead SNPs sf0102342328, sf0401819925, and sf0618816229 were detected five times (Table 1).
Co-localization Of Associated Sites With Previously Reported Qtls And Genes Related To Grain Quality
In the past decade, dozens of genes related to grain quality have been reported and cloned in rice. To evaluate significant GWAS signals, the localization of associated sites were compared with QTLs previously detected in cultivated rice, and genes related to grain quality reported in a previous study (Chen et al. 2018; Lou et al. 2009; Ryoo et al. 2007; Wang et al. 2007; Wang et al. 2015). Overlaps were identified between associated sites detected by GWAS and the reported QTLs or intervals related to grain quality genes. In this study 120 associated sites were co-localized with nine reported QTLs from the corresponding references (Table 1). Most of the QTLs that co-localized with reported QTLs were detected on chromosomes 2 and 7. Lead SNPs sf0707834002 and sf0708340365 overlapped with the reported QTL for both qPC-7 (Lou et al. 2009) and 7 − 4(5) (Wang et al. 2007). The lead SNP sf0719598121 was detected four times in or overlapping the reported QTL 7–8(9) (Wang et al. 2007) and lead SNP sf0719625273 (Chen et al. 2018) in both 2014 and 2016 in different populations (Table 1).
In the present study, nine association loci were found in regions of genes involved grain quality (Table 1). Two genes, Susy2 (LOC_Os03g28330) and Flo5 (LOC_Os08g09230), were less than 50 kb away from the lead SNPs and have been shown to play vital roles in synthesis of rice grain starch (Ryoo et al. 2007; Wang et al. 2015). In our study, lead SNPs were detected near six genes: Sar1a (LOC_Os01g23620) (Tian et al. 2013), GluB6 (LOC_Os02g15070) (Uemura et al. 2003), OsTudor-SN (LOC_Os02g32350) (Chou et al. 2017), and Glb1 (LOC_Os05g41970) (Morita et al. 2009), which were confirmed to participate in the biosynthesis and accumulation of storage protein. The six storage protein-related genes were less than 150 kb from lead SNPs (Table 1), likely because of the strong correlations among these SNPs. In addition to the genes described above, OsAGPL4 (LOC_Os07g13980) and GBSSII (LOC_Os07g22930) are involved in the starch synthesis pathway (Maung et al. 2021; Ryoo et al. 2007; Wang et al. 2015).
Validation Of Gwas Signals With Qtl Mapping
Interestingly, we found that many loci are co-located with amino acid content QTLs (Table 1). Referring to studies on amino acid content QTLs, two F9 recombinant inbred line populations were used to validate the authenticity of significant GWAS signals (Wang et al. 2007). The F9 recombinant inbred line population was hybridized between ZS97B and DL208 and named ZS97B/DL208, while the other population was derived from ZS97B and NYZ and named ZS97B/NYZ. To investigate loci from the GWAS results that were feasible and efficient, and to evaluate the genetic effects, NIL-F2 populations were constructed from the genetic background of ZS97B (Table 2).
QTL (2–4(5)) co-localized with lead SNP sf0206873340 was renamed qPC1.1; QTL (2–4(5)) co-localized with lead SNP sf0706126055 was renamed qPC1.2; QTL (7 − 4(5)) co-localized with lead SNP sf0709202668 was renamed qPC1.3; and QTL (1–12) co-localized with lead SNP sf0113186602 was renamed qPC1.4 (Table 2). The QTL qPC1.2 detected between the markers RM125 and RM214 on chromosome 7 presented the highest phenotypic variation (43.4%) and a logarithm of odds (LOD) score (13.04). The QTL qPC1.3, detected between the markers RM1186 and RM5499 on chromosome 7, explained 29.2% of the phenotypic variation, and had the highest LOD score (22.8). The QTL was previously reported and validated, and found to be reliable (Chen et al. 2018). The QTL qPC1.1, flanked by RM555 and RM492 on chromosome 2, presented a LOD score of 7.9 and phenotypic variation of 9.82%. The QTL qPC1.4, was only detected in ZS97B/DL208, explained 13.6% of the phenotypic variation, and had a LOD score of 6.7. The QTLs, qPC1.2 and qPC1.3, underlined a dominant allele from NYZ and showed a positive additive effect on protein content, suggesting that the allele of ZS97B increased protein content; however, the other two QTLs, underlining a dominant allele from ZS97B, showed a negative additive effect on protein content, and the allele at ZS97B decreased the protein content (Table 2). Thus, all four QTLs controlling protein content were stable, indicating that GWAS signals were reliable.
Phenotypic characterization of the flo5 mutants
The GWAS results identified Flo5 in the total population in 2014 (Table 1; Fig. 3a). We performed haplotype analysis of SNPs in the promoter (2 kb) and gene regions, except for the intron of Flo5, and analyzed the GPC of different haplotypes. Ten haplotypes of Flo5 were identified, and the GPC of the haplotypes differed over the 3 years (Fig. 3b). The GPC of Hap2 in 2014 and Hap5 in 2015 was significantly higher than that of the other haplotypes in the same year. In 2016, the GPC of Hap2 and Hap5 was significantly higher than that of Hap3 and Hap7 (Fig. 3b). The results suggested that Flo5 may affect the GPC in rice. Flo5 has been cloned and affects the quality of rice grain (Ryoo et al. 2007). However, the effect of Flo5 on the GPC of rice remains unknown. In this study, a CRISPR/Cas9-knockout construct expressing three guide RNAs targeting two exons was introduced into ZH11 (Fig. 3c, d), and three homozygous mutants, flo5-1, flo5-2, and flo5-3, were generated (Fig. 3d, e). Compared to wild-type ZH11, flo5-1 had a 3 and 4 bp deletion at the first and third targets, respectively. flo5-2 had a one-base C insertion and a 2 bp deletion at the first and third targets, respectively. flo5-3 had a 6 bp deletion at the first target, and a large 43 bp deletion between the second and third target sites. These results led to the production of frameshift mutations in all three mutants (Fig. 3d). The chalkiness rates of flo5-1, flo5-2, and flo5-3 mutants were significantly higher than that of ZH11, and the chalkiness rate of flo5-1 was the highest (nearly 80%) (Fig. 3e, f). The rice GPCs of mutant flo5-1, flo5-2, and flo5-3 were significantly higher than that of ZH11, with that of flo5-2 being the highest (Fig. 3g), indicating that flo5 could increase rice GPC.
Analysis Of Candidate Genes
GWAS detected many SNPs in different years and populations (Table 1; Fig. 4a, b). Candidate gene analysis of repeatedly detected loci can provide a basis for cloning rice GPC genes. We analyzed the lead SNP sf0706126055, located on chromosome 7. Lead SNP sf0706126055, which was repeatedly detected in 2015 and 2016, was verified by the genetic population and was detected in all and indica populations (Tables 1, 2). By analyzing the candidate genes within 100 kb upstream and downstream of the lead SNP sf0706126055, multiple haplotypes of three genes, LOC_Os07g11120, LOC_Os07g11150, and LOC_Os07g11250, were found in the full population. The GPC of different haplotypes was significantly different in 2015 and 2016, and three genes were expressed in the endosperm of Minghui 63 (MH63) and ZS97B (Fig. 4c-k). The candidate gene LOC_Os07g11120 encodes a hydrolase belonging to the NUDIX family, and Hap7 showed a significantly higher GPC than Hap4, Hap5, and Hap6 in two years. LOC_Os07g11120 was highly expressed in the endosperms of MH63 and ZS97B (Fig. 4c-e). The candidate gene LOC_Os07g11150, which encodes an expressed protein, was found in the endosperm of MH63 and ZS97 (Fig. 4f-h). The GPC of Hap3 of LOC_Os07g11150 was significantly higher than that of Hap4, Hap7, and Hap10 in 2015, and the GPC of Hap7 was significantly lower than that of other haplotypes in 2016. The candidate gene LOC_Os07g11250 also encodes an expressed protein and was highly expressed in the endosperm of MH63 and ZS97B. The GPC of Hap4 was significantly higher than that of Hap1 and Hap2 in two years (Fig. 4i-k). Genetic variations of three candidate genes with different haplotypes were further analyzed (Fig. 5). LOC_Os07g11120 and LOC_Os07g11250 had more SNPs than LOC_Os07g11150, but most of SNPs were distributed in the promoter, and intragenic regions of candidate genes contained eleven and twelve SNPs, respectively (Fig. 5a, b). LOC_Os07g11150 had few SNPs, with only two SNPs in the intragenic region (Fig. 5b). The haplotype distributions in different subpopulations were analyzed (Fig. 5). Hap4 of LOC_Os07g11120 with significant phenotypic differences was only found in japonica subpopulation, while Hap6 and Hap7 were mostly found in indica subpopulation (Fig. 5a). Hap7 of LOC_Os07g11150 with significant phenotypic differences was also only distributed in japonica subpopulation, while Hap2, Hap3, Hap9, and Hap10 of LOC_Os07g11150 were mostly found in indica subpopulation (Fig. 5b). Hap2 of LOC_Os07g11250 with significant phenotypic differences was only existed in japonica subpopulation. While Hap4 and Hap5 were basically contained in the indica subpopulation, Hap6 and Hap7 were mainly found in Aus subpopulation (Fig. 5c).