Characteristics of SNPs
High-quality re-sequencing raw data of 191 accessions derived from the USDA Rice Mini Core (Supplementary Table S1), was retrieved from NCBI SRA database (Accession: PRJNA301661) [52]. Genotyping of the 191 accessions were performed by GATK software. A total of 3,259,478 SNPs was obtained after filtration by minor allele frequencies (≥0.05) and integrity (≥0.4). Imputed SNPs, which were generated by Beagle 5.0 software [53], were used for further analyses. Distribution of these SNPs in the genome is summarized in Table 1 and Figure 1a, and the overall SNP density in the genome was 114.51 (bp/SNP). The number of SNPs ranged from 212,238 to 375,296 across the twelve rice chromosomes. Chromosome 4 held the minimum marker density with 127.23 (bp/SNP), while chromosome 11 exhibited a maximum marker density with one SNP per 100.40 bp.
Population Structure and Linkage Disequilibrium
Admixture analysis divided the 191 accessions into four ancestries, including Indica (63 accessions), Aus (37 accessions), Temperate Japonica (28 accessions), and Tropical Japonica (31 accessions) under the best K model (K = 4) (Fig. 1b), which was determined by the lowest CV (cross-validation error) score (Fig. 1c). Thirty-two accessions are classified as admixture (ADM) since the ratio from each single subpopulation is below 70%.
In order to reduce the amount of calculations, high-quality SNPs (SNPs integrity above 0.8) were selected to construct a maximum likelihood (ML) tree to illustrate the phylogenetic relationship of the 191 rice accessions (Fig. 1d). The population was divided into four subpopulations and the color for each clade was determined according to the Admixture analysis results. The relationship obtained from phylogenetic tree is in line with the Admixture analysis.
Principal component analysis (PCA) was performed based on the 3,259,478 SNPs. Four conceivable subpopulations were separated by PC1, PC2, and PC3. The first three principal components (PCs) explained over 50% of the genetic variation. The first PC separates Indica and Japonica subpopulations (35.70%), the second PC distinguishes the Aus and Indica varieties, while the third PC separates Temperate Japonica and Tropical Japonica varietal groups (Fig. 1e and 1f). Based on the results from the Admixture analysis, phylogenetic tree and PCA, the population was divided into four subgroups. In addition, the decay of LD with the physical distance between SNPs in all population occurred at 191 kb (r2= 0.2) (Fig. 1g), which is similar to that of a previous study [54]. Indica subpopulation exhibited the most rapid LD decay and Temperate Japonica showed the most extended LD.
Correlation of different traits
Correlation analyses between grain ionomics in flooded environment and agronomic traits (Supplementary Fig. S1a), between grain ionomics in unflooded environment and agronomic traits (Supplementary Fig. S1b), and between grain ionomics in flooded versus unflooded growth conditions (Supplementary Fig. S1c) were conducted. The results showed that days to flowering has strong correlation with Rb in flooded (0.53) and unflooded (0.57) (Supplementary Fig. S1a and S1b) environments. The accumulation of Cd, Mo, and Rb in rice grain in flooded environment and unflooded environment are correlated at r = 0.52, 0.81, and 0.60, respectively (Supplementary Fig. S1c).
Genome-wide association study by univariate GWAS and multivariate GWAS
Sixteen grain ionomic traits (As, Ca, Co, Cd, Cu, Fe, K, Mg, Mn, Mo, Ni, P, Rb, S, Sr, and Zn) under flooded and unflooded conditions were the same as reported [17]. Thirteen agronomic traits, including AMYLOSE, AWNTYPE, DAYSFLOWER, HULLCOLOR, HULLCOVER, KERNELLEN, KERNELWID, KERNELRAT, KERNELWT, LODGING, PANICLETYPE, PLANTHT, and PLANTTYPE [18] were shared by Yan as reported [55, 56] and recorded using the methods described by Li et al [57–59]. All these traits were analyzed using two univariate GWAS (GLM and MLM) and two multivariate GWAS (MLMM and FarmCPU) methods to identify QTLs. A total of 106 significant QTLs (p-value <1.53×10-8) were detected to be associated with concentrations of 9 ionomic traits (Cd, Co, Cu, K, Mo, Ni, Rb, Sr, and Zn) in rice grain under flooded condition, in which 63, 68, 17, and 44 significant QTLs were identified by GLM, MLM, MLMM, and FarmCPU, respectively (Fig. 2 and Supplementary Fig. S4b). For Cd, twenty-eight significant QTLs were identified. Three of them located near published genes (CAL1 [14], OsHMA2 [60], rgMT [61]) which have shown to be related to Cd accumulation or resistance. Seven of them were identified in previous mapping studies using univariate methods (Supplementary Table S2). All of the seven QTLs were identified by univariate GWAS methods (GLM and MLM) but only two of the seven were also detected by multivariate methods (MLMM and FarmCPU) in our study. For Co, a total of eleven significant QTLs were detected. Two (one was identified by univariate methods and the other was detected by FarmCPU) of them co-located with previous reported QTLs. Nine of them were new QTLs discovered in the current study, MLMM method discovered 2 significant QTLs and FarmCPU method identified 7 QTLs, respectively. Three QTLs were detected to be significantly associated with K, one of which (only detected by FarmCPU) was also detected in previous studies [62]. For Zn, ten significant QTLs were identified, three of which co-located with previously reported loci [12, 13, 62, 63]. Among them, one significant QTL posited around 1,8001,929 bp of Chromosome 7 was detected by both univariate and multivariate methods, which located near reported QTL qZN-7 [13]. The other two QTLs were detected by FarmCPU method only.
In the unflooded environment, only 47 QTLs were detected to be significantly associated with Cd, Fe, Mo, Ni, and Zn concentration. Specifically, 29, 25, 10, and 20 significant QTLs were identified by GLM, MLM, MLMM, FarmCPU, respectively (Fig. 2 and Supplementary Fig. S4c). Twenty-three identified QTLs were related to Cd, one of which located near CAL1 gene [14], eight QTLs co-located with previous studies (Supplementary Table S2). Among the eight co-localization QTLs, five were detected by univariate methods and three were identified by multivariate methods. For Fe, seven significant QTLs were identified, two of which were also reported by previous studies [62, 63], and both were detected by FarmCPU only in the current study. We noticed that for the traits that many QTLs were identified using GLM and MLM methods, the numbers of QTLs identified by MLMM and FarmCPU were less as shown in the case of Cd and Mo. When GLM and MLM method failed to identify or only identified a few significant QTLs, QTLs were successfully identified by MLMM and FarmCPU methods as shown in the case of identifying QTLs for Co, Fe, K, and Zn concentration regulation (Supplementary Fig. S2). The QQ-plots in Fig. S2 (c) shows the power of MLMM and FarmCPU, which indicated no evidence for inflation but strong evidence for real effects. In contrast, the QQ-plots of GLM and MLM in Fig S2 (c) shows the tendency of false positive peaks. This observation was further confirmed when mining the key candidate genes controlling ionomic and agronomic traits as shown in the section below. Interestingly, only 3 of the 106 (ionomic) QTLs identified in flooded growth condition were shared with the QTLs identified in unflooded condition. The three QTLs (QTLs marked with an asterisk on Chromosome 5, 6, and 12; Figure 2, Supplementary Fig. S2, and Supplementary Table S2) share by both growth condition were associated with Cd and Mo concentration regulation, indicating that the traits of these three QTLs were not impacted by water conditions. Furthermore, several loci were shown to be associated with more than one trait, indicating these QTLs may be pleiotropy. For example, the region around 15.5 Mb on chromosome 2 is associated with Cd and Mo (Fig. 2).
For agronomic traits, a total of 97 significant QTLs (p-value <1.53×10-8) were detected for the thirteen agronomic traits described above except for KERNELWT and PLANTTYPE (Fig. 3, Supplementary Table S2). In detail, 39, 16, 29, and 50 significant QTLs were identified by GLM, MLM, MLMM, and FarmCPU, respectively (Supplementary Fig. S4a). Wax [64] and ALK [65] genes were shown to be significantly associated with amylose content, which is consistent with previous reports. Gain size is a key agronomic trait that strongly linked to yield and quality. Many QTLs have been reported associating with rice grain size, which is decomposed into grain length, width, and thickness (GS3 [66], GS5 [67], GW5 [68], GW8 [69] , GL7 [70], TGW6 [71], etc.). In this study, four types of rice grain size-related traits included kernel length (KERNELLEN), kernel width (KERNELWID), kernel rate (KERNELRAT), and kernel weight (KERNELWT) were analyzed. A total of 13 QTLs were detected by univariate and multivariate GWAS methods. Among them, three were detected by univariate (GLM or MLM) GWAS methods and twelve of them were detected by multivariate (MLMM and FarmCPU) GWAS methods. Six, two, and five of the 13 QTLs were found to be associated with KERNELLEN, KERNELWID, and KERNELRAT, respectively. No significant QTL was shown to be associated with KERNELWT. Remarkably, one QTL (Chromosome 3 position 16,733,441) was detected by all the four methods (Supplementary Fig. S3f). The QTL locates on gene GS3 [66], which is a major gene regulating grain size and organ size. It is worth noting that, five more significant SNPs were identified by FarmCPU and one of them located on chromosome 4 situated nearby the NAL1 gene, which has been shown to be related to rice yield [72]. For the trait of KERNELWID, two significant QTLs were only detected by MLMM method solely and other methods failed to identify candidate QTLs. One of the identified QTLs positioned around 5,364,561 bp of chromosome 5, which was found approximately 0.56 kb apart from a well-known GW5 gene controlling rice grain width (Supplementary Fig. S3g) [73]. These results demonstrated the power of GWAS, especially the power of the multivariate (MLMM and FarmCPU) GWAS methods.
Mining candidate genes of agronomic-related traits
Lodging and Plant height are both related to cell wall properties, which could impact rice yield. Appropriate plant height and the strong stem are required for stable production [74]. A cluster of SNPs around 33.4 Mb on chromosome 1 (Lodging: 33,010,693 to 33,975,764 with leading SNP at 33,469,251; Plant height: 33,181,529 to 33,730,067 with leading SNP at 33,363,796) is shown to be significantly associated with lodging and plant height (Supplementary Fig. S3j and S3k). Through LD block analysis, we defined a 72.37 kb blocks (33,458,683- 33,531,049) containing 12 genes to be the candidate locus. Among these genes, OsPME6 (Os01g0788400) is annotated as pectin methylesterase 6, which is related to the cell wall modification process. We further conducted blastP analysis with Arabidopsis thaliana and found that it shares high homology (E value = 3E-178) with Arabidopsis gene PME18 (AT1G11580) (Supplementary Table S3). The expression of PME18 increased dramatically under hyper gravity stimulus. It is speculated that pectin esterases induced pectin demethylation of carboxyl groups which increased the rigidity of pectin gel in the cell wall through calcium bridges [75]. Therefore, it is worth to test if OsPME6 regulates rice lodging and height.
Flowering time is another important trait critical to rice production. Rice is a typical short-day (SD) flowering plant whose flowering is greatly affected by day length. A number of genes [76–79] have been reported to regulate rice flowering-time. In the current study, a total of three QTLs were significantly associated with the flowering time. Two of them were detected by FarmCPU exclusively on Chromosome 7 and 10. The other QTL on Chromosome 6 was detected by all the four different GWAS methods (Supplementary Fig. S3c). The haplotype analysis showed that this region only harbored 2 genes (OsPLL9 and OsPLL10). Among them, OsPLL9 (Os06g0583900) located 7.15 kb away from the leading SNP. This gene is a homolog of pectate lyase gene, which may play crucial roles during rice panicle development [80]. OsPLL9 is highly expressed in Stamen (one day before flowering), Palea (one day before flowering), and Panicle5 (heading stage) (Supplementary Fig. S5). Thus, OsPLL9 has the potential to be a candidate gene with a critical role in rice flowering.
Mining candidate genes of ionomic traits
As a result, 28 and 23 significant QTLs were detected to be associated with Cd concentration in the flooded and unflooded environment, respectively. QTLs near CAL1 (Chr2:25,190,487-25,191,188) were associated with rice grain Cd accumulation in both flooded (Leading SNP; Chr2: 24,968,588) and unflooded condition (Leading SNP; Chr2: 25,143,071). CAL1 was annotated as a defensin-like protein, which could regulate Cd accumulation of rice leaves through translocating Cd from cytosol into extracellular spaces, but not rice grains [14]. We then further analyzed the genes around the QTLs and found there is an ABC transporter (Os01g0121700), its phosphorylation level was up-regulated under high Cd treatment (100 μM CdCl2·2.5H2O) and it has been shown that the transporter reduces the concentration of Cd 2+ through transporting PCs-Cd into vacuole [81]. Another QTL (Chr6: 29,733,715) is also showed strongly related to Cd concentration in rice grain. This QTL locates near a known gene OsHMA2 (about 253kb away from the leading SNP, but not in LD region), which may decrease rice grain Cd concentration through suppressing the expression level of OsHMA2 [60]. In addition, significantly associated SNP (Chr11: 29,014,045) posited near rgMT gene, which was a metallothionein protein responded to the Cd stress in E. coli [61]. Comparing the QTLs detected in this study with previously reported studies, we found that over fifteen QTLs were co-localized with reported loci. The details were shown in Supplementary Table S2. Meanwhile, thirty-two new QTLs were identified.
By applying the procedure mentioned in the methods section, we obtained a list of genes that represent plausible candidates for the causal gene for each of the loci controlling elemental concentrations in rice (Supplementary Table S3 and S5). We select three loci associated with Cd for further investigation with the aim of identifying the causal novel genes. A significant QTL associated with Cd was identified on Chromosome 1 around nucleotide at 4,348,829 bp with p-value 3.37E-10 (MLM method). This QTL posited within a 9.97 kb block (Chr1: 4,345,517 - 4,355,489) containing only one candidate gene OsWRKY102 (Os01g0182700) (Fig. 4a and 4b). BlastP analysis showed that the OsWRKY102 (Os01g0182700) was highly homologous (1.00E-58) to Arabidopsis WRKY13 (AT4G39410) gene (Supplementary Table S4). WRKY13 activates the expression of gene PDR8 that encodes a Cd2+ extrusion pump, resulting in reduced Cd accumulation [82]. The expression profile from public data showed that OsWRKY102 was intensively higher expressed in stem comparing to other tissues (Supplementary Fig. S6a). When treated with a high concentration of cadmium, the expression level of OsWRKY102 increased rapidly in both shoot and root (Supplementary Fig. S6b). Overall, the results suggested that OsWRKY102 responds at high-level cadmium treatment and regulates cadmium uptake and accumulation in rice. Another QTL (Chromosome 5 posited around 14,941,717) was identified in a flooded environment. Through LD analysis, we defined an 18.65 kb block (Chr5: 14,930,444 - 14,949,090) containing two genes, Os05g0321600 and Os05g0321900. Among them, Os05g0321900 (OsWRKY75) was annotated as DNA-binding WRKY domain-containing protein (Fig. 5a). BlastP analysis found that this gene shared high homology (4E-52) with WRKY55 (AT2G40740) in Arabidopsis thaliana, which regulated gold uptake and tolerance. Remarkably, one QTL (Chromosome 6 around position 11,906,590) was identified under both growth environments (Fig. 5a and 5b). This SNP is a singleton, and we failed to define haplotype blocks for it. For further mining causal gene, we expanded the region as we mentioned in the method section. Finally, we found an important gene OsMan07 (Os06g0311600) was only 24.75 kb away from the leading SNP. This region was reported in previous study [83]. BlastP analysis found this gene had a high similarity (6E-108) to Man3 (AT3G10890) gene in Arabidopsis thaliana (Supplementary Table S4). Overexpression of MAN3 enhanced Cd accumulation and tolerance, whereas loss‐of‐function of MAN3 led to decreased Cd accumulation and tolerance [84]. All the genes’ expression patterns located in the haplotype region associated with Cd were showed in Supplementary Figure S7. Overall, thirty-two new QTLs were identified in addition to precise identification of the loci reported in previous Cd studies.
PIP2;6 has been suggested to be involved in arsenic concentration control [85]. Although no SNP with Bonferroni-corrected significant thresholds -log 10(p) above 7.81 was discovered, there was an SNP peak with -log(p) around 6 on the chromosome 4 near the published gene PIP2;6(Supplementary Fig. S2a), Suggesting that PIP2;6 is located near a candidate QTL revealed by GWAS analysis.
Comparison of univariate and multivariate GWAS Methods
Our results demonstrated that there was not a single method that was able to detect all the QTLs while many loci were detected by all methods. The GS3 gene was shown to be associated with grain length by all the four tested methods. However, the GW5 gene was detected to be related to grain width only by multivariate GWAS. Similarly, the Cu related QTLs in flooded conditions and Fe related QTLs in unflooded condition were also detected by multivariate method only. Interestingly, when a large number of QTLs were identified by the univariate method, the QTLs identified by multivariate method were substantially reduced. For example, over 29 QTLs for Cd were identified by univariate methods in flooded and unflooded environment but only six QTLs were identified by multivariate. Further, it appeared that the multivariate methods were able to pinpoint the location of the QTLs more precisely on the chromosome compared with the univariate methods in many cases. As shown in Supplementary Figure 3f, the peaks identified by univariate method were much broad than the peaks identified by multivariate methods. On the other hand, the univariate methods also identified many loci exclusively. In the case of LODGING, the QTL (candidate gene: OsPME6) located on chromosome 1 was only detected by GLM (Supplementary Figure 3j) and this gene was related with cell wall formation biological process, indicating that some of candidate QTLs might be ignored when we pursued lower FDR in multivariate methods.