Integrating GWAS and transcriptomics to identify genes involved in seed dormancy in rice

Several QTLs and genes responsible for seed dormancy were detected and SNP candidates were shown to cause changes in seed germination. Seed dormancy is a key agricultural trait to prevent pre-harvest sprouting in crop plants such as rice (Oryza sativa L.), wheat (Triticum aestivum), and barley (Hordeum vulgare L.). However, our knowledge of seed dormancy is hampered by the complexities of studying a trait that changes over time after seed harvest, and is complicated by interactions between phytohormones, seed coat components and the environment. Here, we have conducted a genome-wide association study using a panel of 311 natural accessions of cultivated rice, examining a total of 519,158 single nucleotide polymorphisms (SNPs). Eight quantitative trait loci (QTLs) were found to associate with seed dormancy in the whole panel and five in the Japonica and Indica subpanel; expression of candidate genes within 100 kb of each QTL was examined in two published, germination-specific transcriptomic datasets. Ten candidate genes, differentially expressed within the first four days post-imbibition, were identified. Five of these genes had previously been associated with awn length, heading date, yield, and spikelet length phenotypes. Two candidates were validated using Quantitative Reverse Transcription (qRT)-PCR. In addition, previously identified genes involved in hormone signaling during germination were found to be differentially expressed between a japonica and an indica line; SNPs in the promoter of Os9BGlu33 were associated with germination index, with qRT-PCR validation. Collectively, our results are useful for future characterization of seed dormancy mechanism and crop improvement, and suggest haplotypes for further analysis that may be of use to boost PHS resistance in rice.


Introduction
Rice (Oryza sativa) is the staple food for over half the world's population, so safeguarding its production in the face of global warming is critical for food security. Changing environmental conditions can severely affect rice growth and yield; for example, prolonged rainfall and high humidity during rice maturation can lead to seed germination in rice panicles prior to harvest, giving rise to heavy economic and production losses (Zhu et al. 2019). This phenomenon is termed pre-harvest sprouting (PHS) and can occur due to low seed dormancy. Further understanding of seed dormancy is required to avoid PHS in different environments.
Seed dormancy is "one of the least understood phenomena in seed biology" (Finkelstein et al. 2008). Medium seed dormancy is regarded as a desirable domestication trait (Lohwasser et al. 2005), as it can ensure uniform germination of crops and prevent the risk of PHS (Gubler et al. 2005). Dormancy is known to be affected by the balance between two phytohormones, gibberellic acid (GA) and abscisic acid (ABA), but they are not the sole regulators of dormancy (Finkelstein et al. 2008). Other factors, such as seed condition, seed coat color, and environmental signals all influence dormancy (Furukawa et al. 2007;Sweeney et al. 2006). Many quantitative trait loci (QTLs) for seed dormancy have been identified in crops and some model plants, such as Arabidopsis (Atwell et al. 2010;Basbouss-Serhal et al. 2016), soybean (Dougherty and Boerma 1984), barley (Gao et al. 2003;Gawenda et al. 2015;Vanhala and Communicated by Takuji Sasaki.
Genome-wide association studies (GWAS) have been applied in plants to accelerate crop improvement and QTL study (Atwell et al. 2010;Huang et al. 2010;Yang et al. 2014). Using GWAS to study rice dormancy has been performed by several groups using germination rate and index as the key phenotypic traits, but due to differences in populations and data processing, different genes and QTNs have been identified. Magwa et al. (2016) detected 16 loci significantly associated with germination percent in freshly harvested seeds and 38 loci in after-ripened seeds (Magwa et al. 2016). Lu et al. (2018) found 9 known and new QTNs to be significant . These studies show great power and credence, but not all SNPs and loci mapped can be resolved as not all genes in the genetic maps have been annotated, so candidate genes are difficult to select. More information, and cross-validation from different techniques and studies, will help in improving the accuracy of selecting candidate genes associated with dormancy.
Here, we use GWAS on a new rice association panel paired with existing transcriptomic datasets from the literature to identify candidate genes and SNPs affecting gene expression (eSNPs) in different cultivars. Candidate genes containing SNPs identified from GWAS are checked for their differential expression in two different RNA-seq datasets. Relationships with associated traits are determined by cross-checking SNPs in candidate gene with the RiceVar-Map database (Zhao et al. 2015), and eSNPs for previously reported genes are assessed for effect on germination. Our results provide a basis for further elucidating the molecular mechanism behind PHS and seed dormancy of rice.

Plant materials
The association mapping population consisted of 311 rice accessions previously described in detail by . They can be classified into three subgroups: japonica (n = 160), indica (n = 138) and the mixed group (n = 13). These cultivars were grown in paddy fields around Shanghai, China in May 2018. Cultivars with phenotypes of interest were re-planted in May 2019. Each accession was planted in one row with 6 hills per row, and grown to maturity in the field.
The heading time of each accession was recorded when the first panicle emerged. To minimize the influence of different heading time, spikes from different plants were harvested in September (2018 or 2019), approximately 30-40 days after heading.

Phenotype assessment for seed dormancy
After harvest, seeds were immediately removed from the spike and sterilized using 0.6% (v/v) sodium hypochlorite solution for 15 min, and rinsed with distilled water. Two replicates of 30 seeds were placed on two layers of Whatman #1 filter paper in 9 cm Petri dishes, wet with 20 ml sterile water. The Petri dishes were incubated for six days at 28 °C/100% humidity in the light condition.
Germination was defined as emergence of the rootlet. From the second day after imbibition, germinated seeds were counted and removed daily. Germination index (GI) on day 6 was determined as the proportion of seeds (out of the original 30) that had germinated. GI results are presented as the mean of two replicates (Online Resource 1).
We applied two models for the GWAS analysis for the whole panel: Factored Spectrally Transformed Linear Mixed Model (fastLMM) was conducted using FastLmm (v2.07) (Lippert et al. 2014); and Multi-Locus Mixed Model (MLMM) was conducted using TASSEL for format conversion and GAPIT package in R (Bradbury et al. 2007;Tang et al. 2016). Manhattan plots were generated using ggplot2 package in R (Villanueva and Chen 2019). Only fastLMM was conducted for japonica and indica subpanel.

Fixation index analysis
The fixation index between subgroups was calculated using vcftools (v0.1.13) (Danecek et al. 2011). The window size was set as 20 kb and window step set as 10 kb. Fixation index plot was made using pheatmap package in R.

Principal component analysis
The principal component analysis was conducted using plink (v1.9) -pca command (Purcell et al. 2007). The first two principal components were extracted. PCA plot was made in R with subpopulations assigned with different colors.

Transcriptome analysis
Two rice transcriptome datasets (SRP277875, GSE115371) were downloaded from the Gene Expression Omnibus (GEO) database using the Sequence Read Archive (SRA) toolkit (Narsai et al. 2017;Yang et al. 2020). For GSE115371 dataset, only timepoints from aerobically grown samples were used, which includes 0 h, 1 h, 3 h, 12 h, 1 d, 2 d, 3 d and 4 d timepoints after imbibition. The raw data were quality controlled using Trimmomatic (v0.39) software, and mapped to MSU 7.0 rice reference genome using Hisat2 (v2.1.0) (Bolger et al. 2014;Kawahara et al. 2013;Kim et al. 2015). The aligned files were sorted and turned into bam files using samtools (v1.9) (Li et al. 2009). Reads aligned were counted using HTSeq package (Anders et al. 2015). For differential expression analysis, R package DeSeq2 was used. Genes with a false discovery rate < 0.05 and |log 2 (fold change)|> 1 were considered as DEGs (Love et al. 2014). RPKM data were calculated using read counts, and heatmap plots were made based on the RPKM data using pheatmap package in R.

Haplotype analysis
SNP data for the candidate genes were extracted from the obtained SNP dataset. This dataset only contained biallelic SNPs. Heterozygous and missing alleles are excluded from further haplotype analysis. For genes found in QTLs, only non-synonymous SNPs within the coding regions of these genes were applied for haplotype analysis, while for other five genes, which have been previously identified to be involved in hormone signaling during germination, SNPs within the promotor region (3 kb upstream of the gene) were used for haplotype analysis to find eSNPs. Haplotype analysis was conducted in R to determine whether the loci can cause germination index change by Student' s t test. The results were visualized using GraphPad Prism (v8.4.2). The number of accessions correspond to each SNPs was calculated and checked manually.

Phylogenetics of the AIRP genes
The homologs of LOC_Os07g17400 in rice (Oryza Sativa), sorghum (Sorghum bicolor) and Arabidopsis (Arabidopsis thaliana) were extracted from SALAD database (v3) (Mihara et al. 2010). To further include sequences from barley (Hordeum vulgare) and wheat (Triticum aestivum), the sequence of LOC_Os07g17400 was used for blast search on NCBI. All of the obtained sequences were aligned using MUSCLE algorithm (Edgar 2004). MEGA6 software was used to create a phylogenetic tree. Neighbor-joining methods were used with a p-distance model and pairwise detection and bootstrap with 1000 replicates (Tamura et al. 2013).

RNA isolation and Quantitative Reverse Transcription (qRT)-PCR Validation
After haplotype analysis, lines with different haplotypes were selected for expression validation. Seed samples were placed on two layers of Whatman #1 filter paper in 9 cm Petri dishes, wet with 20 ml sterile water. After three days' imbibition, whole coleoptiles were sampled and snap frozen. TRIzol® reagents (Invitrogen, Carlsbad, CA, United States) were used to isolate RNA from each of three biological samples. RNA quality was assessed by 1% agarose gel electrophoresis, Nanodrop (ThermoFisher), and Bioanalyzer (Agilent 2100), with A260/A280 = 1.8-2.2. The cDNA was synthesized using the Fast RT Kit (Tiangen Biotech, Beijing), following the manufacturer's instructions. qRT-PCR analysis was performed on the LightCycler 96 (Roche). The Ubiquitin gene (LOC_Os03g13170) was used as internal control, and the relative expression level of target genes was calculated based on the 2 -ΔΔCt method (Livak and Schmittgen 2001). Primer sequences for qRT-PCR were listed in Online Resource 11. The expression comparison of different groups was conducted using Two-way Anova method by Python script.

Population structure and germination differences
Analysis of SNPs in an association panel of 311 natural rice accessions using phylogenetics and fixation index analysis revealed that the japonica (n = 160) and indica (n = 138) varieties are easily differentiated (] Fig. 1). The mixed subgroup (n = 13), comprising mainly Australian and Basmati cultivars, had more members similar to the indica subpopulation.
The germination index (GI) of harvested seeds was used to test the degree of seed dormancy for each accession. The GI average across all 311 lines was 0.24, with the japonica GI slightly higher, at 0.25, while the indica GI was 0.23, with a higher standard deviation between accessions Table 1. The mixed group had the lowest GI (0.15): while their group was too small to draw any general conclusions, their inclusion in the subsequent GWAS analysis may provide insights into the mechanisms of seed dormancy and pre-harvest sprouting.

GWAS and candidate gene selection
A total of 519,158 SNPs were included for association analysis. We used two different models for GWAS, fastLMM and MLMM for the whole panel Fig. 2. The QTLs were assigned when the peak SNP correlation exceeded the calculated threshold. Eight QTLs, each of 200 kb, were identified for the whole panel ( Fig. 2; Table 2).
To further analyze the genetic basis of seed dormancy, the whole panel was divided into two subpanels: japonica subpanel and indica subpanel. fastLMM was used for GWAS (Online Resource 3). Two QTLs were identified for the japonica panel and three QTLs were identified for the indica panel. (Online Resource 4).
Expression of candidate genes within those QTLs was examined using two transcriptomic datasets from the GEO database. Dataset GSE115371 provided gene expression data at 0 h, 1 h, 3 h, 12 h, 1 d, 2 d, 3 d and 4 d timepoints after imbibition, for an Australian variety (Amaroo); only timepoints from aerobically grown samples were used (Narsai et al. 2017). DEGs were identified by comparing expression at different timepoints to expression at 0 h. Dataset SRP277875 provided expression data for a japonica variety (02,428) and an indica variety (YZX) at 2, 3, and 4 days after imbibition compared with a 0 h timepoint (Yang et al. 2020). The data of the japonica variety were used for expression comparison of candidate genes (Online Resource 5).
Ten candidate genes from five QTLs identified by GWAS in the whole panel were differentially expressed across the first four days post-imbibition (Table 2; Online Resource 5). Five genes had similar expression patterns in both datasets (Online Resource 5), namely LOC_Os07g17400, OsNFYA5 (expressed at 0 h and soon after imbibition); LOC_Os08g06090, PPT1 (2 d post-imbibition); and ROMT9 (3 d post-imbibition). Two

Bioinformatic analysis of the candidates and qRT-PCR validation
SNP variations in five candidate genes had previously been associated with phenotypes such as awn length, heading date, yield, and spikelet length (RiceVarMap;  Haplotype analysis confirmed that three SNPs in our candidate genes associated with GI caused non-synonymous mutations (Online Resource 7). In ROMT9, a glycine to serine change caused GI to drop. Similarly, a threonine to alanine change in PPT1 reduced GI, but the number of T-haplotype varieties was very small so further verification is required. An asparagine to aspartate change in LOC_Os09g12650 caused GI to increase. The expression difference of ROMT9 and LOC_Os09g12650 in the population was validated using qRT-PCR method. Fig. 3 Phylogenetic analysis of the predicted protein encoded by LOC_Os07g17400 reveals homology to AIRP family proteins in maize, Triticum, and Arabidopsis (Online Resource 9).

eSNPs associated with seed germination
To further find eSNPs associated with seed germination, we compared expression in the japonica and indica varieties in dataset SRP277875 (Yang et al. 2020). More than twice as many genes were up-regulated in the japonica than in the indica variety (Online Resource 8), suggesting that the germination difference between these two rice subpopulations can be caused by the differential expression of genes.
Five genes known to be involved in seed germination were differentially expressed in this dataset Table 3. OsGA20ox1, OsGAP, and Os9BGlu33 were up-regulated in japonica 02,428, while OsGA3ox2 and OsPP2C51 were up-regulated in indica YZX. The timing and extent of differential expression differed for each gene. The largest change was in OsGAP, whose expression was 6.5 × higher in the japonica variety at 0 h, and 2.3 × higher at the 3 d timepoint. We found four mutations in the promoter of Os9BGlu33 that are associated with a change in GI Fig. 4; Online Resource 10). The AACG-type cultivars have a higher expression level than that of GGGT-type. (p < 0.001, Two-way Anova test; Online Resource 10). 1 3

Candidate genes for seed dormancy
Seed dormancy is not a very stable trait, as the ability of seeds to break dormancy increases over time. Here, we have combined GWAS with transcriptomics analyses to cross-validate phenotyping results. Three out of ten candidate genes associated with dormancy had previously been linked with awn length (Online Resource 6); as awns tend to reduce germination rate (Ntakirutimana et al. 2019), we suggest that these genes may regulate seed germination by regulating awn development in natural populations. As long awns were important for seed dispersal in wild rice and gradually lost during domestication (Luo et al. 2013), awn loss may cause loss of desired dormancy in domesticated lines. Thus, GWAS from natural populations can reveal mechanisms that may be lost using bi-parental mapping on inbred lines, which in this example, may have been selected for short or missing awns. For the awn-associated candidate LOC_Os09g12650, its homolog in Arabidopsis, AtAIRP2, is a cytosolic RING-type E3 ubiquitin ligase that can positively regulate ABA response in Arabidopsis, while its knockout mutant is insensitive to ABA during seed germination (Oh et al. 2017). The involvement of LOC_Os07g17400 in ABA response is consistent with its proposed role in rice seed dormancy.

Differential expression of seed dormancy-related genes in japonica and indica rice
The GI of japonica rice was generally higher than for indica rice varieties Table 1, and nearly twice as many genes were up-regulated in the 4 days post-imbibition in japonica 02,428 than were up-regulated in indica YZX (Online Resource 8). To examine whether these genomic variations affected the expression of seed dormancyrelated genes, we examined five DEGs previously reported to be involved in seed germination Table 3. OsGA20ox1 and OsGA3ox2 are genes involved in GA biosynthesis (Kaneko et al. 2002;Oikawa et al. 2004), whose expression would affect overall levels of GA, influencing seed dormancy (Guo et al. 2013). OsGAP is a GTPase-activating protein that can integrate ABA and reactive oxygen species (ROS) signaling (Xu et al. 2019); its knockout leads to ABA and H 2 O 2 accumulation (Xu et al. 2019). OsPP2C51 is a member of the phosphatase 2C clade A protein family, which can interact with PYL family to transmit ABA signals to positively regulate seed dormancy (Bhatnagar et al. 2017). OsPP2CC51 is down-regulated in japonica 02,428, consistent with our results that japonica lines have a higher GI. Mutants of Os9BGlu33 exhibit a lower germination rate than wild type seeds, but which are less inhibited than wild type in the presence of ABA, a germination inhibitor (Ren et al. 2019).
Here, we have paired GWAS and transcriptomics analyses to identify eight QTLs associated with seed dormancy, and selected ten candidate genes within five of those QTLs that exhibited differential expression patterns during germination Table 2. Two of the candidates show expression differences in the population. Figure 3 Further, five genes related to hormone-responsive seed dormancy were found to be differentially expressed between japonica and indica varieties, and SNPs in the promoter of one of them, Os9BGlu33, were shown to cause changes in GI. Further experiments to validate these findings will lead to new haplotypes for inclusion into breeding programs to generate cultivars with stronger resistance to PHS.