GWAS identies a wheat orthologue of the rice D11 gene as an important contributor to grain size in an international collection of hexaploid wheat

Grain size is a key agronomic trait that contributes to grain yield in hexaploid wheat. Grain length and width were evaluated in an international collection of 159 wheat accessions. These accessions were genetically characterized using a genotyping-by-sequencing (GBS) protocol that produced 73,784 single nucleotide polymorphism (SNP) markers. GBS-derived genotype calls obtained on Chinese Spring proved extremely accurate when compared to the reference (> 99.9%) and showed > 95% agreement with calls made at SNP loci shared with the 90K SNP array on a subset of 71 Canadian wheat accessions for which both types of data were available. This indicates that GBS can yield a large amount of highly accurate SNP data in hexaploid wheat. The genetic diversity analysis performed using this set of SNP markers revealed the presence of six distinct groups within this collection. A GWAS was conducted to uncover genomic regions controlling variation for grain length and width. In total, seven SNPs were found to be associated with one or both traits, identifying three quantitative trait loci (QTLs) located on chromosomes 1D, 2D and 4A. In the vicinity of the peak SNP on chromosome 2D, we found a promising candidate gene (TraesCS2D01G331100), whose rice ortholog (D11) had previously been reported to be involved in the regulation of grain size. These markers will be useful in breeding for enhanced wheat productivity. the With the latter accessions, eld trials were conducted in two different trial sites in the bimodal humid forest zone of Cameroon, during the 2015–2016 wheat-growing seasons in Mbankolo (1057 m above sea level) and during 2016–2017 in Nkolbisson (650 m a. s. l.). In Mbankolo, the average temperature is 18–20°C, bimodal rainfall with an annual average of 1600 mm. In Nkolbisson, the annual average temperature is 23.5°C, the rainfall is bimodal with an annual average of 1560 mm. At each trial site, an incomplete alpha-lattice design with two replications was used. Each accession was planted in ve-row plots, in 3-m rows with 5 cm between plants and 25 cm between rows. Then, elds trials were managed in accordance with the technical recommendations and standard agricultural practices for wheat 28 . Grain length (Gle), grain width (Gwi), 1000-grain weight (Gwe) and grain yield (Gyi) were recorded for each accession. Gle and Gwi were measured by a digital Vernier caliper on 20 seeds per variety randomly picked from a pool of grains from each harvested area 18 . wheat accessions which genotypic for SNPs 90K SNP Innium iSelect array 35 the 135 SNPs called in using both were using in-house


Introduction
The grain size, which is associated with yield and milling quality, is one of the essential traits that have been subject to selection during domestication and breeding in hexaploid wheat 1 . During the domestication process from ancestral (Einkorn) to common wheat (Triticum aestivum L.) going through tetraploid species, wheat abruptly changed, from a grain with greater variability in size and shape to grain with higher width and lower length 2,3 . However, grain yield is determined by two components namely, the number of grains per square meter and grain weight. Following, grain weight is estimated by grain length, width, and area, which are components showing higher heritability than mainly yield in wheat 4 .
Larger grains may have a positive effect on seedling vigor and contribute to increased yield 5 . Geometric models have indicated that changes in grain size and shape could result in increases in our yield of up to 5% 6 . Consequently, quantitative trait loci (QTLs) or genes governing grain shape and size are of interest for domestication and breeding purposes 7,8 . Many genetic mapping studies have reported QTLs for grain size and shape in wheat cultivars 1,2,8−10 and some studies have revealed that the D genome of common wheat, derived from Aegilops tauschii, contains important traits of interest for wheat breeding 11,12 .
At the genomic level, Okamoto et al. 13 performed QTL analyses for grain size and shape-related traits using four synthetic wheat F2 populations to identify the genetic loci responsible for grain size and shape variation in hexaploid wheat and found QTLs for grain length and width on chromosomes 1D and 2D. This is particularly interesting as the tenacious glume gene Tg-D1 on chromosome 2D is a well-known locus that has been recruited for the domestication of wheat grain size and shape. During allohexaploid wheat speciation, a dramatic change in grain shape occurred due to a mutation in the Tg-D1 gene 14 .
Furthermore, Yan et al. 15 reported a genomic region associated with grain size on chromosome 2D.
New advances in genomics technologies has revolutionized research in plants by developing new high throughput genotyping methods to increase knowledge of the genetic basis of diversity in large core collection of genetic materials through genome-wide association studies (GWAS). Based on such highdensity SNP markers, GWAS can be used for the description and high-resolution mapping of genetic variance from collections of genetic ressources that have derived from several historical recombination cycles 16 . Furthermore, Genotyping-by-sequencing (GBS) is a Next-Generation Sequencing (NGS) technology for high-throughput and cost-effective genotyping, that provides a great potential for applying GWAS to reveal the genetic bases of agronomic traits in wheat 17 . Arora et al. 18 conducted GWAS in a collection of Ae. tauschii accessions for grain traits, using SNP markers based on GBS. They identi ed a total of 17 SNPs associated with granulometric characteristics distributed over all seven chromosomes, with chromosomes 2D, 5D, and 6D harboring the most important marker-trait associations. On the other hand, most studies on germplasm of hexaploid wheat have focused on understanding the genetic and morphological diversity of this species. No studies have used GWAS based on GBS for economically important and essential grain yield components traits such as grain length and width in an international collection of hexaploid wheat. The present investigation aimed to identify QTLs and candidate genes governing grain length and width in an international collection of hexaploid wheat using a GBS-GWAS approach.

Results
Phenotypic characterization of grain yield components. To explore components of grain yield in wheat, we measured four phenotypes: grain length (Gle), grain width (Gwi), 1000-grain weight (Gwe) and grain yield (Gyi) over two years at two sites. As shown in Table 1, means (± standard deviation) observed for these traits corresponded to: 3.2 mm (± 0.08) for grain length, 1.6 mm (± 0.04) for grain width, 25.7 g (± 0.80) for 1000-grain weight and 2.6 t/ha (± 0.11) for grain yield. The broad-sense heritability estimates were 90.6% for grain length, 97.9% for grain width, 61.6% for 1000-grain weight and 56.0% for grain yield.
An analysis of variance revealed signi cant differences due to genotypes (G) for all traits and, for two traits (Gwe and Gyi), the interaction between genotype and environment (GxE) proved signi cant. A correlation analysis showed a high signi cant positive correlation between grain yield and grain weight (r = 0.96; p < 0.01) and also between grain length and grain width (r = 0.88; p < 0.01). Also, signi cant positive correlations were identi ed between grain yield and grain length (r = 0.51; p < 0.01) and between grain yield and grain width (r = 0.54; p < 0.01). Interestingly, a bimodal distribution was observed for grain length and width (Fig. 1). Together, these results suggest that a major gene controls two important characters related to grain size with a high heritability within this collection. After imputation of the missing genotype calls, we observed a mean concordance of 93.8% between the CS individuals and the CS reference genome. Furthermore, 76.7% of genotypes were called initially and 23.3% of genotypes were imputed. It should be noted that the accuracy rate for imputing missing data is 73.4%. More details of SNP data set are provided in supplementary Table S1.
As a further examination of data quality, we compared the genotypes called using both GBS and a SNP array on a subset of 71 Canadian wheat accessions that had been previously genotyped using the 90K SNP array. A total of 77,124 GBS-derived and 51,649 array-derived SNPs were discovered in these 71 accessions (Supplementary Table S2). Of these, only 135 SNP loci were common to both platforms and among these potential 9,585 datapoints (135 loci x 77 lines), only 8,647 genotypes could be compared because the remaining 938 genotypes were missing in the array-derived data. As shown in Fig. 2, a high level of concordance (95.1%) was seen between genotypes called by both genotyping approaches. To better understand the origin of discordant genotypes (4.9%), we inspected the set of 429 discordant SNP calls and observed that: 1) 3.5 % of discordant calls corresponded to homozygous calls of the opposite allele by the two technologies; and 2) 1.4 % of discordant calls were genotyped as heterozygous by GBS while they were scored as homozygous using the 90K SNP array. More details are provided in Supplementary Table S3. From these comparisons, we conclude that GBS is a highly reproducible and accurate approach for genotyping in wheat and can yield a greater number of informative markers than the 90K array.
Genome coverage and population structure. For the full set of accessions, a total of 129,940 SNPs was distributed over the entire hexaploid wheat genome. The majority of SNPs were located in the B (61,844) and A (50,106) sub-genomes compared to the D (only 17,990 SNPs) sub-genome (Table 2). Although the number of SNPs varied 2-to 3-fold from one chromosome to another within a sub-genome, a similar proportion of SNPs was observed for the same chromosome across sub-genomes. Typically, around half of the markers were contributed by the B sub-genome (47.59 %), 38.56 % by the A sub-genome and only 13.84 % by the D sub-genome. The analysis of population structure for the 159 accessions of the association panel showed that K = 6 best captured population structure within this set of accessions and these clusters largely re ected the country of origin (Fig. 3). The number of wheat accessions in each of the six subpopulations ranged from 6 to 44. The largest number of accessions was found in northwestern Baja California (Mexico) represented here by Mexico 1 (44) and the smallest was observed in East and Central Africa (6).
GWAS analysis for marker-trait associations for grain size. To identify genomic loci contributing to grain size in wheat, we performed a GWAS analysis (159 accessions, 73,784 SNPs, grain length and width) using an CMLM approach.
As seen in Fig. 3, both Q-Q plots suggest that the confounding effects of population structure and relatedness were well controlled. For both traits, the greatest marker-trait associations were detected at the end of chromosome 2D, while another weaker association was shared at the beginning of chromosome 1D. For grain width only, a marker-trait association was detected on chromosome 4A. In total, seven SNPs were found to be associated with one or both traits, with respectively one, ve and one signi cant SNPs being located on chromosomes 1D, 2D and 4A. Except for two SNPs (chr2D:442798939 and chr4A:713365388), all other SNPs were signi cant for both grain length and grain width. The SNP at 4A:713365388 was signi cant only for grain width while the SNP at 2D:442798939 was signi cant only for grain length.
The most signi cant association was observed on chromosome 2D and contributed to both grain length and grain width (Table 3; Fig. 3). For this QTL, a total of four SNPs was observed and the SNP most signi cantly associated to both traits was located at position 2D:452812899. A fth SNP located at 2D:442798939 was signi cantly associated to grain length only, but was just below the signi cance threshold (p-value = 5.09E-05) for grain width. A high degree of LD was detected among some of the seven SNPs from chromosome 2D displaying association with grain traits. These formed one discontinuous linkage block as the LD between markers belonging to this block was higher (mean of r 2 = 0.90). For this reason, we considered these to de ne one quantitative trait locus (QTL) on chromosome 2D ( Supplementary Fig. S2). This QTL included 5 SNP markers (chr2D:403935865, chr2D:442798939, chr2D:444560418, chr2D:452644656 and chr2D:452812899) and the peak SNP (chr2D:452812899) explained between 7% and 13% of the phenotypic variation for grain length and width. The minor allele frequencies (MAF) at this locus was 0.30 and exerted an allelic effect from − 0.27 to -0.12 mm (Table 3).
On chromosome 1D, the SNP marker chr1D:166874041 de ned a QTL for both grain length and width. The percentage of phenotypic variation explained by this marker for grain length and width was 6% each, with a MAF of 0.29 and allelic effects of 0.25 and 0.11 mm for grain length and width, respectively. Furthermore, a high degree of interchromosomal LD was observed among the peaks SNPs between chromosomes 1D and 2D (r 2 = 0.94) displaying association with grain traits. In addition, almost all accessions which have the major allele on chromosome 1D are the same which have the major allele on chromosome 2D. Thus, the combined impact of these two loci could explain the observed bimodal distribution.
On chromosome 4A, the SNP marker chr4A:713365388 de ned a QTL for grain width only and it explained 7 % of the variation, had a MAF of 0.14 and exerted an allelic effect of 0.13 mm. However, we reported a very weak LD between this peak SNP marker and the two others on chromosomes 1D and 2D.
In summary, a total of three QTLs signi cantly associated with grain length and/or width were identi ed on chromosomes 1D, 2D and 4A.
Candidate gene detection for grain size. To identify candidate genes contributing to grain size within the studied wheat collection, we investigated the genes residing in the same linkage block as the peak SNP for each QTL. On chromosome 2D, the QTL with the largest number of associated SNPs (chr2D:403935865 to chr2D:452811303) included a total of 315 high-con dence genes. On chromosomes 1D and 4A, the SNP markers chr1D:166874041 and chr4A:713365388, de ning each a QTL, respectively, doesn't included high-con dence genes. Upon examination of the annotations for these genes, the most promising candidate appears to be the TraesCS2D01G331100 gene in the QTL on chromosome 2D, an ortholog of the rice CYP724B1 gene, commonly known as the D11 gene. The D11 gene was previously reported as being involved in the regulation of internode elongation and seed development due to its role in the synthesis of brassinosteroids, key regulators of plant growth promoting the expansion and elongation of cells. More details are provided in Supplementary Table S4.
Haplotypes at the wheat orthologue of the rice D11 gene and their phenotypic effects. To provide a useful breeding tool for the main QTL identi ed in this research, we de ned SNP haplotypes around our candidate gene. Using HaplotypeMiner, we identi ed two SNPs (chr2D:423365752 and chr2D:425474599, Supplementary Fig. S3) that best captured the SNP landscape in the vicinity of the candidate gene. These markers reside in the same haplotype block as the SNP markers, but were not individually found to be signi cantly associated with grain width and length. These SNP markers de ne three haplotypes (AT, CT or CC) around the candidate gene, with 100, 18 and 41 individuals carrying these haplotypes, respectively.
To investigate the phenotypes associated with these haplotypes, we analyzed the trait value for each haplotype. Interestingly, we observed that for all traits, the mean values of accessions with haplotype AT were signi cantly larger (p < 0.001) than those obtained for the other haplotypes. As shown in Fig. 4 Fig. S4, Supplementary Table S5). To conclude, we suggest that SNP markers corresponding to haplotype AT will provide a useful tool in marker-assisted breeding programs to improve wheat productivity. Therefore, we point out that the relationship between yield and haplotypes around the D11 gene would allow the selection of high-yielding wheat lines in a breeding program.

Discussion
The goal of our study was to identify genomic regions controlling variation for grain size in an international collection of 159 hexaploid wheat accessions through a GWAS approach. Thus, we collected the phenotypes for three grain traits (length, width, weight) in addition to grain yield. A statistical analysis revealed that the genotype was a major source of variance for all traits and that these exhibited a high heritability. In agreement with Arora et al. 18 in Ae. tauschii and Rasheed et al. 19 in wheat, we observed that grain length, grain width and grain weight were positively correlated to grain yield.
Interestingly, a bimodal distribution was observed for both the grain length and width phenotypes, suggesting that one to a few major genes control these traits in our collection.
To assess the reproducibility and accuracy of genotypes called via the GBS approach, we genotyped 12 different plants of Chinese Spring (i.e. biological replicates), which were added to the set of 288 wheat samples for SNP calling and bioinformatics analysis, which yielded a total of 129,940 loci. Among the 12 biological replicates of CS, we found a very high reproducibility (~ 100%) in our genotype calls. Firstly, we veri ed the quality of our SNP data by investigating the reproducibility and accuracy of GBS-derived SNPs calls, and found that GBS-derived genotypes were in agreement with the reference genome in 99.9% of cases in over 1M comparisons for non-imputed data and 93.8% after imputation of the missing genotype calls. Recently, Abed et Belzile 20 reported that the accuracy of SNP calls was 99% for nonimputed and 89% for imputed SNPs dataset in Barley. In our study, 76.7% of genotypes were called initially, and only 23.3% were imputed. Thus, we conclude that the imputed data are of lower reliability.
As a further examination of data quality, we compared the genotypes called by GBS and a 90K SNP array on a subset of 71 Canadian wheat accessions. Among the 9,585 calls available for comparison, 95.1% of calls were in agreement. It is likely that both genotyping methods contributed to cases of discordance. It is known, however, that the calling of SNPs using the 90K array is challenging because of the presence of three genomes in wheat and the fact that most SNPs on this array are located in genic regions that tend to be typically more highly conserved, thus allowing for hybridization of homoeologous sequences to the same element on the array 21,22 . The fact that the vast majority of GBS-derived SNPs are located in noncoding regions makes it easier to distinguish between homoeologues 21 . This likely contributed to the very high accuracy of GBS-derived calls described above. We conclude that GBS can yield genotypic data that are at least as good as those derived from the 90K SNP array. This is consistent with the ndings of Elbasyoni et al. 23 as these authors concluded that "GBS-scored SNPs are comparable to or better than array-scored SNPs" in wheat genotyping. Likewise, Chu et al. 24 observed an ascertainment bias for wheat caused by array-based SNP markers, which was not the case with GBS.
Con dent that the GBS-derived SNPs provided high-quality genotypic information, we performed a GWAS to identify which genomic regions control grain size traits. A total of three QTLs located on chromosomes 1D, 2D and 4A were discovered. Under these QTLs, seven SNPs were found to be signi cantly associated with grain length and/or grain width. Five SNPs were associated to both traits and two SNPs were associated to one of these traits. The QTL located on chromosome 2D shows a maximum association with both traits. Interestingly, previous studies have reported that the sub-genome D, originating from Ae.
tauschii, was the main source of genetic variability for grain size traits in hexaploid wheat 11,12 . This is also consistent with the ndings of Yan et al. 15 who performed QTL mapping in a biparental population and identi ed a major QTL for grain length that overlaps with the one reported here. In a recent GWAS on a collection of Ae. tauschii accessions, Arora et al. 18 reported a QTL on chromosome 2DS for grain length and width, but it was located in a different chromosomal region than the one we report here. With a view to develop useful breeding markers to improve grain yield in wheat, SNP markers associated to QTL located on chromosome 2D appear as the most promising.
To identify a candidate gene contributing to grain length and width, we examined the genes residing in the same linkage block as the peak SNP for each QTL. In the genomic interval spanned by the QTL contributing the most to the phenotypic variation for grain size (2D_40.4-45.1 Mb), a total of 315 highcon dence genes were observed. The TraesCS2D01G331100 gene seems like a highly promising candidate as it is orthologous to the rice CYP724B1 gene, commonly known as the D11 gene. The latter has been reported as involved in the regulation of internode elongation and seed development due to his role in brassinosteroid synthesis 25 . Brassinosteroids are a group of plant hormones and are key regulators of plant growth and development (including seeds) that promote cell expansion and elongation 26 .
To further re ne the association between the TraesCS2D01G331100 gene and grain width and length, we de ned SNP haplotypes. An analysis of haplotypes surrounding this gene identi ed three distinct haplotypes, and we observed that, for all grain size traits, the phenotypes corresponding to haplotype AT displayed signi cantly higher values than those of other haplotypes. We therefore suggest that SNP markers anking TraesCS2D01G331100 could provide a useful tool in marker-assisted breeding programs to improve wheat productivity by selecting alleles leading to larger grain size and higher yield. In the longer term, it would be interesting to de ne more precisely the exact nature of the alleles at this gene through targeted re-sequencing of this gene in a broader collection of accessions. Single nucleotide polymorphism calling and bioinformatics analysis. DNA sequences of the full set (n = 300) of wheat samples obtained from GBS were analyzed using the Fast-GBS pipeline 32 to align reads on the wheat reference genome (Chinese Spring v1.0) and to call SNPs. Fast-GBS results were rst ltered to (i) keep only SNPs having the label "PASS" and SNPs positioned on chromosomes (i.e. not on scaffolds), (ii) remove indels and multiallelic SNPs, (iii) convert all heterozygous calls with genotype quality (GQ) < 30 to missing data, (iv) keep only SNPs with a minor allele count (MAC) ≥ 4, (v) remove accessions with more than 80% of missing data, (vi) exclude SNPs with more than 10% heterozygotes and (vii) exclude SNPs with missing data (N) > 80%. Finally, missing data were imputed using BEAGLE v5 33  Validation of SNP call accuracy. The SNP genotypes for 12 different cv. Chinese Spring plants were used to assess the accuracy and reproducibility of GBS-derived SNP calls. Before and after imputation of missing data, we measured both the degree of agreement in SNP calls between replicates and the agreement between the GBS-derived SNP calls and the Chinese Spring reference genome V1.0 using an in-house script. To compare the accuracy of GBS-based and array-based genotype calls, we used a set of 71 Canadian wheat accessions for which genotypic data for 51,649 SNPs had been obtained previously using the 90K SNP In nium iSelect array 35 . For the 135 SNPs called in common using both methods, genotype calls were compared using an in-house script.

Materials And Methods
Population structure and linkage disequilibrium analyses. An analysis of population structure was performed on the collection of 159 wheat accessions using fastSTRUCTURE version 1.0 36 on SNP markers ltered at MAF ≥ 0.05 as recommended by Sobota et al. 37 . Population structure was evaluated using the ltered set of SNP markers using a simple prior and 1,000 iterations for K ranging from 1 to 12.
The optimal range of K was determined based on model complexity using the marginal likelihood method using the fastSTRUCTURE script chooseK.py, as well as on visualization of the log marginal likelihood, and population visualization using Distruct version 1.1 38 .
Genome-wide linkage disequilibrium (LD) analysis was performed using PLINK version 1.9 39 , via the Gabriel method 40 . This method is based on a con dence interval and a normalized measure of D′. The pattern and distribution of intrachromosomal LD were visualized with LD plots generated using Haploview version 4.2 41 to investigate the average LD decay along chromosomes. The smoothed seconddegree LOESS curve was tted to determine the critical D' and r 2 between loci.
Genome-wide association study for grain traits. GWAS for grain traits was performed on the subset of 159 wheat accessions via the Genomic Association and Prediction Integrated Tool (GAPIT) version 2 42 . This approach, based on associations between the estimated genotypic values (BLUEs) for each trait and individual SNP markers 43 was conducted with a compressed mixed linear model 44 . A matrix of genomic relationships among individuals (Supplementary Fig. S5) was calculated using the Van Raden method 42 .
The statistical model used was: Y = Xβ + Zu + , where is the vector of phenotypes; is a vector of xed effects, including single SNPs, population structure (Q), and the intercept; is a vector of random effects including additive genetic effects as matrix of relatedness between individuals (the kinship matrix), ~N(0, 2 ), where 2 is the unknown additive genetic variance and is the kinship matrix; and are the design matrices of β and , respectively; and is the vector of residuals, ~N(0, 2 ), where 2 is the unknown residual variance and is the identity matrix. Association analysis was performed while correcting for both population structure and relationships among individuals with a combination of either the Q + K matrices; K matrix was computed using the Van Raden method 42 . The Pvalue threshold of signi cance of the genome-wide association was based on false discovery rate (FDRadjusted p < 0.05).
Identi cation of candidate genes for grain size. To identify candidate genes affecting grain size in wheat, we de ned haplotype blocks containing the peak SNP. Each region was visually explored for its LD structure and for genes known to reside in such regions. Then, we cross-referenced them against genes reported as controlling grain size in rice and other cereals in previous association and linkage mapping studies 15,18,45 . By using the genome browser available for the wheat reference genome v1.0 on the International Wheat Genome Sequencing Consortium (IWGSC) website (https://urgi.versailles.inra.fr/jbrowseiwgsc/gmod_jbrowse), we extracted FASTA sequences representing the putative genes. Furthermore, the selected genes were further evaluated for their likely function based on publicly available genomic annotation. The function of these genes was also inferred by a BLAST of their sequences to the UniProt reference protein database (http://www.uniprot.org/blast/).
Identi cation of haplotypes around a candidate gene. To better de ne the possible alleles in a strong candidate gene, we used HaplotypeMiner 46 to identify SNPs anking the TraesCS2D01G331100 gene. For each haplotype, we calculated the trait mean (grain length, width, weight and yield) for lines sharing the same haplotype using the R ggpubr program 47 .

Declarations
Ethics declarations