Genome wide association mapping of yield and various desirable agronomic traits in Rice

Rice (Oryza sativa L.) is one of the staple foods worldwide. To feed the growing population, the improvement of rice cultivars is important. To make the improvement in the rice breeding program, it is imperative to understand the similarities and differences of the existing rice accessions to find out the genetic diversity. Previous studies demonstrated the existence of abundant elite genes in rice landraces. A genome-wide association study (GWAS) was performed for yield and yield related traits to find the genetic diversity. Experimental study. A total of 204 SSRs markers were used among 17 SSRs found to be located on each chromosome in the rice genome. The diversity was analyzed using different genetic characters i.e., the total number of alleles (TNA), polymorphic information content (PIC), and gene diversity by Power markers, and the values for each genetic character per marker ranged from 2 to 9, 0.332 to 0.887 and 0.423 to 0.900 respectively across the whole genome. The results of population structure identified four main groups. MTA identified several markers associated with many agronomically important traits. These results will be very useful for the selection of potential parents, recombinants, and MTAs that govern the improvements and developments of new high yielding rice varieties. Analysis of diversity in germplasm is important for the improvement of cultivars in the breeding program. In the present study, the diversity was analyzed with different methods and found that enormous diversity was present in the studied rice germplasm. The structure analysis found the presence of 4 genetic groups in the existing germplasm. A total of 129 marker-trait associations (MTAs) have been found in this study.

the world's food security [1]. In a crop population, natural diversity or genomic variation is the outcome of mutation, outcrossing, crossing over and unsystematic genetic drift [2,3]. Conservation and characterization of germplasm are essential to screen the desired genetic material for crop improvement programs [4]. Moreover, crop improvement is highly dependent on the availability of genetic variation in existing germplasm and the identification of genetic variation for target traits [5]. Therefore, exploiting the genetic diversity in rice is of utmost importance.
Rice breeders have used various kinds of diversity panels like old cultivars, wild species, modern cultivars, advanced lines, and commercially released varieties for the improvement of cultivated rice varieties. As a result of both natural and artificial selection under diverse habitats, the cultivated rice varieties possess high levels of morphological, physiological, and genetic diversity resulting in > 120,000 distinct rice varieties been developed to date [6,7]. Therefore, it is necessary to assess this genetic diversity to make a decision about its conservation and to select the genotypes which could be utilized as potential parents in an upcoming breeding program. This kind of information will help to select the genetically diverse parents that can be used for the development of new hybrids of rice varieties.
The microsatellites markers also known as simplesequence repeats (SSRs) hase now been gaining importance for the assessment of genetic diversity. The main key features of SSRs over other molecular markers are genetic co-dominance, high level of polymorphism, multi-allelic, relatively abundant, easily, and automatically scored, and more interesting widespread across the genome. Over the past decades, these polymorphic SSR markers have been extensively used in assessing genetic diversity analysis [8,9]. Along with that SSR has also been utilized for species identification [10], and mapping genetic linkage and association mapping [11]. The present study will also focus on the assessment of genetic diversity and association mapping using polymorphic SSR markers.
The conventional method of characterization of genetic diversity is the use of bi-allelic mapping populations or QTL mapping [12][13][14]. However, with the advancements in whole genome sequencing technology, genome wide association study (GWAS) has now become extremely popular to map natural genetic variation in diverse populations, not only in rice but in other crop species as well [15,16]. In the recent years the GWAS has been successfully applied to the genetic dissection of complex traits in plants, e.g., Arabidopsis thaliana, zea mays, wheat and cotton [17][18][19][20]. GWAS has an advantage over linkage mapping as it enables the plant breeder to exploit the genetic diversity by using modern genetic technologies [21]. Since association mapping utilizes diverse germplasm, QTL for many traits can be detected with high resolution in a single study, making the method more efficient and less expensive than bi-parental QTL mapping. GWAS also maximizes the diversity of the alleles and identifies the elite genes in the genome [22]. Furthermore, many candidate genes controlling agronomically important traits have been identified in rice through GWAS [23,24]. Apart of this, many salinity tolerant and drought tolerant regions have have been identified using the GWAS approach [25][26][27]. In the present study GWAS approach was conducted to identify the agronomically important gene/ QTLs associated with rice yield and yieldrelated agronomic traits using a diverse collection of genotypes. Moreover, this study will also identify the traits and specific genotypes that would be utilized in the rice breeding program for the development of superior rice. The main purpose of the study id to assess the genetic diversity of rice germplasm and to identify the associated markers related to yield and its parameters.

Plant material
A panel of 100 rice genotypes representing different ecotypes and geographical origins was used (Supplementary Table 1). Seeds of the rice were collected from rice research institute, kala shah kaku, united states department of agriculture (USDA) the USA.

Field evaluation
The chosen rice accessions were sown for the screening of rice lines based on various morphological characters and genetic diversity studies. Seeds of the rice lines were soaked and sown in puddled field conditions through the broadcasting method. After 25 days the nursery of each rice line was prepared for transplantation into the next field with three replications under a randomized complete block design (RCBD). Twenty plants of each accession were planted in each row and row distance was 30 cm and plant to plant distance was 15 cm. Fertilizer applications were done 25-40 days after the transplant of seedlings. NPK, the primary crop stimulants were applied with a ratio of 120:60:60 kg/ ha. However, P and K were smeared basally after the transplant; N was smeared in three splits. Crop protection measures were practiced preventing pests and unwanted wild plants occasionally when necessary. The data was collected for two consecutive years from 2016 to 17 and 2017-18.

Phenotypic data collection
The traits measured in the current study include Days to 50% flowering (DF), Days to maturity (DM), Plant Height (PH), Tillers/Plant (TP), Panicle length (PL), Number of grains/panicle (G/P), Unfilled grain/panicle (UG/P), Seed setting %age (SS), Thousand grain weight (TGW), Yield kg/plot (Y/P), Yield kg/ha (Y/H). On average, the data from 5 plants per row were collected, and the mean value is taken for final measurement and data analysis. Details of measurement procedures of each trait are given in the supplementary file (Supplementary file 1).

Statistical analysis of the phenotypic data
Mean, range, and standard deviation were determined using statistical tools offered by Microsoft Excel. Analysis of variance was completed using the SPSS statistical package to check the genetic variance amongst the rice lines for all their characters studied.

Isolation of genomic DNA
Genomic DNA of each rice genotype was extracted from 2 g of freshly taken leaf tissues using the modified Cetyl-Trimethyl Ammonium Bromide (CTAB) method [28]. The extracted genomic DNA samples were dissolved and stored in TE buffer (10 mMTris-base, 1 mM EDTA, pH 8.0). The quality of DNA was checked by running on agarose gel electrophoresis and further the concentration and quality were measured by nanodrop (ND1000, Thermo Scientific, USA).

Genotyping by SSR
A collection of 204 SSR markers was screened to check polymorphism between the 100 rice accessions. The SSR markers were chosen from previously available rice genetic linkage and genome sequence (physical) maps [29,30] (http://www.gramene.org). The PCR was carried out in a PCR tube with a 20 µL reaction volume. The PCR components included 2 µL genomic DNA (template), 2 µL 10 X buffer containing 1.5 mg MgCl 2 , 0.1 µL Taq polymerase, 0.4 µL 10 mMdNTPs, and 2 µL (forward and reveres) primer. The PCR reactions were carried out on ABI thermal cycler (ProFlex PCR) using the following cycler conditions: the initial denaturing temperature at 94ºC for 4 min, followed by 35 cycles of 94 °C for 45 s, specific temperature for each primer for 45 s, 72 °C for 60 s with a final extension at 72 °C for 10 min. Amplified PCR products were examined by using 2-3% metaphor gel electrophoresis ) (Bio whittaker molecular applications, vallensbaek strand, denmark) and visualized under UV light.

Molecular data analysis
Polymorphic bands amplified from SSR markers were recorded in the numerical format 1 for presence and 0 for absence. Each amplified band from the respective marker is designated as an allele. This data was further utilized for assessing the genetic diversity in the form of allelic variation, the total number of total alleles (TNA), unique alleles (UA), private alleles (PA), expected alleles (Ne), expected heterozygosity (He) and these are measured by Power-Marker Ver 3.25 [31]. The polymorphism information content (PIC) was calculated using the formula, "PICi = 1−∑ni = 0P2ij" where Pij is the frequency of the jth allele for the ith marker, and summation extends over k alleles.

Population structure analysis
The genetic structure (Q) prediction was analyzed with the program STRUCTURE version 2.2 (Pritchard et al., 2000) to identify clusters of genetically similar lines. The process of STRUCTURE organizes the diversity panel into n clusters to form a structure matrix (Q-matrix). The putative number of K (groups) was estimated from K = 1 to 10 using 100,000 burn-in iterations and MCMC (Markov chain Monte Carlo). Web-based analysis "Structure Harvester v0.6.93" was applied to obtain the maximum value or peak of ''K.'' for validation to understand the STRUCTURE results which were based on ad-hoc techniques [32]. The K values ranged from 1 to 10 and 6 independent runs to attain reliable effects. The dissimilarity coefficient was analyzed by the un-weighted pair group method with arithmetic mean (UPGAMA) through DARWIN (dissimilarity analysis and representation for windows) Version 5.0.148 software (http://darwin.cirad.fr/darwin). The linkage disequilibrium (LD) in the studied diversity panel was evaluated using squared Pearson's correlation coefficients (r 2 ) using the r 2 command in the software PLINK [33].

Association mapping
The association mapping method was used to map QTLs for the yield and other agronomic traits examined in the two growing seasons for a total of 204 polymorphic SSR markers. The average phenotypic data of studied traits during both years were used in genome-wide association mapping. The mix linear model (MLM) approach was used for linear unbiased estimates using the equation described previously [34,35]. The marker tarit association was measured by incorporating Q + K matrices using TASSEL version v2.0.1 [36]. The association analysis was performed at 50,000 times per mutation for the correction of multiple testing [37]. Markers with the P-value ≥ 0.05 were regarded as significant. The default run parameters of the convergence criterion set at 1.09 10 − 4 and the maximum number of iterations set at 200 were used.

Phenotypic variation
The grain yield is one of the most important traits in rice crop improvement which is interlinked with other agronomic traits. Significant differences between the genotypes for yield and yield-related traits were observed in the analysis of variance (ANOVA). Significant differences were observed in DF in season-II and DM in both seasons, and the remaining traits were highly significant differences in both seasons among studied rice germplasm (Table 1). This showed that for the studied attributes, a high level of phenotypic variability was observed among all accessions in the field condition. Phenotypic data showed a large variance in the studied rice diversity panel, making it useful for recording genotypic variability in the population.

Genome wide allelic pattern
The genome wide allelic pattern showed enormous diversity like the number of average alleles per locus (Na) is 6.015, the number of different alleles with a frequency ≥ 5% is 4.735, the number of effective alleles (Ne) is 4.162, Shannon's information index (I) is 1.491, expected heterozygosity (He) is 0.723 and unbiased expected heterozygosity (uHe) is 0.727. Private allele means that are particularly unique to a single population, and in the present study there were no private alleles found ( Table 2).

Genetic diversity based on PIC, TNA, and SRBP
Out of 204 polymorphic SSRs microsatellite markers, 17 SSRs were found to be located on each chromosome which showed that SSRs are covering the whole genome of rice to check the diversity in the rice genome. The TNA values per

Population structure analysis
The web-based analysis using 'Structure Harvester v0.6.93' showed the peak value of 4 (K = 4) indicating the presence of four subgroups among the studied rice accessions (Fig. 1a). The population structure predicted by 'STRUCTURE' Software also divided the germplasm into four groups which are depicted in four different colors (Fig. 1b). Population structure is a consequence of departures from random mating in the sampling population that result in some individuals being more closely related than others. Based on the different structure analysis, the current diversity panel of 100 rice genotypes was clustered into 4 groups (Fig. 1). Further assessment of each cluster or group revealed that genotypes G1-G10 and G27, G28 were found to fall into the first group; whereas the G11-G26 and G29, G33 genotypes completely appeared in the second group the third group contained G34-G72 genotypes. The fourth group having that is G73-G100 genotypes. Distance between each cluster specifies the differences among 100 rice genotypes and all clusters are genetically different from each other. The distribution of rice germplasm in the four groups was illustrated by the UPGAMA DARWIN tree (Fig. 2).

Linkage disequilibrium
The linkage disequilibrium (LD) pattern for 100 rice accessions was generated using genotypic data (Fig. 3). Each cell in the figure represents a relationship between two markers with the color codes for the presence of significant linkage disequilibrium (LD). The LD values were plotted, above the diagonal displays r 2 values and below the diagonal displays P values with color codes such as > 0.01 represented by white color, < 0.01 shown in blue color, < 0.001 represented by green color, and values < 0.0001 are in red. The frequency of linkage disequilibrium is associated with a value at P < 0.0001, there is strong LD between two markers, and is represented with red color. The green color shows the LD between the two markers at P < 0.001. Similarly, the blue color is associated with the lowest LD between two markers with P < 0.01. At P > 0.01, the white color represents no linkage between any two markers. The r 2 indicates the percentage of total variation explained. The values of r 2 range from 0.00-to 1.00 represented by different colors. Similar findings were obtained from the earlier reported results in diverse rice genotypes [38].
marker ranged from 2 to 9, PIC values ranged from 0.332 to 0.887, and gene diversity values from 0.423 to 0.900 across the whole genome (Supplementary Table 2 Table 2).

Association mapping
Among the 100 rice genotypes, significant phenotypic and genotypic differences were observed for all characters in the two growing seasons. A total of 129 marker-trait associations (MTAs) have been found in this study at P < 0.05, P < 0.01, P < 0.001, and P < 0.0001 significant levels. Among them, 41 MTAs were observed at P < 0.05 significant levels, while 57 MTAs at P < 0.01 followed by 21 MTAs at P < 0.001 and 10 MTAs at P < 0.0001 significant level (Table 3). In this study, the total phenotypic variance represented by r 2 ranged from 6.08 to 27.15% among 11 studied traits using 204 polymorphic SSR markers (Supplementary Table 3). Trait-wise highest numbers of markers-traits associations (MTAs) were identified in TGW (15) and Y/H (15) followed by TP (14), PH (14), UG/P (13), SS (13), G/P (11) PL (11), DM (10), Y/P (8), and DF (5). Chromosome-wise  Table 3). For this trait, the total PV of these markers ranged from 6.53 to 23.23%. On chromosome 12, the marker (RM6973) had the highest PVE (23.23%) while the marker (RM3455) on the same chromosome explained the least proportion of 6.53% of the variability of the trait at the position.

Days to 50% flowering (DF)
In total, five SSR markers were found to be significantly associated with DF which are located on chromosomes 1, 2, 4, 10, and 12 (Supplementary Table 3). The markers namely RM6973, RM6745, RM207, RM5853, and RM5886 were significantly associated with this trait. The phenotypic variation explained (PV) by the DF associated markers was 27.15-9.63% of the total phenotypic variation (PV) of this trait. The marker (RM6973) explained the maximum value of trait variability (27.15%) on chromosome 12 while the marker (RM5886) on chromosome 4 explained the minimum value (9.63%) of trait variability.

Days to maturity (DM)
In this analysis, a total of 10 SSR markers were strongly correlated with days to maturity (DM). Out of them, three markers from chromosomes 3 (RM7389, RM1004, and

Number of tillers per plant (TP)
A total of 14 significant SSR markers was strongly linked with the number of tillers per plant including, 3 markers located on chromosomes 10 and 2 while, 2 SSR on chromosome 4 and 9 and the other on 1, 3, 5, and 7chromosomes (Supplementary Table 3). These markers explained  Table 3).

Number of unfilled grain per panicle (UG/P)
In association analysis, 13 SSR markers were found to be significantly associated with the UG/P. Overall, 13 significant SSRs distributed across 8 chromosomes, Out of them, 2 SSRs on each chromosome were 2 4, 5, 9, and 11 while other SSRs on 1, 7, and 8. The markers namely RM7033, RM7390, RM7396, RM264, RM116, RM4552, RM317, RM3412, RM3838, RM421, RM300, RM125, and RM279 were significantly associated with the number of unfilled grains per panicles as mentioned in Table 4.9. The phenotypic variation explained (PV) by the UG/P associated markers was 8.01-21.02% of the total phenotypic variation represents a simple and effective process for assessing the genetic variability of germplasm [50]. The analysis of population structure based on the Bayesian clustering model is one of the most frequent methods of population clustering [45,51,52]. In the present study, the selected genotypes were grouped into 4 sub-populations, as viewed in STRUC-TURE analysis (Fig. 1b). But according to the given origin and pedigree record, there were three types of the population but on a genetic basis, the population was divided into four clusters, so the results are unexpected. Consistent with the STRUCTURE analysis results, the DARWIN tree (UPGMA cluster) analysis also distributes the diversity panel into 4 clusters (Fig. 2). Distance between each cluster specifies the differences among 100 rice genotypes and all clusters are genetically different from each other. The more genetic distance among clusters showed the genetic diversity of the genotypes. The presence of 4 groups in the germplasm under study showed that there might be some genetic diversity present which needs to be known and yet investigated. Genome wide association study (GWAS) also known as association analysis, is a method that relies on non-random combination of alleles at two genetic loci to study the natural variation present among the population. Recently, it has been widely used for the identification of elite gene / QTLs related to various traits such as yield, quality, biotic, and abiotic stress tolerance in cultivated crops [53,54]. Therefore, the present research is analogous to previous studies which focused on the small population with limited markers [49,55]. There are two models used in an association study to find the marker-trait associations (MTAs), one is GLM and the other one is MLM. The GLM identified the more false positive association while MLM is more accurate to find the MTAs. So, in the present study MLM approach was utilized to find MTA. In total, 129 MTAs have been found in this study at different (P < 0.05, P < 0.01, P < 0.001, and P < 0.0001) significant levels (Table 3). Among them 15 MTA were found for TGW, 15 for Y/H, 14 for each PH and TP, 13 for each UG/P and SS, 11 for each G/P and PL, 10 for DM, 8 for Y/P, and 5 for DF. Similar studies have been conducted previously and several MTAs have been found associated with yield and yield related traits like a marker, RM5749 present on chromosome 4 was associated with numbers of the fertile tiller and TGW [56]. The RM8257, RM6745, and RM569 were found to be linked to yield and yield-related traits [57,58]. Similarly, RM207 and RM540 were found to be associated with G/P in rice [59]. Moreover, some pleiotropic locus has also been identified such as pleiotropic locus RM5951 on chromosome 10 was found to be associated with PH, TGW, TP, Y/H, and SS. Similarly, the pleiotropic locus (RM233B) on chromosome 2 was found to be associated with PL, TGW, TP, and Y/H. these MTAs will be helpful in the marker assisted selection. RM5603, RM569, RM3625, RM5348, RM5853, RM113, RM237, RM207, RM5886, RM5951, RM3153, and RM233B. The marker was distributed across 8 chromosomes, including 3 SSRs on chromosome 1 and 5, while 2 SSRs on chromosomes 2, 4, and 10, and the other on the chromosomes 3, 8 and 11 (Supplementary Table 3). All these SSRs explained 6.28-21.04% of the phenotypic variation in this trait. The marker (RM6987) on chromosome 10 explained maximum phenotypic variation (21.04%) while the marker (RM233B) on 2 explained minimum variation (6.28%) for yield per hectare in 100 rice genotypes.

Discussion
Assessments of genetic diversity have been carried out in recent decades using diverse approaches [39][40][41]. The SSR markers are the type of co-dominance markers that are commonly being used for the assessment of genetic diversity and association analysis in many crop species including rice [42][43][44][45]. The markers with higher PIC values have a greater ability to identify the cultivars having more genetic diversity. A locus with a PIC > 0.5 is considered highly diverse [46]. In the current study, the lowest PIC values were observed against RM317 marker and highest PIC value was 0.887 against RM439 marker. The genetic diversity can also be detected by total number of alleles (TNA). Marker detecting the lowest total number of alleles showed the lowest gene diversity than those detected maximum total number of alleles revealed the higher gene diversity [39,47]. In this study, the highest number of alleles per marker was 9 which was detected on chromosomes 1, 5, and 6, while the lowest total number of alleles 2 at chromosomes 4, 2, and 1. The PIC value also exhibited a significant correlation with the TNA and gene diversity for microsatellites SSR markers (Supplementary Table 2). Shannon's Information Index mean value is 1.491 having the standard value 0.027. Similar results were also reported by many researchers to exhibit the genetic diversity [38,4148,49]. In this study, a significant amount of rare alleles were also identified which indicates that these rare alleles contribute well to the overall genetic diversity of the population. Collectively, all the results from PIC value, TNA, Shannon's Information Index, and the presence of rare alleles showed that there was enormous diversity present in the current collection of rice germplasm, which in turn can be used in the future breeding programs for the selection of diverse parents.
The analysis of population structure provides information about genetic similarities and dissimilarities within a set of genotypes under study. The population structure analysis formed the cluster based on these genetic similarities and dissimilarities. So, clustering a large number of germplasms