Sequencing rice diversity from Vietnam
Whole-genome sequencing was carried out on 616 rice accessions. 511 of the accessions were obtained from the PRC (Plant Resource Centre, Hanoi, Vietnam, http://csdl.prc.org.vn), together with their passport data, which shows that they were collected from all eight administrative regions of Vietnam (Table S1). The remaining samples were obtained from AGI’s collection (Agricultural Genomics Institute, Hanoi, Vietnam). Three reference accessions (Nipponbare, a temperate Japonica; Azucena, a tropical Japonica; and two accessions of IR64, an Indica) obtained from the PRC, were included in the dataset. A total of 1,174 Giga base-pairs (Gbps) of data was generated for the 616 samples representing an average sequencing depth of 30x for 36 “high coverage” samples and 3x for 580 “low coverage” samples (Table S1). These 616 newly-sequenced accessions were classified into 379 Indica and 202 Japonica subtypes, with the remaining 35 (including the Aus and Basmati varieties) being classified as admixed, based on the STRUCTURE (Pritchard et al. 2000) output for K=2 using a subset of 163,393 SNPs.
Population structure of rice within Vietnam
The population structure of rice within Vietnam was analysed using the diversity panel of 672 samples, comprising 616 newly sequenced accessions and 56 Vietnamese genotypes from the 3K RGP. We assigned the 672 samples to four Japonica subpopulations and five Indica subpopulations (Table S1) using (i) the population structure information obtained from the STRUCTURE analysis (Fig. 1), (ii) the previous characterisation of a panel of Vietnamese native rice varieties using GBS (Phung et al. 2014), and (iii) the assessment of the optimal number of subpopulations (Fig. S1) using the method described in Evanno et al. (2005). Subpopulations were named as in Phung et al. (2014), except that we considered the I6 subpopulation to be part of the I3 subpopulation. Although the previous study used a limited number of GBS markers, 129 of the 164 common samples were assigned to the same subpopulations in both studies. Most differences were due to samples being classified as admixed in either one of the studies. We classified 48 (11%) of the Indica (Im), and eight (4%) of the Japonica samples (Jm) as admixed. The reference varieties Nipponbare (Temperate Japonica), Azucena (Tropical Japonica), and IR64 (Indica) were classified as J4, J1 and I1, respectively.
Each Indica subpopulation contained shared ancestry (admixed components) with other Indica subpopulation (Fig. 1a). The admixed components are shown in detail for the 43 samples in the I5 subpopulation (Fig. 1c) namely 38 samples from our dataset and the following five samples from the 3K RGP; IRIS 313-11384 (IRGC 127275), B184 (IRGC 135862), IRIS 313-11383 (IRGC 127274), IRIS 313-10751 (IRGC 127577) and IRIS 313-11893 (IRGC 127519). The Japonica subtropical J1 subpopulation shared ancestry (between 0 and 25% of the genome) with the Japonica tropical J3 subpopulation, whereas the two temperate subpopulations, J2 and J4 shared ancestry dominantly with each other. The tropical J3 subpopulation contained four samples with around 20% of the haplotypes in common with the temperate J4 subpopulation. Using the passport information available from the PRC, the proportion of each subpopulation originating from each of the “administrative regions” of Vietnam is shown in Fig. 1d. Only the I1 and I2 Indica subpopulations were collected from the Mekong River Delta regions, I2 being almost exclusively grown there whereas I1 was more widespread than I2. The I4 and J4 subpopulations were mainly collected from the Red River Delta areas. The J1 and J3 subpopulations were closely related; the J1 subpopulation was predominantly from the North of Vietnam whereas the J3 subpopulation was concentrated around the South-Central Coast region. Small variations in the percentage of reads mapping were observed for each of the subpopulations (Fig. S2).
A Principal Component Analysis (Fig. 2a and 2b) showed the relationship between these nine Vietnamese subpopulations. Concerning the Vietnamese genotypes from the 3K RGP dataset included in the diversity panel, the Indica I1 subpopulation included two XI-1B modern varieties and eight admixed (XI-adm) accessions. I2 included fourteen XI-3B1 genotypes, which comprises Southeast Asian accessions, and similarly, I3 and I4 included one and ten XI-3B2 genotypes, respectively. Finally, I5 included five XI-adm accessions and clustered distinctly away from all the other subpopulations (Fig. 2a). On the other hand, J1 included the two subtropical (GJ-sbtrp) accessions from the Vietnamese 3K RGP genotypes, and J3 included one tropical (GJ-trp1) accession from the Vietnamese 3K RGP genotypes (Fig. 2b). These results correlate well with the latitudinal distinction between these subpopulations. J2 and J4 included two and one temperate (GJ-tmp) accessions, respectively; and split into two clear subpopulations in Vietnam compared with the East Asian temperate subpopulation described by the 3K RGP.
Population structure of the combined 3,635 Asian cultivated rice genomes
612 of the 616 newly sequenced accessions from this study and the 3,023 accessions from the 3K RGP were combined and classified into 9 and 15 subpopulations (Table S2), and compared with the subpopulations from the 3K RGP analysis (Wang et al. 2018; Zhou et al. 2020). For clarity, we used the prefix Jap- and Ind- to label these subpopulations from our analysis.
When the combined dataset of 3,635 samples was classified into nine subpopulations (Fig. S3a), we found that 95% of the 3K RGP accessions (2,882 out of 3,023) were assigned into the same subpopulations. The remaining 5% lines were either (i) previously classified as admixture and our analysis placed into a subpopulation, or (ii) were previously classified in a subpopulation and were now classified as admixture. The 612 newly sequenced Vietnamese accessions were placed in three Indica clusters (187 accessions), three Japonica clusters (176 accessions), the Basmati and Sadri aromatic cB group (11 accessions), or the Aus cA subpopulation (one accession). In more detail, the three Indica clusters included three Im accessions in the East Asian cluster (Ind-1A), seventy-six I1 accessions in the cluster of modern varieties of diverse origins (Ind-1B), and 108 accessions (I2, I3 and Im) in the Southeast Asian cluster (Ind-3). Whereas, the three Japonica clusters included 54 accessions (J2, J4 and Jm) in the primarily East Asian temperate cluster (Jap-tmp), 119 accessions (J1, J3 and Jm) in the Southeast Asian subtropical cluster subpopulation (Jap-sbtrp) and three J3 accessions in the Southeast Asian Tropical subpopulation (Jap-trp). Any remaining accession with admixture components over 65% either Indica or Japonica were classified as Ind-adm (191 accessions) or Jap-adm (27 accessions), respectively. Finally, the remaining accessions were considered as Admix (19 accessions). Notably, all thirty-seven I5 accessions were placed in Ind-adm, and ten of the sixteen J3 accessions were placed in Jap-adm.
When the combined dataset of 3,635 samples was reclassified into 15 subpopulations (K15_new, Fig. S3b), we noticed the following differences in the distribution of subpopulation compared to the 3K RGP analysis for the same number of 15 subpopulations (K15_3KRGP); we did not observe the division of the Aus samples into cA-1 and cA-2, and we subdivided the Indica subtypes and Japonica subtypes into eight and five subpopulations, respectively. A Principle Coordinate (PCO) analysis of the Indica and Japonica subpopulations is shown in Fig. 3, highlighting our new eight Indica and five Japonica subpopulations (In addition the Vietnamese and 3K RGP subpopulations are shown in Figs. S5 and S6).
The relation between the subpopulations in our comprehensive analysis (3,635 accessions) and the 3K RGP (3,023 accessions) was as follows: (i) The Ind-1A, Ind-1B.1 and Ind-1B.2 were equivalent to XI-1A, XI-1B1 and XI-1B2, respectively. Forty-three of the Vietnamese I1 accessions were in the Ind-1B.1 subpopulation, and the remaining 102 I1 accessions were classified as admixed. (ii) The Ind-2 was equivalent to XI-2A and XI-2B, and as expected, this geographically distant South Asian subpopulation was not present in Vietnam. (iii) The previously observed split of the Indica-3 subpopulation into 3A and 3B was also observed in our analysis, where Ind-3.1 was equivalent to XI-3A and did not contain any Vietnamese accessions. (iv) The remaining Ind-3.2, Ind-3.3 and Ind-3.4 were a rearrangement of the XI-3B1 and XI-3B2 subpopulations. (v) The 89 Vietnamese I2 accessions belonged to Ind-3.2, which was a subset of XI-3B1. (vi) Ind-3.3 contained 16 of the 37 Vietnamese I3 accessions. (vii) 72% of the accessions in Ind-3.4 were from Vietnam, which contained 13 of the 37 I3 accessions, 61 of the 62 I4 accessions, and all I5 accessions. Within Ind-3.4, the admixture components of I3, I4 and I5 subpopulations (Fig. S7) showed that I3 accessions were highly admixed, some I4 and I5 accessions were completely within Ind-3.4, while other I4 and I5 accessions showed admixture with Ind-3.3 (I5) or Ind.2, Ind-3.2, and Ind-3.3 (I4). To clarify these relations, a principle component analysis (PCA) with a reduced number of accessions was carried out using the 723 sample dataset (672 Vietnamese accessions and 51 genotypes from neighbouring Southeast Asian Countries; Fig. S8), this supported the close relationships of I2 with XI-3B1, I4 with XI-3B2, I5 with XI-adm, J1 with GJ-sbtrp, and that both J2 and J4 were within GJ-tmp.
Phenotypic and genetic diversity analysis of the Vietnamese Indica and Japonica subpopulations
Phenotypic measurements for 19 traits were scored in field conditions in the Hanoi area by breeders from the Agricultural Genomics Centre (AGI) for approximately two-thirds of the samples in our study. For five of these traits, additional scores were also included from trials by the Vietnamese Plant Resource Centre. In addition, phenotypic data were available for eleven of the traits in 38 of the 56 samples sourced from the 3K-RGP dataset (Table S3, S4). Finally, the grain length to grain width ratio (GL/GW) was calculated to give a total of 20 traits (Table S5). Scores were available for between 328 and 503 of the 672 samples (Indica subpanel, 170 – 297 samples and Japonica subpanel, 134 – 178 samples).
There were significant differences in measurements between the Indica and Japonica subtypes for ten of the traits; these are detailed in Table S5 and histograms are shown in Fig. 4 for selected phenotypes. The Indica subtypes had significantly (p-value <0.0001) higher values for grain length to width ratio, leaf pubescence, culm number, culm length, and floret pubescence. In contrast, the Japonica subtypes had significantly higher values for grain width, leaf width, flag leaf angle, panicle length, and floret colour. The Indica I1 subpopulation (mostly elite varieties) was the most phenotypically distinct when compared to the rest of the Indica samples (mostly native landraces). I1 samples had longer grains (p-value = 2.2e-16), earlier heading date (p-value = 9.9e-12), higher culm strength (p-value = 2.2e-16), shorter leaf length (p-value = 2.7e-14) and shorter culm length (p-value < 2.2e-16). Similar values were obtained when comparing I1 to just the I5 subpopulation (Fig. 4). The I5 subpopulation was not phenotypically distinct (p-value < 0.001) from the other landrace subpopulations I2, I3 and I4, except for a significantly lower measurement of leaf pubescence (p-value = 0.0007). The Japonica J2 subpopulation had a significantly lower grain length to width ratio than J1 (p-value = 1.8e-13) and J3 (p-value = 5.7e-07). A correlation analysis carried out between the 20 phenotypes (Fig. S9) showed that the highest correlation (r = 0.6) was between leaf length and culm length (excluding the correlation between grain length to width ratio and grain length and grain width). Histogram and correlation plots are available for the 13 traits used for the GWAS analysis in Fig. S10 comparing the Indica and Japonica subtypes and in Fig. S11 comparing subpopulations I1 and I5. Further boxplots showing the phenotypic distribution according to subpopulation for culm length, grain length, grain width and heading date are available in Fig. S12.
The Japonica subtypes had a lower nucleotide diversity (p = 0.000912) than the Indica subtypes (p = 0.00167). Looking at the individual subpopulations (Table S6), the elite I1 subpopulation is the most diverse (p = 0.00144), and the I5 subpopulation is the least diverse (p = 0.00103). Regions of the genome with low diversity in all Indica subpopulations, and regions with low diversity in specific subpopulations, were observed when plotting diversity along each chromosome (Fig. S13). The J3 subpopulation is the most diverse of the four Japonica subpopulations. (p = 0.000697). Large genomic regions with very low diversity were observed in chromosomes 2, 3, 4 and 5 in all Japonica subpopulations (Fig. S14).
Genome-wide genotype-phenotype association analysis
Three independent GWAS were conducted using the full panel (672 samples, 361,191 SNPs), the Indica subpanel (426 samples, 334,935 SNPs) and the Japonica subpanel (211 samples, 122,881 SNPs). Thirteen (13) of the 20 traits were suitable for GWAS based on the variance (CV < 56% for the full panel). The full list of phenotypic measurements is available in Table S3. We found 643 significant phenotype-genotype associations. These associations were organised into 21 QTLs (Table 1, Table S7). The GWAS Manhattan and Quantile-Quantile plots are available in Fig. S17 and Fig. S18. The QTLs ranged from 41 kb (16_FP) to 3,148 kb (5_GS). The 21 QTLs contained 1,730 genes and covered a total of 11 Mbp over ten chromosomes, and contained 453 SNPs with a significant association to a trait in at least one diversity panel (Fig. 5). The list of genes within each QTL is available in Table S8. Functional enrichment was found within 9 of the QTL (Table S9).
Seventeen QTLs were identified in the full diversity panel significantly associated with eight traits: grain length, grain width, grain length-to-width ratio, leaf width, panicle length, floret pubescence, heading date and internode diameter. A further 4 QTLs associated with grain length and grain width were observed only in the Japonica subpanel. Three of the QTLs, which were found in the full panel, were also observed in the Indica subpanel.
The set of 3.8M SNPs (see methods), representing one SNP every 99 bases, was annotated based on the potential effect of each SNP in protein function using SnpEff (Table S12). 526,138 (4.79%) of the SNPs were in genes. There were 21,639 (0.197%) SNPs in 11,125 genes classified as having a putative “High impact” effect (E.g. Exon changes, frameshifts, gene fusions or rearrangements, protein structural changes, etc.). Following additional minimal allele frequency (MAF) filtering, in the Indica dataset (MAF 5%, 2,027,294 SNPs), there were 11,906 "High impact” SNPs in 7,396 genes and the Japonica dataset (MAF 5%, 1,125,716 SNPs), there were 6,240 “High impact” SNPs in 4,439 genes of which 2,818 were present in both Indica and Japonica.
None of the 453 SNPs with a significant association was annotated as resulting in protein changes (“High impact” SNPs). However, “High impact” effects were identified in other SNPs within the QTL. Among the total 1,730 genes in the 21 QTLs, we annotated 309 genes with “High impact” SNPs in the Indica subpanel, 248 genes with “High impact” SNPs in the Japonica subpanel, including 137 “High impact” SNPs common between the two sets. 129 of the 309 genes and 94 of the 248 genes had functional annotations in PhytoMine (Goodstein et al. 2012), but no functional overrepresentation was found for these sets of genes. In addition, we looked for overlaps with the QTL in five published Vietnamese studies (Hoang et al. 2019a; Hoang et al. 2019b; Phung et al. 2016; Ta et al. 2018; To et al. 2019), which used 25,971 SNPs in 182 samples (164 in common). We found that 2_GL and 6_GS overlapped with QTL for panicle morphological traits (Ta et al. 2018); 2_GL overlapped with QTL9 for secondary branch number, and spikelet number (SBN and SpN), and 2_GS overlapped with QTL12 for secondary branch average length (SBL). 4_GW_jap overlapped with “q1” for longest leaf length (LLGHT) (Phung et al. 2016).