Population Structure of 93 Varieties of Rice ( Oryza Sativa L. subsp. Hsien Ting) in Qinba, China

Abstract


Background
The Qinba region is a climate transition area between North and South China, and it is also the transition area from indica varieties to japonica varieties and the best suitable planting area for indica varieties.To date, no genetic research studies on the rice germplasm resources in this region have been conducted.The population genetic structure is the non-random distribution of genes or genotypes in space and time, including genetic variations within populations and genetic differentiation among populations.Population structure analysis is essential to explore the biological adaptability, the population formation process, evolutionary mechanism, protection, and development of biological resources [17].At the same time, the population with identical or similar genetic backgrounds is most suitable for genome-wide association studies (GWAS), therefore, the study of population genetic structure plays an important role in the eld of biology.Genetic markers have played an important role in studying population structure, ranging from earlier morphological markers to more recent DNA molecular markers.Eukaryotic genomes have simple sequence repeats (SSRs) that span approximately 10-50 kb.In the last few decades, SSR molecular markers have become important tools in the eld of biology, particularly in terms of population structure, genetic mapping, and other related elds.SSR markers have also become the designated markers of the International Fingerprint Mapping Center.These are employed in judicial identi cation, identi cation of new varieties of plants, such as rice, rape, and corn [5,14,23,26].SSR markers are used in DNA ngerprinting for breed protection [24].However, SSR markers are few in number, show unbalanced distribution in the genome, have weak electrophoretic resolution, and are relatively time-consuming and laborintensive, and thus it is di cult to construct high-density genetic maps.With the recent development of next-generation sequencing technology, most biological studies have rapidly improved.In particular, the use of single nucleotide polymorphisms(SNPs) based on genome-wide scans [1], and with the release of extensive rice genome sequencing data, one SNP in every hundreds of base pairs or even every dozens of base pair has been identi ed, indicating that there are numerous SNPs in the rice genome [4,9,18] .A small number of SNPs can be used to resolve many problems, so the sequencing technology was born based on simpli ed genome by restriction site-associated DNA (RAD) tags.The frontrunner among these technologies is genotyping-by-sequencing (GBS), which has recently gained attention because it utilizes methylation-sensitive restriction endonucleases (type II enzyme), thereby avoiding repetitive regions of the genome (methylated regions) , GBS technology can rapidly identify high-density mutations, especially SNPs mutations.Currently, there are fewer commercially available restriction enzymes for selecting RAD tags for sequencing.In this study, two type II enzymes (NlaIII and MseI) were selected for digestion of the genome, which generated RAD tags for sequencing to obtain SNPs.Simultaneously, 48 core primer pairs of SSRs from NY/T1433-2014 that originated the Agricultural Standards of the People's Republic of China and 15 agronomic traits, namely, sowing date, plant height, leaf length, leaf width, effective number of panicles per plant, panicle length, total number of grains per panicle, number of lled grains per panicle, 1,000-grain weight, browning rate, milled rice rate, head milled rice rate, chalky grain rate, chalkiness degree, and length/width ratio, were employed to explore gene ow and population genetic structure of 93 rice varieties and to provide reference for future research studies using different genetic markers employed in related elds.

SSR genotyping
A total of 378 bands was detected using 48 core SSR primer pairs (Table 1).Among these, 336 polymorphic bands were detected.The average number of polymorphic fragments was 7 ranging from 1 to 14.The maximum number of 14 polymorphic bands was detected by RM278 while RM311 is the least.The average PPB was 88.87% ranging from 50% to 100%.The average PIC value was 0.77 ranging from 0.19 to 0.88.Those data showed that core SSR in rice can produces rich bands and high polymorphic rate.

Linkage disequilibrium (LD) decay and Haplotype construction
In a total of 6,288,753 loci, among which 326,873 (5.198%) were heterozygous based on 67,621 SNPs.The 67,621 SNPs were unevenly distributed among the 12 chromosomes (Fig. 1a); chromosome 1 contained the largest amount of makers (8,425), while chromosome 8 included the least (3,953).Among the 84,255 SNP pairs, R 2 value had a minimum of 0.2 and an average of 0.73.46,322 SNP pairs (54.98%) had R 2 values higher than 0.8, while 7,841 pairs (9.31%) were in complete LD (R 2 =1).The inter-marker genetic distance between all pairs, between pairs of SNPs with R 2 inferior to 0.8, and between pairs with R 2 equal or superior to 0.8 had average values of 154,130 bp, 171,768 bp, and 139,686 bp, respectively.LD, as represented by inter-loci R 2 values, decreases as the physical distance between loci increases (Fig. 1b).The 12 chromosomes yielded a total of 6568 predicted haplotypes (Fig. 1c), with chromosome 1 possessing the most haplotypes (776) and chromosome 10 possessing the least (349).The largest haplotype was composed of 95 SNPs.The longest haplotype spanned 200.0 kb, the average length of the haplotype is 33.71kb.

Population genetic structure analysis
We employed two types of genetic markers including three kinds of DNA markers and 15 agronomic traits to perform cluster analysis.

PC analysis
Principal component analysis was performed to select the rst three PC(eigenvalue) and their cumulative contribution of variance accounted for 40.69%,39.76%40.10% and 15.76% by NlaIII-GBS only, by MseI-GBS only, by mergered NlaIII-GBS and MseI-GBS data and by SSR, respectively.PCA separated the 93 genotypes into two subgroups (Fig. 4) which were consistent with the UPGMA and STRUCTURE results.W366 and W367 has been separately clustering (Fig. 2).

UPGMA
The unweighted pair-group method with arithmetic means (UPGMA) algorithm was performed on the 93 genotypes, which demonstrated that the 93 genotypes could be divided into 2 subgroups (Fig. 3).Group I included 1 to 3 samples, respectively.and group II contained 92 to 90 samples.The average genetic distance was 0.29 ranging from 0.02 to 0.55 based on mergered NlaIII-GBS and MseI-GBS data,which the two most closely related materials were W710 and W711, the two most greatest materials were W366 and W740.
Bayesian clustering MAF <5% , 70824 SNPs were used to assess the population structure of the entire pool of rice germplasms.Delta K reached a maximum value at K=2, suggesting that the 93 rice samples were divided into two subgroups (consisting of 70 and 23 samples) (Fig. 4).In the population structure analysis, the results from K = 2 to K = 5 revealed the occurrence of gene introgression between groupІ and groupII, accounting for approximately 76.34% of the observed variations (calculated by K = 2).
The analysis performed using PCA, UPGMA and Bayesian clustering got similar results, The population structure is relatively simple,the matrix delamination is not distinctive.

Agronomic traits clustering
Clustering result based on 15 agronomic traits show in g. 5a. the 92 materials are gathered together in addition to W669, showed a single genetic basis for the population.

Clustering of different category materials
We analyze population genetic information of the different category materials including 57 restoring lines, 19 maintainer lines and 17 special rices, respectively (Table 2), and UPGMA clustering (Fig. 5b, 5c, 5d) .The results showed the genetic basis of the restorer line was more abundant than that of the maintainer line,and the genetic basis of the special rice was wider than that of the conventional rice.In the present study, the coe cients of correlation (R 2 ) between the genetic distance matrices of NlaIII-GBS only and MseI-GBS only, NlaIII-GBS only and SSR, MseI-GBS only and SSR, mergered NlaIII-GBS and MseI-GBS and SSR were 0.88, 0.35, 0.27,0.33,respectively (Fig. 6), this may be due to the different number of markers used.

AMOVA and gene ow
Tajima's D value was 1.66, which signi es low levels of both low-and high-frequency polymorphisms, indicating a decrease in population size and/or balancing selection.This resulted in more haplotypes and lacked rare alleles in the population.Analysis of molecular variance showed that the genetic variation within the population was 98% and between populations was 2%, which indicated the existence of slight genetic variation among 93 samples.The genetic differentiation coe cient (F ST ) between the two populations was 0.61, and gene ow (Nm) was 0.16.Further investigation showed that the gene ow of sel ng crops was the smallest, and that of annual herbaceous plants was the lowest.If Nm > 1, which indicates that the level of gene ow between populations is high, then genetic differentiation among populations is small; if Nm > 4, then gene communication between populations is more adequate and genetic differentiation is smaller; and Nm < 1 indicates population differentiation may have occurred due to genetic drift.The gene ow was 0.16, which indicates that the gene ow among rice populations in the Qinba region is lower, but nearly 2.5-fold larger than that of conventional inbred plants, which may result in long-term arti cial selection, leading to reduced genetic differentiation.

Discussion
Germplasm resources are the basis of rice breeding.The analysis of genetic diversity and genetic structure of rice germplasm resources is bene cial to mining excellent breeding materials and improving breeding e ciency.The results showed that 48 pairs core primers of SSRs had rich bands and high polymorphism.
It has been widely used in the study of genetic diversity and identi cation of varieties [27,17,12,3].A total of 72824 SNPs were obtained by genome sequencing based on two enzymes (NlaIII and MseI), indicating that there are abundant SNPs in rice.Previous sequencing results showed that there is one SNP in every hundreds or even tens of base pair in rice genome [8, 16, 13, 14, 24, 19, 21], indicating that there were a large number of SNPs in the rice genome, which had been applied in gene mapping, functional gene detection, assisted breeding selection and other aspects of rice [21,29].The results of this study are consistent with previous sequencing results.
Different DNA markers re ect different ranges of polymorphisms in the genome.In recent decades, SSR markers, which represent second-generation DNA molecular markers, have been widely used for plant population genetic analysis, phylogenetic reconstruction, and quantitative trait mapping.Theoretically, the more markers used in a study, the more accurate the results are.SSR markers are mostly distributed in centromeres, telomeres, introns, and 3'UTR regions, and most of these are non-functional markers of genes.If combined with functional SNP markers, the genetic structure of the population revealed would be more accurate.With the rapid development of the next-generation genome sequencing technology, it is now possible to obtain a large number of sequence data at a low cost as well as increase the reliability of the results.The PCA, UPGMA and Bayesian clustering are three commonly used methods for population genetic structure, the analysis of the population genetic structure is based on the estimation of genetic distance or genetic similarity coe cient matrix between samples, For example, in PC analysis, the accumulative contribution rate of the rst three major factors analyzed by SSR was only 15.76%, far less than that using SNP data, it indicates that the more DNA polymorphism, the more accurate the population variation can be explained.The results showed that the genetic structure of 93 materials was relatively simple and the matrix strati cation was not obvious.The cluster based on 15 agronomic traits also showed that the genetic basis of the other 92 materials was single except W669.The above studies showed that the parents of rice breeding in Qinba area had higher genetic similarity, single genetic background and low genetic polymorphism.The research results are consistent with the previous conclusions of our research group [28].This may be due to many reasons, such as the wide exchange of variety resources among breeding units in the process of breeding, and the similar breeding goals,which makes the genetic basis of breeding varieties similar.The narrow genetic basis limits the cultivation of rice varieties, the breeding of excellent characters and increasing rice yield.
Through the analysis of different types of materials (57 restorer lines, 19 sterile lines and 17 special rice), the results showed that the genetic basis of restorer lines was richer than that of maintainer lines, which was consistent with the conclusion of Ying Jiezheng et al. [27].The main reason may be that most cytoplasmic male sterile lines currently used in production are related to Zhenshan 97B, -32B, zhong9a and gang46a, or they are derived from Chinese Dwarf early rice varieties aizazhan and aijiaonante.At present, the restorer lines used in combination production are from the Yangtze River Basin of China, Sichuan, Southeast Asia, South Korea, etc. and the restorer lines created by crossing indica and japonica rice.Compared with other rice germplasm resources, special rice has rich genetic basis and high breeding potential.
By analyzing different types of materials (57 restorer lines, 19 sterile lines and 17 special rice), the results showed that the genetic basis of restorer lines was richer than that of maintainer lines, which was consistent with the conclusion of Ying Jiezheng et al. [27].Perhaps the main cause is the production on the application of cytoplasmic male sterile lines most with Jane shanyou 97 b, -32 b, 9 a and 46 a post in the blood, or China's variety of dwarf rice dwarf is accounted for and short feet nantes derivative.The restorer line is rich in sources.Currently, the restorer line used in combination production is consanguineous from Yangtze River Basin, Sichuan, Southeast Asia, Korea, etc., as well as restorer line created by crossing indica and japonica.Special rice has an abundant genetic basis compared to other rice germplasm resources and has high breeding potential.

Conclusions
Among the three commonly used methods of population structure analysis, Bayesian algorithm is more practical than UPGMA and PC analysis, considering the known pedigree knowledge.At the same time, the size of gene ow of each samples can be seen from the population genetic structure graph based on Bayesian algorithm.Genetic effects in populations depend on the opportunity distribution of MAFs across the genome-wide, and different populations have different MAF values.Although gene ow in the population consisting of 93 rice varieties is large, the average MAF of the population was only 0.21.It is an important measure to improve the genetic diversity of rice varieties and the main way to improve the yield of rice varieties in Qinba area.Any one of the two sets of GBS data used in this study can be utilized in the analysis of population genetic diversity.However, particularly for subsequent QTL mapping, the greater the number of polymorphic sites, the higher the mapping resolution, especially for NlaIII-GBS data, which possess CATG recognition sequences, so the selected RAD tags digested by NlaIII enzyme may be the functional regions of the gene and will be studied in subsequent QTL mapping efforts.Genomic DNA extraction, primer synthesis, PCR ampli cation,and GBS

Methods
The genomic DNA of 93 rices was extracted from fresh leaves using the SDS technique and detected with 0.8 % agarose gel electrophoresis.The 48 SSR primers were synthesized by Beijing Aoke Biotechnology Co., Ltd.(Beijing, China).PCR were carried out in a 10 μL volume containing 1 μL DNA template, 2μL(10μM)of forward and reverse primers(1μL each), 5μL 2×Taq Master Mix, 2 μL RNase-free water.the reactions were programed as follows: initial denaturation at 94.0 °C for 5 minutes, denaturation at 94.0 °C for 1 minute, annealing at 50-60.0 °C for 1 minute, and extension at 72.0 °C for 1 minute, for a total of 35 cycles.Electrophoresis was performed using 8% non-denaturing polyacrylamide gel under 95V voltage; the bands were visualized via silver staining.

LD decay and Haplotype construction
Genotype data were then used to calculate LD between SNPs and to construct haplotypes using the EM algorithm implemented by PLINK1.07 (https://www.cog-genomics.org/plink2).The command line "--r2" and "--blocks" were used to calculate LD and assign SNPs to their respective haplotypes by calculating inter-maker LD within a 200kb window, respectively.The gures were constructed using the Origin8 platform(http://www.originlab.com/).

SSR marker e ciency analysis
Following electrophoresis, each ampli cation band corresponded to a primer hybridization locus and was considered as an effective molecular marker.Each polymorphic band detected by a same given primer represented an allelic mutation.In order to generate molecular data matrices, clear bands for each fragment were scored in every accession for each primer pair and recorded as 1 (presence of a fragment), 0 (absence of a fragment), and 9 (complete absence of band).The value of the polymorphism information content (PIC) was calculated by PIC_Calc 0.6 program, where PIC represents the PIC value of the ith locus and Pij represents the frequency that allele j appears in the ith locus.The value of PIC varies from 0 to 1, with 0 indicating an absence of polymorphism at a given locus and 1 re ecting multiple alleles at a given locus.The level of polymorphism of each marker was assessed by the polymorphism information content [2], which measures the extent of genetic variation: PIC values smaller than 0.25 indicates low levels of polymorphism associated to a locus, PIC values between 0.25 and 0.5 imply moderate levels of polymorphism, while PIC values greater than 0.5 indicate high levels of polymorphism PC analysis PC analysis was performed under Eigen module by NTSYS-pc2.10e[20].

Bayesian clustering
The Bayesian clustering algorithm implemented in STRUCTURE 2.3.4 (http://taylor0.biology.ucla.edu/structureHarvesteroybase.org/tools.php) was used to simulate population genetic structure with the assumption that all of the genetic markers were independent.To obtain an estimate of the most probable number of population (K), K values from 1 to 5 were simulated with 5 iterations for each K, using 100000 burn-in periods followed by 100000 Markov Chain Monte Carlo iterations.Delta K was plotted against K values and the best number of clusters was determined following the method proposed by Evanno et al [7].
In this study, STRUCTURE 2.3.4,which applies a Bayesian clustering algorithm, was used to simulate population genetic structure based on the assumption that the 72824 loci were independent.Using a membership probability threshold of 0.60, population K values from 1 to 5 were simulated with 5 iterations for each K using 10,000 burn-in periods followed by 10,0000 Markov Chain Monte Carlo iterations in order to obtain an estimate of the most probable number of population.Delta K was plotted against K values; the best number of clusters was determined following the method proposed by Evanno et al and obtained via the Structure Harvester platform (http://taylor0.biology.ucla.edu/structureHarvester/)[6].
Agronomic traits clustering SPSS20 was employed to perform agronomic traits clustering by nearest neighbor analysis.

Correlation analysis among genetic distance matrices by diffrent DNA marker dataset
Mantel tests were used to measure the correlation between the genetic distance matrices generated by NlaIII-GBS only, MseI-GBS only, merged NlaIII-GBS and MseI-GBS data and SSR.It was carried out by the GenAlEx software with permutation 9999 times [21].r ≥ 0.9, 0.8 ≤ r < 0.9, 0.7 ≤ r < 0.8, and r < 0.7 represented signi cant correlation, moderate correlation, weak correlation, and no correlation, respectively.

AMOVA and Gene ow
For the analysis of molecular variance (AMOVA), 93 accessions were classi ed into two groups by 72824 SNPs.The components of variance attributable to different varieties and breeding lines were estimated from the genetic distance matrix using Tajima & Nei method, as speci ed in the AMOVA procedure in ARLEQUIN 3.1[8].A nonparametric permutation procedure with 9999 permutations was used to test the signi cance of variance components associated with the different possible levels of genetic structure in this study.The pairwise Fst values, a value of F statistic analogs computed from AMOVA, were used to compare genetic distances between any two groups.

Figures
Figure 1 In    The correlation between the genetic distance matrices by different genetics markers 93 accessions, representing most of the Oryza sativa L. subsp.Hsien Ting varieties (lines) germplasm resource of the Qinba area in China, were collected from the rice experimental farm during the 2018 growing season in the Shaanxi Rice Research Institute and comprised 57 restoring lines, 19 maintainer lines, and 17 special rice.
a total of 6,288,753 loci, among which 326,873 (5.198%) were heterozygous based on 67,621 SNPs.The 67,621 SNPs were unevenly distributed among the 12 chromosomes (Fig.1a); chromosome 1 contained the largest amount of makers (8,425), while chromosome 8 included the least (3,953).Among the 84,255 SNP pairs, R2 value had a minimum of 0.2 and an average of 0.73.46,322 SNP pairs (54.98%) had R2 values higher than 0.8, while 7,841 pairs (9.31%) were in complete LD (R2=1).The inter-marker genetic distance between all pairs, between pairs of SNPs with R2 inferior to 0.8, and between pairs with R2 equal or superior to 0.8 had average values of 154,130 bp, 171,768 bp, and 139,686 bp, respectively.LD, as represented by inter-loci R2 values, decreases as the physical distance between loci increases (Fig.1b).The 12 chromosomes yielded a total of 6568 predicted haplotypes (Fig.1c), with chromosome 1 possessing the most haplotypes (776) and chromosome 10 possessing the least (349).The largest haplotype was composed of 95 SNPs.The longest haplotype spanned 200.0 kb, the average length of the haplotype is 33.71kb.

Figure 2 PCA 3 UPGMA 4
Figure 2 PCA plots by the different DNA markers.a. by NlaIII-GBS only; b. by MseI-GBS only; c. by merged NlaII-GBS and MseI-GBS data; d. by SSR

Table 1
Information and ampli cation results of SSR primers SNPs and 35,547 SNPs passed the minor allele frequency (MAF) lower limit of 0.05 by NlaIII-GBS only and MseI-GBS only, respectively.Then mergered NlaIII-GBS and MseI-GBS data, a total of 72,824 SNPs including 67,621 SNPs that aligned speci c chromosomes and 5,023 SNPs that do not aligned speci c chromosomes.The average MAF was 0.21 of 93 samples.

Table 2
Population genetic analysis of different category materials