Bulked Segregant Analysis Coupled With Whole-Genome Sequencing (BSA-Seq) and Identication of a Novel Locus, qGL3.5, That Regulates Grain Length

Rice grain length (GL) directly affects the yield and quality of this species. Very few GL-related genes cloned are applied in production because their yield-increasing effect was not obvious. In this study, the two bulk-DNA pools (L-pool and S-pool) and their parents’ (KJ01 and Huaye 4) DNAs were subjected to high-throughput sequencing. After assessing the quality of the data, we obtained a total of 100.22 Gb of high-quality data; the average coverage depth was 55x, and the genome coverage was 96.51%. After combining the association results of the ED and SNP index methods, we mapped the GL genes to a 0.34 Mb “hotspot” region on chromosome 3, which contains 37 genes related to various traits. The 37 predicted genes were further analyzed by the use of the Gene GO, COG database and so on. Thirty-three genes were annotated by GO functions. According to the GO annotations, three genes whose molecular function involved in the plasma membrane and intracellular membrane-bound organelles were detected via CRISPR/ Cas9 editing technology. ORF33 was veried to regulate GL and was the target gene qGL3.5. These results provides a new gene resource for rice grain shape breeding and a starting point for functional characterization of the wild rice GL gene. qGL3.5, was found via clustered, regularly interspaced, short palindromic repeat (CRISPR)/CRISPR-associated 9 (Cas9) to regulate GL. Our study provides a new gene resource for rice grain shape breeding and a starting point for functional characterization of the wild rice GL gene. and genotype between wild-type and mutant plants. A, Plant type comparison between the wild type and mutant; B, GL comparison between the wild type and mutant. C, BLAST-reported sequence of the two target sites. and Compared with the of SNPs in reference genome, 1,225,946 SNPs were selected for SNP index analysis, and 1,773,999 SNPs were selected for ED association analysis. By combining the association results of the ED and SNP index methods, we found that four associated regions on chromosome 3 were closely correlated with the GL trait. In these genomic regions, no genes controlling GL have been reported. The total length of these regions was 0.34 Mb, containing 37 genes. GO enrichment analysis revealed that 33 ORFs were annotated. Among them, three genes whose molecular function involved the plasma membrane and intracellular membrane-bound organelles were detected via CRISPR/Cas9 gene-editing technology. ORF33 was veried to regulate GL and the qGL3.5 target gene.


Introduction
Rice is among the most important staple crop species and feeds approximately one-third of the global population. Rice grain yield is determined by three main characteristics: grain weight, number of grains per panicle, and number of panicles per plant ). In terms of grain weight, the critical factor that affects yield is grain size, which includes grain length (GL), grain width, grain thickness, and the degree of lling ). Among these four factors, GL contributes the most to grain weight (Lin and Wu, 2003), and previous studies have shown that thin grains have better quality (Chen et al. 1997; Shi and Zhu 1997). In addition, GL is also an important factor of rice quality.
GL is a complex quantitative trait that is controlled by 2~3 or more genes (Mckenize and Rutger 1983). To date, at least 102 QTLs controlling GL have been located ). However, only eleven genes have been cloned and conducted further functional studies. The genetic mechanism underlying the effects of these genes has been elucidated and is very complex, and their modes of regulation are also diverse. qGL3/GL3.1 encodes a PPKL family serine/threonine phosphatase and restricts cell division in spikelet hulls by dephosphorylating cyclin T 1;3 (Cyct 1;3 ) (Qi et al. 2012;Zhang et al. 2012). GS3, a major gene controlling GL and grain weight, encodes a putative Gγ protein and inhibits interactions between the Gβ protein RGB1 (G protein γ subunit 1) and two other Gγ proteins to regulate grain size (Sun et al. 2018). GLW7 encodes the plant-speci c transcription factor OsSPL13, which positively regulates GL and yield (Si et al. 2016). GL7 regulates GL by affecting the expression of two linked genes . GL6 encodes a plant AT-rich sequence and zinc-binding (PLATZ) transcription factor; this protein has a premature stop codon and reduces the expression or causes the loss of function of GL6, affecting rice GL ).
Although some genes regulating GL have been cloned, few genes have been applied in production, and the overall regulatory networks underlying the associated processes remain poorly understood. With the development of high-throughput sequencing in recent years, high-e ciency, low-cost whole-genome sequencing has become available. Approaches based on bulked segregant analysis (BSA) coupled with whole-genome sequencing (BSA-seq) have been widely applied for mapping several important agronomic-related genes in rice , and gene cloning and functional research have become easier. BSA was used to construct two pools by mixing the DNA of individuals with extreme traits in a segregating population (Takagi et al. 2013). By analyzing the differences between single-nucleotide polymorphism (SNPs) and insertions/deletions (InDels) between the two pools, we can quickly locate the molecular markers tightly linked to a target gene (Klein et al. 2018). This method is appropriate for qualitative traits controlled by one gene and quantitative traits controlled by major QTLs. Currently, BSAseq, a rapid approach for identifying QTLs, is widely used in rice (Yang et  Interestingly, we obtained a wild rice inbred line Huaye 4 (Oryza ru pogon Griff) that presented very short GL-only 7.03 mm. An F 2 population was then generated from a cross of the cultivar KJ01 with a long GL of 13.33 cm and Huaye 4. QTL genetic linkage analysis showed that there was a novel gene that controlled GL and explained 54.85% of the phenotypic variation (Zheng et al. 2020). In this study, BSAseq was used to narrow the gap of 300 kb on chromosome 3 by the use of an F 2 population comprising 1,307 individuals, and no reported GL (GL) genes were found in this interval, which included 37 candidate genes according to annotation analysis of SNP variation. Among these genes, ORF33, namely, qGL3.5, was found via clustered, regularly interspaced, short palindromic repeat (CRISPR)/CRISPR-associated 9 (Cas9) to regulate GL. Our study provides a new gene resource for rice grain shape breeding and a starting point for functional characterization of the wild rice GL gene.

Plant materials
The wild rice inbred line Huaye 4, cultivar KJ01, and their F 2 progeny were used. These materials were grown in the eld at South China Agricultural University during the late season of 2015. Ten rows were planted in each of the test plots, and the row spacing was 18 cm×21 cm.

Measurements of characteristics
The rice grains were harvested at the mature stage by collecting three major panicles of each plant, after which the materials were put into marking envelopes. Samples of 30 randomly selected grains were assessed for GL, and the length of 10 grains placed end to end was measured using a ruler with a precision of ± 0.01 mm. The grains of each plant were replicated three times. The average length of a single rice grain was then calculated.
Construction of bulked DNA pools for sequencing Two bulk DNA pools for sequencing were rst constructed by selecting individuals exhibiting an extreme phenotype from among 1,307 F 2 segregating plants. The individuals that produced the top 50 longest grains were grouped into the L-pool, and the individuals that produced the top 50 shortest grains were grouped into the S-pool from among the F 2 segregating lines.
Total genomic DNA was extracted from young healthy leaves of both parents and 100 selected F 2 segregating individuals via the sodium dodecyl sulfate (SDS) procedure and puri ed by RNase A. The DNA concentration and quality were estimated with a Nanodrop 2000 UV-Vis spectrophotometer (NanoDrop, Wilmington, DE, USA), and the DNA concentration was diluted to 3 μg/μL; the total DNA was not less than 150 μg. The individual genomic DNA of the 50 L-pool individuals was mixed together equally, constituting the long-seed bulk DNA pool (L-pool), and the individual genomic DNA of the 50 Spool individuals was mixed together equally, constituting the short-seed bulk DNA pool (S-pool). The genomic DNA of the two bulk DNA pools and both parents was prepared for subsequent high-throughput sequencing.

Sequencing data analysis and comparison
The four samples were subjected to high-throughput sequencing on an Illumina HiSeq TM 2500 instrument (Illumina, Inc., San Diego, CA, USA) at Biomarker Technologies Corporation in Beijing. Real-time monitoring was performed for each cycle during sequencing, and the percentage of high-quality reads with quality scores greater than Q30 (a quality score of 30, indicating a 0.1% chance of an error and thus 99.9% con dence) among the raw reads and the GC content were calculated for quality control. Afterward, the sequencing data were compared with the sequence of the reference genome of Nipponbare According to the positioning results of clean reads in the reference genome, Picard (http://sourceforge.net/projects/picard/) was used for preprocessing, such as marking duplicates, and GATK was used for local realignment and base recalibration to ensure the accuracy of detected SNPs.
GATK was then used for SNP detection and ltering, after which a nal set of SNPs were obtained.

Association analysis
The relative marker abundance in the L-pool was calculated as the number of reads of the maternal allele divided by the number of reads of the paternal allele, whereas in the S-pool, the relative marker abundance was calculated as the number of reads of the paternal allele divided by that of the maternal allele. It was expected that the larger the relative abundance was, the greater the possibility that the marker was associated with GL. SNP index association analysis (Abe et al. 2012) and ED association analysis (Deza et al. 2009) were used in this research, and in the present study, L refers to the KJ01 parent, S refers to the parent of Huaye 4, aa represents the L-pool and ab represents the S-pool.

SNP index association analysis
The idea behind SNP index association analysis was recently published; this analysis is a type of method used to calculate genotype frequency differences between two bulks satis ed by Δ(SNP index). A relatively close marker is associated with traits, while a relatively close Δ(SNP index) is associated with the value 1. The Δ(SNP index) was calculated as follows: Laa is the depth of the aa group derived from L, while Saa indicates the depth of the aa group derived from S. Lab means the depth of the ab group derived from L, while Sab represents the depth of the ab group derived from S.
ED association analysis ED association analysis is a method for calculating ED (the quadratic sum root of differences between bulks based on the depth of the four types of bases) and is satis ed by ED. In theory, the higher the ED value is, the closer the object site.
The ED was calculated as described below. Aaa, Caa, Taa, and Gaa represent the depth of bases A, C, T and G, respectively, at a particular site in the large seed bulk. Aab, Cab, Tab, and Gab represent the depths of bases A, C, T and G, respectively, at a particular site in the small seed bulk.
In this study, high-throughput sequencing combined with BSA was used to detect SNP tags between the two bulked DNA pools and to rapidly identify marker-intensive hotspots associated with GL in the rice genome.

GO analysis of selected candidate genes
The GO database is a structured standard biological annotation system that produces standard vocabulary for the functions of genes and their products. The database structure is divided into multiple levels. The lower the level is, the more speci c the functions the nodes represent. The candidate genes that were selected were subjected to GO-based functional enrichment analysis, and they were classi ed according to three categories: biological processes, cellular components and molecular functions.

CRISPR/Cas9 transgene analysis
Two targets were designed for the potential candidate genes ORF14, ORF33 and ORF25 via the website CRISPR-GE (http://skl.scau.edu.cn/). The expression cassette vector was constructed by multiple rounds of PCR (Ma et al. 2015). The target sequences were introduced into sgRNA expression cassettes by overlapping PCR. The puri ed PCR products (sgRNA expression cassettes) were inserted into a pYLCRISPR/Cas9 vector based on Golden Gate system. The ligated products harboring the sgRNA expression cassettes were directly used to transform E. coli DH5α competent cells. The CRISPR/Cas9 constructs that were successfully generated were introduced into Agrobacterium tumefaciens strain EHA105. Agrobacterium tumefaciens-mediated gene transfer experiments were subsequently carried out in the background of Huaye 4. The results were analyzed via the sequential decoding method (http://skl.scau.edu.cn/) to identify positive transgenic plants. Three wild-type plants and twelve mutant plants were selected at maturity to measure their GL.

Results
Phenotypic analysis of both parents and genetic analysis of GL within the F 2 population All of the traits of the wild rice inbred line Huaye 4, including plant height and GL, remained stable by continuous self-crossing and selective breeding for several years ( Figure 1A). The plant height was approximately 41.20 cm, and the GL was approximately 7.03 mm. Huaye 4 was used as one of the parents and crossed with the rice cultivar KJ01, which has a long GL-13.33 mm. There were signi cant differences in GL, grain width, plant height, and panicle length between the two parents (Table 1, Figure  1B). Genetic analysis of GL for 1,307 separate F 2 individuals showed that the frequency distribution of GL was continuously distributed, that the means deviated from those of the long-grain parent, and that the variation range of the GL was very large, from 7.07 mm to 13.33 mm, in the F 2 population ( Figure 1C). Therefore, the GL of the two parents was a typical quantitative trait controlled by multiple genes. Analysis of the quality of the bulked DNA sequence data Fifty individuals that produced extremely long grains (11.57-13.33 mm) and 50 individuals that produced extremely short grains (7.07-8.13 mm) from the F 2 population were selected for preparation of the longseed pool (L-pool) and short-seed pool (S-pool). In total, 100.60 Gb of raw data were obtained from four samples (KJ01, Huaye 4, the L-pool and the S-pool), and 100.22 Gbp of clean data were attained after ltering. The Q30 percentage was 93.86%, and the guanine-cytosine (GC) content was 45.83%. Of the high-quality data, 55,999,600 reads were obtained from KJ01, and 45,059,434 reads were obtained from Huaye 4. The read numbers for the S-pool and L-pool were 127,409,412 and 106,871,512, respectively. The average percentage between the sample and the reference genome was 96.64%, the average coverage depth was 55x, and the genome coverage was 96.51% (at least one base coverage). Four samples (one each of KJ01, Huaye 4, the L-pool and the S-pool), with 16.72 Gb, 13.46 Gb, 31.93 Gb and 38.09 Gb, respectively, were sequenced, and their GC contents were 45.72%, 45.54%, 46.04% and 46.03%, respectively.
The coverage of the genome can re ect the amount of variation from the reference genome. The more regions are covered, the more mutation sites can be detected. The coverage was in uenced mainly by sequencing depth and distance between the sample and reference genome. The depth of genome coverage affects the accuracy of mutation detection, and the accuracy of mutation detection is relatively high in regions with relatively high coverage depths (nonrepetitive sequence regions).
Sequencing quality statistics showed that our sequencing data were of high quality, and good consistency, and could be used for subsequent bioinformatics analysis.
Detection of SNPs between samples and reference genomes SNP mutations can be divided into transition and transversion types. Mutations involving the same type of bases, such as changes between a purine and another purine, and between a pyrimidine and another pyrimidine, are called transitions. Mutations involving different types of bases, such as changes between a purine and pyrimidine, are called transversions. Generally, transitions are more likely to occur than transversions, so the transition/transversion (Ti/Tv) ratio is generally greater than one. For diploid or polyploid species, if a certain SNP site on homologous chromosomes is the same base, the SNP site is referred to as homozygous. If a SNP site on a homologous chromosome involves different types of bases, the SNP site is referred to as heterozygous SNP.
In the present study, the higher the number of homozygous SNPs was, the greater the difference between the sample and the reference genome were, and the greater the number of heterozygous SNPs was, the higher the degree of heterozygosity of the sample was. These results are related to the material used. Compared with the number of SNPs in the reference genome, the number of SNPs in KJ01 and Huaye 4 were 1,443,316 and 1,245,956, respectively. For the two bulked pools, the numbers of SNPs in the S-pool and L-pool were 2,216,645 and 2,136,482, respectively (Table S2). The number of SNPs common to both the L-pool and S-pool was 2,111,496 ( Figure 2). Finally, 2,337,510 SNPs were selected for further analysis after two rounds of sequencing and the exclusion of low-quality fragments.
For SNP index association analysis, 5,201 SNPs with multiple mutations were ltered, 772,175 SNPs were the same in the two bulk pools, 23,119 SNPs had a read support of less than four, and 311,069 SNPs were nonexistent in the parents. A total of 1,225,946 SNPs were ultimately selected for further analysis. Similarly, 1,773,999 SNPs were selected for euclidean distance (ED) association analysis.

SNP index and ED association analysis
Two BSA-seq analysis methods, the SNP index and ED, were used to identify the candidate genomic regions associated with GL. After ltering, a total of 1,225,946 SNPs were used for association analysis through the SNP index method. A SNP index of the L-pool and S-pool was calculated for each identi ed SNP in the genome, and an average SNP index was computed for each 1 Mb interval via a 10 kb sliding window ( Figure 3A, 3B). By combining the SNP index information of the L-pool and S-pool, we calculated a Δ(SNP index), and Δ(SNP index) trends were visualized by means of a sliding window ( Figure 3C). Using the association threshold of 0.99, we identi ed only one genomic region distributed on chromosome 3 that was signi cantly correlated with GL. The length of the region was 25.53 Mb and contained 4,734 genes.
A total of 1,773,999 SNPs were used for association analysis through the ED method. The association threshold was 2.5958 of the ED data, and 4 associated regions on chromosome 3 were closely correlated with the GL trait ( Figure 3D). By analyzing the 4 associated regions, we found that the rst region was 0.03 Mb in the interval from 23,160,000-23,190,000 bp and contained 6 candidate genes, the second region had only one gene at the 23,210,000 bp position, the third region was 0.11 Mb in the interval from 23,250,000-23,3360,000 bp and contained 16 genes, and the fourth region was 0.20 Mb in the interval from 23,830,000-24,030,000 bp and contained 34 genes. The total length of these regions was 0.34 Mb, and they contained a total of 57 genes ( Table 2).
By combining the association results of the ED and SNP index methods, we found that the intersections of the genomic regions were consistent with the associated regions identi ed via the ED method. The 0.34 Mb region of chromosome 3, which contained 57 genes, identi ed via BSA-seq was the "hotspot" for GL of rice. There were no genes controlling GL reported in these genomic regions.
These results in turn allowed us to narrow the genomic region of the QTL linkage analysis, which identi ed the major QTLs as being within the interval of markers PSM379-RID24455, spanning 7.4 Mb (Zheng et al, 2020). Of these candidate genes, 37 were annotated in the NR database; 37 were listed in the NT database; 37 were listed in the TrEMBL database; 23 were listed in the SwissProt database; 33 were listed in the GO database; 3 were identi ed via KEGG pathway analysis; and 9 were functionally annotated in the COG database (Table S3). Afterward, 33 ORFs annotated via GO enrichment were further analyzed, of which 29 were related to cell components, 23 were related to molecular functions, and 18 were related to biological processes (Table 3).
Directed acyclic graphs generated by topGO intuitively show GO terms and their hierarchical relationship with gene enrichment in an associated region. An acyclic graph is a graphical representation of the results of GO enrichment analysis of genes in associated regions; branches represent inclusion relationships, and the functional scope de ned from top to bottom becomes increasingly speci c ( Figure  S1). By performing a topGO analysis of the associated regions in cell components, molecular functions, and biological processes, we found eight GO terms associated with the 37 genes, of which ve of the terms were related to cell components, two were related to biological processes, and only one was related to molecular functions (Table 4). For these results, the smaller the Kolmogorov-Smirnov (KS) value is, the more signi cant the enrichment. The plasma membrane had the most signi cant topGO enrichment (GO:0005886). The three nonsynonymous mutated genes in the associated region, ORF13, ORF14 and ORF19, were involved in plasma membrane enrichment according to the GO annotation.
In addition, ORF33 and ORF25 were involved in intracellular membrane-bound organelles according to the GO annotation (GO:0043231; GO:0044444). The nonsynonymous mutated ORF33 gene was annotated as being associated with extensin in the SwissProt database. Moreover, this gene was found to be related to extensor proteins in tobacco. Studies have shown that extensor proteins are important structural proteins in the plant cell wall and play a role in cell wall strengthening. Studies have also shown that the expression of extensor proteins is negatively correlated with the degree of cell expansion, and increasing the expression of extensor proteins may promote an increase in cell density in local areas of the tissues or organs in which they are expressed (Roberts et al. 2006). ORF25 encodes a protein that participates in carbohydrate transport and metabolism, and is considered a formin-like protein according to the SwissProt database. Among the molecular functions, ATP binding was most signi cant (GO:005524), and ORF13 and ORF14, two nonsynonymous mutated genes, were also enriched in ATP binding.  On the basis of the results of the GO annotation, ORF14, ORF33 and ORF25 were selected to verify their molecular function via CRISPR/Cas9 gene-editing technology. Interestingly, after the ORF33 gene was knocked out in Huaye 4 by CRISPR/Cas9 gene-editing technology, the target gene was inserted such that there was an extra base in the exon, which resulted in a frameshift mutation, and the GL increased by 0.11 compared with that of its wild type in the T0 generation. The edited homozygous T0 seeds were planted, and T1 seeds were retrieved, which were subsequently planted; the genotype and phenotype of T1 were investigated. The mutant phenotype of the T1 plants was stable in the segregating population ( Figure 4). The GL of T1 plants became longer if the ORF33 gene was edited, with or without the presence of the hygromycin gene. These results indicated that ORF33 regulated GL, and was the qGL3.5 target gene.
ORF33 of Huaye 4 was knocked out via CRISPR/Cas9. Shown are the differences in phenotype and genotype between wild-type and mutant plants. A, Plant type comparison between the wild type and mutant; B, GL comparison between the wild type and mutant. C, BLAST-reported sequence of the two target sites.

Discussion
BSA-seq enhances mapping accuracy and saves time The high-throughput Illumina sequencing platform can be used to obtain a large amount of data in a short time. It can be used to construct a double terminal sequencing library, and its base accuracy rate exceeds 98.5% (Metzker 2010). Similarly, BSA is a good time-saving method for the identi cation of QTLs .
In this study, a total of 100.22 Gb (clean data) was attained after ltering of four different materials. The average percent similarity between the sample and the reference genome (version RGAP7 of Oryza sativa japonica) was 96.64%, the average coverage depth was 55x, and the genome coverage was 96.51%. For the two bulked pools, there were 2,216,645 and 2,136,482 SNPs in the S-pool and L-pool, respectively (Table S2). Ultimately, 1,225,946 SNPs were selected for SNP index analysis, and 1,773,999 SNPs were selected for ED association analysis. By combining the association results of the ED and SNP index methods, we mapped the candidate genes of GL to an interval of 0.34 Mb on chromosome 3 and narrowed the mapping region of a previous QTL genetic linkage analysis (Zheng et al. 2020).
Wild rice contains many valuable genes As research has advanced, the genetic resources derived from cultivated rice have become increasingly narrower. Because it has excellent genes associated with resistance and yield, wild rice has become a good candidate to explore valuable genes to enhance rice productivity (Rahman et al. 2007;Guo et al. 2009 The biological function of most cloned GL genes was involves extension, and there is a correlation between cell elongation and the expression of extensor proteins. Previous studies have shown that the expression level of extensors is negatively correlated with the degree of cell extension, and increasing the expression level of extensors may promote an increase in cell density in the local area of tissues or organs in which these genes are expressed (Roberts et al. 2006  that GS3, GW2 and qSW5 interact, qSW5 and GW2 positively regulate the expression of GS3, and GS3 explains the GL effect of qSW5. Perhaps qGL3.5 interacts with these genes, which needs further veri cation. It is very important to clearly elucidate the regulatory network of these GL-related genes, which is valuable to rice grain shape breeding. GL is a complex train that is usually controlled by multiple genes. Zheng et al. (2020) identi ed three major QTLs in a gene mapping region. There may be other genes in addition to qGL3.5 that regulate GL in our mapping region. Therefore, the molecular functions of other candidate genes need to be veri ed.

Conclusion
In this study, BSA-seq was used to rapidly and successfully explore candidate genes controlling rice GL.
Fifty individuals that produce extremely long grains and 50 individuals that produce extremely short grains among an F 2 population of 1,307 individuals were selected to construct an L-pool and an S-pool.
In total, 100.60 Gb of raw data were obtained from four samples (KJ01, Huaye 4, the L-pool and the Spool), and 100.22 Gbp (clean data) was used after ltering. The average percent similarly between the sample and the reference genome was 96.64%, the average coverage depth was 55x, and the genome coverage was 96.51%. Four samples (KJ01, Huaye 4, the L-pool and the S-pool) were sequenced, with 16.72 Gb, 13.46 Gb, 31.93 Gb and 38.09 Gb, respectively. Compared with the number of SNPs in the reference genome, 1,225,946 SNPs were selected for SNP index analysis, and 1,773,999 SNPs were selected for ED association analysis. By combining the association results of the ED and SNP index methods, we found that four associated regions on chromosome 3 were closely correlated with the GL trait. In these genomic regions, no genes controlling GL have been reported. The total length of these regions was 0.34 Mb, containing 37 genes. GO enrichment analysis revealed that 33 ORFs were annotated. Among them, three genes whose molecular function involved the plasma membrane and intracellular membrane-bound organelles were detected via CRISPR/Cas9 gene-editing technology. ORF33 was veri ed to regulate GL and was the qGL3.5 target gene.