Construction and Genetic Analysis of a Worldwide Core Collection of Rapeseed


 Rapeseed (Brassica napus) is an important oilseed crop, which is widely planted in the world. In a previous study, we collected 991 accessions of rapeseed from the worldwide germplasm and revealed genetic polymorphisms within these germplasm by whole-genome resequencing. However, management of such a large amount of accessions is time-consuming, laborious and costly. Therefore, we constructed a core collection of rapeseed consisting of 300 worldwide accessions based on their genetic diversity. Compared with 991 accessions, the worldwide core collection showed similar geographic distribution, the proportion of three ecotypes, nucleotide diversity and the associated SNPs of flowering time. Besides, we identified FT ortholog (BnaA02g12130D) and FLC ortholog (BnaA10g22080D) responsible for flowering time and ecotype differentiation through selective sweep analysis and genome-wide association analysis (GWAS) of flowering time using the rapeseed core collection. FT and FLC are two well-known genes regulating flowering time in Arabidopsis. These results indicate that the worldwide core collection can represent the genetic diversity of 991 worldwide accessions, which could be more efficiently used for phenotypic and genetic studies in rapeseed.


Introduction
Rapeseed (Brassica napus L., AACC, 2n=38) is one of the most economically important oilseed crops in the world . It originated about 7,500 years ago by the hybridization between B. rapa (AA, 2n=18) and B. oleracea (CC, 2n=18) (Chalhoub et al., 2014). With the industrial development of society, rapeseed was widely planted in the late nineteenth century due to the high level of erucic acid . However, excessive erucic acid and glucosinolates in rapeseed are undesirable components for edible oil and harmful to health (Li et al., 2012). Consequently, the single-low (low erucic acid) and double-low (single-low plus low glucosinolate content) cultivars were generated (Li et al., 2012). After that, these cultivars and their derivatives were spread into different countries to develop new canola cultivars adapted to the local climate and environment (Li et al., 2012).
To meet the demand of modern agriculture, many new cultivars are released, which have improved their tolerance to diverse biotic and abiotic stresses (Rygulla et al., 2007;Wang et al., 2005).
In a previous study, we collected 991 accessions of rapeseed from 39 countries (Wu et al., 2019). However, management of such large amounts of rapeseed accessions is time-consuming, laborious and costly. To improve the e ciency, a worldwide core collection of rapeseed was urgently needed to be constructed. The concept of a core collection is proposed that in which a limited set of accessions are selected to represent the whole group's genetic diversity (Frankel et al., 1984). In crops, the core collection has been constructed in many species, such as rice (Oryza sativa), cotton (Gossypium barbadense) and soybean (Glycine max) (Li et al., 2003;Xu et al., 2006;Wang et al., 2006). The establishment of a core collection is an e cient way for maintenance and evaluation of the germplasm collections (Holbrook et al., 2000).
Construction of a core collection generally needs three steps. Firstly, the size of the core collection should be decided (Van Hintum et al., 2000). Then, the accessions are clustered into non-overlapping groups based on the obtained information, such as the origin of the accessions, morphological and agronomic characters, molecular markers and so on (Van Hintum et al., 2000; Odong et al., 2013). And nally, the accessions from each cluster were selected using a certain allocation strategy (Van Hintum et al., 2000;Brown et al., 1989). With the enrichment of the theory and practice of core collections, many strategies have been used to determine the number of entries per group, such as the constant (C) strategy, the proportional (P) strategy, the logarithmic (L) strategy and the maximum (M) allelic richness strategy (Van Hintum et al., 2000). Among them, the C-, P-, and L-strategies are mainly based on morphological and agronomic characters, such as grain yield, leaf area, plant height, etc (Van Hintum et al., 2000;Odong et al., 2013). The M-strategy is based on marker diversity such as simple sequence repeat (SSR) and single nucleotide polymorphism (SNP) (Van Hintum et al., 2000;Odong et al., 2013). With the development of bioinformatic methods, many softwares for selecting core collections were developed and released based on the above strategies, such as MSTRAT, PowerCore, and Core Hunter3 (De Beukelaer et  In this study, we established a worldwide core collection of rapeseed using the Core Hunter3 software, and compared the genetic diversity between 300 core collection and 991 accessions through the genome-wide approaches including principal component analysis (PCA), patterns of linkage-disequilibrium (LD) decay, genomewide scans of nucleotide diversity, selective sweep analysis and genome-wide association studies (GWAS). The present results demonstrated that the worldwide core collection of rapeseed could represent the genetic diversity within 991 accessions. The construction of this core collection will improve the e ciency of management and utilization of rapeseed resources in future.

Materials And Methods
Construction of a worldwide core collection of rapeseed Based on the nucleotide diversity of our re-sequenced 991 worldwide accessions of rapeseed (Wu et al., 2019). The Core Hunter3 software was used to construct a core collection with consisting 300 accessions (De Beukelaer et al., 2018). Totally, 4,124,743 SNPs from the 300 accessions were extracted from the VCF le of 991 accessions using VCFTOOLS software (http://vcftools.sourceforge.net/; v1.17). Furthermore, 2,291,926 high-quality SNPs were obtained with minor allele frequency (MAF) > 5% and missing data rate 50% using PLINK software (https://www.cog-genomics.org/plink2; v1.9). These high-quality SNPs were used in the following analysis.
Population structure, phylogenetic and principal component analysis To minimize the interference of extensive strong LD, we ltered SNPs with LD decay (r 2 ) > 0.2 (sliding window of 100 kb, steps of 10 kb) using the PLINK software (https://www.cog-genomics.org/plink2; v1.9) and 311,716 highcon dence SNPs were obtained. These high-con dence SNPs were used to analyze population structures within the 300 core collections using a software package ADMIXTURE (v1.30) in haploid mode with ve different random seeds for each K (different numbers of clusters) (K = 2-10) (Alexander et al., 2009). The 'delta K' was calculated for each K value and used to assess the optimal number of populations (Evanno et al., 2005). The Q matrix with the same K value was merged using CLUMPP software (https://web.stanford.edu/group/rosenberglab/clumpp.html) and bar plots were drawn using R package Pophelper (http://royfrancis.github.io/pophelper).

Selective sweep analysis and genome-wide association studies
The xation statistics (F st ) between different rapeseed ecotypes were calculated using VCFTOOLS (http://vcftools.sourceforge.net/; v1.17) with 100 kb sliding and 10 kb step size. Sliding windows with the mean of F st greater than 97% were considered as signi cant windows. The genes located in the signi cant windows were considered as selective sweep candidate genes.
A total of 2,291,926 high-quality SNPs were used to perform GWAS of owering time in the 300 core collections. The data of owering time was recorded in our previous study (Wu et al., 2019). The kinship matrix (K value) was measured in TASSEL (http://www.maizegenetics.net/tassel; v5.0). The PCA matrix (Q value) was calculated using samrtPCA software (https://github.com/DReichLab/EIG; v.6.0.1). The K-and Q-values were used in the weighted mixed linear model (MLM) using TASSEL (http://www.maizegenetics.net/tassel; v5.0). The -log 10 P was calculated for each SNPs and SNPs with -log 10 P > 6 were de ned as the signi cant SNPs. The gures of manhattan plot were draw using the R package CMplot (https://github.com/YinLiLin/R-CMplot).

Results
The core collection can represent the genetic diversity of 991 worldwide accessions of rapeseed The core collection consists of 300 accessions from 29 countries, and includes 55 semi-winter, 58 spring and 187 winter type accessions, which is similar to the proportion of three ecotypes in 991 worldwide accessions (Fig. 1a,  b). At the genomic level, the core collection contains 98% of InDels and 97% of SNPs in the 991 accessions (Fig. 1c). Moreover, the core collection represents more than 94% of SNPs in the 991 accessions on each chromosome (Table 1), and also shows a similar distribution of accessions at the PC1 and the PC2 levels (Fig. 1d). It indicates that the 300 core collection can represent the genetic diversity of 991 worldwide accessions of rapeseed. We further analyzed the LD stuctures of the genomes in the core collection and the 991 accessions ( Fig. 1e; Table  1). The LD decay rate was faster in the A-subgenome than that in the C-subgenome in the two populations (Fig. 1e), suggesting the frequency of recombination in the A-subgenome is higher than that in the C-subgenome in both populations. The LD decay distance in the core collection (15.94 kb, half_max r 2 = 0.31) was similar to that in the 991 accessions (12.10 kb, half_max r 2 = 0.30) (Table 1). And, 456,666 LD blocks (< 1 kb 440,555; 1-10 kb 15,340; > 10 kb 771) and 553,398 LD blocks (< 1 kb 544,441; 1-10 kb 8,505; > 10 kb 452) were detected in the core collection and in the 991 accessions, respectively (Table 1), indicating a high frequency of genetic recombination in both populations. In short, the LD structures in the core collection shows a strong resemblance to that in the 991 accessions of rapeseed ( Fig. 1e; Table 1).
The core collection has similar patterns of nucleotide diversity, geographic distribution and the proportion of three ecotypes with 991 worldwide accessions In this study, we compared nucleotide diversity between the core collection and the 991 accessions (Fig. 2, Table  S1). The statistic pi was used to calculate nucleotide diversity for SNPs located on 19 chromosomes. In the Asubgenome, chromosme A04 had the highest pi value (1.34×10 −3 for the core collection and 1.28×10 −3 for the 991 accessions, hereafter) and A09 showed the lowest pi value (9.0×10 −4 and 8.59×10 −4 ) (Table S1). In the Csubgenome, the highest pi value was found on C01 (1.42×10 −3 and 1.38×10 −3 ), and the lowest pi value was detected on C05 (6.62×10 −4 and 6.51×10 −4 ) ( Table S1). The results showed that the core collection has similar nucleotide diversity and can represent that of the 991 worldwide accessions (Fig. 2, Table S1).To investigate the genetic structures of the core collection, we performed population structure analysis, the PCA and phylogenetic analysis. It is found that the core collection can be grouped into three groups (K = 3) (Fig. 3), which is similar to the population structures of the 991 accessions (Wu et al., 2019). Group 1, 2 and 3 include 62, 65 and 173 accessions, respectively, and mainly correspond to the semi-winter, spring and winter type (Table S2). In detail, 82% (51/62) of the accessions in the group 1 exhibit semi-winter type, 83% (54/65) of the accessions in the group 2 are categorized as spring type, and 95% (168/173) of the accessions in the group 3 are winter type (Table S2) The results of the PCA showed that the PC1 distinguished the group 3 from the groups 1 and 2, while the PC2 separated the group 2 from the group 1 (Fig. 3c). It is demonstrated that the group 1, 2 and 3 were genetically different from each other.

Selective regions were identi ed by selective sweeps analysis in the core collection
To determine genomic regions with strong selective sweep signals by natural or arti cial selection pressure in the core collection, the F st method was employed to determine genomic differences between rapeseed ecotypes (Fig. 4a). The value of F st between spring and winter types, semi-winter and spring types, and semi-winter and winter types was ranged from 0.33 to 0.59, 0.27 to 0.52, and 0.36 to 0.66, respectively (Tables S3-5). It is indicated that there are strong selective regions among three ecotypes in the core collection. The candidate genes located in the selective sweep regions are problely associated with the ecotype differentiation. In particular, we identi ed a FLOWERING LOCUS T (FT) ortholog (BnaA02g12130D) between spring and winter types and a FLOWERING LOCUS C (FLC) ortholog (BnaA10g22080D) between semi-winter and winter types (Fig. 4a), which were two well-known genes regulating owering time in Arabidopsis. On the other hand, the same FT and FLC orthologs were identi ed in our previous study (Wu et al., 2019).
Two key genes associated with owing time was identi ed by GWAS in the core collection Flowering time is an important trait associated with yield and ecotypes of B. napus (Amasino et al., 2010). In this study, we recorded the owering time in the core collection of rapeseed. It was ranged from 125 to 201 days, as similar to the 991 accessions (125 to 206 days) ( Table 2). In addition, statistical parameters and coe cient of variation (CV) of the owering time in the core collection were also similar to that in the 991 accessions (Table 2). It is indicated that the core collection can represent the 991 accessions in terms of the owering time. FT ortholog (BnaA02g12130D) and FLC ortholog (BnaA02g12130D) located on A02 and A10 were showed the strongest selection signals (Fig. 4a). Thus, we performed GWAS for owering time using the core collection. We identi ed 37 and 111 signi cant SNPs (-log 10 P ≥ 6) on A02 and A10, respectively (Fig. 4b, Table S6). Interestingly, a FT ortholog (BnaA02g12130D) was located in the 100 kb region of an SNP with -log 10 P of 7.45 (Fig. 4b). Five SNPs were found in the 5 kb upstream region of FTC ortholog (BnaA02g12130D), which were signi cantly associated with the trait of owering time (Fig. 4b). Taken together, FT (BnaA02g12130D) and FLC (BnaA02g12130D) identi ed by GWAS are two key genes associated with owing time in the core collection of rapeseed.

Discussion
Constructing an excellent core collection can improve the e ciency of management and utilization of accessions stored in germplasm banks (Frankel et al., 1984). In this study, we successfully constructed a worldwide core collection of rapeseed, which include 300 accessions representing 97% of SNPs from 991 worldwide accessions of rapeseed (Fig. 1c). A series of analysis indicates that this core collection is high-quality and useful for phenotypic and genetic studies in the rapeseed community.
The quality of a core collection decides the quality of subsequent researches. Thus, the most important aid is to verify the quality of a core collection (Odong et al., 2013). With the improvment of the theories and practices, a set of criteria for determining the quality of a core collection have been used, such as summary statistics, shannon diversity index (SH) and class coverage (Odong et al., 2013). Different criteria are suitable for different kinds of traits. Summary statistics of the mean, range and CV are widely used to evaluate the quality of core collection based on continuous traits, such as quantitative traits (Hu et al., 2000). SH and class coverage are suitable for evaluating core collections based on categorical data, such as quality traits and genetic markers (Odong et al., 2013). Besides, the PCA can be used to perform the distribution of the SNPs of core collection in the whole collection (Mahajan et al., 2007). In this study, we found that the summary statistics of owering time in the 300 core collection were similar to those in the 991 accessions (Table 2). In the PCA, the plot of the PC1 and the PC2 indicated that the SNPs of the core collection were evenly distributed in the 991 accessions (Fig. 1d). Moreover, the results of geographic distribution, the proportion of ecotypes (spring, semi-winter and winter), the number of indels and SNPs, nucleotide diversity and LD decay showed that the core collection can represent the genetic diversity of the 991 accessions of rapeseed (Fig. 1, 2; Table 2).
The major objective of creating a core collection is e cient use of the genetic resources (Mackay et al., 1995 Mini-Core Collection were used to perform a GWAS for the tillering, and OsTCP19 was identi ed as a major gene responding to rice tillering and nitrogen use (Liu et al., 2021). In other study, 203 rice varieties from rice mini-core collection were used to do GWAS for amylose content, seed length and pericarp color . For the traits of amylose content and seed length, the causal SNPs associated with these traits were identi ed . Recently, a core collection of upland cotton (Gossypium hirsutum) comprising 419 accessions was conducted the GWAS for 13 ber-related traits and identi ed several important genes for ber length, ber strength and owering time (Ma et al., 2018). These studies implied that a high-quality core collection can be used for exploiting genetic resources in crops.
Flowering time is an important trait, which plays a crucial role in response to environmental cues and endogenous pathways in plants (Amasino et al., 2010). Rapeseed is the second largest oil crop, which is spread into different countries under different climate and environment condition (Chalhoub et al., 2014). To adapt to various growth conditions, rapeseed forms three ecotypes (winter, semi-winter and spring) through regulating its owering time . To detect natural or arti cial selection between three ecotypes and candidate genes responsible for owering time, we investigated genomic divergence among three ecotypes in the core collection by calculating pairwise xation index (F st ) values and performing GWAS for owering time. As expected, the results of the selective-sweep and the GWAS showed that FT ortholog (BnaA02g12130D) and FLC ortholog (BnaA02g12130D) were identi ed in the core collection (Fig. 4). FT and FLC are two key genes regulating the owering time and ecotype differentiation in Arabidopsis (Michaels et al., 1999;Corbesier et al., 2007).
Undoubtedly, the construction of a worldwide core collection greatly improved the e ciency and was successfully used to identify important genes (Wang et

Conclusions
In this study, we constructed a high-quality core collection of rapeseed based on the genetic divesity, which includes 300 worldwide accessions from 29 countries. Through the analysis of the PCA, LD and nucleotide diversity, we con rmed the core collection can represent the genetic diversity of the 991 worldwide accessions. Furthermore, we identi ed FT ortholog (BnaA02g12130D) and FLC ortholog (BnaA10g22080D) responsible for the owering time of rapeseed through selective sweep analysis and GWAS methods. These results indicate that the worldwide core collection of rapeseed improves the e ciency of the phenotypic and genetic experiments, which is useful for researches and rapeseed breeding.  Figure 1 Comparison of geographic distribution, ecotypes and genetic diversity between the core collection and 991 worldwide accession of rapeseed. a The geographic distribution of the core collection (300 accessions) and 991 worldwide rapeseed accessions. b The proportion of spring, semi-winter and winter type in the core collection and the 991 accessions, respectively. c The number of indels and SNPs in the core collection, which accounts for 98% and 97% of that in the 991 accessions. d The PCA plot of the accessions in the core and the whole accessions. The blue and red points indicate the accessions in the whole and core collections, respectively. e LD decay of the A-and C-subgenomes in the core and the whole collections. The blue and red lines respectively represent the rate of LD decay in the A and C subgenomes. The straight and dashed lines represent the core collection and the 991 accessions, respectively.  Population structure, PCA and phylogenetic tree of the core collection of rapeseed. a Model-based clustering analysis of the core collection (300 accessions) with the different groups (K = 2-10) using ADMIXTURE. The x-axis represents different accessions, and the y axis quanti es cluster membership. Different colors mean different clusters. b The △K was estimated for population structure analysis. c The PCA plot of the core collection by the PC1 and PC2. d Phylogenetic tree of the core collection.