Identification of new QTLs for resistance to Plasmodiophora brassicae in Brassica napus using genome wide association mapping


 Background Clubroot of canola ( Brassica napus ), caused by the obligate pathogen Plasmodiophora brassicae Woronin, is a major disease worldwide. Genetic resistance remains the best strategy to manage this disease. The objective of the study was to identify and map new sources of resistance to clubroot in B. napus using genome-wide association mapping. The reaction of a collection of 177 accessions to four highly virulent pathotypes of P. brassicae was assessed. These pathotypes were selected because they were most recently identified and showed different virulence patterns on the Canadian clubroot differential (CCD) lines. The collection was then genotyped using genotyping by sequencing (GBS) method. Multi-locus mixed linear model (MMLM) was used to perform the association analysis. Results The majority of accessions were highly susceptible (70 –100 DSI), while few individual accessions showed strong resistance (0–20 DSI) to 5X (2 accessions), 2B (7 accessions), 3A (8 accessions) and 3D (15 accessions). In total, 301,753 SNPs were mapped to 19 chromosomes. Population structure analysis indicated that the 177 accessions belong to two major populations. SNPs were associated with resistance to each pathotype using MLMM. In total, 23 significant SNP loci were identified, with 14 SNPs mapped to the A-genome and 9 to the C-genome. The SNPs were associated with resistance to pathotypes 5X (4 SNPs), 2B (9), 3A (5) and 3D (5). A blast search of 2 Mb upstream and downstream identified 61 disease resistance genes, of which 24 belonged to TIR-NBS-LRR proteins and 20 belonged to CC-NBS-LRR proteins. The distance between a SNP locus and the nearest resistance genes ranged from 0.11–1.66 Mb. This indicated that NBS-LRR gene family might have an important role in clubroot resistance in B. napus . Conclusion The resistant B. napus lines and the SNP markers identified in this study can be used for breeding for resistance to clubroot and contribute to understanding the genetic mechanism of resistance to clubroot.

tags from each file were captured and merged to produce a master tag file of 4,253,499 sequence tags. The tags were then aligned to B. napus reference genome v4.1, using the TASSEL-GBS pipeline. A total of 2,217,292 (52.1%) tags were uniquely aligned to the reference, 1,220,090 (28.7%) aligned to multiple positions and 816,117 (19.2%) were not aligned. Uniquely mapped tags were used to calculate the tag density distribution at each site in the B. napus genome and for SNP calling.
The raw sequence data for SNP calling were also analysed using the TASSEL-GBS pipeline.
A total of 399,234 unfiltered SNPs and 355,680 filtered SNPs were called for the 177 accessions, with a mean of individual depth of 8.5 ± 2 SD and mean site depth of 6.7 ± 11.4 SD. Of the 355,680 filtered SNPs, 301,753 SNPs were mapped to the 19 chromosomes; the remaining SNPs were randomly distributed without specific chromosome assignment. Only variants mapped to chromosomes were kept for further analyses.

Variant analysis and annotation
There were more SNPs in the C-genome (160,174 SNPs) than the A-genome (141,579 SNPs). Chromosome A03 had the highest number of SNPs within the A-genome, while C03 algorithm produced two major clusters and six subclusters ( Figure S4).

Analysis of molecular variance
Analysis of molecular variance on the six subclusters (SCA-I, II, III,SCB-I, II and III) identified significant genetic differences between major clusters, among subclusters, and among individuals within sub-clusters (p < 0.001). Variance within subclusters accounted for 87.7% of the total variance, with only 7.5% among sub-cluster and 4.7% among major clusters. The fixation index (F st ) value was 0.21, which indicated that the accessions belonged to two closely related groups (Table S3). Sub-cluster pairwise F st values ranged from 0.03 between SCB-I and SCB-II to 0.16 between SCA-I and SCB-II (Table S4).

Linkage disequilibrium analysis
Linkage disequilibrium in the association panel was calculated using Pearson's r 2 statistic on pairwise combinations of SNPs present across the 19 chromosomes of B. napus ( Figure   S5). The average LD (r 2) across the genome was 0.15. The mean LD was 0.10 in the Agenome and 0.19 in the C-genome. LD values ranged from 0.01 in A09 to 0.19 in C01 (Table 1). Across the genome, LD decayed very rapidly (r 2 = 0.20) within 300 Kb ( Figure   S5).

Association analysis
Genome-wide association analysis for clubroot severity was conducted using the following models: general linear model (GLM), mixed linear model (MLM), compressed mixed linear model (CMLM), enriched compressed mixed linear model (ECMLM), and multi-locus mixed model (MLMM). The quantile-quantile (Q-Q) plots, from all models revealed that, save for significant SNPs, the distribution of observed -log10(p) was closest to the expected distribution in the MLMM compared to other models, therefore associations were identified using this model. A significance threshold of P < 0.5/N (N: number of SNPs) was used for detecting significant SNPs. The MLMM-genome-wide association study (GWAS) detected 23 SNPs associated with resistance to the four P. brassicae pathotypes including four SNPs associated with resistance to 5X, nine SNPs to 2B, five to 3A and five to 3D. The name, physical position, P value and -log(P value) are presented in Table 2. Across genome, the A-genome carried 14 SNP loci and the C-genome carried 11 loci (Table 2, Fig. 3).
Candidate resistance genes A Blast search identified 61 nucleotide binding site/leucine-rich repeat (NBS-LRR) resistance proteins and non-NBS-LRR resistance genes within the 2 Mb sequence upstream and downstream of 19 out of 23 significant SNP loci detected in our study ( Table 3). The majority of resistance genes appeared as clusters of 2 to 10 genes, while they appeared as a single gene in other cases. On A01, one resistance gene (BnaA01g28560D) was found at ~0.5 Mb from the A01_19406286 locus associated with resistance to pathotype 5X. On A03, one Enhanced Disease Resistance 2-like (BnaA03g03110D) gene and two TIR-NBS-LRR resistance (BnaA03g03260D, BnaA03g03270D) genes were detected at 0.11-0.2 Mb distance from A03_651104541 locus associated with resistance to pathotype 2B.

Discussion
GWAS has been widely used to identify and map QTLs for quantitatively inherited traits in a wide range of plant species. In the current study, GWAS was used to identify and map new sources of resistance to four highly aggressive pathotypes (5X, 2B, 3A, 3D) of P. brassicae in 177 accessions of B. napus. The majority of the accessions were highly susceptible to all four pathotypes (80-100 DSI), while ~10% showed high levels of resistance (0-25 DSI). This supported previous reports that sources of high levels of resistance to clubroot were much less common in B. napus than in B. rapa [11,12]. In total, 23 SNPs were identified: 14 SNPs on the A-genome and 9 on the C-genome. This indicated that the A-genome (from B. rapa) carried more QTLs for clubroot resistance, but the C-genome (from B. oleracea) could be a potential source for clubroot resistance improvement [13].
One of the major factors that may affect the accuracy of GWAS analysis is the existence of population structure within the population used for GWAS. The analysis confirmed that the core collection of accessions represented two different populations. A multi-locus mixed linear model (MMLM) was used to analysis the association between the phenotypes and the SNP markers because it provided the best fit in Q-Q plots between SNP markers and the DSI for the four pathotypes for the models assessed.
QTLs for clubroot resistance have been identified previously in B. napus [21,22] and several have been mapped to chromosomes C03, C06, and C09 [13]. We believe that all 23 of the QTLs identified in the current study are novel because they were located at different physical locations on the chromosomes from QTLs identified previously and were associated with resistance to different pathotypes.
The majority of plant disease resistance genes identified to date have been classified as toll-interleukin-1 receptor/nucleotide binding site/leucine-rich repeat (TIR-NBS-LRR or TNL) proteins or coiled coil /nucleotide binding site/leucine-rich repeat (CC-NBS-LRR or CNL) proteins. The ratio of TNLs to CNLs differs among plant species, likely because their R genes are adapted to different pathogens [23,24]. About 70% of NBS-LRR genes in Brassicaceae family belongs to TNLs [25,26,27].
In the current study, 61 resistance genes were identified within 2 Mb upstream and 2 Mb downstream of the SNPs associated with resistance to the four pathotypes. The resistance genes belonged mainly to the TNL family (24 genes) or the CNL family (20 genes). The frequency of TNLs and CNLs was highest on A09 (11 genes) followed in decreasing order by A03, A08, C09, C01 and C03. The uneven distribution of TNLs and CNLs is not uncommon in other plant species [30,31,32,33]. The vast majority of TNLs and CNLs appeared in clusters of 2 to 10 TNLs and CNLs, which is similar to the results of previous studies in B. napus [29], Arabidopsis, Medicago truncatula and Solanum tuberosum [30,34,35].
The remaining resistance genes were non-TNL genes (nTNL), comprised of four enhanced disease resistance-like genes, two disease resistance-responsive (dirigent-like protein) family genes, and seven putative disease resistance proteins. A set of nTNLs with RPP13 domain (called RNLs) was also detected. A group of nTNL genes with RPW8 domain (RNL) had been identified in previous studies [27,36,37], but was not observed in the current study.
A previous phylogenetic analysis of nTNLs and CNLs from five Brassicaceae species indicated that RNLs are likely derived from the CNL lineage [27,29]. The function of RNLs is yet to be determined, but they have no direct response to the pathogen and may have not the same duplication rates as TNLs and CNLs, which explains their lower abundance in the genome [27,29]. They may have a role in defence-signal transduction [38] or as helpers of other NBS genes [38]. The role of other nTNLs is also unknown.

Conclusion
The current study identified several accessions of B. napus with high levels of resistance to four pathotypes of P. brassicae. Genome-wide association mapping analysis detected and mapped 23 SNP loci associated with resistance to the four pathotypes. This information will be used in subsequent genetic analysis of bi-parental populations to verify the SNPs and fine map the functional genes responsible for resistance to each pathotype and for marker-assisted breeding of resistance to clubroot in canola.  (Table S1). These accessions represented collections from Europe (123 accessions), Asia (29), North America (20), Oceania (2), South America (1), Africa (1), and one accession of unknown origin (Table S1). The accessions were oilseed rape (146 accessions), fodder rapa (21), Swede rape (7), rutabaga (2) and turnip (1). The growth habit was predominantly winter type (129), with some spring type (48) accessions (Table   S1).

Plant and pathogen materials
Plants for GBS analysis were grown in a growth chamber up to the 3-4 leaf stage. A total of 100 mg of leaf tissue was collected from each accession, immediately frozen in liquid nitrogen and then lyophilized in a freeze dryer for approximately 48 h. The freeze-dried tissues were ground to a fine powder using a tissue lyser (Qiagen, Newtown City, USA).

Evaluation of clubroot reaction
Seed of each host genotype was pre-germinated on moistened filter paper in a Petri dishes. One-week-old seedlings of each host line and pathotype were inoculated by dipping the entire root system in the resting spore suspension for 10 s. The inoculated seedlings were then immediately planted in 6 × 6 × 6 cm plastic pots filled with Sunshine LA4 potting mixture, with one seedling per pot. The pots were thoroughly watered and transferred to a greenhouse at 21°C ± 2°C with a 16 h photoperiod. The potting mixture was kept saturated with tap water at pH 6.5 for the first week after inoculation and then watered and fertilized as required.
Six weeks after inoculation, the seedlings were gently removed from the potting mix, the roots of each plant were washed with tap water, and each root was rated for clubroot symptom development on a 0 to 3 scale [41], where: 0 = no clubs, 1 = a few small clubs on less than one-third of the roots, 2 = moderate clubs (small to medium-sized clubs on 1/3 to 2/3 of the roots), and 3 = severe clubs (medium to large-sized clubs on > 2/3 of the roots). A DSI was then calculated using the formula of [42] as modified by [41]: Where n is the number of plants in a class; N is the total number of plants in an experimental unit; and 0, 1, 2 and 3 are the symptom severity classes.

Sequence analysis and SNP discovery
The accession sequences were analyzed using GBS. In brief, GBS involves four major steps: DNA sample preparation, library construction, library sequencing and SNP calling.
DNA extraction was performed using the DNeasy 96 plant kit as per the manufacturer's instruction (Qiagen). To reduce the genome complexity, DNA was digested with ApeKI, a methylation-sensitive restriction enzyme. The fragments produced by digestion were directly ligated to enzyme-specific adapters followed by PCR amplification. The samples divided into two pools of 96 samples each followed by two runs of Illumina HiSeq 2500 (Illumina Inc., USA). DNA alignment was generated with BWA software version 0.7.8-r455.
The GBS-TASSEL pipeline [43] was used for SNP calling, and VCF and HapMap genotype files were generated. Initial SNP filtration was performed with the following settings: MAF > 0.01 and missing data per site < 90%. Accessions with too much missing data were removed. Depth, missingness and heterozygosity were calculated using VCFtools V.0.1.12 [44]. Genotyping and SNP calling was performed at the Genomic Diversity Facility, Cornell University (http://www.bio-tech. cornell.edu/brc /brc/ services).

Variant annotation
Variants were annotated to regions of the B. napus reference genome using R, implemented using "VariantAnnotation" [45], and Variant Effect Predictor (VEP, [46]), and variant locations were characterised as coding, intron, splice site, promoter and intergenic regions.
Genetic diversity and population structure Population-based genetic diversity, including allele frequencies, MAF, and average heterozygosity, were computed using TASSEL 5.2.18 software [47]. Polymorphic information content (PIC) values [48] was calculated for SNP markers using the formula ). The ratio of transitions to transversions was calculated using the [49] 2-parameter model, implemented in MEGA7 [50].
Structure analysis of the accessions was conducted using STRUCTURE software v2.2 [51].
A subset of 10,094 SNPs was selected that was evenly distributed across the genome with one SNP per 100 Kb. The admixture model and correlated allele frequency were applied with a burn-in period of 50,000 iterations and 100,000 replications of Markov Chain Monte Carlo (MCMC). Five runs were performed to calculate the mean likelihood for the number of populations K, ranging from 1 to 10, and the mean of the log-likelihood estimates LnP(D) for each K. The ad-hoc statistic ∆K was used to determine optimal number groups [52]. Structure output was visualized using STRUCTURE HARVESTER web-based software [53]).

Analysis of molecular variance
Analysis of molecular variance (AMOVA) was conducted using Arlequin v.3.5 software [54] to estimate the genetic variance among clusters and sub-clusters of the A and C genome haplotypes. In this analysis, the distance matrix among samples was computed to estimate the genetic structure of the haplotypes. Genetic variance components were estimated, and the total variance was partitioned among major clusters, among subclusters within major clusters, and within subclusters. The significance of the variance components was tested using 1,000 permutations. The fixation index (F st ), an estimation of population differentiation and genetic distance based on genetic polymorphism data, was calculated.
Linkage disequilibrium (LD) analysis LD decay across the B. napus genome was measured and a correlation matrix of r 2 values was computed between all pairs of polymorphic SNPs with MAF ≥ 5% using the GAPIT V2 package [55].

Association analysis
Data for the disease DSI were transformed using rank-based inverse normal transformation implemented as the rntransform function in the GenABEL R [56].
Association was analyzed for a subset of 10,094 SNP markers with MAF ≥ 5% using the

Additional files
Table S1 List of accessions, their growth habits, type, origin and DSIs.

Table S2
Pearson correlation coefficient (r 2 ) between DSIs of the four pathotypes.

Evaluation of clubroot reaction
At 6 weeks after seeding, the germplasm was evaluated for resistance to four P. brassicae pathotypes; 5X, 2B, 3A and 3D. The disease severity index (DSI) ranged from 0 to 100. The majority of accessions were highly to completely susceptible (70-100 DSI), but several were highly resistant (0-20 DSI) to pathotypes 5X (21 accessions), 2B (7 accessions), 3A (8 accessions), and 3D (15 accessions) (Fig.1, Table S1). The correlation coefficient of severity among the four pathotypes was strongest between 2B and 3A (r 2 = 0.77) and weakest between 5X and 3D (r 2 = 0.27, Table S2). Phenotypic data were transformed using rank-based inverse normal transformation to make the DSI values nearly fit the normal distribution required for parametric model-based association analysis ( Figure S1). The raw sequence data for SNP calling were also analysed using the TASSEL-GBS pipeline.

Sequence analysis and SNP discovery
A total of 399,234 unfiltered SNPs and 355,680 filtered SNPs were called for the 177 accessions, with a mean of individual depth of 8.5 ± 2 SD and mean site depth of 6.7 ± 11.4 SD. Of the 355,680 filtered SNPs, 301,753 SNPs were mapped to the 19 chromosomes; the remaining SNPs were randomly distributed without specific chromosome assignment. Only variants mapped to chromosomes were kept for further analyses.

Variant analysis and annotation
There were more SNPs in the C-genome (160,174 SNPs) than the A-genome (141,579 SNPs). Chromosome A03 had the highest number of SNPs within the A-genome, while C03 contained the highest number of SNPs in the C-genome ( of SNPs were bi-allelic (90%), and only 10% were multi-allelic ( Figure S2). There was a positive correlation (r 2 = 0.80) between chromosome length and the number of SNPs, but only a weak correlation (r 2 = 0.3) between the number of SNPs and the number of SNPs per Kb.
The SNPs were annotated using the VariantAnnotation package of R. About 37% of SNPs were annotated within coding regions, 22% within introns, 31% within promoter regions, 0.3% within splice sites, and 9.7% mapped to other genetic regions ( Figure S3). A more detailed SNP annotation was performed using the Variant Effect Predictor ( Figure S3). For SNPs within coding regions, 17% were non-synonymous, 18% were upstream-gene variants, 9% were downstream-gene variants, 23% were synonymous variants, 14% were intron variants, 15% intergenic variants, and 4% were located in the splice site regions and 5' and 3' UTRs ( Figure S2C). Overall, more SNPs were annotated to the A-genome than the C-genome ( Figure S3).

Genetic diversity and population structure
For genetic diversity analysis, the SNP markers were filtered at a minor allele frequency (MAF) of 0.05 and minimum sample count of 80%, which resulted in 140,195 good quality SNPs. The mean MAF was the same for the A-and C-genomes (MAF = 0.14). Chromosome C01 had the highest MAF (0.16), followed by C03 and A07 (0.15), and lowest in chromosomes A09 and C09 (0.12) ( Table 1). The mean marker heterozygosity (H e ) was 0.06 and the mean accession heterozygosity was 0.14. The average polymorphic information content (PIC) was the same for A and C-genomes (0.26). PIC was highest in chromosome C01 (0.27) and lowest (0.24) in A09 ( Table 1). The ratio of transitions (changes from A <-> G and C <-> T) to transversions (changes from A <-> C, A <-> T, G <-> C or G <-> T) was 3.22.
Population structure analysis indicated the existence of two major group populations, and analysis using the Evanno criterion supported this result (Fig. 2). Population 1 contained 63 accessions (35.6%) representing all continents, while population 2 contained 114 accessions (64.4%), mainly from Europe. A phylogenetic tree using the neighbour-joining algorithm produced two major clusters and six subclusters ( Figure S4).

Analysis of molecular variance
Analysis of molecular variance on the six subclusters (SCA-I, II, III,SCB-I, II and III) identified significant genetic differences between major clusters, among subclusters, and among individuals within sub-clusters (p < 0.001). Variance within subclusters accounted for 87.7% of the total variance, with only 7.5% among sub-cluster and 4.7% among major clusters. The fixation index (F st ) value was 0.21, which indicated that the accessions belonged to two closely related groups (Table S3). Sub-cluster pairwise F st values ranged from 0.03 between SCB-I and SCB-II to 0.16 between SCA-I and SCB-II (Table S4).

Linkage disequilibrium analysis
Linkage disequilibrium in the association panel was calculated using Pearson's r 2 statistic on pairwise combinations of SNPs present across the 19 chromosomes of B. napus ( Figure   S5). The average LD (r 2) across the genome was 0.15. The mean LD was 0.10 in the Agenome and 0.19 in the C-genome. LD values ranged from 0.01 in A09 to 0.19 in C01 (Table 1). Across the genome, LD decayed very rapidly (r 2 = 0.20) within 300 Kb ( Figure   S5).

Association analysis
Genome-wide association analysis for clubroot severity was conducted using the following SNPs associated with resistance to the four P. brassicae pathotypes including four SNPs associated with resistance to 5X, nine SNPs to 2B, five to 3A and five to 3D. The name, physical position, P value and -log(P value) are presented in Table 2. Across genome, the A-genome carried 14 SNP loci and the C-genome carried 11 loci (Table 2, Fig. 3).

Candidate resistance genes
A Blast search identified 61 nucleotide binding site/leucine-rich repeat (NBS-LRR) resistance proteins and non-NBS-LRR resistance genes within the 2 Mb sequence upstream and downstream of 19 out of 23 significant SNP loci detected in our study ( Table 3). The majority of resistance genes appeared as clusters of 2 to 10 genes, while they appeared as a single gene in other cases. On A01, one resistance gene (BnaA01g28560D) was found at ~0.5 Mb from the A01_19406286 locus associated with resistance to pathotype 5X. On A03, one Enhanced Disease Resistance 2-like (BnaA03g03110D) gene and two TIR-NBS-LRR resistance (BnaA03g03260D, BnaA03g03270D) genes were detected at 0.11-0.2 Mb distance from A03_651104541 locus associated with resistance to pathotype 2B.

Discussion
GWAS has been widely used to identify and map QTLs for quantitatively inherited traits in a wide range of plant species. In the current study, GWAS was used to identify and map new sources of resistance to four highly aggressive pathotypes (5X, 2B, 3A, 3D) of P. brassicae in 177 accessions of B. napus. The majority of the accessions were highly susceptible to all four pathotypes (80-100 DSI), while ~10% showed high levels of resistance (0-25 DSI). This supported previous reports that sources of high levels of resistance to clubroot were much less common in B. napus than in B. rapa [11,12]. In total, 23 SNPs were identified: 14 SNPs on the A-genome and 9 on the C-genome. This indicated that the A-genome (from B. rapa) carried more QTLs for clubroot resistance, but the C-genome (from B. oleracea) could be a potential source for clubroot resistance improvement [13].
One of the major factors that may affect the accuracy of GWAS analysis is the existence of population structure within the population used for GWAS. The analysis confirmed that the core collection of accessions represented two different populations. A multi-locus mixed linear model (MMLM) was used to analysis the association between the phenotypes and the SNP markers because it provided the best fit in Q-Q plots between SNP markers and the DSI for the four pathotypes for the models assessed.
QTLs for clubroot resistance have been identified previously in B. napus [21,22] and several have been mapped to chromosomes C03, C06, and C09 [13]. We believe that all 23 of the QTLs identified in the current study are novel because they were located at different physical locations on the chromosomes from QTLs identified previously and were associated with resistance to different pathotypes.
The majority of plant disease resistance genes identified to date have been classified as toll-interleukin-1 receptor/nucleotide binding site/leucine-rich repeat (TIR-NBS-LRR or TNL) proteins or coiled coil /nucleotide binding site/leucine-rich repeat (CC-NBS-LRR or CNL) proteins. The ratio of TNLs to CNLs differs among plant species, likely because their R genes are adapted to different pathogens [23,24]. About 70% of NBS-LRR genes in Brassicaceae family belongs to TNLs [25,26,27].
In the current study, 61 resistance genes were identified within 2 Mb upstream and 2 Mb downstream of the SNPs associated with resistance to the four pathotypes. The resistance genes belonged mainly to the TNL family (24 genes) or the CNL family (20 genes). The frequency of TNLs and CNLs was highest on A09 (11 genes) followed in decreasing order by A03, A08, C09, C01 and C03. The uneven distribution of TNLs and CNLs is not uncommon in other plant species [30,31,32,33]. The vast majority of TNLs and CNLs appeared in clusters of 2 to 10 TNLs and CNLs, which is similar to the results of previous studies in B. napus [29], Arabidopsis, Medicago truncatula and Solanum tuberosum [30,34,35].
The remaining resistance genes were non-TNL genes (nTNL), comprised of four enhanced disease resistance-like genes, two disease resistance-responsive (dirigent-like protein) family genes, and seven putative disease resistance proteins. A set of nTNLs with RPP13 domain (called RNLs) was also detected. A group of nTNL genes with RPW8 domain (RNL) had been identified in previous studies [27,36,37], but was not observed in the current study.
A previous phylogenetic analysis of nTNLs and CNLs from five Brassicaceae species indicated that RNLs are likely derived from the CNL lineage [27,29]. The function of RNLs is yet to be determined, but they have no direct response to the pathogen and may have not the same duplication rates as TNLs and CNLs, which explains their lower abundance in the genome [27,29]. They may have a role in defence-signal transduction [38] or as helpers of other NBS genes [38]. The role of other nTNLs is also unknown.

Conclusion
The current study identified several accessions of B. napus with high levels of resistance to four pathotypes of P. brassicae. Genome-wide association mapping analysis detected and mapped 23 SNP loci associated with resistance to the four pathotypes. This information will be used in subsequent genetic analysis of bi-parental populations to verify the SNPs and fine map the functional genes responsible for resistance to each pathotype and for marker-assisted breeding of resistance to clubroot in canola.  (Table S1). These accessions represented collections from Europe (123 accessions), Asia (29), North America (20), Oceania (2), South America (1), Africa (1), and one accession of unknown origin (Table S1). The accessions were oilseed rape (146 accessions), fodder rapa (21), Swede rape (7), rutabaga (2) and turnip (1). The growth habit was predominantly winter type (129), with some spring type (48) accessions (Table   S1).

Plant and pathogen materials
Plants for GBS analysis were grown in a growth chamber up to the 3-4 leaf stage. A total of 100 mg of leaf tissue was collected from each accession, immediately frozen in liquid nitrogen and then lyophilized in a freeze dryer for approximately 48 h. The freeze-dried tissues were ground to a fine powder using a tissue lyser (Qiagen, Newtown City, USA).
Resting spores of field collections of strains L-G02, F.183-14, F.3-14 and F.1-14 representing pathotypes 5X, 2B, 3A and 3D respectively of P. brassicae (Canadian Clubroot Differential) system, [39] were increased on canola and stored as frozen clubbed roots at -20°C until needed. Resting spores were extracted from the frozen clubs as described by [40], and adjusted to a concentration of 1 × 10 7 resting spores/mL. Spores of each pathotype were applied separately to the host entries.

Evaluation of clubroot reaction
Seed of each host genotype was pre-germinated on moistened filter paper in a Petri dishes. One-week-old seedlings of each host line and pathotype were inoculated by dipping the entire root system in the resting spore suspension for 10 s. The inoculated seedlings were then immediately planted in 6 × 6 × 6 cm plastic pots filled with Sunshine LA4 potting mixture, with one seedling per pot. The pots were thoroughly watered and transferred to a greenhouse at 21°C ± 2°C with a 16 h photoperiod. The potting mixture was kept saturated with tap water at pH 6.5 for the first week after inoculation and then watered and fertilized as required.
Six weeks after inoculation, the seedlings were gently removed from the potting mix, the roots of each plant were washed with tap water, and each root was rated for clubroot symptom development on a 0 to 3 scale [41], where: 0 = no clubs, 1 = a few small clubs on less than one-third of the roots, 2 = moderate clubs (small to medium-sized clubs on 1/3 to 2/3 of the roots), and 3 = severe clubs (medium to large-sized clubs on > 2/3 of the roots). A DSI was then calculated using the formula of [42] as modified by [41]:

Sequence analysis and SNP discovery
The accession sequences were analyzed using GBS. In brief, GBS involves four major steps: DNA sample preparation, library construction, library sequencing and SNP calling.
DNA extraction was performed using the DNeasy 96 plant kit as per the manufacturer's instruction (Qiagen). To reduce the genome complexity, DNA was digested with ApeKI, a methylation-sensitive restriction enzyme. The fragments produced by digestion were directly ligated to enzyme-specific adapters followed by PCR amplification. The samples divided into two pools of 96 samples each followed by two runs of Illumina HiSeq 2500 (Illumina Inc., USA). DNA alignment was generated with BWA software version 0.7.8-r455.
The GBS-TASSEL pipeline [43] was used for SNP calling, and VCF and HapMap genotype files were generated. Initial SNP filtration was performed with the following settings: MAF > 0.01 and missing data per site < 90%. Accessions with too much missing data were removed. Depth, missingness and heterozygosity were calculated using VCFtools V.0.1.12 [44]. Genotyping and SNP calling was performed at the Genomic Diversity Facility, Cornell University (http://www.bio-tech. cornell.edu/brc /brc/ services).

Variant annotation
Variants were annotated to regions of the B. napus reference genome using R, implemented using "VariantAnnotation" [45], and Variant Effect Predictor (VEP, [46]), and variant locations were characterised as coding, intron, splice site, promoter and intergenic regions.
Structure analysis of the accessions was conducted using STRUCTURE software v2.2 [51].
A subset of 10,094 SNPs was selected that was evenly distributed across the genome with one SNP per 100 Kb. The admixture model and correlated allele frequency were applied with a burn-in period of 50,000 iterations and 100,000 replications of Markov Chain Monte Carlo (MCMC). Five runs were performed to calculate the mean likelihood for the number of populations K, ranging from 1 to 10, and the mean of the log-likelihood estimates LnP(D) for each K. The ad-hoc statistic ∆K was used to determine optimal number groups [52]. Structure output was visualized using STRUCTURE HARVESTER web-based software [53]).

Analysis of molecular variance
Analysis of molecular variance (AMOVA) was conducted using Arlequin v.3.5 software [54] to estimate the genetic variance among clusters and sub-clusters of the A and C genome haplotypes. In this analysis, the distance matrix among samples was computed to estimate the genetic structure of the haplotypes. Genetic variance components were estimated, and the total variance was partitioned among major clusters, among subclusters within major clusters, and within subclusters. The significance of the variance components was tested using 1,000 permutations. The fixation index (F st ), an estimation of population differentiation and genetic distance based on genetic polymorphism data, was calculated.
Linkage disequilibrium (LD) analysis LD decay across the B. napus genome was measured and a correlation matrix of r 2 values was computed between all pairs of polymorphic SNPs with MAF ≥ 5% using the GAPIT V2 package [55].

Association analysis
Data for the disease DSI were transformed using rank-based inverse normal transformation implemented as the rntransform function in the GenABEL R [56].
Association was analyzed for a subset of 10,094 SNP markers with MAF ≥ 5% using the following models: general linear model (GLM), mixed linear model (MLM), compressed mixed linear model (CMLM), enriched compressed mixed linear model (ECMLM), and multilocus mixed model (MLMM) implemented in the GAPIT V2 package of R [55]. A kinship matrix of the accessions was calculated and principle components analysis was used to account for population structure and accessions relatedness.
Candidate resistance genes Using Blast2Go software [57], the sequence region neighboring (2 Mb upstream and downstream) of the significant SNPs were searched for candidate genes encoding disease resistance proteins potentially responsible for resistance to each pathotype of P.  Table 2 List of significant SNPs, chromosomes, physical location and P values

Supplementary Files
This is a list of supplementary files associated with the primary manuscript. Click to download.
Additional files-final version.docx