Plant materials
We collected 500 flax accessions and their wild relatives from North Central Regional Plant Introduction Station (NCRPIS), Ames, Iowa, USA. All genotypes were grown in the field as single rows. We discarded the heterogeneous rows and kept the homogeneous lines for parental stock. Finally, we made a core collection of 337 flax germplasm accessions, which comprises homogeneous lines from NCRPIS, NDSU-released varieties and advanced breeding lines, and varieties developed by different institutes in the USA and Canada (Supplementary Table S1). The advanced breeding lines (F7 generation) were obtained by crossing different parents in various combinations. The core collection is being maintained through selfing.
DNA extraction, sequencing, and SNP calling
Young leaves collected from 30-day old plants were used as the source of DNA. The collected leaf samples from each genotype were lyophilized and subsequently pulverized using stainless beads in a plate shaker. DNA was extracted from the ground leaf tissue using a Qiagen DNeasy Kit (Qiagen, CA, USA) according to the manufacturer’s protocol. A NanoDrop 2000/2000c Spectrophotometer (Thermofisher Scientific) was used to measure DNA concentrations and for the genotype by sequencing (GBS) approach. The GBS library was prepared using ApekI enzyme 65 and sequencing of the library was accomplished using an Illumina HiSeq 2500 sequencer at the University of Texas Southern Medical Center, Dallas, Texas, USA. Identification of SNPs was based on a 120-base kmer length and minimum kmer count of ten using the TASSEL 5 GBSv2 pipeline 66 and the flax reference genome 67 (available at: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/224/295/GCA_000224295.2_ASM22429v2/). The reads were aligned to the reference genome using Bowtie 2 (version 2.3.0) 68 alignment tool, which identified 243,040 SNPs that passed all required steps of the TASSEL 5 GBSv2 pipeline. Though flax is a strictly self-pollinating crop and inbred lines were used, there was a possibility of heterozygous SNPs due to artefactual collapse of homologous sites during alignment. We removed the heterozygous SNPs and filtered the row SNP set using VCFtools 69 following the criteria: minor allele frequency (MAF) ≥ 0.05, missing values (max-missing) ≤ 25%, depth (minDP) ≥ 3, min-alleles = 2 and max-alleles = 2. This filtering process yielded 26,171 bi-allelic high-quality SNP markers.
Phenotyping
We planted 337 genotypes following augmented row-column design 70 with three standard checks (ND Hammond, Gold ND, and Omega), and the checks were diagonally placed to cover spatial heterogeneity (Supplementary Fig. S1). Each check was replicated 20 times per trial and experiments were conducted at Fargo, ND, USA (46.8772° N, 96.7898° W) in three consecutive years (2018, 2019, and 2020) and at Carrington, ND, USA (47.4497° N, 99.1262° W) in two consecutive years (2019, 2020). Hereafter, we referred to the location-year combinations as environments: E1 (Fargo, 2018), E2 (Fargo, 2019), E3 (Carrington, 2019), E4 (Fargo, 2020), and E5 (Carrington, 2020). Within environments, for each genotype, we used 8 (4m×2m) m2 four-row plots and 30 gm of seeds per plot for planting except E1. In the case of E1, we used a single row per genotype, 7 gm of seeds per row. Standard fertilization and cultural practices were used throughout the experiment. Data for nine agronomic traits was recorded in all environments and seed yield in four environments. For each genotype, data measured on different traits was based on the criteria and methods described by Nôžková et al. (2011) 71 with minor modifications. Days to flowering was determined as the number of days from planting to when approximately 50% of plants per plot start flowering. Plant height was measured as the length of the main stem at maturity from the hypocotyl ending point to the plant’s top. Technical length was measured as the length of the main stem at maturity from the end of the hypocotyl to the point where branching starts. The branch number per plant was counted as the lateral branches of the main stem inflorescence. In this case, only the primary lateral branches were considered. Boll number per plant was the capsule number carried by the main stem inflorescence. Thousand seed weight was the weight of exactly 1000 seeds. Seed area, seed width, and seed length were calculated as an average of 1000 seeds. We used a MARVIN seed analyzer (GTA Sensorik GmbH) to measure seed-related traits. We harvested each plot separately and measured the grain weight in grams as yield per plot.
Phenotypic data analysis
In this study, a two-stage analysis of phenotypic data was performed. In stage-I, the best linear unbiased estimates (BLUEs) and other statistics for all genotypes within each environment were determined using the following model:
$$y=X\tau +{e}_{R}+{e}_{C}$$
1
Where X is the design matrix, τ is the fixed effect of genotype, eR is the random effect of row and eC is the random effect of the column.
In stage II, we fitted the BLUEs and weights from stage-I analysis in equation-2 and estimated the best linear unbiased predictions (BLUPs) of genotypes across all environments.
$${y}_{ij}=\mu +{G}_{i}+{E}_{j}+{GE}_{ij}+{e}_{ij}$$
2
Where \({y}_{ij}\)is the observed phenotypic value of the ith genotype in the jth environment, µ is the overall mean, \({G}_{i}\)is the random effect of the ith genotype, \({E}_{j}\) is the fixed effect of the jth environment, \({GE}_{ij}\) is the G×E interaction term and \({e}_{ij}\) is the residual error. The analysis was done using the R-shiny app MrBean (https://beanteam.shinyapps.io/MrBean/).
We calculated the heritability of each trait using the following formula proposed by Cullis et al. (2006) 72:
$${H}_{Cullis}^{2}=1-\left(\frac{PEV}{md\text{*}{V}_{g}}\right)$$
3
Where the genotypic predicted error variance is \(PEV\), \({V}_{g}\) is the genotypic variance and \(md\) is mean values from the diagonal of the relationship matrix. The heritability calculation was done using the R package Sommer 73.
We calculated Pearson correlation among different traits within each environment and combined all environments using observed unadjusted phenotypic values. The correlation of phenotypic values of a trait observed in different environments were also calculated; for this purpose, we used the R package corrplot 74.
Structure and linkage disequilibrium (LD) analysis
An admixture model-based structure analysis of the whole germplasm set was conducted using STRUCTURE 75 software utilizing the whole SNP marker set (26,171). To strengthen the result, structure analysis was run at various combinations of burn-in lengths (5000 to 50000) and Monte Carlo Markov Chain (MCMC) lengths (5000 to 100000). Each combination was replicated 10 times per K (K1 to K10). Along with the DeltaK approach 76, four alternative statistics (MedMedK, MedMeaK, MaxMedK, and MaxMeaK) 77 were used to identify optimum cluster numbers – StructureSelector 78 was used for this purpose. We assembled 10 replicates of the Q-matrix for the best-fitted cluster number using CLUMPP 79. Principal component analysis (PCA) was run using a covariance standardized approach in TASSEL 80. To show the genetic divergence among identified clusters, pairwise Fst81 was calculated using Arlequin3.5 82 at 10000 permutations. Gaussian finite mixture model-based clustering of the collection was fitted via the EM algorithm in R package mclust 83 using phenotypic data. We also visualized the PCA and phenotype-based clustering output using the ggplot2 R package 84.
Chromosome-wise LD (r2 values) among SNPs was calculated using the 26,171 SNP markers in PopLDdecay software 85. The LD decay rate was defined as a half-decay distance, at which observed r2 between sites decays to less than half of the maximum r2 value 86. For this purpose, we wrote R scripts (available on personal communication).
Genomic prediction models’ comparison
In the case of genomic prediction, the linear model equation is unsolvable as the explanatory variables (marker number) exceed the observation number. Researchers solve this problem by using ridge regression or Bayesian computations or parametric method and penalized regression or semi-parametric method 87. Generally, all methods are fitted to the basic skeleton (Eq. 4) with modifications:
Where y is the phenotypic value (BLUP), µ is the fixed intercept, Z is the marker matrix, u is the marker effect vector and ɛ is a residual vector.
In this study, we calculated the prediction ability for different traits using 26,711 markers and BLUP values based on 14 different parametric and semi-parametric models such as GBLUP 88–90, EGBLUP 91, RR 23,26,92, LASSO 93, EN 94, BRR 95, BA 23, BB 96, BC 97, BL 29, RKHS 98, RF 99, SVM 100,101 and MKRKHS 33. Details of the models are available in previously published research articles 34,95. For this purpose, we used the R package BWGS pipeline 34. For each model, we calculated the predictive ability as Pearson correlation between genomic estimated breeding values (GEBVs) and phenotypic values (BLUPs) of the validation set (VS). For this, a fivefold cross-validation approach was used, i.e., we randomly selected 80% of the collection as a training set (TS) and the remaining 20% as a validation set (VS). The process was repeated 100 times and finally predictive ability was reported as average across 100 replicates. The model that gave the best predictive ability for a particular trait was declared as the best-fitted model for the corresponding trait. For subsequent analyses, we only used the best-fitted model identified in this stage for specific traits.
Optimal marker density determination
Predictive ability for each trait was calculated using different subsets of the whole marker set. The marker subset was based on linkage disequilibrium (LD) decay distance and random selection and the marker subset was thinned using chromosome-wise LD decay distance, which yielded 5362 markers. We also selected subsets of 20, 200, 1000, 3000, 7000, and 13000 markers based on random sampling to minimize or avoid any biases. Using a five-fold cross-validation approach, predictive ability in each subset was measured. The process was repeated 100 times and finally, predictive ability was reported as average across 100 replicates.
Marker selection based on marker-trait association
Significant markers for each trait were grouped based on genome-wide association mapping (GWAS) results. This was done five different ways. In the case of scenario-I, we conducted SNP-based GWAS for all traits within each environment (BLUEs) and combined all environments (BLUPs) using different single locus and multi-locus models such as general linear model (GLM) 102, mixed linear model (MLM) 103, compressed MLM (CMLM) 104, enriched compressed MLM (ECMLM) 105, settlement of MLM under progressively exclusive relationship (SUPER) 106, multiple loci MLM (MLMM) 107, fixed and random model circulating probability unification (FarmCPU) 108 and Bayesian information and linkage-disequilibrium iteratively nested keyway (BLINK) 109. The R package GAPIT (version 3) 110 was used to run GWAS and the best-fitted model for a particular trait was determined based on the mean of squared difference (MSD) values and QQ plots. Using the best-fitted model output, we identified significant SNPs associated with a particular trait based on a p-value threshold. The p-value threshold was calculated by dividing the type-I error rate (α) by the effective number of independent tests (Meff) at α = 0.05 111. In this study, the p-value threshold was 0.000103. Any marker having a p-value ≤ p-value threshold was declared as a significant marker. We grouped the significant markers for each trait considering each environment (BLUEs) and combined environment (BLUPs). In the case of scenario II, GWAS was conducted using only an MLM model based on combined environment BLUP values. The top 20 markers for each trait having the lowest p-value were grouped. Then the selected marker set from scenario-I & II were used to calculate predictive ability 100 times using a five-fold cross-validation approach and predictive ability was reported as average across 100 replicates. In the case of scenario III, we randomly divided the whole collection into TS (80% of the whole collection) and VS (20% of the whole collection) 15 times. Each time we did GWAS using TS only and grouped the markers as described in scenario II. Then, predictive ability was calculated and reported as the average across 15 replicates. In the case of scenario IV, GWAS was conducted as one-way ANOVA using the R function lm, and every marker was tested one at a time using phenotype (BLUP values) considering the whole collection. The markers having a p-value less than the predefined p-value (0.001, 0.01, and 0.05) were used for predictive ability calculation 100 times with five-fold cross-validation. In the case of scenario V, we followed scenario III and used one-way ANOVA instead of MLM model-based GWAS to select the marker set having a p-value less than the predefined p-value. The selected marker set was then used to calculate predictive ability 100 times using five-fold cross-validation and predictive ability was reported as average across 100 replicates.
Prediction ability considering population structure
To investigate confounding effects of population structure on predictive ability, we calculated predictive ability using a subset of the whole collection after discarding the genotypes belonging to the most divergent (clusters showing the highest pairwise Fst value to other clusters) genetic clusters. Divergent clusters were removed to increase the genetic similarities among the remaining individuals. We incorporated the Q-matrix of structure analysis output into the RR-BLUP model to nullify the effect of structure and calculated predictive ability considering the whole collection. Albeit a smaller sample size, cluster-wise predictive ability. In all cases, we used five-folds cross-validation 100 times. To cover the genetic variance, we also did stratified sampling and calculated predictive ability following two different methods (M-I and M-II). The whole collection was arranged into small groups of 25, 50, 75, 100, 125,150, 175, 200, 250, and 300 genotypes by randomly selecting genotypes from each cluster proportional to the size of the cluster. In the case of method-I (M-I), 100 times five-fold cross-validation within each subset was done to report the predictive ability as the average across all runs. In the case of method -II (M-II), each subset was used as TS and the remaining genotypes as VS. For this purpose, we made 20 replicates of each subset. The predictive ability was calculated as Pearson correlation between GEBVs and BLUP values of VS and was reported as an average across 20 times. There was overlapping of genotypes among replicates as genotypes were selected randomly from each cluster proportional to the size of the cluster each time.
Indirect predictive ability calculation
Indirect prediction ability for each trait, considering other traits separately, was calculated using the five-fold cross-validation schemes 100 times. For example, we estimated the GEBVs of the validation set for plant height. Then we calculated the indirect prediction ability for seed yield by plant height as Pearson correlation between GEBVs of the validation set based on plant height and BLUP values of that set considering seed yield. The same was repeated 100 times and prediction ability was reported as average across 100 replicates. The same procedure was followed for different trait combinations.