2.1 Plant material and field trials
A total of 250 faba bean accessions were collected from the Centre for Genetic Resources (CGN) located in Netherlands. In addition, six commercial varieties were provided by the breeding companies Limagrain (LG) and NPZ. This complete set of 256 accessions is referred to as the “CGN population”. The CGN population comprises accessions originated from 43 countries, including 158 landraces, 84 varieties, 3 research/breeding lines, and 9 accessions with an unknown genetic status (Supplementary Table 1).
The plants were grown in Winschoten in the Netherlands (53°08’28.7’’N 6°59’33.7’’E) in 2021 and 2022. They were sown on March 24 and harvested on August 17–18 in 2021, and sown on April 13 with a harvest on August 22–23 in 2022. The difference in sowing dates was due to differences in annual climatic conditions.
The experimental design was a resolvable row-column design (RRC) with two complete blocks (i.e., replicates), each featuring 11 rows and 25 columns. The six commercial varieties were planted in four plots within each block as controls to better evaluate the field and season conditions. The design was generated by using the FielDHub R package (29). The randomization within the RRC design was different in 2021 and 2022. The plots were 1.8 m² plots and hosted 20 plants per accession, maintaining a spacing between them of 0.3 m.
Field management adhered to the local farmers’ practices and irrigated exclusively through rainfall. 45 kg of KAS fertilizer (N 27%, NH4 13,5%, NO3 13,5%, and MgO 4%), and 26 kg of KALI60 (K2O 60%), were applied in both years.
2.2 Phenotyping of agronomic traits
The phenotypic traits recorded included flowering time (FT), plant height (PH), pod length (PL), seeds per pod (SP), single seed weight (SSW), and yield (YLD). The trait PL was only recorded in 2021. The data were recorded on the six inner plants of each plot to minimize potential border effects.
The FT was recorded when 50% of the plants per plot exhibited at least one open flower, and quantified in day after sowing (DAS). The PH was measured from the collar to the top of the plant in centimetre (cm) by using a standard ruler at harvest. After drying in the greenhouse, pods from each plot were photographed at 50 cm by using a smartphone in a holder. The pictures were analysed using ImageJ software to determine PL in cm, using a known size label as a reference. The SP was estimated by dividing the total seed count by the total pod count. The SSW was calculated by dividing the total seed weight by the number of seeds, providing an average weight per seed in grams. The total yield referred to plot yield (grams/plot), corresponding to the total weight of seeds per plot.
2.3 DNA extraction, sequencing, and single nucleotide polymorphisms (SNPs) typing
Young leaves from ten plants per accession were collected, lyophilized separately, and pooled in equal weights to capture intra-accession allelic variation in the partially allogamous Vicia faba L. DNA was extracted by using the NucleoMag® DNA (Bioké, the Netherlands) extraction kit. The company IGAtech performed DNA quality check, normalization, and library preparation using the NuGEN Technologies protocol, followed by sequencing on the Illumina NovaSeq 6000 with pair ends (2x150bp) (Illumina, San Carlos, CA). The 90k-SPET® technology (Allegro® Targeted Genotyping) was adopted as a targeted method to select and sequence ~ 90,000 genomic areas.
The sequencing reads were aligned to the faba bean reference genome (Faba bean cv. Hedin) using BWA-MEM v0.7.17 (30), selecting uniquely aligned pairs of reads with a mapping quality > 10. Variant calling was executed with freeBayes v1.3.2 (31) within a 300 bp range from SPET probes, treating samples as haplotype mixes (--pooled-continuous option). Variant normalization was done using BCFtools (32).
Allelic frequencies at each biallelic locus were calculated using a custom Python script, assigning a score s to SNPs as \(\:s=\frac{{A}_{r}}{{A}_{r}+{A}_{a}}\), where Ar and Aa are the reads counts of the reference and alternative alleles, respectively. To minimize errors and bias, s was calculated only for SNPs with an average coverage > 100x across all samples and > 30x in over 15% of the samples. Additionally, the least represented allele (either Ar or Aa) needed at least 2 reads per pool and presence in at least 5% of other samples.
The final SNPs retained for GWAS were selected based on a minor allele frequency (maf) filter of at least 0.04 and a maximum of 6% missing values (NA) per SNP. For Genomic Selection, a stricter maf of 0.07 was used. Missing values were imputed using the mean imputation method. The filtering and imputation were done with a custom R script.
The unique SNP identification code used in this study is: “chromosome_position” (e.g., chr1S_1370573844 indicates the SNP at position 1370573844 on chromosome 1S).
2.4 Genomic relationship matrix (G), PCA, and homozygosity level using SNPs data
The Genomic Relationship Matrix (G) and Principal Component Analysis (PCA) were computed using a pruned set of SNPs. SNP pruning was done using a custom R script, retaining only one SNP per pair that showed a Pearson correlation > 0.4 within a 200 Kbp window. In total, 2631 markers out of 38,014 were discarded.
The G matrix was computed by using the VanRaden method (33) implemented in the AGHmatrix R package (34), which is adapted for pool genotypes and allelic frequency as marker score. Subsequently, the G matrix was blended using the formula described by Nazarian and Gezan (35): G*=G(1 − p) + pI, where G* is the new blended matrix, G is the original matrix, p is the proportion of the identity matrix I to be considered (0.02), and I is the identity matrix to be blended (36). Henceforth, the term G denotes its blended form G*, which is invertible.
PCA was performed by using the base stats R package. Four principal components (PCs) were retained based on the variance-explained criteria. The PCs loadings were inspected to ensure that they represented genome-wide variation, based on the assumption that if SNPs with high loadings are all concentrated in a single or few regions of the chromosome, they are likely representing local genomic structure rather than population structure (37). Four clusters were identified by an iterative K-means clustering on the four PCs using the factoextra R package (38). The goodness of clustering was assessed by examining the within-cluster sums of squares (WSS) of 1 to 15 clusters by using the elbow method in a scree plot.
Later, the CGN population’s homozygosity was assessed by defining a within-pool locus as ‘fully’ homozygous when the frequency of a unique allele > 0.95.
2.5 Statistical analysis of the field trials data
The phenotypic data analysis was performed by fitting different linear mixed models via ASReml 4.2-R package (39), which estimates variance components using the restricted maximum likelihood (REML) methodology.
In order to estimate the narrow sense heritability (h2), the following single-trait and single-environment analysis was performed using raw data based on the following generic model:
y = 1µ + Xβ + Z1u + Z2r(b) + Z3c(b) + e (1)
where y represents a vector of raw phenotypes, µ the overall mean, β the fixed effect coefficients for blocks, u the random genetic effects, with u ∼ N(0, σg2G); r(b) the random rows effects within blocks, with r(b) ∼ N(0, σr2Ir); c(b) the random columns effect within blocks, with c(b) ∼ N(0, σc2Ic); e represent the random residual errors with e ∼ N(0, Σ), and Σ = σe2(AR1r⊗AR1c). The sign ⊗ is the Kronecker product of two autoregressive structures of the rows and columns coordinates, respectively. The matrix AR1 accounts for micro-environmental variations between residuals. X and Z are incidence matrices of their respective orders, I identity matrices of their respective orders, and G is the Genomic Relationship Matrix capturing the additive relationship among individuals, described earlier.
The residuals were assessed for normality by exploring diagnostic plots after model fitting. Outliers were defined as residuals exceeding three times the standardized residual's standard deviation. Excluding outliers, models were re-fitted to calculate the narrow sense heritability (h2) as follows: \(\:{h}^{2}=1-\frac{\left[mean\left(\text{P}\text{E}\text{V}\text{g}\right)\right]}{{\sigma\:}_{g}^{2}}\) (40), with mean(PEVg) being the average prediction error variance (PEV) for the genotypes and σg2 the estimated additive genetic variance.
To estimate an adjusted mean for each genotype (Yadj), a similar model to (1) was fitted with u as a fixed genetic effect.
The type-B additive genetic correlations (41) between 2021 and 2022 were estimated by fitting the following single-trait and multi-environment trial (MET) model using raw data:
y = 1µ + X1t + X2β(t) + Z1u(t) + Z2r(bt) + Z3c(bt) +e (2)
where y is a vector of raw phenotypes and µ is their overall mean across years, t is a vector of fixed year effect, β(t) is a vector of fixed blocks effects within the years, u(t) is a vector of random genetic effects within years, with u(t) ∼ N(0, σg2G⊗U0) with G representing the genomic relationship matrix and U0 is a 2x2 variance-covariance (vcov) matrix between the two years with a heterogeneous structure (corgh), allowing for year-specific σg2 and genetic correlations (ρtype−B) estimation; r(bt) is the vector of random effects for rows within blocks within years, with r(bt) ∼ N(0, σr2Dr), c(bt) is the vector of random effects for columns within blocks within years, with c(bt) ∼ N(0, σc2Dc); e is the vector of residual errors, with e ∼ N(0, σe2De) and with heterogeneity defined as Σ = σe2(AR1r⊗AR1c). Dr, Dc, and De are block diagonal matrices where each year level has its own component, σr2, σc2, and σe2 respectively. All other terms were previously described. A value of ρtype−B near one (obtained from the matrix U0) indicates minimal Genotype-by-Year (GxY) interaction. The approximated standard errors of ρtype−B were estimated by using the Delta method (42). The significance of the year as a covariate was assessed by the incremental Wald statistic.
To estimate the type-A additive genetic correlations among traits (41), the following multi-trait (multivariate) and single-environment model was fitted using raw data:
y = 1µm + X1t + X2β(t) + Z1 u(t) + Z2r(bt) + Z3c(bt) + e (3)
where y is the stacked vector of phenotypes including all traits y = [y1, y2,…yn]; µm is the multivariate version of intercept (µ); X2β(t), Z2r(bt), Z3c(bt) are stacked design effects defined as previously in the model (2), but with ‘t’ now denoting the trait, u(t) is a stacked vector of random genetic effects within trait, with u(t) ∼ N(0, σg2G⊗U0), G is previously described, and U0 is a 6x6 vcov matrix between all traits (5x5 in 2022 because one trait less was analyzed) with a heterogenous structure (corgh), allowing for trait-specific σg2 and unique genetic correlations across traits (ρtype−A), e is a stacked vector of random residual effects, with e ∼ N(0, σe2I⊗R0) and R0 is the vcov between errors that is heterogeneous (corgh), introducing residual correlations between traits and unequal σe2 across traits. The approximated standard errors for correlations were obtained by using the delta method (42). Convergence in REML for the multivariate model was aided by using the starting values for σg2 and σe2 estimated from the univariate model (1) for each trait.
2.6 Linkage disequilibrium decay
Simple linear regression models were fitted between pairs of allelic frequency at each SNP with maf > 0.04, treating one SNP as the response variable and the other as the predictor variable. The strength of linkage disequilibrium (LD) was approximated by the ratio of regression sum of squares (SSR) to the total sum of squares (SST), expressed as \(\:{r}^{2}\:\)(43). To represent the LD decay, \(\:{r}^{2}\) between markers was plotted against their physical distance. The genome was segmented into 300 Kbp bins, and the within bins 99th percentile \(\:{r}^{2}\) values and mean marker distances were chosen as representative data points to fit a Piecewise Cubic Hermite Interpolating Polynomial (PCHIP) via the scipy.interpolate module in Python (44, 45). LD decay was defined as the physical distance where \(\:{r}^{2}\) halved from its maximum (LDmax1/2).
2.7 Genome-wide association study (GWAS)
GWAS was performed using a linear mixed model approach implemented in the statgenGWAS R package (46), as follows: y = Xβ + Zu + e, where y represents the vector of the Yadj; X is the design matrix for fixed effects including the intercept, SNPs, and the genetic clusters to which each accession belongs; β denotes the vector fixed effect coefficients, with βSNP specifying the SNP-effect; Z is the incidence matrix; u is the random additive genetic effects, with var(u)= σg2G. G was defined before (section 2.4), and it captures the shared additive genetic effects among individuals. Finally, e is the vector of residual errors, with var(e) = σe2I.
StatgenGWAS estimates σg2 and σe2 using the Efficient Mixed Model Association (EMMA) algorithm (47). Initially, a base model calculates the REML-variance components excluding the SNP fixed effects. Then variance components are fixed, and the null hypothesis βSNP = 0 is tested for each SNP individually using a general F-statistic within the generalized least squares (GLS) framework.
The model's fit was gauged using the genomic inflation factors from statgenGWAS. SNPs exceeding a 4.5 LOD threshold were deemed significant. Furthermore, adjusted p-values were obtained using the Benjamini-Hochberg method (48).
2.8 Candidate gene identification and putative SNP effects
The candidate genes were identified near significant SNPs considering the LD decay. The genes’ function were obtained from the public available annotation data generated using Prot-Scriber (https://github.com/usadellab/prot-scriber). For intragenic SNPs, SNPeff tool (49) was used to predict the SNPs effect on the protein produced. SNPeff categorizes SNP impacts as “high” (disruptive), “moderate” (potentially altering protein function), “low” (likely harmless), or “modifier” (affecting non-coding regions or genes).
2.9 Genomic selection based on a Bayesian modelling approach
Genomic Selection (GS) models were fit using the Bayes B method. The models were simply formulated as y = 1µ + Zβ + e, where y represents the vector of Yadj; µ is the intercept; Z denotes the marker incidence matrix, and β the corresponding vector of marker effects; lastly, e stands for the vector of residual effects.
In Bayes B, each marker's effect (β for maker j) has a prior that is a mixture of point of mass at zero and scaled-t density slab. The zero-point mass suggests no effect with probability π, while the scaled-t slab allows for an effect with probability 1 − π. The hyperparameter π follows a Beta distribution \(\:B\left(\pi\:|{p}_{0},{\pi\:}_{0}\right)\), with shape \(\:{p}_{0}\) > 0 and \(\:{\pi\:}_{0}\) ∈ [0, 1]. The variance of the marker effects βj are unknown and modelled by assuming a prior distribution from an inverse chi-squared (\(\:{\chi\:}^{-2}\)) with 5 degrees of freedom and scale parameters Sβ. The Sβ’s uncertainty is represented by a gamma density with rate (r) and shape (s) parameters, \(\:G\left({s}_{\beta\:}|r,s\right)\). Bayes B model was fit using the BGLR R package with default hyperparameters (50). 30,000 iterations were performed for sampling and the initial 6,000 sampled values were discarded as burn-in.
The GS models were evaluated through a 5-fold cross-validation (CV), repeated 10 times, for the 2021 and 2022 data separately. The predictive ability (PA) and prediction accuracy (ACC) were used to evaluate the models’ performance. The PA is the correlation between the Yadj and predicted genomic estimated breeding values (GEBV), denoted as corr(yadj, ĝ). For each CV round, the heritability (ℎ2) was estimated from the GS models as \(\:{h}^{2}=1-\frac{{\sigma\:}_{e}^{2}}{{\sigma\:}_{g}^{2}}\), where \(\:{\sigma\:}_{e}^{2}\) and \(\:{\sigma\:}_{g}^{2}\) represent the error variance and genetic variance, respectively. Then, ℎ2 was used to calculate the ACC as: \(\:\text{A}\text{C}\text{C}=\frac{PA}{\sqrt{{h}^{2}}}\). The ACC is the correlation between the true genomic breeding value and the predicted breeding value, denoted as corr(g, ĝ). Furthermore, GS models were evaluated for cross-year prediction by calculating the PA, using 2021 data to predict 2022 and vice versa.