New enzyme combinations for an efficient and uniform capture of the genome
In this study, sixteen DNA samples that had been previously genotyped with the original ApeKI-based GBS protocol were used to produced 3D-GBS libraries. The 16-plex GBS and 3D-GBS libraires produced ~ 21.1M (ranging from ~ 800K to ~ 2.9M reads/sample) and ~ 10.4M (ranging from ~ 300K to ~ 800K reads/sample) reads, respectively. First, the distribution of the SNPs derived from PstI-MspI and Nsil-MspI reads was investigated to assess the relevance of this enzyme combination (Fig. 1a and Supplementary Fig. 1). We found 76.5% and 23.5% of Nsil-MspI and PstI-MspI reads, respectively, encompassing 4,206 and 620 SNPs, respectively. The higher proportion of NsiI-MspI-derived fragments and SNPs could be expected because of the methylation insensitivity of NsiI and lower GC content compared to PstI. Nevertheless, PstI-MspI-derived fragments ensured the coverage of large gaps devoid of NsiI-MspI-derived fragments (e.g., on chromosomes 9, 11 and 20).
To perform a meaningful comparison, the same overall number of reads for each 16-plex library was used to compare the two protocols; as the number of reads per accession varied, an identical number of mapped reads for a given accession in each of the two libraries was used to compare GBS and 3D-GBS (Table 1). As expected, with 3D-GBS, a lower fraction of the genome was captured compared to GBS (genome coverage of 1.2% vs 4.7%, respectively). As the sequencing effort (i.e., the number of reads per sample) was focused on a smaller fraction of the genome, the mean depth of coverage was 3-fold higher in 3D-GBS compared to GBS (14.5X vs 5.1X, respectively) resulting in a lower proportion of missing data (15.3% vs 33.7%, respectively). Fortunately, while the genome coverage was 75% lower for 3D-GBS than GBS data, the number of SNPs identified was only 40% lower (4,826 vs 7,904 SNPs, respectively), showing that 3D-GBS either captures more polymorphic regions of the genome or improves the genotyping efficiency for the same sequencing effort. As expected, highly similar metrics were obtained for mapping quality, proportion of heterozygous genotypes, average minor allele frequency and nucleotide diversity in both datasets. This suggests that 3D-GBS data is as relevant as GBS data for performing different genetic analyses.
The density of SNPs captured by 3D-GBS (5.1 SNPs/Mb and 2.3 SNPs/cM with no gap > 30 cM) represents an adequate density to perform QTL mapping and GS analysis. To confirm this, the distribution of the SNPs across the physical and genetic maps has been evaluated (Fig. 1b and c, Supplementary Fig. 2). Compared to GBS-derived SNPs, the distribution of the 3D-GBS-derived SNPs was more uniform across the genome (Fig. 1b and Supplementary Fig. 2). This can be easily illustrated by (i) several regions > 5Mb on chromosomes 1, 5, 6, 12, 16 and 18 that are missed by GBS while they were covered by 3D-GBS; and (ii) the more uniform distribution of SNPs which rarely exceeds 25 SNPs/Mb in 3D-GBS, compared to GBS where many regions are covered with an “excessive” number of SNPs (25 to more than 41 SNPs/Mb; e.g. on chromosomes 4, 5, 6, 16, 18, etc.). Finally, regarding the genetic map, the 3D-GBS SNPs were well distributed with only one gap close to 20 cM on Chr11, in a region that was also poor in GBS-derived SNPs (Fig. 1c), suggesting that 3D-GBS data are as efficient as GBS data to conduct genetic analyses such as GS or QTL mapping.
The appropriate choice of enzyme(s) is an essential step in developing a GBS protocol (Hamblin and Rabbi 2014). In the original GBS protocol (Elshire et al. 2011b), the ApeKI enzyme was used as frequent cutter with sensitivity to methylation to obtain SNPs mainly distributed in gene-rich regions (hypomethylated fraction of the genome) corresponding to a coverage of ~ 4–5% of the genome (Table 1). A two-enzyme strategy using a rare (e.g. PstI) and a frequent cutter (e.g. MspI) sensitive to methylation has also been developed to significantly reduce genome complexity in species with a very large genome (e.g., barley (5 Gb) (Poland et al. 2012)). However, this approach did not show enough efficiency with species with small to medium genome size (e.g., soybean (⁓1 Gb)) as it captured relatively few genomic regions (Torkamaneh et al. 2021). Moreover, due to the palindromic nature of enzyme’s restriction sites, this produces a bias in GC content, making the two-enzyme strategy using a rare cutter (none available with 50% GC content) impossible to obtain uniform distribution of fragments in the context of a universal use. Indeed, since there is natural variation in GC content across chromosomes (Nishida 2008; Hodgkinson and Eyre-Walker 2011; Melamed-Bessudo et al. 2016; Li 2016a) and between species (Li 2016b; Karimi et al. 2018), using a rare cutter with either 33% or 66% of GC will inevitably induce variable density of restriction fragments across chromosomes and species. On the other hand, frequent cutters can have a 50% GC content, such as MspI (CCGG) or BfaI (CTAG), allowing a more even distribution of restriction fragments throughout the genome, as illustrated by Torkamaneh et al. (2021). However, when they have been used alone, these frequent cutters induce too many restriction fragments across the genome, which is contrary to the objective of reducing genome coverage.
In light of the above, we explored the idea of improving the two-enzyme approach by using a second rare cutter, such as NsiI (Fu et al. 2016), with a cutting site differing in GC content and exploiting methylation insensitivity to capture hypermethylated regions missed by PstI. The combination of NsiI with PstI and MspI presented a good opportunity to obtain a sufficient and efficient low density of SNPs distributed more evenly in the genome. While ApeKI would be expected to cut every ~ 512 bp (44.5), here, a combination of three enzymes that include PstI and NsiI (two 6-bp-cutter with differing methylation sensitivity), with a predicted cutting frequency of one site every ~ 4,096 bp (46), and MspI, a methylation-insensitive 4-bp cutter with an expected cutting frequency of one site every 256 bp (44) were used jointly to reduce the fraction of the genome that is captured. The high cutting frequency of MspI allows to generate more fragments of 100-400bp [24] that are ideal for short-read sequencing. Together, these enzymes span a broad GC, 33% for NsiI, 66% for PstI and 100% for MspI, thus creating a suitable condition to reduce genome coverage and uniformly sample different genomic regions. Finally, by focusing on fewer but well-distributed genomic regions, 3D-GBS offers an efficient and cost-effective approach for discovery and genotyping of SNPs across the genome in species with small to medium-sized genome.
Table 1
Sequencing and SNP-calling data generated from GBS and 3D-GBS libraries of 16 soybean samples.
Steps | Measured parameters | GBS | 3D-GBS |
Sequencing | Mean read count (M) | 0.6 | 0.6 |
Coverage (%)* | 4.7 | 1.2 |
Mean depth of coverage (X)¥ | 5.1 | 14.5 |
Mean mapping quality | 41 | 42 |
SNP calling | SNP count | 7,904 | 4,826 |
Proportion of missing data (%) | 33.7 | 15.3 |
Proportion of heterozygous genotypes (%) | 4.4 | 3.8 |
Average minor allele frequency (%) | 33.8 | 31.3 |
Nucleotide diversity (p per bp) | 0.43 | 0.42 |
Physical map | SNP/Mb | 8.3 | 5.1 |
Number of gaps > 5 Mb | 9 | 6 |
Number of gaps > 10 Mb | 1 | 0 |
Genetic map | SNP/cM§ | 3.7 | 2.3 |
Number of gaps > 10 cM | 7 | 10 |
Number of gaps > 20 cM | 0 | 1 |
*Fraction of the genome captured across all 16 libraries |
¥Average number of read at each sequenced position. |
§Inferred from the closest corresponding SNP on the consensus genetic map (Fallah et al. 2022). |
Optimizing the number of reads per sample to maximize multiplexing
Different numbers of reads (i.e. 50K, 100K, 200K and 300K reads) were randomly sampled three times for each accession from the 16-plex 3D-GBS library. For each metric investigated, the coefficient of variation between replicates based on the same number of reads was < 5% (not significantly different (Tukey HSD test p-value > 0.1)). For this reason, the mean value (across all three replicates) for each metric is reported in Table 2. With increasing the sequencing effort from 50K to 300K reads per sample, the fraction of the genome captured increased from 0.6 to 1%, the number of SNPs increased from 1,314 to 4,082, and the proportion of missing data decreased from 37 to 20%. Even at the smallest value tested (50K reads/sample), the proportion of missing data was still reasonable and would allow for an accurate imputation (Torkamaneh et al. 2018). For average minor allele frequency and nucleotide diversity values, equivalent results were provided across the entire range of reads per sample, suggesting that even with a very limited sequencing effort one can perform high-quality genetic diversity analysis.
The distribution of the SNPs on the genetic map was very similar from 100K to 300K reads while, with only 50K reads per sample, large gaps were detected (e.g., ~ 10 cM on Chr01, ~ 80 cM on Chr03, ~ 60 cM on Chr06) and some chromosome extremities were missed (Fig. 2). While the density of markers doubled between 100K and 300K reads, the distribution of the SNPs across the genetic map remained very similar with some regions that were denser in SNPs using 300K reads (e.g., ~ 40 cM on Chr01, ~ 60 cM on Chr04, ~ 90 cM on Chr06). This very promising result suggests that one can run 3D-GBS with only 100K reads per sample, a significant reduction in the sequencing cost, to achieve sufficient resolution (~ 2,300 SNPs, 1.1 SNP/cM) to perform GS.
In the case of mapping studies using biparental populations (i.e. QTL mapping), the number of polymorphic marker loci can significantly vary based on the relatedness of parents. To ensure that the proposed number of reads would still offer a sufficient number of markers for biparental QTL mapping, we determined the number and distribution of SNPs between the least and most genetically distant pairs of accessions within this collection. A matrix of pairwise genetic distance among the 16 accessions was produced and identified QS4054 and OAC Bright as the most genetically similar, while QS5008 and QS4067 proved to be the most distant (Supplemental Table 1). The number of polymorphic markers using 100K and 300K reads varied from 426 to 669 for the closest lines and from 677 to 1,325 for the most distant ones (Table 3). This means that for the closest lines, doubling or tripling the number of reads from 100K reads had only allowed the discovery of 32% and 36% more SNPs, respectively. In contrast, in the most distant lines, doubling or tripling of the number of reads from 100K has doubled the density of markers on the genetic map. Thus, as similar results were obtained between 200K and 300K, 200K reads per sample seems as suitable as 300K reads to perform QTL mapping in a biparental population. This represents a significant gain compared to current studies where ApeKI-based GBS protocol was used with over 1M reads per sample to conduct QTL mapping studies (de Ronne et al. 2020; St-Amour et al. 2020).
Maximizing multiplexing to minimize the sequencing cost per sample
Thanks to its efficiency and low cost, the GBS approach is commonly used to perform genome-wide genotyping for a large number of species (animal (Gurgul et al. 2019), plant (Boudhrioua et al. 2020), insect (Dupuis et al. 2017) and microorganism (Leboldus et al. 2015)) and different applications (association studies (Torkamaneh et al. 2020b) and GS (Jarquín et al. 2014; Qin et al. 2022)). Nevertheless, the cost associated with high-throughput screening for genome-wide markers remains the most limiting factor in the context of large-scale studies such as GS, genetic fingerprinting and genetic diversity studies. In association studies (GWAS), in general, the denser the catalog of SNPs, the higher the mapping resolution will be. However in contrast, in most of genetic studies (e.g., GS), linkage disequilibrium (LD) is very extensive and a low density SNP catalog is sufficient to capture linkage blocks and perform the analysis (Waldmann et al. 2008; Quiroz et al. 2019). Recent studies based on reducing the total number of SNPs by focusing on a subset with significant marker-trait associations (Spindel et al. 2016; Li et al. 2018) or based on functional annotations (Koufariotis et al. 2018), suggest that a lower-density catalog could generate prediction accuracies as high or better than dense catalogs (e.g., WGS-based genotyping) (Li et al. 2022). This has been well illustrated for GS in barley, where Abed et al. (2018) showed that a catalog of 2K GBS-SNPs provided a very similar prediction accuracy compared to 35K SNPs.
As documented before [64], to reduce the genotyping cost, one can decide to increase the multiplexing level by decreasing the sequencing effort per sample, which can, however, lead to a higher proportion of missing data that need to be imputed correctly and a non-uniform distribution of SNPs across the genome [65,66]. Here, using 3D-GBS, we showed that it is possible to produce a lower number of restriction fragments, well and uniformly distributed across the genome, to reduce the number of reads needed to provide sufficient read coverage to call genotypes efficiently. Here, we found that 100K reads is sufficient to conduct GS with 3D-GBS, and that is significantly lower compared to previous studies where GBS has been used (e.g., Qin et al., (2022) with ~ 3.3 M reads/sample, Jarquín et al. (2014) with ~ 2.6 M reads/sample and Jean et al. (2021) with ~ 1.2 M reads/sample). Similarly, we estimated the optimal number of reads per sample for an efficient genotyping of bi- and multi-parent populations. In the context of biparental populations, we estimated that 200K reads/sample is suitable for performing QTL mapping. 3D-GBS allowed a drastic reduction compared to equivalent studies using GBS where a much larger number of reads per sample were used (e.g., Yoon et al. (2019) ~ 3.2 M, Heim and Gillman et al. (2014) ~ 2.4 M, St-Amour et al. (2020) ~ 1.4 M, de Ronne et al. (2020) ~ 1.0 M and Vuong et al. (2021)) ~ 843K).
To estimate the gain of 3D-GBS over the standard GBS approach, we selected two studies conducted internally, using ApeKI-based GBS protocol and the lowest number of reads per sample for GS (Jean et al. 2021) and QTL mapping (de Ronne et al. 2020). In these study cases, with the same population, experimental design and goal, the application of 3D-GBS for GS and QTL mapping would have led to a significant reduction in per-sample sequencing cost: ~92% (~ 1.2M vs 100K reads/sample) and ~ 86% (~ 1.4M vs 200K reads/sample), respectively. All without taking into account the miniaturization of sequencing libraries which alone can reduce library preparation costs by 67% (Torkamaneh et al. 2020a). Overall, the combination of recent improvements in miniaturizing GBS library preparation procedure (i.e., NanoGBS [25]) and 3D-GBS provides a unique opportunity to dramatically reduce per-sample genotyping costs.