3D-GBS: A universal genotyping-by-sequencing approach for genomic selection and other high-throughput low-cost applications in species with small to medium-sized genomes

doi:10.21203/rs.3.rs-2226166/v1

Download PDF

Method Article

3D-GBS: A universal genotyping-by-sequencing approach for genomic selection and other high-throughput low-cost applications in species with small to medium-sized genomes

https://doi.org/10.21203/rs.3.rs-2226166/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 05 Feb, 2023

Read the published version in Plant Methods →

You are reading this latest preprint version

Despite the increased efficiency of sequencing technologies and the development of reduced-representation sequencing (RRS) approaches allowing high-throughput sequencing (HTS) of multiplexed samples, the per-sample genotyping cost remains the most limiting factor in the context of large-scale studies. For example, in the context of genomic selection (GS), breeders need genome-wide markers to predict the breeding value of large cohorts of progenies, requiring the genotyping of thousands candidates. Here, we introduce 3D-GBS, an optimized GBS procedure, to provide an ultra-high-throughput and ultra-low-cost genotyping solution for species with small to medium-sized genome and illustrate its use in soybean. Using a combination of three restriction enzymes (PstI/NsiI/MspI), the portion of the genome that is captured was reduced 4-fold (compared to a “standard” ApeKI-based protocol) while reducing the number of markers by only 40%. By better focusing the sequencing effort on limited set of restriction fragments, 4-fold more samples can be genotyped at the same minimal depth of coverage. This GBS protocol also resulted in a lower proportion of missing data and provided a more uniform distribution of SNPs across the genome. Moreover, we investigated the optimal number of reads per sample needed to obtain an adequate number of markers for GS and QTL mapping (500-1,000 markers per biparental cross). This optimization allows sequencing costs to be decreased by ~ 92% and ~ 86% for GS and QTL mapping studies, respectively, compared to previously published work. Overall, 3D-GBS represents a unique and affordable solution for applications requiring extremely high-throughput genotyping where cost remains the most limiting factor.

Genotyping-by-sequencing

ultra-high-throughput genotyping

multiplexing

next-generation sequencing

genomic selection

single-nucleotide polymorphism

Genome-wide genotyping of large populations, an essential component in quantitative trait loci (QTL) mapping or genomic selection (GS) studies, is constantly improving to minimize the cost of genotyping per individual sample. The identification of large numbers of molecular markers has been paralleled by the simultaneous development of high-throughput approaches such as microarray- (Ganal et al. 2012) or sequencing-based genotyping (Rasheed et al. 2017). However, new needs related to applied breeding programs require the development of an ultra-high-throughput and cost-effective genotyping platform. SNP arrays are a popular approach (e.g. BARCSoySNP6K in soybean (Song et al. 2020) and C7AIR in rice (Morales et al. 2020)) providing a robust genotype calling of multiple known polymorphic sites at the same time and across different populations allowing for a direct comparison of data between experiments, germplasm and studies (Carvalho et al. 2007; Hyten et al. 2010). However, SNP arrays present ascertainment issues (Moragues et al. 2010), an inability to target loci that were not included during the array development and need to be developed independently for each species and population (Darrier et al. 2019). In addition to these, the cost of genotyping using SNP arrays, even after development, is considerably higher than sequencing-based approaches (Elshire et al. 2011a).

While whole-genome sequencing (WGS) based genotyping remains expensive and sometimes unnecessary in the context of large scale studies, high-throughput sequencing (HTS) of multiplexed samples combined with reduced-representation sequencing (RRS) approaches, allows for a cost-effective genotyping of millions of SNPs in large sets of individuals (Hirsch et al. 2014). Among RRS approaches, genotyping-by-sequencing (GBS) is the most widely used method thanks to its speed, flexibility and cost-effectiveness (Poland and Rife 2012; Narum et al. 2013; He et al. 2014). In the last decade, GBS has been widely applied in animals (Luca et al. 2011; Chen et al. 2013), plants (Zhu et al. 2016; Begali 2018) and fungi (Leboldus et al. 2015), where other genotyping tools (e.g., SNP arrays; (Ganal et al. 2012)) were not adapted (da Fonseca et al. 2016). The attractiveness of GBS has led to many optimizations related to the choice of enzymes (Sonah et al. 2013), pipeline for calling SNPs (Torkamaneh et al. 2020c), improved marker density (double-digest GBS (Wang et al. 2017) and High-density GBS (Torkamaneh et al. 2021)), and improved library-preparation procedure (Torkamaneh et al. 2020a). Although GBS is the most cost-effective genome-wide genotyping approach, it can still be expensive for routine screening of large populations as required in breeding programs (Thomson 2014; Pértille et al. 2016; Rasheed et al. 2017). Nevertheless, GBS could be optimized by focusing sequencing on a lower fraction of the genome allowing more samples to be multiplexed at a lower average sequencing coverage and thus reduce the sequencing cost per sample. Reducing the genome coverage through reduction of sequencing coverage will categorically result in a lower number of markers, however the uniform distribution of these markers is crucial for an efficient and effective genetic study. The appropriate choice of restriction enzymes can also be a challenging point as their recognition sites (based on the size of the enzyme, sensitivity to methylation, and its GC content) are not uniformly distributed across the genomes (Nishida 2008; Hodgkinson and Eyre-Walker 2011; Melamed-Bessudo et al. 2016; Li 2016a) .

The number of required reads is another determining factor in multiplexing and throughput. Despite the various improvements in GBS methods, the estimation of the number of reads for each sample required to achieve an efficient genotyping needs to be determined on a case-by-case basis (Torkamaneh et al. 2020d). An insufficient number of reads per sample will result in a high proportion of missing data, a reduced number of SNP loci at which genotypes can be successfully called and, possibly, an uneven distribution of markers across the genome (Torkamaneh and Belzile 2015; Huang and Lacey Knowles 2016; Eaton et al. 2017). In contrast, an excessive number of reads results in an inefficient use of the sequencing effort and therefore, unnecessarily increases per-sample cost (Beissinger et al. 2013). Thus, finding an optimal number of reads per sample can also help minimize per-sample sequencing cost.

To optimize the multiplexing capacity of GBS, a combination of three restriction enzymes, hence 3D-GBS, was tested on soybean to reduce the number of digested DNA fragments and improve the distribution of markers across the genome. Also, we investigated the optimal number of reads per sample to maximize multiplexing capacity on a single sequencing run and thereby, significantly minimize the sequencing cost per sample. This approach will greatly facilitate the adoption of ultra-high-throughput genome-wide genotyping where the per-sample cost remains a limiting factor for various applications.

Biological materials

To compare the GBS (Elshire et al. 2011a) and 3D-GBS methods, sixteen soybean accessions (QS4049, QS4054, QS4067, QS5008, QS4028, QS4043, QS5017, OAC Klondike, OAC Bright, Altesse, OAC Inwood, OAC Thames, OAC 08-18C, OAC Morris, OAC Embro and OAC McCall; provided by Dr. Louise O’Donoughue at CEROM, Quebec, QC, Canada) were used in this study. These accessions were selected based on the availability of GBS data (Sonah et al. 2015). For each accession, seeds were grown in a growth chamber. Then, approximately 100 mg of young leaf tissues were collected for DNA extraction. Collected leaf tissues were dried for 4 days using a desiccating agent (Drierite; Xenia, OH, USA) and then ground with metallic beads in a RETSCH MM 400 mixer mill (Fisher Scientific, MA, USA). DNA was extracted using the DNeasy Plant Mini Kit (Qiagen, MD, USA) according to the manufacturer’s protocol. DNA quantification was done with a Qubit fluorometer using the dsDNA HS assay kit (Thermo Fisher Scientific, MA, USA) and subsequently adjusted to 10 ng/µl for each sample.

3D-GBS library preparation

Choice of enzymes. The restriction enzymes for 3D-GBS were selected based on their sensitivity to methylation and the size of their recognition site compared to ApeKI, a standard GBS protocol for soybean. ApeKI is a 5 bp-cutter with one ambiguous site and 80% GC content (G*CWGC). Here, we used following enzymes: PstI, a 6-bp cutter with 66% GC content (CTGCA*G), NsiI, a 6-bp cutter with 33% GC content (ATGCA*T), and MspI, a 4-bp cutter with 100% GC content (C*CGG). ApeKI and PstI are partially sensitive and sensitive to cytosine methylation, respectively, while NsiI and MspI are not sensitive to cytosine methylation.

Library preparation. 3D-GBS libraries were prepared on a reduced scale (5 µL reaction volume) according to the NanoGBS protocol (Torkamaneh et al. 2020a) with the three selected enzymes (PstI, NsiI and MspI). Briefly, a total of 10 ng of genomic DNA of each sample was used for digestion with the restriction-enzyme mix and then ligation with sample-specific barcoded adapters. The 5’ adapters had an overhang compatible with the common overhang produced by PstI and NsiI, while the 3’ adapters had an overhang compatible with that produced by MspI. Then, individual libraries were pooled and a size-selection (50–350 bp) step was done using a BluePippin apparatus (Sage Science, MA, USA). PCR amplification (12 cycles), enrichment, and PCR clean-up were performed before quality control, quantification, and purity assessments of DNA libraries with a spectrophotometer (Nanodrop 1000, Fisher Scientific, MA, USA) and a Bioanalyzer 2100 (Agilent Technologies, CA, USA). The 3D-GBS libraries were then sequenced on an Ion Torrent instrument (Thermo Fisher Scientific, MA, USA) on Ion Proton 540 chips at the Genomic Analysis Platform of the Institut de Biologie Intégrative et des Systèmes (Université Laval, QC, Canada).

Data analysis

Sequencing and genotyping. Sequencing data were processed using the Fast-GBS v2.0 pipeline (Torkamaneh et al. 2020c) and the Wm82.a2 soybean reference genome (Gmax_275_Wm82.a2.v1, (Schmutz et al. 2010)) for SNP calling. For GBS and 3D-GBS analyses, variant calls were filtered with VCFtools (Danecek et al. 2011) to remove low-quality SNPs (QUAL < 10 and MQ < 30), variants residing on unassembled scaffolds and indels. Then, only biallelic markers with missing data < 0.8 and heterozygosity < 0.1 were retained. The genome coverage (fraction of the genome captured) was determined with the function ‘coverage’ in Samtools (Danecek et al. 2021) while the mean depth of coverage (sequencing coverage) was calculated using VCFtools with the function ‘--depth’. The proportion of missing data and heterozygous calls, average minor allele frequency and nucleotide diversity (PiPerBP) were estimated using TASSEL v.5 (Bradbury et al. 2007).

Distribution of markers on the physical and genetic maps. The distribution of markers across the physical map was based on the VCF files generated after Fast-GBS analysis and SNP filtration, using the rMVP package in R (Yin et al. 2021). For genetic maps, the genetic position of each SNP was inferred from the closest corresponding SNP on the consensus genetic map based on GBS-derived SNP markers (Fallah et al. 2022). Then, the distribution of markers across the genetic maps was evaluated using the QTL IciMapping v4.1 software (Meng et al. 2015).

Random sampling of reads. Different subsets of reads (i.e., 50K, 100K, 200K and 300K reads) were randomly sampled three times for each of the 16 accessions using seqtk (Li 2012) with the function ‘sample’ (Li 2012). Then, sequencing data as well as the number and distribution of SNPs were assessed as mentioned above to compare results generated from each read subgroup. To investigate 3D-GBS results for biparental crosses, the two genetically closest and most distant accessions were determined by using a matrix of pairwise distances generated with TASSEL v.5 [51].

New enzyme combinations for an efficient and uniform capture of the genome

In this study, sixteen DNA samples that had been previously genotyped with the original ApeKI-based GBS protocol were used to produced 3D-GBS libraries. The 16-plex GBS and 3D-GBS libraires produced ~ 21.1M (ranging from ~ 800K to ~ 2.9M reads/sample) and ~ 10.4M (ranging from ~ 300K to ~ 800K reads/sample) reads, respectively. First, the distribution of the SNPs derived from PstI-MspI and Nsil-MspI reads was investigated to assess the relevance of this enzyme combination (Fig. 1a and Supplementary Fig. 1). We found 76.5% and 23.5% of Nsil-MspI and PstI-MspI reads, respectively, encompassing 4,206 and 620 SNPs, respectively. The higher proportion of NsiI-MspI-derived fragments and SNPs could be expected because of the methylation insensitivity of NsiI and lower GC content compared to PstI. Nevertheless, PstI-MspI-derived fragments ensured the coverage of large gaps devoid of NsiI-MspI-derived fragments (e.g., on chromosomes 9, 11 and 20).

To perform a meaningful comparison, the same overall number of reads for each 16-plex library was used to compare the two protocols; as the number of reads per accession varied, an identical number of mapped reads for a given accession in each of the two libraries was used to compare GBS and 3D-GBS (Table 1). As expected, with 3D-GBS, a lower fraction of the genome was captured compared to GBS (genome coverage of 1.2% vs 4.7%, respectively). As the sequencing effort (i.e., the number of reads per sample) was focused on a smaller fraction of the genome, the mean depth of coverage was 3-fold higher in 3D-GBS compared to GBS (14.5X vs 5.1X, respectively) resulting in a lower proportion of missing data (15.3% vs 33.7%, respectively). Fortunately, while the genome coverage was 75% lower for 3D-GBS than GBS data, the number of SNPs identified was only 40% lower (4,826 vs 7,904 SNPs, respectively), showing that 3D-GBS either captures more polymorphic regions of the genome or improves the genotyping efficiency for the same sequencing effort. As expected, highly similar metrics were obtained for mapping quality, proportion of heterozygous genotypes, average minor allele frequency and nucleotide diversity in both datasets. This suggests that 3D-GBS data is as relevant as GBS data for performing different genetic analyses.

The density of SNPs captured by 3D-GBS (5.1 SNPs/Mb and 2.3 SNPs/cM with no gap > 30 cM) represents an adequate density to perform QTL mapping and GS analysis. To confirm this, the distribution of the SNPs across the physical and genetic maps has been evaluated (Fig. 1b and c, Supplementary Fig. 2). Compared to GBS-derived SNPs, the distribution of the 3D-GBS-derived SNPs was more uniform across the genome (Fig. 1b and Supplementary Fig. 2). This can be easily illustrated by (i) several regions > 5Mb on chromosomes 1, 5, 6, 12, 16 and 18 that are missed by GBS while they were covered by 3D-GBS; and (ii) the more uniform distribution of SNPs which rarely exceeds 25 SNPs/Mb in 3D-GBS, compared to GBS where many regions are covered with an “excessive” number of SNPs (25 to more than 41 SNPs/Mb; e.g. on chromosomes 4, 5, 6, 16, 18, etc.). Finally, regarding the genetic map, the 3D-GBS SNPs were well distributed with only one gap close to 20 cM on Chr11, in a region that was also poor in GBS-derived SNPs (Fig. 1c), suggesting that 3D-GBS data are as efficient as GBS data to conduct genetic analyses such as GS or QTL mapping.

The appropriate choice of enzyme(s) is an essential step in developing a GBS protocol (Hamblin and Rabbi 2014). In the original GBS protocol (Elshire et al. 2011b), the ApeKI enzyme was used as frequent cutter with sensitivity to methylation to obtain SNPs mainly distributed in gene-rich regions (hypomethylated fraction of the genome) corresponding to a coverage of ~ 4–5% of the genome (Table 1). A two-enzyme strategy using a rare (e.g. PstI) and a frequent cutter (e.g. MspI) sensitive to methylation has also been developed to significantly reduce genome complexity in species with a very large genome (e.g., barley (5 Gb) (Poland et al. 2012)). However, this approach did not show enough efficiency with species with small to medium genome size (e.g., soybean (⁓1 Gb)) as it captured relatively few genomic regions (Torkamaneh et al. 2021). Moreover, due to the palindromic nature of enzyme’s restriction sites, this produces a bias in GC content, making the two-enzyme strategy using a rare cutter (none available with 50% GC content) impossible to obtain uniform distribution of fragments in the context of a universal use. Indeed, since there is natural variation in GC content across chromosomes (Nishida 2008; Hodgkinson and Eyre-Walker 2011; Melamed-Bessudo et al. 2016; Li 2016a) and between species (Li 2016b; Karimi et al. 2018), using a rare cutter with either 33% or 66% of GC will inevitably induce variable density of restriction fragments across chromosomes and species. On the other hand, frequent cutters can have a 50% GC content, such as MspI (CCGG) or BfaI (CTAG), allowing a more even distribution of restriction fragments throughout the genome, as illustrated by Torkamaneh et al. (2021). However, when they have been used alone, these frequent cutters induce too many restriction fragments across the genome, which is contrary to the objective of reducing genome coverage.

In light of the above, we explored the idea of improving the two-enzyme approach by using a second rare cutter, such as NsiI (Fu et al. 2016), with a cutting site differing in GC content and exploiting methylation insensitivity to capture hypermethylated regions missed by PstI. The combination of NsiI with PstI and MspI presented a good opportunity to obtain a sufficient and efficient low density of SNPs distributed more evenly in the genome. While ApeKI would be expected to cut every ~ 512 bp (4^4.5), here, a combination of three enzymes that include PstI and NsiI (two 6-bp-cutter with differing methylation sensitivity), with a predicted cutting frequency of one site every ~ 4,096 bp (4⁶), and MspI, a methylation-insensitive 4-bp cutter with an expected cutting frequency of one site every 256 bp (4⁴) were used jointly to reduce the fraction of the genome that is captured. The high cutting frequency of MspI allows to generate more fragments of 100-400bp [24] that are ideal for short-read sequencing. Together, these enzymes span a broad GC, 33% for NsiI, 66% for PstI and 100% for MspI, thus creating a suitable condition to reduce genome coverage and uniformly sample different genomic regions. Finally, by focusing on fewer but well-distributed genomic regions, 3D-GBS offers an efficient and cost-effective approach for discovery and genotyping of SNPs across the genome in species with small to medium-sized genome.

Table 1

Sequencing and SNP-calling data generated from GBS and 3D-GBS libraries of 16 soybean samples.
Steps	Measured parameters	GBS	3D-GBS
Sequencing	Mean read count (M)	0.6	0.6
	Coverage (%)*	4.7	1.2
	Mean depth of coverage (X)^¥	5.1	14.5
	Mean mapping quality	41	42
SNP calling	SNP count	7,904	4,826
	Proportion of missing data (%)	33.7	15.3
	Proportion of heterozygous genotypes (%)	4.4	3.8
	Average minor allele frequency (%)	33.8	31.3
	Nucleotide diversity (p per bp)	0.43	0.42
Physical map	SNP/Mb	8.3	5.1
	Number of gaps > 5 Mb	9	6
	Number of gaps > 10 Mb	1	0
Genetic map	SNP/cM^§	3.7	2.3
	Number of gaps > 10 cM	7	10
	Number of gaps > 20 cM	0	1
*Fraction of the genome captured across all 16 libraries
^¥Average number of read at each sequenced position.
^§Inferred from the closest corresponding SNP on the consensus genetic map (Fallah et al. 2022).

Optimizing the number of reads per sample to maximize multiplexing

Different numbers of reads (i.e. 50K, 100K, 200K and 300K reads) were randomly sampled three times for each accession from the 16-plex 3D-GBS library. For each metric investigated, the coefficient of variation between replicates based on the same number of reads was < 5% (not significantly different (Tukey HSD test p-value > 0.1)). For this reason, the mean value (across all three replicates) for each metric is reported in Table 2. With increasing the sequencing effort from 50K to 300K reads per sample, the fraction of the genome captured increased from 0.6 to 1%, the number of SNPs increased from 1,314 to 4,082, and the proportion of missing data decreased from 37 to 20%. Even at the smallest value tested (50K reads/sample), the proportion of missing data was still reasonable and would allow for an accurate imputation (Torkamaneh et al. 2018). For average minor allele frequency and nucleotide diversity values, equivalent results were provided across the entire range of reads per sample, suggesting that even with a very limited sequencing effort one can perform high-quality genetic diversity analysis.

The distribution of the SNPs on the genetic map was very similar from 100K to 300K reads while, with only 50K reads per sample, large gaps were detected (e.g., ~ 10 cM on Chr01, ~ 80 cM on Chr03, ~ 60 cM on Chr06) and some chromosome extremities were missed (Fig. 2). While the density of markers doubled between 100K and 300K reads, the distribution of the SNPs across the genetic map remained very similar with some regions that were denser in SNPs using 300K reads (e.g., ~ 40 cM on Chr01, ~ 60 cM on Chr04, ~ 90 cM on Chr06). This very promising result suggests that one can run 3D-GBS with only 100K reads per sample, a significant reduction in the sequencing cost, to achieve sufficient resolution (~ 2,300 SNPs, 1.1 SNP/cM) to perform GS.

In the case of mapping studies using biparental populations (i.e. QTL mapping), the number of polymorphic marker loci can significantly vary based on the relatedness of parents. To ensure that the proposed number of reads would still offer a sufficient number of markers for biparental QTL mapping, we determined the number and distribution of SNPs between the least and most genetically distant pairs of accessions within this collection. A matrix of pairwise genetic distance among the 16 accessions was produced and identified QS4054 and OAC Bright as the most genetically similar, while QS5008 and QS4067 proved to be the most distant (Supplemental Table 1). The number of polymorphic markers using 100K and 300K reads varied from 426 to 669 for the closest lines and from 677 to 1,325 for the most distant ones (Table 3). This means that for the closest lines, doubling or tripling the number of reads from 100K reads had only allowed the discovery of 32% and 36% more SNPs, respectively. In contrast, in the most distant lines, doubling or tripling of the number of reads from 100K has doubled the density of markers on the genetic map. Thus, as similar results were obtained between 200K and 300K, 200K reads per sample seems as suitable as 300K reads to perform QTL mapping in a biparental population. This represents a significant gain compared to current studies where ApeKI-based GBS protocol was used with over 1M reads per sample to conduct QTL mapping studies (de Ronne et al. 2020; St-Amour et al. 2020).

Maximizing multiplexing to minimize the sequencing cost per sample

Thanks to its efficiency and low cost, the GBS approach is commonly used to perform genome-wide genotyping for a large number of species (animal (Gurgul et al. 2019), plant (Boudhrioua et al. 2020), insect (Dupuis et al. 2017) and microorganism (Leboldus et al. 2015)) and different applications (association studies (Torkamaneh et al. 2020b) and GS (Jarquín et al. 2014; Qin et al. 2022)). Nevertheless, the cost associated with high-throughput screening for genome-wide markers remains the most limiting factor in the context of large-scale studies such as GS, genetic fingerprinting and genetic diversity studies. In association studies (GWAS), in general, the denser the catalog of SNPs, the higher the mapping resolution will be. However in contrast, in most of genetic studies (e.g., GS), linkage disequilibrium (LD) is very extensive and a low density SNP catalog is sufficient to capture linkage blocks and perform the analysis (Waldmann et al. 2008; Quiroz et al. 2019). Recent studies based on reducing the total number of SNPs by focusing on a subset with significant marker-trait associations (Spindel et al. 2016; Li et al. 2018) or based on functional annotations (Koufariotis et al. 2018), suggest that a lower-density catalog could generate prediction accuracies as high or better than dense catalogs (e.g., WGS-based genotyping) (Li et al. 2022). This has been well illustrated for GS in barley, where Abed et al. (2018) showed that a catalog of 2K GBS-SNPs provided a very similar prediction accuracy compared to 35K SNPs.

As documented before [64], to reduce the genotyping cost, one can decide to increase the multiplexing level by decreasing the sequencing effort per sample, which can, however, lead to a higher proportion of missing data that need to be imputed correctly and a non-uniform distribution of SNPs across the genome [65,66]. Here, using 3D-GBS, we showed that it is possible to produce a lower number of restriction fragments, well and uniformly distributed across the genome, to reduce the number of reads needed to provide sufficient read coverage to call genotypes efficiently. Here, we found that 100K reads is sufficient to conduct GS with 3D-GBS, and that is significantly lower compared to previous studies where GBS has been used (e.g., Qin et al., (2022) with ~ 3.3 M reads/sample, Jarquín et al. (2014) with ~ 2.6 M reads/sample and Jean et al. (2021) with ~ 1.2 M reads/sample). Similarly, we estimated the optimal number of reads per sample for an efficient genotyping of bi- and multi-parent populations. In the context of biparental populations, we estimated that 200K reads/sample is suitable for performing QTL mapping. 3D-GBS allowed a drastic reduction compared to equivalent studies using GBS where a much larger number of reads per sample were used (e.g., Yoon et al. (2019) ~ 3.2 M, Heim and Gillman et al. (2014) ~ 2.4 M, St-Amour et al. (2020) ~ 1.4 M, de Ronne et al. (2020) ~ 1.0 M and Vuong et al. (2021)) ~ 843K).

To estimate the gain of 3D-GBS over the standard GBS approach, we selected two studies conducted internally, using ApeKI-based GBS protocol and the lowest number of reads per sample for GS (Jean et al. 2021) and QTL mapping (de Ronne et al. 2020). In these study cases, with the same population, experimental design and goal, the application of 3D-GBS for GS and QTL mapping would have led to a significant reduction in per-sample sequencing cost: ~92% (~ 1.2M vs 100K reads/sample) and ~ 86% (~ 1.4M vs 200K reads/sample), respectively. All without taking into account the miniaturization of sequencing libraries which alone can reduce library preparation costs by 67% (Torkamaneh et al. 2020a). Overall, the combination of recent improvements in miniaturizing GBS library preparation procedure (i.e., NanoGBS [25]) and 3D-GBS provides a unique opportunity to dramatically reduce per-sample genotyping costs.

Recent advances in NGS technologies have enabled the massively parallel processing of hundreds of samples efficiently and cost-effectively, a prerequisite for genetic studies such as QTL mapping and GS. However, it still remains costly in the context of large-scale studies such as GS, as breeding programs typically produce many thousands of selection candidates each year. In the continuous objective of reducing the genotyping cost for scientific research and applied needs, 3D-GBS enables us to maximize the multiplexing capacity needed to achieve the ultra-high throughput that is needed in a wide range of applications and thus decreasing the sequencing cost per sample. While we demonstrated the efficiency of 3D-GBS using soybean samples, this method could easily be used across a wide range of species with small to medium genome size.

Table 2

Variant calling using different subsets of reads derived from 3D-GBS on 16 soybean samples.
Step	Measured parameters	50K reads	100K reads	200K reads	300K reads
Sequencing	Coverage (%)*	0.6	0.7	0.9	1
Sequencing	Mean depth of coverage (X) ^¥	2.7	4.1	6.3	8.4
SNP calling	SNP count	1,314	2,299	3,587	4,082
	Proportion of missing data (%)	37.1	29.3	23.3	20.3
	Proportion of heterozygous genotypes (%)	6.1	5.2	5	4.6
	Average minor allele frequency (%)	27.3	27.2	26.2	25.9
	Nucleotide diversity (p per bp)	0.36	0.36	0.35	0.35
Physical map	SNP/Mb	1.4	2.4	3.8	4.3
	Number of gaps > 5 Mb	23	9	5	7
	Number of gaps > 10 Mb	6	1	1	0
Genetic map	SNP/cM^§	0.6	1.1	1.7	2
	Number of gaps > 10 cM	29	18	12	9
	Number of gaps > 20 cM	2	1	1	1
*Total genome fraction captured by the 16 libraries
^¥Average number of read at each sequenced position.
^§Inferred from the closest corresponding SNP on the consensus genetic map (Fallah et al. 2022).

Table 3

Analysis of SNP density obtained with different number of reads for two hypothetical biparental crosses. The genetically closest and farthest accessions were QS4054 and OAC Bright and, QS5008 and QS4067, respectively.
Crossing	Closest accessions	Farthest accessions	Closest accessions	Farthest accessions	Closest accessions	Farthest accessions
Reads per sample (K)	100		200		300
SNP count	426	677	630	1,165	669	1,325
SNP/1 Mb	0.5	0.7	0.7	1.2	0.7	1.4
SNP/cM	0.21	0.33	0.31	0.57	0.33	0.65

Next-generation sequencing (NGS), High-Throughput sequencing (HTS), Single-nucleotide polymorphism (SNP), Quantitative trait loci (QTL), Genome-wide association study (GWAS), Marker-assisted selection (MAS), Genomic selection (GS), Whole-genome sequencing (WGS), Reduced-representation sequencing (RRS), Restriction-associated DNA (RAD), complexity reduction of polymorphic sequences (CRoPS), Genotyping-by-sequencing (GBS), Linkage disequilibrium (LD).

Ethical Approval

Not applicable.

Competing Interests

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Authors’ Contributions

MDR, BB, FB and DT conceptualized the concept of 3D-GBS. MDR and GL conducted the experiments. MDR conducted data analysis. MDR, DT and FB contributed to writing the manuscript. All authors read and approved the manuscript.

Funding

This work was funded by Genome Canada [#6548] under Genomic Applications Partnership Program (GAPP).

Availability of data and materials

The VCF files generated from the sequencing data and used for the analyzes of this study are on FigShare.com and will be accessible after acceptance of the manuscript.

Acknowledgments

The authors wish to thank Génome Québec, Genome Canada, the government of Canada, the Ministère de l’Économie et de l’Innovation du Québec, the Canadian Field Crop Research Alliance, Semences Prograin Inc., Sollio Agriculture, Grain Farmers of Ontario, Barley Council of Canada, and Université Laval.

Abed A, Pérez-Rodríguez P, Crossa J, Belzile F (2018) When less can be better: How can we make genomic selection more cost-effective and accurate in barley? Theor Appl Genet 131:1873–1890. https://doi.org/10.1007/s00122-018-3120-8
Bastien M, Sonah H, Belzile F (2014) Genome Wide Association Mapping of Sclerotinia sclerotiorum Resistance in Soybean with a Genotyping‐by‐Sequencing Approach . Plant Genome 7:0. https://doi.org/10.3835/plantgenome2013.10.0030
Begali H (2018) A Pipeline for Markers Selection Using Restriction Site Associated DNA Sequencing (Radseq). J Appl Bioinforma Comput Biol 07: https://doi.org/10.4172/2329-9533.1000147
Beissinger TM, Hirsch CN, Sekhon RS, et al (2013) Marker density and read depth for genotyping populations using genotyping-by-sequencing. Genetics 193:1073–1081. https://doi.org/10.1534/genetics.112.147710
Boudhrioua C, Bastien M, Torkamaneh D, Belzile F (2020) Genome-wide association mapping of Sclerotinia sclerotiorum resistance in soybean using whole-genome resequencing data. BMC Plant Biol 20:1–24. https://doi.org/10.1186/s12870-020-02401-8
Bradbury PJ, Zhang Z, Kroon DE, et al (2007) TASSEL: Software for association mapping of complex traits in diverse samples. Bioinformatics 23:2633–2635. https://doi.org/10.1093/bioinformatics/btm308
Carvalho B, Bengtsson H, Speed TP, Irizarry RA (2007) Exploration, normalization, and genotype calls of high-density oligonucleotide SNP array data. Biostatistics 8:485–499. https://doi.org/10.1093/biostatistics/kxl042
Chen Q, Ma Y, Yang Y, et al (2013) Genotyping by Genome Reducing and Sequencing for Outbred Animals. PLoS One 8:e67500. https://doi.org/10.1371/journal.pone.0067500
da Fonseca RR, Albrechtsen A, Themudo GE, et al (2016) Next-generation biology: Sequencing and data analysis approaches for non-model organisms. Mar Genomics 30:3–13. https://doi.org/10.1016/j.margen.2016.04.012
Danecek P, Auton A, Abecasis G, et al (2011) The variant call format and VCFtools. Bioinformatics 27:2156–2158. https://doi.org/10.1093/bioinformatics/btr330
Danecek P, Bonfield JK, Liddle J, et al (2021) Twelve years of SAMtools and BCFtools. Gigascience 10:. https://doi.org/10.1093/gigascience/giab008
Darrier B, Russell J, Milner SG, et al (2019) A comparison of mainstream genotyping platforms for the evaluation and use of barley genetic resources. Front Plant Sci 10:544. https://doi.org/10.3389/fpls.2019.00544
de Ronne M, Labbé C, Lebreton A, et al (2020) Integrated QTL mapping, gene expression and nucleotide variation analyses to investigate complex quantitative traits: a case study with the soybean–Phytophthora sojae interaction. Plant Biotechnol J 18:1492–1494. https://doi.org/10.1111/pbi.13301
Dupuis JR, Brunet BMT, Bird HM, et al (2017) Genome-wide SNPs resolve phylogenetic relationships in the North American spruce budworm (Choristoneura fumiferana) species complex. Mol Phylogenet Evol 111:158–168. https://doi.org/10.1016/j.ympev.2017.04.001
Eaton DAR, Spriggs EL, Park B, Donoghue MJ (2017) Misconceptions on missing data in RAD-seq phylogenetics with a deep-scale example from flowering plants. Syst Biol 66:399–412. https://doi.org/10.1093/sysbio/syw092
Elshire RJ, Glaubitz JC, Sun Q, et al (2011a) A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PLoS One 6:1–46. https://doi.org/10.1371/journal.pone.0019379
Elshire RJ, Glaubitz JC, Sun Q, et al (2011b) A robust, simple genotyping-by-sequencing (GBS) approach for high diversity species. PLoS One 6:1–10. https://doi.org/10.1371/journal.pone.0019379
Fallah M, Jean M, Boucher St-Amour VT, et al (2022) The construction of a high-density consensus genetic map for soybean based on SNP markers derived from genotyping-by-sequencing. Genome 65:413–425. https://doi.org/10.1139/gen-2021-0054
Fu YB, Peterson GW, Dong Y (2016) Increasing genome sampling and improving SNP genotyping for genotyping-by-sequencing with new combinations of restriction enzymes. G3 Genes, Genomes, Genet 6:845–856. https://doi.org/10.1534/g3.115.025775
Ganal MW, Polley A, Graner EM, et al (2012) Large SNP arrays for genotyping in crop plants. J Biosci 37:821–828. https://doi.org/10.1007/s12038-012-9225-3
Gurgul A, Miksza-Cybulska A, Szmatoła T, et al (2019) Genotyping-by-sequencing performance in selected livestock species. Genomics 111:186–195. https://doi.org/10.1016/j.ygeno.2018.02.002
Hamblin MT, Rabbi IY (2014) The effects of restriction-enzyme choice on properties of genotyping-by-sequencing libraries: A study in Cassava (Manihot esculenta). Crop Sci 54:2603–2608. https://doi.org/10.2135/cropsci2014.02.0160
He J, Zhao X, Laroche A, et al (2014) Genotyping-by-sequencing (GBS), An ultimate marker-assisted selection (MAS) tool to accelerate plant breeding. Front Plant Sci 5:1–8. https://doi.org/10.3389/fpls.2014.00484
Heim CB, Gillman JD (2017) Genotyping-by-sequencing-based investigation of the genetic architecture responsible for a ~sevenfold increase in soybean seed stearic acid. G3 Genes, Genomes, Genet 7:299–308. https://doi.org/10.1534/g3.116.035741
Hirsch CD, Evans J, Buell CR, Hirsch CN (2014) Reduced representation approaches to interrogate genome diversity in large repetitive plant genomes. Briefings Funct Genomics Proteomics 13:257–267. https://doi.org/10.1093/bfgp/elt051
Hodgkinson A, Eyre-Walker A (2011) Variation in the mutation rate across mammalian genomes. Nat Rev Genet 12:756–766. https://doi.org/10.1038/nrg3098
Huang H, Lacey Knowles L (2016) Unforeseen consequences of excluding missing data from next-generation sequences: Simulation study of rad sequences. Syst Biol 65:357–365. https://doi.org/10.1093/sysbio/syu046
Hyten DL, Choi IY, Song Q, et al (2010) A high density integrated genetic linkage map of soybean and the development of a 1536 universal soy linkage panel for quantitative trait locus mapping. Crop Sci 50:960–968. https://doi.org/10.2135/cropsci2009.06.0360
Jarquín D, Kocak K, Posadas L, et al (2014) Genotyping by sequencing for genomic prediction in a soybean breeding population. BMC Genomics 15:1–10. https://doi.org/10.1186/1471-2164-15-740
Jean M, Cober E, O’Donoughue L, et al (2021) Improvement of key agronomical traits in soybean through genomic prediction of superior crosses. Crop Sci 61:3908–3918. https://doi.org/10.1002/csc2.20583
Karimi K, Wuitchik DM, Oldach MJ, Vize PD (2018) Distinguishing Species Using GC Contents in Mixed DNA or RNA Sequences. Evol Bioinforma 14:. https://doi.org/10.1177/1176934318788866
Koufariotis LT, Chen YPP, Stothard P, Hayes BJ (2018) Variance explained by whole genome sequence variants in coding and regulatory genome annotations for six dairy traits. BMC Genomics 19:. https://doi.org/10.1186/s12864-018-4617-x
Leboldus JM, Kinzer K, Richards J, et al (2015) Genotype-by-sequencing of the plant-pathogenic fungi Pyrenophora teres and Sphaerulina musiva utilizing Ion Torrent sequence technology. Mol Plant Pathol 16:623–632. https://doi.org/10.1111/mpp.12214
Li H (2012) seqtk: Toolkit for processing sequences in FASTA/Q formats. In: GitHub 767. https://github.com/lh3/seqtk/. Accessed 17 Aug 2022
Li X, Guo T, Mu Q, et al (2018) Genomic and environmental determinants and their interplay underlying phenotypic plasticity. Proc Natl Acad Sci U S A 115:6679–6684. https://doi.org/10.1073/pnas.1718326115
Li XQ (2016a) Somatic genome variation in animals, plants, and microorganisms. Somat Genome Var Anim Plants, Microorg 1–419. https://doi.org/10.1002/9781118647110
Li XQ (2016b) Genome variation in archaeans, bacteria, and asexually reproducing eukaryotes. Somat Genome Var Anim Plants, Microorg 253–266. https://doi.org/10.1002/9781118647110.ch10
Li Y, Ruperao P, Batley J, et al (2022) Genomic prediction of preliminary yield trials in chickpea: Effect of functional annotation of SNPs and environment. Plant Genome 15:e20166. https://doi.org/10.1002/tpg2.20166
Luca F, Hudson RR, Witonsky DB, Di Rienzo A (2011) A reduced representation approach to population genetic analyses and applications to human evolution. Genome Res 21:1087–1098. https://doi.org/10.1101/gr.119792.110
Melamed-Bessudo C, Shilo S, Levy AA (2016) Meiotic recombination and genome evolution in plants. Curr Opin Plant Biol 30:82–87. https://doi.org/10.1016/j.pbi.2016.02.003
Meng L, Li H, Zhang L, Wang J (2015) QTL IciMapping: Integrated software for genetic linkage map construction and quantitative trait locus mapping in biparental populations. Crop J 3:269–283. https://doi.org/10.1016/j.cj.2015.01.001
Moragues M, Comadran J, Waugh R, et al (2010) Effects of ascertainment bias and marker number on estimations of barley diversity from high-throughput SNP genotype data. Theor Appl Genet 120:1525–1534. https://doi.org/10.1007/s00122-010-1273-1
Morales KY, Singh N, Perez FA, et al (2020) An improved 7K SNP array, the C7AIR, provides a wealth of validated SNP markers for rice breeding and genetics studies. PLoS One 15:e0232479. https://doi.org/10.1371/JOURNAL.PONE.0232479
Narum SR, Buerkle CA, Davey JW, et al (2013) Genotyping-by-sequencing in ecological and conservation genomics. Mol Ecol 22:2841–2847. https://doi.org/10.1111/mec.12350
Nishida H (2008) Genome DNA Sequence Variation , Evolution , and Function in Bacteria and Archaea Number of genes Escherichia coli Streptomyces griseus GC content (%). 19–24
Pértille F, Guerrero-Bosagna C, Silva VH Da, et al (2016) High-throughput and Cost-effective Chicken Genotyping Using Next-Generation Sequencing. Sci Rep 6:1–12. https://doi.org/10.1038/srep26929
Poland JA, Brown PJ, Sorrells ME, Jannink JL (2012) Development of high-density genetic maps for barley and wheat using a novel two-enzyme genotyping-by-sequencing approach. PLoS One 7:. https://doi.org/10.1371/journal.pone.0032253
Poland JA, Rife TW (2012) Genotyping‐by‐Sequencing for Plant Breeding and Genetics. Plant Genome 5:. https://doi.org/10.3835/plantgenome2012.05.0005
Qin J, Wang F, Zhao Q, et al (2022) Identification of Candidate Genes and Genomic Selection for Seed Protein in Soybean Breeding Pipeline. Front Plant Sci 13:. https://doi.org/10.3389/fpls.2022.882732
Quiroz M, Kohn R, Villani M, Tran MN (2019) Speeding Up MCMC by Efficient Data Subsampling. J Am Stat Assoc 114:831–843. https://doi.org/10.1080/01621459.2018.1448827
Rasheed A, Hao Y, Xia X, et al (2017) Crop Breeding Chips and Genotyping Platforms: Progress, Challenges, and Perspectives. Mol Plant 10:1047–1064. https://doi.org/10.1016/j.molp.2017.06.008
Schmutz J, Cannon SB, Schlueter J, et al (2010) Genome sequence of the palaeopolyploid soybean. Nature 463:178–183. https://doi.org/10.1038/nature08670
Sonah H, Bastien M, Iquira E, et al (2013) An Improved Genotyping by Sequencing (GBS) Approach Offering Increased Versatility and Efficiency of SNP Discovery and Genotyping. PLoS One 8:1–9. https://doi.org/10.1371/journal.pone.0054603
Sonah H, O’Donoughue L, Cober E, et al (2015) Identification of loci governing eight agronomic traits using a GBS-GWAS approach and validation by QTL mapping in soya bean. Plant Biotechnol J 13:211–221. https://doi.org/10.1111/pbi.12249
Song Q, Yan L, Quigley C, et al (2020) Soybean BARCSoySNP6K: An assay for soybean genetics and breeding research. Plant J 104:800–811. https://doi.org/10.1111/tpj.14960
Spindel JE, Begum H, Akdemir D, et al (2016) Genome-wide prediction models that incorporate de novo GWAS are a powerful new tool for tropical rice improvement. Heredity (Edinb) 116:395–408. https://doi.org/10.1038/hdy.2015.113
St-Amour VTB, Mimee B, Torkamaneh D, et al (2020) Characterizing resistance to soybean cyst nematode in PI 494182, an early maturing soybean accession. Crop Sci 60:2053–2069. https://doi.org/10.1002/csc2.20162
Thomson MJ (2014) High-Throughput SNP Genotyping to Accelerate Crop Improvement. Plant Breed Biotechnol 2:195–212. https://doi.org/10.9787/pbb.2014.2.3.195
Torkamaneh D, Belzile F (2015) Scanning and filling: Ultra-dense SNP genotyping combining genotyping-by-sequencing, SNP array and whole-genome resequencing data. PLoS One 10:e0131533. https://doi.org/10.1371/journal.pone.0131533
Torkamaneh D, Boyle B, Belzile F (2018) Efficient genome-wide genotyping strategies and data integration in crop plants. Theor Appl Genet 131:499–511. https://doi.org/10.1007/s00122-018-3056-z
Torkamaneh D, Boyle B, St-Cyr J, et al (2020a) NanoGBS: A Miniaturized Procedure for GBS Library Preparation. Front Genet 11:1–8. https://doi.org/10.3389/fgene.2020.00067
Torkamaneh D, Chalifour FP, Beauchamp CJ, et al (2020b) Genome-wide association analyses reveal the genetic basis of biomass accumulation under symbiotic nitrogen fixation in African soybean. Theor Appl Genet 133:665–676. https://doi.org/10.1007/s00122-019-03499-7
Torkamaneh D, Laroche J, Belzile F (2020c) Fast-gbs v2.0: An analysis toolkit for genotyping-by-sequencing data. Genome 63:577–581. https://doi.org/10.1139/gen-2020-0077
Torkamaneh D, Laroche J, Boyle B, et al (2021) A bumper crop of SNPs in soybean through high-density genotyping-by-sequencing (HD-GBS). Plant Biotechnol J 19:860–862. https://doi.org/10.1111/pbi.13551
Torkamaneh D, Laroche J, Boyle B, Belzile F (2020d) DepthFinder: A tool to determine the optimal read depth for reduced-representation sequencing. Bioinformatics 36:26–32. https://doi.org/10.1093/bioinformatics/btz473
Vuong TD, Sonah H, Patil G, et al (2021) Identification of genomic loci conferring broad-spectrum resistance to multiple nematode species in exotic soybean accession PI 567305. Theor Appl Genet 134:3379–3395. https://doi.org/10.1007/s00122-021-03903-1
Waldmann P, Hallander J, Hoti F, Sillanpää MJ (2008) Efficient Markov chain Monte Carlo implementation of Bayesian analysis of additive and dominance genetic variances in noninbred pedigrees. Genetics 179:1101–1112. https://doi.org/10.1534/genetics.107.084160
Wang Y, Cao X, Zhao Y, et al (2017) Optimized double-digest genotyping by sequencing (ddGBS) method with highdensity SNP markers and high genotyping accuracy for chickens. PLoS One 12:. https://doi.org/10.1371/journal.pone.0179073
Yin L, Zhang H, Tang Z, et al (2021) rMVP: A Memory-efficient, Visualization-enhanced, and Parallel-accelerated Tool for Genome-wide Association Study. Genomics, Proteomics Bioinforma 19:619–628. https://doi.org/10.1016/j.gpb.2020.10.007
Yoon MY, Kim MY, Ha J, et al (2019) QTL analysis of resistance to high-intensity UV-B irradiation in soybean (Glycine max [L.] merr.). Int J Mol Sci 20:3287. https://doi.org/10.3390/ijms20133287
Zhu WY, Huang L, Chen L, et al (2016) A high-density genetic linkage map for cucumber (Cucumis sativus L.): Based on specific length amplified fragment (SLAF) sequencing and QTL analysis of fruit traits in cucumber. Front Plant Sci 7:437. https://doi.org/10.3389/fpls.2016.00437

No competing interests reported.

Supplemental.Figure1.tiff
Supplementary Figure 1: Distribution of the SNPs derived from NsiI-MspI and PstI-MspI reads across the physical map. The colors of the heatmap correspond to the number of SNPs within 1 Mb windows size.
Supplemental.Figure2.tiff
Supplementary Figure 2: Distribution of the SNPs derived from GBS and 3D-GBS libraries across the physical map. The colors of the heatmap correspond to the number of SNPs within 1 Mb windows size.
Supplemental.Table1.docx

Download PDF

Journal Publication

published 05 Feb, 2023

Read the published version in Plant Methods →

Editorial decision: Major revision
16 Dec, 2022
Reviews received at journal
14 Dec, 2022
Reviewers agreed at journal
21 Nov, 2022
Reviewers invited by journal
18 Nov, 2022
Editor assigned by journal
04 Nov, 2022
Submission checks completed at journal
04 Nov, 2022
First submitted to journal
01 Nov, 2022

You are reading this latest preprint version

3D-GBS: A universal genotyping-by-sequencing approach for genomic selection and other high-throughput low-cost applications in species with small to medium-sized genomes

Status:

Journal Publication

Version 1

Abstract

Figures

Introduction

Materials And Methods

Biological materials

3D-GBS library preparation

Data analysis

Results And Discussion

New enzyme combinations for an efficient and uniform capture of the genome

Optimizing the number of reads per sample to maximize multiplexing

Maximizing multiplexing to minimize the sequencing cost per sample

Conclusion

Abbreviations

Declarations

References

Additional Declarations

Supplementary Files

Status:

Journal Publication

Version 1