Samples
Altogether 12 Nordic Red dairy cows from the Luke research barn were selected for GBS and WGS sequencing, pipeline optimization and benchmarking. For each cow sample three repeated GBS libraries and one WGS library were created, starting from the same extracted DNA so that in total 36 GBS libraries and 12 WGS of cow samples were sequenced (Figure S1).
In addition, 42 European whitefish were used for pipeline validation and repeatability testing. Fish samples consisted of 27 randomly picked, unrelated individuals and 5 families of trios (parents and one offspring). From the set of random individuals, 12 whitefish were sequenced three times, twice with technical replicates of the same library and once with an entirely new library, that was started from the DNA. The European whitefish originate from the national breeding program maintained by Luke at the inland, freshwater fish farm located in Enonkoski [25, 26]. The broodstock was established in 1998 from an anadromous wild strain of the river Kokemäki. Currently, the breeding program is based on traditional sire-dam-offspring pedigree, maintained by the use of family tanks during the early phase of growth [25, 26], but the development of SNPs will enable to implement also genomic selection.
Cow DNA was extracted from blood (ethical permission ESAVI/16348/2019) while fin tissue preserved in 100% ethanol was used for DNA extraction from fish. DNA was extracted using DNeasy Blood & Tissue Kit (Qiagen, Germany) following manufacturer’s protocol.
Enzyme selection in silico
Restriction enzyme pairs for genome reduction were selected i) to generate a number of fragments providing above 5 000 GBS variants and ii) to leave a suitable overhang for library preparation. Assuming the proportion of variable sites of approximately 0.005 [24] and aiming for Paired-End (PE) sequencing with a total of 150 (2x75 bp) sequence read length per fragment, the number of variable sites was expected to be 0.75 times the fragment number. That suggested inclusion of at least 10 000 fragments, if all variable sites pass all quality ascertainment steps. The considered restriction enzyme pairs were EcoRI with MspI, SphI, MseI and NlaIII, or SphI with MluCI. These enzymes were previously used successfully for GBS in other species [21, 71, 72]. For a wider applicability, six reference genomes were included for the restriction enzyme evaluation: Bos taurus (ARS-UCD1.2), Coregonus supersum ‘balchen’ (AWG_v2), Gallus gallus (GRCg6a), Hermetia illucens (iHerIll2.2), Oncorhynchus mykiss (Omyk_1.0), and Salmo salar (ICSASG_v2). DdRAD library construction was simulated using SimRAD version 0.96 [73], but the functions were adjusted to use the full cut site. Digestion was simulated by using both the full reference genome contigs as well as reduced genomes of 10 random 10% genome subsamples. The full genome based (Bos taurus and Coregonus supersum) predicted fragments for the chosen EcoRI;SphI enzyme pair were used for quality evaluation of the GBS analysis. The obtained sequence data was used to estimate the effective size window and as consequence the size selection window was set to 150–400, for consistency. The effective size window thresholds were roughly estimated as values, where the slope of the density curves of the aligned fragments turned to + 1 (lower size threshold) and − 1 (upper size threshold).
ddRAD library preparation
The workflow (Figure S6) for the ddRAD library preparation was adapted from [29]. In detail, 250 or 500 ng of DNA was double-digested with two restriction enzymes EcoRI-HF (G^AATTC) and SphI-HF (GCATG^C) (New England Biolab, USA). The restriction reaction was performed in a volume of 20 µL, containing 17 µL of DNA (250 ng/500 ng in total), 0.25 µL of EcoRI-HF (5 units), 0.25 µL of SphI-HF (5 units), 2 µL of cut-smart buffer (10x) and 0.5 µL of molecular grade water at 37°C for 2h, following heat-inactivation for 15 min at 65°C. Two non-barcoded restriction site specific adapters (Table S3) were ligated by adding 1 µL of each adapter (adapter P1 EcoRI: 1 µM, adapter P2, SphI: 10 µM) to the restriction mixture, 0.5 µL of T4 ligase (200 units) and 1.5 µL of ligation buffer (New England Biolab, USA). Ligation was performed at 16°C for 14h, following heat-inactivation at 65°C for 15 min. DNA-fragments were selected between 200 bp and 700 bp by using SPRIselect magnetic beads (Beckman Coulter, USA) with a left-right ratio of 1x-0.56x. In details, the volume of each sample was adjusted with molecular grade water to 50 µL and then 28 µL of SPRIselect beads were added to achieve a 0.56x ratio for the selection of fragments shorter than 700 bp following selection of fragments longer than 200 bp by adding 22 µL of SPRIselect beads to achieve a ratio of 1x. The size selected DNA was resuspended in 25 µL of molecular grade water. Samples were barcoded by adding Illumina Nextera v2 (Illumina, San Diego, CA, USA) combinatorial dual-indexed barcodes (i7 and i5). For each individual sample a PCR-mix containing 6 µL of 5x Phusion HF buffer, 0.4 µL dNTP (10 mM), 0.2 µL of Phusion HF DNA polymerase (0.4 units) (ThermoFisher scientific, USA), 1.5 µL of i5 barcode primer, 1.5 µL of i7 barcode primer, 5 µL of sample and 15.4 µL of molecular grade water was prepared, two PCR reactions per sample were performed. The cycling conditions were as follows: initial denaturation at 98°C for 30 sec, followed by 18 cycles of 10 sec at 98°C, 20 sec at 61°C, 15 sec at 72°C and a final elongation step at 72°C for 10 min. The two PCR reactions per sample were pooled, the volume was adjusted to 50 µL, and small fragment removal was carried out with 40 µL (0.8x) SPRIselect beads. The size selected PCR products were resuspended in 25 µL molecular grade water and quantified using Qubit Flex with 1x dsDNA HS assay (ThermoFisher scientific, USA). Only products with a significantly higher amount than the No Template Control (NTC) were used for sequencing (> 3 ng/µL).
Sequencing
Single ddRAD libraries were pooled in equimolar amounts. The pool was size selected with SPRIselect beads to the length between 300 and 700 bp (ratio 0.75-0.56x), corresponding to the combined length of 150–550 bp restriction insert and 147 bp adapter. The quality and size of the pooled sequencing library was evaluated on the TapeStation 4150 (Agilent, USA) using the DNA HS1000 assay. Quantification of the library was done using Qubit 4 (1x dsDNA HS assay) (ThermoFisher scientific, USA). Following the guidelines from the NextSeq System denature and dilute libraries guide (Document # 15048776 v09, December 2018 (Illumina, San Diego, CA, USA)), the library was diluted for sequencing to a final concentration of 1.4 pM, containing 10% PhiX control, to increase complexity at the start of the sequencing. The PE sequencing (2x75 bp) was performed on the NextSeq 550 (Illumina, San Diego, CA, USA) using medium output flow cell.
The WGS of cow samples was performed at the Finnish Functional Genomics Centre (Turku, Finland) using TruSeq® DNA PCR-Free Library kit (Illumina, San Diego, CA, USA) and PE sequencing (2x150 bp) on an Illumina NovaSeq 6000 (Illumina, San Diego, CA, USA) platform.
Mock-reference genome
Analyzing GBS data without a preexisting reference genome necessitates in creating a technical (mock) reference. For this, various sample selection methods were considered: choosing the sample with the highest read count (mock-strategy 1), a sample with an average read count (mock-strategy 2), a random subset of three samples (mock-strategy 3), or all samples (mock-strategy 4).
As the first step, the raw PE sequences were checked for overlap that might happen in case of short inserts. Overlapping reads were merged into single-end (SE) reads using PEAR [74], with two tuning parameters being optimized here: the p option (values between 0.001 and 0.1) for a statistical test to determine read-pair merging, and the pl option (values 30 to 70) for defining the minimum accepted total length of the merged construct. These parameters determined when read pairs were merged and whether the construct's length met the criteria for inclusion. PE reads that could not be merged, were then stitched together with a sequence of 20 N bases as standard for the pipeline. Stitching of reads was controlled by the parameter rl, and reads were stitched, if the length of read1 was larger than (rl − 19) and length of read2 was larger than (rl − 5), otherwise reads were not used for the mock generation. The resulting SE reads were utilized to construct the de-novo mock reference genome using vsearch [75]. In the de-novo building phase, two vsearch options were fine-tuned: the id option (values between 0.8 and 0.99), defining the minimum pairwise identity for merging two clusters, and the min option (values between 80 and 160), setting the minimum cluster length for inclusion in the mock reference. The in-silico simulated protocol as described in “Enzyme selection in silico” was used to evaluate the mock reference constructs.
Following the de-novo mock reference creation, an additional refinement step was applied, where clusters with low coverage were removed from the mock reference. Tuning parameters were totalReadCoverage and minSampleCoverage. The first parameter defines the minimum number of reads that need to be aligned across all samples on a cluster to keep it in the mock reference. The second parameter defines the minimum number of samples that need to have at least a single read aligned to a cluster so that this cluster remains in the mock. For the tuning of the totalReadCoverage we tested 6, 12, 24, 60 and 120 as values and for minSampleCoverage reads from 2 (10%), 4 (25%), 6 (50%), 8 (75%), 10 (90%), 12 (100%) of the total number of samples in the study.
Variant calling
The GBS variant calling was done using Snakebite-GBS [76], which is a Snakemake pipeline extension that is based on the existing GBS-SNP-CROP [38] pipeline and that is part of the Snakebite framework Snakepit [77]. First, the quality-trimmed reads were aligned with BWA-mem [78] against the mock and/or preexisting reference genome(s). Then, samtools mpileup [79] was used for variant calling and various filters were applied to obtain the final variant set. The underlying GBS-SNP-CROP pipeline allows for eight different filters: (1) mnHoDepth0 (value: 5), the minimum depth required for calling a homozygote when the alternative allele depth equals 0; (2) mnHoDepth1 (value: 20) the minimum depth required for calling a homozygote when the alternative allele depth equals 1; (3) mnHetDepth (value: 3) the minimum depth required for each allele when calling a heterozygote; (4) altStrength (value: 0.8) the minimum proportion of non-primary allele reads that are the secondary allele; (5) mnAlleleRatio (value: 0.25) the minimum required ratio of the less frequent allele depth to the more frequent allele depth; (6) mnCall (value: 0.75) the minimum acceptable proportion of genotyped individuals to retain a variant; (7) mnAvgDepth (value: 3) the minimum average read depth of an acceptable variant; (8) mxAvgDepth (value: 200) the maximum average read depth of an acceptable variant.
The cattle WGS variant calling was performed following the GATK4 best practices [80] implemented as Snakemake [37] workflow called Snakebite-WGS [81]. Implemented steps contain, among others, the GATK base recalibrator as well as a model to adjust the base quality scores and a base recalibration step. Variant calling is done via haplotype caller. The pipeline utilizes also BWA-mem to align the data but includes a refinement step using Picard before the GATK4 software suite is used for the final variant calling with applied default filters.
GBS quality evaluation
The generated cow GBS variant data was mapped against an in-silico digested ARS-UCD1.2 reference genome for evaluating the size selection performance. Following variant calling, sample-wise genotype concordance between GBS and WGS sequencing strategies was assessed using Picard.
The repeatability of the GBS runs was tested by intersecting the variant locations on the corresponding reference genomes. Here, bcftools [82] was used to intersect the three vcf-files and corresponding intersection numbers were calculated. Further, samtools mpileup was run for the GBS data aligned to the reference genome and for each sample contiguous areas, that had a minimum coverage of three reads, were identified and stored in bed-format. Individual sample-wise bed-files were then merged and only regions with read support from at least 10 samples were kept. This bed-file was then used to intersect the WGS-based vcf file using bedtools [83] and extract WGS variants only from the corresponding intersecting genome regions.
In cattle, the GBS variant based variability and relatedness were compared against resampled WGS variants with restricted variant numbers from 50 to 30 000 to compare how the variant number influenced the classical Genomic Relatedness Matrix (GRM) calculated using the R-package BGData [84]. The GRM based on the full WGS variant matrix was compared to smaller bootstrap samples of WGS and GBS data.
The lift-over between mock reference and pre-existing reference genome to compare variants from both methods based on their chromosomal was done by using the tool transanno. Here, first the mock reference was aligned against the reference genome and the resulting file in pairwise mapping format (paf) was then used in transanno to create the lift-over chain and eventually to perform the lift-over. Chromosomal locations between the lift-overed mock reference-based variants and their pre-existing reference genome based counterparts were then again matched via bcftools isec.
The GRM structure differences were quantified by measuring the variability in different directions using the distance between the eigenvalues of the matrices, calculated using the Frobenius matrix norm.
For whitefish data, relatedness in trios was assessed using the full whitefish data set to overcome bias in the small data set caused by few closely related individuals in the parental generation. In addition to the genomic relatedness, the genotype quality was assessed by evaluating non-Mendelian inheritance of the GBS variants in five families of trios, that included parents and an offspring.