Testing pipelines for genome-wide SNP calling from Genotyping-By-Sequencing (GBS) data for Pinus ponderosa

Abstract


Conclusion
The present case study provides a comprehensive comparison between four commonly-used SNP calling pipelines, and identi es the Stacks reference-based approach as the best overall for conifers (or other species with large repetitive genomes) that do not have a published reference genome for the same species.However, all four pipelines had distinct bene ts and limitations, with Stacks for instance being less user-friendly than some of the other pipelines.In addition, researchers studying other conifer species using similar approaches should be prepared to analyze very large numbers of SNPs.

Background
Single Nucleotide Polymorphisms (SNPs) have been widely used for plant genomic studies, including genome-wide association studies, marker-assisted breeding and genomic selection, because of their abundance in most genomes and amenability to high-throughput, cost effective genotyping technologies [1][2][3].Reduced-representation sequencing approaches using multiplexed next-generation sequencing (NGS) and restriction enzyme based genome complexity reduction, often referred to as Genotyping-by-Sequencing (GBS) or RADseq (restriction site-associated DNA sequencing) have emerged as a costeffective strategies for genome-wide SNP discovery and genotyping without the need for a reference genome [4][5][6][7].Moreover, they have the potential to reach regions of the genome involved in transcription regulation that are inaccessible to sequence capture approaches that target coding sequences [8].While "GBS" and "RADseq" are sometimes used as umbrella terms for all such techniques, here we use GBS to refer to a speci c approach that uses fewer steps than RADseq and lower sequencing depth [5].
However, the use of GBS on conifers species is still largely limited by the di culty of genome-wide SNP calling from the massively parallel short-read sequences [11,12].Even though GBS only sequences a fraction of the genome, because conifer genomes are so large and repetitive the datasets produced still present a computational challenge.
Studies now commonly use advanced analysis pipelines to lter, sort, and align the GBS raw data to get SNP data.There are two general types of pipelines for handling GBS data: reference-based and de novo approaches.Reference-based pipelines call SNPs by mapping the raw GBS data to an existing reference genome to identify the position of sequences and compare the sequences from the same position to call SNPs [13].Several reference-based pipelines have been widely used, including: TASSEL-GBS (v1 and v2), Stacks, IGST, and Fast-GBS [14][15][16][17].In the absence of a reference genome, de novo pipelines identify pairs of nearly identical reads (presumed to represent alternative alleles of a locus) to call SNPs.Two de novo pipelines are commonly used: the Universal Network Enabled Analysis Kit (UNEAK) [18], and Stacks [16].
Previous studies have generally found that alignment to a reference genome from the same species increases the number of identi able SNPs compared to the de novo pipelines [19].However, it is unknown which pipeline is best for SNP calling in species that lack a sequenced genome.This includes most conifer species [20][21][22].Though aligning sequences to the reference genome of a closely related species could allow for more SNPs to be identi ed if sequences are fairly conserved, it could also result in many sequence fragments being rejected (and therefore SNPs in these fragments not being identi ed) if this is not the case.
No reference genome is available for ponderosa pine (Pinus ponderosa), but one does exist for loblolly pine (Pinus taeda) [22,23].Of the conifers that have been sequenced to date, P. taeda is the most closely related to P. ponderosa; both are classi ed in the subgenus Pinus as opposed to Strobus [24,25], which contains the other sequenced pine, P. lambertiana [21].Furthermore, the P. taeda reference genome was used to successfully used to design probes for sequence capture in P. contorta [26,27].Recent studies show that within this subgenus, P. taeda and P. ponderosa diverged more recently from each other than either did from lodgepole pine (P.contorta) [28,29], suggesting that there is likely substantial sequence similarity between P. taeda and P. ponderosa as well.Previous studies have used de novo pipelines such as UNEAK to identify > 10,000 SNP loci in conifers that lack a full genome sequence [9,10].However, these earlier studies were based on a small number of samples, usually 6 individuals.Inclusion of more individuals will likely increase the number of SNPs identi ed -but by how much, and will the inclusion of more individual-level variation change the relative e ciency of different pipelines?Despite the many advantages of GBS data, its reliability for SNP calling is compromised by the presence of paralogous genomic regions.Especially for the large genomes of conifers, involving both polyploidy and repetitive element activity [30], it is challenging to separate multiple copies in a genome (e.g.paralogs) from variants at a single locus due to sequence similarity and the short sequences obtained.Moreover, it is largely unknown which pipeline does a better job at ltering out the paralogs.
In this study, we sequenced 94 individual P. ponderosa using GBS and compared four pipelines for SNP calling, including two reference based pipelines (TASSEL-GBS V2, Stacks), and two de novo pipelines (UNEAK, Stacks).We rst tested the performance of various restriction enzymes for fragmentation of P. ponderosa genome, and then used the best for GBS library construction.Then we applied the TASSEL-GBS V2 [17] and Stacks [16] pipelines using the reference genome of P. taeda, as well as the Stacks and UNEAK [18] pipelines without a reference genome.Our aim was to determine which method produced the most SNPs, which produced the least amount of missing data for the SNPs identi ed, and how much overlap there is in the SNPs called between method, as well as the proportion of paralogs among the SNPs called by different pipelines.

Restriction enzyme selection
Figure 1 shows the ampli ed fragment size distributions of libraries from ponderosa pine DNA digested with different restriction enzymes.ApeKI yielded a high smooth curve of fragment sizes between 150 and 500, the sequencing size range for GBS.There were no discrete peaks suggesting repetitive DNA fragments present in ApeKI library, while other three REs had a few discrete peaks.PstI performed similarly, though the curve was more jagged.EcoT22I produced a lower, more jagged curve, while EcoT22I + PstI had the worst performance of all.We therefore selected ApeKI.This enzyme does not cut CpG methylated sequences [31], and therefore tends to avoid stably silenced portions of the genome, hopefully increasing the proportion of SNPs from more actively transcribed regions.

Sequence quality of raw reads
Quality control for the raw reads involves the analysis of sequence quality, GC content, sequence length distribution, the presence of adaptors, overrepresented sequences, sequence duplication levels in order to detect sequencing errors, PCR artifacts, or contamination.Reducing the error rate of base calls and improving the accuracy of the per-base quality score are integral to having reliable GBS raw data [13].The sequence quality of the raw reads was high, with the per base sequence quality score over 32 and the most frequently observed mean quality score per sequence over 40.This indicates that the sequencing error is less than 0.1%.
The per base GC content module measures the GC content across the whole length of each sequence in a le and compares it to a modeled normal distribution of GC content.For the raw data of our study, the per base GC content is a roughly normal distribution with both the shape and peak corresponding to the distribution of GC content of the underlying genome, which indicates a normal random library without any bias.
According to the sequence length distribution plot, the length for all the sequences is, as expected, 90 bp for one set of samples and 100 bp for the other.The duplicate sequences analysis issues an error since non-unique sequences make up more than 50% of the total, which is in line with the high proportions of repetitive sequences in conifers [32,33].This is a feature that could be problematic for SNP calling; however, each pipeline has its own method for cleaning the data that can be more or less effective at removing repetitive sequences.No sequence represented more than 0.1% of the total, indicating that the library was not contaminated.

Comparison of 4 SNP-calling pipelines
Even with reduced representation sequencing, due to the large genome size of pines, the raw data for each of the two sets of 47 samples was over 19 GB after compression.Large computing resources (a remote cluster or supercomputer) are needed to run these pipelines.The performance of the four SNPcalling pipelines differed in many respects (Table 1).Of the two de novo pipelines, Stacks identi ed fewer SNPs than UNEAK (62,882 vs. 196,698) and took much longer to run than any of the other three pipelines (over 53 hours).Of the two reference-based analyses, Stacks identi ed 25% more SNPs than TASSEL-GBS V2 and took about 57% longer to run.The two reference-based pipelines identi ed 1-2 orders of magnitude more SNPs than the two de novo pipelines.For the Stacks pipeline, the reference-based version identi ed over forty times as many SNPs as the de novo one with a shorter run time.
Paralogs (%) 1.5 1.1 18.5 1.0 The SNP quality data includes good reads, missing data, average MAF, and average observed and expected heterozygosity, average read depth per individual, and the proportion of paralogs.There were 7.8 billion total reads for the 94 samples.All the ve pipelines used the same quality score (20) and same length (64 bp) to clean and trim the raw data.However, the number of reads considered "good" differed between pipelines, with TASSEL-GBS V2 keeping only 76.9% of reads, while the others kept at least 93.6% (Table 1).This resulted in TASSEL-GBS V2 having a much lower missing genotype rate (47.4% vs > 72%).
The TASSEL-GBS V2 pipeline produced the largest average read depth per individual (22.5 vs. < 5).The relatively low read depth of Stacks reference-based pipeline (5.8) and Stacks de novo pipeline (4.6) is consistent with their high percentages of missing genotype calls.
The UNEAK pipeline produced a much smaller average MAF than the other pipelines (0.093 vs. > 0.21).This is likely due to UNEAK employing a network lter to discard repeats and paralogs.Accordingly, UNEAK produced a small proportion of paralogs (1.1%).The proportion of paralogs of TASSEL-GBS V2 pipeline is much higher than the other pipelines (18.5% vs. < 1.5%).The higher numbers of SNPs identi ed by TASSEL-GBS V2 pipeline is partly due to paralogs.
Interestingly, reference-based Stacks identi ed a very low average observed heterozygosity despite having SNPs with a relatively high minor allele frequency.Stacks de novo and TASSEL-GBS V2 had similar minor allele frequencies and expected heterozygosity, but the observed heterozygosity was higher for TASSEL-GBS V2.For all pipelines, the average observed heterozygosity is lower than expected heterozygosity, which suggests that at least some loci are out of Hardy Weinberg equilibrium.This may be due to selection or genetic drift operating across the Sierra Nevada mountains, as the sampled individuals are widely distributed and do not represent a single random-mating population.
There are 1,888,913 overlapping SNPs identi ed by both reference-based pipelines (Fig. 2).Of the SNPs identi ed by TASSEL-GBS V2 11.4% were unique, while of those identi ed by the Stacks reference pipeline 30.2% were unique.Because the positions of SNPs were identi ed based on the reference genome, using the vcf-compare function, we were only able to compare the SNPs found using the two reference-based pipelines.Efforts to map SNPs identi ed by the de-novo approaches to the genome were stymied by the fact that the loblolly genome has not been fully assembled into chromosomes, and we were not able to develop a work-around for this that would enable software like VCFtools to be used.

Discussion
The repetitive DNA content in conifers affects the e ciency of SNP calling [10] and requires strategies for reducing the complexity and repetitive DNA content of GBS libraries.Selection of a restriction enzyme is one of the critical steps in GBS [5,34].In our study, the commonly-used restriction enzyme ApeKI performed well for ponderosa pine, with PstI offering a decent second choice.Similarly, for lodgepole pine (Pinus contorta) and white spruce (Picea glauca), Chen et al. ( 2013) found that the size distribution curve was smoothest for ApeKI compared to PstI and EcoT22I.ApeKI was also used for other conifers such as interior spruce, a hybrid complex of white spruce (Picea glauca) and Engelmann spruce (Picea engelmannii) [35].Thus, ApeKI seems to be a good choice for conifers in general.As Table 2 indicates, no one pipeline was superior for all criteria that might be of concern for a researcher, though the reference-based Stacks did the best in terms of identifying a large number of SNPs with a low number of paralogs.However, it was more complex to use.All the steps in TASSEL-GBS V2 could deal with all the 94 samples together and assign the SNP data into each sample in the nal VCF le.However, some steps in Stacks (e.g.ustacks, SAMtools) require separate codes for each sample instead of the one code for the whole library, which takes much more effort due to individual code assignment.Some of the difference in performance can be explained by the alignment methods used.
All these pipelines assemble identical reads as tags/stacks before the alignment.The reference-based pipelines then align the tags/stacks with the reference genome to nd their position, and then compare the tags/stacks in the same positions to identify SNPs with 1 bp mismatch (Fig. 3).Thus, the reference genome helps to ensure that tags from the same position are compared to identify SNPs.The de novo pipelines directly compare the tags/stacks with each other to identify SNPs with 1 bp mismatch (Fig. 4).In this situation, some of the reads from the same general position may not be identi ed as pairs because not enough of their short sequences overlap, and therefore some of the SNPs are missed.
Torkamaneh et al. ( 2016) conducted a comparison between different SNP calling pipelines on soybean (Glycine max) and found that four reference-based pipelines (TASSEL-GBS V1, IGST, TASSEL-GBSV2, Fast-GBS) identi ed more SNPs than either of two de novo pipelines (Stacks, UNEAK).However, the differences between the methods were much smaller than the differences found in our study.This suggests that for other non-model species without available genome sequences, SNP calling using a reference genome from closely related species can be an effective option.
The number of SNPs identi ed by the two de novo pipelines were very different.Torkamaneh et al. ( 2016) also found that the UNEAK pipeline identi ed more SNPs than the Stacks de novo pipeline.One possible explanation for this difference is the different way of assembling the identical reads as tags/stacks.For Stacks, the default setting for the maximum number of stacks at a single de novo locus in the program ustacks is 3.If there are over 3 stacks in the same locus, it will be blacklisted, meaning that locus will not be available for insertion into, or matching against, the catalogue.This is done as a means of rejecting repetitive sequences.However, the UNEAK merges the identical reads as tags without this limit.As a result, UNEAK pipeline can potentially identify most of SNPs because fewer stacks are rejected, but could also have more errors involving not properly separating paralogs.However, as discussed below, this did not appear to be the case; the percentage of paralogs was similar.Given this, and the more e cient identi cation of SNPs, we would recommend UNEAK over Stacks for de novo SNP identi cation.As noted in the methods, however, UNEAK is available in TASSEL V3.0, but not in the more recently updated versions of this software.
The different number of SNPs identi ed by the two reference-based pipelines is likely caused by a difference in how they assemble tags/stacks.TASSEL-GBS V2 assembles the identical reads rst as tags rst, and then align the tags to the reference genome.Stacks directly aligns the trimmed reads directly to the reference genome, which may lead to more alignments and a greater number of SNPs identi ed.TASSEL-GBS V2 rejected a higher proportion of reads initially (lower % considered "good") but produced a much lower percentage of missing data by either locus or individual than the other methods, which would mean less imputation will be needed at later steps in an association or genetic structure analysis.However, despite this thinning of reads, TASSEL-GBS V2 appears to be more likely to incorrectly identify SNPs from paralogs than the other three methods.Thus, for reference-based assembly, we would recommend Stacks based on lower paralog percentages and higher SNP number, with the caveat that it is somewhat less user-friendly.
There are 1,888,931 SNPs identi ed by both of the reference-based methods.These SNPs comprised most (88.6%) of those identi ed by TASSEL-GBS V2.This pipeline exhibited the highest heterozygosity, MAF and proportion of paralogs, so some of the loci identi ed that did not overlap (11.4%) likely had unusually high heterozygosity.Stacks, which produced the highest number of loci, did so in part by identifying 816,107 SNPs that were not identi ed by TASSEL-GBS V2.
Finally, while earlier studies making use of < 10 individual conifers identi ed < 20,000 SNPs [9,10], this study identi ed between 62,882 and 2,705,038 SNPs from 94 individuals.This indicates the high degree of genetic variation that is present in ponderosa pine [36], consistent with observations in other widespread conifer species [37].While these individuals came from multiple populations within the Sierra Nevada mountains, this represents only a tiny fraction of the total range of this species, which extends from northern Mexico to southern Canada and from the Paci c to the Rocky Mountains.Future studies, especially those considering range-wide variation, should be prepared to analyze very high numbers of SNPs.

Conclusion
With the Illumina data generated by GBS from ponderosa pine, we compared four SNP calling pipelines, and identi ed the Stacks reference-based pipeline as the best in terms of identifying large numbers of SNPs while reducing false calls from paralogs.Use of a reference genome from a related pine species greatly increased the number of SNPs identi ed.Researchers studying other conifer species should be prepared to analyze very large numbers of SNPs, and to consider the bene ts and limitations of different pipelines.

Sample preparation
In the 1970's, the Forest Service's Paci c Southwest Regional Genetic Resources Program planted clones of 302 wild ponderosa pines in Chico, California.They came from diverse climate conditions in the central portion of California's Sierra Nevada mountains and are now reproductively mature, thus presenting an excellent resource for genetic studies.Although P. ponderosa contains multiple subdivisions, with the most important being between the Paci c and Rocky Mountain groups, based on their source locations the trees within the orchard likely do not cross any subdivision boundaries [24,36,38,39].
For this study, we chose 94 individual P. ponderosa genotypes from the orchard collection.The source locations of these 94 genotypes are shown in Fig. 5 and the location information are listed in Table S1.The sample preparation included three steps: dry needle preparation, DNA extraction, and quanti cation.
Fresh needles were collected and dried with silica gel desiccant.Total genomic DNA was extracted from the dried needle tissue using DNeasy Plant Mini Kit (250) following the protocol from the manufacturer (Qiagen, Hilden, Germany) with two main modi cations.First, to reduce protein contamination, for the step of grinding needles we added 1.5 ul of Proteinase K (20 mg/ml) along with the Buffer AP and RNase A. The MiniG 1600 from SPEX SampePrep (Metuchen, NJ, USA) was used to grind needles with automated mechanical disruption through bead beating.Second, at the very last step, the amount of AE elution buffer was changed from 100 µl to 50 µl to get a higher concentration of DNA (averagely 200 ng/ µl).The DNA concentration was quanti ed using an Eppendorf BioSpectrometer (Eppendorf, AG, Germany).

Restriction enzyme selection
When working on a new species, it is bene cial to determine which enzyme produces the most fragments within the desired size range (100-400 bp).For optimization of the GBS protocol, 1000 ng samples of P. ponderosa genomic DNA were digested separately with ApeKI, PstI, and EcoT22I, and with a combination of PstI and EcoT22I (double digest) following the instructions of the enzyme manufacturer (New England Biolabs).These three restriction enzymes are methylation sensitive and have been previously used for construction of reduced complexity GBS libraries in conifers [9,10].Fragment size distributions of each test library were visualized using an Agilent BioAnalyzer 2100 (Agilent Technologies, Santa Clara, CA) with the High Sensitivity DNA Kit for quanti cation.For each test library, we used three samples with the same DNA concentration (50 ng/ul).We selected the enzyme based on the smoothness of the distribution and the size of the fragment sequences produced.Once this was done, all of the postextraction steps were carried out at the UC Davis Genome Center.

Illumina libraries preparation and sequencing
For Pinus contorta and Picea glauca 47-plex GBS libraries yielded good results [9].Therefore, a 48-plex GBS library consisting of 47 DNA samples and a negative control (no DNA) was prepared in our study.
The GBS protocol was slightly modi ed from the standard protocol [5] and that of Chen et al. (2013).The library preparation and sequencing includes 6 steps: digestion, ligation, pooling samples, PCR, clean-up, and single-end read sequencing.DNA extracts (100 ng) were digested with the restriction enzyme ApeKI for at 75 °C for 2 hours.Each of the 47 ponderosa pine DNA samples was tagged with a unique barcode.The barcodes for each individual are listed in Table S2.Sequences for the ApeKI barcode adapters and the common adapters, and the temperature cycles, were as described in Chen et al. (2013).After the digestion, the samples were cooled to 4 °C, and then adapters were ligated onto restriction fragments.This was done using T4 DNA Ligase (Life Technologies, Burlington, ON, Canada) at 16 °C for 1 hour, after which samples were "heat killed" at 65 °C for 20 minutes.The pool was quanti ed via qPCR using the KAPA Library Quanti cation Kit (Kapa Biosystems, Wilmington, MA, USA) for Illumina sequencing platforms, with 0.9X bead cleanup to remove small fragments (< 250 bp).Additional DNA puri cation using the Zymo DNA Clean & Concentrator kit (Zymo Research, Irvine, CA) was performed to further increase the purity of the extracted DNA.The fragments from all 47 samples were then sequenced (single-end read 90 bp) on one lane of an Illumina HiSeq 4000 (Illumina, San Diego, CA).The same procedure was repeated for the other 47 samples (single-end read 100 bp).We then assessed the sequence quality of the raw reads using FastQC analysis [40].
TASSEL-GBS V2 is implemented in TASSEL V5.0, a program originally developed for maize to facilitate genotype-phenotype comparisons (https://bitbucket.org/tasseladmin/tassel-5source/wiki/Tassel5GBSv2Pipeline).This pipeline requires a reference genome to call SNPs.The steps involved are illustrated on the left side of Fig. 3.The raw sequence data in FASTQ le are rst trimmed to same length (64 bp) and then identical reads are assembled into tags (unique DNA sequences).These distinctive tags are saved into FASTQ le.Then alignment program bowtie2 [41] is used to align these tags with the reference genome of loblolly pine.Based on the position of the tags against the reference genome, SNPs are produced by identifying tags aligned in the same position that have a 1 bp mismatch.Finally, the SNP information within each tag for each sample is output as a VCF le.Each step is performed internally with TASSEL-GBS V2 plugins except for alignment, which is carried out externally with bowtie2.We used the default parameter settings for our analysis except that the minimum quality score was set to 20 to make the base call accuracy more than 99%.
Stacks is a software package developed for restriction site-associated DNA sequencing that identi es SNPs and calculates population statistics from any restriction enzyme-based, reduced-representation sequence data with short-read sequences (http://catchenlab.life.illinois.edu/stacks/).It was developed with population genomics in mind, and so aims to assemble loci in large numbers of individuals and read haplotypes from them.Stacks allows for SNP calling with or without a reference genome; we chose to do both.The detailed steps of Stacks reference pipeline are represented on the right side of Fig. 3.There are two main differences from the TASSEL-GBS V2 pipeline.First, the Stacks reference pipeline aligns the reads directly against the reference genome, while TASSEL-GBS V2 assembles the same reads into tags and then performs the alignment.Second, the BWA alignment program [42] is used instead of bowtie2.Each step in the Stacks reference pipeline is performed internally in Stacks algorithms except alignment with BWA and the SAMtools [43] step used to get read position.
The steps involved in the Stacks de novo pipeline are shown on the right side of Fig. 4. First, reads are demultiplexed, cleaned and trimmed to 64 bp, and identical reads are assembled as "stacks".The stacks in each sample are merged as catalogs, which then are grouped together across samples.Third, SNPs are identi ed by matching reads to the catalogs and assigning SNPs to each sample when there is a 1 bp mismatch.SNP information is saved in a VCF le.Optional additional steps include the creation of genetic maps and calculation of population statistics.Every step in Stacks de novo pipeline uses the Stacks internal algorithms.For both Stacks reference and de novo pipeline, we used the default

Figures Figure 1 Fragment
Figures

Figure 3 Comparison
Figure 3

Figure 4 Comparison
Figure 4

Table 1
Comparison of different SNP-calling approaches.