The repetitive DNA content in conifers affects the efficiency of SNP calling [10] and requires strategies for reducing the complexity and repetitive DNA content of GBS libraries. Selection of a restriction enzyme is one of the critical steps in GBS [5, 34]. In our study, the commonly-used restriction enzyme ApeKI performed well for ponderosa pine, with PstI offering a decent second choice. Similarly, for lodgepole pine (Pinus contorta) and white spruce (Picea glauca), Chen et al. (2013) found that the size distribution curve was smoothest for ApeKI compared to PstI and EcoT22I. ApeKI was also used for other conifers such as interior spruce, a hybrid complex of white spruce (Picea glauca) and Engelmann spruce (Picea engelmannii) [35]. Thus, ApeKI seems to be a good choice for conifers in general.
Table 2
SNP-calling approaches ranked
Approach | de novo | reference-based |
Pipeline | Stacks | UNEAK | TASSEL-GBS V2 | Stacks |
Run time | Highest | Lowest | Medium | Medium |
Ease of use | Poor | Best | Medium | Poor |
# of SNPs identified | Lowest | Low | High | Highest |
Missing data | High | High | Lowest | High |
% paralogs | Low (good) | Low (good) | Highest (poor) | Lowest (best) |
As Table 2 indicates, no one pipeline was superior for all criteria that might be of concern for a researcher, though the reference-based Stacks did the best in terms of identifying a large number of SNPs with a low number of paralogs. However, it was more complex to use. All the steps in TASSEL-GBS V2 could deal with all the 94 samples together and assign the SNP data into each sample in the final VCF file. However, some steps in Stacks (e.g. ustacks, SAMtools) require separate codes for each sample instead of the one code for the whole library, which takes much more effort due to individual code assignment. Some of the difference in performance can be explained by the alignment methods used.
All these pipelines assemble identical reads as tags/stacks before the alignment. The reference-based pipelines then align the tags/stacks with the reference genome to find their position, and then compare the tags/stacks in the same positions to identify SNPs with 1 bp mismatch (Fig. 3). Thus, the reference genome helps to ensure that tags from the same position are compared to identify SNPs. The de novo pipelines directly compare the tags/stacks with each other to identify SNPs with 1 bp mismatch (Fig. 4). In this situation, some of the reads from the same general position may not be identified as pairs because not enough of their short sequences overlap, and therefore some of the SNPs are missed.
Torkamaneh et al. (2016) conducted a comparison between different SNP calling pipelines on soybean (Glycine max) and found that four reference-based pipelines (TASSEL-GBS V1, IGST, TASSEL-GBSV2, Fast-GBS) identified more SNPs than either of two de novo pipelines (Stacks, UNEAK). However, the differences between the methods were much smaller than the differences found in our study. This suggests that for other non-model species without available genome sequences, SNP calling using a reference genome from closely related species can be an effective option.
The number of SNPs identified by the two de novo pipelines were very different. Torkamaneh et al. (2016) also found that the UNEAK pipeline identified more SNPs than the Stacks de novo pipeline. One possible explanation for this difference is the different way of assembling the identical reads as tags/stacks. For Stacks, the default setting for the maximum number of stacks at a single de novo locus in the program ustacks is 3. If there are over 3 stacks in the same locus, it will be blacklisted, meaning that locus will not be available for insertion into, or matching against, the catalogue. This is done as a means of rejecting repetitive sequences. However, the UNEAK merges the identical reads as tags without this limit. As a result, UNEAK pipeline can potentially identify most of SNPs because fewer stacks are rejected, but could also have more errors involving not properly separating paralogs. However, as discussed below, this did not appear to be the case; the percentage of paralogs was similar. Given this, and the more efficient identification of SNPs, we would recommend UNEAK over Stacks for de novo SNP identification. As noted in the methods, however, UNEAK is available in TASSEL V3.0, but not in the more recently updated versions of this software.
The different number of SNPs identified by the two reference-based pipelines is likely caused by a difference in how they assemble tags/stacks. TASSEL-GBS V2 assembles the identical reads first as tags first, and then align the tags to the reference genome. Stacks directly aligns the trimmed reads directly to the reference genome, which may lead to more alignments and a greater number of SNPs identified. TASSEL-GBS V2 rejected a higher proportion of reads initially (lower % considered "good") but produced a much lower percentage of missing data by either locus or individual than the other methods, which would mean less imputation will be needed at later steps in an association or genetic structure analysis. However, despite this thinning of reads, TASSEL-GBS V2 appears to be more likely to incorrectly identify SNPs from paralogs than the other three methods. Thus, for reference-based assembly, we would recommend Stacks based on lower paralog percentages and higher SNP number, with the caveat that it is somewhat less user-friendly.
There are 1,888,931 SNPs identified by both of the reference-based methods. These SNPs comprised most (88.6%) of those identified by TASSEL-GBS V2. This pipeline exhibited the highest heterozygosity, MAF and proportion of paralogs, so some of the loci identified that did not overlap (11.4%) likely had unusually high heterozygosity. Stacks, which produced the highest number of loci, did so in part by identifying 816,107 SNPs that were not identified by TASSEL-GBS V2.
Finally, while earlier studies making use of < 10 individual conifers identified < 20,000 SNPs [9, 10], this study identified between 62,882 and 2,705,038 SNPs from 94 individuals. This indicates the high degree of genetic variation that is present in ponderosa pine [36], consistent with observations in other widespread conifer species [37]. While these individuals came from multiple populations within the Sierra Nevada mountains, this represents only a tiny fraction of the total range of this species, which extends from northern Mexico to southern Canada and from the Pacific to the Rocky Mountains. Future studies, especially those considering range-wide variation, should be prepared to analyze very high numbers of SNPs.