Benchmarking Variant Identification Tools for Plant Diversity Discovery

doi:10.21203/rs.2.9666/v1

Download PDF

Research article

Benchmarking Variant Identification Tools for Plant Diversity Discovery

https://doi.org/10.21203/rs.2.9666/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 09 Sep, 2019

Read the published version in BMC Genomics →

You are reading this older preprint version

Read the latest preprint version →

Background

The ability to accurately and comprehensively identify genomic variations is critical for plant studies utilizing high-throughput sequencing. Most bioinformatics tools for processing next-generation sequencing data were originally developed and tested in human studies, raising questions as to their efficacy for plant research. A detailed evaluation of the entire variant calling pipeline, including alignment, variant calling, variant filtering, and imputation was performed on different programs using both simulated and real plant genomic datasets.

Results

A comparison of SOAP2, Bowtie2, and BWA-MEM found that BWA-MEM was consistently able to align the most reads with high accuracy, whereas Bowtie2 had the highest overall accuracy. Comparative results of GATK HaplotypCaller versus SAMtools mpileup indicated that the choice of variant caller affected precision and recall differentially depending on the levels of diversity, sequence coverage and genome complexity. A cross-reference experiment of S. lycopersicum and S. pennellii reference genomes revealed the inadequacy of single reference genome for variant discovery that includes distantly-related plant individuals. Machine-learning-based variant filtering strategy outperformed the traditional hard-cutoff strategy resulting in higher number of true positive variants and fewer false positive variants. The 2-step imputation which utilized a set of high-confidence SNPs as the reference panel showed up to 60% higher accuracy than direct LD-based imputation method.

Conclusions

Programs in the variant discovery pipeline have different performance on plant genomic dataset. Choice of the programs is subjected to the goal of the study and available resources. This study serves as an important guiding information for plant biologists utilizing next-generation sequencing data for diversity characterization and crop improvement.

Epigenetics & Genomics

Read alignment

variant calling

machine learning

variant filtering

imputation

Genomic technologies provide unprecedented opportunities to reveal the history of crop domestication, to discover novel genetic diversity, and to understand the genetic basis of economically important traits, collectively contributing to crop improvement and food security [1]. One of the most important steps in genomic analyses is the ability to accurately and comprehensively identify genetic variations. As sequencing cost continues to decrease, whole genome sequencing (WGS) strategies are increasingly employed for plant diversity and domestication studies. [2-5]. Accompanying improvements in sequencing technology is the need to not only improve but also better understand the algorithms that enable variant calling from sequencing data. Many of the algorithms used in the processing of sequencing data were originally developed from human WGS studies yet are frequently used by plant genomic researchers [6]. The underlying assumption is that the performance of a given algorithm for human data will be similar for plant data, in spite of significant differences between the human and plant genomes.

The variant discovery pipeline for WGS dataset can be roughly divided into four steps: read mapping, variant calling, variant filtering, and imputation. Sequence aligners for the read mapping step can be grouped according to their indexing methodologies [7]. Programs such as Novoalign (http://www.novocraft.com) and GSNAP [8] use hash tables indexing methods; whereas BWA [9], SOAP2 [10] and Bowtie2 [11] use Burrows-Wheeler Transform indexing algorithms. Variant calling programs can be categorized into alignment-based programs such as SAMtools [12] and FreeBayes [13], and assembly-based programs, such as GATK HaplotypeCaller [14] and FermiKit [15]. Variant filtering steps remove low-quality variants based on various quality metrics such as base quality, read depth, and mapping quality. The purpose of this step is to remove false positive variants while minimizing false negative variants, a source of "hidden diversity". The basic filtering strategy, termed "hard-filtering", sets empirical cutoffs on quality metrics to eliminate false positive variants.

Over the past decade, extensive progress in human genomic studies has developed and applied machine-learning based variant filtering methods [14] which uses adaptive cutoffs that adapt to a specific dataset, often by finding variants within the dataset that were previously identified with high confidence. The final step in variant discovery often employs imputation methods by leveraging external information to infer missing genotypes due to technical limitations. The standard way of imputation in human genomic studies utilizes a reference panel [16, 17], where a previously identified set of haplotypes link missing variants with successfully genotyped variants. Many of these advanced methods have yet to be readily adopted by plant researchers. In some instances, there are clear obstacles to implementation, such as the lack of extensive plant haplotype panels of similar quality to the 1000 Genomes Project [18] or HapMap [19]. Though some species, such as maize [20] and rice [21], are rapidly building these resources. Even though both plants and human genomics share a similar computational workflow, the structure and composition of plant genomes pose unique challenges that are not present in humans. As a result, the evaluation of these emerging computational genomics technologies is urgently needed in agriculture.

A major challenge for crop genomics is the ability to accurately and comprehensively characterize genetic diversity in domesticated crops, diverse landraces, and wild crop relatives. Genetic diversity in plants can be much greater than that found human genomes. These sources of diversity, especially in the wild species, provide a reservoir of genetic variation for future crop improvement [22-24]. For example, introgression from related wild species into domesticated tomato has been used to improve agronomic performance such as abiotic tolerance [25-28]. Similarly, a gene from a wild relative of bread wheat has been shown to confer resistance to one of the most destructive stem rust pathogen races, Ug99 [29]. Characterizing these rich pools of diversity is an important challenge facing plant genomics because the regions containing this diversity may pose the most challenges for algorithms designed and optimized for human studies.

The second challenge for variant discovery in plant genomics is the quality of available reference genomes. The human reference genome has been in a constant state of improvement for decades (https://www.ncbi.nlm.nih.gov/grc/human). Once released, however, most plant reference genomes see little improvement, resulting in references that are less accurate and less complete than that found in humans. Other key challenges are the large amounts of repetitive sequences, structural variation and, in some crops, complex polyploid genomes [30, 31]. Diversity may be underestimated because of presence-absence variations (PAV) that are common to most plant genomes [32]. The diverse nature of plant genomes together with low quality or incomplete reference assemblies can negatively affect read alignment and variant calling steps, leading to inaccurate genotypes and missing variants [1, 33, 34].

Here, we benchmarked the performance of programs that are commonly used for variant discovery in plant studies. The comparison included three highly-cited sequence aligners, BWA-MEM, Bowtie2 and SOAP2, and two popular variant callers, GATK HaplotypeCaller (GATK-HC) and SAMtools mpileup (SAMtools-mpileup) using domesticated tomatoes, wild relatives and simulated genomic datasets. We show that as diversity and genome complexity increased, the ability of these algorithms to identify variants varied significantly. In addition, the inadequacy of a single reference genome was uncovered after a cross-reference comparison was performed. Finally, we evaluated the performance of machine learning based variant filtering method and reference panel assisted imputation methods on the high diversity plant datasets.

Simulated multi-species genomic dataset and real tomato genomic dataset

We used publicly available 602 WGS datasets representing 514 domesticated and 88 related wild species of tomato. The data were retrieved from the NCBI BioProjects under accession PRJNA259308, PRJNA353161 and PRJEB5235. Simulated tomato sequencing reads were generated from the S. lycopersicum using a custom Python script was used to introduce from 0-20 SNPs per read, fragment sizes ranging from 200-10000 nt, and INDELs ranging from 0-40 nt. In order to evaluate the performance of BWA-MEM on multiple crop species, simulation of the Illumina sequencing reads was also performed on rice, soybean, tomato, maize and wheat using mason [46]. The mutation rate including SNPs and INDELs was simulated at 0.001, 0.005, 0.02, 0.04, 0.08, 0.1, 0.15, and 0.2. The proportion of the SNPs and INDELs were 0.85 and 0.15, respectively. Sequencing error was modeled as the default settings.

Evaluation of read alignment programs

Different aligners were evaluated using high-coverage datasets from PRJEB5235 and simulated datasets. BWA-MEM, SOAP2, SOAP2-tuned, Bowtie2 and Bowtie2-tuned were tested. SOAP2-tuned was used with the following options: -m 100 -x 888 -s 35 -l 32 -v 3. Bowtie2-tuned was used with the following options: --very-sensitive -N 1 -I 100 -X 888. To determine mapping percentages, these five aligner settings were used to align one million reads that were randomly selected from high coverage genomes from 52 domesticated and 30 wild relative samples. The IBS (Identity-By-State) distance was calculated using SNPrelate [47]. The true positive alignments ratio was calculated by comparing the known ground truth location and aligned location. BWA-MEM was also evaluated on multiple crop species with a mixture of SNPs and INDELs in the simulated datasets.

SNP discovery comparison

Eighty-two high-coverage datasets from PRJEB5235 was used for SNP discovery comparisons. SNPs were called with SAMtools-mpileup and GATK-HC using BWA-MEM and Bowtie2-tuned alignments. Default settings were applied for both variant calling programs. Only polymorphic SNPs were used as data for the Venn diagram. Simulated datasets with known variants were generated for tomato, rice, soybean, maize using mason. Each crop species was simulated at different coverages (5x, 15x, 30x, and 50x) and mutation rates (0.001, 0.01, 0.05, 0.1, 0.15). In addition to individual simulated datasets, population-level simulated datasets were also generated with varied diversity (low diversity: 0.001 mutation rate and high diversity: 0.1), population size (25, 50, 75, and 100) and sequencing coverage (1x, 5x, and 10x). SAMtools-mpileup and GATK-HC were evaluated on both individual and population simulated datasets by comparing the precision and recall ratios.

Due to technical limitations, Equation 1 has been placed in the Supplementary Files section.

Due to technical limitations, Equation 2 has been placed in the Supplementary Files section.

Imputation algorithm comparison

Beagle v4.1 [16] direct imputation and 2-step imputation method were compared using 602 tomato genomes. The raw SNPs were called using BWA-MEM and GATK-HC pipeline, and then hard filtered using GATK recommended options: “QD <2.0 || FS > 60.0 || MQ < 40.0 || SOR > 3.0”. The high-confidence set of SNPs for the 2-step imputation was identified from 82 high-coverage dataset using BWA-MEM and GATK-HC. GATK hard-filtering and VCFtools [48] with options: --missing 1 and --mac 2. SNPs with heterozygosity above 20% were removed. Beagle v4.1 was used to phased the high-confidence set of SNPs. The comparison was performed on four groups of samples: Two hundred random tomato and wild samples (RANDOM), 100 domesticated tomato samples (DOM), 50 Solanum pimpinellifolium samples (PIM), and 36 distantly related wild species (WILD). The one hundred domesticated samples from PRJNA353161 only, 15 DOM-REF samples from PRJEB5235 only, 50 PIM samples and 36 WILD samples were randomly selected for generating simulated datasets. Polymorphic SNPs in each dataset were randomly masked using a custom Python script if there were more than 7 reads supporting the genotypes. Both Beagle v4.1 and 2-step imputation methods were used to impute missing genotypes in five simulated datasets. The concordance R² ratio between genotyped and imputed values were calculated as imputation accuracy using BCFtools [49].

Variant filtering algorithms comparison

The 602 tomato datasets were used to generate raw SNPs using BWA-MEM and GATK-HC pipeline. Hard-filtered (“QD <2.0 || FS > 60.0 || MQ < 40.0 || SOR > 3.0”), Machine-learning based and combined filtering methods were individually applied to the raw dataset. The machine learning based methods were followed the GATK Best Practice Workflow (https://software.broadinstitute.org/gatk/documentation/article.php?id=2805). The prior likelihood we assigned to SolCap markers and filtered-SolCap markers was Q10, and threshold 99.9 was used to generate tranches. Polymorphic SNPs in the first 10 million base pairs of Chromosome 1 were selected to test the performance of different filtering methods. PCA was performed using SNPrelated after LD pruning (R² > 0.2). LD decay was calculated using the PopLDdecay package (https://github.com/BGI-shenzhen/PopLDdecay.git) with default parameters.

Alignment program evaluation

The performance of three different aligners, BWA-MEM, Bowtie2, and SOAP2, was evaluated using Illumina paired-end read datasets from 52 domesticated tomato, 30 related wild relatives (Supplemental Table 1) [35], and simulated genomic sequences from different crops. Mapping percentage, alignment accuracy, and processing time for each aligner were evaluated.

The ability to align reads to a domesticated tomato reference genome, Solanum lycopersicum [36], was assessed using default and tuned parameters on Bowtie2 (Bowtie2 and Bowtie2-tuned), SOAP2 (SOAP2 and SOAP2-tuned), and default parameters for BWA-MEM. Parameter tuning (see details in Methods) for Bowtie2 and SOAP2 was necessary to attempt to match the mapping percentage to the default used by BWA-MEM. BWA-MEM showed the highest alignment percentage, 99.54% and 95.95% in domesticated and wild relatives, respectively, while SOAP2 showed the lowest alignment percentage, 91.25% and 40.58%, respectively (Supplemental Table 2). In the domesticated tomato datasets, all of the five alignment settings resulted in more than 90% mapping percentage with standard deviation ranging from 0.34% to 3.77% (Figure 1A). Greater variation in mapping percentage existed when analyzing the sequences from wild species with standard deviation ranging from 1.91% to 24.25%. The mapping percentage in the wild tomato samples displayed a bimodal distribution (Figure 1A). The distribution of the group with higher alignment percentage contained wild species that were closely related to domesticated tomatoes, whereas the lower group contained distantly related wild species based on previous domestication and diversity studies [3, 37]. Alignment percentage was found to be negatively correlated with the IBS distance of each sample to the S. lycopersicum reference genome (Figure 1B). When the sample was distantly related to the reference genome, BWA-MEM resulted in the highest mapping percentage and SOAP2 resulted in the lowest mapping percentage. In terms of processing time, SOAP2 was the fastest aligner in both domesticated and wild tomato datasets, and it was up to five times faster than the slowest alignment setting, Bowtie2-tuned (Supplemental Figure 1A).

We next determined whether greater alignment percentage or shorter alignment time could result in tradeoffs on accuracy and sensitivity by using simulated datasets and calculating the ratio of true positive (TP), false positive (FP) and false negative (FN) alignments. Simulated datasets were derived from the reference genome by permuting fragment sizes, and number of SNPs or size of small indels per read. For all alignment methods, the ratio of FP alignment increased as more SNPs or indels were introduced per read (Figure 1C-D) when the fragment size was fixed at 600 nt. When the number of introduced SNPs was equal or less than 2, the average percent of FP alignments BWA-MEM, Bowtie2-tuned and SOAP2-tuned was 0.94%, 1.15% and 0.88%, respectively (Figure 1C). When the number of introduced SNPs was greater or equal to 4, the average FP alignment rate of BWA-MEM, Bowtie2-tuned increased to 6.41% and 2.54%, respectively, while SOAP2, and SOAP2-tuned were no longer able to find alignments. BWA-MEM was the only aligner that was capable of finding TP alignments with 15 SNPs per read with FP alignment rate of 18.26%. Similar results were also observed in the indel simulation experiment (Figure 1D). Only BWA-MEM was able to find TP alignments of reads with INDELs up to 40 nt in size at the cost of 26% false alignments.

To indirectly determine the true vs false positive rates of BWA-MEM and Bowtie2 in real data, one million randomly selected reads from six samples (2 S. lycopersicum, 2 S. pennellii and 2 other wild relatives) were aligned to both S. lycopersicum and Solanum pennellii reference genomes [38]. The positions of alignments with mapping quality (MQ) ³40 were compared against the synteny map of the genome generated by nucmer [39]. When the alignment position of read matched to the nucmer conversion of the S. lycopersicum coordinate to the S. pennellii coordinate, the read was considered to be syntenic. If the positions did not match, the read was considered non-syntenic. BWA-MEM was able to align approximately 4.22 times more reads per sample than Bowtie2 (Supplemental Table 3), but only 65.71% (SD ± 2.68%) of these alignments were considered as syntenic compared to 88.17% (SD ± 1.59%) of Bowtie2 alignments.

To extend the study to other crop species, simulated sequencing datasets were generated from rice, soybean, maize and wheat reference genomes by varying the mutation rate from 0.001 to 0.2 (Figure 1E). In these studies, both SNP and INDEL were included in the simulation. When the mutation rate is equal to or lower than 0.04, BWA-MEM was able to align at least 92% of the sequences correctly for rice, tomato and soybean, whereas it was only able to correctly align 81.5% and 82% of the sequences for maize and wheat, respectively. As mutation rate increased, difference in both true positive and false positive alignment was seen among different crops. On average, BWA-MEM was able to find 18.1%, 20.2% and 17.0% more true positive alignments in rice, tomato and soybean than in wheat and maize at mutation rate 0.08, 0.1, and 0.15, respectively. On the other hand, BWA-MEM was able to generate 18.8%, 22.5%, and 24.5% less false positive alignments in rice, tomato and soybean than in wheat and maize at mutation rate 0.08, 0.1, and 0.15, respectively.

Variant Calling Program Comparison

Four variant datasets were produced from the permutation of the aligners, Bowtie2-tuned, and BWA-MEM, and the variant callers SAMtools-mpileup and GATK-HC using 52 domesticated and 30 wild tomatoes. Results showed nearly a two-fold difference in the number of unfiltered SNPs ranging from 69.2M to 133.7M. A greater difference in the variant count in wild species was observed than that found in domesticated ones (Table 1). In domesticated species, dataset sizes ranged from 11.8M to 17.8M unfiltered SNPs, while in wild species it ranged from 66.4M to 128.3M. The primary determinant of variant count between datasets was whether Bowtie-2 or BWA-MEM was used.

To further evaluate the differences in the ability of identifying variants, both individual-level and population-level simulated datasets were generated with varied mutation rate, sequencing coverage and population size. In the simulated population-level datasets, evaluation was performed on both raw and filtered variants. In the comparison of raw variants, GATK-HC was able to identify more true SNPs at the cost of accuracy as sequencing coverage increased in diversity populations. At 5x and 10x coverages, SAMtools-mpileup was able to identify similar recall ratio with higher precision ratio than GATK in the low diversity population. When dealing with high diversity populations, GATK-HC always outperformed SAMtools-mpileup in both precision and recall aspects (Supplemental Figure 2A). In the comparison of raw INDELs, GATK-HC always outperformed SAMtools-mpileup in terms of precision and recall in the low diversity population. In the high diversity populations, GATK-HC was able to identify greater number of true INDELs at the cost of accuracy (Supplemental Figure 2B).

In the filtered SNP results, when the sequencing coverage is at 5x and 10x, GATK-HC was able to result higher precision ratio in all coverage and diversity permutations without compensating the recall ratio (Figure 2A). In the 1x coverage simulation dataset, even though SAMtools-mpileup identified variants with lower precision ratio, it was able to result higher recall ratio in the dataset. In the filtered INDEL results, GATK-HC always outperformed SAMtools-mpileup in terms of precision and recall ratio in the low diversity population. In the high diversity population, SAMtools-mpileup resulted higher precision ratio at the cost of much lower recall ratio (Figure 2B). Noticeably, SAMtools-mpileup was only able to result 3.08% and 1.61% recall ratio in the high diversity populations for SNPs and INDELs, respectively.

In the individual-level simulated dataset, a consistent pattern of trade-off between precision and recall was observed. SAMtools-mpileup was able to generate higher precision ratio for both SNPs and INDELs, however, GATK-HC was able to result higher recall ratio for both SNPs and INDELs as coverage and mutation rate increased in most case (Supplemental Figure 3A-D). Among four different crop species, rice, tomato and soybean has similar results in both variant calling programs. Nevertheless, results from simulated maize datasets had lower precision and recall ratios. Noticeably, when the mutation rate is at 0.1 and 0.15, both variant calling programs resulted lower precision ratio for SNP detection as coverage increased. Maize datasets had the largest magnitude of reduction in precision whereas other crop species resulted similar reduction.

Wild reference genome alignment and variant calling

The large increase in the number of SNPs in wild samples was expected due to both greater distance from the domesticated reference genome and increased diversity relative to the domesticated samples. Expectedly, as the distance from the reference genome increased, a greater proportion of reads was unmapped. The variants in these unmapped reads, especially in the wild species, could represent “missing diversity”. To test this hypothesis, we evaluated how variants discovery in these 82 tomato samples were changed by using a wild reference genome (S. pennellii) [38].

The read alignment to the S. pennellii reference was performed under identical settings as above. As previously seen, BWA-MEM showed the highest mapping percentage and SOAP2 showed the lowest (Figure 3A). In general, mapping percentage in domesticated and wild tomato groups were similar regardless of aligner settings used (Figure 3A). The single outlier with high alignment percentage was a S. pennellii sample with an alignment of 95.13% (or 99.69%) as opposed to 34.22% (or 94.87%) against the S. lycopersicum reference using SOAP2-tuned (or BWA-MEM). Concurrent with these results, the 82 samples, except the S. pennellii sample, had similar IBS distances to the reference genome. As with the S. lycopersicum reference, alignment percentage to the S. pennellii reference was inversely proportional to IBS distance to the reference genome (Figure 3B), suggesting this relationship was independent of reference genome used.

To investigate how diversity estimation varied by reference genome, reads from randomly selected eight domesticated tomatoes and eight wild relative accessions were aligned to the S. pennellii reference. Alignment to the S. pennellii reference genome generated a total of 96,712,749 unfiltered SNPs and 59,944,499 filtered SNPs, while a total of 77,718,102 raw SNPs and 53,036,666 filtered SNPs were identified using the S. lycopersicum reference genome. Compared to using the S. lycopersicum reference genome, significantly more SNPs were identified from 8 domesticated tomato samples when S. pennellii reference genome was used for variant discovery (Supplemental Figure 4A).

To further investigate the source of this additional variation, a cross-reference comparison was performed between SNPs identified using S. pennellii and S. lycopersicum reference genomes. One hundred nucleotides of DNA sequence flanking each filtered SNP identified using one reference genome was aligned to the other reference genome. Results in the Figure 3C showed that majority of the filtered SNPs identified in the S. pennellii located on the synteny path of S. lycopersicum genome. In the S. lycopersicum sample, and similarly, majority filtered SNPs identified using S. pennellii reference were located on the synteny path of S. lycopersicum genome. This result indicated that using S. pennellii reference genome, we were able to identify SNPs that were fixed in the S. lycopersicum domesticated varieties.

Since these SNPs were fixed in S. lycopersicum, they would not have been identified from alignment to the S. lycopersicum reference. Outside of these fixed SNPs in the domesticated species, 4.55% of flanking sequences of SNPs identified using S. pennellii genome in chromosome 1 could not be mapped to the S. lycopersicum reference. Similarly, 11.15 % of the flanking sequences of SNPs identified in the S. pennellii sample using the S. pennellii genome were not found in the S. lycopersicum genome (Figure 3C). Switching to the domesticated reference genome, 7.13% of the downstream sequences of SNPs identified in a S. lycopersicum sample using S. lycopersicum genome could not be found in the S. pennellii genome (Supplemental Figure 4B). These results indicated that a great portion of variation in the wild species would be missed if a single domesticated genome was used as the reference, and vice versa.

Hard-filtering and machine-learning based variant filtering

Variant filtering is required to minimize both false positive and negative genotype calls. Comparisons were made between three variant filtering methods: setting empirical hard-cutoffs (HARD) on metrics such as read depth, strand bias, and variant quality and so on, a newly implemented machine-learning based (ML) variant filtering [14], and a combination between HARD and ML (COMBINED) filtering. Filtered datasets generated from the 602 WGS tomato datasets, including a wide range of domesticated and wild tomato samples [25], were analyzed. A training dataset of 8401 markers from SolCap was used for the training phase of ML [40]. The SolCap is a high confidence dataset consisting of verified markers previously used in genetic studies. In the COMBINED method, the HARD filters were first applied to SolCap to remove low-confidence markers and yield a training set of 7,633 variants. Results indicated that the HARD-filtered method retained the fewest SNPs (94.2M), which was 26.3% and 7.1% fewer than ML-filtered (127.8M) and COMBINED-filtered (101.4M) datasets, respectively (Supplemental Table 4). SNPs in the first 10 million bases in Chromosome 1 (Supplemental Table 4) were cross-compared between the three datasets. 70% of SNPs in this segment were shared among all three filtered datasets (Supplemental Figure 5A), while each dataset had a subset of unique variants.

Two methods were used to indirectly infer the quality of filtered datasets: recapitulation of diversity estimates generated by a “gold standard” set of 22,336,965 SNPs (See details in Methods) in the form of PCA (Supplemental Figure 5B) and IBS analyses (Supplemental Figure 5C), and calculation of LD decay distance for each filtered dataset. SNPs identified by all three filtering methods were removed for this analysis so that the efficacy of each method could be evaluated independently. The underlying assumption of these analyses is that true diversity would recapitulate the known population structure, whereas the population structure would begin to break down as the number of artifacts increased. Using the “gold standard” variant dataset, samples were grouped into four clusters based on PCA and IBS results. All three filtering methods were able to resolve Cluster 1 and Cluster 4, whereas the HARD and ML filtering methods were not able to clearly resolve Cluster 2 from Cluster 3 (Figure 4A-B). In contrast, the COMBINED filtering method was able to identify all four original clusters to reconstruct the population structure of 82 Solanum genomes (Figure 4C).

Next, the contribution of false positive SNPs in each filtered dataset was evaluated by calculating the rate of LD decay. The assumption was that false positive SNPs were random noise that would be found not in LD with nearby SNPs. Therefore, the apparent rate of LD decay in a dataset would increase as the number of false positives increased. As predicted, a greater rate of LD decay was found in all three filtered datasets than that found in the high-confidence dataset. Of the three filtered datasets, the COMBINED method, however, had the lowest rate of LD decay (Figure 4D) approximating the rate of LD decay seen in the high-confidence SNP dataset.

To quantitively measure the difference between hard filtering and machine-learning based filtering, simulated datasets with varied population size, mutation rate and sequencing coverage were generated (Supplemental Figure 6A-B). In the low diversity population datasets, machine-learning based SNP filtering always outperformed hard SNP filtering by 7.38% and 14.14% on average for precision and recall ratio, respectively. In terms of INDEL filtering in the low diversity dataset, machine learning based filtering and hard filtering resulted comparable precision results, however, machine learning based filtering was able to result 12.49% higher recall ratio than hard filtering. In the high diversity population, SNP and INDEL had similar results from different filtering methods. Minor difference was observed in the recall ratio between machine-learning based and hard filtering. No difference was found in the precision ratio between machine-learning based and hard filtering in the high diversity population.

Two-step imputation method

Missing genotypes, possibly due technical limitations, are commonly resolved via imputation. In human studies, standard imputation methods leverage linkage disequilibrium (LD) and reference panels [41]. Beagle 4.1 is a commonly used imputation algorithm in plant studies that can function with or without a reference panel. To determine the importance of a reference panel for SNP imputation, both LD-based and reference panel-assisted imputation were applied to several datasets. A reference panel of 22,336,965 high-confidence, phased SNPs was generated from 82 high coverage (30x) WGS tomato datasets. Imputation results were compared between the two methods. In the first method, missing SNPs were imputed without a reference panel. In the second method, imputation was performed in two steps: in the first step a reference panel was used to impute missing calls only for missing reference variants; and then a second step was employed to impute the remaining missing, non-reference SNPs. Samples were placed in four groups and varying percentages of high confidence genotypes were masked to act as “missing” data (See details in Methods). The concordance (r-squared) between the original masked and imputed genotypes was calculated to estimate imputation accuracy.

Results showed that no difference between LD-based and 2-step imputation was observed in 100 domesticated (DOM) tomato samples (Figure 5A) or the 50 Solanum pimpinellifolium (PIM) samples (Supplemental Figure 7A) datasets. In the dataset of 200 randomly selected tomato samples (RANDOM), at 47% missing data, a 4% difference was observed (Supplemental Figure 7B). When the parameter of missing percentage was set at 72%, 2-step imputation methods showed 60% higher accuracy than LD-based imputation in the dataset of 36 wild tomato species (WILD) (Figure 5B). High LD between SNPs may reduce the need for a reference panel in imputation. The calculated LD decay for each dataset showed that DOM had the slowest LD decay and WILD had the fastest LD decay (Supplemental Figure 7C). Due to the fact that limited samples of wild tomato were available, the number of samples we used in the simulation in DOM (100) was also considerably higher than that in WILD (36). As such, considerably more information was present in the DOM dataset for imputation, as opposed to the WILD dataset which not only had a smaller number of samples but also contained multiple species. To determine if LD continued to be sufficient for imputation in small domesticated panels when the amount of missing data was considerable, 15 randomly selected domesticated tomato samples that were also included in the reference panel (15-DOM-REF) had up to 85% of their genotypes masked. Both methods were applied to the 15-DOM-REF dataset. The results showed two-step imputation was 9.25 times more accurate than Beagle v4.1 direct imputation by when the missing percentage was 85% (Figure 5C).

The ability to accurately and comprehensively identify genetic variation is a critical step for studying diversity, trait mapping and breeding in plant genomics. Many plant studies involve high levels of genetic diversity and, in some instances, incorporating distantly related varieties and wild relatives. Neither of these conditions are common in human studies, and as such pipelines designed and evaluated on humans may perform differently than expected. Therefore, we evaluated programs that are commonly used by plant genomic studies on SNP discovery steps including read alignment, variant calling, variant filtering and missing data imputation in the context of plant diversity discovery

One of the first computational steps in the variant discovery pipeline is the alignment of reads to a suitable reference genome. Previous aligner evaluation studies have been performed using either human or microbial genomic datasets [42, 43], which may not represent the levels or types of diversity expected in plant studies. We performed alignment using both real and simulated tomato datasets and found that different aligners were very different in their tolerance of sequence variation in paired-end reads. BWA-MEM outperformed four other alignment settings in mapping percentage while still being able to maintain high mapping accuracy. Neither SOAP2 nor Bowtie2 was able to align as many reads, even after optimizing their settings to account for increased variation. In this study, we chose not to tuned BWA-MEM mostly because the mapping percentage was high with the default settings and there is no obvious parameter such as numbers of mismatches allowed, or fragment size as found in Bowtie2 or SOAP2. Besides, many program users, especially when they are not experts in bioinformatics, may stay with the default settings of programs.

BWA-MEM’s increased sensitivity may come at a cost, in that as the number of SNPs or size of INDELs per read increased, the false positive rate also became slightly higher (Figure 1C-D) than that of Bowtie2-tuned. The increased number of false positive alignments may, in turn, result in erroneous variant identifications. Nevertheless, given the relatively high sensitivity and accuracy of BWA-MEM, our results indicate that under most circumstances it is probably the most suitable algorithm for read mapping for plant datasets, especially when distantly related samples are included in the analysis. If high accuracy at the cost of less sensitivity is required, Bowtie2 may be the better choice. Although SOAP2 was the fastest aligner tested, its difficulty in aligning reads with high variance from the reference genome make it unsuitable for studies where significant levels of genomic diversity may be present.

The next step in an analysis pipeline is variant calling. Comparisons between aligner-variant caller combinations indicated that the alignment algorithm had a greater impact on the number of variants discovered than the variant caller used. For a given aligner, SAMtools-mpileup and GATK-HC had similar results in the total number of SNP identified in the real tomato genomic dataset. This further emphasizes the importance of selecting an aligner appropriate to the goals of a study, especially when high diversity samples such as wild relatives and related species are included. According to the simulation results, GATK-HC was able to identify more true positive variants at higher precision ratio in most cases. Especially in the high diversity population simulation, GATK-HC was more preferred than SAMtools-mpileup because SAMtools-mpileup resulted very low recall ratio in both SNP and INDEL detection.

In general, we recommend GATK-HC for variant calling and filtering for several reasons. First of all, GATK-HC outperformed SAMtools-mpileup in most of our situation tests resulting a higher precision and recall ratio for SNP and INDEL detection. Second, GATK-HC allows rapid incorporation of multiple samples into a dataset without needing to recall genotypes for all samples, even previously genotyped ones, from aligned reads by using the GVCF system. This saves considerable time and computational expense when adding samples to a dataset. The third reason to recommend GATK-HC is that it supports multi-thread processing which is not available in the SAMtools-mpileup. Taking the advantage of high-performance clusters, multi-thread feature can significantly save processing time especially for large studies. Finally, the GATK package supports sophisticated machine learning based variant filtering (VQSR) which showed superior performance than empirical hard cutoffs. We did, however, find situations that SAMtools-mpileup is more preferable depending on the goal of the study. For example, for a low diversity population with very low sequencing coverage (1x), SAMtools-mpileup was able to identify more true SNPs than GATK-HC but at the cost of lower precision. If the purpose of the experiment is to identify as many true positive SNP as possible, then SAMtools-mpileup could be used in this particular situation. Another situation that SAMtools-mpileup may be preferable is identifying SNP from a closely related sample. According to the simulation results from single samples, SAMtools-mpileup resulted slightly higher precision and recall values than GATK-HC results when the mutation rate was lower than 0.05. If the experiment aims at charactering SNPs in a line that is closely related to the reference genome, SAMtools-mpileup could be used in this particular situation.

Variant filtering is the third step in a diversity assessment pipeline. Three approaches to this were evaluated: hard filters of various quality metrics, machine learning as implemented in GATK (VQSR), and a combined approach. The combined approach which utilized hard filtered SolCap markers as the training dataset showed significant improvements over other variant filtering methods. According to the PCA plots (Figure 3A-C) and LD decay figure (Figure 3D), the combined method was able to generate more true positives, with fewer false negative SNPs and fewer false positive SNPs when an appropriate training dataset was used. This indicates that machine-based learning methods may be better suited at identifying true positives and eliminating false positive SNPs than empirical hard-filtering. The difference in the results of combined and VQSR suggested the importance of the training dataset. The machine learning model will learn from errors in the training dataset that might contribute to false positive variants. The downside of machine-learning-based filtering is that its implementation is complicated and requires experimentally validated (high-confidence) training set. In human studies, this information can be obtained from numerous genomic resources such as HapMap, the 1000 Genomes Project and omni SNP array datasets. Only in few major crops, such as maize [20], rice [44] and soybean [45] are these resources available. Similar conclusions were found from the simulation tests. According to the simulation results, VQSR outperformed hard filtering in general. Nevertheless, only minor difference was found when the simulated population had high diversity for both SNP and INDEL filtering suggesting the quality metrics used by VQSR may not be sophisticated enough to differentiate true variants from false positive variants. This also indicates new quality metrics may be necessary, especially for the genomic regions that can be hyper-variable.

The final step in the variant discovery pipeline is imputation. Reference panels are routinely employed in human studies, but these have not been routinely employed in plant genomics. To evaluate the importance of a reference panel for imputation, Beagle v4.1 [16] was used to impute masked genotypes in four sample group without the use of a reference panel and with a reference panel in a two-step process where SNPs contained in a reference panel were first imputed, and then imputation was extended to the entire dataset. Our results showed that the two-step imputation method was able to utilize a de novo reference panel of SNPs generated from high coverage sequencing data to assist imputation in the low coverage samples. Results from these studies indicated that the two-step imputation method was superior to the LD-based imputation method in sample groups that contained wild species. In addition, even for closely related samples, a certain number of samples must be present for LD-based imputation to produce valid results. Further, if there are insufficient samples, a reference panel may be required (Figure 4C). The tradeoff was that 2-step imputation doubled the running time and would incorrectly impute missing SNPs which were not due to technical issues but because of structural variations. Therefore, care must be taken not to introduce false positive since presence-absence variations are common in plants. These genomic regions could be identified prior to imputation to avoid this pitfall.

The effect of presence-absence variation on identifying missing genetic diversity is a special concern in studies that include high diversity samples. This issue can be seen from the results of the cross-reference experiment. Up to 11.15% of the variations identified using the wild reference could not be mapped back to the domesticated S. lycopersicum genome, and vice versa. These results indicated the inadequacy of single reference genome for comprehensive variant discovery. It also indicated that employing multiple reference genomes could identify additional sources of diversity that went undetected when using a single reference. These results have implications for the utility of pan-genomes. Multiple references or pan genomes would likely increase the detection of “missing diversity” that is due primarily to PAV between samples. Moreover, using a distantly related reference genome may allow the detection of SNPs that would be undetected using a closely related reference genome. These species-specific, fixed variants have implications in the evolutionary history of plant species such as domestication events.

In conclusion, we found that BWA-MEM was better overall at detecting more true-positive alignments, especially in distantly related samples, while Bowtie2 was better at minimizing the incorrect alignments. Incorporating multiple reference genomes gave a more complete picture of variations, especially when the samples showed considerable presence-absence variation. For filtering, the optimal approach found in our test was to incorporate a combination of machine learning and hard filtering, in which a set of “known” SNPs was used as the training set for machine learning. This requires a panel of known, high-quality SNPs however, which may be unavailable for many plant species. Finally, the importance of high-quality reference panels was emphasized during the imputation step especially when genotype imputation was challenging due to small LD blocks or not enough samples. Above all, the computational pipeline to discover variation from plant sequencing data will depend upon the diversity of the datasets, whether the goals of the experiment benefit from higher sensitivity or accuracy, the depth of sequence coverage, and the availability of external resources such as reference panels and gold-standard SNPs.

CONFLICTS OF INTEREST

There are no conflicts of interest.

ACKNOWLEDGEMENTS

We thank Dr. Christopher Fragoso for valuable discussions and critical reading of the manuscript. We acknowledge support from the Yale University High Performance Computing Center and the Yale Center for Genome Analysis. This project has been supported in part by the National Science Foundation Plant Genome Research Program to SLD (#1444478).

Bevan MW, Uauy C, Wulff BB, Zhou J, Krasileva K, Clark MD: Genomic innovation for crop improvement. Nature 2017; 543(7645):346-354.
Zhou Z, Jiang Y, Wang Z, Gou Z, Lyu J, Li W, Yu Y, Shu L, Zhao Y, Ma Y et al: Resequencing 302 wild and cultivated accessions identifies genes related to domestication and improvement in soybean. Nat Biotechnol 2015; 33(4):408-414.
Lin T, Zhu G, Zhang J, Xu X, Yu Q, Zheng Z, Zhang Z, Lun Y, Li S, Wang X et al: Genomic analyses provide insights into the history of tomato breeding. Nat Genet 2014; 46(11):1220-1226.
Callaway E: Domestication: The birth of rice. Nature 2014; 514(7524):S58-59.
Hufford MB, Xu X, van Heerwaarden J, Pyhajarvi T, Chia JM, Cartwright RA, Elshire RJ, Glaubitz JC, Guill KE, Kaeppler SM et al: Comparative population genomics of maize domestication and improvement. Nat Genet 2012; 44(7):808-811.
Torkamaneh D, Boyle B, Belzile F: Efficient genome-wide genotyping strategies and data integration in crop plants. Theor Appl Genet 2018.
Li H, Homer N: A survey of sequence alignment algorithms for next-generation sequencing. Brief Bioinform 2010; 11(5):473-483.
Wu TD, Nacu S: Fast and SNP-tolerant detection of complex variants and splicing in short reads. Bioinformatics 2010; 26(7):873-881.
Li H, Durbin R: Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 2010; 26(5):589-595.
Li R, Yu C, Li Y, Lam TW, Yiu SM, Kristiansen K, Wang J: SOAP2: an improved ultrafast tool for short read alignment. Bioinformatics 2009; 25(15):1966-1967.
Langmead B, Salzberg SL: Fast gapped-read alignment with Bowtie 2. Nat Methods 2012; 9(4):357-359.
Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R, Genome Project Data Processing S: The Sequence Alignment/Map format and SAMtools. Bioinformatics 2009; 25(16):2078-2079.
Garrison E, Marth G: Haplotype-based variant detection from short-read sequencing. arXiv 2012, 1207.3907.
Poplin R, Ruano-Rubio V, DePristo MA, Fennell TJ, Carneiro MO, Van der Auwera GA, Kling DE, Gauthier LD, Levy-Moonshine A, Roazen D et al: Scaling accurate genetic variant discovery to tens of thousands of samples. bioRxiv 2017.
Li H: FermiKit: assembly-based variant calling for Illumina resequencing data. Bioinformatics 2015; 31(22):3694-3696.
Browning BL, Browning SR: Genotype Imputation with Millions of Reference Samples. Am J Hum Genet 2016; 98(1):116-126.
Howie BN, Donnelly P, Marchini J: A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet 2009; 5(6):e1000529.
Genomes Project C, Abecasis GR, Auton A, Brooks LD, DePristo MA, Durbin RM, Handsaker RE, Kang HM, Marth GT, McVean GA: An integrated map of genetic variation from 1,092 human genomes. Nature 2012; 491(7422):56-65.
International HapMap C, Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, Gibbs RA, Belmont JW, Boudreau A, Hardenbol P et al: A second generation human haplotype map of over 3.1 million SNPs. Nature 2007; 449(7164):851-861.
Bukowski R, Guo X, Lu Y, Zou C, He B, Rong Z, Wang B, Xu D, Yang B, Xie C et al: Construction of the third-generation Zea mays haplotype map. Gigascience 2018; 7(4):1-12.
project rg: The 3,000 rice genomes project. Gigascience 2014; 3:7.
Jacob P, Avni A, Bendahmane A: Translational Research: Exploring and Creating Genetic Diversity. Trends Plant Sci 2018; 23(1):42-52.
Migicovsky Z, Myles S: Exploiting Wild Relatives for Genomics-assisted Breeding of Perennial Crops. Front Plant Sci 2017; 8:460.
Wulff BB, Moscou MJ: Strategies for transferring resistance into wheat: from wide crosses to GM cassettes. Front Plant Sci 2014; 5:692.
Zhu G, Wang S, Huang Z, Zhang S, Liao Q, Zhang C, Lin T, Qin M, Peng M, Yang C et al: Rewiring of the Fruit Metabolome in Tomato Breeding. Cell 2018; 172(1-2):249-261 e212.
Zhang S, Yu H, Wang K, Zheng Z, Liu L, Xu M, Jiao Z, Li R, Liu X, Li J et al: Detection of major loci associated with the variation of 18 important agronomic traits between Solanum pimpinellifolium and cultivated tomatoes. Plant J 2018.
Krause K, Johnsen HR, Pielach A, Lund L, Fischer K, Rose JKC: Identification of tomato introgression lines with enhanced susceptibility or resistance to infection by parasitic giant dodder (Cuscuta reflexa). Physiol Plant 2018; 162(2):205-218.
Rambla JL, Medina A, Fernandez-Del-Carmen A, Barrantes W, Grandillo S, Cammareri M, Lopez-Casado G, Rodrigo G, Alonso A, Garcia-Martinez S et al: Identification, introgression, and validation of fruit volatile QTLs from a red-fruited wild tomato species. J Exp Bot 2017; 68(3):429-442.
Periyannan S, Moore J, Ayliffe M, Bansal U, Wang X, Huang L, Deal K, Luo M, Kong X, Bariana H et al: The gene Sr33, an ortholog of barley Mla genes, encodes resistance to wheat stem rust race Ug99. Science 2013; 341(6147):786-788.
Michael TP, VanBuren R: Progress, challenges and the future of crop genomes. Curr Opin Plant Biol 2015; 24:71-81.
Schatz MC, Witkowski J, McCombie WR: Current challenges in de novo plant genome sequencing and assembly. Genome Biol 2012; 13(4):243.
Wang W, Mauleon R, Hu Z, Chebotarov D, Tai S, Wu Z, Li M, Zheng T, Fuentes RR, Zhang F et al: Genomic variation in 3,010 diverse accessions of Asian cultivated rice. Nature 2018; 557(7703):43-49.
Huang X, Han B: Natural variations and genome-wide association studies in crop plants. Annu Rev Plant Biol 2014; 65:531-551.
Morrell PL, Buckler ES, Ross-Ibarra J: Crop genomics: advances and applications. Nat Rev Genet 2011; 13(2):85-96.
Tomato Genome Sequencing C, Aflitos S, Schijlen E, de Jong H, de Ridder D, Smit S, Finkers R, Wang J, Zhang G, Li N et al: Exploring genetic variation in the tomato (Solanum section Lycopersicon) clade by whole-genome sequencing. Plant J 2014; 80(1):136-148.
Tomato Genome C: The tomato genome sequence provides insights into fleshy fruit evolution. Nature 2012; 485(7400):635-641.
Strickler SR, Bombarely A, Munkvold JD, York T, Menda N, Martin GB, Mueller LA: Comparative genomics and phylogenetic discordance of cultivated tomato and close wild relatives. PeerJ 2015; 3:e793.
Bolger A, Scossa F, Bolger ME, Lanz C, Maumus F, Tohge T, Quesneville H, Alseekh S, Sorensen I, Lichtenstein G et al: The genome of the stress-tolerant wild tomato species Solanum pennellii. Nat Genet 2014; 46(9):1034-1038.
Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, Salzberg SL: Versatile and open software for comparing large genomes. Genome Biol 2004; 5(2):R12.
Sim SC, Durstewitz G, Plieske J, Wieseke R, Ganal MW, Van Deynze A, Hamilton JP, Buell CR, Causse M, Wijeratne S et al: Development of a large SNP genotyping array and generation of high-density genetic maps in tomato. PLoS One 2012; 7(7):e40563.
Das S, Forer L, Schonherr S, Sidore C, Locke AE, Kwong A, Vrieze SI, Chew EY, Levy S, McGue M et al: Next-generation genotype imputation service and methods. Nat Genet 2016; 48(10):1284-1287.
Thankaswamy-Kosalai S, Sen P, Nookaew I: Evaluation and assessment of read-mapping by multiple next-generation sequencing aligners based on genome-wide characteristics. Genomics 2017; 109(3-4):186-191.
Shang J, Zhu F, Vongsangnak W, Tang Y, Zhang W, Shen B: Evaluation and comparison of multiple aligners for next-generation sequencing data analysis. Biomed Res Int 2014; 2014:309650.
Thomson MJ, Singh N, Dwiyanti MS, Wang DR, Wright MH, Perez FA, DeClerck G, Chin JH, Malitic-Layaoen GA, Juanillas VM et al: Large-scale deployment of a rice 6 K SNP array for genetics and breeding applications. Rice (N Y) 2017; 10(1):40.
Song Q, Hyten DL, Jia G, Quigley CV, Fickus EW, Nelson RL, Cregan PB: Development and evaluation of SoySNP50K, a high-density genotyping array for soybean. PLoS One 2013; 8(1):e54985.
Reinert K, Dadi TH, Ehrhardt M, Hauswedell H, Mehringer S, Rahn R, Kim J, Pockrandt C, Winkler J, Siragusa E et al: The SeqAn C++ template library for efficient sequence analysis: A resource for programmers. J Biotechnol 2017; 261:157-168.
Zheng X, Levine D, Shen J, Gogarten SM, Laurie C, Weir BS: A high-performance computing toolset for relatedness and principal component analysis of SNP data. Bioinformatics 2012; 28(24):3326-3328.
Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST et al: The variant call format and VCFtools. Bioinformatics 2011; 27(15):2156-2158.
Danecek P, McCarthy SA: BCFtools/csq: haplotype-aware variant consequences. Bioinformatics 2017; 33(13):2037-2039.

Table 1. Summary of SNPs identified by combinations of aligners and variant calling program

	Unfiltered SNPs				Filtered SNPs
	Total	Domesticated tomatoes	Wild tomatoes	Common	Total	Domesticated tomatoes	Wild tomatoes	Common
BWA-MEM + GATK-HC	131,449,946	17,771,072	128,294,973	14,616,099	93,739,759	13,628,974	91,482,115	11,371,330
Bowtie2-tuned + GATK-HC	73,393,338	11,813,500	70,453,383	8,873,545	30,307,811	8,261,729	28,243,136	6,197,054
BWA-MEM + SAMtools-mpileup	133,734,683	17,268,821	130,886,221	14,420,359	80,709,232	10,366,835	78,727,565	8,385,168
Bowtie2-tuned + SAMtools-mpileup	69,219,499	12,390,916	66,416,422	9,587,839	46,436,709	8,832,598	44,626,459	7,022,348

Download PDF

Journal Publication

published 09 Sep, 2019

Read the published version in BMC Genomics →

Editorial decision: Major revision
11 Jul, 2019
Review #2 received at journal
07 Jun, 2019
Review #1 received at journal
07 Jun, 2019
Reviewer #2 agreed at journal
06 Jun, 2019
Reviewers invited by journal
17 May, 2019
Reviewer #1 agreed at journal
17 May, 2019
Submission checks completed at journal
15 May, 2019
Editor invited by journal
15 May, 2019
Editor assigned by journal
15 May, 2019
First submitted to journal
29 Apr, 2019

You are reading this older preprint version

Read the latest preprint version →

Benchmarking Variant Identification Tools for Plant Diversity Discovery

Status:

Journal Publication

Version 1

Abstract

Figures

Background

Methods

Results

Discussion

Conclusions

Declarations

References

Tables

Supplementary Files

Status:

Journal Publication

Version 1