Micro-haplotyping in polyploids using massively parallel amplicon sequencing

Haplotyping in polyploids is a very relevant but challenging task. Most methods based on short-read technologies require haplotype assembly, which is error-prone and computationally intensive. However, most (auto)polyploids have high SNP densities. Therefore, haplotyping could already be done in regions of a size that can be entirely covered by reads generated by current short-read sequencing techniques. Here, we used a method that enables massively parallel amplicon sequencing in order to generate such micro-haplotypes in autotetraploid potato and autohexaploid chrysanthemum. Results


Abstract Background
Haplotyping in polyploids is a very relevant but challenging task. Most methods based on short-read technologies require haplotype assembly, which is error-prone and computationally intensive. However, most (auto)polyploids have high SNP densities. Therefore, haplotyping could already be done in regions of a size that can be entirely covered by reads generated by current short-read sequencing techniques.
Here, we used a method that enables massively parallel amplicon sequencing in order to generate such micro-haplotypes in autotetraploid potato and autohexaploid chrysanthemum.

Results
For potato, we generated 79 million reads to sequence 412 regions in 96 samples. The average region size was 154 base-pairs. On average, we found 2.4 haplotypes per locus per individual. For chrysanthemum, we generated a similar amount of reads to sequence 940 regions in 92 samples. The average region size here was 171 base-pairs. On average, 2.7 haplotypes per locus per individual were found. Concordance with dosage data based on SNP-arrays was up to 96.8% in potato and 94.2% in chrysanthemum.

Conclusions
Interpretation of genotyping data generated with next-generation sequencing can be challenging in polyploids, because estimation of allele dosages is often a requisite. Taking advantage of high SNP densities in many autopolyploids we show that massively parallel amplicon sequencing can generate high-quality data with high information content. This has a broad range of applications, including linkage mapping, diversity studies, and genomic selection.

Background
Innovation in DNA sequencing techniques have had a large in uence on our understanding of genetics.
Due to such innovations, it has become affordable to genotype a large number of samples for a large number of variants for many different species, including polyploids. However, genotyping polyploids is more challenging compared to diploids. This is because one individual can carry more than two alleles at a locus and these alleles can be present in different dosages. The information about allele dosage is important for genetic studies, because a wide range of allelic combinations can occur at a locus that can result in different phenotypes [1][2][3][4] Similar to diploids, SNP genotyping has become state-of-art in polyploids. High-throughput genotyping often is performed by the use of conventional SNP arrays [5][6][7][8][9][10]. By making use of allelic dosage estimation in single SNPs, signi cant progress has been made in GWAS [11,12] and in linkage and QTL mapping [1,[13][14][15]. However, a limitation of such an approach is that a single bi-allelic SNP can only separate two alleles, whereas most polyploid crops are outbreeding and can carry multiple alleles at the same locus. Therefore, assays that tag multiple alleles enable better capture of the genetic diversity of a locus.
The identi cation of haplotypes enables the characterization of multiple alleles within a locus. To obtain haplotypes, sequencing data is often phased using haplotype assembly [16][17][18]. In diploids, this is quite straightforward because one haplotype de nes the phasing of the other. However, in polyploids, this is much more complex. For example, de ning one haplotype in a tetraploid leaves three other haplotypes to phase. This makes haplotype assembly much more computationally intensive.
Direct haplotyping within sequencing reads alleviates the di culties associated with haplotype assembly.
The length of the regions needed for direct haplotyping depends on the SNP density in the regions of interest. In general, SNP densities are high in polyploids. Potato for example has an estimated populationwide SNP density in genic regions of 1 in 24 base pairs, whereas each individual potato variety contains a single variant each 42 bp [19]. This means that direct multi-SNP haplotyping can be done already in relatively short regions of 150 to 200 base pairs, which is in the range of most short-read sequencing technologies. This concept is referred to as micro-haplotyping [20][21][22], and has great potential in polyploid organisms due to their high SNP densities.
With common PCR, short regions can be easily ampli ed in a sample and the resulting amplicons can be sequenced. With massively parallel PCR this can be done for a high number of regions in a single reaction. The AgriSeq technology [23] is developed to automate this for a large number of regions in a large number of genotypes, enabling high-throughput micro-haplotyping. In this study we evaluate this methodology for its suitability for haplotyping in polyploids. We applied this technology in tetraploid potato (Solanum tuberosum) and hexaploid chrysanthemum (Chrysanthemum x morifolium) to generate microhaplotypes. We used loci and individuals that previously have been genotyped with SNP arrays [1,24], enabling the comparison of SNP dosage estimation from both methods.

Results
In order to investigate the potential of massively parallel amplicon sequencing, we used AgriSeq to sequence amplicons in 96 samples for potato and 92 samples for chrysanthemum. Samples from both species including the parents, members of an F1 population, and a small set of cultivars (Additional le 1 & 2).

Sequencing results
For potato, 79 million reads were generated with a mean read length of 152 bp which approached the mean target region size of 154 bp (Table 1). With these reads, we identi ed 2467 SNPs in the target regions. In total, 71.2 million (90.1%), entirely spanned all identi ed SNPs within a region and were considered useful for direct micro-haplotyping. These reads mapped to 409 out of the 412 (99.3%) regions selected on the potato genome [25]. This resulted in an average read depth of 1764 per sample on the target regions. The average number of SNPs on the target regions was 6.0, resulting in an average SNP density of 1 in 26 base pairs. For chrysanthemum, 79 million reads were generated with a mean read length of 176 bp, which exceeded the mean target region size of 171 bp (Table 1). With these reads, we identi ed 6600 SNPs in the target regions. In total, 70.7 million (89.4%) entirely spanned all called SNPs within a target region. These reads mapped to 931 out of the 940 (99.0%) target regions selected on the C. seticuspe reference genome [26]. This resulted in an average read depth per sample on the target regions of 822. The average number of SNPs in the target regions was 7.1, resulting in an average SNP density of 1 in 24 base pairs. Haplotyping Aligned sequence reads were ltered for read alignments that spanned all the SNPs within a target region.
Therefore, we could directly construct and count haplotypes from the sequence alignment les. The haplotype distributions per individual per locus had maxima at the expected allele frequencies of 0, ¼, ½ and ¾ and 1 for potato and 0, 1/6, 1/3, ½, 2/3, 5/6 and 1 for chrysanthemum ( Fig. 1). Dosages of haplotypes were based on haplotype counts using a maximum likelihood method. With this method, any regions and haplotypes that were unlikely to t the expected dosage distributions within an individual were ltered out. Since most haplotype frequencies approached the expected distributions, estimating haplotype dosage was straightforward in most regions ( Fig. 2; Fig. 3). Haplotype identi cation and dosage scoring resulted in 5.4 and 7.1 haplotypes per target region on average for potato and chrysanthemum, respectively. On average, each individual carried 2.4 haplotypes per locus for potato and 2.7 haplotypes per locus for chrysanthemum.

Comparison to array data
Because genomic regions were selected based on regions around previously characterized SNP markers, the SNP dosages of these particular SNPs could be compared between the two genotyping methods. For potato we could compare 346 SNPs. Not all used individuals were previously genotyped with the array resulting in 63 common individuals. For chrysanthemum, we could compare 82 genotypes for 869 SNPs.
Concordance was calculated as the number of equal non-missing dosage calls between the two methods by dividing the total number of non-missing calls. In general, concordance was over 90% (Table 3) for both crops. If the dosage did not match with the array, it differed usually by single dosage (Fig. 4).
The used datasets have high average coverages per locus (1764 for potato and 822 for chrysanthemum), and this might not be cost-effective in future applications. In order to investigate whether lower coverages would result in similar output, we sub-sampled the datasets to have an average coverage per locus of 100, 200 and 400 (Table 3). This did not decrease concordance with the SNP array. For potato, lower read depths actually increased the concordance slightly, as low-frequency spurious haplotypes were not detected anymore.

Discussion
In this study, we used massively parallel amplicon sequencing in order to identify and genotype microhaplotypes. Once their allelic conformation is characterized, micro-haplotypes can be used as multi-allelic genotypic markers. The use of multi-allelic markers instead of bi-allelic markers can signi cantly increase the amount of genetic information in diploids [20,27], and for polyploids in particular [28]. The reason is that polyploid species can carry multiple different alleles at the same locus, which are di cult to capture by bi-allelic markers. Multi-allelic markers can, for example, improve the construction of genetic linkage maps or detection of alleles associated with traits of interest using genome-wide association studies (GWAS). In this study, we show how a simple and straightforward short-read-sequencing-based technology can provide multi-allelic markers based on micro-haplotypes.
We employed amplicon-based sequencing (Agriseq) in two polyploid crops: tetraploid potato (Solanum tuberosum) and hexaploid chrysanthemum (Chrysanthemum x morifolium). The design of the PCR target regions involved a selection from high-quality SNPs that had previously been used to generate genetic maps of both potato [24] and chrysanthemum [1] by SNP arrays. In general, the conversion rates between SNPs used in these arrays and the design considered here are high, despite the fact that the sequences used for marker design are different between the SNP arrays and amplicon sequencing. In a SNP array, a sequence directly anking the SNP is used. For amplicon sequencing, primer annealing sites at both sides of the target amplicon are used. These locate at larger distances from the original focal SNP.
Nevertheless, conversion rates between SNP locus and primer design were over 99%.
Haplotypes were directly identi ed from sequence reads that aligned to the reference. This was straightforward because all used reads entirely spanned the SNPs within a target region. Therefore, there was no need for haplotype assembly of overlapping reads, which is often needed in longer genomic regions [16][17][18]. This strongly reduced the sensitivity to errors and computational time.
After identi cation, we could estimate the allelic composition and dosage within an individual based on the haplotype counts. The individual SNP dosages that where estimated based on the haplotype dosages as estimated from the maximum likelihood approach showed a high concordance with previous results. In both crops, concordance of the dosages estimated from amplicon sequencing with the SNP array data was high: up to 96.8% for potato and 94.2% for chrysanthemum. The used maximum likelihood estimation method enabled us not only to estimate the most likely allele dosage, but also a con dence measure of the estimation in the form of a log likelihood. Filtering based on the log likelihood slightly increased the concordance with genotyping arrays, suggesting that low log likelihood estimates can be used to lter out erroneous dosage calls. In addition, in the cases in which there was a discordance in allelic dosages, in most cases this was a difference of one, which will have lower impact on downstream analyses compared to larger dosage differences. The relatively marginal discordance between the array data and current sequencing data therefore shows that the error rates of haplotype dosage estimations was low and are at least comparable to state-of-art SNP genotyping methods in polyploids.
When comparing the two crops, it appears that likelihood ratios and concordance with the SNP array were lower in chrysanthemum compared to potato. One important cause is the ploidy level. The expected read frequencies per dosage are closer together for hexaploid chrysanthemum compared to tetraploid potato. A small allele bias would therefore more likely result in errors in haplotype dosage estimation in a hexaploid compared to a tetraploid. Therefore, while employing techniques as AgriSeq for genotyping, it should be considered that uncertainty caused by allelic bias or low read depth is higher in higher ploidy levels.
In order to make an application for AgriSeq more cost effective, one could consider to use lower average read depth compared to this study (1764 for potato and 822 for chrysanthemum). For the described dataset, lowering the average read-depth by sub-sampling did not decrease the concordance with the array, and in some cases it even slightly increased the number of high-quality regions because some lowfrequency spurious haplotypes did not occur. Even coverage distributions are key for the success of using a lower number of sequence-reads per sample.
We think that the Agriseq method can be applied for analyses in polyploids that require high-quality marker information, like linkage mapping. It could be particularly interesting for linkage map integration, since this can be a serious challenge in polyploids [13]. A substantial proportion of the developed markers are multi-allelic, so they anchor loci relative to each other between homologous linkage groups. The accuracy of integration of linkage groups will therefore most likely strongly increase. Likewise these panels of reasonable high numbers of amplicons could potentially be used in applications such as genomic prediction, genetic diversity analyses and genotype validation of DNA samples.

Conclusions
Methods for genotyping by sequencing have had and still have a large impact on genetic research. This is because they have enabled the generation of large amounts of data at low costs per data point. A challenge is coping with the balance between costs and data quality. This is certainly the case in polyploids, in which for most applications, dosage scoring is a requisite. In this paper we have described a genotyping by sequencing method that can deliver high-quality data with high information content at relatively low costs. The method is particularly useful for polyploids, because haplotyping is straightforward in small regions with high SNP density. Future applications of the method are broad, ranging from diversity studies to linkage mapping and genomic selection.

Plant Material
For chrysanthemum and potato, we genotyped 91 and 96 samples, respectively. For potato, these included 75 members of the mapping population resulting from the cross between 'Altus' and 'Colomba' [24], three replicates of both parents and 15 cultivars (Additional le 1). For chrysanthemum, these included 74 members of the population resulting from the cross between DB36451 and DB39287 [1,10], two replicates of the maternal parent, four replicates of the paternal parent and twelve cultivars (Additional le 2).

Region selection and primer design
For chrysanthemum, 1152 SNPs were selected that properly mapped according to Van Geest et al.
(2017a). The regions around selected SNPs were converted into regions with an average length of 171 base pairs. Primer design was based on the C. seticuspe genome sequence [26]. In total 940 SNP regions passed quality control. For potato 569 SNPs were selected that were originally part of the SOL-STW 20 k SNP array [6]. Like with chrysanthemum these markers were converted into amplicon regions of 154 bp. In total 412 regions passed quality control. Primer design was based on the potato DM genome sequence V4.03 [25].

Sequencing
Samples were prepared for sequencing using the AgriSeqTM HTS Library Kit (A34143-Life Technologies).
In short, DNA concentrations were normalized to 3.3 ng/µL for a total of 10 ng DNA per 10 µL reaction. Normalized DNA was combined with the Ion AgriSeq primer panel and AgriSeq ampli cation master mix. For ampli cation of genomic targets, the following thermocycling programs were used; 99ºC for 2 min, then 15 cycles of 99ºC for 15 s and 60ºC for 4 min for chrysanthemum and 99ºC for 2 min, then 16 cycles of 99ºC for 15 s and 60ºC for 4 min for potato. Amplicons were then prepared for ligation with preligation enzyme digestion at 50ºC for 10 min, 55ºC for 10 min, and 60ºC for 20 min. IonCodeTM Barcode Adapters 1-384 Kit (A29751-Life Technologies) were ligated to the digested products with barcoding enzyme and buffer. Labeled amplicons were then pooled, cleaned up, ampli ed and normalized.
Following library preparation, libraries were loaded onto an Ion 540TM sequencing Chip Kit (A27765) via the Ion 540TM Kit-Chef (A30011-Life Technologies) and Ion Chef. Sequencing was then performed on the Ion S5 system (Thermo Fisher, Inc. Waltham, MA).

Bioinformatics and data analysis
SNPs were called using freebayes [29] and ltered using vcf lter for SNPs with a quality greater than 100 and at least one heterozygous individual. Sequence alignment les were ltered for reads entirely covering the rst and last SNP in the target regions using bedtools [30]. The variant call format le resulting from the SNP call was used to identify SNP alleles in each read using pysam (https://github.com/pysam-developers/pysam).
In order to lter for spurious haplotypes that resulted from off target ampli cation, we ltered for haplotypes with a high frequency of occurrence in the individuals (> 80%) but low average frequency within individuals (less than 1/8 for potato and less than 1/12 for chrysanthemum). Filtering steps and data visualisation were done using R [31].
Likelihood-based test for genotyping In this study we estimated haplotype dosages by taking into account all possible haplotype conformations of the detected haplotypes within a genotype. aplotype dosages were estimated for each individual in each region using a likelihood-based method in line with well-established techniques (Li 2011; Gerard et al. 2018). The possible dosage of each allele was calculated based on ploidy level (k = 4, k = 6) and the number of haplotypes found in a single individual. If we consider a region with e.g. three haplotypes (A, B, C) we easily inferred the dosages of homozygous classes to be AAAA, BBBB or CCCC.

Exclusion of erroneous haplotypes
Before genotyping each haplotype was tested if it was likely to originate from a sequencing error. This avoids situations where only a minor proportion of the reads leads to the incorrect genotype assignment.
These reads are likely to represent sequencing error or contamination from other samples.
With the assumption of independence between sequencing read observations we can estimate the probability of observing k erroneous reads at n depth, given a sequencing error rate ε using a binomial model.
Where k represents the read counts of the most frequent haplotype in an indivudual, and n represents the total haplotype count/read depth at that position. Hence, we can estimate the probability of observing k reads given the total read depth n.
For instance, with sequence error rate at 2%, two errors at 100 depth occur in 32% of the cases. Likewise 10 errors at read depth of 1000 occur in 98% of the cases. This allows to separate the erroneous haplotypes from real haplotypes at a userspeci ed probability threshold. In this study we used a threshold of 0.9.

Genotype likelihoods
Subsequently, we calculated genotype likelihoods. In case of heterozygosity we have multiple genotype classes: In this example we have three genotype possibilities (G i ) in a tetraploid: AABC, ABBC and ABCC, with corresponding expected frequencies per haplotype ( Table 1). As the individual haplotype counts are known, we can simply estimate the posterior probability of each genotypes by Bayes rule as: From this follows that the likelihood of observing a genotype is equal to: binomially distributed. We consider multi-allelic variants to be to be multinomially distributed. Hence we can compute the conditional probabilities of each by calculating the multinomial probability of each genotype.
For the homozygous classes (in example AAAA we estimate the probability of observing a homozygous classes with given a sequencing error rate of ε: where A represents the count of the major allele, and α represents the counts of the minor allele. This avoids that genotypes in which only a few alternative reads are present are seen as heterozygous. In any case the solution with highest likelihood is selected. As con dence measurement we use the log likelihood ratio between the best and second best solution.

Declarations
Ethics approval and consent to participate Not applicable Consent for publication Not applicable Availability of data and materials The datasets on potato generated and/or analysed during the current study are available in the Sequence Read Archive repository under accession number PRJNA635347, and can be accessed through: https://www.ncbi.nlm.nih.gov/sra/PRJNA635347 The datasets on chrysanthemum used and/or analysed during the current study are available from the corresponding author on reasonable request.
Python scripts written for haplotype calling and dosage estimation can be found at https://github.com/GeertvanGeest/dirHap. Project name: dirHap Project home page: https://github.com/GeertvanGeest/dirHap.   Concordance between SNP dosage estimated using a SNP array and using AgriSeq in potato (A) and chrysanthemum (B). The numbers on the sides depict SNP dosages as estimated by the SNP array and AgriSeq. Non-analyzed dosages (due to too unexpected signal ratios are low read coverage) are depicted with NA.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.