Genotype calling and haplotype inference from low coverage sequence data in heterozygous plant genome using HetMap

This study developed a new genotyping method that can accurately infer heterozygous genotype information from the complex plant genome sequence data, which helped discover new alleles in the association studies. Many software packages and pipelines had been developed to handle the sequence data of the model species. However, Genotyping from complex heterozygous plant genome needs further improvement on the previous methods. Here we present a new pipeline available at https://github.com/Ncgrhg/HetMapv1) for variant calling and missing genotype imputation from low coverage sequence data for heterozygous plant genomes. To check the performance of the HetMap on the real sequence data, HetMap was applied to both the F1 hybrid rice population, which consists of 1495 samples and the wild rice population with 446 samples. The high coverage sequence data of four hybrid rice accessions and two wild rice accessions, which were also included in low coverage sequence data, were used to validate the accuracy of genotype inference. The validation results showed that HetMap archieved significant improvement in heterozygous genotype inference accuracy (13.65% for hybrid rice, 26.05% for wild rice) and total accuracy compared with similar software packages. The application of the new genotype with the genome-wide association study also showed improvement of association power in wild rice awn length phenotype. It could archive high genotype inference accuracy in low sequence coverage in a small population with both the natural and constructed recombination population. HetMap provided a powerful tool for the heterozygous plant genome sequence data analysis, which may help to discover new phenotype regions for the plant species with the complex heterozygous genome.


Introduction
Next-generation sequencing (NGS) technology has been widely applied in the genome sequencing area to explore the mysteries in the genome. The short reads of NGS data can be used for assembly of the reference genome, exploration of population structure (Huang et al. 2012), conduction of GWAS (Huang et al. 2011;Flint et al. 2012), and so on. Besides the important model species like human and mouse, genome research on some other plant species, which are of great economic value, such as maize (Lai et al. 2010;Tian et al. 2011) and rice, is also important. One primary step of NGS data analysis is to do genotyping from the short reads data to generate the genome variation map. There are already some excellent pipelines like SNPtools (Wang et al. 2013), GotCloud (Jun et al. 2015), and OutcrossSeq (Chen et al. 2021) that provide the whole workflow from raw sequence data to imputed genotype matrix. However, most popular software and pipelines are devote to tackling the model species genome sequence data, such as human and mouse, with a high recombination rate and medium repeat sequences. While some heterozygous plant genomes have rich repeat sequences and low recombination rates (For example, wild rice and hybrid maize). There is a great difference in the heterozygous rate of genome variants between the plants and animals (Jaramillo-Correa JP et al. 2010). For example, the heterozygous genotype rate in plants can be reduced dramatically due to the existence of self-pollination. Many software packages designed for model species cannot directly be used to tackle the plant genome sequence data. Here we have developed a pipeline HetMap to solve this problem. Normal cultivated rice often has a low heterozygous rate. A high heterozygous genotype rate in rice can be found in the F 1 hybrid rice population, which is caused by the direct cross of two homozygous parent lines with different alleles in the genome. It can also be found in a natural heterozygous population, which is a combination of genetic variants from different wild rice accessions. The sequence data of hybrid rice (Huang et al. 2015) and wild rice (natural population, Huang et al. 2012) are used to test our software package. We archive great improvement in genotype inference accuracy compared with other popular software packages.

Plant material, DNA isolation and RNA isolation
The detailed information for the four high coverage sequence hybrid rice accessions is described in the previous study (Huang et al. 2012). The DNA extraction, library preparation, and genome sequence with Illumina sequencing machine are also described in the previous publication.  Li 2011), andGATK (DePristo et al. 2011). They are processed as described in the user manual to generate input file for the downstream analysis (Yao et al. 2020, Sandmann et al. 2017. In this study, HetMap first parses the BAM file using Samtools API. We search for the bases in the sequencing reads that are different from the reference genome. The reads with mapping quality lower than 60 and reads mapping ratio smaller than 0.82 are discarded. HetMap only outputs the polymorphic sites that at least appear twice in the population. All the bases that cover the polymorphic sites detected in the previous step and their related sequencing quality score are extracted from the BAM file. HetMap summarizes all the extracted bases for each polymorphic site to generate the missing rate and minor allele frequency. Only the polymorphic sites with minor allele frequency higher than 0.05, missing rate lower than 0.4, and physical distance with neighboring SNPs bigger than 10 bp are kept. Using the chosen polymorphic sites, we calculate genotype likelihood probability using the reads depth and sequence quality score information. Beagle is used to infer the genotype from the raw genotype matrix files generated by all six variant callers.

SNP validation
HetMap uses the high coverage sequence data of four hybrid rice accession and two wild rice accessions for the SNP inference accuracy validation. They are also included in the low coverage sequence data. We compare the genotype called in the low coverage sequence data with the genotype called in the high coverage sequence data to check the inference accuracy.

The penalty of reads count based on Phred score information
The Phred quality score (q) for the base in the fastq format was calculated using the log-transformed error probability (p) by the following equation.
we can calculate the error probability of the base using the equation p = 10 q/(− 10) . For example, if we get three reads supporting the reference allele (Phred score: a1, a2, and a3) for a specific position. If the Phred score for a base is higher than 19, there is no penalty, and the base count is set as 1. If the Phred score of a base is smaller than 20, we integrate the penalty by reducing the base count based on the error probability (refined count = 1*(1− 10 q/(− 10) )). If the Phred score for all the three bases (a1, a2, and a3) were higher than 19, the base count is 3 for our example datasets. If all the three bases were below 20, the base q = −10 × log 10(p) 1 3 count is calculated as follows. refined count = 1*(1− 10 a1/ (− 10) ) + 1*(1− 10 a2/(− 10) ) + 1*(1− 10 a3/(− 10) ).

Genotype likelihood calculation
For all the three possible kinds of genotype likelihood, we set them as the homozygous reference allele, homozygous alternative allele, and heterozygous allele. The base coverage information for the different alleles is integrated with an improved Binomial model. As there is a high sequencing error for the second-generation sequence data, we give a penalty score based on the base Phred score information in the integration of reads coverage for the genotype likelihood calculation. After we calculate the genotype likelihood for all three kinds of genotypes, we normalize them to get the final probability.
ProbAA, ProbAB, and ProbAB represent the final normalized value for the three kinds of genotype likelihood.

Cultivation of the wild rice and the phenotyping of the traits
In this study, we have planted and phenotyped the 446 wild rice accession in Lin shui country Hai Nan province China from 2012 to 2014. Only few cultivated rice accessions have a short awn length, and most cultivated rice accessions have no awn. However, Most wild rice accessions have a long awn length. The awn is stripped from the seed before we measure the awn length and the seed length. The awn is pulled straight and measured with a ruler. We measure the awn length for five seeds, and their average value is used as the final awn length for each wild rice accession.

Data simulation
Reads data of different sequence coverage, which ranges from the 0.25 fold coverage to 30 fold coverage, are simulated by randomly sampling from the high coverage sequence data. The sequence reads are aligned against the reference genome to generate the SAM file. Samtools is used to convert the SAM format to the sorted BAM file. HetMap and SNPtools are used to generate the genotype likelihood files using these sorted BAM files. The genome variants are called for each sample using the default parameters for the rest of four different genotype callers (GATK, freebayes, Samtools, and deepvariant). Self-customized perl scripts are used to merge the sample genotype information to generate the raw genotype matrix for the population with missing data. All the genotype likelihood files and raw genotype matrix are submitted to beagle imputation. The genotypes called in the simulated data are compared against those called in the high coverage sequence data to get the accuracy.

General workflow of the HetMap pipeline
HetMap pipeline is implemented in C++ language. It takes standard BAM files as input. The output file is in VCF format. The workflow from raw sequence reads to the final imputed genotype is outlined in Fig. 1. Many genotype imputation software packages can be used to handle the output file of HetMap. Here Beagle is chosen as an example.
The whole pipeline can be divided into five sections. The HetMap pipeline starts with the sorted BAM file. Generally, multiple sequence alignment software packages can be used to align the short reads to the reference genome to generate the input file. Firstly, the "Call_snp" program is used to detect the genetic variants from the sequence data that are different from the reference genome. Secondly, the "Sum_snp" program can summarize all the genetic variants of the whole population to generate the candidate polymorphic sites by using some filtering parameters. Thirdly, the "Chose_pile" program can extract all the bases that are aligned to the candidate polymorphic sites. Fourthly, the "filter_snp" program extracts the whole base information that's aligned to the candidate polymorphic sites. Fifthly, the "Call_prob" program can calculate the three kinds of genotype likelihood in the entire population for the whole polymorphic sites. The output file in VCF format (version 4.0) can be used as an input file for many popular genotype imputation software packages.

Detection of polymorphic sites
HetMap pipeline uses the application programming interface (API) provided by the Samtools to parse the compressed BAM file. Before detecting genetic variants from the next-generation sequence data, HetMap first filters the sequence reads that has low mapping score (default, < 60) and small reads matching ratio (default, < 0.82). All the variant sites in the sequence data, which are different from the reference genome, can be genotyped for each accession. As some variant sites in the sequence reads may be caused by sequencing error or mismatch of sequence reads, only non-reference alleles that appear at least twice in the whole population are kept to generate the primary polymorphic sites. As all the genetic variants are considered in the determination of primary polymorphic sites, we need all the four base information to get the final polymorphic sites. All the sequenced bases, which are aligned to the primary polymorphic sites, are pulled down from the sequence alignment files. The base information and Phred score information are all included in the output file.
A hash function is often used to get selected polymorphic sites from the sequence reads in many software packages. However, it needs a large amount of calculation to get all the bases aligned to the candidate polymorphic sites by scanning whole sequence data. As the sample size used for the population genetics analysis is often very huge, counting the frequency of all the four bases for the whole genome becomes a challenging problem. Sort and compare is another method to pull down chosen polymorphic sites from the massive sequence data. By comparing the sorted reads mapping position with the polymorphic sites, the bases aligned to the polymorphic sites can also be extracted. It only needs a simple comparison instead of the complex hash function for the input BAM file. The simple sort and compare method is impractical for overlapping the mapping position with the sequence reads. In the HetMap pipeline, an improved sort Fig. 1 General workflow for the HetMap pipeline. Short sequence reads are aligned to the rice reference genome to generate the sorted BAM alignment files. The "Call_snp" program is used to identify the non-reference allele in the reads alignment to generate the candidate polymorphic site. We take 6 candidate polymorphic sites (A-E) as examples. The underline green sequence is an example reference sequence. The marked characters in the line are non-reference alleles. HetMap filters the candidate polymorphic sites to keep those non-singleton SNPs. The E polymorphic site is filtered for being singleton. The "Choose_pile" program is used to pull down all the bases that are aligned to the candidate polymorphic sites. HetMap calculates the minor allele frequency and missing rate for each polymorphic site to filter the polymorphic sites again with the "Filter_snp" program. The "?" stands for missing for the polymorphic locus with no sequence coverage. The C and D polymorphic sites are filtered for this reason. HetMap uses the "Call_prob" program to calculate the genotype likelihood for all the three possible genotypes for all the polymorphic sites in the population. Beagle is then used to infer the missing genotype 1 3 and compare method is used. We use a dual sorting method to compare the mapping position of the reads with the polymorphic site. The genome position of the polymorphic sites can increase or decrease dynamically with the change of reads mapping range. After we get all the four base information for each candidate polymorphic site, we can filter the polymorphic sites with multiple criteria to get the final polymorphic sites.

Genotype likelihood calculation using improved binomial model
There are three kinds of genotype for each polymorphic site that consists of two alleles. They are homozygous reference allele (RR), homozygous alternative allele (AA), and heterozygous allele (RA). HetMap doesn't consider the tri-allele polymorphic sites here, as they only occupy a small part of the genome. However, HetMap outputs all four base information for the polymorphic site. Tri-allele sites can still be detected from the output file with some easy customization. HetMap calculates genotype likelihood for all possible three kinds of genotypes (Eqs. 1-6, methods). As there is high sequencing error for the short reads sequence data, some bases with low Phred quality score (< 20) are often discarded in the genotype likelihood calculation process. However, the number of reads covering a specific polymorphic site is small in the low coverage sequence data. Hence HetMap integrates the penalty method based on their Phred score information to take the bases that have low Phred scores into consideration. If both an alternative allele and a reference allele are detected in the polymorphic site. The probability for this allele to be a homozygous reference allele is a combination of the probability of this reference allele plus the error probability of the alternative allele. HetMap outputs the genotype likelihood file in VCF format. Multiple software packages can be used for downstream analysis.

Application of HetMap to the F 1 hybrid rice population
The published sequence data for the F 1 population (Huang et al. 2015) are used to test the application of HetMap to the F 1 population. The data contain 1495 pair-end sequenced accessions with the reads length being 96 bp and the average sequence coverage being 2.2 fold. The short sequence reads are aligned to the rice reference genome to generate standard BAM file (Li et al. 2009). After detecting the polymorphic sites with HetMap, it results in 21,016,942 nonsingleton SNPs. Of them 17,670,585 SNPs have a minor allele frequency bigger than 0.05, with the average missing rate being 0.29 across the whole population (Supplementary Fig. 1). For all the polymorphic sites, only those SNPs with the missing rate smaller than 0.35 and the minor allele frequency bigger than 0.05 are kept for the downstream analysis. After filtering all these positions, we got a total of 1,432,221 SNPs and of them, 125,768 are coding SNPs ( Supplementary Fig. 2, Supplementary Table 1). The genotype likelihood for the polymorphic sites is submitted to Beagle (version 4.2) to infer the missing genotype (Browning et al. 2007).
To validate the accuracy of genotype inference, four high coverage sequence hybrid rice accessions are used as the gold standard, which is sequenced with an average of 47.5 fold coverage. The four accessions are also included in the low coverage sequence data. Genotypes called by the HetMap pipeline from the low coverage sequence data are compared with those inferred from high coverage sequence data to check genotype inference accuracy. The average total accuracy is found to be 98.20%. The heterozygous genotype accuracy is 98.31%, and the homozygous genotype accuracy is 98.15% (Table 1).
To compare with other software packages in the genotype inference accuracy, another 5 different genotype callers are selected (SNPtools, GATK, deepvariant, and Samtools mpileup) to analyze the whole F 1 samples again. We run the software as they are suggested in its reference manual with the default parameters. Except for SNPtools, which can directly generate genotype likelihood file for beagle imputation, the left four software packages call genotype for each sample separately. Self-customized perl scripts are used to merge the sample data to generate the raw genotype matrix with missing data. All the matrix datasets are also submitted to beagle for genotype imputation with the same parameter. SNPtools was chosen as an example to make a  Fig. 1, Supplementary  Fig. 3). Of the two pipelines, 1,268,169 SNPs are detected by both methods (Fig. 2). A similar overlapping rate was also found for each of the six genotype callers (Supplementary Fig. 4). The genotype inference accuracy is also validated using the high coverage sequence data. The average total accuracy is 95.43% (98.66% for the homolog genotype and 86.50% for the heterozygous genotype, Table 1, Supplementary Table 2). Except for the SNPtools, the accuracy for the rest four genotype callers is much lower than the HetMap method. While the homozygous genotype inference accuracy of our pipeline is similar to that of SNPtools, HetMap gets significant improvement in the heterozygous genotype accuracy (chi-square test, P < 0.01) and the total accuracy (chi-square test, P < 0.01). About an average 21.75% of polymorphic sites in the genome of hybrid rice are heterozygous genotypes. Hence, a high accuracy rate for the heterozygous genotype is essential for the downstream analysis.

Application of HetMap to natural wild rice population
As hybrid rice is the F 1 population, we also test our pipeline on the natural heterozygous population. HetMap is applied to the 446 published wild rice genome sequence data (Huang et al. 2012), which is sequenced with average two fold coverage. Different from the normal cultivated rice there is existence of outcross in wild rice. Former studies estimated that 1.56% of the wild rice genome was heterozygous, which was much smaller than that of the hybrid rice (Huang et al. 2015). The sequence data of wild rice are processed in the same way as that of hybrid rice. After filtering the SNPs with default parameters, it results in 2,360,987 SNPs. Of them, 126,930 are coding SNPs ( Supplementary Fig. 5, Supplementary Table 1). The average SNP density across the genome is 6.17/kb. The minor allele frequency of 86% SNPs is bigger than 0.05 ( Supplementary Fig. 6).
To validate the accuracy of the HetMap in the wild rice population, two high sequence coverage wild rice accessions are used as the golden standard. The total genotype inference accuracy for wild rice is average 93.6% (95.17% for the homozygous genotype, 74.47% for the heterozygous genotype).
The other five different genotype callers are also applied to the 446 wild rice samples. The SNPtools is chosen as an example to be studied. It results in a total of 30,179,926 SNPs. The minor allele frequency for 22.30% of the polymorphic sites is bigger than 0.05 (Supplementary Figs. 6 and 7). The total accuracy for the wild rice sample is 88.80% (90.78% for the homozygous genotype, 58.87% for the heterozygous genotype). It's found that the HetMap pipeline significantly surpasses the SNPtools accuracy in the wild rice population in the genotype inference including both the heterozygous genotypes and homozygous genotypes. Compared with HetMap, the rest of the four genotype callers generally showed a much lower genotype calling accuracy (Supplementary Table 2).

Assessment of sequencing reads coverage on the genotype inference accuracy through computational simulation
To check the effects of sequence coverage on the genotype inference accuracy in the wild rice population and hybrid rice population, we simulate sequence reads of different fold coverage to submit them to all the six different genotype callers. The simulated sequence data ranges from 0.25 fold coverage to 30 fold coverage. (Supplementary Figs. 8 and  9). The simulation result shows that HetMap can archive an average accuracy of 85% with only one fold coverage (Fig. 3) in the hybrid rice sample. The accuracy rate doesn't increase much after fourfold coverage. However, for the wild rice HetMap needs about 1.3 fold coverage to get an average accuracy of 85%. The slow increase of accuracy only occurs after sixfold sequence coverage. Compared with the other five genotype callers, HetMap can archive better accuracy with low coverage sequence data when the sequence coverage is below 16 (Fig. 3, Supplementary Fig. 9). However, when sequence coverage is bigger than 16 there is no significant difference in the genotype inference accuracy.
Besides the difference in genotype inference accuracy within the different software packages, the hybrid rice Fig. 2 The venn diagram for the four polymorphic datasets. We use two methods to separately characterize the polymorphic loci in the two populations to generate four SNP datasets. The two SNP datasets for hybrid rice are detected with HetMap and SNPtools separately. The same two SNP datasets for wild rice are detected with HetMap and SNPtools separately. The R package "venn-diagram" is used to generate this diagram 1 3 population shows consistently higher genotype inference accuracy than the wild rice population. Only when the sequence coverage for the sequenced accession is high (> 18 fold) no significant difference exists between the wild rice population and hybrid rice population. There is a big difference in the genotype inference accuracy between the wild rice population and hybrid rice population for all six variant callers. The difference in genotype inference accuracy between the two populations is smaller when the HetMap pipeline is applied to the two populations than that of the other software packages.

Improvement of genome-wide association power with the addition of heterozygous information
In this study, we have characterized heterozygous genotype in the wild rice population with a new method. A previous study, which generated the homozygous genotype variation for the wild rice, had performed genome-wide association with one phenotype. In this study, we decide to conduct association with the phenotypes using both the homozygous genotype variation map generated in the previous research and the heterozygous genotype used in this study to check their difference. To compare their difference in performance, we check the association result for the traits that detect known phenotype-related genes around the associated peaks. The genome-wide association results for the heterozygous genotype surpass the results generated using the homozygous genotype in nearly both cases (Fig. 4, Supplementary  Fig. 10). The association result using the heterozygous genotype has better association accuracy and signal detection rate than the homozygous genotype. Genotype matrix generated by the HetMap package is found to archive the highest association signal near the two known awn related genes than the rest of five genotype callers.
Besides the improvement of GWAS association power for the heterozygous genotype, it can also detect the association signal that is missed by using the homozygous genotype. We take the awn phenotype GWAS association result as an example to show this improvement. We can detect five significant association signals for the awn phenotype by using the heterozygous genotype. While only one significant signal was detected in the awn phenotype association using the homozygous genotype. Previous studies have cloned two awn length related genes using one near inbreed line constructed using W1943 (wild rice) and GLA4 (cultivated rice, Luo et al. 2013;Gu et al. 2015). The two known awn related genes are located near the peak association signal of awn phenotype association result using the heterozygous genotype (Supplementary Table 3). The peak signal of awn phenotype in chromosome 10 also overlaps with one awn phenotype QTL locus mapped in the W1943 and GLA4 near inbreed line. However, no QTL mapping locus of awn phenotype in the W1943 and GLA4 population overlaps with the peak signal in chromosome 4 detected using the homozygous genotype. Besides, more association signals in the awn phenotype result are archived with the heterozygous genotype than that of the homozygous genotype, the heterozygous genotype also explains more phenotype variation in the highest association signals (Supplementary Table 3). For awn phenotype association result, the highest association signal with the heterozygous genotype can explain 40.04 percent phenotype variation. In comparison, the highest association signal with the homozygous genotype can only Fig. 3 Simulation of different sequence coverage in different populations to check the genotype inference accuracy. The horizontal line stands for the simulated reads coverage range from 0.25 fold coverage to 30 fold coverage. The vertical line represents the genotype inference accuracy for simulated data compared with the high coverage sequence. The simulated reads are submitted to both the HetMap and SNPtools pipeline separately 1 3 explain 27.77 percent phenotype variation. The heterozygous rate for the highest association signal in awn phenotype is as high as 30.71%. So, the heterozygous genotype plays a vital role in the association of awn phenotype. Compared with the other five variant callers, the HetMap developed in this study showed the highest association signal. For the awn length phenotype, the association studies using the heterozygous genotype can detect more association signals and archive a better explanation of the phenotype variation than that of the homozygous genotype for the wild rice awn length phenotype.

Discussion
In this study, we show that the HetMap pipeline can be used to tackle heterozygous plant genome in both the F 1 population (hybrid rice) and natural heterozygous population (wild rice). Construction of the F 1 population is important for gene mapping and dissection of hybrid vigor, which exists in the species like rice and maize (Duvick et al. 2001, Zhang 2007, Huang et al. 2015. For example, many natural plants like Miscanthus Sinensis have a high heterozygous genotype rate (Ma et al. 2012). It is not suitable to submit them to popular genotype processers (DePristo et al. 2011, Li 2011). This study shows that a fine variation map can be obtained by HetMap pipeline for the heterozygous plant genome sequence data using shallow coverage (Supplementary Figs. 9,10 and 11). Compared with other software packages, the HetMap pipeline shows a higher genotyping accuracy rate in both the F 1 population of hybrid rice and the natural wild rice population (Fig. 3, Supplementary Fig. 9). Our software also offers more options for users to customize the input data and filter the variation data by their own needs.
Besides the difference in genotyping accuracy between the HetMap and other five genotype callers, the genotyping accuracy between the natural wild rice and F 1 hybrid rice populations is also different (Fig. 3). They may be caused by several reasons. Firstly, the population size of hybrid rice is larger than that of the wild rice population. There is more reads coverage for a specific polymorphic site in a larger population. The missing genotype has more references in the missing genotype inference process. Secondly, there is significant population structure and high population differentiation in the wild rice population. The genetic distance within the hybrid rice population is small Fig. 4 Comparison of the GWAS results for the awn phenotype using different kinds of genotype. Figure a and Figure b are the manhattan plots for the awn phenotype using the heterozygous genotype and homozygous genotype, respectively. The horizontal line represents the 12 chromosomes of the wild rice genome. The vertical line represents the transformed associated value for SNP with the phenotype. Figure c and Figure d are Quantile-quantile plots for the awn pheno-type using the heterozygous genotype and the homozygous genotype, respectively. The horizontal line represents the transformed expected P values. The vertical line represents the transformed observed P value. There are two known genes, An-1 and An-2, located around the two peak SNPs on chromosome 4 using the heterozygous genotype. However, no related associated peaks appear in the association result using the homozygous genotype and all the accessions generally cluster together in the evolution phylogenetic tree analysis (Huang et al. 2015). The missing genotype will have more reference haplotypes to impute from with smaller genetic distance. Thirdly, the heterozygous genotype of the F 1 population is a raw combination of the different alleles between the two parents. It's evenly distributed across the genome. However, heterozygous genotype in wild rice is more complex. They can come from genetic drift, genetic introgression, or contamination from other wild rice accessions (Phan et al. 2012;Hufford et al. 2013). Fourthly, as we use the cultivated rice as the reference genome, there exists more difference between the wild rice genome and reference genome than that of the hybrid rice. There is more reads coverage for the hybrid rice than the wild rice when mapping the short reads to the reference genome.
Except for the genotype calling accuracy, the running time for the genotype caller to finish a sample is also important. As the HetMap package developed in this study provides the whole workflow from the input BAM files to the raw genotype matrix, the time to generate the raw genotype matrix from 50 randomly chosen samples is evaluated ( Supplementary Fig. 12, Supplementary Table 4). The calling time nearly shows exponential growth with the increase of sequence coverage. Though the calling time of Pileup and HetMap are both fast compared with other software packages, less time is needed for HetMap to finish our test dataset.
Many studies had proposed possible solutions for calling the genotype from the low coverage short sequence data recently (Hickey et al. 2019;Chen et al. 2021). However, they are often limited to model species like human or confined to special recombination populations. Our study provides a powerful tool for calling the genotype from both the nature heterozygous population and the recombination populations. Association studies with wild rice awn phenotypes had shown that the improvement of the genotype calling accuracy could help facilitate the discovery of the associated phenotype-related regions. In total, our study provides a powerful tool to detect the accurate genotype from the low coverage heterozygous genotype, which can help facilitate the discovery of the associated regions in the complex heterozygous genome species.