CandiHap: a haplotype analysis toolkit for natural variation study

Haplotype blocks greatly assist association-based mapping of casual candidate genes by significantly reducing genotyping effort. The gene haplotype could be used to evaluate variants of affected traits captured from the gene region. While there is a rising interest in gene haplotypes, much of the corresponding analysis was carried out manually. CandiHap allows rapid and robust haplotype analysis and candidate identification preselection of candidate causal single-nucleotide polymorphisms and InDels from Sanger or next-generation sequencing data. Investigators can use CandiHap to specify a gene or linkage sites based on genome-wide association studies and explore favorable haplotypes of candidate genes for target traits. CandiHap can be run on computers with Windows, Mac, or UNIX platforms in a graphical user interface or command line, and applied to any species, such as plant, animal, and microbial. The CandiHap software, user manual, and example datasets are freely available at BioCode (https://ngdc.cncb.ac.cn/biocode/tools/BT007080) or GitHub (https://github.com/xukaili/CandiHap).


Introduction
With the rapid development of next-generation sequencing (NGS) technologies, genome sequencing is becoming inexpensive, routine, and convenient to obtain large numbers of single-nucleotide polymorphisms (SNPs) (Goodwin et al. 2016). Whole-genome re-sequencing (WGRS), genotyping-by-sequencing (GBS), and restriction siteassociated DNA (RAD) are essential strategies in medical, biological, and agricultural research to elucidate the genetic basis of phenotypic traits, such as disease or economically important features (Visscher et al. 2017Patil et al. 2019;Thudi et al. 2016;Tinker et al. 2016;Miller et al. 2007). These strategies are based on sequencing of whole genomes, or representative genome fractions, across many individuals to determine loci with sequence variations. SNPs can alter the amino acid sequence of a protein directly (e.g., via non-synonymous SNPs, alterations of stopcodons, frameshift SNPs, or SNPs in splice sites) or can change gene expression patterns by affecting gene regulatory regions. Using many genomewide variants, genome-wide association studies (GWAS) generally identify SNPs that are statistically associated with certain traits. Such SNPs provide the basis to understand mechanisms that drive a trait; however, a key challenge is to rapidly and robustly identify causal SNPs (McCarthy and Hirschhorn 2008).
In GWAS, millions of genetic variations are tested across numerous genomes to identify those statistically associated with a specific trait or disease. Interpreting these associations in a biological and genomic context is very difficult ). Due to linkage disequilibrium, previous GWAS have demonstrated that most traits are significantly correlated with both causal and noncausal variants that are physically close (Slatkin 2008). This limits the identification of causative variants without additional research . Based on observed patterns of linkage disequilibrium and association statistics, fine-mapping is an in silico procedure created to prioritize the set of variants within each genetic locus found by GWAS that are most likely to be causal to the target phenotype (Raychaudhuri 2011;Schaid et al. 2018). The fine-mapping approaches are based on stepwise conditional, such as GCTA-COJO  and Bayesian models, including CAVIAR , FIN-EMAP (Benner et al. 2016), PAINTOR (Kichaev et al. 2014), and SuSIE (Wang et al. 2018a). Prioritizing the likely affected gene is perhaps the most crucial part of the functional interpretation of GWAS. GWAS fine-mapped to coding variants, tools such as ANNOVAR (Wang et al. 2010) or SnpEff (Cingolani et al. 2012) can be used to infer their potential effect on genes and to determine the affected gene (Visscher et al. 2017). However, the vast majority of the tools used for these analyses are web-based or command lines implemented and mainly focused on human and rice traits, which severely limit wider applications. In fact, researchers would benefit from identifying candidate causal variants of the most significant SNPs from the species on which GWAS was conducted. The corresponding manual tasks are laborious, time consuming, and prone to errors and omission. To resolve these problems, we aimed to develop a software for fast identification of candidate causal variants or gene(s) from GWAS data.
Haplotype blocks greatly assist associationbased mapping of casual candidate genes by significantly reducing the genotyping effort (Zhang et al. 2002a). Different definitions are used to define the haplotype block structure (Patil et al. 2001;Gabriel et al. 2002;Wang et al. 2002;Zhang et al. 2002b). Here, we refer to the haplotype not as the strong inter-marker LD, but rather the SNPs and indels within a gene region, including upstream, downstream, exonic, and intronic regions. We term this as the gene haplotype, which could be adopted to evaluate the variants of the affected traits captured from the gene region. While there is a rising interest in gene haplotypes, much of the corresponding analyses were carried out manually (Supplementary material). In addition, DnaSP is a software package for comprehensive analysis of DNA polymorphism data and also allow analysis on haplotype phasing (Rozas et al. 2017). HaplotypeMiner is an R package developed for exploring allelic diversity at genes of interest in a plant breeding context. The program minimally takes as input a dataset of SNP markers and the genomic position of a gene of interest, and outputs a set of possible haplotypes defined by the genotypes of a reduced number of neighboring SNPs (Tardivel et al. 2019). The RFGB (Rice Functional and Genomic Breeding) contains a set of core collection of about 3000 rice accessions (3 K rice genome). This dataset provides a base for the large-scale novel allele mining for important traits in rice with various bioinformatics and genetic approaches. It also contains Haplotype module, which was designed to fulfill the increasing demands of mining the associations between sequence variations within certain gene/ region/SNP and target traits ). However, a graphical haplotype analysis tool (no basic programming skill is required) that can handle any species is in urgent need.

Materials, genotyping, and annotation
A set of 398 accessions, including 360 foxtail millets and 38 S. viridis, was used to examine and calibrate our approach. These accessions have recently been analyzed via whole-genome sequencing (WGS) (Li et al. 2022). The clean reads were mapped onto the xiaomi reference genome (Yang et al. 2020). SNPs and InDels were detected using GATK (McKenna et al. 2010) (ver. 3.8.0). The SNPs and InDels were considered valid if they met the following requirements: (1) two alleles only; (2) excluding sites on the basis of the proportion of missing data > 0.9; (3) minor allele frequency ≥ 0.05; (4) mean depth values ≥ 5. SNPs that did not meet these four criteria were excluded from the study. We generated a final set of 4,158,075 filtered single-nucleotide polymorphisms (SNPs) and insertion-deletion polymorphisms (InDels) for further analysis. All identified SNPs that passed quality screening were further annotated with ANNOVAR (Wang et al. 2010) or SnpEff (Cingolani et al. 2012) based on the gene annotation of the reference genome (Wang et al. 2010). In practical application, users can adjust the above parameters. When a VCF file was submitted, ANNOVAR or SnpEff was computed to rapidly categorize the effects of variants in the reference genome sequence. ANNOVAR or SnpEff annotates variants based on their genomic locations (annotated genomic locations can be intronic, exonic, or intergenic) and predicts coding effects (mainly synonymous or non-synonymous amino-acid replacement). The process can be applied to any plant, animal, and bacteria species, by providing the genome file and its GFF (generic feature format) annotation file. The annotated vcf file was converted to Hap-Map using a Perl script vcf2hmp.pl, and this step would normally take several hours (~ 3 h for 4 million SNPs). Finally, a haplotypes.hmp file was generated for further haplotype analysis.
Put vcf2hmp.pl test.gff, test.vcf, and genome.fa files in the same directory, then run:

Platforms and install
CandiHap was written in Perl 5 and R, which supported Windows, Mac, or UNIX platform computers in the graphical user interface (GUI) or command lines. Graphics were created by R. In addition to the GUI, users can also run CandiHap through command lines and install the R software environment (https:// www.r-proje ct. org). The code was compiled for the UNIX platforms and Windows 64-bit environment, and tested with CentOS 7, Windows 10/11 as well as Mac OS 10/13. For a given SNP that was found significantly in GWAS, the run time was about 1 min for a set of 398 samples with 4,158,075 marks. The CandiHap tool is open source, available on multiple platforms, and freely available online (https:// github. com/ xukai li/ Candi Hap or https:// bigd. big. ac. cn/ bioco de/ tools/ BT007 080).
A user-friendly graphical user interface software package of CandiHap, installable on Windows platforms, is implemented using electron development toolkit, which is freely available and not required for registration. For Windows users' convenience, the installation package integrates Perl and R modules needed to run independently, meaning no more software installation is required. However, for the Mac Os or UNIX platform users, installation of the R software environment is required, followed by three packages by command install.packages (c("ggplot2","agricolae", "ggbeeswarm")) in R.

Parameter for CandiHap
Intergenic SNPs are SNPs that are located at least 5 kb up-or downstream of a gene. In general, they are not associated with a gene and not located in a known regulatory region. We set a default parameter in CandiHap, which limits the mapping SNPs to 2000 bp upstream and 500 bp downstream of a gene. The default settings ensure that the result is based on the association signals in gene(s) with statistical significance. Users may also adjust the parameter in "CandiHap.pl." Using Perl and R, the analysis provided and displayed various statistical results for haplotypes such as annotation statistics, types of variation, number of varieties, variety ID and phenotypes, mean, SD (standard deviation) of phenotypes, and significant phenotype differences. A boxplot of the gene showed a significant difference in haplotype-phenotype association analysis. The least significant difference (LSD) test is used to determine whether or not the difference between or among group means is significant.
Put CandiHap.pl and Phenotype.txt, Your.hmp, and genome.gff files in the same directory, then run:

Graphical user interface
In addition to the command-line environment, we also provide user-friendly graphical interface software package of CandiHap on Windows platforms. For the convenience of the Windows platform, the installation package integrates necessary Perl and R modules for running independently by PyInstaller (5.7.0, https:// pyins taller. org/ en/ stable/), meaning no more software installation required. The installation package also contains the test dataset as the default value of the GUI version.
Sanger data for haplotype analysis The "sanger_CandiHap.sh" was written in Shell, Perl 5, and R (with sangerseqR), which only supported the UNIX platforms in command lines. We developed a Perl script "ab1-fastq.pl" for reading ABI Sanger sequencing trace file and simulating the primarySeq and second-arySeq to fastq reads by extracting 90-bp blocks from Sanger sequence and shifting 1 bp in turns. As an example, a 200-bp Sanger sequence would obtain 110 fastq reads within the length of 90 bp. Then, mapping the new fastq reads to reference gene sequence is transferred into

Results
We developed a user-friendly software, CandiHap, that may be operated on a range of computer platforms. In CandiHap, users can identify polymorphisms based on the models of gene haplotypes within vcf file and to report results in a variety of formats, including tables and figures. CandiHap allows researchers to explore favorable haplotypes of candidate genes for target traits, providing a guide to study underlying genetic mechanisms. In addition, some researchers use Sanger sequences to detect the mutations that underlie a number of traits, yet it is challenging to determine heterozygotes from Sanger ab1 files and conduct haplotype analysis. The "Sanger_ CandiHap.sh" in CandiHap allows fast identification of the haplotype from Sanger ab1 files. An overview of the process is presented in Fig. 1. Starting from a VCF file as an entry point, Candi-Hap first annotates the variants using an annotated reference genome to produce a new VCF file. This new VCF file is then used to mine variants and genotyping data, and sent into a series of modules in charge of various processes. Users can then analyze variants ranging from genome to single gene levels. The GWAS results of genomic regions (Fig. 1a) and LD can be defined by entering the limits, and the application would loop and process all genes in the LD regions. The CandiHap implements a threestage analysis (Fig. 1b): the first annotates the VCF file for GWAS by ANNOVAR (table_annovar.pl); the second converts the txt result of ANNOVAR to hapmap format (vcf2hmp.pl); and the third stage requires input data of hapmap file, GFF file of your reference genome, the phenotype data, the LD, and the most significant SNP position of GWAS result. If users need only to run one gene, the vcf, phenotype, gff, and gene ID need to be input. Besides the graphical user interface (GUI) software, users can run CandiHap through command lines on UNIX, Mac, or DOS platforms. The output includes a txt file of haplotypes with detailed information and three pdf files of figures (Fig. 1c-f). The results of haplotypes include references allele, alternative allele, allele frequency, SNP annotation, SNP positions, and haplotypes (Fig. 1d). The information for each haplotype also includes number of varieties, varieties ID and its phenotype, average, SD of phenotype, and significant difference (Fig. 1d). For the graphical user interface (GUI), CandiHap analytical pipeline is divided into three functional modules, vcf2hmp, CandiHap, and GWAS_LD2haplotypes, which correspond to the command line steps. First, the "txt" and "vcf" files of ANNOVAR results with genotype information are required as input for module vcf2hmp to convert the txt result of ANNOVAR to hapmap format. Then, CandiHap module can detect a single specific gene or GWAS_LD2haplotypes module for an LD region.
To further test the universality of CandiHap in haplotype analysis, we analyzed the haplotypes of the ARE1 gene in rice (Wang et al. 2018b), and the same result was obtained except that five more SNPs and two errors were identified in our study (highlighted by blue and red boxes). The discrepancy is due to the fact that there are 276 more rice varieties used in our study, and authors analyze the haplotype of ARE1 gene manually (Fig. 2).
In addition to NGS data, Sanger sequencing technology is also widely used in natural variation analysis. To meet this demand, "sanger_CandiHap.sh" was developed for process of Sanger sequencing data (Fig. 3). Starting from ab1 files as the entry point, the process first simulates ABI Sanger sequencing trace data to fastq reads and then maps the fastq reads to reference gene sequence as for re-sequencing program (Fig. 3a). The output is a txt file of haplotypes with detailed information, including references allele, alternative allele, SNP positions, and haplotypes. The information for each haplotype consists of the number of samples and sample ID (Fig. 3c). As an example, the Page 7 of 11 21 Vol.: (0123456789) PHYC gene has a mutation at 4475 and has homozygous samples (Fig. 3b). Runtime is ∼20 min for a set of 100 samples of Sanger sequences. The results are filtered to retain the homozygous mutation sites, allowing users to find the mutation easily (Fig. 3b,  c). Moreover, the heterozygous mutation was also shown in the results (Fig. 3c). There are 57 wild types and 43 variations, and 40 homozygous samples showed wild-type phenotype (Fig. 3c). The test data files can be freely downloaded (https:// github. com/ xukai li/ Candi Hap/ tree/ master/ Sanger_ ab1 or https:// ngdc. cncb. ac. cn/ bioco de/ tools/ 7080/ relea ses/ v1.0. 1/ file/ downl oad? filen ame= sanger_ Candi Hap-1. 0.1. zip).

Discussion
We refer to the haplotype not as the strong intermarker LD, but rather the SNPs and indels within a gene region, including upstream, downstream, exonic, and intronic regions. We term this as the gene haplotype, which specifically aims to select a collection of SNPs that are useful for defining SNP haplotypes around a target gene and could be adopted to evaluate the variants of the affected traits captured from the gene region. Also, all the individuals that share a given gene are systematically pooled into haplotypes. The CandiHap can be widely used for the investigations of natural variations. It should be noted that CandiHap is not intended to be used to predict true causal SNPs and gene(s) for complex traits. Therefore, CandiHap outputs are candidate causal SNPs and gene(s), which allows users to screen for useful ones. An essential application of the CandiHap results is to allow investigators to test "a priori" hypothesis by using candidate causal SNPs as a practical starting point.
In terms of other software, the DnaSP input data are required in FASTA format that is not suitable for analyzing large-scale population data, such as vcf (Rozas et al. 2017); the HaplotypeMiner is a command-line tool using parameters (Tardivel et al. 2019), and RFGB can only analyze the 3 K rice genome data ). Due to the above limitations, the CandiHap is a graphical haplotype analysis tool (no basic programming skill is required) that can handle any species, which is in urgent need. Several factors can affect the best settings for haplotype definition of a given gene; we have therefore allowed users to select the desired settings for some parameters: (1) the P-value of Wilcox test, (2) the LD threshold, (3) the length of gene upstream and downstream, and (4) the phenotype data. Collapsing SNPs to retain a single tag SNP by using a higher threshold (P-value of Wilcox test = 1) leads to no loss of information from the dataset but has the disadvantage of being sensitive to genotyping errors. On the contrary, a lower threshold, as was used here (P-value of Wilcox test = 0.01), will mask small genotyping errors but may also lead to the loss of haplotypes specific to rare alleles. The ability to capture informative markers in the vicinity of the gene is dependent on the density of genotyping. For this approach to work best, the genotyping should be sufficiently dense to deliver at least one marker at a gene. We successfully identified gene haplotypes offering a very good match with the allelic status at Si9g49990 and LOC_Os08g12780 (ARE1) genes under study. Because of manual analysis on ARE1 gene in rice, these five alleles were misidentified, and two errors were identified in Wang et al. (2018a, b) (Fig. 2).
Breeding aims to produce new lines that carry an array of alleles that jointly produce a superior phenotype. A key part of this work is to assess the allelic diversity present in germplasm collections and to identify individuals carrying favorable alleles at these genes. In this work, we demonstrate how our approach can provide accurate and essential information for breeding by delivering a quick and clear picture of the allelic diversity for a gene within a given germplasm collection. This approach was also found to be reproducible on two distinct genotypic datasets (next-generation sequencing and Sanger sequencing) with similar results. Ultimately, our approach for identification of gene haplotypes from large SNP datasets represents a promising approach to routinely assess allelic variation in a gene region. In the future, CandiHap will be regularly updated and expanded to perform more functions with more user-friendly options.