We developed a user-friendly software, CandiHap, that may be operated on a range of computer platforms. In CandiHap, users can identify polymorphisms based on the models of gene haplotypes within vcf file and to report results in a variety of formats, including tables and figures. CandiHap allows researchers to explore favourable haplotypes of candidate genes for target traits, providing a guide to study underlying genetic mechanisms. In addition, some researchers use Sanger sequences to detect the mutations that underly a number of traits, yet it is challenging to determine heterozygotes from Sanger ab1 files and conduct haplotype analysis. The ‘Sanger_CandiHap.sh’ in CandiHap allows fast identification of the haplotype from Sanger ab1 files.
An overview of the process is presented in Fig. 1. Starting from a VCF file as an entry point, CandiHap first annotates the variants using an annotated reference genome to produce a new VCF file. This new VCF file is then used to mine variants and genotyping data, and sent into a series of modules in charge of various processes. Users can subsequently analyze variants ranging from genome to single gene levels. The GWAS results of genomic regions (Fig. 1a) and LD can be defined by entering the limits, and the application would loop and process all genes in the LD regions. The CandiHap implements a three-stage analysis (Fig. 1b): the first annotates the VCF file for GWAS by ANNOVAR (table_annovar.pl); the second converts the txt result of annovar to hapmap format (vcf2hmp.pl); and the third stage requires input data of hapmap file, GFF file of your reference genome, the phenotype data, the LD, and the most significant SNPs position of GWAS result. If users need only to run one gene, the vcf, phenotype, gff and gene ID need to be input. Besides the graphical user interface (GUI) software, users can run CandiHap through command lines on UNIX, Mac or DOS platforms. The output includes a txt file of haplotypes with detailed information and three pdf files of figures (Fig. 1c-f). The results of haplotypes include references allele, alternative allele, allele frequency, SNP annotation, SNP positions and haplotypes (Fig. 1d). The information for each haplotype also includes number of varieties, varieties ID and its phenotype, average, SD of phenotype and significant difference (Fig. 1d). For the graphical user interface (GUI), CandiHap analytical pipeline is divided into three functional modules, vcf2hmp, CandiHap and GWAS_LD2haplotypes, which corresponds to the command line steps. Firstly, annovar result txt file and VCF file with genotype information are required as input for module vcf2hmp to convert the txt result of annovar to hapmap format. Then, CandiHap module can detect a single specific gene or GWAS_LD2haplotypes module for a LD region.
To exemplify CandiHap, we performed a GWAS analysis of foxtail millet (Unpublished). Approximate 3679 K SNPs were tested, of them, 531 SNPs passed the threshold of P-value < 9.42 × 10− 7. The most significant SNP was located at chr9 at position of 54583294 with P-value = 1.23 × 10− 8 (Fig. 1a), and CandiHap identified one candidate causal gene (Si9g49990) within 50 kb LD form this SNP, we identified a signal at position 54605172 (P-value = 1.03 × 10− 7), leading to stop gain of Si9g49990 (Fig. 1d). The boxplot of Si9g49990, for haplotype-phenotype association analysis, showed significant differences in the phenotype of each haplotype between Hap 1, 2, 6 and Hap 3, 4, 5, 7, 8, 9, with intuitive supporting evidences (Fig. 1f). Only the SNPs and haplotypes found in ≥ 2 accessions were used to construct the haplotype network for Si9g49990 (Supplementary Fig. 1). The results of other genes in the LD region are not shown because of space limitation. User can run the test data to check those results.
To further test the universality of CandiHap in haplotype analysis, we analyzed the haplotypes of the ARE1 gene in rice (Wang et al., 2018), and the same result was obtained except that five more SNPs and two errors were identified in our study (highlighted by blue and red boxes). The discrepancy is due to the fact that there are 276 more rice varieties used in our study, and author analysis the haplotype of ARE1 gene by manually (Fig. 2).
In addition to NGS data, Sanger sequencing technology is also widely used in natural variation analysis. To meet this demand, ‘sanger_CandiHap.sh’ was developed for process of Sanger sequencing data (Fig. 3). Starting from ab1 files as the entry point, the process first simulates ABI Sanger sequencing trace data to fastq reads, and then maps the fastq reads to reference gene sequence as for re-sequencing program (Fig. 3a). The output is a txt file of haplotypes with detailed information, including references allele, alternative allele, SNP positions and haplotypes. The information for each haplotype consists of the Number of samples and Sample ID (Fig. 3c). As an example, the PHYC gene has a mutation at 4475, and has homozygous samples (Fig. 3b). Runtime is ∼20 min for a set of 100 samples of Sanger sequences. The results are filtered to retain the homozygous mutation sites, allowing users to find the mutation easily (Fig. 3b,c). Moreover, the heterozygous mutation was also shown in the results (Fig. 3c). There are 57 wild types and 43 variations, and 40 homozygous samples showed wild type phenotype (Fig. 3c). The test data files can be freely downloaded at https://github.com/xukaili/CandiHap/tree/master/Sanger_ab1.