CandiHap: a haplotype analysis toolkit for natural variation study

doi:10.21203/rs.3.rs-1741665/v1

Download PDF

Research Article

CandiHap: a haplotype analysis toolkit for natural variation study

https://doi.org/10.21203/rs.3.rs-1741665/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 15 Mar, 2023

Read the published version in Molecular Breeding →

You are reading this latest preprint version

Haplotype blocks greatly assist association-based mapping of casual candidate genes by significantly reducing the genotyping effort. The gene haplotype, which could be adopted to evaluate the variants of the affected traits captured from the gene region. While there is a rising interest in gene haplotypes, much of the corresponding analyses were carried out manually. CandiHap allows fast and robust haplotype analyses and candidate identification preselection of candidate causal SNPs and InDels from Sanger or next-generation sequencing data. Investigators can use CandiHap to specify a gene or linkage sites based on GWAS and explore favourable haplotypes of candidate genes for target traits. CandiHap can be run on computers with Windows, Mac, or UNIX platforms in graphical user interface or command lines, and applied to any species of plant, animal and microbial. The CandiHap software, user manual, and example datasets are freely available at GitHub: https://github.com/xukaili/CandiHap.

CandiHap

haplotype

GWAS

SNPs

InDels

With the rapid development of next-generation sequencing (NGS) technologies, genome sequencing is becoming inexpensive, routine and convenient to obtain large numbers of single nucleotide polymorphisms (SNPs) (Goodwin et al., 2016). Whole-genome re-sequencing (WGRS), genotyping-by-sequencing (GBS), and restriction site-associated DNA (RAD) are essential strategies in medical, biological and agricultural research to elucidate the genetic basis of phenotypic traits, such as disease or economically important features (Miller et al., 2007; Patil et al., 2019; Thudi et al., 2016; Tinker et al., 2016; Visscher et al., 2012; Visscher et al., 2017). These strategies are based on sequencing of whole genomes, or representative genome fractions, across many individuals to determine loci with sequence variations. SNPs can alter the amino acid sequence of a protein directly, e.g., via non-synonymous SNPs, alterations of stop-codons, frameshift SNPs or SNPs in splice sites), or can change gene expression patterns by affecting gene regulatory regions. Using many genome-wide variants, genome-wide association studies (GWAS) generally identify SNPs that are statistically associated with certain traits. Such SNPs provide the basis to understand mechanisms that drive a trait; however, a key challenge is to rapidly and robustly identify causal SNPs (McCarthy and Hirschhorn, 2008).

Most GWAS infer candidate SNPs (SNPs with P value below a certain threshold) by linkage disequilibrium (LD) analysis and the functional annotations of the corresponding genes (Li et al., 2012). However, the vast majority of the tools used for these analyses are web-based or command-lines implemented and mainly focused on human and rice traits, which severely limit wider applications. In fact, researchers would benefit from identifying candidate causal variants of the most significant SNPs from the species the GWAS was performed on. The corresponding manual tasks are laborious, time-consuming, and prone to errors and omission. To resolve these problems, we aimed to develop a software for fast identification of candidate causal variants or gene(s) from GWAS data.

Haplotype blocks greatly assist association-based mapping of casual candidate genes by significantly reducing the genotyping effort (Zhang et al., 2002a). Different definitions are used to define the haplotype block structure (Gabriel et al., 2002; Patil et al., 2001; Wang et al., 2002; Zhang et al., 2002b). Here, we refer to the haplotype not as the strong inter-marker LD, but rather the SNPs and indels within a gene region, including upstream, downstream, exonic and intronic regions. We term this as the gene haplotype, which could be adopted to evaluate the variants of the affected traits captured from the gene region. While there is a rising interest in gene haplotypes, much of the corresponding analyses were carried out manually.

CandiHap was written in Perl 5 and R, which supported Windows, Mac or UNIX platforms computers in the graphical user interface (GUI) or command lines. Graphics were created by R. In addition to the GUI, users can also run CandiHap through command lines by using the UNIX platforms or Mac, and please install the R software environment (https://www.r-project.org), and three packages by command install.packages(c("ggplot2", "agricolae", "pegas") in R. The code was compiled for the UNIX platforms and Windows 64-bit environment, and tested with CentOS 7, Windows 10 as well as Mac OS 10. For a given SNP that was found significantly in a GWAS, runtime was about 1 min for a set of 400 samples with ~ 3 million SNPs. The CandiHap tool is an open source, available on multiple platforms, and freely available at https://github.com/xukaili/CandiHap.

Intergenic SNPs are SNPs that are located at least 5 kb up- or downstream of a gene. In general, they are not associated with a gene and not located in a known regulatory region. We set a strict default parameter in CandiHap, which limits the mapping SNPs to 2000 bp upstream and 500 bp downstream of a gene. The default settings ensure that the result is based on the association signals in gene(s) with statistical significance. Users may also adjust the parameter in ‘CandiHap.pl’.

A user-friendly graphical user interface software package of CandiHap, installable both on Windows and Mac platforms, is implemented using electron development toolkit, which is freely available and not required for registration. For the convenience of the Windows users, the installation package integrates the necessary Perl and R modules for running independently, meaning no more software installation required. But for the Mac Os or UNIX platforms users, installation of the R software environment (https://www.r-project.org) is required, followed by three packages by command install.packages(c("ggplot2","agricolae", "pegas")) in R.

The ‘sanger_CandiHap.sh’ was written in Shell, Perl 5 and R (with sangerseqR), which only supported the UNIX platforms in command lines. We developed a Perl script ‘ab1-fastq.pl’ for reading ABI Sanger sequencing trace file, and simulating the primarySeq and secondarySeq to fastq reads by extracting 90 bp blocks from Sanger sequence and shifting 1-bp in turns. As an example, a 200 bp Sanger sequence would obtain 110 fastq reads within the length of 90 bp. Then, mapping the new fastq reads to reference gene sequence is transferred into a call SNP process for next generation sequencing. Burrows-Wheeler aligner software (Li and Durbin, 2010) (BWA mem, ver. 0.7.17) was used to map the fastq reads with default parameters onto the gene reference sequence of all samples. Mapped reads were converted into BAM files using SAMtools (ver. 1.7). The variants including SNPs and indels were detected using GATK (McKenna et al., 2010) (ver. 3.8.0). Hard filtering was applied to the raw variant set using GATK. The results are filtered to retain the homozygous mutation sites. It is an open source and freely available at https://github.com/xukaili/CandiHap/tree/master/Sanger_ab1_Linux.

The variants of VCF file were further filtered using the VCFtools (Danecek et al., 2011) (ver. 0.1.15). The SNPs and indels were considered to be valid for the study if they met the following requirements: (1) two alleles only; (2) excluding sites on the basis of the proportion of missing data > 0.9 (defined to be between 0 and 1, where 0 allows sites that are completely missing and 1 indicates no missing data allowed); (3) minor allele frequency ≥ 0.05; and (4) mean depth values ≥ 5. SNPs not meeting these four criteria were excluded from the study. All identified SNPs that passed quality screening were further annotated with ANNOVAR (ver. 2015 Dec 14) based on the gene annotation of the reference genome (Wang et al., 2010). In practical application, users can adjust the above parameters. When a VCF file was submitted, ANNOVAR was computed to rapidly categorize the effects of variants in the reference genome sequence (This step took more than 2 hours for 3 million SNPs). ANNOVAR annotates variants based on their genomic locations (annotated genomic locations can be intronic, exonic or intergenic) and predicts coding effects (mainly synonymous or non-synonymous amino-acid replacement). The process can be applied to any plant, animal and bacteria species, by providing the genome file and its GFF (generic feature format) annotation file. The annotated vcf file was converted to HapMap using a Perl script vcf2hmp.pl, and this step would normally take several hours (~ 3 h for 3 million SNPs). Finally, a haplotypes.hmp file was generated for further haplotype analysis.

Using Perl and R, the analysis provided and displayed various statistical results for the haplotypes such as annotation statistics, types of variations, number of varieties, varieties ID and their phenotypes, average, SD (standard deviation) of phenotypes and significant difference in phenotype. A boxplot of the gene showed a significant difference of haplotype-phenotype association analysis. The least significant difference (LSD) test is used to clarify whether or not the difference between or among the group means is significant.

We developed a user-friendly software, CandiHap, that may be operated on a range of computer platforms. In CandiHap, users can identify polymorphisms based on the models of gene haplotypes within vcf file and to report results in a variety of formats, including tables and figures. CandiHap allows researchers to explore favourable haplotypes of candidate genes for target traits, providing a guide to study underlying genetic mechanisms. In addition, some researchers use Sanger sequences to detect the mutations that underly a number of traits, yet it is challenging to determine heterozygotes from Sanger ab1 files and conduct haplotype analysis. The ‘Sanger_CandiHap.sh’ in CandiHap allows fast identification of the haplotype from Sanger ab1 files.

An overview of the process is presented in Fig. 1. Starting from a VCF file as an entry point, CandiHap first annotates the variants using an annotated reference genome to produce a new VCF file. This new VCF file is then used to mine variants and genotyping data, and sent into a series of modules in charge of various processes. Users can subsequently analyze variants ranging from genome to single gene levels. The GWAS results of genomic regions (Fig. 1a) and LD can be defined by entering the limits, and the application would loop and process all genes in the LD regions. The CandiHap implements a three-stage analysis (Fig. 1b): the first annotates the VCF file for GWAS by ANNOVAR (table_annovar.pl); the second converts the txt result of annovar to hapmap format (vcf2hmp.pl); and the third stage requires input data of hapmap file, GFF file of your reference genome, the phenotype data, the LD, and the most significant SNPs position of GWAS result. If users need only to run one gene, the vcf, phenotype, gff and gene ID need to be input. Besides the graphical user interface (GUI) software, users can run CandiHap through command lines on UNIX, Mac or DOS platforms. The output includes a txt file of haplotypes with detailed information and three pdf files of figures (Fig. 1c-f). The results of haplotypes include references allele, alternative allele, allele frequency, SNP annotation, SNP positions and haplotypes (Fig. 1d). The information for each haplotype also includes number of varieties, varieties ID and its phenotype, average, SD of phenotype and significant difference (Fig. 1d). For the graphical user interface (GUI), CandiHap analytical pipeline is divided into three functional modules, vcf2hmp, CandiHap and GWAS_LD2haplotypes, which corresponds to the command line steps. Firstly, annovar result txt file and VCF file with genotype information are required as input for module vcf2hmp to convert the txt result of annovar to hapmap format. Then, CandiHap module can detect a single specific gene or GWAS_LD2haplotypes module for a LD region.

To exemplify CandiHap, we performed a GWAS analysis of foxtail millet (Unpublished). Approximate 3679 K SNPs were tested, of them, 531 SNPs passed the threshold of P-value < 9.42 × 10^− 7. The most significant SNP was located at chr9 at position of 54583294 with P-value = 1.23 × 10^− 8 (Fig. 1a), and CandiHap identified one candidate causal gene (Si9g49990) within 50 kb LD form this SNP, we identified a signal at position 54605172 (P-value = 1.03 × 10^− 7), leading to stop gain of Si9g49990 (Fig. 1d). The boxplot of Si9g49990, for haplotype-phenotype association analysis, showed significant differences in the phenotype of each haplotype between Hap 1, 2, 6 and Hap 3, 4, 5, 7, 8, 9, with intuitive supporting evidences (Fig. 1f). Only the SNPs and haplotypes found in ≥ 2 accessions were used to construct the haplotype network for Si9g49990 (Supplementary Fig. 1). The results of other genes in the LD region are not shown because of space limitation. User can run the test data to check those results.

To further test the universality of CandiHap in haplotype analysis, we analyzed the haplotypes of the ARE1 gene in rice (Wang et al., 2018), and the same result was obtained except that five more SNPs and two errors were identified in our study (highlighted by blue and red boxes). The discrepancy is due to the fact that there are 276 more rice varieties used in our study, and author analysis the haplotype of ARE1 gene by manually (Fig. 2).

In addition to NGS data, Sanger sequencing technology is also widely used in natural variation analysis. To meet this demand, ‘sanger_CandiHap.sh’ was developed for process of Sanger sequencing data (Fig. 3). Starting from ab1 files as the entry point, the process first simulates ABI Sanger sequencing trace data to fastq reads, and then maps the fastq reads to reference gene sequence as for re-sequencing program (Fig. 3a). The output is a txt file of haplotypes with detailed information, including references allele, alternative allele, SNP positions and haplotypes. The information for each haplotype consists of the Number of samples and Sample ID (Fig. 3c). As an example, the PHYC gene has a mutation at 4475, and has homozygous samples (Fig. 3b). Runtime is ∼20 min for a set of 100 samples of Sanger sequences. The results are filtered to retain the homozygous mutation sites, allowing users to find the mutation easily (Fig. 3b,c). Moreover, the heterozygous mutation was also shown in the results (Fig. 3c). There are 57 wild types and 43 variations, and 40 homozygous samples showed wild type phenotype (Fig. 3c). The test data files can be freely downloaded at https://github.com/xukaili/CandiHap/tree/master/Sanger_ab1.

The CandiHap can be widely used for the investigations of natural variations. It should be noted that CandiHap is not intended to be used to predict true causal SNPs and gene(s) for complex traits. Therefore, the outputs of CandiHap are candidate causal SNPs and gene(s), which enables users to screen for the useful ones. An essential application of the CandiHap results is to allow investigators to test ‘a priori’ hypothesis by using candidate causal SNPs as a practical starting point.

In the future, CandiHap will be regularly updated and extended to fulfill more functions with the more user-friendly options.

Availability of data and materials

The raw sequencing data reported in this paper have been deposited in the NCBI under BioProject accession no. PRJNA633413. These data are also available in the BIG Data Center under the accession number CRA002636. SNPs data of ARE1 coding region of 3023 rice varieties were downloaded from RFGB (http://www.rmbreeding.cn). CandiHap code at https://github.com/xukaili/CandiHap.

Acknowledgements

The authors thank the CandiHap users for helpful comments and discussions. We also thank Dr. Staffan Persson (University of Melbourne) and Dr. Yiwei Jiang (Purdue University) for their critical reading of the manuscript.

Funding

This work has been supported by the National Natural Science Foundation of China [32001608]; the National Key R&D Program of China [2019YFD1000700 and 2019YFD1000704]; the Major Special Science and Technology Projects in Shanxi Province [202101140601027].

Authors' contributions

Xukai Li supervised the study. Xukai Li, Xingchun Wang and Kai Guo conceived of the study idea. Jianhua Gao carried out the sampling process and participated in the material preparation. Xukai Li and Kai Guo performed most of the experiments. Xukai Li, Zhiyong Shi and Kai Guo analyzed the data. Xukai Li drafted the manuscript. All of the authors discussed the results and commented on the manuscript.

Corresponding authors

Correspondence to Xukai Li ([email protected]) or Kai Guo ([email protected]).

Ethics declarations

Competing interests

The authors declare no competing financial interests, and no conflict of interests.

Publisher's note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Danecek P, Auton A, Abecasis G, Albers CA, Banks E, DePristo MA, Handsaker RE, Lunter G, Marth GT, Sherry ST et al (2011) The variant call format and VCFtools. Bioinformatics 27:2156–2158. 10.1093/bioinformatics/btr330
Gabriel SB, Schaffner SF, Nguyen H, Moore JM, Roy J, Blumenstiel B, Higgins J, DeFelice M, Lochner A, Faggart M et al (2002) The structure of haplotype blocks in the human genome. Science 296:2225–2229. 10.1126/science.1069424
Goodwin S, McPherson JD, McCombie WR (2016) Coming of age: ten years of next-generation sequencing technologies. Nat Rev Genet 17:333–351. 10.1038/nrg.2016.49
Li H, Durbin R (2010) Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics 26:589–595. 10.1093/bioinformatics/btp698
Li M-X, Yeung JMY, Cherny SS, Sham PC (2012) Evaluating the effective numbers of independent tests and significant p-value thresholds in commercial genotyping arrays and public imputation reference datasets. Hum Genet 131:747–756. 10.1007/s00439-011-1118-2
McCarthy MI, Hirschhorn JN (2008) Genome-wide association studies: potential next steps on a genetic journey. Hum Mol Genet 17:R156–R165. 10.1093/hmg/ddn289
McKenna A, Hanna M, Banks E, Sivachenko A, Cibulskis K, Kernytsky A, Garimella K, Altshuler D, Gabriel S, Daly M et al (2010) The genome analysis toolkit: a mapreduce framework for analyzing next-generation DNA sequencing data. Genome Res 20:1297–1303. 10.1101/gr.107524.110
Miller MR, Dunham JP, Amores A, Cresko WA, Johnson EA (2007) Rapid and cost-effective polymorphism identification and genotyping using restriction site associated DNA (RAD) markers. Genome Res 17:240–248. 10.1101/gr.5681207
Patil GB, Lakhssassi N, Wan J, Song L, Zhou Z, Klepadlo M, Vuong TD, Stec AO, Kahil SS, Colantonio V et al (2019) Whole-genome re-sequencing reveals the impact of the interaction of copy number variants of the rhg1 and Rhg4 genes on broad-based resistance to soybean cyst nematode. Plant Biotechnol J 17:1595–1611. 10.1111/pbi.13086
Patil N, Berno AJ, Hinds DA, Barrett WA, Doshi JM, Hacker CR, Kautzer CR, Lee DH, Marjoribanks C, McDonough DP et al (2001) Blocks of limited haplotype diversity revealed by high-resolution scanning of human chromosome 21. Science 294:1719–1723. 10.1126/science.1065573
Thudi M, Khan AW, Kumar V, Gaur PM, Katta K, Garg V, Roorkiwal M, Samineni S, Varshney RK (2016) Whole genome re-sequencing reveals genome-wide variations among parental lines of 16 mapping populations in chickpea (Cicer arietinum L.). BMC Plant Biol 16(Suppl 1):10–10. 10.1186/s12870-015-0690-3
Tinker NA, Bekele WA, Hattori J (2016) Haplotag: software for haplotype-based genotyping-by-sequencing analysis. G3 6:857–863. 10.1534/g3.115.024596
Visscher PM, Brown MA, McCarthy MI, Yang J (2012) Five years of GWAS discovery. Am J Hum Genet 90:7–24. 10.1016/j.ajhg.2011.11.029
Visscher PM, Wray NR, Zhang Q, Sklar P, McCarthy MI, Brown MA, Yang J (2017) 10 years of GWAS discovery: biology, function, and translation. Am J Hum Genet 101:5–22. 10.1016/j.ajhg.2017.06.005
Wang K, Li M, Hakonarson H (2010) ANNOVAR: functional annotation of genetic variants from high-throughput sequencing data. Nucleic Acids Res 38:e164–e164. 10.1093/nar/gkq603
Wang N, Akey JM, Zhang K, Chakraborty R, Jin L (2002) Distribution of recombination crossovers and the origin of haplotype blocks: the interplay of population history, recombination, and mutation. Am J Hum Genet 71:1227–1234. 10.1086/344398
Wang Q, Nian J, Xie X, Yu H, Zhang J, Bai J, Dong G, Hu J, Bai B, Chen L et al (2018) Genetic variations in ARE1 mediate grain yield by modulating nitrogen utilization in rice. Nat Commun 9:735. 10.1038/s41467-017-02781-w
Zhang K, Calabrese P, Nordborg M, Sun F (2002a) Haplotype block structure and its applications to association studies: power and study designs. Am J Hum Genet 71:1386–1394. https://doi.org/10.1086/344780
Zhang K, Deng M, Chen T, Waterman MS, Sun F (2002b) A dynamic programming algorithm for haplotype block partitioning. Proc. Natl. Acad. Sci. USA. 99:7335–7339. 10.1073/pnas.102186799

CandiHapSupplementaryMaterial.pdf

Download PDF

Journal Publication

published 15 Mar, 2023

Read the published version in Molecular Breeding →

Reviewers agreed at journal
17 Jun, 2022
Reviewers invited by journal
14 Jun, 2022
Editor assigned by journal
09 Jun, 2022
First submitted to journal
09 Jun, 2022

You are reading this latest preprint version

CandiHap: a haplotype analysis toolkit for natural variation study

Status:

Journal Publication

Version 1

Abstract

Figures

Introduction

Methods

Results

Discussion

Declarations

References

Supplementary Files

Status:

Journal Publication

Version 1