Whole-genome sequencing of a Brazilian naturalized horse breed resistant to arid climate for identifying single nucleotide variants and insertions/deletions

Background: In this study, we perform a search for variants (SNVs and InDels) in the genome of a Brazilian Naturalized horse breed, using FreeBayes and GATK variant calling tools. This breed presents exclusive adaptive traits of extreme importance to semi-arid conditions, such as those that allow survival under excessive sunlight, rainfall, low forage availability and stony ground. Moreover, these traits are expressed without any detriment to the performance and perpetuation of the breed. Results: A total of 305,588,364 reads were mapped in the horse reference genome, 1,598,210 single nucleotide variations and 138,139 insertions/deletions were detected by FreeBayes, 88,838 (SNVs) and 25,232 (InDels) by GATK. Both have been used in order to increase the safety of variant calls, identify in which regions of the genome they are present and check for variants in genes possibly associated with the peculiar traits exhibited by the breed. Conclusions: The variants annotation identified numerous non-synonymous SNVs and frameshift InDels, which could affect phenotypic variation. We found 28 and 392 Emsembl gene IDs containing high and moderate impact SNVs, including GTPase family members, olfactory receptors, mitochondrial complex and defense genes. Functional enrichment analysis was performed and revealed that variants in the olfactory transduction pathway were overrepresented.


Background
The Nordestino horse breed is a Brazilian naturalized breed, developed from the introduction of Iberian and Barb-Arabic breeds in the Northeast of Brazil, after Portuguese colonization. It presents adaptive traits of extreme importance to semi-arid conditions. The animals representing the breed exhibit greater resistance to disease and phenotypic rusticity in the racial pattern, such as rigid hooves, medium and arcuate body conformation, strong musculature and bones, rectilinear cephalic profile and dilated nostrils (Mariante et al. 2011;Melo et al. 2013) which allow survival under excessive sunlight, rainfall, low forage availability and stony ground, without any detriment to the performance and perpetuation of the breed.
It is unlikely that other non-specialized horse breeds would develop activities in this environment, which requires strength and endurance. Understanding the molecular basis associated with these traits becomes essential to advance breeding programs and breed conservation, since it is in danger of disappearing and, therefore, requires urgent actions for its conservation (Pires et al. 2014). To date, there are no studies about this breed on genetic structural variations, including single nucleotide variations (SNVs) and insertions/deletions (InDels), or the association to the resistance exhibited by the breed to arid conditions, or even functional studies related to the genetic peculiarity exhibited by the breed.
Phenotypic variations associated with mutations co-occur with domestication as an effect of the impact between selective breeding, controlled by human actions, and the performance of natural selection (Andersson 2012). As a result, most equine breeds present populations with high phenotypic and genotypic uniformity within the breed, but there is a lot of variation between breeds.
The equine genome project (Equus caballus) has made publicly available a full and high quality genome data of a Thoroughbred female, representing a breakthrough in genomics and veterinary medicine (Wade et al. 2009). As consequence, several studies have performed complete genome sequencing of several equine breeds, including domestic breeds, in order to understand the genetic mechanisms associated with pattern and racial establishment, from pursuit of structural variations, including the first analysis of re-sequencing of a complete genome, identifying significant variations in the Quarter Horse breed (Doan et al. 2012) and in Chinese horses (Lichuan and Kazakh breeds) (Zhang et al. 2018).
In early 2018, a new and improved equine genome assembly (EquCab3.0) was made available (Kalbfleisch et al. 2018), boosting future searches for variants. Furthermore, the recent availability of complete genome sequences of horse breeds allowed the development of a next generation, highdensity equine SNP array (670 k), comprising genomic information from individual representatives of 24 different equine breeds. The study cataloged 23 million new genetic variants (Schaefer et al. 2017). High-density SNP array enables the enhancement of population-based approaches to identify selection signals and diversity indexes.
There are several studies of SNPs associated with traits of interest in domestic animals, from high density Chips, such as equines (McCue et al. 2012), sheep (Kijas et al. 2012), cattle (Zhan et al. 2011; When intending to start the study of complex and peculiar traits of a particular breed, for which there is no genomic information, starting from whole genome sequencing, the identification of all variant genes in the genome is a first crucial step for the discovery of causal variants, possibly associated with these traits (Das et al. 2015). There is a great interest in genetic variants in new equine breeds, especially SNVs, for the creation of a SNPs database and integration of quantitative and linkage maps, as performed for the Thoroughbred breed, in order to contribute to breeding strategies (Lee et al. 2014).
Although the whole genome re-sequencing has become an accessible and easy-to-perform technique for variant search, most of these studies focus on equine breeds aimed at sports practices. Thus, it is necessary to search for variants in naturalized breeds, in order to elucidate genomic and conservation, since these may contain rich genetic information associated with peculiar adaptive traits. This information can be inserted into equine breeding programs through the development of genomic technologies.
Here, we present the first complete genomic sequence and characterization of the genetic variations of a Brazilian naturalized breed specimen, a male of the Nordestino horse breed, including SNVs and InDels, with genetic annotation analysis. That annotation allows the identification, location and association of variations related to the complex resistance traits that are peculiar to the breed, as well as subsequent studies on origin, genomic characterization and population studies, especially about the segregation of variants in the remaining population.

Materials And Methods Ethics statement
The blood sample was collected from a male horse, in a private property, with written consent from the owner, without experimental planning on the property or experimental interventions that cause damage or non-momentary pain and suffering to the animal. Therefore, no specific ethical approval is needed (Brazil law number 11794, from October 8th, 2008, Chapter 1, Art. 3, paragraph III).

Sample collection
The DNA sample was extracted from a blood aliquot of a male specimen typical of the Nordestino horse breed, from Pernambuco state, belonging to the Caatinga biome (semi-arid climate). The specimen presents all the phenotypic traits of the breed, such as mean weight, height (138 cm ± 8), hull type, characteristic stiffness and head size (ABCCN, 1987). The sample was kindly provided by our partner research group from the Instituto Federal de Educação, ciência e Tecnologia (IFPE) and Universidade Estadual do Sudoeste da Bahia (UESB).
Genomic DNA extraction, library preparation and genome sequencing Genomic DNA samples were extracted in replicates, using the DNeasy Blood & Tissue Kit (QIAGEN Pty. Ltd., Venlo, Netherlands), according to the protocol provided by the manufacturer. The Filtering and mapping processes Before the mapping process of the sequenced reads, raw reads were filtered, using FastQC software, version 0.11.7 (cutoff read length for high quality, 70%; cutoff quality score, 20) (Andrews, 2010). For the reads mapping, the horse reference genome (Ensembl EquCab3.0) was used. Clean sequencing reads were mapped to the reference assembly, using the Burrows-Wheeler Aligner tool (BWA, version 0.7.10-r789) with default parameters (Li & Durbin, 2009). PCR duplicates were detected and removed, using the Picard tools (version 1.54) (http://broadinstitute.github.io/picard/). Then, a re-alignment of the reads, using one of the Genome Analysis tools, Toolkit (GATK, version 3.8), was carried out to improve the mapping quality (McKenna et al. 2010). Downstream processing was carried out using typical GATK pipeline, according to parameters applied by Cornish and Guda (2015), for the GATK base quality score recalibration (BQSR) step.
The SNVs and InDels were functionally annotated with the SnpEff software (Cingolani et al. 2012

Functional enrichment
For functional enrichment, we selected all Emsembl IDs containing SNVs of high and moderate impact, present in the variant call analysis, according to both GATK and FreeBayes. The Gene Ontology (GO) terms were obtained using the Databank for Annotation, Visualization and Integrated Discovery (DAVID) (Huang et al. 2009). This databank was used to evaluate enrichment in the GO terms, using known annotations of horse genes, with Equus caballus selected as background. For GO term analysis, we considering a 10% FDR (False Discovery Rate) threshold for significance.

Results And Discussion
Genomic variants in Nordestino horse breed A total of 28 Gb of paired-end sequence data were produced from whole-genome sequence data of a male of the Nordestino horse breed, with 11.2-fold genome coverage, considering the sum of the sequencing runs performed. A total of 305,588,364 reads were mapped to the horse reference genome (EquCab3.0 from the Ensembl database), with a mapping rate of 96.05%.
At an effective genome size of 2,462,676,227 bases, a total of 1,741,210 variants were identified, using FreeBayes; therefore, a variation rate of 1 variant per 1,414 bases, relative to the reference genome. Among these, 1,598,210 were classified as SNVs; 57,580 as insertions and 80,529 as deletions. Particularly in InDels analysis, a total of 4,964 were classified as structural variants, being 54 insertions, 3 deletions and 4,907 mixed.
When we applied the Genome Analysis Toolkit (GATK), we identified 88,848 variants, classified as SNVs (1 variant  Using the recommended quality metrics for each software, it was found that the total number of variants detected by FreeBayes was higher than GATK, which is expected, since GATK exhibits higher sensitivity while maintaining a lower number of false positive SNVs (Cornish & Guda, 2015 region (values next to 0.4%) (Fig 1). The actual number of effects is greater than the number of variants because those found between genes positioned in close proximity can have their effects categorized as both downstream and upstream.
The percentage of SNVs effects per genotypic region of the Nordestino horse presented here was very similar to the average percentage found in native Chinese breeds (Lichuan and Kazakh breeds, small and rugged horses), as demonstrated by Zhang et al. (2018) by SNVs calling conducted with GATK and functional annotation, based on SnpEff software, from whole-genome sequencing data. From the total of single nucleotide variations, defined by the authors as single nucleotide polymorphisms (SNPs), the most intense effects were transcript, intron and intergenic (29.07%, 28.12% and 27.02%, respectively), and the smaller effects were exactly the same as found here, with very similar percentages.
Using the SnpEff program [24], we also classified the effects of variants (SNVs and InDels) by impact as modifiers of mostly high, moderate and low impact of variants called by GATK and FreeBayes. We showed the additional effect of SNVs and InDels on FreeBayes data and the exclusive effect of SNVs on GATK data (Fig 2). The effects of SNVs and InDels in all categories were much higher in FreeBayes analysis due to the combined effect of these variations. GATK analysis revealed greater sensitivity for exclusion of "false-positive" (PF) variants, as previously mentioned.

Selection of genes containing high and moderate effects SNVs
Based on the effect of the variants and their annotation by SnpEff, we have identified all genes or Emsembl IDs in which high, low, moderate and modifier impact effects (SNVs and InDels) occur, based in variants calling results by both GATK, and FreeBayes. In order to prioritize single nucleotide variants, which can be characterized as SNPs in subsequent population studies, we have screened the genes that have at least one such variation that has high impact (disruptive impact in the protein, causing protein truncation, loss of function or triggering nonsense mediated decay) and then checked which genes are present in the analysis by both variants calling tools. We found 28 Emsembl IDs from multigenic families containing at least one high impact SNVs, in accordance with both GATK and FreeBayes (Table 2). Among these, a pseudogene and GTPase Family members 7, 4, 2 and 1. Ontology enrichment analysis for these genes revealed eight GO biological processes and nineteen GO molecular functions, acting in nine metabolic pathways (Table 3).   The GTPase gene family was the only one that presented high impact SNVs in agreement with the variant calling between both softwares. Members of this family are key regulators of most cell processes, including proliferation, differentiation, vesicle and organelle dynamics, transport and regulation of the cytoskeleton (Bos et al. 2007). These are evolutionarily conserved proteins, associated with the activation of extracellular signals to various cellular responses, due to the ability to undergo conformational changes in response to the alternate binding of GDP and GTP. The GDPbound "off" or "on" states recognize distinct effector proteins, thereby allowing these proteins to function as binary molecular switches (Bos et al. 2005). Considering the functional importance of this gene family in basal cellular processes, the presence of SNVs with possible disruptive impact in the protein, causing protein truncation or loss of function, should be carefully evaluated. A study conducted by Zhang et al. (2018) also identified high impact SNVs in the metabolic pathway associated with GTPases in enriched biological processes from GO analysis in horses of 14 breeds.
The highest count of genes with this type of SNVs impact effect (55) are G protein-coupled receptor signaling pathway (GO:0007186), in which we also observed high gene representation, but with SNVs of moderate effect (7 out of 33 selected for functional analysis). G protein-coupled receptors activate signal transduction pathways and, coupling with G proteins, they pass through the cell membrane seven times, being called seven-transmembrane receptors (Trzaskowski et al. 2012).
When we analyzed moderate impact SNVs, the variety of gene families in which they occur is wide, drawing attention to the olfactory receptors family, which had high representation. These genes, involved in the Olfactory Signaling pathway, act in the perception of odor through olfact, interact with odorant molecules in the nasal epithelium, to initiate a neuronal response that triggers the perception of a smell (Antunes et al. 2014). The biochemical signaling events related to this (super pathway) act in food recognition and consequently food preference (Ma 2007), identification of sexual partners (Kang et al. 2015), mother-infant bonding (Doucet et al. 2009) and several other aspects of animal survival. Among them, we can also highlight the variable susceptibility to intranasal infections, as the study by Kupke et al. (2016), which analyzed the proteins expression of this pathway in the equine nasal epithelium, in association with this susceptibility. The SNVs present in genes associated with this pathway, once validated in more individuals of the Nordestino horse breed, may be associated with resistance, including respiratory diseases and the phenotypic profile of rusticity exhibited by Nordestino horse, a profile characterized by Melo et al. (2013). The high representation of this genetic family in our moderate impact SNVs calling can be explained by the fact that Mammalian Olfactory Receptor (OR) Genes constitute a large family. In humans, for example, there are 390 OR genes and 465 pseudogenes (Olender et al. 2008), since these receptors recognize varied binders, from chemical compounds to peptides (Ma 2007).
In a cattle variant calling study (Stafuzza et al.2017 ), the olfactory transduction pathway was over represented, in all four important cattle breeds in Brazil: Guzerat, Gyr, Girolando and Holstein. (Metzger et al. 2014) identified InDels with codon shift effect of OR genes on horses of the Hanoverian and Arabian breeds, including the O56A3 gene, in which we also identify SNVs with moderate impact on the Nordestino horse breed (Table 3). They also investigated codon changes due to private InDels occurrence in breed horses, compared to non-breed (Przewalski) horses, and revealed higher occurrence of these variants in genes involved in immune system processes in breed horses. Immune regulation and metabolic processes also contained variants of impact on gene transcription.
As mentioned in the study by Metzger et al. (2014), the high density of mutations in domestic equine breeds seems to be concentrated in metabolic pathways related to the signaling of basal cellular mechanisms, known as housekeeping, to the signaling of the immune system and mostly in olfactory genes, also associated with the perception of chemical stimuli. This variability, specifically in these last two gene classes, seems to have great importance in promoting the adaptation of these domestic breeds to specific environments, being exactly the one observed for the breed studied in the present work, which exhibits high adaptation to the inhospitable environment of the semi-arid region of northeastern Brazil.
It should be considered that, due to the small sampling of this research, care is needed in the interpretation of the over-represented pathways and the terms and results of the GO. However, these results provide genomic information of extreme importance to investigate the genetic mechanisms associated with the exclusive phenotypic differences of the Nordestino horse breed.

Conclusions
This is the first genomic data for a Naturalized Brazilian horse breed, and it is an invaluable resource for future studies of genetic variation associated with the exclusive phenotype of the Nordestino horse breed. Comparing its genome to the horse reference genome, approximately 89 thousand SNVs and 10 thousand InDels were identified. We prioritized variants of high (affecting splice-sites, stop and start codons) and moderate impacts (non-synonymous), especially SNVs, and identified 28 Ensembl IDs, in which high impact SNVs are present, and 392 Ensembl IDs containing moderate impact SNVs.
The functional enrichment analysis indicated that the GTPase IMAP Family was the only one that presented high impact SNVs and the genes with non-synonymous SNVs in coding regions were highly enriched in olfactory functions, sensory perception of smell and metabolic processes. It is possible that the variability in these gene families has relevant importance in the gorgeous adaptation of the breed to the semi-arid climate of the Brazilian Northeast. Therefore, this study provides the basis for validation of variants in a population study of this breed to identify genomic markers, such as SNPs, associated with the exclusive phenotype, and the molecular mechanisms involved. The genomic insights may aid in breed conservation and in the development of resistance markers to arid climate conditions.

Ethics approval
The blood sample was collected from a male horse, in a private property, with written consent from the owner, without experimental planning on the property or experimental interventions that cause damage or non-momentary pain and suffering to the animal. Therefore, no specific ethical approval is needed (Brazil law number 11794, from October 8th, 2008, Chapter 1, Art. 3, paragraph III).

Availability of data and materials
The datasets generated and/or analysed during the current study are not publicly available due they belong to the germplasm bank of Embrapa Recursos Genéticos e Biotecnologia, but are available from the corresponding author on reasonable request.

Figure 1
SNVs effects percentage by genomic region through the FreeBayes and GATK variant calling tools. SNPEff software was used to categorize the variants' effects, based on position in the genome. These include exons, introns, untranslated regions (5 'UTR and 3' UTR), splice donor site, splice acceptor site, splice site region, transcripts and intergenic regions.
"Downstream" and "Upstream" are defined as regions 5 kilobase (kb) downstream of the most distal polyA addition site and 5 kilobase (kb) upstream of the most distal transcription start site, respectively (Cingolani et al. 2012). Splice region means that a variant is within 2 bp of a splice junction. Splice acceptor means that the variant hits a splice acceptor site (defined as 2 bases before the exon start site, except for the first exon). Splice donor means that the variant hits a splice donor site (defined as 2 bases after the end of the coding exon, except for the last exon (Zhang et al. 2018).

Supplementary Files
This is a list of supplementary files associated with this preprint. Click to download.