Genomic variants in Nordestino horse breed
A total of 28 Gb of paired-end sequence data were produced from whole-genome sequence data of a male of the Nordestino horse breed, with 11.2-fold genome coverage, considering the sum of the sequencing runs performed. A total of 305,588,364 reads were mapped to the horse reference genome (EquCab3.0 from the Ensembl database), with a mapping rate of 96.05%.
At an effective genome size of 2,462,676,227 bases, a total of 1,741,210 variants were identified, using FreeBayes; therefore, a variation rate of 1 variant per 1,414 bases, relative to the reference genome. Among these, 1,598,210 were classified as SNVs; 57,580 as insertions and 80,529 as deletions. Particularly in InDels analysis, a total of 4,964 were classified as structural variants, being 54 insertions, 3 deletions and 4,907 mixed.
When we applied the Genome Analysis Toolkit (GATK), we identified 88,848 variants, classified as SNVs (1 variant every 27,470 bases), for an effective genome size of 2,440,521,205 bases. In the search for InDels, 10,006 insertions and 15,226 deletions (1 InDel per 96,300 bases) were identified, for an effective genome size of 2,429,851,222 bases (Table 1).
Table 1. Number of variants in the Nordestino horse genome by FreeBayes and GATK variant calling tools.
Variants
|
FreeBayes
|
GATK
|
SNV
|
1,598,210
|
88,838
|
INS
|
57,580
|
10,006
|
DEL
|
80,559
|
15,226
|
MIXED
|
4,861
|
_
|
SNV Rate
|
1/1,540
|
1/27,470
|
INDEL Rate
|
1/17,221
|
1/96,300
|
Using the recommended quality metrics for each software, it was found that the total number of variants detected by FreeBayes was higher than GATK, which is expected, since GATK exhibits higher sensitivity while maintaining a lower number of false positive SNVs (Cornish & Guda, 2015).
We do not intend to compare variant calling tools. However, both have been used in order to increase the safety of variant calling and to identify in which regions of the genome they are present and to check for variants in genes possibly associated with the peculiar traits exhibited by the breed. From these data, in a future study, we intend to validate SNPs in a population of nordestinos horses. In addition, we do not intend, at this time, to compare new SNVs with already known and present in SNPs arrays, since this is the first study about the breed and we initially searched for variants from the complete genome of a single specimen of the breed.
Characterization of SNVs and InDels
Calling variants were distributed through 3,113 supercontigs in the analysis of FreeBayes andthrough 1,520 supercontigs, using the GATK, which constitute, respectively, 98.23% and 97.34% of the horse genome (Additional file1). We found 12.23% of SNVs effects in intergenic regions, 37.63% of introns and 1.02% of exons in FreeBayes analysis. According to GATK, 19.19% of SNVs are located in intergenic regions, 33.05% in introns and 1.16% in exons. In both variants call tools, SNVs and InDels had very low occurrence in donor and acceptor splice-site sequences (values close to 0.01%), splice-site regions (approximately 0.08%) and UTR 5’ (0.17%), and somewhat higher occurrence in UTR 3’ region (values next to 0.4%) (Fig 1). The actual number of effects is greater than the number of variants because those found between genes positioned in close proximity can have their effects categorized as both downstream and upstream.
The percentage of SNVs effects per genotypic region of the Nordestino horse presented here was very similar to the average percentage found in native Chinese breeds (Lichuan and Kazakh breeds, small and rugged horses), as demonstrated by Zhang et al. (2018) by SNVs calling conducted with GATK and functional annotation, based on SnpEff software, from whole-genome sequencing data. From the total of single nucleotide variations, defined by the authors as single nucleotide polymorphisms (SNPs), the most intense effects were transcript, intron and intergenic (29.07%, 28.12% and 27.02%, respectively), and the smaller effects were exactly the same as found here, with very similar percentages.
Using the SnpEff program [24], we also classified the effects of variants (SNVs and InDels) by impact as modifiers of mostly high, moderate and low impact of variants called by GATK and FreeBayes. We showed the additional effect of SNVs and InDels on FreeBayes data and the exclusive effect of SNVs on GATK data (Fig 2). The effects of SNVs and InDels in all categories were much higher in FreeBayes analysis due to the combined effect of these variations. GATK analysis revealed greater sensitivity for exclusion of "false-positive" (PF) variants, as previously mentioned.
Selection of genes containing high and moderate effects SNVs
Based on the effect of the variants and their annotation by SnpEff, we have identified all genes or Emsembl IDs in which high, low, moderate and modifier impact effects (SNVs and InDels) occur, based in variants calling results by both GATK, and FreeBayes. In order to prioritize single nucleotide variants, which can be characterized as SNPs in subsequent population studies, we have screened the genes that have at least one such variation that has high impact (disruptive impact in the protein, causing protein truncation, loss of function or triggering nonsense mediated decay) and then checked which genes are present in the analysis by both variants calling tools. We found 28 Emsembl IDs from multigenic families containing at least one high impact SNVs, in accordance with both GATK and FreeBayes (Table 2). Among these, a pseudogene and GTPase Family members 7, 4, 2 and 1.
Table 2. Total of Nordestino horse genes containing High Impact SNVs in accordance with both GATK and FreeBayes variant calling tools.
|
|
|
|
GATK
|
FreeBayes
|
GeneName
|
Transcript
|
Product
|
BioType
|
SNVs
impact
HIGH
|
SNVs impact
HIGH
|
LOC102149342
|
gene32392
|
|
pseudogene
|
1
|
1
|
LOC100146699
(id196201)
|
rna15620
|
GTPase family member 7
|
protein_coding
|
1
|
1
|
id196202
|
rna15621
|
GTPase family member 7
|
protein_coding
|
1
|
1
|
id196203
|
rna15622
|
GTPase family member 7
|
protein_coding
|
1
|
1
|
id196204
|
rna15623
|
GTPase family member 7
|
protein_coding
|
1
|
1
|
id196205
|
rna15624
|
GTPase family member 7
|
protein_coding
|
1
|
1
|
id196206
|
rna15625
|
GTPase family member 7
|
protein_coding
|
1
|
1
|
id196207
|
rna15626
|
GTPase family member 7
|
protein_coding
|
1
|
1
|
id196208
|
rna21264
|
GTPase family member 7
|
protein_coding
|
1
|
1
|
id196209
|
rna21265
|
GTPase family member 7
|
protein_coding
|
1
|
1
|
id196210
|
rna21266
|
GTPase family member 7
|
protein_coding
|
1
|
1
|
id196211
|
rna25375
|
GTPase family member 7
|
|
1
|
1
|
id196213
|
rna38835
|
GTPase family member 7
|
protein_coding
|
1
|
2
|
id196214
|
rna38989
|
GTPase family member 7
|
protein_coding
|
1
|
1
|
id196215
|
rna40155
|
GTPase family member 7
|
protein_coding
|
1
|
2
|
id196216
|
rna43275
|
GTPase family member 7
|
protein_coding
|
1
|
1
|
id196217
|
rna51848
|
GTPase family member 7
|
protein_coding
|
1
|
1
|
id196219
|
rna63380
|
GTPase family member 7
|
|
1
|
1
|
id196220
|
rna63381
|
GTPase family member 7
|
|
1
|
1
|
LOC111773116
(id196221)
|
rna71433
|
GTPase family member 4
|
protein_coding
|
1
|
1
|
id196222
|
rna71664
|
GTPase family member 4
|
|
1
|
1
|
id196223
|
rna73843
|
GTPase family member 4
|
protein_coding
|
1
|
1
|
id196224
|
rna73844
|
GTPase family member 4
|
protein_coding
|
1
|
1
|
LOC100054458
(id196225)
|
rna73845
|
GTPase family member 2
|
protein_coding
|
1
|
1
|
id196226
|
rna76474
|
GTPase family member 2
|
protein_coding
|
2
|
5
|
id196227
|
rna76774
|
GTPase family member 2
|
protein_coding
|
1
|
1
|
id196228
|
rna76804
|
GTPase family member 2
|
protein_coding
|
1
|
1
|
LOC100063777
(id196229)
|
rna76872
|
GTPase family member 1
|
protein_coding
|
1
|
2
|
Considering also the relevance of moderate impact effects, we identified 392 Ensembl IDs containing SNVs with this effect, in accordance with both GATK and FreeBayes (Additional file 2). Among these, we selected 70 genes for functional enrichment analysis and Gene Ontology (GO) terms, using the Databank for Annotation, Visualization and Integrated Discovery (DAVID). The screening for 70 genes was performed with the purpose of allowing the data presentation in a non-additional table. These data allowed the timely identification of regions where there are variations of high and moderate impacts, including variations in genes in which impacts on gene transcription can occur, and verify the occurrence of these variations in candidate genes, possibly related to the arid conditions resistance traits exhibited by the Nordestino horse breed.
The 70 Emsembl IDs with SNVs of moderate impact, selected from the 392, are representative from 33 human orthologous genes (ABCD2, ARHGAP20, ARHGAP28, ATP6, CHFR, COX2, COX3, CYT, ERP27, EXOC6, FDFT1, GBP7, GIMAP7, HYAL4, KIF1, KLRK1, LCORL, LY49F, NBPF7, OR10D4, OR52L2, OR56A3, OR56A4, OR6B2, OR7G2, OR8S1, PDPR, RNASEL, RWDD3, SLC45A1, SWT1, WAPL, WD40). We explored these genic functions (which contained high and moderate SNVs impact) associated with various biological processes. The P value of .05 was considered significant for GO annotations. Gene Ontology enrichment analysis for these genes revealed eight GO biological processes and nineteen GO molecular functions, acting in nine metabolic pathways (Table 3).
Table 3. Gene ontology (GO) terms and enriched KEGG pathways (False Discovery rate (FDR)<0.10) of the selected gene set, containing high and moderate impact SNVs, in accordance with both GATK and FreeBayes
|
ID
|
Name
|
FDR
|
Genes with high and moderate impact SNVs
|
GO
Molecular Function
|
GO:0004984
|
olfactory receptor activity
|
5,37E-01
|
OR8S1,OR56A3,OR56A4,OR6B2,OR7G2,OR52L2P,OR10D4P
|
|
GO:0015078
|
proton transmembrane transporter activity
|
1,81E+00
|
MT-ATP6,MT-CO2,MT-CO3,MT-CYB
|
|
GO:0007186
|
G protein-coupled receptor signaling
|
1,48E+01
|
OR8S1,OR56A3,OR56A4,OR6B2,OR7G2,OR52L2P,OR10D4P
|
|
GO:0009055
|
electron transfer activity
|
1,48E+01
|
MT-CO2,MT-CO3,MT-CYB
|
|
GO:0004888
|
transmembrane signaling receptor activity
|
1,48E+01
|
KLRK1,OR8S1,OR56A3,OR56A4,OR6B2,OR7G2,OR52L2P,OR10D4P
|
|
GO:0016676
|
oxidoreductase activity
|
1,48E+01
|
MT-CO2,MT-CO3
|
|
GO:0004129
|
cytochrome-c oxidase activity
|
1,48E+01
|
MT-CO2,MT-CO3
|
|
GO:0015002
|
heme-copper terminal oxidase activity
|
1,48E+01
|
MT-CO2,MT-CO3
|
|
GO:0016675
|
oxidoreductase activity
|
1,48E+01
|
MT-CO2,MT-CO3
|
|
GO:0038023
|
signaling receptor activity
|
1,81E+01
|
KLRK1,OR8S1,OR56A3,OR56A4,OR6B2,OR7G2,OR52L2P,OR10D4P
|
|
GO:0051996
|
squalene synthase activity
|
1,81E+01
|
FDFT1
|
|
GO:0004310
|
farnesyl-diphosphate farnesyltransferase activity
|
1,81E+01
|
FDFT1
|
|
GO:0015077
|
transmembrane transporter activity
|
3,02E+01
|
MT-ATP6,MT-CO2,MT-CO3,MT-CYB
|
|
GO:0032394
|
MHC class Ib receptor activity
|
3,10E+01
|
KLRK1
|
|
GO:0060089
|
molecular transducer activity
|
3,89E+01
|
KLRK1,OR8S1,OR56A3,OR56A4,OR6B2,OR7G2,OR52L2P,OR10D4P
|
|
GO:0004741
|
[pyruvate dehydrogenase phosphatase activity
|
3,89E+01
|
PDPR
|
|
GO:0022857
|
transmembrane transporter activity
|
3,89E+01
|
SLC45A1,MT-ATP6,MT-CO2,ABCD2,MT-CO3,MT-CYB
|
|
GO:0016491
|
oxidoreductase activity
|
4,77E+01
|
PDPR,MT-CO2,MT-CO3,MT-CYB,FDFT1
|
GO Biological Process
|
GO:0050911
|
chemical stimulus in sensory perception of smell
|
1,42E+00
|
OR8S1,OR56A3,OR56A4,OR6B2,OR7G2,OR52L2P,OR10D4P
|
|
GO:0007608
|
sensory perception of smell
|
1,42E+00
|
OR8S1,OR56A3,OR56A4,OR6B2,OR7G2,OR52L2P,OR10D4P
|
|
GO:0050907
|
sensory perception
|
1,42E+00
|
OR8S1,OR56A3,OR56A4,OR6B2,OR7G2,OR52L2P,OR10D4P
|
|
GO:0009593
|
detection of chemical stimulus
|
1,67E+00
|
OR8S1,OR56A3,OR56A4,OR6B2,OR7G2,OR52L2P,OR10D4P
|
|
GO:0007606
|
sensory perception of chemical stimulus
|
1,67E+00
|
OR8S1,OR56A3,OR56A4,OR6B2,OR7G2,OR52L2P,OR10D4P
|
|
GO:0050906
|
detection of stimulus involved in sensory perception
|
1,67E+00
|
OR8S1,OR56A3,OR56A4,OR6B2,OR7G2,OR52L2P,OR10D4P
|
|
GO:0051606
|
detection of stimulus
|
8,72E+00
|
OR8S1,OR56A3,OR56A4,OR6B2,OR7G2,OR52L2P,OR10D4P
|
|
GO:0022900
|
electron transport chain
|
4,92E+01
|
MT-CO2,MT-CO3,MT-CYB
|
KEEG Pathway
|
1269583
|
Olfactory Signaling Pathway
|
9,52E-01
|
OR8S1,OR56A3,OR56A4,OR6B2,OR7G2,OR52L2P,OR10D4P
|
|
1270121
|
The citric acid cycle and respiratory electron transport
|
9,52E-01
|
PDPR,MT-ATP6,MT-CO2,MT-CO3,MT-CYB
|
|
1270127
|
Respiratory electron transport, ATP synthesis
|
3,36E+00
|
MT-ATP6,MT-CO2,MT-CO3,MT-CYB
|
|
|
and heat production.
|
|
|
|
82942
|
Oxidative phosphorylation
|
3,36E+00
|
MT-ATP6,MT-CO2,MT-CO3,MT-CYB
|
|
93344
|
Cardiac muscle contraction
|
7,00E+00
|
MT-CO2,MT-CO3,MT-CYB
|
|
1270128
|
Respiratory electron transport
|
1,27E+01
|
MT-CO2,MT-CO3,MT-CYB
|
|
83087
|
Olfactory transduction
|
1,27E+01
|
OR8S1,OR56A3,OR56A4,OR6B2,OR7G2
|
|
1269574
|
GPCR downstream signaling
|
2,66E+01
|
OR8S1,OR56A3,OR56A4,OR6B2,OR7G2,OR52L2P,OR10D4P
|
|
142287
|
epoxysqualene biosynthesis
|
3,56E+01
|
FDFT1
|
Gene Family
|
167
|
Olfactory receptors, family 56
|
1,31E+00
|
OR56A3,OR56A4
|
|
643
|
Mitochondrial complex: cytochrome c oxidase subunits
|
3,10E+00
|
MT-CO2,MT-CO3
|
|
721
|
Rho GTPase activating proteins
|
1,44E+01
|
ARHGAP28,ARHGAP20
|
|
808
|
ATP binding cassette subfamily D
|
3,03E+01
|
ABCD2
|
|
580
|
GTPases, IMAP
|
4,31E+01
|
GIMAP7
|
|
1055
|
Exocyst complex
|
4,31E+01
|
EXOC6
|
|
642
|
Mitochondrial complex III
|
4,31E+01
|
MT-CYB
|
|
621
|
CD molecules|Killer cell lectin like receptors
|
4,52E+01
|
KLRK1
|
The GTPase gene family was the only one that presented high impact SNVs in agreement with the variant calling between both softwares. Members of this family are key regulators of most cell processes, including proliferation, differentiation, vesicle and organelle dynamics, transport and regulation of the cytoskeleton (Bos et al. 2007). These are evolutionarily conserved proteins, associated with the activation of extracellular signals to various cellular responses, due to the ability to undergo conformational changes in response to the alternate binding of GDP and GTP. The GDP-bound “off” or “on” states recognize distinct effector proteins, thereby allowing these proteins to function as binary molecular switches (Bos et al. 2005). Considering the functional importance of this gene family in basal cellular processes, the presence of SNVs with possible disruptive impact in the protein, causing protein truncation or loss of function, should be carefully evaluated. A study conducted by Zhang et al. (2018) also identified high impact SNVs in the metabolic pathway associated with GTPases in enriched biological processes from GO analysis in horses of 14 breeds. The highest count of genes with this type of SNVs impact effect (55) are G protein–coupled receptor signaling pathway (GO:0007186), in which we also observed high gene representation, but with SNVs of moderate effect (7 out of 33 selected for functional analysis). G protein–coupled receptors activate signal transduction pathways and, coupling with G proteins, they pass through the cell membrane seven times, being called seven-transmembrane receptors (Trzaskowski et al. 2012).
When we analyzed moderate impact SNVs, the variety of gene families in which they occur is wide, drawing attention to the olfactory receptors family, which had high representation. These genes, involved in the Olfactory Signaling pathway, act in the perception of odor through olfact, interact with odorant molecules in the nasal epithelium, to initiate a neuronal response that triggers the perception of a smell (Antunes et al. 2014). The biochemical signaling events related to this (super pathway) act in food recognition and consequently food preference (Ma 2007), identification of sexual partners (Kang et al. 2015), mother-infant bonding (Doucet et al. 2009) and several other aspects of animal survival. Among them, we can also highlight the variable susceptibility to intranasal infections, as the study by Kupke et al. (2016), which analyzed the proteins expression of this pathway in the equine nasal epithelium, in association with this susceptibility. The SNVs present in genes associated with this pathway, once validated in more individuals of the Nordestino horse breed, may be associated with resistance, including respiratory diseases and the phenotypic profile of rusticity exhibited by Nordestino horse, a profile characterized by Melo et al. (2013). The high representation of this genetic family in our moderate impact SNVs calling can be explained by the fact that Mammalian Olfactory Receptor (OR) Genes constitute a large family. In humans, for example, there are 390 OR genes and 465 pseudogenes (Olender et al. 2008), since these receptors recognize varied binders, from chemical compounds to peptides (Ma 2007).
In a cattle variant calling study (Stafuzza et al.2017 ), the olfactory transduction pathway was over represented, in all four important cattle breeds in Brazil: Guzerat, Gyr, Girolando and Holstein. (Metzger et al. 2014) identified InDels with codon shift effect of OR genes on horses of the Hanoverian and Arabian breeds, including the O56A3 gene, in which we also identify SNVs with moderate impact on the Nordestino horse breed (Table 3). They also investigated codon changes due to private InDels occurrence in breed horses, compared to non-breed (Przewalski) horses, and revealed higher occurrence of these variants in genes involved in immune system processes in breed horses.
Jun et al. (2014) characterized the genome of the Marwari horse (from the complete genome sequencing of a male Marwari horse), an Indian rare breed with unique phenotypic traits. The variant calling results by SAMtools software and functional enrichment analysis also showed that the genes with Nonsynonymous SNVs and/or InDels in coding regions were highly enriched in olfactory functions.
Immune regulation and metabolic processes also contained variants of impact on gene transcription. As mentioned in the study by Metzger et al. (2014), the high density of mutations in domestic equine breeds seems to be concentrated in metabolic pathways related to the signaling of basal cellular mechanisms, known as housekeeping, to the signaling of the immune system and mostly in olfactory genes, also associated with the perception of chemical stimuli. This variability, specifically in these last two gene classes, seems to have great importance in promoting the adaptation of these domestic breeds to specific environments, being exactly the one observed for the breed studied in the present work, which exhibits high adaptation to the inhospitable environment of the semi-arid region of northeastern Brazil.
It should be considered that, due to the small sampling of this research, care is needed in the interpretation of the over-represented pathways and the terms and results of the GO. However, these results provide genomic information of extreme importance to investigate the genetic mechanisms associated with the exclusive phenotypic differences of the Nordestino horse breed.