Number and distribution of CNV
LUMPY detected 8,563 duplications and 230,497 deletions while Manta detected 24,088 duplications and 320,374 deletions. A combined data set with 244,876 deletions and 8,677 duplications (totaling 253,553, translating into an average of 1,393 CNV per animal) was derived from the intersection of the LUMPY and Manta sets after removal of variants shorter than 50 bp or longer than 3 Mb. The combined data set had more observations than the LUMPY data set (which had fewer raw CNV) because for some individuals, many short CNV from Manta intersected with few long CNV from LUMPY.
The CNV were distributed across the 29 autosomes as shown in Figure 1. A vast majority of the CNV (96.6%) were losses. This is not unexpected, because all CNV detection methods suffer from an inherent deficiency in detecting insertions. In the case of CNV detection using WGS data, this limitation is even more pronounced with PEM methods, because they detect insertions when the mapped reads are at a distance shorter than the fragment length, so they are not able to detect insertions larger than the insert size of the reference library [43].
Overall, the mean CNV length was about 3.3 kb, with a median of 1.3 kb. The distribution of the lengths of the CNV for each population are shown in Figure 2 by CNV length category. A summary of the descriptive statistics of the CNV for the populations are given in Table 1. Most of the CNV losses (99.92%) were less than 100 kb long while 6.3% of CNV gains were longer than 100kb. Despite the overwhelming proportion of losses over gains, there were more CNV gains observed over 100kb than losses. Similarly, only 1.04% of the loss CNV were longer than 10 kb, while almost one-quarter (22.99%) of all gain CNV were over 10kb. As a result, CNV gains were longer than CNV losses and had larger range in length. Deletions and duplications averaged about 2.3 and 31.5 kb long, with median lengths of 1.3 and 1.4 kb, respectively. There were no significant differences in the distribution of CNV across the five populations as shown in the percentile and sample QQ plots in Figure 3.
Table 1: Descriptive statistics of CNV and CNV length for each population
Population
|
Number of samples
|
CNV
|
CNV length (bp)
|
State
|
Number
|
Mean
|
Median
|
Minimum
|
Maximum
|
BOE
|
9
|
Loss
|
9079
|
2227.1
|
1326
|
67
|
254129
|
Gain
|
331
|
20165.9
|
1500
|
161
|
631262
|
Overall
|
9410
|
2858.1
|
1330
|
67
|
631262
|
EAF
|
80
|
Loss
|
108051
|
2244.7
|
1293
|
52
|
2161018
|
Gain
|
3544
|
30979.2
|
1316.5
|
118
|
2777398
|
Overall
|
111595
|
3157.2
|
1293
|
52
|
2777398
|
MAD
|
27
|
Loss
|
31426
|
2475.3
|
1295
|
84
|
2069909
|
Gain
|
1078
|
28384.1
|
1446
|
84
|
1660243
|
Overall
|
32504
|
3334.6
|
1296
|
84
|
2069909
|
SAF
|
44
|
Loss
|
67099
|
2368.9
|
1285
|
51
|
2539701
|
Gain
|
2514
|
31000.7
|
1192
|
101
|
1959154
|
Overall
|
69613
|
3402.9
|
1283
|
51
|
2539701
|
WAF
|
22
|
Loss
|
29221
|
2491.4
|
1280
|
52
|
2457795
|
Gain
|
1210
|
40255.3
|
1234
|
65
|
2788546
|
Overall
|
30431
|
3993
|
1280
|
52
|
2788546
|
Population CNV differentiation
Analysis of population differentiation (VST) as described by Redon et. al. [11] showed that several CNV were highly differentiated between and across the populations. Some of these CNV overlapped with genes of importance in goats. Results for the pairwise population VST tests and the VST test across all the populations with their respective 99th percentile CNV VST thresholds are given in Supplementary Table 1 (Additional file 1). VST values for the pairwise tests are given in Supplementary Figures 1-10 (Additional file 2). The VST values for genes that were in CNV that were highly differentiated across all populations are shown in Figure 4. The gene DST was in a CNV with a very high VST threshold across all the populations. DST has been associated with herpes virus and respiratory disease (BRD) in cattle [44]. Some CNV were highly differentiated both between and across populations. CNV with high differentiation between only some populations include the CNV corresponding to the genes BCO2, CCSER1 (FAM190A), COL24A1, CPNE4, CWC22, IMMP2L, KBTBD12, LAMA3, NAALADL2, RFX3, SEMA3D, SLC2A13, STPG2 (C4orf37), TAFA2 (FAM19A2), TMEM117, TMEM161B and VPS13B. The rest of the genes were in CNV that were highly differentiated across all populations.
Number and distribution of CNV regions
The lists of CNV regions (CNVR) by population are given in Supplementary Table 2 (Additional file 1) and their locations on the goat genome are shown in Figure 5. Plots of the CNVR for each breed (with more than 2 animals) are given in Supplementary Figures 11 to 39 (Additional file 2). Descriptive statistics of the CNVR for each population are given in Supplementary Table 3 (Additional file 1) while a distribution of CNVR by size and populations is given in Figure 6. Over 92% of the CNVR were copy losses. There was a wide variation in the number and sizes of the CNVR between and among the populations. The fraction of copy gains or gains and losses was highest in the group of CNVR of at least 10kbp, with 25% copy gains and 19% for losses/gains (Figure 6).
Number and distribution of global CNVR
A total of 6,231 global CNVR were found across all animals. A list of the global CNVR is given in Supplementary Table 4 (Additional file 1) and a summary is given in Table 2. There were 5,742 CNVR with copy losses, 280 with copy gains and 209 with both copy losses and gains in different individuals. The locations of the global CNVR are given in Figure 7. CNVR with both gains and losses were much longer (mean 185.8 kb) and constituted a significant proportion of the total CNVR coverage (65.6%).
Overall, the CNVR covered about 59.2 Mb of the goat genome. Previous work on genome-wide CNV discovery in goats using SNP data done by Liu et al. [18] showed that CNVR cover approximately 262 Mb of the goat genome. Of the 978 CNVR reported in that study, 540 CNVR intersected with 819 CNVR identified in our study. The amount of the overlap between the CNVR in the two studies was 217.1 Mb, covering 38.6 Mb (65.1%) in this study, and 194.2 Mb (74.1%) in the other study.
Table 2: CNVR summary statistics for each CNV state based on CNV occurring in at least 2 individuals.
Copy state
|
Number of CNVR
|
Length (bp)
|
CNVR coverage (bp)
|
Mean
|
Median
|
Minimum
|
Maximum
|
Loss
|
5742
|
3041.3
|
1140.5
|
52
|
1177087
|
17463236
|
Gain
|
280
|
10377.9
|
1008.0
|
302
|
236347
|
2905806
|
Both
|
209
|
185755.2
|
1731.0
|
616
|
2956746
|
38822839
|
Overall
|
6231
|
9499.6
|
1157.0
|
52
|
2956746
|
59191881
|
Common and rare CNVR
Most of the CNVR (>95.9 %) were found in at least 2 breeds. Out of the 6,231 CNVR, 98 (1.6%) were present in all the 34 breeds and 1,790 (28.7%) were present in all the populations (Figure 8A and Figure 8B). The most frequent CNVR observed was on chromosome 6 from 115,822,332 bp to 115,825,687 bp with a frequency of 96.2%. There were 259 CNVR private to 30 breeds, and 1,018 private to all 5 populations, distributed as shown in Figure 8C and Figure 8D. BOE (Tanzania and Zimbabwe), KEF (Ethiopia) and MLY (Tanzania) breeds had the highest numbers of private CNVR (20, 21 and 31, respectively).
Functional annotation and gene enrichment analysis
Functional annotation was carried out for genes in global and private CNVR. Up to 2,980 genes overlapped with the 6,321 CNVR identified in this study. Up to 755 of these genes formed 24 clusters, with enrichment scores ranging from 0.0 to 1.89. Higher enrichment scores imply higher overrepresentation of the genes in the gene set for the gene enrichment term [45]. The top 3 clusters with the highest enrichment scores are given in Table 3 while the full list is given in Supplementary Table 5 (Additional file 1). The most significant GO terms identified in the analysis included retrograde endocannabinoid signaling; glutamatergic synapse; circadian entrainment; dopaminergic synapse; gastric acid secretion; long-term potentiation; salivary secretion; and calcium signaling pathway.
CNVR private to populations and breeds overlapped with 172 and 620 genes, respectively. The GO terms associated with these genes based on functional analysis are listed in Supplementary Table 6 (Additional file 1). The genes that overlapped with the CNVR private to breeds were not significantly enriched in biological processes, molecular functions and cellular components, while the ones that overlapped with the CNVR private to populations were significantly enriched (P ≤ 0.05) with such terms as aldosterone synthesis and secretion; glucagon signaling pathway; insulin secretion; glutamatergic synapse; thyroid hormone synthesis; gastric acid secretion and phosphatidylinositol signaling system. The most common CNVR (chr6:115,822,332-115,825,687) includes the gene TMEM129 (transmembrane protein 129) that has been reported to be responsible for ubiquitination and proteasome-mediated degradation of misformed or unassembled proteins in the cytosol [46–48], and belongs to a network responsible for cellular assembly and organization, cellular function and maintenance, and cell cycle [49].