Using pedigree-derived inbreeding coefficients, we estimated the effective population size of the sire and dam line of the SLW breed to 44 and 72, respectively. In order to assess sequence variation within the two lines, we prioritized 70 boars for whole-genome sequencing based on their marginal genetic contributions to the active breeding populations with a key ancestor approach. Of the 70 boars, 38 and 32 represent the sire and dam line, respectively, explaining 95.35 and 87.95% of the genetic diversity of the active breeding populations.
Following quality control (removal of adapter sequences, reads and bases of low sequencing quality), between 81.15 and 377.01 million read pairs (2 x 150 bp) per sample (mean: 165.55 ± 60.32 million read pairs) were aligned to the SSC11.1 assembly of the porcine genome. Using reads with high mapping quality (reads with mapping quality < 10 and SAM bitwise flag 1796 were not considered), the average sequencing coverage of the 70 boars was 16.69 ± 5.93-fold across all autosomes. Raw sequence read data of 70 pigs have been deposited at the European Nucleotide Archive (ENA) of the EMBL at BioProject PRJEB38156 and PRJEB39374.
A reference-guided multi-sample variant discovery and genotyping approach yielded genotypes at 28,407,060 sites (22,191,375 biallelic SNP, 4,379,470 biallelic INDEL, and 1,836,215 others, Table 1). We applied GATK’s VariantFiltration module for site-level hard filtration using parameters recommended in the best practice guidelines [15]. Subsequently, we applied Beagle (version 4.1; [16]) phasing and imputation to improve the genotype calls from GATK and to impute sporadically missing genotypes. Following the imputation, we retained 26,862,369 variants including 21,592,583 SNP and 5,269,786 INDEL. The number of polymorphic sites that were seen in the heterozygous (singletons) and homozygous (doubletons) state only once was 2,026,088 (7.54%) and 72,100 (0.27%), respectively. To prevent bias resulting from flawed genotypes in repetitive regions, we excluded 1,710,337 variants for which an excess of sequencing coverage was evident for downstream analyses. The transition/transversion (Ti/Tv)-ratio estimated from filtered and imputed variants was 2.28.
The resulting data were separated into two datasets containing 23,774,053 and 23,531,919 autosomal variants detected in 32 and 38 boars from the dam and sire line, respectively. Of the variants, 1,049,689 and 1,594,775 were fixed for the alternate allele in the dam and sire line, respectively. On average, we detected 11,119,760 ± 176,113 biallelic variants per animal (Fig. 1A), of which 6,258,456 ± 280,127 and 4,861,304 ± 135,524 were heterozygous and homozygous for the reference allele, respectively. The average nucleotide diversity (π) across 452,444 overlapping windows (10 kb in size with 5 kb steps), spanning 22,840,217 and 22,529,446 biallelic variants, respectively, was 2.81 x 10− 3 in the dam and 2.72 x 10− 3 in the sire line.
Table 1
Variants detected in 70 sequenced key ancestor animals.
|
Raw
|
Filtered & imputed
|
Dam line
|
Sire line
|
Number of animals
|
70
|
70
|
32
|
38
|
Sequence coverage1
|
16.69
(8.72–36.85)
|
18.02
(9.31–36.85)
|
15.57
(8.72–27.73)
|
Number of variants
|
All
|
28,407,060
|
26,862,369
|
24,358,047
|
24,093,052
|
Biallelic SNP
|
22,191,375
|
21,209,725
|
19,456,000
|
19,232,692
|
Biallelic INDEL
|
4,379,470
|
4,339,947
|
3,960,976
|
3,928,684
|
Others2
|
1,836,215
|
1,312,697
|
941,071
|
931,676
|
Autosomal variants
|
All
|
27,582,843
|
26,198,587
|
23,774,053
|
23,531,919
|
Biallelic SNP
|
21,553,323
|
20,715,354
|
19,015,058
|
18,808,294
|
Biallelic INDEL
|
4,248,742
|
4,211,012
|
3,846,008
|
3,817,622
|
Others2
|
1,780,778
|
1,272,221
|
912,987
|
906,003
|
1 estimated from the autosomes |
2 this category contains multi-allelic SNP, multi-allelic INDEL, as well as sites that may contain both SNP and INDEL |
Comparison between array-called and sequence-called genotypes
Sixty-eight boars (32 and 36 from the dam and sire line, respectively) that had average sequencing coverage between 8.72 and 36.85-fold (average: 16.79-fold) also had Illumina PorcineSNP60 BeadChip-called genotypes. Using the array-called genotypes at 54,600 autosomal SNP for which we were able to determine reference and alternate alleles as a truth set, we calculated genotype concordance, non-reference sensitivity and non-reference discrepancy between array-called and sequence-called genotypes as proposed by DePristo et al. [17]. Of the 54,600 SNP, 6,376 and 1,029 were fixed for the reference and alternate allele, respectively, and 47,195 were polymorphic in the array-called genotypes of the 68 pigs.
Of the 48,224 SNP that were either polymorphic or fixed for the alternate allele in the array-called genotypes, 46,009 (95.41%) and 45,951 (95.29%) were also present in the raw and filtered sequence variants, respectively. 1,232 SNP of the Illumina PorcineSNP60 BeadChip complement were missing in the sequenced set because they were either genotyped as INDEL or multiallelic sites using GATK and thus excluded from the comparison due to incompatible alleles. 983 and 1,041 SNP were not among the raw and filtered sequence variants, respectively, although the frequency of the minor allele was > 5% in the array-called genotypes for most (> 80%) of them. It is likely that these variants could not be matched with the sequence set due to either incompatible or ambiguous map coordinates.
Non-reference sensitivity was greater than 99% and non-reference discrepancy around 1% for the raw genotypes called by the GATK, suggesting that the high sequencing coverage facilitated accurate variant discovery (Table 2). The concordance between sequence- and array-called genotypes improved slightly after applying site-level hard filtration. Beagle phasing and imputation further increased the concordance and non-reference sensitivity as well as decreased the non-reference discrepancy of the filtered sequence variant genotypes.
Table 2
Comparison between sequence- and array-called genotypes at corresponding positions.
Dataset
|
Genotype concordance (%)
|
Non-reference sensitivity (%)
|
Non-reference discrepancy (%)
|
Raw
|
99.18
|
99.75
|
1.11
|
Filtered
|
99.19
|
99.77
|
1.09
|
Filtered & imputed
|
99.82
|
99.95
|
0.24
|
Population structure and genetic diversity
To investigate the population structure, ancestry and genetic diversity among the 70 sequenced pigs, we performed principal component, admixture and FST analyses. The principal components were extracted from a genomic relationship matrix constructed from 23,691,198 autosomal sequence variants that had minor allele frequency greater than 0.01.
The first principal component of the genomic relationship matrix explained 8.61% of the variation and separated the animals by lines (Fig. 1B). The second principal component explaining 2.68% of the variation revealed variability within the sire line. Five outlier animals along the second axis of variation descended from imported Large White boars, thus reflecting an introgression of foreign genes into the sire line.
We performed an admixture analysis using 1,207,189 independent biallelic SNP to assess gene flow between both lines. As expected, K = 2 was the most plausible number of genetically distinct clusters (Supplementary Fig. S1, Supplementary File 1). The cross-validation error for K = 1, K = 2 and K = 3 was 0.561, 0.546 and 0.564, respectively. At K = 2, six individuals (one from the dam and five from the sire line) showed some admixture.
In order to investigate if pronounced allele frequency differences exist between both lines, we performed a SNP-based genetic differentiation analysis using the Weir and Cockerham weighted FST statistic [18]. We observed multiple 10 kb sliding windows scattered throughout the genome with FST values greater than 0.25, indicating genetic divergence of both lines (Supplementary Fig. S2, Supplementary File 2). The average weighted FST value across all windows was 0.07.
We estimated runs of homozygosity (ROH) with BCFtools/ROH [19] based on GATK-derived Phred-scaled likelihoods for 19,146,365 biallelic SNP to investigate genomic inbreeding in both lines. In total 111,201 ROH with an average length of 391.28 kb (ranging from 50 kb to 11.1 Mb) were detected (Phred-scaled likelihood > 70). The ROH contained an average number of 3,176 SNP (ranging from 29 to 87,699). The boars from the dam and sire line had 1,604 ± 133 and 1,575 ± 91 ROH with an average size of 377,928 and 402,731 bp, respectively. The genomic inbreeding (FROH, i.e., the fraction of the autosomal genome covered by ROH), was 0.26 ± 0.03 and 0.28 ± 0.03 for the dam and sire line, respectively. We classified the ROH into short (50–100 kb), medium (100 kb − 2 Mb) and long ROH (above 2 Mb) (Fig. 2). Most ROH belonged to the medium length class. The average FROH was similar in both lines for small and medium ROH. However, FROH was higher for long ROH in the sire line.
Variant annotation
In 32 boars from the dam line, we annotated 23,774,053 (19,087,807 SNP; 4,038,170 INDEL) variants, including 2,567,754 variants that were not detected in the sire line. In 38 boars of the sire line, we annotated 23,531,919 (18,881,067 SNP; 4,009,043 INDEL) variants, including 2,325,620 that were not detected in the dam line. When compared to 63,832,658 germline variants listed for Sus scrofa in the Ensembl database (release 101), 5,745,790 (24.17%, dam line) and 5,693,068 (24.19%, sire line) variants were novel, of which the majority were INDEL and 14.66% and 14.64% were biallelic SNP.
We used the Ensembl Variant Effect Predictor software (VEP, release 98; [20]) to predict functional consequences for the sequence variants (Table 3). In total, 2.96% (dam line) and 2.94% (sire line) of the variants were in exons. Putative impacts of missense variants on protein function were predicted using the SIFT (sorting intolerant from tolerant) scoring algorithm [21] as implemented in the VEP software. The scoring algorithm classified 12,024 and 11,958 amino acid substitutions in the dam and sire line, respectively, as “deleterious” (SIFT score < 0.05).
Table 3
Predicted consequences of variants segregating in two lines. The table shows only the most sever consequence for a variant.
Consequence type (most severe)
|
Dam line
|
Sire line
|
Splice donor variant
|
1,396
|
1,421
|
Splice acceptor variant
|
1,126
|
1,096
|
Stop gained
|
1,615
|
1,604
|
Frameshift variant
|
10,912
|
11,043
|
Stop lost
|
595
|
587
|
Start lost
|
423
|
421
|
Inframe insertion
|
990
|
987
|
Inframe deletion
|
1,164
|
1,186
|
Protein altering variant
|
62
|
62
|
Missense variant
|
70,758
|
69,983
|
Splice region variant
|
22,493
|
22,148
|
Incomplete terminal codon variant
|
12
|
11
|
Synonymous variant
|
76,977
|
75,279
|
Stop retained variant
|
149
|
135
|
Start retained variant
|
4
|
4
|
Coding sequence variant
|
98
|
96
|
Mature miRNA variant
|
12
|
16
|
5’ - UTR variant
|
168,000
|
164,866
|
3’ - UTR variant
|
348,135
|
344,514
|
Non-coding transcript exon variant
|
277,002
|
275,909
|
Intron variant
|
12,213,614
|
12,092,056
|
Non-coding transcript variant
|
11
|
10
|
Upstream gene variant
|
878,779
|
869,207
|
Downstream gene variant
|
757,364
|
750,548
|
Intergenic variant
|
8,942,362
|
8,848,730
|
Known trait-associated variants
The catalogue of Mendelian traits in Sus scrofa curated in the OMIA database (https://omia.org/home/, [22]) contained records of 47 likely causal variants (as of September 2020). However, the genomic coordinates were available for only 33 likely causal variants. Using functional annotations and sequence coverage analyses, we detected OMIA-listed variants affecting the KIT, MC1R and FUT1 genes in the sequenced key ancestor animals that occurred at alternate allele frequencies between 0.013 and 1 (Supplementary Table S1, Supplementary File 3).
A duplication of the KIT gene and a splice site variant in intron 17 of the KIT gene are associated with the dominant white phenotype [23, 24]. Because the genotyping of larger structural and copy number variants from short-read sequencing data is notoriously difficult, we visually inspected the depth of sequencing coverage at the SSC8 region encompassing KIT. An increase in coverage between 41.22 and 41.78 Mb confirmed the presence of a previously reported 560 kb duplication (DUP1; Supplementary Fig. S3, Supplementary File 4, [25, 26]. The duplication also encompasses a copy of KIT that carries a splice donor site variant (SSC8: 41486012G>A, rs345599765) which manifests in a dominant white phenotype [23, 24]. The splice variant segregated at a frequency of 0.49 and 0.42 in the sire and dam line, respectively. Seven animals that carried either one or two copies of DUP1 did not carry the splice site variant and all others were heterozygous carriers. Because this variant is located within the 560 kb duplication, we observed allelic imbalance in heterozygous animals.
We detected three OMIA-listed pigmentation-associated variants in the MC1R gene in the sequenced pigs. All boars were homozygous carriers of a 2-bp insertion (SSC6: 182,120 - 182,121 bp), that causes a frameshift and premature translation termination, which is associated with recessive white color [27]. All animals were also homozygous carriers of two missense variants in the MC1R gene (SSC6: 181461T>C, ENSSSCP00000027395.1: p.Thr243Ala and SSC6: 181697A>G, ENSSSCP00000027395.1: p.Val164Ala), for which the reference alleles had been associated with red color in the Duroc breed [28].
A missense variant (SSC6: 54079560T>C; ENSSSCP00000062180.1: p.Thr102Ala; rs335979375) in the FUT1 gene enables adhesion of enterotoxigenic Escherichia Coli F18 fimbriae (ETEC F18) to receptors at the brush border membranes of the intestinal mucosa [29]. The allele that facilitates ETEC F18 adhesion causes diarrhea in neonatal and recently weaned piglets. Since a strong selection against the ETEC F18 susceptible allele takes place in both SLW lines, we observed the disease-associated allele only in one boar from the sire line in the heterozygous state.
Signatures of selection
We detected signatures of past selection using the composite likelihood ratio (CLR) test. Signatures of ongoing selection were identified by the integrated haplotype score (iHS) test. For both analyses, we used biallelic autosomal SNP (Ndam = 19,015,058, Nsire = 18,808,294) that were grouped into non-overlapping 100 kb windows. For the CLR tests, we considered an empirical 0.5% significance threshold to identify putative signatures of selection (Figure 3A). The number and length of candidate selection regions was higher in the dam than the sire line (14 vs. 7; 38.1 Mb vs. 26.1 Mb). Two regions on SSC3 (from 122.6 to 124.9 Mb) and SSC13 (from 140.0 to 146.1 Mb) showed evidence of selection in both lines. For the iHS analyses, we used an empirical 0.1% significance threshold to detect putative signatures of selection (Figure 3B). We detected 14 and 16 candidate regions of selection in the dam and sire line, respectively, encompassing 28.5 Mb and 32.5 Mb. Four regions on SSC1 (from 51.1 to 53.7 Mb, from 142.7 to 146.2 Mb), SSC6 (from 64.9 to 69.3 Mb) and SSC13 (from 148.0 to 150.6 Mb) were shared between both lines.
Considering both statistics, we detected more signatures of selection in the dam than sire line (28 vs. 23). Only 6 regions, detected by either CLR or iHS, overlapped between both lines. A strong signature of selection was detected in both lines with both methods on SSC13 between 140 and 152.4 Mb. The candidate region encompassed 125 genes (Supplementary Table S2, Supplementary File 5), as well as 63,480 and 55,835 polymorphic sites in the dam and sire line, respectively, precluding to readily prioritize candidate genes and variants responsible for the sweep.
Reference-based genotyping from low-coverage sequencing data
In order to investigate if the 70 sequenced key ancestor animals may serve as a reference panel for genotyping by low-coverage sequencing, we sequenced the genomes of 175 pigs (84 from the sire line and 91 from the dam line) at low coverage using Gencove’s low-pass sequencing solution. The pigs also had Illumina PorcineSNP60 BeadChip-called genotypes. A principal component (Supplementary Fig. S4, Supplementary File 6) analysis of a genomic relationship matrix constructed from microarray-derived genotypes showed that the 175 pigs cluster with the 70 key ancestor animals.
Following quality control, we aligned a median number of 16,153,314 (between 5,950,534 and 21,168,683) read pairs (2 x 150 bp) to the porcine reference genome, achieving an average depth of coverage of 1.11-fold (from 0.38 to 1.51). On average, 54% of the reference nucleotides were covered with at least one read. Following the reference-guided low-pass sequence variant genotyping approach (GLIMPSE) proposed by Rubinacci et al. [6], we utilized the haplotypes of the 70 sequenced key ancestor animals as a reference panel to call genotypes at 22,618,811 polymorphic sites in the 175 low-pass sequenced samples.
We assessed the accuracy of genotyping by low-pass sequencing based on Illumina PorcineSNP60 BeadChip-called genotypes at 54,600 SNP, for which we were able to determine reference and alternate alleles. Of the 54,600 SNP, 6,176 and 965 were fixed for the reference and alternate allele, respectively, in the 175 pigs according to the array-called genotypes. Of 48,424 SNP that were either polymorphic or fixed for the alternate allele, 46,001 (94.99%) were also among the GLIMPSE-imputed genotypes. 2,423 SNP had microarray-derived genotypes but were missing in the GLIMPSE-imputed genotypes because these SNP were missing in the haplotype reference panel constructed from the key ancestor animals.
The genotype concordance, non-reference sensitivity and non-reference discrepancy between GLIMPSE-imputed and array-called genotypes at 46,001 autosomal SNP was 97.60, 98.73 and 3.24% in 175 low-pass sequenced pigs (Table 4, Figure 4A). When the sequence variant calling of the 175 samples was performed together with the 70 key ancestor animals using the multi-sample approach implemented in the GATK, all concordance metrics were considerably worse. Although, Beagle imputation improved the genotype calls of GATK for the low-pass sequenced samples, the genotype concordance and non-reference sensitivity was lower and non-reference discrepancy higher using GATK than GLIMPSE. Using the GLIMPSE approach improved the genotype concordance over GATK filtered & Beagle imputed variants by 13.83% and this improvement is mostly due to a lower non-reference discrepancy (Table 4).
Table 4
Accuracy of sequence variant genotyping in low-coverage (1.11-fold) sequencing data
Variant genotyping approach
|
Genotype concordance
|
Non-reference sensitivity
|
Non-reference discrepancy
|
GLIMPSE
|
97.60
|
98.73
|
3.24
|
GATK raw
|
75.90
|
52.35
|
30.20
|
GATK filtered
|
75.89
|
52.36
|
30.22
|
GATK filtered & imputed
|
85.74
|
96.56
|
19.34
|
We constructed genomic relationship matrices (GRM) from the microarray-derived and GLIMPSE-imputed genotypes of the 175 sequenced pigs based on a subset of 44,268 SNP that were detected at minor allele frequency greater than 0.01 in both datasets. Both the off-diagonal and the diagonal elements of the GRM constructed from array-derived genotypes had greater variance (σ2diag = 3.37 x 10-3, σ2off = 9.34 x 10-3) than corresponding elements of the GRM constructed from low-pass sequencing data (σ2diag = 3.30 x 10-3, σ2off = 9.15 x 10-3). While the correlation of the off-diagonal (r = 0.99) and diagonal (r = 0.96) elements was high between both GRMs, the values of the diagonal elements were higher for all samples using the GLIMPSE-imputed than microarray-derived genotypes (Figure 4B and 4C). On average, the 175 boars were homozygous for 65.58 ± 1.39% and 67.27 ± 1.49% of the 44,268 SNP when the genotypes were called from the microarray and low-pass sequencing data, respectively.