Genome-Wide Analysis Reveals Genetic Structure and Selective Sweeps in Chinese Indigenous Pig Breeds

Background Chinese indigenous pigs exhibit considerable phenotypic diversity, but their population structure and the genetic basis of agriculturally important traits have not been explored. Results Here, we sequenced the whole genomes of 24 individual pigs representing 22 breeds distributed throughout China. For comparison with European and commercial breeds (one pig per breed), we integrated seven published pig genomes with our new genomes. Our results showed that pig domestication occurred at three places in Southeastern Asia, namely the Mekong region, the middle to downstream regions of the Yangtze River, and Tibetan highlands. Moreover, we demonstrated that classic morphological characteristics such as coat color are not consistent with genetic data. We found that genetic material from European pigs likely introgressed into ve Chinese breeds. Two new subpopulations of domestic pigs have been identied in South and North China that encompass morphology-based criteria. The Southern Chinese subpopulation comprises the classical Southern China Type and part of the Central China Type, whereas the Northern Chinese subpopulation comprises the North China Type, the Lower Yangtze River Basin Type, the Southwest Type, the Plateau Type, and the remainder of the Central China Type. Eight haplotypes and two recombination sites were identied within a conserved 40.09 Mb linkage-disequilibrium block on the X chromosome. Potential selection and domestication signatures were identied, mainly inuencing body size, along with adaptations to cold and hot temperature environments. Conclusions Our ndings provide insights into the phylogeny of Chinese indigenous pig breeds, and will be of enormous benet in identifying benecial genes to develop superior pig breeds.

With the development of genome sequencing and SNP chip technologies, the past decade has seen an increase in data on genome-wide variation. Indeed, comparative genomic analyses have identified genes involved in a wide variety of agriculturallyimportant traits, including coat color [9,10], ear morphology [9], body size [11][12][13], meat yield [11], and disease resistance [11]. DNA-based techniques provide a good opportunity to clarify Chinese pig classification. Recent studies investigated only a few breeds with highly desirable agronomic traits, and they tended to focus on genetic mining [10,13] and on identifying selective sweeps during domestication [14,15]. Such research included genome-wide analyses of domestic breeds (e.g., Tibetan, [14]; Tongcheng, [10]; Enshi Black, [13]; Rongchang [16]) with a focus on tolerance to harsh environments, high fertility, and body size. Currently, too few Chinese pig breeds have been studied to provide a conclusive investigation of porcine evolution in China. Specific loci and genes underlying common phenotypic variation among Chinese domestic pig breeds have not yet been studied.
To address these deficiencies, we performed whole-genome resequencing of pigs representing 22 breeds distributed across different geographical areas in China. This new sequence data was integrated with publically-available sequence data from seven other pig breeds, including European ones. We uncovered population structures among Chinese indigenous pigs, along with selection and domestication signatures associated with body size and temperature-related adaptations.

Results And Discussion
Sequencing and variation identification Twenty-four animals representing twenty-two pig breeds were individually resequenced (Additional file 1: Figure S1 and Table S1). The average effective sequencing depth was 17.54× and genomic coverage was 94.74% (Additional file 1: Figure S2 and Table S2), resulting in a high-quality resequencing resource for pigs. Page 4/25 To these data, we integrated publically-available genomic data from seven pigs of wild and commercial European and Chinese breeds (Additional file 1: Table S1). The combined dataset had 14.09 billion high-quality raw reads (1281.12 Gb raw bases, >90% Q30 bases) (Additional file 1: Figure S3).
A strict quality-filter pipeline resulted in 19 685 697 single-nucleotide polymorphisms (SNPs) from 31 pigs (Additional file 1: Table S4). Of these SNPs, 13 430 360 (68.22%) were in intergenic regions, 1 223 834 (6.22%) were 5-Kb upstream or downstream of gene regions, and 5 031 503 (25.56%) were within gene regions. The last group contained 46 618 non-synonymous (NS) and 53 028 synonymous (S) SNPs (Additional file 1: Figure S4), leading to a NS/S ratio (ω) of 0.88, which is higher than a previously reported ratio of 0.68 [14]. This result suggests that positive selection may be a more important factor in the evolution of domestic pigs in China than previously reported.
Homozygous SNPs were more common in all European pigs than in Chinese pigs, especially in two European wild boars that had Hom/Het SNP ratios of 1.627-4.460 (Additional file 1: Table S4). Furthermore, with the exception of the Large White (LW) pig, the Hom/Het indel ratio was consistent with SNP variations between European and Chinese pigs (Additional file 1: Table S5). These results suggest that population bottlenecks may be responsible for the reduced genetic diversity observed in European pigs compared with Chinese pigs [17].
Additionally, numerous specific alleles appear to have been fixed in European and Chinese populations after isolation.

Population structure and introgression
We constructed a non-rooted phylogenetic tree based on 9.2 million population SNPs ( Fig. 1a) to understand the genetic relationships and structure among Chinese pigs with different geographical distributions. The estimated phylogeny revealed that the primary division was between European and Chinese pigs, consistent with previous studies [14,17].
Our results lend further support to the hypothesis that pig domestication occurred independently in western Eurasia and East Asia. Moreover, Chinese domestic breeds split on geographical grounds, namely into South and North China (CnSouth and CnNorth) subpopulations. The former encompassed all individuals from the classical South Chinese Type and some of the Central China Type. The latter comprised the remainder of Central China Type and all those from the remaining four types (North China, Lower Yangtze River Basin, Southwest, and Plateau) (Fig. 1a). The genetic relationships among Chinese indigenous pig breeds were remarkably congruent with geographic distribution.
Respectively, South Chinese and Lower Yangtze River Basin Types were closest to breeds Dahua Bai (DH; Xingning City, Guangdong Province, South China) and Jinhua (JH; Jinhua City, Zhejiang Province, Yangtze River lower reaches) ( Fig. 1a and Additional file 1: Table   S1). Notably, DH and JH are considered to be of the Central China Type, a consideration based on coat color phenotypes [4].  Figure S1) confirmed the three Southeast-Asian centers of pig domestication originally identified through mitochondrial DNA. These centers are the Mekong region [18], middle and downstream regions of the Yangtze River [19,20], and Tibetan highlands [18,20]. Thus, our study provides evidence that the classical classification scheme [4,21] requires updating with genetic information.
We examined nucleotide variation (θπ and θw) to measure genetic diversity across three populations (wild pigs, European domestic pigs, and Chinese domestic pigs) and the two Chinese subpopulations (CnNorth and CnSouth). Tested populations were more genetically-diverse (θw/Kb: 2.01-2.80, θπ/Kb: 2.12-3.11; Additional file 1: Table S8 and Our results support the hypotheses of expansion from a relatively small ancestral population [14,17] and a large reduction of effective population size under intensive breeding [26]. Wild pigs had slightly higher nucleotide diversity than did Chinese domestic pigs (Additional file 1: Figure S7f). Within a short LD decayed distance (<30 Kb), wild pigs had lower r 2 than Chinese pigs, but higher r 2 at long distances (≥30 Kb), suggesting narrow bottlenecks in wild-pig evolutionary history. Finally, CnNorth and CnSouth exhibited similar nucleotide diversity (θw/Kb: 2.68, θπ/Kb: 2.82) (Additional file 1: Table S8 and Figure S7d-f and Figure S8b).

Selective signatures from CnSouth and CnNorth
To reduce the impact of genetic admixture when analyzing selective footprints, admixed individuals MP, LWU, NX, and YH were removed. We then identified selective regions on autosomes and the X chromosome of CnSouth and CnNorth animals to evaluate the effects of local adaptation.
The nearly ten-fold greater length of genomic regions (note: windows with distances ≤50 Kb were merged into a single selected locus) indicated stronger selective pressure in CnNorth than in CnSouth (10.82 Mb versus 1.06 Mb), but the former had approximately five-fold fewer genes (167 versus 863 genes) (Fig. 2). This outcome probably reflects differences in selective pressure on the two subpopulations that caused variations in distribution, size, and gene content of genomic footprints. Our results lend further support to the native environment as a major force for rapid evolution, rather than mixture effects.
In CnSouth, 863 genes were under selection, most with functions linked to heat adaption. These include those associated with a specialized organelle, the lysosome, intestinal permeability (tight junction), central nervous system regulation (GABAergic synapse), as well as glycan biosynthesis and metabolism (glycosaminoglycan degradation) ( Table 1 and Additional file 2: Table S9). Heat stress reduces intestinal blood flow because splanchnic-region blood is diverted to perfuse skin for heat dissipation [27,28]. This decrease in blood flow can cause dysfunction of the intestinal epithelial tight junction barrier, resulting in inflammation from the passage of unwanted substances (e.g., endotoxins and pathogenic bacteria) into the internal environment [29,30]. Thus, maintaining tight-junction function is crucial for proper intestinal health and metabolism, protecting CnSouth against heat stress-induced problems. Three selective genes (AQP10, HspBP1 and GABA) , functional genes that protect cells against heat stress [31][32][33], were observed in CnSouth pigs.
The 167 strongly selected genes in CnNorth pigs were related mainly to vasopressinregulated water reabsorption, bladder cancer, as well as gastric-acid secretion (Table 1), suggesting selection for extremely cold environments. Cold temperatures increase vasopressin secretion in the kidney, elevating water reabsorption while also boosting blood content and osmotic pressure to adequately supply blood to the heart and brain [34]. After water reabsorption, urinary concentration is very high, and therefore a potential risk factor for bladder cancer [35]. Moreover, cold exposure stimulates gastric acid secretion and causes stress ulcerations [36]. Thus, selection on genes related to regulating gastric acid secretion might prevent stomach damage. Moreover, we identified three important candidate genes (AQP3, AQP7, and CA10) for cold adaptation (Additional file 2: Table S9).
Genes AQP3 and AQP7 are water and glycerol transporters in the skin, protecting against desiccation of the stratum corneum during cold stress [37]. CA10 is associated with the molecular structure of hemoglobin and allows oxygen to reach respiring cells more easily [38].
Characterization of a large-scale LD block in the X chromosome Using SNP data, we identified a large-scale LD block (40. found only in northern Chinese domestic pigs (Fig. 3). These LD patterns indicate that northern Chinese domestic pigs exhibit more haplotype diversity, and they corroborate findings of a 14 Mb X-linked sweep region [12,15].
We then detected local breakdown spots in the recombinant haplotype set using all SNP markers from the LD block. We identified two intervals of recombination: spot 1 (left) was  Figure S11), a highly conserved portion of haplotype N. This region was only in northern Chinese domestic pigs and is inside the 14 Mb X-linked sweep [15]. Overall, we found the eight haplotypes within the 40.09 Mb LD block, and a shorter conserved region (10.40 Mb) than described in previous reports [12,15,39]. This difference between studies is likely due to our use of high-density genetic markers from data with high sequencing depths, and from obtaining a greater number of Chinese pig breeds.  [40]. Meanwhile, all overlapping Reproduction QTLs were assigned to the Reproductive organ, reflecting between-subpopulation (CnNorth, CnSouth, European) differences in reproductive characters. In autosomes, overlapping Reproduction QTLs containing selection regions differed greatly between CnSouth (95/1259, 7.55%) and CnNorth pigs (1/871, 0.11%) (Additional file 1: Table S13).
Across CnNorth and CnSouth pigs, we identified 4169 population-level indels in CDS regions of functional genes. After filtering out markers with χ 2 < 5 for one group, 2711 indels remained. Six differed significantly between the two subpopulations, and five of these were distributed in three gene loci (ENSSSCG00000012830, HUWE1, and ITIH5L) on the Page 10/25 10.40 Mb conserved region (Additional file 1: Table S14). The first locus contained three indels that were matched against the InterPro database to reveal two specific cold-shock protein domains (IPR002059 and IPR011129). Variants of these genes in the CnNorth pigs were also found in northern Chinese wild pigs and European domestic and wild pigs.
We next selected the top 100 SVs out of 64 876 population-level SVs that exhibited significantly non-random distribution (χ 2 test with FDR correction, P < 0.01). Thirty-four of these SVs were located in the X chromosome (Additional file 2: Table S15 Table S16) on EDA and found that they were fixed only in CnNorth pigs. The EDA signaling pathway is involved in ectodermal-organ (hair, teeth, and exocrine glands) development [41,42], and EDA defects result in Tooth Agenesis [42]. Our findings are consistent with archaeological evidence of different tooth structural characters between CnNorth and CnSouth pigs [4].

Identification of candidate genes for body size
Our sample was split into small pigs (adult body length ≤100 cm, height ≤ 50 cm; N = 7) and large pigs (adult body length ≥ 120 cm, height ≥ 65 cm; N = 7), based on early phenotype characterization records [21] and our own measurements (Additional file 1: Table S3). We then identified 115 nonsynonymous substitutions, distributed in 95 gene regions, that varied in allele frequency between large versus small pigs (>80% in one group, approaching fixation; <20% in the other) (Additional file 1: Table S17). These nonsynonymous substitutions were putative candidate polymorphisms that resulted in size differences. Indeed, two genes (LEPR and FANCC) overlapping with nonsynonymous substitutions are reported as associated with body growth and development in some mammals [43,44]. In humans, impaired LEPR function exerts a strong negative effect on ponderal index at birth and height in adolescence [44]. Likewise, FANCC plays a major role in skeletal formation, and thus affects human height [45,46].
We then analyzed differences (χ 2 -test with Bonferroni's correction) in frequency of indels and SVs between large and small pigs, to understand their effects on body size. We found significant (P < 0.05) between-size-group differences for 10 indels and 20 SVs, located within 7 and 10 functional genes, respectively (Additional file 1: Tables S18 and S19). Among small pigs, we identified a 4 bp insertion in the third exon of COL1A1.
COL1A1 is an α1(I) protein chain of type I collagen and a major structural component of bone. Nonfunctional COL1A1 markedly reduces skeletal mineral density and body height [47,48]. We also found a 430 bp deletion in the third intron of the gene encoding propionyl CoA caboxylase α subunit (PCCA). A genetic defect in PCCA causes propionic acidemia, a condition that can lead to bone disease and growth failure [49].

Conclusions
In this population-genomic study on Chinese pig breeds, we performed whole-genome

Samples
To clarify the genetic structure of Chinese pigs across different geographical locations, we selected individuals that represent all six Chinese indigenous breeds [4]: South China (n = 9), North China (n = 3), Lower Yangtze River Basin (n = 2), Central China (n = 3), Southwest (n = 1), Plateau (n = 2). We also included samples from southern and northern Chinese wild pigs (n = 4), as well as European wild and commercial pigs (n = 7) (Additional file 1: Figure S1 and Table S1). Altogether, data from 31 individual animals were used in this study: (i) 24 sampled from 22 breeds, which were handled by the South China Agricultural University, Guangzhou, People's Republic of China (Additional file 1: Table S2 and Figure S2) and (ii) seven (one pig per breed) downloaded from the Wageningen University Porcine Re-sequencing Phase 1 Project (http://www.ebi.ac.uk/ena/data/view/ERP001813) [12,17] with the highest sequencing depths to supplement the breeds sampled here (Additional file 1: Table S2). Seven small pigs and seven large pigs were used to detect candidate genes for body size (Additional file 1: Table S3). Body size data were obtained for 14 pigs, 11 from the book Animal genetic resources in China: pigs [21], and three were measured according to the technical specifications for the registration of breeding pigs (NY/T 820-2004, 2004). A completed ARRIVE guidelines checklist is included in Checklist S1.

DNA isolation and genome sequencing
Genomic DNA was extracted from ear tissue of live collection using a phenolchloroform-based method. For each sample, 1-15 µg of DNA was sheared into 200-800 bp fragments using the Covaris system (Life Technologies). Fragments were then treated according to the Illumina DNA-sample-preparation protocol. For library construction, fragments were end-repaired, A-tailed, ligated to paired-end adaptors, and PCR-amplified with 500 bp inserts. Sequencing was performed to generate 100 bp paired-end reads on the HiSeq 2000 platform (Illumina), following the manufacturer's protocol.

Sequence alignment and genotype calling
Filtered reads were aligned to the Wuzhishan pig draft genome assembly (minipig_v1.0) [50] using the Burrows-Wheeler Aligner [51]. This genome was selected as the reference [6,50] after considering the geographical distance and genetic divergence among the 31 breeds (Additional file 1: Table S1 and Figure S1). To detect structural variations, we followed an existing method [54], with some modifications. Reads were assembled into contigs and scaffolds using default parameters in SOAPdenovo. The assembled scaffold was mapped to the reference genome in BLAT [55], with the -fastmap option.

Phylogenetic and population genetic analyses
Genetic structure was inferred from high-density SNP data in FRAPPE [56], a program that applies maximum likelihood and expectation-maximization to estimate individual ancestry and admixture proportions. To explore individual convergence, we predefined the number of genetic clusters from K = 2 to K = 5. The maximum iteration of the expectationmaximization algorithm was set to 10,000.
A phylogenetic tree was generated from population-level SNPs in TreeBeST (http://treesoft.sourceforge.net/treebest.shtml), under the p-distances model. Populationlevel SNPs were then subjected to PCA in EIGENSOFT [57], and eigenvectors were obtained using the R function eigen.
To evaluate LD decay, Haploview [58] was used to calculate the squared correlation (r 2 ) between any two loci. Average r 2 was calculated for pairwise markers in a 5 Kb window and averaged across the whole genome.

SNP diversity and selective sweep analysis
Within-population, whole-genome SNP diversity was estimated with average pairwise divergence (θπ) and Watterson's estimator (θw) [59]. These two parameters and Tajima's D [60] were estimated using sliding windows of different sizes (10 Kb, 100 Kb, and 500 Kb) with 50% overlap between adjacent windows. Parameters were calculated per window with an in-house PERL script. A representative window of 500 Kb was used to visualize wholegenome patterns. Next, population differentiation (F ST ) was calculated [61].
Regions and genes under artificial or natural selective sweeps should differ significantly in diversity across groups (with one being less diverse). We therefore calculated the ratio of genetic diversity between South and North groups (πw/πc) in 50-Kb sliding windows with a step size of 25 Kb. This ratio was used to identify regions with significantly fewer polymorphisms per group. Windows containing the top 5% of πw/πc values were considered candidate regions of significantly lower diversity that likely experienced selection sweeps. If πw was <0.002, the window was removed from the analysis. Windows with distances ≤50 Kb were merged into a single selected locus.
Gene and QTL annotation

Introgression analysis
Methods followed previous studies [62]. We applied a likelihood ratio test to study the ancestral contribution of groups to the genome of each sample. All putative introgressions between group pairs were examined. For every 100 Kb window containing at least 10 SNPs and when at least three comparisons were possible per group, we calculated the ratio of average sharing per variety with itself and with another group. Regions with an average sharing ratio of >0.8 were defined as introgressions. Shared introgression frequency was plotted and tabulated. Introgression length and number per sample were also tabulated.
Regions of extensive haplotype sharing (≥90% shared SNPs) were considered introgression regions for each group pair.

Consent for publication
Not applicable Availability of data and material The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.

Tables
Checklist S1: A completed ARRIVE guidelines checklist for reporting animal data in this manuscript.(PDF 1122kb)  Distribution of θπ ratios and Fst values and selected regions from CnSouth and CnNorth. Red and green data points represent the selected windows for CnSouth and CnNorth, respectively.