Sampling
The blood samples from all the 60 individual ducks (10 per breed) were collected from the wing vein using vacuum tubes containing EDTA-K2 as an anticoagulant. The spot-billed ducks, mallards and Fenghua ducks were captured in Fenghua City, Zhejiang Province, China (29°35′ N, 121°24′ E). The Shaoxing and Shanma ducks were collected in Zhuji City, Zhejiang Province, China (29°38′ N, 120°10′ E), and the Cherry Valley Pekin duck were raised in Huzhou City, Zhejiang Province, China (30°41′ N, 120°19′ E). From Shaoxing ducks and mallards, 3 randomly selected ducklings were killed by rapid decapitation and sterile dissection, and muscle and liver tissues were sampled and immediately snap frozen in liquid nitrogen.
Sequencing and Quality Control
A total of 60 ducks, which were sampled from Eastern China, were sequenced on the Illumina HiSeq 2000 platform (Illumina, San Diego CA, USA). We generated a total of 401.491 Gb of raw sequence data (supplementary table S1).
To make sure reads reliable and without artificial bias in the following analyses, raw reads of fastq format were firstly processed through a series of quality control (QC) procedures in-house C scripts. The low-quality reads were filtered out based on the following [28]:
(1) Reads with ≥10% unidentified nucleotides (N);
(2) Reads with >50% bases having phred quality <5;
(3) Reads with >10 nt aligned to the adapter, allowing ≤10% mismatches;
(4) Putative PCR duplicates generated by PCR amplification in the library construction process (read 1 and read 2 of two paired-end reads that were completely identical).
Consequently, 387.88 Gb were retained for assembly, of which the quality of 96.06% and 91.52% of the bases were ≥Q20 and ≥Q30, respectively.
Reads Mapping and SNP Calling
The remaining high quality reads were mapped to the mallard (Anas platyrhynchos) reference genome [29] using Burrows-Wheeler Aligner (Version: 0.7.8) [30] with the command line was ‘BWA mem -t 4 -k 32 –M’. SAMtools was used to remove the duplicated reads to reduce mismatch generated by PCR amplification before sequencing.
After alignment, we used SAMtools [31] to carry out SNP calling. The ‘mpileup’ command was used to identify SNPs with the parameters as ‘-q 1 -C 50 -S -D -m 2 -F 0.002’. The following filtering steps were applied in order to obtain high quality SNPs:
(1) Quality score > = 20.
(2) Coverage depth > =2 and <=1000.
Annotation of Genetic Variants
Using the ANNOVAR package [32], 2,809,077 high-quality SNPs were annotated according to the genome. Based on the genome annotation, SNPs were classified into several categories, such as exonic regions, intronic regions, splicing sites, upstream and downstream regions and intergenic regions. The SNPs in coding exons were fall into two categories: synonymous SNPs and nonsynonymous SNPs.
Principal Component Analysis
The software GCTA [33] was used for PCA with the population scale SNPs. The significance level of the eigenvectors was determined using the Tracey-Widom test to clarify the phylogenetic relationship among 60 individuals. The first three significant components were plotted (supplementary fig. S2), and the discrete points to a degree reflect the real structure of population.
Phylogenetic Tree
We also inferred an individual-based neighbor-joining (NJ) tree in TreeBeST (http://treesoft.sourceforge.net/treebest.shtml#inno) based on the p-distance. The bootstrap was set to 1,000 times to evaluate the reliability of branch..
Population Structure and TreeMix analysis
The population genetic structure of 60 individuals was inferred by FRAPPE [34]. We set the number of cluster (K) from 2 to 6 and ran analysis with 10,000 iterations. Migration events among the populations were inferred using TreeMix [35]. We converted plink output for TreeMix by plink2treemmix module in TreeMix software. We set the number of migration events from 1 to 3. For each number of migration events, we ran 100 random input orders.
Linkage Disequilibrium Analysis
We compared the pattern of linkage disequilibrium (LD) among 6 breeds using the 2.8 million high-quality SNPs. To estimate LD decay, we calculated the squared correlation coefficient (r2) between pairwise SNPs using the software Haploview [36]. The average r2 value was calculated for pairwise markers in a 500-kb window and averaged across the whole genome.
Effective Population Size
We used a hidden Markov model (HMM) of pairwise sequentially Markovian coalescence (PSMC) to reconstructed demographic history of 60 individuals. Firstly, we called genotype each individual using the package SamTools [30] based on the command ‘mpileup’ with the parameter ‘-C 50 -D -S -m 2 -F 0.002’. Then, we performed the program `fq2psmcfa' with the parameter ‘−N30, −t15, −r5 and −p ‘4+25*2+4+6’’ to convert the consensus sequence to the required input format. A mutation rate (μ) of 1.6×10-9 per bp per generation [7] and a generation time of 1 year were used for analysis. In addition, we applied a bootstrapping approach, repeating sampling 100 times on two samples to estimate the variance of simulated results.
Selective Sweep Analysis
The nucleotide diversity (θπ), population-differentiation statistic (FST), Tajima’s D statistic and Watterson estimator (θW) were calculated with sliding windows of 40 kb that had 20 kb overlap between adjacent windows. The putative genomic regions under positive selection during domestication were extracted based on being the highest differences in genetic diversity (log2(θπ ratio)) and the top 5% of FST. We identified a total of 665 potential selective-sweep regions overlapping with 387 candidate genes in merging domestic ducks and 491 potential selective-sweep regions overlapping with 311 candidate genes in Shaoxing ducks, which would be used for subsequent analysis and discussion.
Functional Enrichment Analysis
Gene Ontology term enrichment analysis was processed with those selective genes by goseq packages in R software. We used the GOSeq R package, in which gene length bias was corrected, to perform GO and functional pathway analysis on the candidate genes. The Gene ontology (GO) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways with a Benjamini adjusted P-values less than 0.05 were considered significantly enriched.
RNA-seq and Gene Expression Analysis
To infer whether the genes under selection could also affecting gene expression between Shaoxing ducks and mallards, we compared gene expression in breast muscle, liver and cerebellum between this two groups. 3 females from Shaoxing ducks and mallards respectively were selected for transcriptomics analysis.
All samples were individually sequenced by Illumina HiSeq 4000 sequencing platform. Perl scripts was used to ensure the quality of raw data. The reference genomes and the annotation file were downloaded from ENSEMBL database (http://www.ensembl.org/index.html). We used Bowtie/Bowtie 2 to build the genome index and TopHat v2.0.12 to map clean data to reference genome. And HTSeq v6.0 was used to count the number of fragments for each gene in each sample. The expression level of genes in each sample was estimated by FPKM (Fragments Per Kilobase Per Million Mapped Fragment). We used DEGseq v1.18.0 to analyze differential gene expression between Shaoxng and Shanma ducks. Genes with q≤0.05 and |log2Ratio|≥1 are identified as differentially expressed genes.