Animals
Ear tissues from four pigs of three breeds were collected, including an Asian wild pig (A1), two unrelated Diannan small-ear pigs (D1 and D2) and a Tibetan pig (T1). Total DNA was extracted by QIAamp DNA Investigator kit (QIAGEN, Hilden, Germany) following the manufacturer’s instruction. DNA quality was evaluated by spectrophotometry and agarose gel electrophoresis. The guidelines of the experimental animal management of China Agricultural University (CAU) were followed, and the experimental protocols were approved by the Experimental Animal Care and Use Committee of CAU.
DNA sequencing
DNA templates were ultrasonically sheared using a Covaris E220 (Covaris, Woburn, USA), and were prepared for DNA libraries following the NEBNext Ultra DNA Library preparation protocol. Multiple Ampure Bead XP cleanups (Beckman Coulter, Brea, CA, USA) were conducted to remove any adapter dimer that might have developed. The quality and concentration of libraries were determined on an Agilent Bioanalyzer 2100 (Agilent Technologies, Santa Clara, CA). Subsequently, the quality-controlled genomic library for each sample was PE100 sequenced using the Illumina HiSeq 2000 sequencing system.
The traditional sequencing approach, Sanger sequencing, was also performed on the samples to represent a gold standard in variation detection. The mitochondrial DNA was PCR-amplified by the 16 primer pairs used in the previous study [26]. Amplicons were bi-directionally sequenced using BigDye Terminator version 3.1 technology on an ABI 3730 system (Applied Biosystems, Foster City, CA). Mitogenomes were analyzed by the softwares of MEGA6 [27] and DnaSP v5 [28].
Quality control
Read quality was assessed using the FastQC software focusing on base quality scores and sequence length. To ensure quality, a quality control of NGS data was conducted. Adapters and low-quality bases were removed by the Clip&Merge software. Reads were filtered to exclude those of a nucleotide length of shorter than 35bp and a Phred quality score of lower than 20. And then forward and reverse reads were merged into single sequences if they overlapped by at least 8 bp. These tools were integrated into the EAGER-pipeline [29] .
Alignment to different reference sequences
Mapping was performed by BWA [30] with default “aln” and “samse” parameters, except considering the overall number of mismatches tolerated in the alignment by setting the expected fraction of misalignments to 0.04 (-n). Three reference sequences on behalf of different genetic distances between the reference sequences and the sample were used as follows.
(1) The commonly used reference sequence (CU-ref), which referred to a frequently-used sequence from RefSeq project at NCBI [31] database, and here was NC_000845.1 used.
(2) The breed-specific reference sequence (BS-ref), which referred to a sequence of the same breed as the sample downloaded from NCBI database. For the Asian wild pig, KP765605.1 was used as BS-ref, which is from a Changbai mountains wild boar and 16720 bp in length; for the Diannan small-ear pigs, KM044240.1 was used, which is a complete mitogenome of 16720 bp obtained from a Diannan small-ear pig in Yunnan Province; and for the Tibetan pig, KM073256.1 was used, which is a Tibetan complete mitogenome of 16710 bp.
(3) The sample-specific reference sequence (SS-ref), which referred to the consensus sequence obtained from the NGS reads of the sample through de novo assembly.
The produced BAM files from BWA were filtered for sequences with a mapping quality of at least 30. Duplicate removal was carried out on those reads that showed identical start and end coordinates only by the DeDup software. The tools were integrated into the EAGER-pipeline [32].
The qualities of mitogenome mapping were measured by the mapping ratio, average coverage and run time. The mapping ratio refers to the ratio of reads mappable to the mitochondrial reference. The average coverage means the number of times the mitogenome is sequenced. The runtime counts the consumed CPU time consumption of mapping processing, instead of using elapsed time, which includes for example, waiting for input/output operations or entering low-power mode.
De novo assembly for SS-ref
In order to produce the optimal SS-ref, three modified de novo assembly strategies were compared based on the NGS data from A1. They were different in the NGS read sets, including all clean read sets, homologous read sets filtered by BLAST or by BWA mapping. De novo assembly was performed using SOAPdenove2 [33] by default parameters with the best k-mer size estimated by KmerGennie [34]. The detailed assembly information was as follows.
(1) “Denovo”: the de novo assembly directly by all clean [22, 34]. All clean data from NGS were put into SOAPdenovo2, and produced contigs. Then contigs were aligned to NC_000845.1 using MEGA6, and produced a consensus [22].
(2) “BLAST_denovo”: the de novo assembly by homologous read sets filtered from clean data by BLAST. Clean data were filtered against a reference panel composed of all complete Sus Scrofa mitochondrial genome sequences (219) downloaded from NCBI database by the BLAST tool with the blastn command, and then these sets were put into SOAPdenovo2 for de novo assembly.
(3) “BWA_denovo”: the de novo assembly by homologous read sets filtered from clean data by BWA. Clean reads were mapped against the above-mentioned reference panel to filter homologous sequences of each sample by BWA, and then these sequences were assembled by SOAPdenovo2.
To assess the three de novo assembly strategies, the indicators including the N50, consensus length, coverage and sequence polymorphism resulted from each strategy were measured.
SNP calling of mitochondrial genomes
Three variation callers, i.e, SAMtools 1.3.1 [35], VarScan 2.3.9 [36] and GATK 3.7 [37], were respectively applied with different parameter combinations detailed in Table 2 to bam files resulting from mitogenome alignments. These parameters were selected to ensure comparability among different callers. The minimum base quality required to consider a base for calling was set to be 30.
The performance of SNP calling was evaluated in the overall genotype concordance by comparing the NGS results with the Sanger data, with the assumption that the Sanger sequencing gave the correct calling [38-40]. Only the positions where a Sanger sequence were available were kept, and the concordance SNPs between Sanger and NGS data for each individual were considered as true SNPs, while the discrepancies were considered as errors. When NGS data identified an alternate homozygote not observed by Sanger sequence, it was considered as a false positive. Accordingly, when NGS data did not see an alternate homozygote found with Sanger sequence, it was considered as a false negative. The number of true SNPs, false positives and false negatives were analyzed.