Young SINE retrotransposon insertions are highly polymorphic in the pig genomes
Three pig-specific SINE families (SINEA, SINEB, and SINEC), with different evolutionary histories, were identified in a previous study showing that SINEA represents the youngest family with some of its subfamilies still displaying activity in the last 10 million years [10]. Eleven subfamilies of SINEA (A1–A11) were identified previously, and they display high sequence similarity, but with minor differences: SINEA1–SINEA3 have six specific nucleotides, SINEA1 and SINEA2 have two specific nucleotides, while SINEA1 contains the longest polyA sequence (Additional file 1: Fig. S1), which is unique and different from other subfamilies and might act in their transposition activities. Insertion age analysis revealed that SINEA1–SINEA3 displayed activity 2 million years ago (Mya); the activity of SINEA4 was hard to detect in the last 2 million years, while the activity of other subfamilies (SINEA5–SINEA11), SINEB, and SINEC was totally extinct in this period (Fig. 1A and Additional file 1: Fig. S2). Overall, SINEA1 showed dominant current activity (<2 Mya), followed by SINEA2, while SINEA3 exhibited very weak current activity, indicating that these subfamilies, particularly SINEA1, might still jump and contribute to genomic variations in pigs.
To investigate the jumping activity of these SINE elements, 1,400 SINE insertions distributed in the intragenic regions, and 1,400 in the intergenic regions in the reference genome from seven SINE subfamilies (SINEA1–SINEA4, SINEB2, SINEB6, and SINEC4), representing different insertion ages, were selected randomly for polymorphism prediction by local BLAST searching as described in the methodology. The predicted polymorphic ratio varied significantly across subfamilies, as expected. SINEA1 showed the highest polymorphic ratios at 22.50% and 26.50% in intragenic and intergenic regions, respectively. The SINEA2 and SINEA3 subfamilies showed polymorphism rates ranging from 5.00% to 12.50%, while other subfamilies displayed very low insertion polymorphism rates (<2%) (Fig. 1B and Additional file 1: Table S1). Furthermore, 25 predicted polymorphic and 25 non-polymorphic insertions between a non-reference (Meishan) genome and the reference (Duroc) genome were used to evaluate the accuracy of local BLAST searching, by PCR (Fig. 1C). The accuracies of finding polymorphic and non-polymorphic insertions were 88.00% (22/25) and 84.00% (21/25), respectively (Fig. 1D and Additional file 1: Table S2), indicating that the local BLAST protocol for SINE RIP prediction is highly reliable. These findings confirmed that SINEA1–SINEA3 are still active and can jump into the pig genome, and proved that SINEA1, the youngest element, is very active, and tends to generate highly polymorphic insertions.
Development of the genome-wide SINE RIP screening protocol
To identify SINE RIPs in all assembled pig genomes (15 non-reference and one reference) we developed a genome-wide SINE RIP mining protocol, summarized in Fig. 2A and described in detail in the methodology. A total of approximately 100,000 SINEA1–SINEA3 insertions in each genome were mapped by RepeatMasker. On average, more than 95% of these insertions in the non-reference genomes were mapped successfully to the reference genome. Based on the comparison of non-reference and reference genomic SINE insertion positions, we obtained 263,837 putative SINE RIPs from all genomes, which were submitted to local BLAST searching and checked manually for each RIP (Additional file 1: Table S3). The ambiguous SINE RIPs were discarded based on their alignment patterns (Fig. 2A), and 94,074 SINE RIPs remained for further analysis (Additional file 1: Table S3).
Because the assembly levels of non-reference genomes were lower than the reference genome, the gaps in the non-reference genomes could result in a false positive estimation for the SINE RIP deletion allele. Therefore, we discarded those predicted SINE RIP deletion alleles that were detected only in one non-reference genome, and verified those present in two, three, and four non-reference genomes using PCR (Fig. 2B). As expected, we found a high rate of false positives when the SINE deletion alleles occurred only in two or three non-reference genomes, with accuracies of SINE RIP prediction of only 32.14% and 37.50%, respectively, so these sites were removed from further analysis. However, the accuracy (81.25%) was significantly improved when SINE RIP deletions were detected in four non-reference genomes. The SINE RIP insertion alleles identified in one, two, 14, or 15 non-reference genomes were also verified by PCR, and all of them showed high accuracy (>80%) (Fig. 2C; Additional file 1: Table S4). These data indicate that the SINE deletion alleles identified in more than three non-reference genomes and all SINE RIP insertions (one or more non-reference genomes) were at least 80% accurate.
Large-scale RIPs generated by SINE jumping in the pig genomes
After removing the inaccurate and redundant RIPs, a final total of 36,284 SINE RIPs were obtained at the genome level (Table 1). Then, 230 SINE RIPs were selected randomly for PCR verification, and 185 RIPs were confirmed as positive, 30 RIPs were false positives, and 15 RIPs were uncertain (Fig. 3A), resulting in an accuracy of predicting SINE RIPs of >80% (Fig. 3A, Additional file 1: Table S5). Thus, our genome-wide SINE RIP screening protocol was reliable. Overall, 74.34%, 20.21%, and 5.45% SINE RIPs came from the SINEA1, SINEA2, and SINEA3 subfamilies, respectively, which generally corresponds to their age distributions in the genome (Fig. 3B). Furthermore, SINE RIPs were evenly distributed on each chromosome, with an average of 14.5 (range 11.28–21.63) SINE RIPs in each 1 Mb window (Fig. 3C, Additional file 1: Table S6). While chromosomes 10, 11, 12, 17, and 18 tended to be slightly enriched for SINE RIPs (>18 RIPs/Mb, Fig. 3C), which is generally consistent with the retrotransposon coverage on each chromosome (Fig. 3D), chromosomes 1, 13, and X showed a tendency to be slightly depleted of SINE RIPs (<13 RIPs/Mb, Fig. 3C). The Y chromosome was excluded from analysis because of its multiple repeats, which resulted in difficulties in sequencing and assembly, with too many gaps remaining.
Over 65% of SINE RIPs overlapping with genes
By calculating the genomic positions of each SINE RIP with the biogenic regions, 66.08% of the SINE RIPs (21,596/32,684) overlapped with the genic regions (NCBI annotated genes and NONCODE annotated lncRNA genes), which represent 23.09% of the total genes. In all, 51.36% of the SINE RIPs (16,787/32,684) were found to be overlapping with protein-coding genes, which account for 29.78% (6,154/20,666) of the total, and most of them (99.09%) are in introns (16, 635/16, 787). While 13.59% SINE RIPs (4,443/32,684) overlap with the lncRNA genes, which account for 17.30% (2, 504/14,477) of the total lncRNA genes, most of them (96.89%) were found to be overlapping with introns (4,305/4,443) as well (Table 2). Furthermore, significant biases of SINE RIPs in the biogenic locations of lncRNA and protein-coding genes and their transcripts were observed. SINE RIPs tended to be enriched in the first and second introns of the protein-coding and lncRNA genes compared with other introns and their flanking sequences (Fig. 3E). In addition, a total of 260 SINE RIPs were identified in the exon regions of the protein-coding genes. These SINE RIPs appear to be significantly enriched in the 3¢ UTRs (151/260) of mRNAs compared with 5¢ UTRs (98/260) and CDS (8/260) (Fig. 3F).
Nearly half of all SINE RIPs are common in pig genomes
For the 36,284 SINE RIPs, approximately 10,000 (6,612–12,703) of them appeared as insertion alleles, while the rest of them were identified as deletion alleles in each breed’s genome (Fig. 4A). Deletion or insertion alleles of the predicted SINE RIPs detected in >12 or <4 breed genomes were designated as rare RIPs. In contrast, deletion or insertion SINE alleles present in 4–12 genomes were considered to be common RIPs. Based on this classification, we identified 16,694 common RIPs, representing 46.01% of all SINE RIPs identified (Table 1), resulting in highly polymorphic sequences in most breeds and with great potential for genetic analysis and QTL mapping. In addition, a pairwise comparison of SINE RIPs across the assembled genomes revealed that, on average, 11,482 differential alleles (range 7,532–14,751) were observed between genomes (Additional file 1: Fig. S3). Comparison across the commercial pig breed genomes (Duroc, Landrace, Large White, Pietrain, Hampshire, Berkshire) revealed that they exhibited relatively few alleles that differed between genomes, representing about 8,000 SINE RIP alleles, ranging from 7,817 between Berkshire and Hampshire pigs to 9,044 between Duroc and Hampshire (Fig. 4B). By contrast, the Chinese native pigs displayed more SINE RIP alleles that differed between breeds, with an average of 11,103 (range 9,721–12,622) (Fig. 4C). Comparison of the most important commercial pig breeds (Duroc, Landrace, and Large White) revealed that 23,189 RIP loci shared the same alleles, with each genome containing about 4,000 (range 4,051–4,793) breed-specific RIP alleles (Fig. 4D).
PCA and cluster analysis of the SINE RIPs
Cluster analysis showed the presence of two main groups of pig breeds, in fact, all Western pigs: Large White, Landrace, Duroc, Pietrain, Hampshire, Berkshire, Duroc, and the cross-breeds form a clade that is well separated from the one comprising all Chinese pigs, including Rongchang, Jianghua, Meishan, Bamei, Tibet, Bama, Wuzhishan, and Göttingen pigs which contained Asian pig genetic material (Fig. 5A). As expected, the SINE RIP-based clusters were also well supported by PCA (Fig. 5B), in which both clusters are separated horizontally in accord with the direction of maximal variance.
Analysis of the population structure and genetic diversity of some Chinese native pigs based on SINE RIP molecular markers
To evaluate the potential application of SINE RIPs in population genetic analysis, 16 SINE RIPs were selected to detect polymorphisms in 22 native Chinese pig breeds and in one native Italian pig breed. The PCR analysis revealed that all the markers were polymorphic and biallelic. Detection of SINE RIPs in each breed and their primers are summarized in Additional file 1: Table S7 and Additional file 2.
The Ne statistic per locus ranged between 1.537 (REF-16266) and 2.000 (ESA1-16), with a mean across loci of 1.765. The expected heterozygosity was higher than the observed heterozygosity at most loci. Observed and expected heterozygosity values ranged from 0.166 (DR-68328) to 0.468 (REF-3992) and from 0.350 (REF-16266) to 0.500 (ESA1-16) with overall means of 0.354 ± 0.088 and 0.423 ± 0.055, respectively. While the PIC values, which can reveal the usefulness of a marker in diversity analysis of a breed, are moderately informative for all 16 SINE RIPs (PIC 0.25–0.5), with an overall mean of 0.335 ± 0.031, ranging from 0.288 to 0.375, the negative FIS values (–0.106 ± 0.153), ranging from –0.315 to 0.328, indicated a low value of inbreeding of each breed detected. The FST values ranged from 0.117 (REF-14902) to 0.369 (ESA1-33), with a mean FST value of 0.252 for all loci, indicating that 74.8% of the genetic variation was caused by differences between individuals and 25.2% arose from differentiation between breeds. Agreement with Hardy–Weinberg equilibrium was tested by loci within breeds at P < 0.05. For all loci combined, on average about one-third of the breed–loci combinations did not comply with Hardy–Weinberg equilibrium (Additional file 1: Table S7).
The UPGMA method was used to construct a phylogenetic tree (Fig. 6A) based on Nei's unbiased genetic distance. This clearly shows three clusters that generally correspond to their geographic locations (Fig. 6B), especially for southern Chinese breeds (Bamaxiang, Wuzhishan, Dahuabai, and Lantang) and most pig breeds of central China (Qingping, Hanjiang Black, Shaziling, Tongcheng, Lepinghua, Ningxiang, Erhualian, Laiwu Black, Dapulian, Dingyuan, and Mingguang Small Ear), with the exception of Bamei, Wei, and Anqinliubai. Bamei is a northern Chinese breed, but clustered with the southern Chinese pigs, while Wei and Anqinliubai were separated from their original geographical location (central China) and clustered with the northern Chinese pigs (Mashen and Dongbei Min) and the Italian pig breed (Nero Siciliano pig), which also has the highest genetic distance from Chinese pig breeds.