In silico characterization of putative PERVs
Using previously reported PERV sequences as queries, we mined 14 pig genomes (Additional files 2: Table S1) available in GenBank and showed a detailed genome-wide distribution of full-length PERVs (i.e., containing two LTRs). We initially compiled a PERV data set that included 185 putative PERVs (containing at least one LTR) (Additional files 2: Table S2). A total of 84 classified (30 PERV-A, 39 PERV-B, and 15 PERV-C) and 18 unclassified PERVs (i.e., lacking the env gene) were retrieved from pig genomes. We identified 2–10 full-length PERVs in most pig breeds (12/14), including Meishan, Goettingen, and Large White (Additional files 2: Table S2). After removing 19 previously classified PERV sequences that were low-quality fragments (> 200 “N” bases), the final data set comprised 65 high-quality classified PERVs (27 PERV-A, 29 PERV-B, and 9 PERV-C). Their viral genomic structures are summarized in Figure 1.
Most of the PERVs exhibited large-scale genetic alterations induced by indels and stop codons (Fig. 1), suggestive of a relatively long evolutionary history. PERV LTRs were classified by the presence (LTR B) or absence (LTR A) of the 18 bp and 21 bp repeat structure reported previously [8, 14, 15], as shown in Fig. 1a. Three different types of B LTRs in the PERV were identified, distinguished by the number of 18 bp and 21 bp repeat sequences: LTR B1 (two 18-bp and one 21-bp repeats), LTR B2 (three 18-bp and two 21-bp repeats), and LTR B3 (four 18 bp and three 21 bp repeats). Of the high-quality PERVs we analyzed, 32 PERVs (>55%) carried LTR A, 10 carried LTR B1, 13 carried LTR B2, and 2 carried LTR B3. LTR A was identified in PERV-A and -C, and LTR B1 was identified in PERV-A and -B. LTR B2 and LTR B3 were only identified in PERV-B. The remaining eight PERVs contained different types of 5’- and 3’- LTR, which may reflect recombination between PERVs with different LTR types (Fig. 1). For example, we discovered one PERV (AEMK02000536.1) with LTR B2 at the 5’ end and LTR A at the 3’ end (Fig. 1).
Genomic rearrangement via PERVs
Retrovirus integration creates a short duplication called target site duplication (TSD) flanking the LTR [16, 17]. If chromosomal rearrangement through homologous recombination between distant proviruses occurred, the flanking TSDs should be different, as mentioned in a previous study of genomic rearrangement in primates via ERVs [18]. To identify pig genomic rearrangement via PERVs, we first constructed a maximum likelihood (ML) tree representing the 5’- and 3’- LTR sequences of full-length PERVs (Additional files 1: Fig. S1). The phylogenetic tree was divided into three large clusters (Additional files 1: Fig. S1), suggesting that three major integration events had occurred. We collected PERVs in which the 5’- and 3’- LTR sequences did not cluster together in the phylogenetic tree. Remarkably, 11 PERVs did not share the same TSD (4bp in length) (Table 1, Additional files 2: Table S3); thus, these PERVs could reflect porcine genomic rearrangement via PERV recombination during evolution.
Table 1. PERVs with different target site duplications (TSDs)
Name
|
Accession number
|
Divergent LTRs based on structure *
|
Divergent LTR based on tree
|
Flanking TSD **
|
5'
|
3'
|
AEMK02000536.1
|
AEMK02000536.1
|
yes
|
yes
|
AGCC
|
CTTT
|
CM000818.5_a
|
CM000818.5
|
no
|
yes
|
GTTC
|
CTTC
|
CM000826.5_c
|
CM000826.5
|
no
|
no
|
ACCA
|
AATC
|
CM000828.5_d
|
CM000828.5
|
no
|
yes
|
CCAC
|
CACC
|
KQ001967.1
|
KQ001967.1
|
no
|
yes
|
CCAC
|
CACC
|
LIDP01000017.1
|
LIDP01000017.1
|
no
|
yes
|
CCAC
|
CACC
|
LUXR01004647.1
|
LUXR01004647.1
|
yes
|
yes
|
GTTC
|
CTTC
|
LUXR01022139.1
|
LUXR01022139.1
|
yes
|
yes
|
CCAC
|
CACC
|
LUXX01045907.1
|
LUXX01045907.1
|
no
|
yes
|
CCAC
|
CACC
|
LUXX01080744.1
|
LUXX01080744.1
|
no
|
yes
|
GTTC
|
CTTC
|
LUXY01101100.1
|
LUXY01101100.1
|
no
|
yes
|
CCAC
|
CACC
|
*LTRs of PERVs comprise four types (LTR A, B1, B2, and B3). If two different types of LTRs are flanking the PERV, the LTRs are considered "divergent".
**Only TSDs flanking the intact 5’ and 3’ LTRs sequences were analyzed
Detection of PERV-related sequences in mammalian genomes
After screening 304 mammalian genomes (Additional files 2: Table S4) available on GenBank using tBLASTn and choosing three major proteins (Gag, Pol, and Env) of PERVs as queries, a significant sequence (accession number: NW_004504334.1) was found in the genome of lesser Egyptian jerboa (Jaculus jaculus) that exhibited strong sequence similarity (for gag and pol: >75% nucleotide identity over 95% region; for env: >75% nucleotide identity over 55% region) to PERVs. Using this PERV-like sequence as a query, three other possible PERV-like sequences were identified in J. jaculus with >85% nucleotide identity over 80% of the query sequence. The four PERV-like sequences identified in J. jaculus were designated as ERV-Gamma.n-Jja (Additional files 2: Table S5) (where n = 1–4). These four significant hits are located in large scaffolds > 5 Mb in length and are flanked by several host genes, indicating that the ERVs-Gamma-Jja sequences were relatively reliable (Additional files 2: Table S6).
We were only able to identify one pair of ERV-Gamma-Jja LTRs. This full-length ERV-Gamma.1-Jja (containing 2 LTRs) is annotated in Additional files 1: Fig. S2. The length of the 3’- LTR of this ERV-Gamma-Jja is 674 bp, while the 5’- LTR is 932 bp with a 258 bp insertion. We aligned ERV-Gamma-Jja LTRs with PERV LTRs. The start of the U3 region and the end of the U5 region were distinct and were not included in the alignment (Additional files 1: Fig. S3). The ERV-Gamma-Jja LTRs include a similar repeat structure (three 18 bp and two 21 bp repeat sequences) in the U3 region, the sequence of which is identical to that of the PERV LTR B2 (identity ~73%). Alignment analysis revealed a closer relationship between the LTRs of the ERV-Gamma-Jja and LTR B2 of PERVs (Additional files 1: Fig. S3). Notably, the alignment of the conserved R region supported a close evolutionary relationship between the ERV-Gamma-Jja and PERVs (Fig. 2a). To highlight the similarity between PERVs and ERVs-Gamma-Jja, we generated pairwise alignments of ERV-Gamma.1-Jja and PERV nucleotides using the full-length ERVs and performed a sliding window analysis of these pairwise alignments (Fig. 2b) [19, 20]. For comparison, we determined the similarity of the HIV-1 provirus sequence to that of its closest relative (chimpanzee SIVcpz) [21, 22]. Interestingly, gag and pol were more similar between ERV-Gamma-Jja and PERV-A, -B, and -C than HIV-1 and SIVcpz (Fig. 2b). Therefore, our results indicate that ERVs-Gamma-Jja and PERVs are homologous. However, the RBD and the proline rich-region (PRR) of the surface subunit (SU) of env were dissimilar between ERV-Gamma-Jja and PERV-A, -B, and -C, as also seen in HIV and SIVcpz. As RBD determines the host range [23-26], this observation suggests that ERVs-Gamma-Jja and PERVs have distinct host ranges.
We used the RBD amino acid sequences from PERV-A, -B, and -C as queries to screen for homologous viral elements. The eight significant hits (>60% amino acid identity over 80% region) were obtained in rock hyrax (Procavia capensis) of Procaviidae, and all eight hits flanking with genes were located in large scaffolds >0.3 Mb in length (Additional files 2: Table S5). We examined the sequences flanking the eight hits (especially pol) and found that ERVs including these hits were endogenous gamma-retroviruses. These hits were therefore designated ERVs-Gamma.n-Pca (where n = 1–8). We aligned the RBDs of PERVs and ERVs-Gamma-Pca and found that ERVs-Gamma-Pca were highly similar to PERVs (Fig. 3). Pairwise comparisons revealed that ERV-Gamma.1-Pca and ERV-Gamma.2-Pca have high identity to PERV-B (63%) but a low identity with PERV-A, -C, and PERV-IM (40–43%). Therefore, the RBD of ERVs-Gamma-Pca and PERV-B are homologous.
We also found homologous LTRs of PERVs (~73% identity) in eight Muroidea species (Mus caroli, M. pahari, M. musculus, M. spretus, Apodemus speciosus, A. sylvaticus, Rattus norvegicus, and Phodopus sungorus). The coding genes (gag, pol, and env) near these homologous LTRs were identified (Additional files 2: Table S5). Notably, Muridae ERVs also showed a similar repeat structure in the U3 region and share a 70% identity to LTR B2 (Fig. 2a; Additional files 1: Fig. S3), which indicates that PERV and these rodent ERVs may share the same origin. The large indels in Muridae ERVs LTR further suggest that they may have an ancient origin.
To characterize the relationships between ERV-Gamma-Jja, ERVs-Gamma-Pca, Muridae ERVs, and PERVs, we produced phylogenetic trees of Gag, Pol and Env, first by removing the variable RBD (Fig. 2c-e). Forty ERVs from two Muroidea species and one primate (M. musculus, R. norvegicus, and Microcebus murinus) found in a previous study showing the close relationship with PERVs were also included [27, 28]. Our maximum likelihood (ML) phylogenetic tree revealed that ERVs-Gamma-Jja and PERVs clustered together with high bootstrap supports in three phylogenies (Fig. 2c-e), suggesting that they share the most recent common ancestry. However, the Gag, Pol and Env (without RBD) of ERV-Gamma-Pca were distantly related to ERVs-Gamma-Jja and PERVs, while RBD was similar to PERV, which might indicate recombinant events among ancient retroviruses. As GALV and KoRV may also have a rodent origin [29, 30], Muridae ERVs, ERV-Gamma-Jja, and PERV might share the same ancestral rodent retrovirus via one or more intermediate hosts (Fig. 2c-e; Fig. 4). Failure to detect any other ERV-Gamma-Jja and PERV-like elements in the remaining rodent genomes indicated that these viruses were not vertically transmitted, suggesting instead that ancient horizontal transmission occurred during evolution.
Remarkably, a new lineage close to PERV-A and PERV-C was observed, named PERV-IM (designating a PERV-intermediate type), and presented in all 14 pig genomes (Fig. 2e, Additional files 2: Table S2). The Env proteins of PERV-IMs showed relatively low similarity to PERV-A, -B, and -C, and they were clearly distinct in the RBD region (Fig. 3).
Molecular dating analysis
To roughly estimate the integration time of PERVs, ERVs-Gamma-Jja and ERVs-Gamma-Pca, we used an LTR-divergence method based on the divergence between 5’- and 3’- LTR of ERVs with a known host nucleotide substitution rate [17, 31]. However, since the nucleotide substitution rates of S. scrofa, J. jaculus, and P. canpensis are unknown, we used an average mammal neutral substitution rate (2.2 × 10-9 per site per year) [32] for these three species. Our results indicated that PERV-A first invaded the Suidae ~6.6 million years ago (MYA), while PERV-B first invaded ~6.4 MYA. In contrast, the invasions of PERV-C and PERV-IM were relatively recent (~3.4 MYA and ~4.4 MYA, respectively) (Fig. 5, Additional files 2: Table S2). Thus, the oldest PERV-A and PERV-B invaded the host just after the Suidae split from the ancestral group (~7.3 MYA) [33]. PERV-A, -B, and -C have continued to integrate into pig genomes, resulting in increasing numbers of insertions. However, the LTRs of another three ERVs-Gamma-Jja were incomplete, and the time estimation of these ERV-Gamma-Jja was based on only one provirus. ERVs-Gamma.1-Jja was estimated to have integrated ~17.2 MYA, which is well before J. jaculus speciated (~11.1 MYA), but later than the speciation of Dipodidae (~42.7 MYA) [34]. ERV-Gamma-Pca integration time was calculated based on two full-length ERV-Gamma-Pcas. ERVs-Gamma-Pca insertions were estimated to be much older than PERVs (~10.7 MYA and ~8.4 MYA).