Complex evolution of porcine endogenous retroviruses

Background: Porcine endogenous retroviruses (PERVs) are proviruses that can replicate in human cells. However, the evolutionary process leading to the generation of modern PERVs is not well understood. Results: We mined 14 pig genomes and other available 304 mammalian genomes in silico, which led to the documentation of 185 full-length PERVs. Notably, we found two novel ERVs in the lesser Egyptian jerboa ( Jaculus jaculus ) and rock hyrax ( Procavia capensis ) named ERV-Gamma-Jja and ERV-Gamma-Pca, respectively, which were the source of the modern PERVs. Phylogenetic analyses provided evidence for the multiple origins of PERVs involving hosts of rodents, rock hyrax, and pigs. Conclusion: These new findings help us to understand the complex evolution of the modern PERVs.

3 However, the 39-bp repeat structure is absent in some PERV-A and all PERV-C. Thus, these PERVs have low transcriptional activity [8,9].
While several studies have examined the evolutionary relationships between PERVs and other viruses [7, 10-12], the origin and evolution of the modern PERVs remain uncertain. BLAST search analysis confirmed that the R and U5 regions of the PERV LTRs are highly conserved in the pig and mouse genomes (74-87% identity) [13], suggesting a murine origin of PERVs. At least two species that belong to the same order as pigs, Tayassu pecari (of Eocene origin) and Babyrousa babyrussa (of Miocene origin), lack PERVs [7], raising the possibility of a non-porcine origin of the PERVs. However, the common warthog (Phacochoerus africanus) carries PERVs, suggesting that an ancestral porcine species could have carried PERVs [7]. Here, we expand the genomic mining to all available mammalian genomes, aiming to discover non-porcine and non-murine ERVs that have high similarity to PERVs. Combined with our comprehensive phylogenetic analyses, we illustrate the evolutionary journey of the modern PERVs and reveal that the genesis of PERVs is much more complex than previously thought.

In silico characterization of putative PERVs
Using previously reported PERV sequences as queries, we mined 14 pig genomes (Additional files 2: Table S1) available in GenBank and showed a detailed genomewide distribution of full-length PERVs (i.e., containing two LTRs). We initially compiled a PERV data set that included 185 putative PERVs (containing at least one LTR) (Additional files 2: Table S2). A total of 84 classified (30 PERV-A, 39 PERV-B, and 15 PERV-C) and 18 unclassified PERVs (i.e., lacking the env gene) were retrieved from pig genomes. We identified 2-10 full-length PERVs in most pig breeds (12/14), including Meishan, Goettingen, and Large White (Additional files 2: Table   S2). After removing 19 previously classified PERV sequences that were low-quality fragments (> 200  (three 18-bp and two 21-bp repeats), and LTR B3 (four 18 bp and three 21 bp repeats). Of the high-quality PERVs we analyzed, 32 PERVs (>55%) carried LTR A, 10 carried LTR B1, 13 carried LTR B2, and 2 carried LTR B3. LTR A was identified in PERV-A and -C, and LTR B1 was identified in PERV-A and -B. LTR B2 and LTR B3 were only identified in PERV-B. The remaining eight PERVs contained different types of 5'and 3'-LTR, which may reflect recombination between PERVs with different LTR types (Fig. 1). For example, we discovered one PERV (AEMK02000536.1) with LTR B2 at the 5' end and LTR A at the 3' end ( Fig. 1).

Genomic rearrangement via PERVs
Retrovirus integration creates a short duplication called target site duplication (TSD) flanking the LTR [16,17]. If chromosomal rearrangement through homologous recombination between distant proviruses occurred, the flanking TSDs should be different, as mentioned in a previous study of genomic rearrangement in primates via ERVs [18]. To identify pig genomic rearrangement via PERVs, we first constructed a maximum likelihood (ML) tree representing the 5'-and 3'-LTR sequences of full-length PERVs (Additional files 1: Fig. S1). The phylogenetic tree was divided into three large clusters (Additional files 1: Fig. S1), suggesting that three major integration events had occurred. We collected PERVs in which the 5'and 3'-LTR sequences did not cluster together in the phylogenetic tree.
Remarkably, 11 PERVs did not share the same TSD (4bp in length) (

Alignment analysis revealed a closer relationship between the LTRs of the ERV-
Gamma-Jja and LTR B2 of PERVs (Additional files 1: Fig. S3). Notably, the alignment of the conserved R region supported a close evolutionary relationship between the 7 ERV-Gamma-Jja and PERVs (Fig. 2a). To highlight the similarity between PERVs and ERVs-Gamma-Jja, we generated pairwise alignments of ERV-Gamma.1-Jja and PERV nucleotides using the full-length ERVs and performed a sliding window analysis of these pairwise alignments ( Fig. 2b) [19,20]. For comparison, we determined the similarity of the HIV-1 provirus sequence to that of its closest relative (chimpanzee SIVcpz) [21,22]. Interestingly, gag and pol were more similar between ERV-Gamma-Jja and PERV-A, -B, and -C than HIV-1 and SIVcpz (Fig. 2b). Therefore, our results indicate that ERVs-Gamma-Jja and PERVs are homologous. However, the RBD and the proline rich-region (PRR) of the surface subunit (SU) of env were dissimilar between ERV-Gamma-Jja and PERV-A, -B, and -C, as also seen in HIV and SIVcpz. As RBD determines the host range [23][24][25][26], this observation suggests that ERVs-Gamma-Jja and PERVs have distinct host ranges.
We used the RBD amino acid sequences from PERV-A, -B, and -C as queries to screen for homologous viral elements. The eight significant hits (>60% amino acid identity over 80% region) were obtained in rock hyrax (Procavia capensis) of Procaviidae, and all eight hits flanking with genes were located in large scaffolds >0.3 Mb in length (Additional files 2: Table S5). We examined the sequences flanking the eight hits (especially pol) and found that ERVs including these hits were endogenous gamma-retroviruses. These hits were therefore designated ERVs-Gamma.n-Pca (where n = 1-8). We aligned the RBDs of PERVs and ERVs-Gamma-Pca and found that ERVs-Gamma-Pca were highly similar to PERVs (Fig. 3). Pairwise comparisons revealed that ERV-Gamma.1-Pca and ERV-Gamma.2-Pca have high identity to PERV-B (63%) but a low identity with PERV-A, -C, and PERV-IM (40-43%).
Therefore, the RBD of ERVs-Gamma-Pca and PERV-B are homologous. near these homologous LTRs were identified (Additional files 2: Table S5). Notably, Muridae ERVs also showed a similar repeat structure in the U3 region and share a 70% identity to LTR B2 ( Fig. 2a; Additional files 1: Fig. S3), which indicates that PERV and these rodent ERVs may share the same origin. The large indels in Muridae ERVs LTR further suggest that they may have an ancient origin.
To characterize the relationships between ERV-Gamma-Jja, ERVs-Gamma-Pca, Muridae ERVs, and PERVs, we produced phylogenetic trees of Gag, Pol and Env, first by removing the variable RBD ( Fig. 2c- Table S2). The Env proteins of PERV-IMs showed relatively low similarity to PERV-A, -B, and -C, and they were clearly distinct in the RBD region (Fig. 3).

Molecular dating analysis
To roughly estimate the integration time of PERVs, ERVs-Gamma-Jja and ERVs-Gamma-Pca, we used an LTR-divergence method based on the divergence between 5'-and 3'-LTR of ERVs with a known host nucleotide substitution rate [17,31].
However, since the nucleotide substitution rates of S. scrofa, J. jaculus, and P.
canpensis are unknown, we used an average mammal neutral substitution rate (2.2 × 10 -9 per site per year) [32] for these three species. Our results indicated that

Discussion
Using systematic large-scale genome mining, we revealed hundreds of PERVs or PERV-like sequences, including some previously unidentified viruses from rodents and rock hyrax. Phylogenetic reconstruction, as well as sequence analysis, provided evidence that ERVs-Gamma-Jja from lesser Egyptian jerboa share the most recent ancestry with the modern PERVs. Muridae ERVs (with PERV-like LTRs) and ERVs-Gamma-Pca, from rock hyrax, also contributed to the origin of LTRs and the RBD region of modern PERVs, respectively (Fig. 2). Env phylogeny (Fig. 2c) revealed that the ancestral PERV was then speciated into different classes, in which the PERV-B emerged earlier than the other classes (Fig. 6). Previously, the insertion time of the most ancient PERV was estimated at 7.6 MYA [14], similar to our estimation that the oldest PERV was dated back to 6.6 MYA. In particular, the most ancient insertion times of ERVs-Gamma-Jja and ERVs-Gamma-Pca were roughly estimated as 17.2 and 10.7 MYA, respectively, which are much older events than the origination time of ERVs-Gamma-Jia were the closest relatives to PERVs (Fig. 2c- and Suidae fossils is in East Africa. We speculate that the ancestral PERV was generated in this region via multiple recombination events involving Rodentia and Dipodidae species during Miocene (Fig. 6). This is also consistent with our dating results of the PERVs, and two PERV-like viruses-ERVs-Gamma-Jja and ERVs-Gamma-Pca.
We also checked the flanking regions of ERVs-Gamma-Jja, ERVs-Gamma-Pca, and PERVs, and they showed no similarity with each other, indicating that these ERVs were not vertically transmitted. However, we identified 29 full-length PERVs as ortholog sequences, while most PERVs, including some with same TSD (Additional files 2: Table S3), were not orthologs, revealing that PERVs were more invasive after the speciation of different pig breeds.

Conclusion
In sum, we decipher a complex evolutionary history for the modern PERVs. The ancestral PERV is likely to be derived from ancient retroviruses carried by nonporcine species via multiple recombination events. We also suggest that pig genomes have been shaped by PERVs, as specifically reflected by the PERVassociated genomic rearrangements that have occurred during porcine evolution. In other words, prior to their appearance in pigs, modern PERVs had an evolutionary history more complex than previously thought.

In silico identification of PERV and PERV-related proviruses.
To identify PERV-related elements in Sus scrofa, tBLASTn [35]  screen the 14 pig genomes available in GenBank. A 50% identity over 50% of the match region was used to filter significant hits. It has been shown that PERVs harbor two LTR structures, one with and one without a repeat structure in the U3 region [8,14]. Using two typical LTRs as queries we extended flanking sequences of coding 13 domains of PERVs to identify LTRs with BLASTn, and TSDs were used to define PERV boundaries. LTR lengths were defined as 100-1,000 bp. PERVs with at least one LTR and one coding gene were used in the evolutionary analysis.
To identify PERV-related proviruses in mammals, tBLASTn was used with the queries described above in 20 representative PERV proviruses to search the 304 mammal genomes available in GenBank. A 50% identity over 80% region was used to filter significant hits. LTRs were identified using LTR finder [36], LTRharvest [37] and consisted with previously study [38].

Detection of genomic rearrangement via PERVs.
To search for proviruses involved in recombination and genomic rearrangement, we constructed a maximum likelihood (ML) tree of the 5'-and 3'-LTRs of full-length PERVs using PhyML 3.1 [39] with GTR+I+Γ nucleotide substitution model. LTRs less than 250 bp were not considered. Sequence alignment was performed with MAFFT 7.222 [40].

Phylogenetic analyses.
To determine the evolutionary relationship among PERVs, ERVs-Gamma-Jja, ERVs-Gamma-Pca, muridae ERVs and representative gammaretroviruses (Additional files 2: Table S7), phylogenetic trees were inferred using the amino acid sequences of full-length PERVs and PERVs with one LTR and at least one coding gene. The length of protein sequences < 70% of the alignment are not considered for phylogenetic analyses. Significant bat viruses including Eptesicus serotinus bat retrovirus (EsRV) and the Megaderma lyra retrovirus (MlRV) are too short to include in analysis.
However, their phylogenetic relationship with PERVs has been shown in previous study [41]. All Gag, Pol and Env protein sequences (additional files 3: dataset S1) were aligned in MAFFT 7.222 and confirmed manually in MEGA7/MEGA X [42,43].
The phylogenetic history of these gammaretroviruses was then determined using the maximum likelihood (ML) method available in PhyML 3.1 [39], incorporating 100 bootstrap replicates to assess node robustness. The best-fit JTT+Γ amino acid substitution model was selected for Gag, Pol and JTT+Γ+I for Env using the ProtTest

Dating estimation of PERV, ERV-Gamma-Jja and ERV-Gamma-Pca.
The 5' and 3' LTRs of ERVs are identical at the point of integration, and then diverge and evolve independently [31]. So the ERV integration time can be calculated using the following relation: T = (D/R)/2, in which T is the invasion time Availability of data and materials. All datasets used and analyzed during the current this study are included in this published article (and its additional files).
Competing interests. The authors declare that they have no competing interests.    S4. The complete phylogenetic tree of Gag. Fig. S5. The complete phylogenetic tree of Pol. Fig. S6. The complete phylogenetic tree of Env.
Additional files 2. Table S1. The information of pig, rodent, and rock hyrax genomes used for data mining. Table S2. The information of full-length and near full-length PERVs. Table S3. The recombination-related information of full-length PERVs shown in Fig. S1. Table S4. The information of 304 mammals used for PERVlike sequences mining. Table S5. The matching contigs identified in mammal genomes. Table S6. The information of genes flanking the ERVs-Gamma-Jja and ERVs-Gamma-Pca. Table S7. The information of representative retroviruses used for phylogenetic analysis.
Additional files 3. Data set S1. The alignments used to build the phylogenetic trees of Gag, Pol and Env represented in Fig. S4, S5, and S6, respectively.

12.
Benveniste RE, Todaro GJ: Evolution of type C viral genes: preservation of ancestral murine type C viral sequences in pig cellular DNA. Proc Natl