The results presented below are based on the genomes of the salmonids Atlantic salmon, Brown trout, Rainbow trout, Sockeye salmon, Coho salmon, Chinook salmon and Charr (see Material and Methods for details). All genomes, apart from Charr and Northern pike, originated from completely homozygous or so-called double haploid animals thus eliminating the added confusion of allelic gene variants. To understand the evolution of genes, the salmonid data are compared against results from the Northern pike genome, a species that is basal to salmonids, but lacks the 4WGD [22](Fig. 1). Genomes from the three Salmonidae genomes Coregonus, Hucho hucho and Thymallus thymallus were ignored as they contained un-annotated or incomplete genomic regions, thus not enabling informative comparisons.
There seems to be some confusion as to the origin of the NCBI Salvelinus genome, now annotated as Salvelinus in NCBI, which may potentially be Salvelinus malma malma and not Salvelinus alpinus as presented in the original article [24, 25]. Using standardised nomenclature exemplified by Sasa for Salmo salar and Eslu for Esox lucius, we also used Saal for Salvelinus alpinus although it may be Sama.
Orthology between salmonid regions is a summary of data obtained from Christensen et al. and Sutherland et al. [24, 26] presented in Additional file 1. For Brown trout, the linkage groups presented by Leitwein and coworkers [27] do not match the chromosome numbers in the NCBI genome, so regional orthology is currently based on blast match with region specific genes from other salmonids when this was informative.
Evolution of salmonid MHCIa and MHCIb regions
Based on previous data we define the genomic region containing the classical UBA locus as the MHCIa region and the duplicate region containing non-classical genes as the MHCIb region [11, 12]. Genes residing within these two regions also have an -a or -b extension. All salmonid genomes analysed in this study contained well-defined and annotated duplicated MHCIa and MHCIb regions (Additional file 2). The Ia region, containing the UBA locus, was overall identical for all species with a few exceptions. Brown trout has a CD5-like gene in between the SLC39A7a and RING2a gene, not present in any of the other species.
In Zebrafish, there are functional MHCI haplotypes with polymorphism in both proteasome subunits PSMB8, PSMB13 as well as TAP2 [28]. In Rainbow trout, the two allelic PSMB8 variants found in Zebrafish are encoded by two different genes in the Onmy-MHCIa region [21]. Here, the Onmy-PSMB8a gene is a pseudogene while the Onmy-PSMB8F gene is functional. PSMB8F pseudogenes have previously been found in the duplicate Atlantic salmon MHCIa and MHCIb regions [21]. However, there is a bona fide Atlantic salmon Sasa-PSMBF sequence in Genbank (ACI66984.1), suggesting some Atlantic salmon haplotypes may have a functional variant of this gene. Neither pike nor other salmonids have an annotated PSMB8F gene in the MHCIa region, but Charr has a PSMB8F gene on an unplaced scaffold (XP_023998549.1).
The duplicate MHCIb region was also mostly identical in all analysed species. The LHX9-like gene found in Northern pike is present in all salmonid MHCIb regions with the exception of Salvelinus. All but Salvelinus and Northern pike also have a varying number of chitin synthase-like (CHS2) genes in between the RXRB and SLC39A7 genes. Chitin synthase is a well-known molecule in fungi and invertebrates, but also seems to have functional roles in fish and amphibians [29]. In Chinook salmon there is a duplicate of the entire MHCIb region (Genbank NW_020128813), which most likely is an assembly artefact as the sequenced animal was a double haploid.
Evolution of U lineage genes
Six Northern pike U lineage genes reside on chromosome 10 here defined as Eslu-UAA through Eslu-UFA (Fig. 2, Additional files 2–4). Based on phylogeny, there seems to be three original genes where each of the three genes have duplicated into Eslu-UAA and Eslu-UBA, Eslu-UCA and Eslu-UDA and Eslu-UEA and Eslu-UFA (Fig. 3). Eslu-UCA is only a partial sequence and may be a pseudogene. The polymorphic content of these genes remains undefined, but there is one EST and one TSA matching the Eslu-UAA/UBA genes (Genbank GH268323 and TSA GATF010284) and one EST originating from one of the Eslu-UEA or Eslu-UFA loci (EV373903). The seventh pike U lineage gene is located on an unplaced scaffold (Eslu-UGA, NW_022995044), and seems to be a pseudogene duplicate of the Eslu-UDA gene.
As previous studies have shown that the three extracellular alpha domains of U lineage sequences display different evolutionary patterns [5, 18–20], we made phylogenetic trees of both entire mature extracellular amino acid sequences as well as trees of individual alpha 1, alpha 2 and alpha 3 domain sequences to identify orthology (Fig. 3, Additional file 3).
Phylogenies of alpha 1 domain sequences shared by distantly related teleost species, show that also non-classical genes share these lineages (Fig. 2)[5, 18–20]. Non-classical UEA gene sequences share the alpha 1 domain lineage Va, UGA gene sequences share the alpha 1 domain lineage II and most UCA and UDA gene sequences cluster with the alpha 1 domain lineage I. Also Northern pike U lineage genes share these alpha 1 domain lineages. Eslu-UAA and Eslu-UBA alpha 1 domain sequences cluster with alpha 1 domain lineages Vb, Eslu-UDA clusters with lineage IIIa and Eslu-UEA and Eslu-UFA cluster with lineage IIIb sequences. In the alpha 2 domain analysis, all Northern pike sequences cluster together, although the bootstrap value is only 31 percent (Additional file 3). A similar clustering is also seen for all Northern pike alpha 3 domain sequences, with a somewhat higher bootstrap value.
Only one salmonid U lineage gene, UHA, resides outside of the two duplicated MHCIa and MHCIb regions (Table 1, Additional file 2). Sequences from this gene display strongly supported clusters in all phylogenies. Northern pike and Sockeye salmon did not display any UHA gene sequences, but the remaining salmonids all have UHA lineage genes on one homeolog of Northern pike Chr. 16 (Additional file 1). Atlantic salmon and Charr have regionally duplicated UHA lineage genes where at least the duplicate Sasa-UHA2 gene is a pseudogene (Additional file 4). Although the two Charr UHA gene sequences are incomplete, there is an expressed UHA1/2-like sequence in Salvelinus malma (Genbank AYG86905.1), suggesting at least one of these UHA loci are functional also in Charr. Overall, UHA gene sequences are very different from other U lineage sequences (Fig. 3, Additional file 3), suggesting an ancient origin. However, we have not been able to find orthologs in any other teleost, so these genes may have evolved fast in salmonids.
Table 1
Number of MHC class I lineage genes in salmonids and Northern pike
| U | Z | L | S | H | P |
Northern pike | 7 | 5 | 4 | 1 | 1 | 0 |
Atlantic salmon | 8 | 7 | 12 | 6 | 2 | 1 |
Brown trout | 7 | 7 | 25 | 3 | 2 | 1 |
Rainbow trout | 7 | 6 | 14 | 2 | 2 | 1 |
Coho salmon | 6 | 5 | 14 | 2 | 2 | 1 |
Sockeye salmon | 5 | 5 | 14 | 2 | 2 | 1 |
Chinook salmon | 9 | 7 | 16 | 2 | 2 | 1 |
Charr | 11 | 4 | 13 | 2 | 2 | 1 |
Only Atlantic salmon has a duplicate annotated U lineage gene in the MHCIa region denoted ULA, a gene that lacks the transmembrane domain (Additional file 2–4)[30]. We know that the UBA loci from Atlantic salmon, Rainbow trout, Brown trout and Sockeye salmon are classical MHCI loci with considerable polymorphism [18, 20, 31–34]. There are currently 48 Atlantic salmon and Rainbow trout UBA alleles registered in the IPD-MHC database [35] while 31 and 34 alleles have been defined in Brown trout and Sockeye salmon. The polymorphic content of UBA loci from Coho, Chinook and Charr remains undetermined.
MHC class I gene richness is most profound in the salmonid MHCIb regions, with Brown trout and Salvelinus having four U lineage genes surrounding the TAPBPb and PSMB8b genes (Fig. 2). Rainbow trout has three annotated U lineage genes in this region with an additional fourth Onmy-UFA pseudogene reported previously [12]. Previous studies have shown that Rainbow trout and Atlantic salmon MHCIb regions contain non-classical MHC genes, displaying low polymorphism and more restricted expression patterns than their classical UBA counterparts [11, 12]. Sockeye, Chinook and Coho salmon all have two annotated U lineage genes in this region. This region then resembles the three original MHCI genes found on Northern pike Chr.10.
Salvelinus has two additional unplaced scaffolds containing U lineage genes, all clustering with UBA sequences (UXA, UZA1/2; Additional files 3 & 4). Their origin and location is unknown, but as the sequenced genome does not originate from a double haploid animal, they could be allelic variants of non-classical U lineage genes or assembly artefacts. Chinook salmon also has two additional U lineage genes residing on unplaced scaffolds here denoted Onts-U1 and Onts-U2. Onts-U1 seems like a pseudogene with sequence identity to Onts-UCA. Onts-U2 is a duplicate of the Onts-UEA gene sequence, and most likely represents an assembly artefact as the Chinook salmon genome originates from a double haploid.
MHCIb regions also contain a unique UGA gene that is present in all analysed salmonids, located in between the SLC39A7b and RING2Ab genes (Fig. 2). Chinook salmon lacks an annotated UGA gene, although there are expressed Chinook sequences supporting a functional UGA locus (e.g. GGDU01219126.1). The gene denoted UGA in Northern pike (Additional file 4) is not an ortholog to the salmonid UGA genes, so UGA is a gene duplication that translocated to the MHCIb region after salmonids split from pike. UGA lineage sequences show strongly supported clusters in alpha 1 and alpha 2 domain phylogenies, while the alpha 3 domain sequences are more dispersed.
Based on location and phylogenetic clustering, the UEA gene seems to have existed in a primordial salmonid, but was then lost in Atlantic and Sockeye salmon (Fig. 2&3, Additional files 3 & 4). All UEA alpha domain phylogenies show strongly supported clusters. Salmonid UCA and UDA gene sequences also form strongly supported clusters in the alpha 1 and alpha 2 domain sequence phylogenies, suggesting they originate from a salmonid ancestor. Duplications from a single primordial UC/DA gene to multiple UCA and UDA genes seem to have occurred individually in the Oncorhynchus and Salmo lineages based on the alpha 2 domain phylogenies, as well as in each individual species (Fig. 3, Additional file 3). The gene sequences defined as UFA in Charr and Brown trout do not cluster in phylogenies, so they represent within species gene duplications. However, the UFA pseudogene previously reported in Rainbow trout, clusters with the UFA sequence from Brown trout (data not shown), so this gene originated in a salmonid ancestor.
We have previously shown that the Atlantic salmon MHCIb region contains haplotypes with varying number of non-classical Sasa-UCA and Sasa-UDA genes [36]. In the Atlantic salmon genome, this haplotypic variation is more pronounced with an 8.6 Mb region separating the Sasa-UCA pseudogene from two additional UCA and UDA genes as opposed to the 30 Kb separating the UDA and UCA genes in the previously sequenced BAC (Genbank FJ969490). If this represents haplotypic variation or is a study artefact remains unknown, but Brown trout, the closest relative to Atlantic salmon, does not show this UCA/UDA gene duplication 10 Mb upstream suggesting this MHCIb haplotype may be Atlantic salmon specific.
Oncorhynchus species also have a unique U lineage pseudogene located approximately 10 Mb upstream of their UCA genes, which we here name UMA. These Oncorhynchus regions are unique as they do not contain the same genes as those surrounding the Atlantic salmon genome UDA gene 8.6 Mb upstream of the major MHCIb region. Nor does this region resemble the UIA region found in Medaka, where there is approximately 14 Mb between the classical UAA/UBA genes and a UIA gene [19]. This Oncorhynchus UMA gene then seems to be a unique gene duplication that occurred on the Oncorhynchus lineage only. Looking at other teleost species, two S lineage genes in Astyanax mexicanus are located in an unplaced region in between MYO1G and SGK1 genes, a region also containing the Oncorhynchus UMA region genes EYA3, CDK5R1 and XKR8.3. However, phylogeny does not support any relationship between these Mexican tetra S lineage sequences and the Oncorhynchus UMA sequences (data not shown). A plausible explanation would then be that Atlantic salmon and an ancestral Oncorhynchus species have experienced unique but similar translocations of the UCA/UDA and UMA genes.
To summarize, evolutionary orthology between individual Northern pike and salmonid MHCI gene sequences is not apparent in our phylogenies. The seven U lineage pike genes occurred through duplications in pike after the split from salmonids. A similar gene expansion of U lineage genes in the MHCIb region has occurred in a salmonid ancestor, where a primordial UBA gene has duplicated and diversified into the non-classical genes found in the MHCIb regions today. Such a species-specific duplication of classical genes into diversified non-classical genes has also occurred in some tetrapod species [4, 37].
Z lineage evolution
In addition to the six U lineage genes, Northern pike also has five Z lineage genes on linkage group 10 (Fig. 2, Table 1, Additional files 2–4). In comparison, the salmonid MHCIa and Ib regions all have from two to four Z lineage genes per region. Due to the unique position of the Salmo ZAA gene residing in the MHCIa region, we chose to reserve this ZAA name to reflect a location in between the VHSVa induced protein and ATF6a. The remaining sequences are named ZBA through ZDA regardless of phylogenetic clustering. Of pike and salmonid Z lineage genes, only Onmy-ZDAb and Satr-ZDAb seem like pseudogenes.
Phylogenetic trees of the entire mature extracellular amino acid Z lineage sequences display two well-supported clades, each with two sub-clades. Surprisingly, all Northern pike Z lineage gene sequences cluster together with a strong bootstrap support, suggesting they are within species gene duplications (Fig. 4). A similar strongly supported clustering of pike Z lineage sequences is also seen when we perform phylogenies of individual extracellular domains (Additional file 3). Based on the two to four Z lineage gene duplicates identified in salmonid MHCIa and MHCIb regions (Fig. 2), one would have expected some orthology between pike and salmonid gene sequences.
The first clade (Fig. 4, clade 1) consists of MHCIa region sequences, while the second clade (Fig. 4, clade 2) consist of MHCIb region sequences, suggesting the Z lineage genes evolved independently in the MHCIa and MHCIb regions (Fig. 4, Additional file 2 &3). Clade 1 gene sequences are further divided into two subclades, one containing Oncorhynchus gene sequences (clade 1.1) and the other with Salmo and Salvelinus gene sequences (clade 1.2). Clade 1.1 suggests that one original Oncorhynchus gene expanded to the three ZBAa, ZCAa and ZDAa genes present in this region today where ZDAa is a more recent duplicate of ZBAa. Although not as strongly supported, Salmo and Salvelinus Z lineage Ia genes within clade 1.2 also seem like within region duplicates of one common ancestor. It seems that the evolutionary process has repeated itself with the ZBAa and ZDAa genes being a more recent gene duplications while ZCAa is an older gene duplication. The Salmo ZAAa gene is also a more recent duplication of the ZBA or ZDA gene. Charr MHCIa Z lineage sequences show a dual clustering, with the Saal-ZBAa sequence clustering with Oncorhynchus while the Saal-ZCAa sequence clusters with Salmo ZCAa sequences.
Sequences originating from the MHCIb region split into two strongly supported subclusters (Fig. 4, subclades 2.1 and 2.2) and in this region Oncorhynchus and Salmo Z lineage genes share an evolutionary history. The subclade 2.1 contains ZCAb sequences while subclade 2.2 contains ZBAb sequences. The only exception is Atlantic salmon sequences where Sasa-ZBAb and Sasa-ZCAb represents a more recent gene duplication (Figs. 2 & 4). The Sasa-ZBAb seems to b the only soluble Z lineage molecule, lacking the transmembrane region [36].
To summarize, the salmonid MHCIa and MHCIb Z lineage genes are not orthologs of the Northern pike Z lineage genes. Instead, the salmonid Z lineage genes have experienced unique gene duplications in the two duplicated regions sharing an evolutionary history in the MHCIb region, but evolving independently for Oncorhynchus and Salmo species in the MHCIa region. Potentially, transposable elements enabling these duplications were already present in Northern pike. As seen in other teleosts [5], the eight peptide anchoring residues are also conserved in salmonid Z lineage sequences (data not shown).
Evolution Of L Lineage Genes
Northern pike has four L lineage genes dispersed on Chr.2, 15 and 20 where salmonids have orthologs to the pike genes on Chr.02 and Chr.20 based on phylogeny and regional orthology (Table 1, Fig. 5, Additional file 1 & 4). Nomenclature is mostly based on phylogenetic clustering with previously identified L lineage gene sequences [5, 14], as exemplified by the LGA gene sequences, which form a strongly supported phylogenetic cluster. L lineage genes have exploded in salmonids with 12–13 genes in Charr and Atlantic salmon and 25 genes in Brown trout. Many Charr L lineage genes seem like pseudogenes, while the remaining species have 9–21 bona fide genes.
The previously published Rainbow trout LAA gene [14], is also found in salmonid species, whereas this gene was lost in Northern pike (Additional file 2). Fragments of this gene is found on Atlantic salmon homeolog chromosomes 13 and 15 and ortholog regions in the other salmonids (Additional file 1), flanked by ANKS1A and SARG genes. Only Rainbow trout and Sockeye salmon seem to have bona fide LAA genes, where the LAA genes from the other species are pseudogenes. The sequences are quite distant from the remaining L lineage sequences and form the base of the phylogenetic tree (Additional file 3).
Another older L lineage gene previously described in Atlantic salmon, LIA, [5] has orthologs in all species including Northern pike (Additional file 1–4). LIA gene sequences are also quite old forming a strongly supported branch quite basal in the phylogenetic tree. Only the Charr LIA gene seems to be a pseudogene with an internal stop codon. This LIA gene is flanked by VWA8 and F5 in all species. Although salmonid LIA regions are orthologous to Northern pike Chr. 16, the pike LIA gene resides on Chr. 20, suggesting a translocation in a salmonid ancestor. The salmonid homeolog chromosome also hold L lineage genes in most species represented by the LLA and LJA genes, being mostly pseudogenes. Although not strongly supported, the Northern pike L lineage region on Chr. 15, here called Eslu-LPA clusters with the LIA gene sequences and is most likely a gene duplication specific for Northern pike. A similar unique gene duplication is seen for the Atlantic salmon Sasa-LKA gene with no orthologous region in other salmonids or Northern pike.
Salmonid LDA gene sequences represent another strongly supported clade, but also clusters with the remaining gene sequences from Northern pike and salmonids (Additional file 2–4). This LDA gene is not present in pike, and also only located on one of the salmonid homeologs, suggesting it translocated to this region after the 4WGD event. The LDA gene is flanked by IRAK1BP1 and IL17RD genes in all species.
Salmonid orthologs to the Northern pike Chr. 2 genes here defined as Eslu-LBA and Eslu-LCA have expanded a lot with Brown trout being the most extreme with fifteen L lineage genes on Chr.12 (Fig. 5)(Additional file 1). FAH and CTXND1/ ARNT2 genes, flanking the two pike L lineage genes on chromosome 2, are also present in orthologous regions represented by Atlantic salmon Chr.11 and Chr.26 [26]. Most likely due to regional complexity, clustering genes from Coho, Chinook, Sockeye and Charr all reside on unplaced scaffolds. Gene expansions have occurred locally after the 4WGD. For instance, Atlantic salmon Chr.11 with the two duplicate Sasa-LCA genes is a homeolog of Rainbow trout Chr. 26 containing nine L lineage genes. Brown trout Chr. 12 and an unplaced Chinook scaffold both display a similar L gene expansion with twelve and eight L lineage genes respectively. Most of the Chinook genes on this unplaced scaffold seem like pseudogenes with internal stop codons while most of those on Rainbow trout Chr.26 seem like bona fide genes. Also LEA/LMA genes as well as LFA genes reside in strongly supported clusters displaying a shared evolutionary history within each of these clusters.
To summarize, the L lineage genes have exploded in some salmonids with Brown trout being the most extreme with 25 L lineage genes. The other salmonids seem to have between five and fourteen functional L lineage genes. A structural investigation of L lineage sequences found them to be able to bind quite hydrophobic structures, possibly analogue to mammalian CD1 molecules [5]. Our understanding of the L gene function has since advanced with the study by Edholm and co-workers [23] showing that L lineage genes display different responses upon stimulation. Six Atlantic salmon L lineage genes were included in their study where Sasa-LIA responded to a single-stranded RNA virus but not when challenged with a bacteria. Sasa-LIA and Sasa-LGA both responded to stimulation by type I interferon A, while Sasa-LHA did not. Instead, Sasa-LHA responded to a variety of viral and bacterial TLR ligands. These results show that duplicate L lineage genes have acquired a variety of functional roles in protection against pathogens. In particular Brown trout with 21 potentially bona fide genes may hold more surprises when it comes to functional diversity of this MHCI lineage.
Evolution Of S, H And P Lineage Genes
S lineage genes have previously been described in many teleosts [5]. This gene is also present in Northern pike on Chr. 1 (Table 1, Additional file 2 and 3). Most salmonids have duplicate S lineage genes on both homeologs, where the SBA gene has been silenced in a primordial salmonid (Table 1, Additional files 2 &4). The gene is mostly flanked by VWA5A and CIPC/ AKT2 genes in both regions. Atlantic salmon has six S lineage genes, all residing on unplaced scaffolds. Three of these six Atlantic salmon genes seem to be pseudogenes. In a previous study, we sequenced a bacterial artificial chromosome (BAC) clone originating from Chr. 09, which contained one SAA gene in addition to the flanking VWA5 and AKT2 genes [36]. We did not find other BACs positive for the SAA probe, so potentially there are individual differences in the number of SAA genes in Atlantic salmon.
A fifth MHCI lineage described in teleosts is the P lineage, which has expanded to 24 genes in the pufferfish Fugu [5]. Remnants of this P lineage is lacking in Northern pike while all salmonid P lineage genes have been silenced (Table 1, Additional file 2 & 4). Only one homeolog has remnants of this P lineage gene, suggesting it has been deleted in the duplicated region. The PAA gene is surrounded by PPP1R12A_like and Immunoglobulin light chain (Ig-L) genes. We previously found an IgL gene linked to a UIA gene in Medaka and to Z lineage genes in stickleback [5]. IgL genes are also found linked to the shark MHC region, suggesting it was present in the primordial MHC region [38].
We recently found a sixth MHC class I lineage in teleosts which we denoted the H lineage [13]. One HAA lineage gene is present in Northern pike and all salmonids studied here have HAA and HBA genes on homeologs to this pike HAA gene on Chr. 3 (Table 1, Additional files 1, 2, 4). All regions have TOX and PPP1R7 genes flanking the H lineage gene. The HAA genes seem functional in all species, while the HBA gene is a pseudogene at least in Salmo species. In Coho and Chinook, there are expressed reads matching the HBA gene (GGDU01537164.1, GDQG01022515.1), suggesting the homeolog HBA gene has retained a function in some species. The fact that H lineage sequences lack the alpha 3 domain, and has a cytoplasmic domain highly conserved also between distant teleost species, suggests that teleost MHCI may have a broader functional diversity than previously envisioned. Mammalian equivalents with such a molecular structure are the ULBP/ RAET genes, which interact with the NKG2D receptor upon stress or infection [39]. If the salmonid H lineage molecules have a similar function remains to be seen.