Alternative splicing and genetic variation of MHC-E: Implications for rhesus cytomegalovirus-based vaccines

doi:10.21203/rs.3.rs-1633558/v1

Download PDF

Article

Alternative splicing and genetic variation of MHC-E: Implications for rhesus cytomegalovirus-based vaccines

https://doi.org/10.21203/rs.3.rs-1633558/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 19 Dec, 2022

Read the published version in Communications Biology →

Version 1

posted

You are reading this latest preprint version

Rhesus cytomegalovirus (RhCMV)-based vaccination against Simian Immunodeficiency virus (SIV) elicits MHC-E-restricted CD8 + T cells that stringently control SIV infection in ~ 55% of vaccinated rhesus macaques (RM). However, it is unclear how accurately the RM model reflects HLA-E immunobiology in humans. Using long-read sequencing, we identified 16 Mamu-E isoforms and all Mamu-E splicing junctions were detected among HLA-E isoforms in humans. We also obtained the complete Mamu-E genomic sequences covering the full coding regions of 59 RM from a RhCMV/SIV vaccine study. The Mamu-E gene was duplicated in 32 (54%) of 59 RM. Among four groups of Mamu-E alleles: three ~ 5% divergent full-length allele groups (G1, G2, G2_LTR) and a fourth monomorphic group (G3) with a deletion encompassing the canonical Mamu-E exon 6, the presence of G2_LTR alleles was significantly (p = 0.02) associated with the lack of RhCMV/SIV vaccine protection. These genomic resources will facilitate additional MHC-E targeted translational research.

The major histocompatibility complex (MHC) plays an essential role in host immune regulation. MHC is constitutively expressed in nearly all nucleated cells and harbors significant genomic complexity^1–4. Assigned with the critical role of distinguishing self from non-self, MHC Class I and II genes contain genetic variations that have been associated with hundreds of autoimmune and infectious diseases in human^5–7. Rhesus macaques (RMs) have been an important nonhuman primate model for the study of many of these human diseases⁸ and are critical for pre-clinical trial vaccine development for protection against human immunodeficiency virus (HIV) using SIV infection in RMs^{9, 10}. RMs also serve as vaccination models against SARS-CoV-2¹¹, Mycobacterium tuberculosis^{12, 13}, and influenza A virus¹⁴. Intriguingly, the genetic architecture and polymorphisms of MHC class I and II genes differ significantly among primates¹⁵, posing a challenge for translational interpretation of non-human primate models in general.

Among primate MHC Class I genes, the MHC-E locus is long considered as the most conserved^{16, 17} and is believed to exist without duplication in both RM and human¹⁸. As a non-classical MHC molecule, MHC-E dually functions in innate and adaptive immunity by interacting with T cells in addition to NK cells¹⁹. This unconventional role of MHC-E in T-cell immunity is conserved between humans and RMs²⁰. Furthermore, human leukocyte antigen (HLA)-E, the human MHC-E ortholog, possesses the ability to present both self- and pathogen-derived sequences^{21, 22}, and its surface expression can be induced by human cytomegalovirus (hCMV)²³. Together, these unique characteristics make MHC-E a crucial target for ongoing CMV-based vaccine development^24–26.

In a recent rhesus RhCMV/SIV vaccine study, 55% of RMs were protected from a highly pathogenic strain of SIV^{9, 10}. It was later shown that this protection was driven by RM MHC-E (Mamu-E)-restricted peptide antigen recognition by CD8⁺ T cells^{27, 28}. Furthermore, Mamu-E intracellular transport is now known to be necessary for vaccine efficacy and is driven by the genetic architecture of RhCMV²⁹. We also recently showed that an Interleukin-15 response signature in whole blood predicts RhCMV/SIV vaccine efficacy³⁰, but it is still not clear if Mamu-E genetic diversity might also contribute to differences in RhCMV/SIV protection outcome.

Evidence suggests MHC-E expression and function may be regulated by alternative splicing. The most recent RefSeq annotations for HLA-E and Mamu-E contain a single transcript with the canonical MHC Class I exon/intron splicing, originally described in ³¹, and three additional HLA-E transcript variants predicted using EST and mRNA support. In contrast, HLA-G, a separate MHC Class Ib gene, has seven known transcript variants: four membrane-bound and three secreted in soluble form³². Mamu-AG, the RM ortholog, also shares this extensive alternative splicing³³. There is an increasing body of evidence linking soluble (s)HLA with downregulated T cell responses³⁴ and a variety of immune disorders^35–37. These sHLA molecules can result from surface shedding, cleavage by metalloproteinases, or secretion via alternative splicing³⁸. While a secreted sHLA-E transcript has not yet been documented, there is some support from western blotting of endothelial cells³⁹. A more recent study reported an increase of sHLA-E after Japanese Encephalitis Viral infection but did not determine the source⁴⁰. Overall, however, the documentation of this HLA-E alternative splicing is sparse, and nothing to our knowledge has been reported for Mamu-E.

Given the extreme genomic complexity of the rhesus MHC region, in this study we aimed to expand genomic resources for Mamu-E using long-read sequencing of RM RNAs and DNAs. We characterized Mamu-E and HLA-E alternatively spliced transcripts, determined their functional capacities, and examined the extent to which their alternative splicing repertoires are conserved. Separately, we interrogated the genetics of RMs from a RhCMV/SIV vaccine study, identifying for the first time extensive Mamu-E gene duplications. Finally, we show the potential of these resources by examining the relationship between the Mamu-E spliceosome, genetics, and vaccine-induced immunity in whole blood during the pre-challenge phase of a RhCMV/SIV vaccine study³⁰. These resources will provide a foundation for more comprehensive research of MHC-E in rhesus macaques and inform translational research of CMV-based vaccines for use in humans.

The gene expression of Mamu-E is regulated by extensive alternative splicing that is conserved among HLA-E isoforms

To accurately define Mamu-E transcript structures, we aimed to use high-quality, full-length transcript sequences obtained by long read transcriptome sequencing⁴¹. Since the sequences of MHC genes are very similar, it was critical that we use long read sequencing to avoid transcript sequence assembly. In our previous work⁴², using PacBio transcriptome sequencing (the Iso-Seq method) we obtained over 2.8 million Circular Consensus Sequencing (CCS) reads from four different rhesus macaque tissues (Supplementary Table 1). About 33% of these CCS reads were full length (i.e., contained the 5’ cDNA primer, 3’ cDNA primer, and polyadenylation tail), each representing a single transcript molecule⁴³. All CCS reads (full-length (FL) and non-full length) were initially clustered without a genome reference and subsequently aligned to a RM MHC Class I region reference sequence, which was previously assembled using BAC cloning (Methods). These CCS read groups were further clustered and curated, yielding an initial set of 13 unique Mamu-E isoforms (shown in Fig. 1 as Mamu-E1-10, 12, 14, and 16). The canonically spliced Mamu-E isoform (Mamu-E1) had the strongest FL CCS read support of all isoforms (92 of 123, 74.8%), while other isoforms had FL support ranging from 1 to 13 (Supplementary Table 2). Collectively, these isoforms supported a shorter 5’ UTR than previously annotated and a significantly longer 3’ UTR, and this was also supported by mRNA-seq data from RM whole blood samples described later (Supplementary Fig. 1). These isoforms exhibit several new splicing events largely concentrated at the 3’ end of the transcript, including exon skipping, alternative 3’ UTR splicing, a retained intron, and a novel exonization event between exons 5 and 6 (Fig. 1), all of which had canonical splice signals. Many of these isoforms were also predicted to encode protein sequences with different domain configurations (Fig. 1). For example, while nearly all isoforms (12 of 13) encode the canonical Alpha 1, 2 and 3 domains, many isoforms skip the transmembrane domain and have diverse cytoplasmic tails introduced by alternative 3’ UTR splicing. Together, these results indicate that complex alternative splicing of Mamu-E yields proteins with potentially diverse functions.

Separately from our Mamu-E analysis, we recovered 41 unique HLA-E isoforms collectively supported by 2,050 FL CCS reads from a human PacBio Iso-Seq dataset from 60 myelogenous patient samples (Supplementary Fig. 2). Interestingly, all new splicing patterns in Mamu-E isoforms were found among these HLA-E isoforms, with 4 perfect isoform matches. Additionally, the human 3’ and 5’ UTRs were of comparable length to those in RM. A similar pattern was also observed among HLA-E isoforms, where most FL reads (1,788 of 2,050, 87%) supported the canonical splicing configuration. When resampling HLA-E isoforms using sequencing depth commensurate with Mamu-E (i.e. 123 FL CCS reads), 12.6 isoforms were detected on average, suggesting HLA-E and Mamu-E may have similar spliceosome complexities. Despite the greater number of isoforms detected in human (41 vs. 13), only 25 contained the Alpha 1 and 2 domains needed for peptide binding (Supplementary Fig. 2, Supplementary Table 3). There were also many retained intron events detected between exons 1 and 2 (19 of 41 isoforms) of HLA-E compared to Mamu-E (1 of 13). While the retained intron led to a frameshift and a premature stop codon in the Mamu-E isoform, this did not affect the reading frame in human isoforms (Supplementary Fig. 2). Further, while 3’ UTR splicing diversity was evident in human, it did not impact the cytoplasmic tail, as the HLA-E open reading frame (ORF) terminates before the last splice junction (i.e. in exon 7); whereas Mamu-E isoforms terminate shortly after the junction (i.e. in exon 8) (Fig. 1, Supplementary Fig. 2). Upon closer inspection, we determined that this was due to a difference in the exon 7 reading frame (data not shown).

Several Mamu-E isoforms with few FL CCS read support (8 of 13) captured new splicing patterns but failed to recover the complete Mamu-E 5’ end to varying degrees (Fig. 1). Given the consistency of the HLA-E and Mamu-E exon structures, we inferred 5’ ends for these incomplete isoforms. Next, we designed PCR assays to target the unique splicing features of these inferred isoforms and isolated the resulting bands for Sanger (Supplementary Table 4, Methods). The 5’ ends of most isoforms (6 of 8) were confirmed using this approach, and unexpectedly we identified three new isoforms (Fig. 1; Mamu-E11, 13, and 15). These isoforms match PacBio-derived isoforms (Mamu-E10, 12, and 14, respectively), but lack a retained intron between exons 4 and 5 (Fig. 1).

We hypothesized that this complex alternative splicing might be in part associated with transposable elements (TEs) in the Mamu-E locus. TE sequences are known to permeate the MHC region in RM⁴⁴ and human^{45, 46}, and they are believed to play a significant role in human disease^{47, 48}. Alu elements, a type of transposon, have a strong connection with transcriptional regulation, as they can influence alternative splicing^{49, 50} and function as enhancers⁵¹. We screened the Mamu-E locus and upstream and downstream genomic regions, finding eight elements on the sense strand and two on the antisense strand (Fig. 1). Interestingly, all eight of these TEs were found in the HLA-E locus in similar locations (data not shown), suggesting these were translocated prior to the split between old and new world monkeys. Two Alu elements (AluJb and AluY) were found directly upstream of the 5’ UTR (Fig. 1), suggesting a possible role in transcriptional activation. We also detected an Alu element (AluSx3) on the antisense strand between exons 5 and 6, coincidentally where six isoforms (Mamu-E8-13) have novel splicing acceptor/donor sites that result in exons partially spanning the Alu element. Another AluY and two other TEs were found in the 3’ UTR, suggesting that alternative splicing in this region might be influenced by and/or influence their function. Lastly, mammalian-wide interspersed repeat (MIR)b was found directly downstream of the transcriptional termination site with a fully intact AluYf1 element directly adjacent to it (Fig. 1). Like Alu elements, MIRs can function as enhancers to promote tissue-specific gene expression⁵² and there is also evidence that they can be transcribed in human⁵³. Taken together, the presence of complex splicing and deluge of TEs indicate that the Mamu-E and HLA-E loci are under strong transcriptional regulation.

Mamu-E gene duplications are common

Mamu-E has long been known to be polymorphic¹⁷, currently with 33 alleles in the Immuno Polymorphism MHC Database (IPD-MHC)⁵⁴. To date, it has not been investigated whether this polymorphism has any connection with Mamu-E-restricted antigen presentation in response to RhCMV/SIV vaccination. We obtained genomic DNAs from 59 of 60 animals from four RhCMV/SIV vaccine groups, three previously described in ³⁰, and used PacBio long amplicon analysis (LAA) to target and sequence Mamu-E allele sequences (Methods). Across 59 animals, we recovered 152 allele sequences (Supplementary Table 5), assigned to 17 IPD-MHC database alleles. These alleles were composed of four groups: three full-length ~ 5% divergent groups (G1, G2, G2_LTR) and a fourth monomorphic group missing the canonical Mamu-E exon 6 and the surrounding intronic sequence harboring an antisense AluSx3 element (G3) (Figs. 1, 2a-b). G2_LTR alleles are accordingly named by the ~ 700bp solo LTR5B inserted approximately 20bp after the expected start of the amplified sequence 5’ end (e-value ~ 10^− 83) (Fig. 2a). G1 alleles were detected in all animals and exclusively in 27 of 59 (46%), while additional alleles from G2, G2_LTR, and G3 were found in 6, 7, and 20 animals, respectively (Table 1).

G2, G2_LTR, and G3 alleles were also found to be in complete linkage with Mamu-E*02:02, Mamu-E*02:11, and Mamu-E*02:04 (G1 alleles), respectively. We confirmed the presence of multiple Mamu-E loci in 1 of 4 selected animals (animal ID #Rh28808) using fosmid isolation followed by PacBio DNA sequencing (Methods, Supplementary Table 6). The fosmid sequence from this animal contained both the G3 allele (E*02:13V-short) and the E*02:04 allele separated by ~ 20kb, supporting the linkage we observed between these alleles across multiple animals. No animals were found to have alleles from all four groups and or have > 2 alleles from any of the groups, with the exception of one animal (animal ID # Rh29659). We recovered a third G1 allele (Mamu-E*02:03, also found in 11 other animals), which was not detected in our later expression analysis (data not shown). The presence of additional MHC-E alleles in the same animals was not associated with vaccine group (Fisher’s exact test: p = 0.569) or protection outcome (Fisher’s exact test: p = 1) (Methods, Table 1), where the E group was excluded as there was no protection observed among its animals. However, all 7 animals from groups O, S, and X with G2_LTR alleles were not protected and the association with protection outcome was statistically significant (Fisher’s exact test: p = 0.02, Table 1).

Next, we investigated the segments driving the sequence differences among allele groups by separately analyzing the exons, introns, and the sequence recovered upstream of 5’ UTRs. We observed that G2_LTR alleles significantly diverged from all other alleles even when removing the inserted LTR5b sequence (Fig. 2b-c). G1 alleles tended to cluster together in the 5’ upstream region, while a small subset clustered with G2 alleles and G3 alleles shared some similarities with both clusters (Fig. 2c). We found that G1 and G2 alleles were more similar in exons 1 and 2, while both G2_LTR and G3 alleles significantly diverged (Fig. 2c). Interestingly, all 4 allele groups diverged in exon 3 (Alpha 2), intron 3, and exon 4 (Alpha 3) (Fig. 2c), suggesting these allele groups may function differently.

Mamu-E expression in whole blood is dominated by a single locus

To explore the potential functional divergences among duplicated Mamu-E alleles, we sought to determine if Mamu-E genes of these allele groups are similarly expressed. We examined Mamu-E gene expression using mRNA-seq analysis of whole blood samples collected from the same animals during the pre-challenge phase of a RhCMV/SIV vaccine study before and after the prime and boost phases (Methods). Nine samples from each of the 59 animals (531 total) were sequenced, yielding ~ 14.8 billion reads (~ 27.8 million reads per sample). For each animal, reads were aligned to the MHC Class I/II BAC reference with the Mamu-E locus masked and animal-specific Mamu-E allele sequences as separate contigs (Methods).

We calculated the relative expression of allele groups in all animals expressing at least 1 allele from more than one group based on our genomic analysis (Table 1). The proportions of expression from each allele group were fairly stable throughout the pre-challenge phase, with G2, G2_LTR, and G3 alleles composing approximately 25%, 10–15%, and 5% of expression, respectively (Fig. 3a). While G1 alleles composed most of the Mamu-E expression, both the relative (Fig. 3b) and absolute (Fig. 3c) G1 allele expression levels varied contingent on the extra allele groups present in the same animals. For example, when G2_LTR alleles were present, the absolute G1 allele expression levels were about 30% higher (Fig. 3c). However, when G3 alleles were present, the absolute G1 allele expression levels were about 30% lower.

RhCMV/SIV vaccination elicits MHC-E-restricted T-cell responses, so we next sought to determine the effect of vaccination on the expression of these alleles. We observed that in animals expressing alleles from > 1 group, the expression of allele groups was strongly correlated (Fig. 3d). When examining total Mamu-E expression (i.e. pooled allele expression), we found that Mamu-E expression increased significantly following vaccination prime and boost, regardless of protection outcome (Fig. 3e), suggesting that RhCMV/SIV vaccination may influence the functions of Mamu-E.

We also examined the relative expression of alleles expressed within the same group, finding that we could reliably recover allele-specific read counts even with little polymorphism between alleles (Supplementary Fig. 3a). We also observed fairly even allelic coverage within loci regardless of allele group that was also stable throughout the pre-challenge phase (Supplementary Fig. 3b-c). One exception to this was in one animal (animal ID # Rh28835 from the S group), where one G1 allele was found to be expressed substantially less than the other (Supplementary Fig. 3b). Interestingly, the lowly expressed G1 allele was the only allele among all animals with an insertion, which incidentally resulted in a frameshift and premature stop codon. These results indicate that G1 alleles tend to be expressed at relatively similar levels to each other and several times higher than G2 and G3 alleles.

Confirmation and extension of Mamu-E G1 alleles using mRNA-seq based haplotype phasing

We independently assessed the accuracy of our Mamu-E allele sequencing at per base level and captured additional 3’ UTR variation using the collected whole blood mRNA-seq data. Also, as shown in Figs. 1, 2a, Mamu-E transcribes a much longer 3’ UTR than the canonical annotation. This long 3’ UTR was not covered in our allele genomic sequencing which was designed to target coding regions (Fig. 2a). We focused this mRNA-seq based analysis on alleles in the G1 group since their expression was dominant, making this effort feasible (Fig. 3a-b).

We first assessed the depth of mRNA-seq read coverage of Mamu-E and the ability to capture Mamu-E polymorphism accurately using short read mRNA-seq data. We observed ~ 4–5% of total reads mapped to the Mamu Class I and II complexes and 10,000 per base Mamu-E coverage (Supplementary Fig. 1). We also found that recovery of the transmembrane domain region polymorphisms was intractable likely due to greater conservation of this region with other MHC genes using a kmer-based strategy (Supplementary Fig. 5a, Methods). Recovery of polymorphisms in 3’ UTR regions harboring TEs was also found to be intractable, leading to their exclusion (Supplementary Fig. 4b). Lastly, low coverage bases proximal to the transcriptional start and termination sites were excluded from this haplotype phasing analysis (Supplementary Fig. 4c).

For the remaining highly confident regions we generated completely contiguous haplotype blocks spanning the entire Mamu-E region, resulting in Mamu-E haplotigs (Methods). Almost all (1,146 of 1,150, 99.7%) heterozygous variant calls were successfully phased for all animals. As expected, we did not observe a lower fraction of read support for the G1 haplotype configuration in animals with additional G2 and G3 alleles (Supplementary Fig. 5a), given the dominant expression of G1 alleles. On average almost 100% of the variants identified by haplotigs derived from mRNA-seq reads were identical to the most similar G1 allele sequences within each animal where they overlap (i.e. excluding the 3’ UTR) (Supplementary Fig. 5b). For each animal, we matched haplotigs against allele sequences, determining that variant phasing was also highly concordant (> 96% of variants) with differences only arising due to mRNA-seq variant calling issues in fringe locations just passing our required per base coverage threshold (Supplementary Fig. 5c). This nearly perfect agreement between these two independent methods (DNA sequencing via PacBio LAA, haplotig recovery via mRNA-seq) shows the extremely high accuracy of sequences we obtained by LAA. We merged these G1 alleles with their matched haplotigs, producing final, complete G1 allele sequences spanning the entire Mamu-E locus including both coding regions and long 3’ UTRs.

Characteristics of Mamu-E G1 allele variants and their associations with RhCMV vaccine protection

Next, we examined the variation recovered across these merged G1 allele sequences, since all animals have at least one copy of G1 alleles and G1 alleles contributed the majority of Mamu-E expression in whole blood samples (Fig. 3a-b). Variants were identified throughout the whole Mamu-E G1 locus, protein coding regions and both UTRs (Fig. 4a-b). Single nucleotide polymorphisms (SNPs) were also found to be non-synonymous, producing a total of 42 unique single amino-acid polymorphisms (SAPs) spread across all protein domains (Fig. 4b). However, none of the SAPs located in the Alpha 1 and 2 domains were located in the predicted B and F pocket key binding sites^{55, 56} (Supplementary Fig. 6). Interestingly, G2 and G3 allele polymorphisms also did not affect key binding sites. However, those in G2_LTR alleles impacted 5 sites across Alpha 1 and 2, indicating that they likely have significantly altered function.

Given the extent of polymorphism recovered among G1 alleles (108 SNPs, 2 insertions, 3 deletions), we decided to more closely inspect individual variants and determine the extent of linkage disequilibrium (LD) within the Mamu-E G1 locus. We found that 55 of 113 (48.7%) passed a minor allele frequency (MAF) threshold of 0.1 (Supplementary Table 7). Variants passing the MAF filter were located throughout the 3’ UTR and coding region of the Mamu-E locus (Fig. 4a). We detected substantial correlation (i.e. LD) of variants both locally and between variants distant from one another in the Mamu-E G1 locus (Fig. 4a). When we grouped these variants based on their correlations, we found 2 major clusters of correlated variants of size 21 and 18 SNPs, with 10 additional clusters of size 4 or less (Fig. 4c). The only indel that passed the MAF filter (a deletion in the transmembrane (TM) domain) was in strong LD with 3 SNPs also in the TM domain (cluster 3 in Fig. 4c). Interestingly, the two large clusters of SNPs were each comprised of a set of 3’ UTR variants along with variants from the Alpha domains, TM domain, cytoplasmic domain, and 5’ UTR (Fig. 4c). We also observed that when represented in a phylogeny, final G1 allele sequences formed 3 major subgroups, with one much larger than the other two (Fig. 4d). However, there was no significant association between these G1 allele subgroups and vaccine group (Fisher’s exact test, p = 0.303) or protection outcome (Fisher’s exact test: p = 0.313). We then cross-referenced the variant clusters identified (Fig. 4c) with the three major G1 allele subgroups identified (Fig. 4d), finding that G1 subgroup 1 alleles contained the major form of both of the two large variant clusters, subgroup 2 contained the minor form of both, and subgroup 3 contained the minor and major form of the first and second, respectively (data not shown).

We then examined the genotypes of animals across each of the variant clusters as well as individual variants, finding that there was no statistically significant association with protection outcome among vaccine groups O, S, and X (Fig. 4e, Supplementary Table 7) or with vaccine groups (Supplementary Fig. 7, Supplementary Table 7). However, we did observe a tendency for protected animals to favor the major form of variant cluster 3 and the minor forms of variant clusters 8 and 9 (p = 0.116, 0.124, 0.116 and FDR = 0.497, 0.497, 0.497, respectively) (Fig. 4e, Supplementary Table 7).

Since cluster 3 harbored 4 variants (including a deletion) in the TM domain as did variant cluster 9 (SNP), we examined the differences in hydrophobicity scores across all allele TM domains. We observed a reduction in N-terminal TM hydrophobicity among all G2_LTR and G2 alleles relative to G1 alleles and HLA-E (Fig. 5). We also found that G3 alleles and G1 alleles harboring variant cluster 3 and had increased C-terminal TM hydrophobicity relative to other Mamu-E alleles as well as HLA-E, while Mamu-E alleles with cluster 9 had unaltered hydrophobicity (Fig. 5). These results suggest that Mamu-E polymorphisms in this region may impact Mamu-E protein transport and/or membrane stability. Furthermore, we cannot make a complete determination since variants located within 3’UTR TEs and indels in the 3’ UTR region were not included in this analysis (Supplementary Fig. 4b).

Since we determined that the Mamu-E expression in whole blood was driven by G1 alleles, we investigated if the G1 allele expression, especially isoform usage, in whole blood could be related to RhCMV/SIV vaccine protection outcome. The abundances of Mamu-E G1 allele isoforms were quantified using full G1 allele sequences and relative proportions of isoforms were determined for all pre-challenge time points for each animal (Methods). We found that Mamu-E1 isoforms were most prevalent, composing ~ 80% of isoform abundances, while all other isoforms were detected at lower levels (Supplementary Figs. 8, 9). We also observed that relative isoform usages were largely stable in whole blood throughout the pre-challenge phase, regardless of vaccine protection outcome and vaccine group (Supplementary Fig. 9). Mamu-E isoforms were observed in different strata based on their relative expression (Supplementary Fig. 8). Mamu-E2, 4, 6, 8, and 14–16 formed a second stratum after Mamu-E1, each representing ~ 1–10% of isoforms expressed. Mamu-E3, 5, 7, 9, 10, and 12 formed a third stratum, each expressing ~ 0.1-1% of isoforms. Mamu-E11 and 13 were especially rare, defined as a fourth stratum with < 0.1% of isoform expression. Incidentally, these two rare isoforms were only identified using Sanger sequencing (Fig. 1).

Here we present a first comprehensive analysis of alternative splicing and genetic variations across the whole Mamu-E locus. We used long read sequencing of both RNAs and DNAs to address the genomic complexity in the rhesus MHC-E region and complementary mRNA-seq analysis for independent validation and expression quantification. We uncover complex Mamu-E alternative splicing that is also conserved in human. We show that the whole Mamu-E locus is polymorphic and Mamu-E gene duplications are common, a striking contrast compared to the highly monomorphic HLA-E in humans.

Up to this point the standard annotation of both HLA-E and Mamu-E has been a single transcript with the canonical MHC Class I exon/intron splicing, but the evidence we describe suggests that MHC-E transcription is regulated by complex alternative splicing. Interestingly, all Mamu-E splicing junctions were also found in HLA-E splicing isoforms. The high conservation of alternative splicing between Mamu-E and HLA-E provides additional evidence that rhesus may serve as a good model for studying HLA-E immunobiology. However, it also suggests that further investigation of these isoforms is needed to better understand the regulation and function of MHC-E in both rhesus macaques and humans.

Historically, MHC genotyping analyses have focused on coding regions, yet the substantial polymorphism in the Mamu-E 3’ UTR, the presence of transposable elements (TEs), and the 3’ UTR alternative splicing we observed in this study warrant further investigation and more expanded genotyping efforts. In the future, we anticipate expanding the long amplicon analysis approach employed here to cover the full range of the 3’ UTR annotated in this study. Moreover, tapping into TEs in regions surrounding MHC genes might also inform research on MHC gene duplications in RM, as we observed with Mamu-E G2_LTR alleles in this study.

Since our original PacBio long amplicon analysis design did not cover the Mamu-E 3’ UTR, we explored the feasibility of recovering genotype information in that region using available mRNA-seq data for those animals. MHC genes, Mamu-E in particular, are constitutively expressed, making RNA-seq coverage less problematic than for typical protein coding genes. There is also a strong propensity for splicing observed here with Mamu-E and HLA-E, but also with many other MHC genes⁵⁷. Splicing uniquely provides long range haplotype information, a clear advantage over high-throughput DNA-seq. We were able to phase nearly all heterozygous variants detected despite blacklisting portions of the 3’ UTR containing repetitive TE-derived sequence. This type of in silico approach is common, originally explored for HLA typing with the seq2HLA tool⁵⁸ and more recently with the arcasHLA tool⁵⁹. We expect this strategy to be useful for re-analysis of previously published RM mRNA-seq data for examining the prevalence of known and potentially novel Mamu-E allele sequences among different RM colonies.

We present direct evidence that Mamu-E gene duplications are common, detecting them in ~ 50% of 59 animals we sequenced here. Since there is no reported HLA-E gene duplication, this discovery of widespread Mamu-E gene duplications could have significant functional implications for MHC-E in rhesus macaques. For example, we found the presence of G2_LTR alleles was significantly associated with the lack of RhCMV/SIV vaccine protection. Since this set of animals was relatively small, this association needs to be confirmed independently. Wu et al suggested potential duplications of the Mamu-E locus based on the expression of multiple MHC-E transcripts within individual rhesus macaques, but they did not observe significant functional differences among Mamu-E molecules²⁰. Interestingly, we detected dominantly expressed G1 alleles in whole blood samples and G1 alleles were present in all animals. Further, in whole blood samples we found that the canonical Mamu-E1 isoform was most abundant, and all other isoforms collectively composed ~ 20% of expression. This suggests the observed functional conservation could be driven by the Mamu-E1 isoform from the common G1 alleles. However, our results show that Mamu-E expressions also appear to be dose sensitive, suggesting potential interactions among Mamu-E alleles.

Clearly more work will need to be done to examine Mamu-E allele and isoform expression patterns in other tissues and cell types with different phenotypes. For example, Mamu-E alternative splicing could be investigated at the single cell level potentially with 3’ tag approaches using the 10x Genomics platform, as much of the splicing diversity is concentrated at the 3’ end. This 3’ alternative splicing was found to affect inclusion of the 3’ UTR and also impacted the protein sequence of the cytoplasmic tail, but only in Mamu-E as the HLA-E protein sequence terminates before the exon 7–8 junction. Cytoplasmic tails are believed to be important for selective export from the endoplasmic reticulum, and there is supporting evidence in the case of HLA-F⁶⁰. Manipulation of cytoplasmic tails of another MHC Class I molecule, Patr-AL, drastically affected its surface expression⁶¹. It was also shown that splice variants of MHC class I molecules resulting in deletion of amino acids in exon 7 improved CD8 + T-cell stimulatory capacity of DC cells⁶². Collectively, this body of evidence suggests that such Mamu-E splice variants affecting the cytoplasmic tail could generate proteins with different functional outcomes.

This study is a first to interrogate both the genetics and alternative splicing of Mamu-E with this level of precision in the context of a RhCMV/SIV vaccine study. The surprising association of Mamu-E G2_LTR alleles with the lack of RhCMV/SIV vaccine protection and the weak associations between selected Mamu-E variants and RhCMV/SIV vaccine protection, should be followed up. Our genetic analysis missed a few highly variable regions in the 3’ UTR due to limitations of mRNA-seq haplotype phasing. A closer examination of these specific variants and regions in the future will offer a better understanding of the potential impact of Mamu-E genetic variations. The analysis of isoform usages was complicated in that we used mRNA-seq analysis of whole blood samples which included many cell types. Additional analysis of specific cell types or even single cells may be necessary to fully investigate if alternative splicing plays any roles in RhCMV/SIV vaccine-induced protection. It is our belief that this study lays the groundwork needed for more comprehensive analysis of Mamu-E, which in turn will facilitate a more informed assessment of RhCMV-based vaccine translatability as we look towards hCMV/HIV vaccine development.

Rhesus full-length transcriptome sequencing and data processing

Full-length transcriptome sequencing data was generated from four rhesus tissues (whole blood, peripheral blood mononuclear cells, lymph node, and rectal biopsy) and pre-processed in our previous work to produce Circular Consensus Sequence (CCS) reads⁴². CCS reads were then aligned to Mamu Class I and II assemblies previously generated using Bacterial Artificial Cloning (BAC) technology⁶³ (AC148696.1) and annotated using Mamu and HLA cDNA and protein sequences available in GenBank. STARlong v2.5.2b⁶⁴ was used for alignment with the following parameters specified: --alignEndsType EndToEnd --outFilterMismatchNoverReadLmax 0.05 --outFilterMatchNminOverLread 0.95 --twopassMode Basic --outFilterMultimapNmax 20 --outFilterIntronMotifs RemoveNoncanonical --outFilterType BySJout. To mitigate the splicing of reads between highly distant yet similar MHC loci, we serially aligned CCS reads, gradually increasing the maximum intron length using the –alignIntronMax parameter with the following values: 5000, 15000, 100000, 0 (no maximum). CCS reads that successfully aligned to the BAC reference were further processed using the Iso-Seq bioinformatics pipeline⁴¹ and its supporting Cupcake scripts (https://github.com/Magdoll/cDNA_Cupcake) to produce full-length (FL) consensus isoforms. FL Mamu-E isoforms were realigned to the BAC reference and then curated by correcting splice junctions misaligned due to indel events, extending 3’ ends shortened by intrapriming in the 3’ UTR, and collapsing any redundancies in the isoforms produced by these corrections. Finally, the transcriptional start and termination sites (TSS, TTS) for isoforms were clustered using a window size of 50 nucleotides. For isoforms within each cluster, the TSS or TTS was updated to match that of the isoform that extended the annotation the furthest (smallest and largest genomic coordinate for TSS and TTS, respectively).

Human full-length transcriptome sequencing and data processing

RNA was isolated from 60 samples of myelogenous cells obtained from human patients. Using a Clontech SMARTer kit, cDNA was produced from each RNA sample followed by PCR amplification. Libraries were prepared using the SMRTbell Express Template Prep Kit 2.0 and sequenced on the Sequel II System (Pacific Biosciences, Menlo Park, CA). Raw PacBio data were first pre-processed using the CCS protocol⁶⁵ to generate a complete set of Circular Consensus Sequence (CCS) reads. CCS reads were then aligned to the human genome (hg38) and subsequently processed using the Iso-Seq pipeline as described above, and the resulting isoforms were characterized using SQANTI⁶⁶. HLA-E isoforms were then realigned to hg38 and curated as described above for Mamu-E isoforms. Additionally, short isoforms with both a TSS and TTS located within introns (classified as genic introns or genic genomic isoforms by SQANTI) were removed, as they were likely sequencing artifacts or fragmented mRNAs. To compare HLA-E and Mamu-E spliceosome complexities, HLA-E isoforms were sampled with replacement using their respective FL read counts to estimate probability of detection. This was repeated 10,000 times using the total number of Mamu-E FL read counts each time and the mean and standard deviation of the results were recorded.

Cross-species comparison of MHC-E isoforms

To facilitate cross-species comparison of MHC-E isoform structures, the genomic DNA of the HLA-E and MHC-E were aligned to each other, and a genomic coordinate converter was generated from the alignment. Mamu-E isoform genomic coordinates were thus converted to HLA-E coordinates and compared to those of HLA-E isoforms. Mamu-E isoforms with incomplete 5’ ends but otherwise complete matches to HLA-E isoforms had their 5’ ends inferred using the HLA-E 5’ ends.

Validation of inferred Mamu-E 5’ ends using PCR and Sanger Sequencing

To validate inferred 5’ ends of select Mamu-E isoforms, isoform-specific PCR assays were designed. In brief, a common forward primer targeted the canonical first Mamu-E exon, while the reverse primers were isoform-specific (Supplementary Table 4). In cases where a reverse primer could not be designed uniquely for an isoform, the primer was designed to produce an amplicon of unique size for the isoform of interest. cDNA was obtained from whole blood RNA pooled from multiple rhesus macaques (Qiagen QuantiTect RT) and amplified using TD-PCR to help limit off target effects (Agilent Herculase II Fusion Polymerase). Most commonly, phase 1 consisted of 10 cycles started at an annealing temperature (Ta) of 65 Celsius (C) that was reduced 1C per cycle. Phase 2 utilized a Ta of 56C for an additional 30 cycles. Extension was performed at 25s. PAGE-based gel purification was performed on selected amplicons which were eluted overnight in 100uL 0.1X TAE at room temperature on an orbital shaker. Eluted bands were concentrated via centrivap and then re-amplified and purified to increase yield and purity. Each amplicon was examined via PCR using sequencing primers paired with the appropriate PCR primer to help eliminate any bands that were products of PCR bubbling and to reconfirm band sizing before sequencing. Purified bands were then Sanger sequenced at Eton Biosciences, Inc. using the same primers used for PCR. In cases where the band size exceeded Sanger Sequencing limitations, forward and reverse primers were designed in the canonical Mamu-E exon 4 to pair with the PCR primers and produce two overlapping sequences for the band. The resulting sequence trace files were imported into SnapGene and exported to produce fastq files. In cases where multiple sequences were produced for a single band, sequences were merged using PEAR v0.9.10⁶⁷ with default parameters. Final merged sequences were then aligned to the expected amplicon sequence for the band. Unexpected novel Mamu-E isoforms generated from these assays were added to the existing isoform annotations.

Isoform functional analysis and identification of genomic transposable elements

Mamu-E and HLA-E isoforms were each analyzed for coding potential. Isoform cDNA sequences were extracted from the respective reference sequences using the isoform GTF annotation file and the gffread tool from cufflinks v2.2.1⁶⁸. Consensus domain sequence (CDS) annotation was then generated by aligning these sequences back to the reference using GMAP v2019-21-01⁶⁹ with the –format = gff3_gene, -z sense_force, and -F parameters. These CDSs were then extracted using gffread and translated into protein sequences.

Separately, the entire Mamu Class I and II BAC reference sequences were screened for transposable elements using Dfam release 3.1⁷⁰ with the organism set to Homo Sapiens. Database hits were then parsed to produce GTF records that were visualized together with Mamu-E isoform annotations using the Integrative Genomics Viewer⁷¹. These database hits were also compared to those pre-calculated at the HLA-E locus in Dfam release 3.1.

RhCMV/SIV vaccine study sample collection

Whole blood PAXgene samples were collected from 3 vaccine groups of 15 male RMs each (oral 68 − 1 vaccination group O, subQ 68 − 1 vaccination group S, and subQ 68 − 1 + 68 − 1.2 vaccination group X), as recently reported^{28, 72}. Whole blood samples were similarly collected from an additional vaccine group of 15 male RMs (subQ 68 − 1.2 group E) from this same study. PAXgene samples were collected prior to immunization and at days 1, 3 and 7 post-prime vaccination (W0D1, W0D3, W0D7) and post-boost (W18D0, W18D1, W18D3, W18D7). An additional sample was collected before the start of the first SIVmac239 challenge (W88D0).

PacBio Long Amplicon Analysis and data processing

We obtained genomic DNAs from 58 of 60 animals from the four RhCMV/SIV vaccine groups described above and used PacBio long amplicon analysis (LAA) to target and sequence Mamu-E allele sequences. Two Mamu-E genomic reference sequences (NW_015057580 and NC_041757) were used for the design of long-range PCR primers (Supplementary Table 7). Three different primer sets were designed from the flanking regions of Mamu-E to avoid allelic drop due to unanticipated variation at the primer binding sites. Each set of primers generated ~ 3.2 to 3.5 kbp products. Two stage long-range PCR was used for target generation (stage1) and indexing of amplicons (stage2). PCR products were combined in equimolar quantities, pooled into a single tube, and the pooled product was processed using the SMRTbell Express Template Prep Kit 2.0 (Pacific Biosciences, Menlo Park, CA). The sequencing library was sequenced in a single SMRT cell on the Sequel II System (Pacific Biosciences, Menlo Park, CA).

Raw data were analyzed first by demultiplexing with Lima followed by running LAA to generate amplicon sequences, both components of PacBio’s open-source SMRT Analysis software suite (Pacific Biosciences, Menlo Park, CA). CCS reads were mapped back to amplicon sequences using in house derived cluster/match analysis. Low quality and recombinant sequences were filtered out to generate final amplicon sequences and Mamu-E gene annotation from the derivative amplicon sequences was accomplished using Geneious Prime (San Diego, CA).

Confirmation of Mamu-E duplications via fosmid isolation and PacBio sequencing

4 animals were targeted for fosmid isolation and sequencing (Supplementary Table 5) using modifications of the approach described in ^{73, 74}. Sequencing was performed on the Sequel IIe System using the Sequel II Sequencing 2.0 Bundle according to the manufacturers protocol (Pacific Biosciences, Menlo Park, CA). CCS corrected reads of over 30 kbp were targeted for analysis and consensus sequences were derived from overlapping ZMWs using the Celera Assembler Canu 2.0. Resulting complete fosmid sequences were then screened for Mamu-E allele sequences using Geneious Prime (San Diego, CA).

Whole blood mRNA-seq and Mamu-E haplotype phasing

As previously described⁷², cDNA libraries were prepared and sequenced for all whole blood samples and resulting sequencing data was demultiplexed using Illumina bcl2fastq. Raw reads for each sample were aligned to the MHC Class I and II BAC reference using STAR v2.7.7a⁶⁴ with the following parameters set: --alignIntronMax 5000 --alignMatesGapMax 5000 --outFilterMultimapNmax 50. Reads uniquely mapped to Mamu-E (mapping score = 255) were extracted from the alignment output using samtools⁷⁵ followed by the bedtools intersect tool⁷⁶, where the Mamu-E1 GTF annotation was used. Per base coverage of the Mamu-E1 sequence was then computed using bam-readcount v0.8.0 (https://github.com/genome/bam-readcount) with parameter -b 20 to only count bases with quality score > = 20 from reads with perfect mapping scores. Coverage was pooled across all nine pre-challenge samples for each animal and the ends of the Mamu-E1 sequence with coverage below 10,000 reads per base were excluded from analysis.

The uniqueness of the Mamu-E1 sequence was assessed using two strategies. Firstly, kmer libraries (k = 76bp) were generated for all Mamu-E alleles and the kmers were aligned to the BAC reference using STAR as described above. The rate of uniquely mapped, multimapped, and unmapped kmers was then assessed. Secondly, the rate of uniquely mapped reads was examined from the mRNA-seq samples aligned as described above. From the initial mapping results (from STAR), per base coverage was computed as above using the parameters -b 0 and -q 255 (uniquely mapped read quality score for STAR). This was performed a second time with -q 0 to capture the total per base coverage from which the per base multimapping rate was inferred. From these analyses, additional regions were identified and excluded from haplotype phasing.

For each Mamu-E1 position remaining, bases were called using a threshold of 25% coverage for each animal. Positions with a single call were labelled as homozygous and those with more than one as heterozygous, and a VCF file was then manually generated and indexed using samtools^{75, 77}. Next using the mapping results from all nine samples for each animal, haplotype blocks were generated using phASER⁷⁸, a haplotype phasing tool optimized for RNA-seq data, with the following parameters set: --paired_end 1 --mapq 255 --baseq 20. The statistical test for variant connections was disabled using the parameter –cc_threshold 0, as a small fraction of reads (< 5%) were expected from additional Mamu-E loci with lower expression. In cases where multiple haplotype blocks were produced, additional phasing was performed inferentially by comparing relative coverage of haplotypes from each block. For two blocks to be merged, a perfect consensus was required across all nine samples. Any remaining heterozygous positions not included in the largest haplotype block were assigned an ambiguous call using standard IUPAC ambiguity codes (e.g., A or C = M). Haplotypes were then screened for variants with low phasing support by assessing variant connections with at least 100 read support. Variants that on average had > 20% connections with other variants conflicting with the haplotype configuration were removed from the haplotype block and labeled as ambiguous calls. Lastly, haplotypes were expanded to include homozygous positions, resulting in complete Mamu-E1 haplotig sequences.

Comparison and integration of Mamu-E alleles with mRNA-seq haplotigs

Exons 1–7 were extracted from haplotigs and aligned with all allele exonic sequences using PRANK⁷⁹, an indel-aware progressive multiple sequence aligner. In cases where haplotigs in an animal only differed by 3’ UTR SNPs, these were collapsed into a single haplotig at this step. Then, using this alignment each haplotig was compared to each allele from the same animal, excluding indel variation captured by the alleles and regions blacklisted in the haplotype phasing analysis. Haplotypes and G1 alleles were progressively matched, taking the pairing with the fewest mismatches and subsequently pairing the remaining allele and haplotig, if any. G1 alleles were then merged with the 3’ UTRs of matched haplotigs by using the intronic sequence between exons 7 and 8 extracted from the MHC Class I/II BAC reference using gffread⁶⁸, yielding a single contiguous sequence. All SNPs and indels detected in the alleles and the 3’ UTR of haplotigs were gathered for each animal into a single VCF file for later genetic analysis. All of these variants were also enumerated, stratifying over the different protein coding regions and both UTRs. Non-synonymous SNPs were separately counted.

Mamu-E phylogenetic and genetic analysis

All multiple sequence alignments of alleles, including those of exonic and promoter regions, were performed using Clustal Omega⁸⁰, a progressive multiple sequence aligner. PRANK⁷⁹ was not used here, as Clustal Omega performed better when including noncoding regions. Phylogenetic analysis was then performed in R using the phangorn package⁸¹ for all multiple sequence alignments. In brief, a neighbor joining (NJ) tree was generated using the dist.ml function with the multiple sequence alignment as input followed by the NJ function. A maximum likelihood (ML) tree was generated from the NJ tree and multiple sequence alignment using the pml function and followed by the optim.pml function with optNni = T, performing Jukes-Cantor optimization. The resulting ML trees in some cases were visualized as phylograms using phangorn’s internal functionality and in others as circular tree structures using the ggtree and dendextend R packages^{82, 83}.

Linkage disequilibrium (LD) was assessed for all variants in the exonic regions of G1 alleles merged with matched 3’ UTR haplotig sequences. Variants with low minor allele frequencies (MAF) were removed by requiring a MAF > 0.1. The extent of LD was assessed by computing the D value for all remaining variant pairs and the correlation coefficient was the computed from these D values using standard formulae for LD analysis⁸⁴. Variants were then hierarchically clustered using the R hclust function with a distance matrix produced from these correlations as input. After visual examination of the tree generated from this clustering, correlated groups of variants were split using the cutree function with a height of 0.35, implicitly requiring a minimum correlation of 0.9 within groups. Resulting groups of size two that were not in complete LD, were further split into individual groups of size one.

Statistical tests for association with protection outcome for individual variants were performed using Fisher’s Exact tests (fisher.test R function) followed by Benjamini-Hochberg (BH) multiple hypothesis testing correction for FDR control. 2x2 contingency tables were constructed by comparing animals with homozygous major variants against heterozygotes and those with homozygous minor variants stratified across protection outcomes (protected, not protected). These same statistical tests for association were applied to correlated groups of variants followed by BH FDR control. Only animals in groups O, S, and X were used for all tests for association with protection outcome. This same procedure was performed for statistical tests for association with vaccine group, where all animals from groups O, S, X, and E were used. Fisher’s exact tests were also to test these associations.

Expression analysis of Mamu-E loci

The expression of alleles from different groups was performed by generating STAR alignment indexes tailored to each animal. The MHC Class I/II BAC reference described above was included with Mamu-E masked out. Trimmed alleles from each group were included as additional contigs. Alleles from the different G2 subgroups were kept as separate contigs when present in the same animal. G2 alleles containing an LTR in their promoter region, were left untrimmed and the LTR was annotated using Dfam release 3.1⁷⁰ with the organism set to Homo Sapiens. When 2 allele sequences were present from a group, a reference was chosen, and the other allele was represented in a VCF file as alternatives. This file was generated by aligning the two allele sequences using the needle tool from the EMBOSS Suite⁸⁵ and extracting positions harboring SNPs. Needle was run using the parameters -endopen 0 -endextend 0 -gapopen 100 -gapextend 0. The reference alleles included in the index were annotated by aligning the canonical Mamu-E1 isoform sequence with exon 8 removed (not included in the allele sequences) using exonerate⁸⁶. For G3 alleles, which did not contain exon 6, Mamu-E7 was used (skips exon 6). Since no stop codon was expected in these annotations, we used the exonerate est2genome model with –showtargetgff yes to extract annotation records for alleles.

Raw whole blood mRNA-seq reads were aligned using STAR 2.7.7a⁶⁴ in WASP mode⁸⁷ with the same base parameters used above when phasing haplotypes from mRNA-seq. WASP mode removes allele-specific bias that might be introduced by selecting one allele as the reference. WASP mode was run by adding the additional parameters: --outSAMattributes NH HI AS nM vA vG --varVCFfile [VCF file] --waspOutputMode SAMtag. When a VCF file was not present (i.e. no allele group had > 1 allele), only --outSAMattributes NH HI AS nM was added as an additional parameter.

Relative expression of allele groups was determined by using the gene counts produced by STAR for each group. To examine the relative expression of alleles within a group (where applicable), we extracted reads properly paired and uniquely mapped to the Mamu-E allele contigs using samtools with -q 255 and -f 0x2 parameters and using the bedtools intersect function⁷⁶, selecting those with the vA flag set. Reads with vA set to i:1 and i:2 were assigned to the reference allele and alternate allele, respectively. Reads with vA set to i:0 were common to both alleles. Reads with vA set to any other values (i:1,2 or i:3) or with the WASP flag turned on (i.e. set to a value other than i:1) were removed from analysis.

Relative Mamu-E isoform expression analysis

Isoform expression analysis was performed similarly to the Mamu-E locus analysis. To include Mamu-E isoform annotations spanning the 3’ UTR, G1 alleles merged with the 3’ UTR haplotig sequence were used in place of the trimmed G1 alleles. Annotations for all Mamu-E isoforms were generated using exonerate⁸⁶ run with the cdna2genome model. In cases where G2 and G3 alleles were present in the animal, they were still included as separate contigs as described above. Mamu-E transcriptome alignments were then generated when aligning to this index by adding “TranscriptomeSAM” to the –quantMode parameter field. Relative isoform abundances were then calculated using salmon⁸⁸ with the default VBEM algorithm and 25 bootstraps. Final isoform relative abundances were calculated by using the mean of the bootstrap estimates.

Mamu-E duplication and G1 allele association analysis

Statistical tests were separately performed to assess the significance of association between either Mamu-E duplications or G1 allele subgroups with either protection outcome or vaccine group (four tests in total). When assessing Mamu-E duplication associations, animals with either a G2 or G3 allele were grouped together, thus forming two groups of animals (G1, G1 + G2/G3). For association with protection outcome, only animals from the O, S, and X groups were used and animals were stratified by protection outcome (protected, not protected). In cases with 2x2 contingency tables, Fisher’s Exact tests were used to assess significance in R (fisher.test function).

Ethics: Parts of this study were approved by Fred Hutchinson Cancer Research Center IRB protocol number 9950.

ACKNOWLEDGEMENTS

This project has been funded in whole or in part with Federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, Department of Health and Human Services, under Contract No. HHSN272201800008C (to M.G.) and Contract No. HHSN272201600027C (to D.E.G.). Funding for this study was supported in part by the National Institutes of Health, Office of the Director P51OD010425 (to M.G.). Research reported in this publication was supported by the University of Washington / Fred Hutch Center for AIDS Research, an NIH-funded program under award number AI027757 which is supported by the following NIH Institutes and Centers: NIAID, NCI, NIMH, NIDA, NICHD, NHLBI, NIA, NIGMS, NIDDK.

AUTHOR CONTRIBUTIONS

This study was conceived and designed by: L.L., L.J.P., M.G., D.E.G., and X.P. Experiments were performed by: T.T. and C.P. Bioinformatics analyses were performed by: H.B., R.W., A.T., and E.T. The paper was written by: H.B. and X.P. The paper was reviewed and edited by: H.B., E.T., L.L, L.J.P., M.G., D.E.G., and X.P.

COMPETING INTERESTS

No potential competing interest reported by the authors.

DATA AVAILABILITY

Sequence FASTA and annotation GTF files for RM and human MHC-E isoforms, Sanger sequencing data, and Mamu-E genotyping data were deposited to Zenodo (doi: 10.5281/zenodo.5985423). Mamu-E allele sequences were deposited to GenBank under accession numbers MT221257 through MT221434. Transcriptomic data for vaccine groups O, S, and X is available in the Gene Expression Omnibus (GEO) https://www.ncbi.nlm.nih.gov/geo/ under accession number GSE160562. Transcriptomic data for vaccine group E is available under BioProject accession number PRJNA825389 in the NCBI BioProject database (https://www.ncbi.nlm.nih.gov/bioproject/).

Trowsdale, J. & Knight, J. C. Major histocompatibility complex genomics and human disease. Annual review of genomics and human genetics 14, 301–323 (2013).
Shiina, T., Hosomichi, K., Inoko, H. & Kulski, J. K. The HLA genomic loci map: expression, interaction, diversity and disease. J. Hum. Genet. 54, 15–39 (2009).
The MHC, s. c. Complete sequence and gene map of a human major histocompatibility complex. Nature 401, 921–923 (1999).
Boegel, S. et al. HLA and proteasome expression body map. BMC medical genomics 11, 36 (2018).
Dendrou, C. A., Petersen, J., Rossjohn, J. & Fugger, L. HLA variation and disease. Nature Reviews Immunology 18, 325–339 (2018).
Blackwell, J. M., Jamieson, S. E. & Burgner, D. HLA and Infectious Diseases. Clin. Microbiol. Rev. 22, 370 (2009).
Matzaraki, V., Kumar, V., Wijmenga, C. & Zhernakova, A. The MHC locus and genetic susceptibility to autoimmune and infectious diseases. Genome Biol. 18, 76 (2017).
Bontrop, R. E. Non-human primates: essential partners in biomedical research. Immunol. Rev. 183, 5–9 (2001).
Hansen, S. G. et al. Immune clearance of highly pathogenic SIV infection. Nature 502, 100–104 (2013).
Hansen, S. G. et al. Profound early control of highly pathogenic SIV by an effector memory T-cell vaccine. Nature 473, 523–527 (2011).
Yu, J. et al. DNA vaccine protection against SARS-CoV-2 in rhesus macaques. Science, eabc6284 (2020).
Hansen, S. G. et al. Prevention of tuberculosis in rhesus macaques by a cytomegalovirus-based vaccine. Nat. Med. 24, 130–143 (2018).
Carpenter, S. M. & Behar, S. M. A new vaccine for tuberculosis in rhesus macaques. Nat. Med. 24, 124–126 (2018).
Carroll, T. D. et al. Efficacy of influenza vaccination of elderly rhesus macaques is dramatically improved by addition of a cationic lipid/DNA adjuvant. J. Infect. Dis. 209, 24–33 (2014).
Heijmans, C. M. C., de Groot, N. G. & Bontrop, R. E. Comparative genetics of the major histocompatibility complex in humans and nonhuman primates. Int J Immunogenet 47, 243–260 (2020).
Knapp, L. A., Cadavid, L. F. & Watkins, D. I. The MHC-E Locus Is the Most Well Conserved of All Known Primate Class I Histocompatibility Genes. J. Immunol. 160, 189 (1998).
Boyson, J. E. et al. The MHC E locus in macaques is polymorphic and is conserved between macaques and humans. Immunogenetics 41, 59–68 (1995).
Shiina, T., Blancher, A., Inoko, H. & Kulski, J. K. Comparative genomics of the human, macaque and mouse major histocompatibility complex. Immunology 150, 127–138 (2017).
D’Souza, M. P. et al. Casting a wider net: Immunosurveillance by nonclassical MHC molecules. PLOS Pathogens 15, e1007567 (2019).
Wu, H. L. et al. The Role of MHC-E in T Cell Immunity Is Conserved among Humans, Rhesus Macaques, and Cynomolgus Macaques. J. Immunol. 200, 49 (2018).
Joosten, S. A., Sullivan, L. C. & Ottenhoff, T. H. M. Characteristics of HLA-E Restricted T-Cell Responses and Their Role in Infectious Diseases. Journal of Immunology Research 2016, 2695396 (2016).
Grant, E. J. et al. The unconventional role of HLA-E: The road less traveled. Mol. Immunol. 120, 101–112 (2020).
Tomasec, P. et al. Surface Expression of HLA-E, an Inhibitor of Natural Killer Cells, Enhanced by Human Cytomegalovirus gpUL40. Science 287, 1031 (2000).
Sharpe, H. R., Bowyer, G., Brackenridge, S. & Lambe, T. HLA-E: exploiting pathogen-host interactions for vaccine development. Clin. Exp. Immunol. 196, 167–177 (2019).
Marshall, E. E. et al. Enhancing safety of cytomegalovirus-based vaccine vectors by engaging host intrinsic immunity. Science Translational Medicine 11, eaaw2603 (2019).
Caposio, P. et al. Characterization of a live-attenuated HCMV-based vaccine platform. Scientific Reports 9, 19236 (2019).
Hansen, S. G. et al. Broadly targeted CD8⁺ T cell responses restricted by major histocompatibility complex E. Science 351, 714 (2016).
Malouli, D. et al. Cytomegaloviral determinants of CD8 + T cell programming and RhCMV/SIV vaccine efficacy. Sci Immunol 6 (2021).
Verweij, M. C. et al. Modulation of MHC-E transport by viral decoy ligands is required for RhCMV/SIV vaccine efficacy. Science 372 (2021).
Barrenäs, F. et al. Interleukin-15 response signature predicts RhCMV/SIV vaccine efficacy. PLoS Pathog 17, e1009278 (2021).
Malissen, M., Malissen, B. & Jordan, B. R. Exon/intron organization and complete nucleotide sequence of an HLA gene. Proc. Natl. Acad. Sci. U. S. A. 79, 893–897 (1982).
Paul, P. et al. Identification of HLA-G7 as a new splice variant of the HLA-G mRNA and expression of soluble HLA-G5, -G6, and -G7 transcripts in human transfected cells. Hum. Immunol. 61, 1138–1149 (2000).
Boyson, J. E., Iwanaga, K. K., Golos, T. G. & Watkins, D. I. Identification of a novel MHC class I gene, Mamu-AG, expressed in the placenta of a primate with an inactivated G locus. J. Immunol. 159, 3311 (1997).
Zavazava, N. & Krönke, M. Soluble HLA class I molecules induce apoptosis in alloreactive cytotoxic T lymphocytes. Nat. Med. 2, 1005–1010 (1996).
Nocito, M., Montalbán, C., González-Porque, P. & Villar, L. M. Increased Soluble Serum HLA Class I Antigens in Patients with Lymphoma. Hum. Immunol. 58, 106–111 (1997).
Tsuchiya, N., Shiota, M., Yamaguchi, A. & Ito, K. Elevated serum level of soluble HLA class I antigens in patients with systemic lupus erythematosus. Arthritis & Rheumatism 39, 792–796 (1996).
Adamashvili, I. et al. Soluble Class I HLA antigens in patients with rheumatoid arthritis and their families. J. Rheumatol. 22, 1025–31 (1995).
Tabayoyong, W. B. & Zavazava, N. Soluble HLA revisited. Leuk. Res. 31, 121–125 (2007).
Coupel, S. et al. Expression and release of soluble HLA-E is an immunoregulatory feature of endothelial cell activation. Blood 109, 2806–2814 (2006).
Shwetank, Date, O. S., Kim, K. S. & Manjunath, R. Infection of human endothelial cells by Japanese encephalitis virus: increased expression and release of soluble HLA-E. PloS one 8, e79197 (2013).
Gordon, S. P. et al. Widespread Polycistronic Transcripts in Fungi Revealed by Single-Molecule mRNA Sequencing. PLoS One 10 (2015).
Brochu, H. N. et al. Systematic Profiling of Full-Length Ig and TCR Repertoire Diversity in Rhesus Macaque through Long Read Transcriptome Sequencing. J. Immunol., ji1901256 (2020).
Eid, J. et al. Real-Time DNA Sequencing from Single Polymerase Molecules. Science 323, 133 (2009).
Doxiadis, G. G. M. et al. Compound Evolutionary History of the Rhesus Macaque Mhc Class I B Region Revealed by Microsatellite Analysis and Localization of Retroviral Sequences. PLOS ONE 4, e4287 (2009).
Kulski, J. K. et al. The Evolution of MHC Diversity by Segmental Duplication and Transposition of Retroelements. J. Mol. Evol. 45, 599–609 (1997).
Andersson, G., Svensson, A., Setterblad, N. & Rask, L. Retroelements in the human MHC class II region. Trends in Genetics 14, 109–114 (1998).
Saleh, A., Macia, A. & Muotri, A. R. Transposable Elements, Inflammation, and Neurological Disease. Frontiers in neurology 10, 894 (2019).
Payer, L. M. & Burns, K. H. Transposable elements in human genetic disease. Nature Reviews Genetics 20, 760–772 (2019).
Payer, L. M. et al. Alu insertion variants alter mRNA splicing. Nucleic Acids Res. 47, 421–431 (2018).
Nakama, M. et al. Intronic antisense Alu elements have a negative splicing effect on the inclusion of adjacent downstream exons. Gene 664, 84–89 (2018).
Su, M., Han, D., Boyd-Kirkup, J., Yu, X. & Han, J. J. Evolution of Alu Elements toward Enhancers. Cell Reports 7, 376–385 (2014).
Jjingo, D. et al. Mammalian-wide interspersed repeat (MIR)-derived enhancers and the regulation of human gene expression. Mobile DNA 5, 14 (2014).
Carnevali, D., Conti, A., Pellegrini, M. & Dieci, G. Whole-genome expression analysis of mammalian-wide interspersed repeat elements in human cell lines. DNA research: an international journal for rapid publication of reports on genes and genomes 24, 59–69 (2017).
Maccari, G. et al. IPD-MHC 2.0: an improved inter-species database for the study of the major histocompatibility complex. Nucleic Acids Res 45, D860-D864 (2017).
Buxton, S. E., Benjamin, R. J., Clayberger, C., Parham, P. & Krensky, A. M. Anchoring pockets in human histocompatibility complex leukocyte antigen (HLA) class I molecules: analysis of the conserved B ("45") pocket of HLA-B27. J Exp Med 175, 809–820 (1992).
Garrett, T. P., Saper, M. A., Bjorkman, P. J., Strominger, J. L. & Wiley, D. C. Specificity pockets for the side chains of peptide antigens in HLA-Aw68. Nature 342, 692–696 (1989).
Vandiedonck, C. et al. Pervasive haplotypic variation in the spliceo-transcriptome of the human major histocompatibility complex. Genome Res 21, 1042–1054 (2011).
Boegel, S. et al. HLA typing from RNA-Seq sequence reads. Genome Med 4, 102 (2012).
Orenbuch, R. et al. arcasHLA: high-resolution HLA typing from RNAseq. Bioinformatics 36, 33–40 (2020).
Boyle, L. H., Gillingham, A. K., Munro, S. & Trowsdale, J. Selective export of HLA-F by its cytoplasmic tail. J Immunol 176, 6464–6472 (2006).
Goyos, A. et al. A distinctive cytoplasmic tail contributes to low surface expression and intracellular retention of the Patr-AL MHC class I molecule. J Immunol 195, 3725–3736 (2015).
Rodríguez-Cruz, T. G. et al. Natural splice variant of MHC class I cytoplasmic tail enhances dendritic cell-induced CD8 + T-cell responses and boosts anti-tumor immunity. PloS one 6, e22939 (2011).
Daza-Vamenta, R., Glusman, G., Rowen, L., Guthrie, B. & Geraghty, D. E. Genetic Divergence of the Rhesus Macaque Major Histocompatibility Complex. Genome Res 14, 1501–1515 (2004).
Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).
Cartolano, M., Huettel, B., Hartwig, B., Reinhardt, R. & Schneeberger, K. cDNA Library Enrichment of Full Length Transcripts for SMRT Long Read Sequencing. PLOS ONE 11, e0157779 (2016).
Tardaguila, M. et al. SQANTI: extensive characterization of long-read transcript sequences for quality control in full-length transcriptome identification and quantification. Genome Res 28, 396–411 (2018).
Zhang, J., Kobert, K., Flouri, T. & Stamatakis, A. PEAR: a fast and accurate Illumina Paired-End reAd mergeR. Bioinformatics (Oxford, England) 30, 614–620 (2014).
Trapnell, C. et al. Transcript assembly and abundance estimation from RNA-Seq reveals thousands of new transcripts and switching among isoforms. Nat Biotechnol 28, 511–515 (2010).
Wu, T. D. & Watanabe, C. K. GMAP: a genomic mapping and alignment program for mRNA and EST sequences. Bioinformatics 21, 1859–1875 (2005).
Storer, J., Hubley, R., Rosen, J., Wheeler, T. J. & Smit, A. F. The Dfam community resource of transposable element families, sequence models, and genome annotations. Mob DNA 12 (2021).
Thorvaldsdóttir, H. et al. Integrative genomics viewer. Nature biotechnology 29, 24–26 (2011).
Barrenäs, F. et al. Sustained IL-15 response signature predicts RhCMV/SIV vaccine efficacy. bioRxiv, 2021.01.11.426199 (2021).
Pyo, C. et al. Recombinant structures expand and contract inter and intragenic diversification at the KIR locus. BMC genomics 14, 89 (2013).
Roe, D. et al. Revealing complete complex KIR haplotypes phased by long-read sequencing technology. Genes and immunity 18, 127–134 (2017).
Li, H. et al. The Sequence Alignment/Map Format and SAMtools. 25, 2078–2079 (2009).
QUINLAN, A. R. & HALL, I. M. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics (Oxford, England) 26, 841–842 (2010).
Li, H. A statistical framework for SNP calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics 27, 2987–2993 (2011).
Castel, S. E., Mohammadi, P., Chung, W. K., Shen, Y. & Lappalainen, T. Rare variant phasing and haplotypic expression from RNA sequencing with phASER. Nature communications 7, 12817 (2016).
Löytynoja, A. & Goldman, N. webPRANK: a phylogeny-aware multiple sequence aligner with interactive alignment browser. BMC Bioinformatics 11, 579 (2010).
Sievers, F. et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Molecular systems biology 7, 539 (2011).
Schliep, K. P. phangorn: phylogenetic analysis in R. Bioinformatics 27, 592–593 (2011).
Yu, G., Smith, D. K., Zhu, H., Guan, Y. & Lam, T. T. ggtree: an r package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods in Ecology and Evolution 8, 28–36 (2017).
Galili, T. dendextend: an R package for visualizing, adjusting, and comparing trees of hierarchical clustering. Bioinformatics (Oxford, England) 31, 3718–3720 (2015).
Lewontin, R. C. On measures of gametic disequilibrium. Genetics 120, 849–852 (1988).
Rice, P., Longden, I. & Bleasby, A. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet 16, 276–277 (2000).
Slater, G. S. C. & Birney, E. Automated generation of heuristics for biological sequence comparison. BMC Bioinformatics 6, 31 (2005).
van de Geijn, B., McVicker, G., Gilad, Y. & Pritchard, J. K. WASP: allele-specific software for robust molecular quantitative trait locus discovery. Nat Methods 12, 1061–1063 (2015).
Patro, R., Duggal, G., Love, M. I., Irizarry, R. A. & Kingsford, C. Salmon: fast and bias-aware quantification of transcript expression using dual-phase inference. Nat Methods 14, 417–419 (2017).

Table 1. Allele groups detected in RhCMV/SIV study animals, stratified by protection outcome and group.

Protection outcome	Group	G1	G1+G3	G1+G2	G1+G2_LTR	G1+G2+G2_LTR
Protected	O	4	4	1	0	0
	S	2	4	1	0	0
	X	4	1	1	0	0
Not Protected	O	2	1	0	2	1
	S	3	2	0	2	0
	X	3	4	1	1	0
	E	6	4	1	1	0
Exposed-Uninfected	E	3	0	0	0	0
Total		27	20	5	6	1

There is NO Competing Interest.

BrochuSupplementaryInformationfinal.pdf
Supplementary Information
NCOMMS2218194RSC.pdf
Reporting Summary
mamu.zip
Additional Custom Software

Download PDF

Journal Publication

published 19 Dec, 2022

Read the published version in Communications Biology →

Version 1

posted

You are reading this latest preprint version

Alternative splicing and genetic variation of MHC-E: Implications for rhesus cytomegalovirus-based vaccines

Status:

Journal Publication

Version 1

Abstract

Figures

Introduction

Results

Mamu-E gene duplications are common

Mamu-E expression in whole blood is dominated by a single locus

Confirmation and extension of Mamu-E G1 alleles using mRNA-seq based haplotype phasing

Characteristics of Mamu-E G1 allele variants and their associations with RhCMV vaccine protection

Discussion

Methods

Rhesus full-length transcriptome sequencing and data processing

Human full-length transcriptome sequencing and data processing

Cross-species comparison of MHC-E isoforms

Validation of inferred Mamu-E 5’ ends using PCR and Sanger Sequencing

Isoform functional analysis and identification of genomic transposable elements

RhCMV/SIV vaccine study sample collection

PacBio Long Amplicon Analysis and data processing

Confirmation of Mamu-E duplications via fosmid isolation and PacBio sequencing

Whole blood mRNA-seq and Mamu-E haplotype phasing

Comparison and integration of Mamu-E alleles with mRNA-seq haplotigs

Mamu-E phylogenetic and genetic analysis

Expression analysis of Mamu-E loci

Relative Mamu-E isoform expression analysis

Mamu-E duplication and G1 allele association analysis

Declarations

References

Table 1

Additional Declarations

Supplementary Files

Status:

Journal Publication

Version 1