The gene expression of Mamu-E is regulated by extensive alternative splicing that is conserved among HLA-E isoforms
To accurately define Mamu-E transcript structures, we aimed to use high-quality, full-length transcript sequences obtained by long read transcriptome sequencing41. Since the sequences of MHC genes are very similar, it was critical that we use long read sequencing to avoid transcript sequence assembly. In our previous work42, using PacBio transcriptome sequencing (the Iso-Seq method) we obtained over 2.8 million Circular Consensus Sequencing (CCS) reads from four different rhesus macaque tissues (Supplementary Table 1). About 33% of these CCS reads were full length (i.e., contained the 5’ cDNA primer, 3’ cDNA primer, and polyadenylation tail), each representing a single transcript molecule43. All CCS reads (full-length (FL) and non-full length) were initially clustered without a genome reference and subsequently aligned to a RM MHC Class I region reference sequence, which was previously assembled using BAC cloning (Methods). These CCS read groups were further clustered and curated, yielding an initial set of 13 unique Mamu-E isoforms (shown in Fig. 1 as Mamu-E1-10, 12, 14, and 16). The canonically spliced Mamu-E isoform (Mamu-E1) had the strongest FL CCS read support of all isoforms (92 of 123, 74.8%), while other isoforms had FL support ranging from 1 to 13 (Supplementary Table 2). Collectively, these isoforms supported a shorter 5’ UTR than previously annotated and a significantly longer 3’ UTR, and this was also supported by mRNA-seq data from RM whole blood samples described later (Supplementary Fig. 1). These isoforms exhibit several new splicing events largely concentrated at the 3’ end of the transcript, including exon skipping, alternative 3’ UTR splicing, a retained intron, and a novel exonization event between exons 5 and 6 (Fig. 1), all of which had canonical splice signals. Many of these isoforms were also predicted to encode protein sequences with different domain configurations (Fig. 1). For example, while nearly all isoforms (12 of 13) encode the canonical Alpha 1, 2 and 3 domains, many isoforms skip the transmembrane domain and have diverse cytoplasmic tails introduced by alternative 3’ UTR splicing. Together, these results indicate that complex alternative splicing of Mamu-E yields proteins with potentially diverse functions.
Separately from our Mamu-E analysis, we recovered 41 unique HLA-E isoforms collectively supported by 2,050 FL CCS reads from a human PacBio Iso-Seq dataset from 60 myelogenous patient samples (Supplementary Fig. 2). Interestingly, all new splicing patterns in Mamu-E isoforms were found among these HLA-E isoforms, with 4 perfect isoform matches. Additionally, the human 3’ and 5’ UTRs were of comparable length to those in RM. A similar pattern was also observed among HLA-E isoforms, where most FL reads (1,788 of 2,050, 87%) supported the canonical splicing configuration. When resampling HLA-E isoforms using sequencing depth commensurate with Mamu-E (i.e. 123 FL CCS reads), 12.6 isoforms were detected on average, suggesting HLA-E and Mamu-E may have similar spliceosome complexities. Despite the greater number of isoforms detected in human (41 vs. 13), only 25 contained the Alpha 1 and 2 domains needed for peptide binding (Supplementary Fig. 2, Supplementary Table 3). There were also many retained intron events detected between exons 1 and 2 (19 of 41 isoforms) of HLA-E compared to Mamu-E (1 of 13). While the retained intron led to a frameshift and a premature stop codon in the Mamu-E isoform, this did not affect the reading frame in human isoforms (Supplementary Fig. 2). Further, while 3’ UTR splicing diversity was evident in human, it did not impact the cytoplasmic tail, as the HLA-E open reading frame (ORF) terminates before the last splice junction (i.e. in exon 7); whereas Mamu-E isoforms terminate shortly after the junction (i.e. in exon 8) (Fig. 1, Supplementary Fig. 2). Upon closer inspection, we determined that this was due to a difference in the exon 7 reading frame (data not shown).
Several Mamu-E isoforms with few FL CCS read support (8 of 13) captured new splicing patterns but failed to recover the complete Mamu-E 5’ end to varying degrees (Fig. 1). Given the consistency of the HLA-E and Mamu-E exon structures, we inferred 5’ ends for these incomplete isoforms. Next, we designed PCR assays to target the unique splicing features of these inferred isoforms and isolated the resulting bands for Sanger (Supplementary Table 4, Methods). The 5’ ends of most isoforms (6 of 8) were confirmed using this approach, and unexpectedly we identified three new isoforms (Fig. 1; Mamu-E11, 13, and 15). These isoforms match PacBio-derived isoforms (Mamu-E10, 12, and 14, respectively), but lack a retained intron between exons 4 and 5 (Fig. 1).
We hypothesized that this complex alternative splicing might be in part associated with transposable elements (TEs) in the Mamu-E locus. TE sequences are known to permeate the MHC region in RM44 and human45, 46, and they are believed to play a significant role in human disease47, 48. Alu elements, a type of transposon, have a strong connection with transcriptional regulation, as they can influence alternative splicing49, 50 and function as enhancers51. We screened the Mamu-E locus and upstream and downstream genomic regions, finding eight elements on the sense strand and two on the antisense strand (Fig. 1). Interestingly, all eight of these TEs were found in the HLA-E locus in similar locations (data not shown), suggesting these were translocated prior to the split between old and new world monkeys. Two Alu elements (AluJb and AluY) were found directly upstream of the 5’ UTR (Fig. 1), suggesting a possible role in transcriptional activation. We also detected an Alu element (AluSx3) on the antisense strand between exons 5 and 6, coincidentally where six isoforms (Mamu-E8-13) have novel splicing acceptor/donor sites that result in exons partially spanning the Alu element. Another AluY and two other TEs were found in the 3’ UTR, suggesting that alternative splicing in this region might be influenced by and/or influence their function. Lastly, mammalian-wide interspersed repeat (MIR)b was found directly downstream of the transcriptional termination site with a fully intact AluYf1 element directly adjacent to it (Fig. 1). Like Alu elements, MIRs can function as enhancers to promote tissue-specific gene expression52 and there is also evidence that they can be transcribed in human53. Taken together, the presence of complex splicing and deluge of TEs indicate that the Mamu-E and HLA-E loci are under strong transcriptional regulation.
Mamu-E gene duplications are common
Mamu-E has long been known to be polymorphic17, currently with 33 alleles in the Immuno Polymorphism MHC Database (IPD-MHC)54. To date, it has not been investigated whether this polymorphism has any connection with Mamu-E-restricted antigen presentation in response to RhCMV/SIV vaccination. We obtained genomic DNAs from 59 of 60 animals from four RhCMV/SIV vaccine groups, three previously described in 30, and used PacBio long amplicon analysis (LAA) to target and sequence Mamu-E allele sequences (Methods). Across 59 animals, we recovered 152 allele sequences (Supplementary Table 5), assigned to 17 IPD-MHC database alleles. These alleles were composed of four groups: three full-length ~ 5% divergent groups (G1, G2, G2_LTR) and a fourth monomorphic group missing the canonical Mamu-E exon 6 and the surrounding intronic sequence harboring an antisense AluSx3 element (G3) (Figs. 1, 2a-b). G2_LTR alleles are accordingly named by the ~ 700bp solo LTR5B inserted approximately 20bp after the expected start of the amplified sequence 5’ end (e-value ~ 10− 83) (Fig. 2a). G1 alleles were detected in all animals and exclusively in 27 of 59 (46%), while additional alleles from G2, G2_LTR, and G3 were found in 6, 7, and 20 animals, respectively (Table 1).
G2, G2_LTR, and G3 alleles were also found to be in complete linkage with Mamu-E*02:02, Mamu-E*02:11, and Mamu-E*02:04 (G1 alleles), respectively. We confirmed the presence of multiple Mamu-E loci in 1 of 4 selected animals (animal ID #Rh28808) using fosmid isolation followed by PacBio DNA sequencing (Methods, Supplementary Table 6). The fosmid sequence from this animal contained both the G3 allele (E*02:13V-short) and the E*02:04 allele separated by ~ 20kb, supporting the linkage we observed between these alleles across multiple animals. No animals were found to have alleles from all four groups and or have > 2 alleles from any of the groups, with the exception of one animal (animal ID # Rh29659). We recovered a third G1 allele (Mamu-E*02:03, also found in 11 other animals), which was not detected in our later expression analysis (data not shown). The presence of additional MHC-E alleles in the same animals was not associated with vaccine group (Fisher’s exact test: p = 0.569) or protection outcome (Fisher’s exact test: p = 1) (Methods, Table 1), where the E group was excluded as there was no protection observed among its animals. However, all 7 animals from groups O, S, and X with G2_LTR alleles were not protected and the association with protection outcome was statistically significant (Fisher’s exact test: p = 0.02, Table 1).
Next, we investigated the segments driving the sequence differences among allele groups by separately analyzing the exons, introns, and the sequence recovered upstream of 5’ UTRs. We observed that G2_LTR alleles significantly diverged from all other alleles even when removing the inserted LTR5b sequence (Fig. 2b-c). G1 alleles tended to cluster together in the 5’ upstream region, while a small subset clustered with G2 alleles and G3 alleles shared some similarities with both clusters (Fig. 2c). We found that G1 and G2 alleles were more similar in exons 1 and 2, while both G2_LTR and G3 alleles significantly diverged (Fig. 2c). Interestingly, all 4 allele groups diverged in exon 3 (Alpha 2), intron 3, and exon 4 (Alpha 3) (Fig. 2c), suggesting these allele groups may function differently.
Mamu-E expression in whole blood is dominated by a single locus
To explore the potential functional divergences among duplicated Mamu-E alleles, we sought to determine if Mamu-E genes of these allele groups are similarly expressed. We examined Mamu-E gene expression using mRNA-seq analysis of whole blood samples collected from the same animals during the pre-challenge phase of a RhCMV/SIV vaccine study before and after the prime and boost phases (Methods). Nine samples from each of the 59 animals (531 total) were sequenced, yielding ~ 14.8 billion reads (~ 27.8 million reads per sample). For each animal, reads were aligned to the MHC Class I/II BAC reference with the Mamu-E locus masked and animal-specific Mamu-E allele sequences as separate contigs (Methods).
We calculated the relative expression of allele groups in all animals expressing at least 1 allele from more than one group based on our genomic analysis (Table 1). The proportions of expression from each allele group were fairly stable throughout the pre-challenge phase, with G2, G2_LTR, and G3 alleles composing approximately 25%, 10–15%, and 5% of expression, respectively (Fig. 3a). While G1 alleles composed most of the Mamu-E expression, both the relative (Fig. 3b) and absolute (Fig. 3c) G1 allele expression levels varied contingent on the extra allele groups present in the same animals. For example, when G2_LTR alleles were present, the absolute G1 allele expression levels were about 30% higher (Fig. 3c). However, when G3 alleles were present, the absolute G1 allele expression levels were about 30% lower.
RhCMV/SIV vaccination elicits MHC-E-restricted T-cell responses, so we next sought to determine the effect of vaccination on the expression of these alleles. We observed that in animals expressing alleles from > 1 group, the expression of allele groups was strongly correlated (Fig. 3d). When examining total Mamu-E expression (i.e. pooled allele expression), we found that Mamu-E expression increased significantly following vaccination prime and boost, regardless of protection outcome (Fig. 3e), suggesting that RhCMV/SIV vaccination may influence the functions of Mamu-E.
We also examined the relative expression of alleles expressed within the same group, finding that we could reliably recover allele-specific read counts even with little polymorphism between alleles (Supplementary Fig. 3a). We also observed fairly even allelic coverage within loci regardless of allele group that was also stable throughout the pre-challenge phase (Supplementary Fig. 3b-c). One exception to this was in one animal (animal ID # Rh28835 from the S group), where one G1 allele was found to be expressed substantially less than the other (Supplementary Fig. 3b). Interestingly, the lowly expressed G1 allele was the only allele among all animals with an insertion, which incidentally resulted in a frameshift and premature stop codon. These results indicate that G1 alleles tend to be expressed at relatively similar levels to each other and several times higher than G2 and G3 alleles.
Confirmation and extension of Mamu-E G1 alleles using mRNA-seq based haplotype phasing
We independently assessed the accuracy of our Mamu-E allele sequencing at per base level and captured additional 3’ UTR variation using the collected whole blood mRNA-seq data. Also, as shown in Figs. 1, 2a, Mamu-E transcribes a much longer 3’ UTR than the canonical annotation. This long 3’ UTR was not covered in our allele genomic sequencing which was designed to target coding regions (Fig. 2a). We focused this mRNA-seq based analysis on alleles in the G1 group since their expression was dominant, making this effort feasible (Fig. 3a-b).
We first assessed the depth of mRNA-seq read coverage of Mamu-E and the ability to capture Mamu-E polymorphism accurately using short read mRNA-seq data. We observed ~ 4–5% of total reads mapped to the Mamu Class I and II complexes and 10,000 per base Mamu-E coverage (Supplementary Fig. 1). We also found that recovery of the transmembrane domain region polymorphisms was intractable likely due to greater conservation of this region with other MHC genes using a kmer-based strategy (Supplementary Fig. 5a, Methods). Recovery of polymorphisms in 3’ UTR regions harboring TEs was also found to be intractable, leading to their exclusion (Supplementary Fig. 4b). Lastly, low coverage bases proximal to the transcriptional start and termination sites were excluded from this haplotype phasing analysis (Supplementary Fig. 4c).
For the remaining highly confident regions we generated completely contiguous haplotype blocks spanning the entire Mamu-E region, resulting in Mamu-E haplotigs (Methods). Almost all (1,146 of 1,150, 99.7%) heterozygous variant calls were successfully phased for all animals. As expected, we did not observe a lower fraction of read support for the G1 haplotype configuration in animals with additional G2 and G3 alleles (Supplementary Fig. 5a), given the dominant expression of G1 alleles. On average almost 100% of the variants identified by haplotigs derived from mRNA-seq reads were identical to the most similar G1 allele sequences within each animal where they overlap (i.e. excluding the 3’ UTR) (Supplementary Fig. 5b). For each animal, we matched haplotigs against allele sequences, determining that variant phasing was also highly concordant (> 96% of variants) with differences only arising due to mRNA-seq variant calling issues in fringe locations just passing our required per base coverage threshold (Supplementary Fig. 5c). This nearly perfect agreement between these two independent methods (DNA sequencing via PacBio LAA, haplotig recovery via mRNA-seq) shows the extremely high accuracy of sequences we obtained by LAA. We merged these G1 alleles with their matched haplotigs, producing final, complete G1 allele sequences spanning the entire Mamu-E locus including both coding regions and long 3’ UTRs.
Characteristics of Mamu-E G1 allele variants and their associations with RhCMV vaccine protection
Next, we examined the variation recovered across these merged G1 allele sequences, since all animals have at least one copy of G1 alleles and G1 alleles contributed the majority of Mamu-E expression in whole blood samples (Fig. 3a-b). Variants were identified throughout the whole Mamu-E G1 locus, protein coding regions and both UTRs (Fig. 4a-b). Single nucleotide polymorphisms (SNPs) were also found to be non-synonymous, producing a total of 42 unique single amino-acid polymorphisms (SAPs) spread across all protein domains (Fig. 4b). However, none of the SAPs located in the Alpha 1 and 2 domains were located in the predicted B and F pocket key binding sites55, 56 (Supplementary Fig. 6). Interestingly, G2 and G3 allele polymorphisms also did not affect key binding sites. However, those in G2_LTR alleles impacted 5 sites across Alpha 1 and 2, indicating that they likely have significantly altered function.
Given the extent of polymorphism recovered among G1 alleles (108 SNPs, 2 insertions, 3 deletions), we decided to more closely inspect individual variants and determine the extent of linkage disequilibrium (LD) within the Mamu-E G1 locus. We found that 55 of 113 (48.7%) passed a minor allele frequency (MAF) threshold of 0.1 (Supplementary Table 7). Variants passing the MAF filter were located throughout the 3’ UTR and coding region of the Mamu-E locus (Fig. 4a). We detected substantial correlation (i.e. LD) of variants both locally and between variants distant from one another in the Mamu-E G1 locus (Fig. 4a). When we grouped these variants based on their correlations, we found 2 major clusters of correlated variants of size 21 and 18 SNPs, with 10 additional clusters of size 4 or less (Fig. 4c). The only indel that passed the MAF filter (a deletion in the transmembrane (TM) domain) was in strong LD with 3 SNPs also in the TM domain (cluster 3 in Fig. 4c). Interestingly, the two large clusters of SNPs were each comprised of a set of 3’ UTR variants along with variants from the Alpha domains, TM domain, cytoplasmic domain, and 5’ UTR (Fig. 4c). We also observed that when represented in a phylogeny, final G1 allele sequences formed 3 major subgroups, with one much larger than the other two (Fig. 4d). However, there was no significant association between these G1 allele subgroups and vaccine group (Fisher’s exact test, p = 0.303) or protection outcome (Fisher’s exact test: p = 0.313). We then cross-referenced the variant clusters identified (Fig. 4c) with the three major G1 allele subgroups identified (Fig. 4d), finding that G1 subgroup 1 alleles contained the major form of both of the two large variant clusters, subgroup 2 contained the minor form of both, and subgroup 3 contained the minor and major form of the first and second, respectively (data not shown).
We then examined the genotypes of animals across each of the variant clusters as well as individual variants, finding that there was no statistically significant association with protection outcome among vaccine groups O, S, and X (Fig. 4e, Supplementary Table 7) or with vaccine groups (Supplementary Fig. 7, Supplementary Table 7). However, we did observe a tendency for protected animals to favor the major form of variant cluster 3 and the minor forms of variant clusters 8 and 9 (p = 0.116, 0.124, 0.116 and FDR = 0.497, 0.497, 0.497, respectively) (Fig. 4e, Supplementary Table 7).
Since cluster 3 harbored 4 variants (including a deletion) in the TM domain as did variant cluster 9 (SNP), we examined the differences in hydrophobicity scores across all allele TM domains. We observed a reduction in N-terminal TM hydrophobicity among all G2_LTR and G2 alleles relative to G1 alleles and HLA-E (Fig. 5). We also found that G3 alleles and G1 alleles harboring variant cluster 3 and had increased C-terminal TM hydrophobicity relative to other Mamu-E alleles as well as HLA-E, while Mamu-E alleles with cluster 9 had unaltered hydrophobicity (Fig. 5). These results suggest that Mamu-E polymorphisms in this region may impact Mamu-E protein transport and/or membrane stability. Furthermore, we cannot make a complete determination since variants located within 3’UTR TEs and indels in the 3’ UTR region were not included in this analysis (Supplementary Fig. 4b).
Since we determined that the Mamu-E expression in whole blood was driven by G1 alleles, we investigated if the G1 allele expression, especially isoform usage, in whole blood could be related to RhCMV/SIV vaccine protection outcome. The abundances of Mamu-E G1 allele isoforms were quantified using full G1 allele sequences and relative proportions of isoforms were determined for all pre-challenge time points for each animal (Methods). We found that Mamu-E1 isoforms were most prevalent, composing ~ 80% of isoform abundances, while all other isoforms were detected at lower levels (Supplementary Figs. 8, 9). We also observed that relative isoform usages were largely stable in whole blood throughout the pre-challenge phase, regardless of vaccine protection outcome and vaccine group (Supplementary Fig. 9). Mamu-E isoforms were observed in different strata based on their relative expression (Supplementary Fig. 8). Mamu-E2, 4, 6, 8, and 14–16 formed a second stratum after Mamu-E1, each representing ~ 1–10% of isoforms expressed. Mamu-E3, 5, 7, 9, 10, and 12 formed a third stratum, each expressing ~ 0.1-1% of isoforms. Mamu-E11 and 13 were especially rare, defined as a fourth stratum with < 0.1% of isoform expression. Incidentally, these two rare isoforms were only identified using Sanger sequencing (Fig. 1).