Information about MHC class I alleles is sparse in sheep, partly due to the lower research resources available compared e.g. to humans or cattle but also due to the complex haplotype and gene organization of the MHC. Less than 40 MHC class I alleles are officially assigned by the IPD database, compared to more than 20.000 in humans. This incomplete knowledge of allelic variation limits, among other issues, the development of typing systems and immunological studies. Therefore, new cost-efficient ways to expand the ovine class I allele panel are urgently sought.
MHC class I alleles are among the most polymorphic genes of all. For example, if we consider the region between nucleotide position 474 and 593 for the alleles found in this publication, the average number of nucleotide differences is 18.5 with a maximum number of 32/120 bp (15% and 27% differences, respectively) in a length range common for sequence reads from RNA-Seq experiments. This high degree of polymorphism combined with the unclear genomic organization of MHC class I genes makes the identification of new alleles in sheep very difficult.
DinoMfRS combines the alignment of RNA-Seq data to an incomplete MHC class I reference database with de novo reassembly of a subset of RNA-Seq reads (align -> extraction of RNA-Seq reads -> de novo reassemble) to get the information about new alleles. This strategy proved to be highly effective. From 16 animals 61 different MHC class I alleles could be identified of which 53 were novel. This almost doubled the number of known ovine MHC class I alleles and will facilitate the detection of further alleles in future experiments. Only 4 of 60 alleles identified in this work are shared between breeds. This reflects the high genetic plasticity of MHC genes, which leads to population-specific allele sets.
DinoMfRS enables resource-efficient expansion of the MHC class I allele database by using existing data from NGS experiments. RNA-Seq data are used for a variety of different applications, from quantification of expression to analysis of splice variants. As the cost of NGS technology continues to decrease, the number of available RNA-Seq datasets in farm animals is rapidly increasing. However, inference of class I alleles from these datasets in sheep has been largely unsuccessful because the methods used rely on strict alignment algorithms, which by their nature do not allow alignment of highly divergent sequence reads to genomic or transcriptomic reference sequences. The approach developed here overcomes these obstacles by combining the alignment to a reference sequence with a de novo alignment approach and enables the identification of MHC class I alleles from ovine RNA-Seq data despite the high degree of sequence differences.
Using DinoMfRS, all expressed alleles can be identified. While methods using specific PCR primers run the risk of missing alleles that show variation in the primer sequence region, the use of RNA-Seq data enables unbiased identification of all alleles. This is a major advantage, especially for highly polymorphic gene regions such as the MHC. In addition, MHC class I molecules are basically expressed on the surface of almost all cells that contain a nucleus and this, in combination with the high expression level greatly facilitates their identification even in RNA-Seq datasets with a limited number of reads. In addition, DinoMfRS can be easily extended to other genes such as MHC class II and allows the study of expression levels.
The efficiency of DinoMfRS increases with the number of known alleles. When little information is available about the allelic spectrum the identification of the first novel alleles using DinoMfRS is time consuming since several steps are necessary. With increasing completeness of the reference database the method speeds up, since more and more alleles can be determined by one BWA-MEM run. When the allelic spectrum is sufficiently covered by the reference database the method can be used for low or medium throughput high resolution genotyping of MHC. Several techniques have been developed for high-resolution genotyping of MHC alleles from short sequence reads, most of them for the HLA with its high information density available (26). The approaches can be divided into two groups: the de novo assembly and the alignment approach, in which short sequence sequences are aligned to known reference allele sequences. Both have their disadvantages (26). DinoMfRS combines the advantages of both approaches, and generating RNA-Seq data from individuals combined with DinoMfRS may be an alternative to current genotyping methods for MHC genes, which are complex and expensive.
When RNA-Seq data are available for the animal whose DNA was sequenced to build a genomic sequence, DinoMfRS can greatly improve genomic annotation of the MHC region, as demonstrated here for the current reference sheep genome. Although a high-quality reference sheep genome is available, annotation of the MHC region is incomplete. Using DinoMfRS derived alleles from Benz2616 RNA-Seq data, the annotation of MHC class I alleles in the genome sequence was reviewed. At least one MHC class I allele was present in the reference sequence but not annotated, some MHC class I gen-containing contigs have not yet been mapped to the ovine chromosome sequence, and the current sheep genome release contains sequences from two different haplotypes. Based on whole-genome contig data in combination with DinoMfRS it could be shown that alleles LfL2082, LfL2087, and LfL2088 likely comprise one haplotype and alleles LfL2083, LfL2084, LfL2085, LfL2086, and LfL2090 comprise the second.
Beneath completing the set of ovine MHC class I alleles and MHC genotyping DinoMfRS can contribute to a number of further research fields. One is the clarification of the evolutionary mechanisms driving the high variability of MHC genes. The evolution of MHC class I genes of the family Bovidae remains largely obscure up to now because neither the assignment of alleles to individual gene loci nor the allelic spectrum is sufficiently known. Completion of the allelic spectrum also facilitates the annotation of genes in genomic sequences and can thus contribute in two ways to elucidating the evolutionary mechanisms of the MHC and related aspects such as mate choice influenced by the MHC. DinoMfRS can help to overcome the problems in gene expression studies that include MHC class I genes. It enables the analysis of all expressed MHC class I alleles individually, overcoming methodological problems in identifying differentially expressed MHC genes in gene expression profiling. Further, it will be useful for identifying expressed alleles in other gene-families that show different copy numbers per chromosome or are highly polymorphic (e.g. odorant receptors (27)).
Finally, as is already the case in humans and experimental animals, the identification of pathogen antigens bound to MHC and their recognition will become important in the future for understanding disease resistance in livestock. The process of assigning each ligand to its presenting MHC molecule is a critical step in the analysis of MHC ligand data (28). The completion of the MHC class I allelic spectrum will be a prerequisite for wet-lab and bioinformatics analyses of the immunopeptidome in sheep.