The genomic organization of four loci of T-cell receptors (TR), i.e. alpha (TRA), beta (TRB), gamma (TRG) and delta (TRD), is complex. The four loci are distributed over three different genomic regions across two chromosomes in the human genome: TRA and TRD are located intermingled on chromosome 14q11.2 (with TRD embedded within the TRA locus), and TRB and TRG on different arms of chromosome 7. The TRB and TRD loci are comprised of Variable (V), Diversity (D) and Joining (J) genes, whereas TRA and TRG loci contain V and J genes only. The polypeptides, encoded by functionally rearranged TRA and TRB loci, combine to form a TRαβ receptor, whereas functionally rearranged TRD and TRG loci form the TRγδ receptor, both containing an antigen-recognition domain. The recombination of V(D)J genes has the potential to generate many millions of different TR molecules, each having a unique antigen binding specificity 1. The V(D)J recombination process is directed by “recombination signal sequences” (RSS), short highly conserved DNA stretches, present at each recombination site of the TR genes, i.e. downstream to V, upstream to J, and at both sites of D 2,3.
TR genes harbor inter-individual germline allelic variants, causing different individuals to be able to produce different receptors. As these different allelic variants are shared within confined human populations 4, they contribute also to more extreme diversity of receptors at the population level 5. These population-specific germline variations have been shown to introduce varying disease prevalences in specific population 6–9. For example, in Asian and Caucasian populations, TRBV17 plays a pivotal role in Influenza A virus specific T-cell immunity 10. Consequently, to understand (population-specific) immune responses, a catalogue of population-wide observed TR alleles is crucial. Till today, there is, however, only one database that reports all alleles for the TR loci: the International ImMunoGeneTics information system (IMGT) 11,12. But, this database does not report allelic frequencies or population statistics and, moreover, reported alleles are mostly profiled from Caucasian populations 13,14.
To enrich the catalogue of TR germline genes with population information, we relied on the “1,000 Genomes (G1K)” dataset (https://www.internationalgenome.org/), derived from cell samples of 2,548 individuals across five different ethnicities. We are not the first in doing so. Yu et al created the Lym1k database for immunoglobulin (IG) and TR loci, also from the G1K data using their AlleleMiner tool 14. They, however, did not provide any information on the reliability of the (newly) identified alleles and also the link to population information was not retained. Moreover, not all relevant components of each TR locus were stored, i.e. they neglected the D, J, C genes and the RSSs. Also, they were not able to profile all TR genes as they used a previous version of the G1K dataset (i.e. a mapping to GRCh37 being liftover to GRCh38).
Here, we identified the alleles for all components of all four TR loci, i.e. the V, D, J, C genes as well as the RSSs, report reliability scores for the differently detected alleles as well as population information of each allele, and present an online accessible database containing this information which we called the “population-matched germline allelic variants of T-cell receptor loci” database; or in short, the pmTR database. To realize this, we have developed an automated pipeline to profile all the TR alleles from the G1K data. The pipeline returns the sequences of alleles, frequency of alleles, as well as the population distribution of each allele among 26 different populations profiled in G1K resource. The resulting alleles are manually curated and made available via GitHub and the online database (www.pmTRIG.com), including population information and confidence levels to provide access for the community. We have also enabled a BLAST search on the database to directly use our germline alleles in further research.