The MMS algorithm
The MMS algorithm assigns human-based OR symbols by detecting the inter-species hierarchical pairwise similarities (Fig. 1). The algorithm first analyzes the all-versus-all BLASTP identity matrix of the given OR repertoire versus human. Mutual best hits with 80% identity are first identified and are assigned the same symbol as the human best hit, second-best ortholog candidates with 80% identity are then identified and are assigned with the human best hit gene symbol with the addition of the letters B, C, etc., with the exception of P which is reserved for pseudogenes. For example, OR9A4 is the mutual best hit and OR9A4B is the second best hit. The 80% identity cutoff is based on our previous studies of sequence similarities between mutual best hits across mammals [5, 11], and on phylogenetic analyses that we performed on a randomly selected set of OR subfamilies during the current project (data not shown). We previously reported a high level of conservation of ORs across mammals [5, 30], and thus we believe that this cutoff can be used for most placental mammals. The remaining ORs are compared to non-human ORs to detect non-human OR orthology relationships. OR genes that have not already been named in the above steps are classified into families and subfamilies in the same way as the human ORs [10, 11] and as shown in Fig. 1. We use a ≥60% identity cutoff to classify a gene as a new member of a subfamily with the next available subfamily member number. If the sequence identity is below 60% we use ≥40% identity to classify the gene into an OR family, with a novel subfamily symbol (see above, and also in Olender T. et. al. 2013 [11] for more details). At the step of subfamily classification (Step 4, Fig. 1), the similarity matrix includes the within-species OR repertoire in addition to the other species’ OR repertoires. This step is required for the correct classification of species-specific subfamilies. An exception to the above order of comparisons was made for the rat, where the repertoire was initially compared to mouse before detecting best human matches, to take into account the close evolutionary distance of the rodents. Symbols are composed of uppercase letters and Arabic numbers, except in rodents where, by convention, only the first letter is capitalized and the suffix “-ps” is used for pseudogenes in place of “P”.
Classification of mammalian OR repertoires
We used the MMS algorithm to assign symbols to the OR repertoires of the mouse, rat, dog, cow, horse, chimpanzee and orangutan (Table 1). Across all 7 mammalian species, the OR genes and pseudogenes were classified into 18 families and 623 subfamilies (Additional file 1, Additional file 2). A comparison between our subfamily classification and the orthologous gene groups (OGGs) suggested by Niimura [31] for the cow, dog and horse genes found in both studies (4587 loci) shows that the OGG method classifies the ORs into smaller subgroups (737 OGGs versus 285 OR subfamilies), where a typical OR subfamily includes on average 2.6 OGGs.
The fraction of genes and pseudogenes assigned as putative orthologs of a human gene (including those with the B/C/D suffix) varies from 83.9% in chimpanzee to ~27% in rodents (Additional file 3: Fig. S1). These numbers are in line with the literature and reflect the rapid evolution of the OR gene family [8, 23], including the ~20% difference between human and chimpanzee OR repertoires we found, which is similar to the ~25% reported by Go [32]. As expected, genes whose symbols are shared among more than one mammal contain a significantly higher fraction of intact ORs, as compared to genes that were assigned as novel subfamily members (Additional file 3: Fig. S2).
To test if genes assigned as best and second best hits to a human OR gene also lie in syntenic regions, we analyzed mammalian multiz genome alignments [33]. We found that 88% of the mutual best hits and 62% of the second best hits aligned to the exact genomic location of the corresponding human ortholog, and 96% of the mutual-best hits and 82% of the second best are aligned within a distance of < 100 kb. Thus, most of the ORs that were classified by the MMS algorithm as putative orthologs also reside in the approximate expected syntenic location. We note that due to the rapid evolution of the OR gene family synteny is not always expected to be preserved among orthologs.
We used phylogenetic analysis to assess the accuracy of the nomenclature assignment. Though, in general, phylogenetic analysis cannot fully resolve the relationships within the OR superfamily due to low bootstrap values (also reported by Niimura [34], Niimura [35], Rimbault [36] and Khan [37]) this is possible within subfamilies, as shown in Figure 2A. The relationships between the genes in subfamily 10D can
Species
|
Genes
|
pseudogenes
|
total
|
Genome
|
chimpanzee
|
396
|
427
|
823
|
PanTro4
|
orangutan
|
321
|
466
|
787
|
PonAbe2
|
dog
|
803
|
206
|
1009
|
CanFam3
|
cow
|
1110
|
695
|
1805
|
BosTau8
|
horse
|
1101
|
1372
|
2473
|
EquCab2
|
mouse
|
1142
|
247
|
1389
|
Mm10
|
Rat
|
1333
|
457
|
1790
|
Rn6
|
Zebrafish
|
158
|
14
|
172
|
DanRer10
|
Total
|
6365
|
3884
|
10249
|
|
Table 1
immediately be recognized from the phylogenetic tree as well as from the symbols (Fig. 2A). These relationships would not be easily detected using, for example, the current approved mouse gene nomenclature (Fig. 2B). For each of the 50 largest OR subfamilies, representing 45% of the total number of ORs in these species, we generated Maximum Likelihood and Bayesian phylogenies (Additional files 4 and 5). We then compared the within-subfamily groupings in these phylogenies against the hierarchical naming results. We found that the nomenclature assignments were largely congruent with the phylogenetic groupings of ORs within a subfamily, and 195 exceptions for which there was strong phylogenetic support for reclassification by both maximum likelihood and Bayesian inference trees were renamed manually (see Additional files 4,5 and 6). We note that the majority of the ORs that were renamed manually have similarity scores just short of a cut-off that would have led them to be named in line with the phylogeny. Although it is inevitable that rigid similarity cut-offs will occasionally result in this type of classification issue, we provide evidence that the vast majority of OR genes are classified in agreement with phylogenetic analysis and it is mostly pseudogenized loci that are incorrectly classified. We manually updated a very small proportion of the total OR symbols in this study to ensure that the classification is as accurate as possible in this initial cohort of species. This will ensure that when more species’ OR repertoires are classified using the MMS algorithm, the set of ORs that they are being compared to are classified correctly. We performed phylogenetic analysis only within subfamilies, and not in larger family groupings, because it is well established in the literature that due to rapid evolution within the OR superfamily, deeper nodes in OR phylogenies are extremely challenging to resolve with confidence [34], [35-37]. Because of the relative difficulty in assigning deeper family and subfamily relationships using phylogenetic analysis, the MMS method is preferable to phylogenetic classification because it is able to rapidly classify ORs into families with a reproducible and consistent methodology. Nevertheless, as this work is ongoing we note that updates to the genome assemblies of the studied organisms, and future refinements to the MMS algorithm, may result in minor changes to some symbols prior to final approval by nomenclature committees.
The overall congruence of our MMS nomenclature assignment algorithm with both synteny comparison across species, and phylogenetic analyses within OR subfamilies, is highly supportive of our methodology for easily, rapidly and accurately classifying newly identified OR repertoires in a given species.
Systematic gene family nomenclature as a tool for evolutionary studies
The unified nomenclature system presented here provides a framework for evolutionary studies of vertebrate ORs. We used our subfamily classification to calculate the Pearson correlations coefficients between the classified OR repertoires from each species, using the number of members of each subfamily. (Fig. 3, Additional file 2). The result is in line with evolutionary expectations, where closely related species, namely primates (human, chimpanzee, orangutan), Laurasiatheria (dog and cow), and rodents (mouse and rat) are clustered together. We observed that the human and chimpanzee subfamily classifications are closer (0.98 Pearson) than those of mouse and rat (0.90 Pearson). This is expected given that human and chimpanzee diverged ~6-12 million years ago [38, 39], whereas mouse and rat diverged ~12-24 million years ago [40].
This nomenclature classification also allows the immediate identification of species-specific expansions and deletions. For example, in the species studied the OR4D subfamily has up to 18 members, except for in horse where it has 57. Horse OR4D genes are found within 18 clusters, where the largest cluster contains 17 genes. In total, we identified 147 subfamilies for which the gene count in one of the studied organisms was at least double that in the other organisms, presumably representing species-specific expansions (Additional file 7: Table S1A). The OR subfamily 6Z is entirely absent from the primate genomes analyzed, while in other mammals members of this subfamily lie in a single genomic cluster, suggesting a large deletion in the primate lineage. Ten other mammalian OR subfamilies were not identified in the primate genomes studied, of which six are encoded from a single genomic cluster in the mammalian genome, as is subfamily 6Z (Additional file 7: Table S1B), and an additional 14 subfamilies were not identified in the rodents, of which 8 are encoded in a single genomic cluster (additional file 7: Table S1 B).
We further used the assigned OR symbols of the human, chimpanzee and orangutan to perform a three-way repertoire comparison (Fig. 4). This analysis identified 437 symbols (51.1% of the human ORs) that are shared among all 3 apes, with a significantly higher presentation of class I (“fishlike”, OR families 51-56) OR genes (p=4e-6, chi-square); 215 (29.6%) symbols are shared only between human and chimpanzee, and 55 symbols are shared only between human and orangutan (Fig. 4). Thus, the use of our nomenclature shows that despite the similarity in the OR repertoire size, and in pseudogene content within primates, the gene content is different, in line with previous published findings [32].
Classification of zebrafish ORs
In an effort to extend this work to non-mammalian species, we classified the OR repertoire of Danio rerio (zebrafish), a popular model organism in studies of the olfactory system [41, 42, Friedrich, 2014 #31, Abreu, 2016 #32, 43]. Previous publications [24, 25] identified a repertoire of ~140 OR genes in the zebrafish genome which, although smaller than that of mammalian OR repertoires, shows greater intra-species sequence diversity than in mammals. As mentioned previously, the study of Alioto [24] also proposed a nomenclature for zebrafish ORs which is based on phylogenetic classification and groups the zebrafish ORs into classes and not into families and subfamilies.
We used version GRCz10 of the zebrafish genome to identify an updated repertoire of 172 zebrafish OR genes, including 13 pseudogenes. We then used the MMS algorithm to assign symbols to each of the zebrafish genes. However, as zebrafish ORs are very distant from mammalian ORs and are expected to be classified into different OR families, we initially manually named zebrafish-specific OR family representatives which were added to the library with the classified ORs (hierarchical clustering, Additional file 3: Fig. S3). Importantly, the OR family numbers were selected to fit with the study of Alioto [24], which used phylogenetic analysis to classify the zebrafish ORs into eight classes, two of which are found in mammals (class A and B). We assigned a distinct set of family numbers for every class (Additional file 3: Table S2), e,g, class C: OR family numbers 30-39, class D: OR family numbers 40-49. Because the mammalian classes have already been assigned the family numbers 51-59 (class A) and 1-14 (class B), classes E, F etc. were assigned numbers starting with 60, 70 and so on, respectively. We then proceeded with the symbol assignment process by applying the MMS algorithm using the same cutoff criteria that we used for mammals (see Methods). The zebrafish OR repertoire was classified into 20 families, of which 2 are shared with mammals (OR families 6 and 55) (Fig. 5). We note that the similarity to mammalian ORs could not be detected using the nomenclature suggested by Alioto [24].