The MMS algorithm
The MMS algorithm assigns human-based OR symbols by detecting the inter-species hierarchical pairwise similarities (Fig. 1). The algorithm first analyzes the all-versus-all BLASTP identity matrix of the given OR repertoire versus human. Mutual best hits with 80% identity are first identified and are assigned the same symbol as the human best hit, second-best ortholog candidates with 80% identity are then identified and are assigned with the human best hit gene symbol with the addition of the letters B, C, etc. For example, OR9A4 is the mutual best hit and OR9A4B is the second best hit. The remaining ORs are compared to non-human ORs to detect non-human OR orthology relationships. OR genes that have not already been named in the above steps are classified into families and subfamilies as shown in Fig. 1 and as explained in [11], either as a new member of a subfamily with the n+1 subfamily member number, or classified into a novel subfamily [11]. At the step of subfamily classification (Step 4, Fig. 1), the similarity matrix includes the within-species OR repertoire in addition to the other species’ OR repertoires. This step is required for the correct classification of species-specific subfamilies. An exception to the above order of comparisons was made for the rat, where the repertoire was initially compared to mouse before detecting best human matches, to take into account the close evolutionary distance of the rodents. Symbols are composed of uppercase letters and Arabic numbers, except in rodents where, by convention, only the first letter is capitalized and the suffix “-ps” is used for pseudogenes in place of “P”.
Classification of mammalian OR repertoires
We used the MMS algorithm to assign symbols to the OR repertoires of the mouse, rat, dog, cow, horse, chimpanzee and orangutan (Table 1). Across all 7 mammalian species, the OR genes and pseudogenes were classified into 18 families and 681 subfamilies (Additional file 1, Additional file 2).
The fraction of genes and pseudogenes assigned as putative orthologs of a human gene (including those with the B/C/D suffix) varies from 83.9% in chimpanzee to ~27% in rodents (Additional file 3: Fig. S1). These numbers are in line with the literature and reflect the rapid evolution of the OR gene family [8, 18], including the differences between human and chimpanzee OR repertoires[25]. As expected, genes whose symbols are shared among more than one mammal contain a significantly higher fraction of intact ORs, as compared to genes that were assigned as novel subfamily members (Additional file 3: Fig. S2).
To test if genes assigned as best and second best hits to a human OR gene also lie in syntenic regions, we analyzed mammalian multiz genome alignments [26]. We found that 88% of the mutual best hits and 62% of the second best hits aligned to the exact genomic location of the corresponding human ortholog, and 96% of the mutual-best hits and 82% of the second best are aligned within a distance of < 100 kb. Thus, most of the ORs that were classified by the MMS algorithm as putative orthologs also reside in the approximate expected syntenic location. We note that due to the rapid evolution of the OR gene family synteny is not always expected to be preserved among orthologs.
We used phylogenetic analysis to assess the accuracy of the nomenclature assignment. Though in general phylogenetic analysis cannot fully resolve the relationships within the OR superfamily due to low bootstrap values [27], this is possible within subfamilies, as shown in Figure 2A. The relationships between the genes in subfamily 10D can
Species
|
genes
|
pseudogenes
|
total
|
genome
|
chimpanzee
|
396
|
427
|
823
|
PanTro4
|
orangutan
|
321
|
466
|
787
|
PonAbe2
|
dog
|
803
|
206
|
1009
|
CanFam3
|
cow
|
1110
|
695
|
1805
|
BosTau8
|
horse
|
1101
|
1372
|
2473
|
EquCab2
|
mouse
|
1142
|
247
|
1389
|
Mm10
|
rat
|
1333
|
457
|
1790
|
Rn6
|
zebrafish
|
158
|
14
|
172
|
DanRer10
|
total
|
6365
|
3884
|
10249
|
|
Table 1
Immediately be recognized from the phylogenetic tree as well as from the symbols (Fig. 2A). These relationships would not be easily detected using, for example, the current approved mouse gene nomenclature (Fig. 2B). For each of the 50 largest OR subfamilies, representing 45% of the total number of ORs in these species, we generated Maximum Likelihood and Bayesian phylogenies (Additional files 4 and 5). We then compared the within-subfamily groupings in these phylogenies against the hierarchical naming results. We found that the nomenclature assignments were largely congruent with the phylogenetic groupings of ORs within a subfamily, and 195 exceptions for which there was strong phylogenetic support for reclassification were renamed manually (see Additional files 4 and 5). We note that the majority of the ORs that were renamed manually have similarity scores just short of a cut-off that would have led them to be named in line with the phylogeny. Although it is inevitable that rigid similarity cut-offs will occasionally result in this type of classification issue, we provide evidence that the vast majority of OR genes are classified in agreement with phylogenetic analysis. We manually updated a very small proportion of the total OR symbols in this study to ensure that the classification is as accurate as possible in this initial cohort of species. This will ensure that when more species’ OR repertoires are classified using the MMS algorithm, the set of ORs that they are being compared to are classified correctly. We performed phylogenetic analysis only within subfamilies, and not in larger family groupings, because it is well established in the literature that due to rapid evolution within the OR superfamily, deeper nodes in OR phylogenies are extremely challenging to resolve with confidence [27]. Because of the relative difficulty in assigning deeper family and subfamily relationships using phylogenetic analysis, the MMS method is preferable to phylogenetic classification because it is able to rapidly classify ORs into families with a reproducible and consistent methodology. Nevertheless, as this work is ongoing we note that updates to the genome assemblies of the studied organisms, and future refinements to the MMS algorithm, may result in minor changes to some symbols prior to final approval by nomenclature committees.
The overall congruence of our MMS nomenclature assignment algorithm with both synteny comparison across species, and phylogenetic analyses within OR subfamilies, is highly supportive of our methodology for easily, rapidly and accurately classifying newly identified OR repertoires in a given species.
Systematic gene family nomenclature as a tool for evolutionary studies
The unified nomenclature system presented here provides a framework for evolutionary studies of vertebrate ORs. We used our subfamily classification to calculate the Pearson correlations coefficients between the classified OR repertoires from each species, using the number of members of each subfamily. (Fig. 3, Additional file 2). The result is in line with evolutionary expectations, where closely related species, namely primates (human, chimpanzee, orangutan), Laurasiatheria (dog and cow), and rodents (mouse and rat) are clustered together. We observed that the human and chimpanzee subfamily classifications are closer (0.98 Pearson) than those of mouse and rat (0.90 Pearson). This is expected given that human and chimpanzee diverged ~6-12 million years ago [28, 29], whereas mouse and rat diverged ~12-24 million years ago [30].
This nomenclature classification also allows the immediate identification of species-specific expansions and deletions. For example, in the species studied the OR4D subfamily has up to 18 members, except for in horse where it has 57. Horse OR4D genes are found within 18 clusters, where the largest cluster contains 17 genes. In total, we identified 147 subfamilies for which the gene count in one of the studied organisms was at least double that in the other organisms, presumably representing species-specific expansions (Additional file 6: Table S1A). The OR subfamily 6Z is entirely absent from the primate genomes analyzed, while in other mammals members of this subfamily lie in a single genomic cluster, suggesting a large deletion in the primate lineage. Ten other mammalian OR subfamilies were not identified in the primate genomes studied, of which six are encoded from a single genomic cluster in the mammalian genome, as is subfamily 6Z (Additional file 6: Table S1B), and an additional 14 subfamilies were not identified in the rodents, of which 8 are encoded in a single genomic cluster (additional file 6: Table S1 B).
We further used the assigned OR symbols of the human, chimpanzee and orangutan to perform a three-way repertoire comparison (Fig. 4). This analysis identified 437 symbols (51.1% of the human ORs) that are shared among all 3 apes, with a significantly higher presentation of class I (“fishlike”, OR families 51-56) OR genes (p=4e-6, chi-square); 215 (29.6%) symbols are shared only between human and chimpanzee, and 55 symbols are shared only between human and orangutan (Fig. 4). Thus, the use of our nomenclature shows that despite the similarity in the OR repertoire size, and in pseudogene content within primates, the gene content is different, in line with previous published findings [25].
Classification of zebrafish ORs
In an effort to extend this work to non-mammalian species, we classified the OR repertoire of Danio rerio (zebrafish), a popular model organism in studies of the olfactory system [31, 32, Friedrich, 2014 #31, Abreu, 2016 #32, 33]. Previous publications [19, 20] identified a repertoire of ~140 OR genes in the zebrafish genome which, although smaller than that of mammalian OR repertoires, shows greater intra-species sequence diversity than in mammals. As mentioned previously, the study of [19] also proposed a nomenclature for zebrafish ORs which is based on phylogenetic classification and groups the zebrafish ORs into classes and not into families and subfamilies.
We used version GRCz10 of the zebrafish genome to identify an updated repertoire of 172 zebrafish OR genes, including 13 pseudogenes. We then used the MMS algorithm to assign symbols to each of the zebrafish genes. However, as zebrafish ORs are very distant from mammalian ORs and are expected to be classified into different OR families, we initially manually named zebrafish-specific OR family representatives which were added to the library with the classified ORs (hierarchical clustering, Additional file 3: Fig. S3). Importantly, the OR family numbers were selected to fit with the classes of [19], with a distinct set of family numbers for every class (Additional file 3: Table S2). We then proceeded with the symbol assignment process by applying the MMS algorithm using the same cutoff criteria that we used for mammals (see Methods). The zebrafish OR repertoire was classified into 20 families, of which 2 are shared with mammals (OR families 6 and 55) (Fig. 5). We note that the similarity to mammalian ORs could not be detected using the nomenclature suggested in [19].