The current distribution of recently-recruited genes among chromosomes appears increasingly non-random along the evolutionary series from the tetrapoda to the primata

Background: The present availability of full genome sequences of a broad range of animal species across the whole range of evolutionary history enables one to ask questions as to the distribution of genes across the chromosomes. Do newly recruited genes, clade by clade, distribute at random or at non-random locations? Results: We extracted values for the “consensus” ages of the human genes and for their current chromosome locations, from published sources. A quantitative analysis showed that the distribution of newly-added genes among and within the chromosomes appears to be increasingly non-random if one observes animals along the evolutionary series from the tetrapoda through to the great apes, whereas the oldest genes are randomly distributed. Conclusions: Randomization will result from chromosome evolution, but less and less time is available for this process as evolution proceeds. Much of the bunching of recently-added genes arises from new gene formation in gene families, near the location of genes that were recruited in the preceding phylostratum. As examples we cite the KRTAP, ZNF, OR and some minor gene families. We show that bunching can also result from the evolution of the chromosomes themselves when, as for the KRTAP genes, blocks of genes that had previously been on disparate chromosomes become linked together.


Background
The study of human genome evolution using orthologous genes (or orthologs) has been much furthered by Domazet-Loso and Tautz (1), (2), (3), whose pioneering phylostratigraphic approach was based on the cladistic description of evolution. A clade can be de ned as "a group of organisms that share a common evolutionary history, and are closely related, more so to members of the same group than to other organisms. These groups are recognized by sharing unique features which were not present in distant ancestors". In the Domazet-Loso and Tautz (2) formulation there are 19 successive clades that have emerged during evolution from the rst living organisms to modern humans. All animals that arise in the evolutionary record between the beginning of a clade and the onset of the next clade are de ned as belonging to a single phylostratum level, denoted by a number from 1 to 19. (See Table 1 at the end of this paragraph). Table 1 is based on Domazet-Loso and Tautz (2), who de ned the 19 phylostrata into which they divided the genes of the organisms of the living world. (The phylostratum level concept is additionally helpful when thinking about evolution, since it has a numerical value that ranks the different levels of organismic evolution. Once one has learned that, say, the tunicates are in phylostratum level 10, while the lancelets are in phylostratum level 9, one can immediately know which preceded which in the evolutionary process). In the original 2007 formulation of Domazet-Loso and Tautz (1), a phylostratum was de ned as "a set of genes from an organism that coalesce to founder genes having common phylogenetic origin." These then are the newly-appearing genes that de ne the relevant phylostratum level. In keeping with the original de nition, we will refer to genes as being in a phylostratum, while the animals in which those genes rst appeared are in the corresponding phylostratum level. The aim of a phylostratigraphic analysis of the genes of a particular organism is to ascertain for each gene X the phylostratum in which it rst appeared. To do this, one nds among life's organisms, the set of orthologs of gene X (these being those homologs of X which can be related to it by linear descent). One ranks this ortholog set into the ages of the species in which each ortholog is found. The youngest of these ages is then the age of gene X. (Arendsee and colleagues present a program for performing phylostratigraphic analyses) (4).
Applying the approach of Domazet-Loso and Tautz, we were able, for almost all of the 19,781 currently annotated protein-coding genes of the human genome, to nd the consensus age of the gene (de ned as the phylostratum level during which this individual gene was added to the evolving human genome)(5).
Such ages have been used to study the evolution of biological processes (6). Here we ask: Do we nd the recently-added genes to be distributed at random among the chromosomes or are such genes bunched?
The two major routes by which new genes arise are, rst, by the duplication of an existing gene, followed by their divergent evolution and, second, by the conversion of a non-coding sequence into a coding sequence to form a so-called "orphan" gene . We have found that along the vertebrates' evolutionary path, from the tetrapoda through to the great apes, the distribution of newly-recruited genes among the chromosomes is observed to be less and less random. We suggest a mechanism that could in part have driven this phenomenon: many newly recruited genes preferentially distribute as family members to where their older homologues had earlier arisen. As examples of this process, we consider the Zinc Finger (ZNF), the keratin-associated protein (KRTAP) and olfactory receptor (OR) gene families, and some minor families.

Results
We asked if genes that rst appeared in a particular phylostratum were always distributed at random across the chromosomes of living animals, or were there perhaps contributions from particular phylostrata in which we see such genes distributed preferentially to a speci c chromosome. In Figure 1, the leftmost gure shows the % of "great ape" genes (phylostratum 19.2); the middle gure shows the " sh" genes (phylostratum 12), in both cases as they are distributed across the autosomal human chromosomes, while the rightmost gure shows the phylostratum 12 genes of the zebra sh (Danio rero) which, being a jawed sh, appears itself in phylostratum 12. The data that we report are, across the chromosomes of the speci ed animal species, the number of genes recruited in a particular phylostratum as a percentage of the total number of genes present on each chromosome, normalized by being divided by the median across that phylostratum: It can be seen, comparing the leftmost and middle gures, that the genes that were acquired with the appearance of the great apes (with a MAD +/-SEM value of 0.38+/-0.09, n=22) are far more unevenly distributed among the human chromosomes than are the genes that were acquired with the emergence of the sh (0.05+/-0.01, n=22). The difference is signi cant at P<0.05 (Holm-Sidak Pairwise Multiple Comparison). Comparing the middle and right gures of 1 shows that the phylostratum 12 genes are, similarly, evenly distributed also among the chromosomes of the sh itself where the MAD +/-SEM has a value of 0.08+/-0.02, n=25), the difference between these two distributions being non-signi cant using the same statistical test. The spread between the 75% and 25% limits as a function of the median in each case is 0.13 for the " sh" genes in the human genome while being 0.73 for the "great ape" genes. The two chromosomes with the highest fraction of phylostratum 19.2 genes (left gure, these being chromosomes 21 and 19) are no longer the most enriched compared with the phylostratum 12 genes (middle gure, where chromosome 21 is now 7 th from the top and chromosome 19, 13th).
Extending this same approach, we studied the random/non-random distribution of newly appearing genes, phylostratum by phylostratum, across the full evolutionary trajectory for eight animal species chosen from a range of phylostrata. Figure 2 shows the results of this analysis, plotted as the Mean Absolute Deviation about the Median (MAD -see Methods), divided by the medians, against phylostratum number for these eight animal genomes, each animal species being shown by a different symbol. (The red circles in the gure denote the data for the human genome from which we removed all the genes that belong to a number of gene families. These results will be considered later in the paper).
The raw data and computations upon which this gure is built are presented as Tables S2 through S10 in the Supplementary Materials). Note the structure of this plot: There are three data points (other than the red circle) for phylostratum 19.2, depicting the genes added in phylostratum 19.2 for the three great apes that we have analysed. Phylostratum 19.1 shows four data points, the genes that were added in phylostratum 19.1 for the three great ape species plus the data point for the macaque monkey (Macaca mulatta) that represents for us phylostratum 19.1. As successively earlier phylostrata are considered, further data points are added at each earlier phylostratum until, from phylostratum 12 and earlier, all eight points appear at each phylostratum number, one from each of the eight species that we have used. We performed the analysis using only the autosomal genes since we were concerned that the intense evolution of, especially, the Y chromosome in the more recent phyla might affect the results. (The analysis using the whole chromosome complement is very similar to Figure 2 (data not shown)). The horizontal lines drawn are the median, and the 25% and 75% limits, computed for all the data through to phylostratum 12, the euteleostomii (the jawed sh).
Using a mixed-effects model (from the Non Linear Mixed Effect package of R), accounting for the repeated measure element of retesting species, the MAD values for the tetrapoda (phylostratum 13) do not differ signi cantly (at P= 0.3711) from those for the combined phylostrata 1 through 12, but signi cance holds (at P< 0.0001) for all of the more recent phylostrata. The values for the olfactores at phylostratum 10 lie signi cantly above the data for the combined phylostrata 1 through 12, P< 0.0001).
This is in spite of the extensive chromosomal rearrangements that have taken place since the time that the olfactores appeared (See the Discussion for more in such chromosomal arrangements). The contribution of the olfactores to the human genome is only 85 genes out of more than 19,000. With such a small sample number, their distribution among the 22 to 25 (in different species) autosomal chromosomes might be expected to be somewhat uneven, and variable from species to species. Indeed, the correlation coe cient between the human and macaque data for phylostratum 10 genes is not signi cant at P=0.57, whereas that for the 19.1 phylostratum between these two species is less than 0.0001. If these outlying olfactores data are excluded from the combined phylostrata 1 through 12, the MAD values for the tetrapoda are now signi cantly different from those of the combined phylostrata 1 through 12 at P=0.024.
It would appear from these data that during the evolution of the vertebrates, from the amniota (and possibly the tetrapoda) through to the great apes, the distribution of newly recruited genes across the chromosomes appears as being increasingly non-random. An important message from this plot is this: If one looks at the MAD value for the genes that emerged with, for example, the amniotes (at phylostratum 14), their MAD number does not differ much when those "amniote" genes are seen in the chromosomes of the chicken itself, or those of the chimpanzee, macaque, or dog chromosomes. A distribution of that particular degree of scatter is found in the chromosomes of the amniotes through to the great apes, although the chromosomes themselves have evolved and today vary so much between the species in size and number.
We tested whether the data could perhaps best be described by two straight lines, one horizontal and the other with a delayed slope, an ascending function of phylostratum age. An appendix in the Supplementary Methods provides the results of such an analysis. The delayed slope model was signi cantly superior to a single slope (p<0.001).
We wondered whether the increasing patchiness seen for the newly recruited genes might in part arise from a preferred localisation to those chromosomes that had preferentially recruited genes during the immediately previous phylostratum. Table 2 records as a matrix the Spearman rank order correlations between the distributions of newly recruited genes across all the chromosomes, phylostratum by phylostratum. Phylostratum names, as rows and columns, are in bold. The correlation between the chromosome distribution of the genes of any phylostratum X and the distribution of those of the succeeding phylostrata, is given as the point of intersection between the rows and columns of the matrix, and is displayed as the correlation coe cient r and, directly below this, the corresponding probability P. All correlations that have P<0.05 are in bold type. It will be noticed that most of the highest correlations are between successive phylostrata. These are signi cant except between phylostrata 18 and 19.1, where the correlation is below signi cance, although the correlation between phylostrata 18 and 19.2 is signi cant (as is that between 19.1 and 19.2). Table  S11, in the Supplementary Materials shows the similar results, again for Homo sapiens, but now using the restricted set of more consistent data for which 3 or more ortholog databases agreed with the modal value We extended these between-phylostrata correlations to include a number of mammalian species. The full data set can be found as Table S12 of the Supplementary Materials.
We had noted in Figure 1 that phylostratum 19.2 has the highest percentage of newly recruited genes on chromosome 19. We asked whether the location of newly recruited genes along chromosome 19 itself might correlate with the location of the genes recruited in the previous phylostratum. To test this, we divided chromosome 19 into twenty successive equal sections of gene counts (This gave 70 genes in each section except for the 69 genes that remained for the twentieth section). In each section, we computed the proportion of genes from any phylostratum X in chromosome 19 as a percentage of the genes of phylostratum X in the whole genome. We then performed Spearman rank order phylostratum to phylostratum correlations between the gene distributions along the twenty successive sections across chromosome 19. We asked whether the section by section content of phylostratum X genes along the chromosome correlated with the distribution of the succeeding phylostrata. The part of the data that show signi cant correlations is depicted in Table 3. The table records as a matrix the Spearman rank order correlations between the distribution of newly recruited genes across the twenty sections of chromosome 19, phylostratum by phylostratum. Phylostratum names appear as the rows and columns of the table. The correlation between any two phylostrata is given as the point of intersection between the rows and columns of the matrix, and is displayed as the correlation coe cient r and, below this, the corresponding probability P. All correlations that have P<0.05 are in bold type, the single correlation with P=0.051 is in italics. We wanted to nd out if there was a particular region of chromosome 19 at which these new gene additions, phylostratum to previous phylostratum, occurred. We used a heat map showing the gene content of the successive 20 sections of chromosome 19, comparing successive phylostrata, to provide the answer. The heat map (built as a percentage of the genes of a particular phylostratum, in a particular section, to all the genes of that phylostratum in the entire genome) is depicted in the upper part of Figure  3: The Zinc Finger (ZNF) genes of chromosome 19.
It would appear that the distal half of the q portion of chromosome 19 (the most distal four sections in particular) are the richest in genes that originated in recent phylostrata (17 through 19.2), those that showed the highest between-phylostrata correlations in chromosomal distributions (Table 3). Section 20 (at the distal section of the q arm of the chromosome) appears on Figure 3 as a section with a high content of both phylostrata 17 and 19.1. This section contains a high proportion of zinc nger (ZNF) genes, these being 65% of all the genes in this section of chromosome 19. Chromosome 19 contains a high proportion of ZNF genes. Indeed, of its 1396 protein-coding genes 248 or almost 18% are ZNF genes. Figure 4A below depicts the ZNF genes of the human genome as a function of phylostratum number and chromosome location.
As can be seen, the ZNF genes are in general, recent, with most of them having been recruited in phylostrata 17 and 19 (the hoofed and pawed animals and the primates, respectively). It is apparent, too, that chromosome 19 is to a very large extent the preferred location for these genes. We have seen that it is the most distal section of the q arm of this chromosome that is especially preferred. The distribution across chromosome 19 of the ZNF genes as a percentage of all the ZNF genes of the genome was depicted as a heat map in the lower half of gure 4. The map to map comparison is striking. The location of the successive cohorts of newly recruited ZNF genes appears, in many cases, to be coordinated. All of the phylostratum 19.2 genes are located at the sites where 19.1 genes had been formed, and many of the phylostratum 19.1 genes locate close to phylostratum 17 genes.
Keratin-associated protein (KRTAP) genes. We looked for a second gene family that could be involved in between-phylostrata relations. In Figure 1, the chromosome with the highest proportion of phylostratum 19.2 genes is chromosome 21. Now chromosome 21 is rich in genes from the KRTAP (Keratin Associated Protein) family. Figure 4B depicts the distribution of the KRTAP family genes by phylostratum age and by chromosomal location. Over 50% of these genes are located on chromosome 21 and a very high proportion of them are recent genes, largely arising with the mammals. The KRTAP genes are associated with the evolution and development of hair, a mammalian innovation (7). The KRTAP gene family can be divided into numerous sub-families of shared evolutionary history (8), and Supplementary Figure S2 depicts these sub-families as they are situated on chromosome 21 The Olfactory Receptor (OR) genes. As an additional gene family we chose the OR genes, the olfactory receptor genes. There are 406 of these in the human genome. Figure 4C shows their distribution by age and by chromosome location. Chromosome 11 is by far the richest bearer of the OR genes. Most of the OR genes were recruited with the rst mammals, next highest being the hoofed and pawed animals of phylostratum 17. Supplementary Figure S3 shows the location of the OR genes along human chromosome 11. The OR genes are indeed non-randomly distributed, being present in two major complexes, with a few other solitary cases. The major complex, in the proximal region between 4.5 and 6 mB, is centered around the genes that were recruited in phylostratum 12, with the genes from phylostrata 15, 17, 18 and 19 being close by. Many of the more recent genes of chromosome 11, these being for the most part mammalian genes, were contributed by the Olfactory receptor (OR) gene family. Our prior familiarity with the ZNF, KRTAP, and OR gene families led to their being chosen as convenient samples of the class.
Increase in randomness with evolutionary time after excluding the gene families.
In addition to the three major gene families, the ZNF, KRTAP, and OR families, there are a number of minor gene families that we surmised might contribute to the non-randomness of gene distributions among the chromosomes. These minor families include the TRAV family of 45 genes, all located on chromosome 14, the TRBV family of 37 members on chromosome 7, the KIR2D and KIR3D families with 45 genes on chromosome 19, the LILR family with 33 genes on chromosome 19, and the NLRP family of 15 genes, 9 of which are located on chromosome 19. These latter genes are associated with the adaptive immune response which began to evolve with the origin of the jawed sh (9). In addition, we took account of the PCDH gene family of 66 members, 55 of which are on chromosome 5. This list consists of the genes in those families that contain ten or more members and comprises 923 genes.
The genes that were incorporated into the evolving human genome as members of a gene family did so at much later epochs than those that were incorporated as an individual. Table S13 in the Supplementary Materials lists these two classes of genes together with their consensus ages while Figures S4 and S5 in the Supplementary Materials depict these data plotted as the percentage of genes of that class (the set incorporated into families or the set incorporated as individuals) as a function of consensus age (depicted as phylostratum number in S4 or as thousands of years before present in S5). The two distributions, analysed by the Mann-Whitney Rank Sum Test (with medians of 6 and 15.2 in consensus ages) were statistically different at p<0.001. It would appear that, from the tetrapoda onwards, comparing the two sets, the fraction of newly-added genes that were incorporated into gene families did so signi cantly later than the fraction of those incorporated as individuals. At each phylostratum, the fraction of genes incorporated into a family as a proportion of all genes so incorporated increases linearly with phylostratum number (p= 0.010) while, in contrast, the fraction incorporated as individuals shows an insigni cant decrease.
We wondered whether excluding all the genes that were incorporated into families might diminish the bunching of newly added genes that we had seen in Figure 2 and thus asked what the MAD versus phylostratum plot might look like after excluding these major and minor gene families. The red circles in Figure 2S in the Supplementary materials depict a plot of the uneven distribution of genes across the chromosomes of the human genome, calculated again as the appropriate MAD values, but now after excluding 923 genes, this being the total of the genes contributed by all the above-cited families. The open red triangles depict a control in which we excluded a random sample of 923 of the genes of the human genome and calculated again the appropriate MAD values.
The data points computed for the human genome after excluding these gene families differ little from the other data points until the tetrapoda are reached, but then the data for the full genomes deviate increasingly upwards. This parallels the increasing fraction of genes that are in gene families which begins to deviate upwards at much the same time period ( Figure S4). The difference between the MAD values for the full human genome, for genes that were added from the tetrapoda and later, is signi cantly different (P=0.001, t-test, N=7) from the data where the gene families are excluded. The difference between the MAD values for the full human genome is not signi cantly different (P=0.301, t-test, N=17) from the data where 923 genes were excluded at random. By extending the Non Linear Mixed Effect regression model to include the set with 923 randomly-excluded genes, this set could be shown to be signi cantly different (P=0.0017, from the data set in which the gene families were excluded. The MAD values where the gene families are excluded remain, from the amniota onwards, signi cantly different (at P = 0.02, Mann-Whitney Rank Sum Test) from the data set of the MAD values through the jawed sh at phylostratum 12. This residual variance, which does not seem to arise from bunching by the previously-listed gene families, will be considered further in the discussion.
We wondered whether or not it would be useful to consider, during a particular phylostratum, the genes that were added to families already in the evolving genome separately from those genes that were added as non-related individuals. To that end we chose all the 874 genes added as the primate clade evolved and divided these into the 417 genes that appear in the gene families listed previously (Supplementary  Table S14, worksheet 2) and the remaining 457 that were added as individuals (Supplementary Table   S14, worksheet 3). Figure S6A in the Supplementary Materials shows that those genes that were recruited to the evolving primate genome as members of an existing gene family) are very unevenly distributed among the chromosomes with a MAD value of 0.52, whereas primate genes that were added as individuals were evenly distributed, with a MAD value of 0.26 ( Figure S6B).

Discussion
We used a recently published database (5) to ascertain the age of the protein-coding genes of the human genome (the age being determined by the phylostratum in which the earliest ortholog of the gene rst appeared). These age estimates are probably the most reliable currently available with over 97% of the ages listed being agreed upon by three or more of the ortholog databases that were the source of the age estimates. We emphasize that the approach of using orthologs to nd the ages of the genes that exist in gene families does not report the single founder gene of the family but rather the earliest phylostratum at which the individual members of the family rst appeared. The chromosome locations we used were drawn from the accepted source. Integrating these two sources of information, we determined the age distribution of genes across the human chromosomes. We found that genes that were added more and more recently to the evolving human genome appeared to be less and less randomly distributed in the current human chromosome complement (Fig. 2). We showed that this phenomenon (the appearance of an increasing non-random distribution) applied also to a number of other animal species along the evolutionary path from the sh to the human. At each evolutionary level, the measure of randomness of the chromosomal distribution of genes rst identi ed in that clade was little different between an animal in that clade and those same genes when studied in animals from more recent clades.
We searched for mechanisms that might account for this phenomenon. A major effect is probably the extensive chromosomal rearrangements that have occurred over the course of evolution (10). In whatever pattern (bunched or scattered) newly-added genes were distributed in the earliest evolutionary epochs, chromosomal evolution would be likely to have randomized them by now. But on the evolutionary trajectory from the tetrapoda to the primata less and less time has been available for such chromosomal reshu ing. The current pattern of gene distribution will therefore increasingly resemble the initiallyformed pattern as we approach the current epoch.
There is, moreover, a second, determining process that contributes to the observed bunching of recently added genes. An increasing portion of the genes added since the tetrapoda have been genes that were added to existing gene families ( Figure S5). As we saw in Figures 4A, 4B, and 4C the size of three such gene families (the ZNF (zinc nger), the KRTAP (keratin-associated protein), and the OR (olfactory receptor) genes) increases with evolutionary time and their distribution among the human chromosomes is uneven. Many of the newly recruited genes will have been formed in situ in their new location by mutation after gene duplication. Often, the tandem gene duplication will have arisen through unequal crossing over, and hence. by de nition localized on a particular chromosome, and thus be located within the family, bringing about bunching. Genes incorporated into gene families form an increasing proportion of newly added genes as evolution proceeds. When genes that are members of gene families are removed from our analysis, there was an almost complete elimination of the uneven distribution of genes added in the more recent phylostrata (Figure 2, large red circles). One can eliminate the contribution of reshu ing over the ages by considering only the most recently-added genes, the primate genes. Those primate genes that were added into gene families are far more unevenly distributed among the chromosomes than are those primate genes that incorporated as individuals ( Figures S4 and S5 in the Supplementary materials). Thus, it is genes joining existing gene families that provides the basis for the phenomenon of the uneven gene distributions, coupled with the fact that insu cient time has been available for their reshu ing by chromosomal rearrangements. Even after the genes from the listed gene families are removed, there remains, however, a small, but statistically-supported degree of nonrandomness of the recently added genes. This might be due to the coming together of genes that share a common regulatory mechanism. An evolutionary advantage would result from genes becoming closely located under a single regulatory control mechanism and this would extend to genes that are not in the same family. As non-coding regulatory genes increase in number with the evolution of the higher animals, the drive to locate together genes with a common regulatory control will increase, leading to a further increase in the non-random localisation of newly recruited genes. The genes that were not added to existing gene families might perhaps be considered as true orphans in the sense that they were formed by mutational conversion of non-coding DNA sequences. In contrast, most of the genes that joined existing families were most probably formed by the duplication of existing genes (these being the earlieradded family members) followed by mutational divergence.
In addition to the view that chromosome evolution can reduce bunching, we suggest that chromosome evolution can itself provide a route by which related genes can be brought together, thus reducing the random distribution of genes among the chromosomes. Studies on the synteny between the mammalian ancestral chromosomes and the human chromosomes suggest that chromosome 21 was formed by the conjunction of mammalian ancestral chromosome 4 and mammalian ancestral chromosome 16 (11), (12). In the transition from the Mammalia to the Boreoeutheria, 9 new KRTAP genes were added to the already high KRTAP content of chromosome 21 ( Figure S2). This was accompanied by the conjunction that brought together the two blocks of genes of the KRTAP genes that had previously been located on separate ancestral mammalian chromosomes. The KRTAP genes comprise almost a quarter of the protein-coding genes on chromosome 21 and they are the only large family on this chromosome. The evolutionary drive for the conjunction that produced chromosome 21 might have been to bring, under a single mechanism of control and regulation, two blocks of KRTAP genes, which as we discuss further below, are so important for the development of the hair that reduces heat loss in warm-blooded animals.
If so, this control is likely to arise from the higher-level organization of the chromosome (13), (14), since we found no evidence for a newly organized sharing of enhancers between the two halves of  (19). Writing on the evolutionary sources of the KRTAPencoding genes, Wu, Irwin, and Zhang (8) reported that the keratin-associated proteins were con ned to the mammals, while in a later paper, Wu and Irwin wrote: "As the sequence composition of the KRTAP genes have no homology with other existing genes, it is likely that the KRTAP genes originated de novo from non-genic regions." (20). From this one might conclude that the rst KRTAP was an orphan gene, or perhaps different orphans founded the different families. The descendant genes evolved from these founders by duplication followed by mutation. Consistent with this history is the genomic localisation of the KRTAPs, their being bunched in their families, and bunched on just a few chromosomes. The ecological niche here was the initial adoption of the nocturnal life-style, while the rami cation of the mammalian life-style was associated with the development of new varieties and distributions of hair bres.
A similar argument can be made for the olfactory receptor (OR) gene family of Fig 4C. The emergence of the tetrapoda from the sea on to dry land was associated with the initial development of receptors for airborne signalling molecules, the odours. As evolution proceeded and the range of olfactory receptors expanded, there was a corresponding expansion of the range of odours that could be detected, with an expansion of the repertoire of food-seeking behaviors. The newly-evolved OR genes would have been formed by duplication and transformation of pre-existing orphan genes or of earlier evolved family members. Thus again, the families expanded in number but remained bunched in their localisation.
Finally, as we saw in the discussion around Figure 2S, even when the members of the gene families are removed from the gene distribution computations, there still remains some residual bunching of the more recently-appearing genes. We associate this with the evolutionary advantage of having genes bunched closely together so as to be co-ordinately controlled by appropriate non-coding genes. The appearance in the genome of non-coding genes followed in time that of the appearance of the protein-coding genes.
Thus bunching associated with non-coding genes would be more prevalent in recently-appearing genes.

Conclusions
Our conclusion from all these data is that from the tetrapoda through to the higher apes the distribution of newly-added genes across the chromosomes is increasingly non-random. Bunching arises from the preferential location of new members of gene families to where their older family members had been established. Insu cient time has been available, in these more recent epochs, for chromosomal rearrangements to have disrupted this bunching whereas, for earlier epochs, there would have been su cient time to bring about randomization of any non-random distributions.

Methods
De nition of the 19 phylostratum levels. See Table1 in the Background section.
The ages of the genes. To determine the ages of the genes, we have used publicly available ortholog search engines. Unfortunately, the various ortholog databases interrogated by the ortholog search engines do not have uniform criteria for identifying orthologs and, thus, for identifying the earliest ortholog and hence the age of the gene in question (see for an account of ongoing progress in this eld) (21). Indeed, the age estimate for a particular gene can vary very widely between different databases.  (5) which is the source of the data on which the present study is based. The values found for the consensus ages are likely to be fairly robust: for 90% of the protein-coding genes, the mode was shared by 4 or more databases while for 97%, 3 or more databases shared the same value for the earliest ortholog level. Genes present in gene families are a major concern of the present paper. Finding the ortholog-based ages of the genes in such families does not identify the single founder gene of the family but rather the earliest phylostratum at which the individual members of the family rst appeared.
Chromosome distributions were abstracted from the ENSEMBL database (24): We took eight representative species from various phylostratum levels, choosing those for which full assignments of genes to chromosomes are available. This excludes, for instance, the wallaby and rabbit genomes where many genes are still assigned to scaffolds.
Statistical approaches.
To assess the random/non-random distribution of the newly recruited protein-coding genes across the chromosomes, we rst computed for each chromosome in turn the ratio of genes having a particular phylostratum age to all the protein-coding genes on that chromosome (expressed as a percentage). We To measure the concordance between the age distributions between the successive phylostrata, we used the Spearman Rank Order Correlation routine of SigmaPlot version 11.0, from Systat Software, Inc., San Jose California USA , which returns the correlation coe cient and its p value for the comparisons. We used this also to explore the concordance between successive cuts along a chromosome. To explore the form of the relation between the phylostratum age (PA) and MAD, we tted the data with the multiregression routine of Sigmaplot using a dummy variable PA =n to search for the breakpoint between two successive linear components. To account for the repeated measure element of retesting species, we used a mixed-effects model (from the Non Linear Mixed Effect package of R).

Declarations
Ethics approval and consent to participate: not applicable Consent for publication : not applicable Availability of data and materials : All data generated or analysed during this study are included in this published article and its supplementary information les.
Competing interests : not applicable Funding : not applicable no funding was asked for nor any received Authors' contributions: M.B.H. asked the question that this paper attempts to answer, suggested the major statistical approach that we used, and provided the Appendix. W.D.S. performed the remaining computational analyses and wrote the paper