A global survey of eco-evolutionary pressures acting on horizontal gene transfer

doi:10.21203/rs.3.rs-3062985/v1

Download PDF

Article

A global survey of eco-evolutionary pressures acting on horizontal gene transfer

https://doi.org/10.21203/rs.3.rs-3062985/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 05 Mar, 2024

Read the published version in Nature Ecology & Evolution →

Version 1

posted

You are reading this latest preprint version

Horizontal gene transfer, the exchange of genetic material through means other than reproduction, is a fundamental force in prokaryotic genome evolution. Genomic persistence of horizontally transferred genes has been shown to be influenced by both ecological and evolutionary factors. However, the limited availability of ecological information apart from species’ isolation sources prevented deeper exploration of ecological contributions to horizontal gene transfer. Here, we assessed extensive ecological profiles of gene-exchanging organisms, focusing on transfers detected through explicit phylogenetic methods. By analysing the observed horizontal gene transfer events, we show distinct functional profiles for recent versus old events. Although most genes transferred are accessory, genes transferred earlier in evolution tend to be more ubiquitous within present-day species. Based on environmental information, we find that co-occurring, interacting, and high-abundance species tend to exchange more genes. Finally, we show that host-associated specialist species are much more likely to exchange genes with each other, while generalist species display less of a preference towards HGT with other species in their assigned habitat. Our study covers an unprecedented scale of integrated horizontal gene transfer and environmental information, highlighting broad eco-evolutionary trends.

Biological sciences/Evolution/Molecular evolution

Biological sciences/Computational biology and bioinformatics

Biological sciences/Ecology/Microbial ecology

Biological sciences/Genetics/Prokaryote

The gene content of microbial genomes constantly changes through gain and loss of genes¹. Gene gain through horizontal gene transfer (HGT) in particular is a driving force in prokaryotic genome evolution^1,2, and most genes have been shown to undergo HGT at least once in their evolutionary history^3,4. However, foreign genes can be a burden or even toxic to the recipient⁵, typically persisting only as long as imposed by fluctuating environmental circumstances. In a simple two-class model of gene evolution⁶, such genes display high rates of turnover. By contrast, other foreign genes may provide sufficient benefit to the recipient, outweighing maintenance costs and persisting long enough to be detected in present-day genomes through computational methods⁷.

Multiple conceptually diverse approaches for computational HGT detection exist⁸. Detecting genomic regions with abnormal sequence composition has the advantage of requiring the recipient genome only. However, it is restricted to recent transfer events due to gene amelioration, whereby foreign DNA evolves to resemble that of its host species^9,10. Alternatively, HGT can be detected through comparing genomes and identifying discrepancies between gene and species evolutionary history. These comparative genomics approaches include the detection of nearly identical sequences in genomes from different species^11–16 or the more computationally intensive modelling of gene evolution through processes such as gene duplication, transfer or loss^3,4,17–19. The next-generation sequencing revolution has enabled HGT detection through comparative genomics approaches by enabling an abundance of publicly available high-quality prokaryotic genomes in curated databases such as proGenomes²⁰.

Previous large-scale surveys of horizontal gene transfer across different environments have showcased the contribution of shared ecology to HGT^11,13,14,21. Generally, inter-environmental transfers were found to be rare, with the possible exception of antibiotic resistance genes¹¹. The importance of shared ecology in determining HGT frequency can be explained from two different perspectives. On the one hand, similar environments may exert similar pressures, prioritising the persistence of specific functional traits. On the other hand, as most HGT mechanisms require physical proximity between the donor and the recipient²², co-occurring within the same environment may simply provide more opportunities for HGT.

In this study, we aim to elucidate both ecological and evolutionary factors that contribute to a successful gene gain event through HGT. Using the gene content of 8,790 species’ pangenomes²⁰ clustered into over a million gene families, we ran RANGER-DTL to model duplication, transfer, and loss events in gene evolution²³. In parallel, we searched for these species in the MicrobeAtlas database (https://microbeatlas.org/), obtaining more than one million microbial community profiles from diverse, globally distributed environments. By following species presence and abundance profiles across this dataset, we show that co-occurrence, abundance, and dispersal patterns all determine HGT success. By looking at functionality and ubiquity of transferred genes, we observe that recent transfers are enriched for genes involved in transcription, replication and repair, and antimicrobial resistance genes. By comparison, old transfers are enriched for genes involved in amino acid, carbohydrate, and energy metabolism, and are more likely to concern genes that are present in nearly all members of a species. This study provides an overview of global ecological trends in HGT.

The contribution of HGT to prokaryotic genome evolution is extensive

To detect HGT events, we first created pangenomes for 8790 species based on 78,315 high-quality single isolate genomes²⁰. The resulting 45 million genes were clustered on minimum 80% nucleotide identity and minimum 50% sequence overlap into 22 million clusters, 961,821 (4.4%) of which covered more than five species. For each such gene cluster, reconciliation with the species tree based on 40 universal single-copy marker genes²⁰ was performed using RANGER-DTL²³ (Fig. 1 and Methods), resulting in 2.4 million well-supported unique transfer events that involved 8756 species and 1.7 million species pairs (4.4% of all possible species pairs). Previous studies considering trends in HGT based on thousands of genomes have focused on very recent transfer events, i.e. those involving gene pairs with ≥99% nucleotide identity^11,13,15,16. Such gene pairs comprised 4.5% detected events in our dataset. By using tree reconciliation for HGT event detection, we were able to investigate whether transfers that happened earlier in evolution were subject to the same trends as very recent transfer events.

At least one transfer event was detected in 634,352 gene trees out of 961,821 (~66%). The fraction of transferred genes varied between species. For example, a transfer event was detected in 61.5% of the genes considered for Acinetobacter baumannii, but only in 19.8% for Lysteria monocytogenes. Across all species, this resulted in an average of 42.5% (IQR: 35.9% - 50.5%) of genes per species affected by HGT. This number is lower than previously reported estimates of an average of 73%¹⁷ and 81%⁴ genes per genome affected by HGT. Most likely, this is because we use a stricter threshold to cluster sequences: 80% nucleotide identity as opposed to 30%¹⁷ and 25%⁴ amino acid identity. We therefore do not capture the oldest transfers considered in these studies^4,17, but are able to assess HGT in a much larger dataset and look at very recent transfers in return.

We observed no association (r = 0.01, P_Pearson = 0.17) between the average fraction of transferred genes per species and the number of genomes used for generating the pangenome (Extended Data Fig. 1a). The average fraction of transferred genes was therefore not significantly skewed towards better-studied species. More interestingly, the average fraction of transferred genes per species was weakly positively correlated (r = 0.18, P_Pearson = 7.0E-64) with the number of genes in the genome (Extended Data Fig. 1b). Previous studies comparing closely related prokaryotic genomes of different sizes have found evidence that HGT is the driving force behind genome expansion, which leads to larger genomes containing a higher fraction of transferred genes^1,2.

Previous studies have shown host-associated species to exchange more genes than those found in water or soil^11,13,14, leading us to next investigate the interspecies variability in gene transfer rates from this perspective. We mapped the species in our dataset to operational taxonomic units (OTUs) in the MicrobeAtlas database (Fig. 1), assigning “preferred” habitats based on highest average relative abundance. Restricting the analysis to transfers concerning gene pairs with ≥98% nucleotide identity, we indeed observed the highest median fraction of transferred genes in animal-associated species (1.32%). Plant-associated species had the second highest median fraction of transferred genes (0.46%), followed by soil-associated (0.16%) and finally water-associated (0.10%) species (Extended Data Fig. 1c). By contrast, when considering all transfer events in the dataset, we found no significant difference between animal-associated, water-associated, and soil-associated species (Extended Data Fig. 1d). These findings indicate that on longer evolutionary scales, the loss of transferred genes compensates the higher rate of HGT in animal-associated species.

Transferred genes are accessory, but can spread across the species population over time

We next focused on gene pairs with detected transfer events. Such gene pairs were enriched across distantly related species when compared to gene pairs without transfer events (Fig. 2a), even after subsampling gene pairs with and without transfer events to follow the same gene distance distribution (Fig. 2b, P_{Mann-Whitney U}≤ 2.2E-16). The enrichment was especially strong at gene tree distances ≤ 0.01, which corresponded to the most recent transfer events (Fig. 2a, bottom row in 2D histogram). After subsampling gene pairs with and without transfer events to follow the same species distance distribution, we observed generally lower gene distances in gene pairs with detected transfers (Fig. 2c, P_{Mann-Whitney U}≤ 2E-16). These results confirmed that transferred genes are more similar than expected based on species similarity.

The extent of HGT and resulting within-species gene content variation leads to a common distinction between core genes (present in all genomes of a species) and accessory genes (present in some genomes of a species)²⁴. We therefore studied the pangenome ubiquity of transferred genes using previously defined thresholds for extended core, intermediate-frequency accessory (shell) and low-frequency accessory (cloud) genes²⁵ (see Fig. 2d). We observed an over two-fold enrichment of transfer events among the cloud genes (Fig. 2d and Extended Data Fig. 2a, odds ratio = 2.07 and 2.87, P_Fischer ≤ 2.2E-16 in both cases). By contrast, extended core genes were depleted in transfer events (odds ratio = 0.46 and 0.37, P_Fischer ≤ 2.2E-16 in both cases). We next used gene distance as a proxy for time since the transfer event, as genes transferred earlier in evolution have had more time to accumulate mutations and diverge from the donor. Interestingly, we observed higher fractions of extended core genes in older transfers (Extended Data Fig. 2b), implying the persistence and population spread of a subset of transferred genes during species evolution. Indeed, in the context of the two-class model of gene evolution⁶, genes with originally high turnover rates can be recruited to perform biological functions with long-term benefit. Such genes then switch to the second, slowly evolving and persistent, class.

Recent and old transfer events display distinct functional repertoires

Multiple studies have considered the function of transferred genes^{9,13,15,18,26–33}, which, in the context of very recent transfers, has been shown to be predictive of HGT success¹⁵. To explore further, we divided our landscape of detected transfer events into bins based on species and gene distance and performed functional enrichment analysis for each bin using the COG categories from eggNOG³⁴ (Fig. 2e and Methods), gene distance again acting as a proxy for time since the transfer event. Recent transfers were enriched for genes participating in defence mechanisms, intracellular trafficking, cell cycle control, transcription, replication and repair, and genes of unknown function (Fig. 2f, Fig. 3, Extended Data Fig. 3). By contrast, genes involved in various metabolic functions were depleted in recent transfer events and enriched in older transfer events (Fig. 2f, Fig. 3, Extended Data Fig. 3). Finally, we found an overall depletion of transfers in genes involved in signal transduction, cell wall biogenesis, and cell motility (Fig. 2f, Fig. 3, Extended Data Fig. 3). To validate our findings with another system of functional annotation, we repeated the analysis using KEGG pathways³⁵ (Fig. 3, Extended Data Fig. 4). As most pathways considered were associated with metabolic function, we observed a similar trend of significant depletion in recent transfers and enrichment in older transfers, although the latter was not always statistically significant (Fig. 3, Extended Data Fig. 4).

The use of various methods for defining transferred genes, different functional annotation systems and choice of background expectation complicate direct comparison between different studies. Moreover, in contrast to most previous studies, we performed functional enrichment analysis separately for gene and species pairs of varying degrees of divergence to prevent recent transfers between closely related species from dominating enrichment results (Fig. 2e, bottom left corner). Nevertheless, we were able to select ten previous studies on HGT that performed functional enrichment analysis and compared their results with our observations from recent transfers (gene distance bins c, d, and e) (see Fig. 3 and Methods). Our observations agreed with previous observations in all ten functional categories with consistent strong evidence for enrichment or depletion in transfers.

In contrast to our findings, multiple studies discuss high transfer rates in genes associated with metabolism and low transfer rates in information storage and processing^27,29,32. These studies find evidence for increased enrichment for energy and carbohydrate metabolism but disagree on other categories of metabolism (Fig. 3). Similarly, observed low transfer rates in information storage and processing could be driven by low transfer rates in genes associated with translation; genes associated with transcription or with replication and repair genes display no consistent signal across studies (Fig. 3). For transcription, a case can be made for comparing genes involved in regulatory processes separately from other genes involved in transcription, as these appear to be more consistently enriched in transfers. Moreover, studies disagree on the rates of transfer of genes involved in cell cycle control, intracellular trafficking, and secondary metabolism. In light of these disagreements, Tal Pupko and others argue that it is gene connectivity rather than function that determines HGT success²⁹, which could be an interesting avenue for follow up in the future.

Antimicrobial resistance genes have been previously observed to be transferred at high rates^11,13–16. We therefore focused on genes annotated as such by KEGG. The most recent transfers displayed an over three-fold enrichment in genes conferring resistance to beta-lactams, aminoglycosides, tetracyclines, macrolides, phenicols and rifamycins (Extended Data Fig. 5). The degree of enrichment increased with species distance, suggesting that aggressive environmental selection for antimicrobial resistance can help overcome mechanistic barriers to HGT²² between distantly related species. At larger gene distances, however, we generally observed a depletion in transfers or no significant signal. The low degree of divergence between antimicrobial resistance genes shared via HGT could indicate transfer event recency but could also stem from strong evolutionary selection acting on these genes. Unfortunately, we are unable to distinguish whether these transfers occurred before or after widespread antibiotic usage, with previous estimates indicating nearly-identical genes to have been transferred at any point in the last 1000¹³ or 10,000¹⁶ years.

Co-occurring and interacting species are more likely to transfer genes between each other

We moved on to study the species participating in HGT and possible associated ecological factors. By using the Microbe Atlas Database, an extensive collection of environmental samples mapped to the same 16S rRNA gene reference, we were able to observe the presence of two taxa within the same environmental sample, directly calculating co-occurrence rates. After mapping our dataset to OTUs in the MicrobeAtlas database, we observed a positive correlation between co-occurrence and the number of genes transferred for most OTUs (Fig. 4a, step 1 and pre-correction histogram). However, genetic similarity has been shown to influence the success of HGT^{11,13,15,16,30}. Indeed, the number of transferred genes negatively correlated with the phylogenetic distance between the OTU pair (Fig. 4a, step 2). In addition, we observed a decrease in co-occurrence with increasing phylogenetic distance, in accordance with closely related taxa preferring similar environments³⁶ (Fig. 4a, step 2). We thus sought to correct for the phylogenetic signal in our observations on HGT and co-occurrence.

To this end, we modelled the relationship between co-occurrence and phylogenetic distance using power law (Fig. 4a, step 3) and exponential decay (Extended Figure 6) equations. Upon comparing model residuals on co-occurrence with the number of genes transferred, the positive correlation remained for most OTUs (Fig. 4a, steps 4-5 and post-correction histogram). As a complementary approach, we compared species pairs with multiple (≥7) genes transferred to those with at most one gene transferred with respect to their phylogenetic distance and co-occurrence (Fig. 4b). After normalising for differences in phylogenetic distance distribution, we observed that pairs of species with multiple transferred genes were significantly more likely to co-occur than pairs of species with at most one transferred gene (Fig. 4c, P_{Mann-Whitney U}≤ 2.2E-16). Correspondingly, when comparing species pairs with similar degrees of co-occurrence, we observed that pairs with multiple transferred genes were more likely to be closely related (Fig. 4d, P_{Mann-Whitney U} ≤ 2.2E-16).

Observing a positive relationship between co-occurrence and the number of genes transferred, we next asked whether co-occurring species also need to interact to increase the chance of a successful HGT event. Predicting ecological interactions between two species based on co-occurrence can result in spurious associations arising from shared habitats, batch effects, or interactions of both considered species with a third intermediary species. To correct for these effects, we used FlashWeave to generate an ecological interaction network³⁷ (see Methods). We then compared the number of ecological interactions between species pairs with multiple genes transferred and species pairs with at most one gene transferred after subsampling these two groups to follow the same phylogenetic distance and co-occurrence distributions. Within this data subset, we observed 1,012 interactions detected between species pairs with multiple genes transferred and 571 interactions between species pairs with zero or one genes transferred, a 1.8-fold enrichment. In contrast, correlating relative abundance profiles without statistical correction of spurious associations yielded 12,525 and 10,071 “interactions” respectively, a 1.2-fold enrichment. This suggests that there is a notable contribution of ecological interactions to HGT in an environment in addition to mere co-occurrence, but further research that considers a larger number of interactions is needed.

Analysing species relative abundance across environments reveals additional factors contributing to HGT

We used the MicrobeAtlas database not only to look at presence/absence data, but to compare relative abundance profiles across over a million samples. To our knowledge, only one study has looked into the relationship between HGT and species abundance, concluding that abundant bacteria are more likely to transfer genes to other abundant bacteria within the human gut¹⁶. Unlike the aforementioned study, we do not possess directly matched cultured isolate genomes with their relative abundance in the corresponding environmental sample, but we can determine whether a species is generally found in high or low abundances within a particular environment. We therefore assigned each OTU to its “preferred” habitat and compared HGT in OTUs lying on opposite ends of the environment’s OTU abundance distribution (Extended Data Fig. 7).

We observed a higher fraction of high-abundance OTU pairs participating in HGT when comparing pairs with similar phylogenetic distance (Fig. 5a). Interestingly, the increase in HGT probability with respect to abundance was higher in animal- and plant-associated microorganisms and was significant in all pairwise comparisons of high-high, high-low and low-low abundance OTUs (Fig. 5a, Extended Data Table 2). In water and soil-associated microorganisms, the increase in HGT probability was less apparent and not always significant (Fig. 5a, Extended Data Table 2). As HGT mechanisms require physical proximity between cells exchanging DNA²², high-abundance species have more opportunities for transfer assuming a well-mixed environment. The stronger signal observed in animal- and plant-associated organisms, however, indicates a role for host-associated factors in HGT.

Finally, we defined an index of species generalism based on the relative abundance measurements of OTUs across different environments (see Methods). Generalist species can thrive within a wide range of environments, while specialist species are confined to a particular environment. Our expectation was therefore that OTUs high on the generalism index can more easily disperse between environments, creating more opportunities for inter-environmental HGT. To this end, we selected 200 OTUs with the highest (“generalists”) and lowest (“specialists”) generalism index and compared the number of OTU pairs with at least one transfer event (Fig. 5b). Compared to all species, “generalists” exhibited a lower standard deviation (Z-score = -10.5), lower range (Z-score = -6.56) and higher mean (Z-score = 2.67) of inter-environmental transfer rates. By contrast, “specialists” exhibited a higher standard deviation (Z-score = 37) and higher range (Z -score = 31.4) of inter-environmental transfer rates. Interestingly, we observed a much higher rate of HGT between animal-associated “specialists” when compared to any other environmental and generalism index combination.

HGT is extensive and a fundamental driving force in prokaryotic genome evolution. In this study, we performed large-scale computational detection of HGT and integrated these data with an extensive microbial ecology dataset. In our dataset, an average of 42.5% genes in the genome were at one point affected by HGT. Most transferred genes are accessory and likely subject to high turnover rates. Nevertheless, a fraction of genes transferred earlier in evolution managed to persist and become part of the extended core genome of the species. We have shown that such genes transferred earlier in evolution are enriched for metabolic functions. By contrast, genes transferred most recently are enriched for defence mechanisms and antimicrobial resistance. When considering previous knowledge on HGT and gene function, we show that 9 out of 19 COG categories display no consistent signal across studies, suggesting additional factors at play.

Using the MicrobeAtlas database, we followed the global distribution of species that participated in HGT. Even after correction for the confounding effect of phylogenetic relatedness, species co-occurrence rates were positively correlated with larger numbers of transferred genes. Additionally, we have shown that species interactions, abundance and dispersal affect HGT rates, indicating the importance of cell proximity for creating opportunities to transfer genes. These ecological factors could not have been assessed on such a global scale with previously available data, showing the value of the MicrobeAtlas database in describing high-level trends in microbial ecology and evolution.

Genome selection and pangenome generation

We based our analysis on the proGenomes v2.2 dataset containing 82,400 genomes grouped into 11,562 species (i.e., specI clusters) that were defined based on 40 single-copy marker genes²⁰. The corresponding species tree generated based on concatenated marker gene sequences was kindly provided by the authors of the proGenomes paper.

From this initial selection, we filtered out metagenome-assembled genomes, single amplified genomes, genomes flagged as chimeric by GUNC³⁸, genomes that were not taxonomically cohesive with the rest of the specI cluster according to GTDB³⁹, genomes with no 16S rRNA gene sequence detected, and genomes we couldn’t confidently map to the MicrobeAtlas database (see “Mapping Genomes to MicrobeAtlas database OTUs” below). The species tree was pruned to remove these genomes using the ETE Toolkit v3⁴⁰. As a result, we obtained 78,315 genomes grouped into 8790 species. For each species, a pangenome was generated by clustering all gene sequences on 95% nucleotide sequence identity as described in ⁴¹.

HGT event detection

All gene sequences were clustered using MMseqs2⁴² with minimum overlap of 50%, minimum identity threshold of 80% and clustering mode “0”. The rest of the parameters were left as default. For each gene cluster, whenever sequences originated from more than one genome within a species, we only retained sequences that had the highest average nucleotide identity with sequences from other species within the gene cluster. Sequences were then aligned using the automatic strategy selection option in MAFFT v7.471⁴³, all other parameters left as default. Based on the multiple sequence alignment, a gene tree was generated using FastTree v2.1.11⁴⁴ using the GTR model of nucleotide evolution, all parameters left as default.

Prior to performing tree reconciliation, we subsampled the species tree using ETE Toolkit v3⁴⁰ to decrease computational requirements in the following manner: for each gene cluster, the species tree node corresponding to the last common ancestor of all species within the gene cluster was selected. Clades within the species tree not containing any genes from the gene cluster were collapsed. Subsequently, the subsampled species tree was used to root the gene tree using the OptRoot module from RANGER-DTL v2.0²³. We then ran RANGER-DTL with default settings to perform gene and species tree reconciliation for a total of 500 times. Gene clusters in which more than 50 optimal roots were detected were not considered further. Reconciliations from each optimal root were aggregated using the AggregateRanger_recipient module from RANGER-DTL v2.0. We used a custom script to aggregate results across optimal roots and detect tree nodes that were labelled as transfers. For downstream analysis, we considered only transfer events detected in ≥80% reconciliations that contained gene pairs with ≥0.5 minimum branch support in the gene tree. In addition, all multifurcations containing 100% identical genes from different species were considered to be transfer events.

Calculating average fraction of genes transferred

For each genome, we counted a gene as having undergone transfer if its pangenome-representative gene was involved in a transfer event. We counted a gene as assessed by the pipeline if its pangenome-representative gene was present in the pipeline output. The number of genes transferred was then divided by the number of genes assessed by the pipeline and the average based on all genomes within a species was calculated. For the examples mentioned in the main text, we used data from specI_v3_Cluster259 for Acinetobacter baumannii and data from specI_v3_Cluster712 for Lysteria monocytogenes.

MicrobeAtlas data retrieval

The NCBI sequence read archive (SRA)⁴⁵ was searched for samples and studies containing any of the keywords “metagenomic”, “microb*”, “bacteria”, or “archaea” in their metadata. The corresponding raw sequence data (as of 7th of March 2020) were downloaded and quality filtered. To assign OTU labels, quality filtered data were mapped to MAPref v2.2.1 using MAPseq v1.0 at a ≥0.5 confidence level⁴⁶. We then filtered out samples containing less than 1000 reads and 20 97% OTUs and retained samples with at least 90% community coverage (calculated based on the formula in ⁴⁷).

NCBI SRA sample metadata were parsed to classify every sample into four general environments: animal, aquatic, plant, and soil. Subsequently, we calculated Bray-Curtis distances between all samples in the dataset and compared community compositions in samples from independent studies. When a sample was consistently similar to samples assigned to a different environment, we adjusted its environment label. In cases where samples with similar community compositions had no general agreement between assigned environments, we removed the environmental label.

Mapping genomes to MicrobeAtlas database OTUs

We used barrnap⁴⁸ with default settings to predict 16S rRNA gene sequences in the genome selection, proceeding with sequences of ≥50% of expected length. The sequences were then mapped to MAPref v2.2.1 using MAPseq v1.0⁴⁶, retaining only sequences that mapped to an OTU with ≥0.3 confidence level. Genomes containing multiple 16S rRNA gene copies were mapped to OTUs based on majority rule (≥50% copies) or high confidence (at least one copy with 0.98 confidence level). Species containing multiple genomes were mapped to OTUs based on majority rule (≥50% genomes).

Preferred habitat assignment

For each OTU within the dataset, the average abundance was calculated separately for all samples assigned to the animal, aquatic, plant, and soil environments. The OTU was then assigned to its preferred environment based on the highest of the four numbers.

Gene and species distances normalisation

Distances between gene and species pairs were extracted from the corresponding trees using the dist function in ETE Toolkit v3⁴⁰. To plot the distribution in Fig. 2a, only gene pairs with ≥0.5 minimum branch support values and ≥50% sequence overlap were considered. Gene pairs with and without transfer events were normalised with respect to species distance by splitting the species distance distributions into 80 bins and subsampling the group with the larger number of pairs in each bin (either “Transfer Detected” or “No Transfer Detected”) to the number of pairs in the second group in the corresponding bin (either “No Transfer Detected” or “Transfer Detected”). The same procedure was performed for normalising gene pairs with and without transfer events with respect to gene distance. After normalisation, the resulting distributions were compared using the Mann-Whitney U test.

Pangenome analysis

To calculate gene ubiquity, we counted the number of genomes represented by a gene in each pangenome versus the total number of genomes in the species. For subsequent analysis, only species encompassing ≥10 genomes were considered. We used previously defined thresholds²⁵ to distinguish extended core genes (≥90% gene ubiquity) and cloud genes (≤15% gene ubiquity). In the species pair participating in HGT, the species with the higher gene ubiquity was labelled as the putative donor, while the species with the lower gene ubiquity was labelled as the putative recipient. To compare extended core and cloud genes with or without transfer events, Fisher's exact test was performed.

Genome annotation and functional enrichment analysis

We used the COG category and KEGG pathway functional annotations provided by the proGenomes database after running eggNOG-mapper for eggNOG 5.0³⁴. Each gene cluster was annotated to the corresponding functional categories based on the union of all gene annotations within the cluster. We calculated a functional category’s background expectation fraction by counting the total number of genes that passed the pipeline that were annotated to this category divided by the total number of genes that passed the pipeline.

For each detected transfer event, we calculated the average species and gene distance by taking all average pairwise distances between left descendants and right descendants of the transfer event (for gene distance calculations, only gene pairs with ≥50% sequence overlap were considered). The resulting distribution of species and gene distances can be seen in Fig. 2e. For functional enrichment analysis, minimum and maximum species and gene distance cut-offs were selected in such a way that there were no bins without observations, the resulting area divided into thirds. We also looked specifically at transfer events at the 0.01 and 0.05 gene distance cut-offs (approximately ≥99% and ≥95% sequence identity respectively) as these results would be more comparable to previous studies that detected HGT events based on nearly identical sequences. We then counted the number of transfer events annotated to each functional category divided by the total number of transfer events in the area. The observed fraction of events annotated to a specific function was then tested with the binomial test against the fraction of all genes that the pipeline was run on that were annotated to this function. Resulting p-values were corrected for multiple testing using the Holm-Sidak method.

A similar procedure was performed using KEGG ortholog annotations, grouping them into KEGG pathway maps (09101 - 09145) for Extended Data Fig. 4 and antimicrobial resistance genes (BR:ko01504) for Extended Data Fig. 5.

Functional repertoire comparison with previous studies

We compared our functional enrichment analysis results with those from Jeong H et al¹⁸, Song W et al³³, Oliveira PH et al³¹, Popa O et al³⁰, Cohen O et al²⁹, Cohen O and Pupko T³², Cordero O et al²⁸, Sheinman M et al¹³, Paquola ACM et al²⁷, and Nakamura Y et al⁹. In most of these studies, functional categories were based on the COG database, with the exception of Sheinman M et al (with categories based on the SEED⁴⁹), and Paquola ACM et al and Nakamura et al (both with categories based on TIGRFAMs⁵⁰). The mapping between COG categories and KEGG pathways (used in our study), SEED, and TIGRFAMs can be found in Extended Table 2.

For our study, we considered enrichment data from the most recent transfers i.e., gene distance bins 0.00 - 0.01, 0.00 - 0.05, and 0.00 - 0.25. These three gene distance bins together with three species distance bins provided us with nine data points to consider for each functional category. We assigned a functional category to have strong evidence for enrichment or depletion in transfers if at least seven of the nine data points showed significant enrichment or depletion. We assigned a functional category to have weak evidence for enrichment or depletion in transfers if most data points showed enrichment or depletion, but this was not always statistically significant.

For Jeong H et al, we considered the results depicted in Fig. 8d and Supplementary Table 13 of the article. We calculated the first and third quartiles of the HGT index using all genes in Supplementary Table 13. We assigned a functional category to have strong evidence for enrichment in transfers if the median HGT index from genes in this category was greater than the third quartile. We assigned a functional category to have strong evidence for depletion in transfers if the median HGT index from genes in this category was less than the first quartile. Only functional categories containing at least five genes were considered.

For Song W et al, we considered the results depicted in Fig. 9 of the article. We considered only recent HGT events (≥99% nucleotide sequence identity). We assigned a functional category to have strong evidence for enrichment in transfers if the median recent HGTs in this category was greater than the third quartile. We assigned a functional category to have strong evidence for depletion in transfers if the median recent HGTs in this category was less than the first quartile.

For Oliveira PH et al, we considered the results depicted in Fig. 4a (HTgenes row) of the article. We considered a functional category to have strong evidence for enrichment or depletion in transfers if the observed-to-expected ratio of orthologous groups was significantly different from one.

For Popa O et al, we considered the results depicted in Supplementary Fig. 7 of the article. We considered a functional category to have strong evidence for enrichment or depletion in transfers if the relative proportion of transferred genes was significantly over- or underrepresented when compared to the set of all bacterial genes.

For Cohen O et al, we considered the results depicted in the first two columns of Table 3 of the article. We considered a functional category to be enriched in transfers if its relative transferability was higher than one, and to be depleted in transfers if its relative transferability was lower than one. We used a p-value cut-off of 0.05 to distinguish strong and weak evidence for enrichment or depletion.

For Cohen O and Pupko T, we considered the results depicted in Table 2 of the article. In the table authors listed functional categories that significantly differed from the background of all gene families. We used a p-value cut-off of 0.05 to distinguish strong and weak evidence for enrichment or depletion.

For Cordero O et al, we considered the results depicted in Fig. 4b of the article. We used Z-score cut-offs of 2 and -2 to distinguish strong and weak evidence for enrichment or depletion.

For Sheinman M et al, we considered the results depicted in Supplementary File 6 (SEED Level 1 and SEED Level 2) of the article. We used a p-value cut-off of 0.05 to distinguish strong and weak evidence for enrichment or depletion. We downweighed depletion evidence for the “Transcription (regulatory)” and “Signal transduction” categories as they both mapped to “Regulation and cell signalling” in the SEED. For COG categories that mapped to multiple categories in the SEED, we indicated evidence based on the consensus from these categories.

For Paquola et al, we considered the results depicted in Table 2 of the article. We downweighed depletion evidence for “Cell cycle control and mitosis” and “Cell motility” as they both mapped to the “Cellular processes” in TIGRFAMs. We also downweighed enrichment evidence for “Carbohydrate transport and metabolism” as there was no one-to-one mapping for this category.

For Nakamura et al, we considered the results depicted in Fig. 2 of the article. We considered a functional category to be enriched in transfers if the proportion of transferred genes was greater than 10%, and to be depleted in transfers if the proportion of transferred genes was less than 3%.

Co-occurrence analysis

An OTU was detected as present in a sample if its relative abundance was at least 0.01%. To calculate the co-occurrence between two OTUs, we counted the number of samples in which both OTUs were present and divided it by the number of samples the less prevalent OTU was present in. Phylogenetic distances between OTUs were retrieved from the MicrobeAtlas database 16S rRNA tree using the dist function in ETE Toolkit v3⁴⁰.

For modelling the relationship between co-occurrence and phylogenetic distance, we only considered OTUs that exchanged at least one gene with 30 other OTUs and OTU pairs in which both OTUs were present in at least 20 environmental samples. The power law equation (1) or the exponential decay equation (2) were used to model the relationship:

Where CO stands for co-occurrence, PD stands for phylogenetic distance, and k, a, N0 and λ are parameters fitted using the nlstools package in R⁵¹. Model residuals were then used to calculate the Spearman correlation with the number of genes transferred. To generate the background distribution, the number of genes was shuffled prior to calculating the Spearman correlation. The resulting distributions of Spearman correlations generated based on raw co-occurrence (pre-correction), model residuals (post-correction) or background were compared to each other with the Mann-Whitney U test.

The analysis depicted in Fig. 4b, 4c and 4d has been performed using a similar set up as described in “Gene and species distance normalisation”. We used the ≥7 genes transferred cut-off to denote OTU pairs with many transfer events as this corresponded to the 80% quantile of OTU pairs with at least one gene transferred.

Interaction prediction and analysis

Global networks of predicted interactions were computed with FlashWeave v0.19.0³⁷. The parameters used were: sensitive = false, heterogeneous = true, max_k = 3 (with confounder correction) or max_k = 0 (without confounder correction). We used co-occurrence data from all 95,422 OTUs contained within the environmental sample data set, filtering the resulting network for edges between the 4380 OTUs for which transfer event data were generated. OTU pairs with a score higher than zero were considered as interacting. To normalise for differences in phylogenetic distance and co-occurrence distributions between species with at least seven genes transferred and species with zero or one gene transfer, the procedure described in “Gene and species distance normalisation” was performed with simultaneous subsampling on phylogenetic distance and co-occurrence for 80x80 bins.

Abundance analysis

We used the same relative abundance numbers as calculated in “Preferred environment assignment”. For each OTU, we only considered its abundance within its preferred environment, denoting high-abundance OTUs as those whose abundance was above the 80% quantile in this environment. In contrast, we denoted low-abundance OTUs as those whose abundance was below the 20% quantile in this environment. OTU pairs were then sorted based on phylogenetic distance and the fraction of OTU pairs with at least one transfer event detected was calculated for each phylogenetic distance bin. The uncertainty was calculated using Bernoulli’s uncertainty principle. Resulting fractions were then pairwise compared between the High-High, High-Low and Low-Low groups using the Wilcoxon Rank Sum test. Resulting p-values were corrected for multiple testing using the Benjamini-Hochberg method.

Generalist and specialist analysis

We computed a generalism index for each OTU reflecting its environmental flexibility. This index was calculated based on the entropy of the OTU’s abundance values across the four major environments (animal, aquatic, soil, plant). OTUs with similar abundances across environments had higher entropy. OTUs with uneven abundances across environments (a higher abundance in one or a few of the environments compared to the rest) had lower entropy.

To compare inter-environmental transfers, we selected 200 OTUs assigned to each environment (see “Preferred habitat assignment”) that displayed the highest entropy (“generalists”) and 200 OTUs that displayed the lowest entropy (“specialists”). OTU pairs were then subsampled in such a way that phylogenetic distance distributions were equal between all environments and between “generalists”, “specialists”, and all species. We then counted the fraction of OTU pairs with at least one transfer event detected. To generate the background expectation, OTU pairs from all species were subsampled to the target phylogenetic distance distribution 1000 times. We then fit a normal distribution to the generated data using the fitdistr function in R⁵² to get an estimate of the expected mean, standard deviation and range of transfer rates between different environments.

Data visualisation

Data from Figs 2, 4b, 4c and Extended Figs 1-5 were visualised using seaborn v0.11.2⁵³ and matplotlib v3.5.1⁵⁴ in Python v3.7.4. Data from Figs 3, 4a, 5 and Extended Figs 6, 7 were visualised using ggplot2 v3.3.5⁵⁵ in R v4.1.1.

Data availability

We have uploaded the data sets generated during the study on Figshare, where they will eventually be accessible through the following link: https://doi.org/10.6084/m9.figshare.22893632.

Code availability

The scripts used for data set generation and analysis can be found through the following link: https://github.com/marydmit/eco_evolutionary_factors_and_hgt.

Author contributions

M.D., J.T., J.F.M.R., L.P.C. and C.v.M. conceived and designed the study. M.D., J.T., J.F.M.R., J.H.C. and L.P.C. generated the data. M.D. performed the statistical analyses and generated the visualisations. L.P.C. and C.v.M. supervised the study. M.D. wrote the first draft of the manuscript, with input from L.P.C. and C.v.M. All authors contributed to revising and editing the final manuscript.

Acknowledgements

We would like to thank Daniel R. Mende (Amsterdam UMC), Thomas S.B. Schmidt (EMBL), Askarbek Orakov (EMBL) and Oleksandr M. Maistrenko (NIOZ) for providing data from proGenomes v2 and answering our questions about the dataset. We would also like to thank Mark Robinson (UZH), Simone Tiberi (University of Bologna), Pierre-Luc Germain (UZH & ETH Zurich) for statistical advice. Finally, we would like to thank Anna Cuscó (Fudan University) for providing comments on an earlier version of the manuscript. M.D. and C.v.M. were supported by the Swiss National Science Foundation (No. 310030-192569). L.P.C. was partially supported by a Shanghai Municipal Science and Technology Major Project (No. 2018SHZDZX01) and the 111 Project (No. B18015). J.H.C was supported by Project PID2021-127210NB-I00 (MCIN/AEI/10.13039/501100011033/FEDER, UE).

Puigbò, P., Lobkovsky, A. E., Kristensen, D. M., Wolf, Y. I. & Koonin, E. V. Genomes in turmoil: quantification of genome dynamics in prokaryote supergenomes. BMC Biol 12, 66 (2014).
Treangen, T. J. & Rocha, E. P. C. Horizontal Transfer, Not Duplication, Drives the Expansion of Protein Families in Prokaryotes. PLOS Genetics 7, e1001284 (2011).
Dagan, T. & Martin, W. Ancestral genome sizes specify the minimum rate of lateral gene transfer during prokaryote evolution. Proceedings of the National Academy of Sciences 104, 870–875 (2007).
Dagan, T., Artzy-Randrup, Y. & Martin, W. Modular networks and cumulative impact of lateral transfer in prokaryote genome evolution. Proceedings of the National Academy of Sciences 105, 10039–10044 (2008).
Sorek, R. et al. Genome-Wide Experimental Determination of Barriers to Horizontal Gene Transfer. Science 318, 1449–1452 (2007).
Wolf, Y. I., Makarova, K. S., Lobkovsky, A. E. & Koonin, E. V. Two fundamentally different classes of microbial genes. Nat Microbiol 2, 1–6 (2016).
Sela, I., Wolf, Y. I. & Koonin, E. V. Theory of prokaryotic genome evolution. Proceedings of the National Academy of Sciences 113, 11399–11407 (2016).
Ravenhall, M., Škunca, N., Lassalle, F. & Dessimoz, C. Inferring Horizontal Gene Transfer. PLOS Computational Biology 11, e1004095 (2015).
Nakamura, Y., Itoh, T., Matsuda, H. & Gojobori, T. Biased biological functions of horizontally transferred genes in prokaryotic genomes. Nat Genet 36, 760–766 (2004).
Lawrence, J. G. & Ochman, H. Amelioration of Bacterial Genomes: Rates of Change and Exchange. J Mol Evol 44, 383–397 (1997).
Smillie, C. S. et al. Ecology drives a global network of gene exchange connecting the human microbiome. Nature 480, 241–244 (2011).
Brito, I. L. et al. Mobile genes in the human microbiome are structured from global to individual scales. Nature 535, 435–439 (2016).
Sheinman, M. et al. Identical sequences found in distant genomes reveal frequent horizontal transfer across the bacterial domain. eLife 10, e62719 (2021).
Fondi, M. et al. “Every Gene Is Everywhere but the Environment Selects”: Global Geolocalization of Gene Sharing in Environmental Samples through Network Analysis. Genome Biology and Evolution 8, 1388–1400 (2016).
Zhou, H., Beltrán, J. F. & Brito, I. L. Functions predict horizontal gene transfer and the emergence of antibiotic resistance. Science Advances 7, eabj5056 (2021).
Groussin, M. et al. Elevated rates of horizontal gene transfer in the industrialized human microbiome. Cell 184, 2053–2067.e18 (2021).
Kloesges, T., Popa, O., Martin, W. & Dagan, T. Networks of Gene Sharing among 329 Proteobacterial Genomes Reveal Differences in Lateral Gene Transfer Frequency at Different Phylogenetic Depths. Mol Biol Evol 28, 1057–1074 (2011).
Jeong, H., Arif, B., Caetano-Anollés, G., Kim, K. M. & Nasir, A. Horizontal gene transfer in human-associated microorganisms inferred by phylogenetic reconstruction and reconciliation. Sci Rep 9, 1–18 (2019).
Choi, Y. et al. HGTree v2.0: a comprehensive database update for horizontal gene transfer (HGT) events detected by the tree-reconciliation method. Nucleic Acids Research 51, D1010–D1018 (2023).
Mende, D. R. et al. proGenomes2: an improved database for accurate and consistent habitat, taxonomic and functional annotations of prokaryotic genomes. Nucleic Acids Res (2019) doi:10.1093/nar/gkz1002.
Khedkar, S. et al. Landscape of mobile genetic elements and their antibiotic resistance cargo in prokaryotic genomes. Nucleic Acids Research 50, 3155–3168 (2022).
Thomas, C. M. & Nielsen, K. M. Mechanisms of, and Barriers to, Horizontal Gene Transfer between Bacteria. Nat Rev Microbiol 3, 711–721 (2005).
Bansal, M. S., Kellis, M., Kordi, M. & Kundu, S. RANGER-DTL 2.0: rigorous reconstruction of gene-family evolution by duplication, transfer and loss. Bioinformatics 34, 3214–3216 (2018).
McInerney, J. O., McNally, A. & O’Connell, M. J. Why prokaryotes have pangenomes. Nat Microbiol 2, 1–5 (2017).
Maistrenko, O. M. et al. Disentangling the impact of environmental and phylogenetic constraints on prokaryotic within-species diversity. ISME J 14, 1247–1259 (2020).
N’Guessan, A., Brito, I. L., Serohijos, A. W. R. & Shapiro, B. J. Mobile Gene Sequence Evolution within Individual Human Gut Microbiomes Is Better Explained by Gene-Specific Than Host-Specific Selective Pressures. Genome Biology and Evolution 13, evab142 (2021).
Paquola, A. C. M. et al. Horizontal Gene Transfer Building Prokaryote Genomes: Genes Related to Exchange Between Cell and Environment are Frequently Transferred. J Mol Evol 86, 190–203 (2018).
Cordero, O. X. & Hogeweg, P. The impact of long-distance horizontal gene transfer on prokaryotic genome size. Proceedings of the National Academy of Sciences 106, 21748–21753 (2009).
Cohen, O., Gophna, U. & Pupko, T. The Complexity Hypothesis Revisited: Connectivity Rather Than Function Constitutes a Barrier to Horizontal Gene Transfer. Molecular Biology and Evolution 28, 1481–1489 (2011).
Popa, O., Landan, G. & Dagan, T. Phylogenomic networks reveal limited phylogenetic range of lateral gene transfer by transduction. ISME J 11, 543–554 (2017).
Oliveira, P. H., Touchon, M., Cury, J. & Rocha, E. P. C. The chromosomal organization of horizontal gene transfer in bacteria. Nat Commun 8, 841 (2017).
Cohen, O. & Pupko, T. Inference and Characterization of Horizontally Transferred Gene Families Using Stochastic Mapping. Molecular Biology and Evolution 27, 703–713 (2010).
Song, W., Wemheuer, B., Zhang, S., Steensen, K. & Thomas, T. MetaCHIP: community-level horizontal gene transfer identification through the combination of best-match and phylogenetic approaches. Microbiome 7, 36 (2019).
Huerta-Cepas, J. et al. eggNOG 5.0: a hierarchical, functionally and phylogenetically annotated orthology resource based on 5090 organisms and 2502 viruses. Nucleic Acids Research 47, D309–D314 (2019).
Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y. & Morishima, K. KEGG: new perspectives on genomes, pathways, diseases and drugs. Nucleic Acids Research 45, D353–D361 (2017).
Tamames, J., Sánchez, P. D., Nikel, P. I. & Pedrós-Alió, C. Quantifying the Relative Importance of Phylogeny and Environmental Preferences As Drivers of Gene Content in Prokaryotic Microorganisms. Frontiers in Microbiology 7, (2016).
Tackmann, J., Matias Rodrigues, J. F. & von Mering, C. Rapid Inference of Direct Interactions in Large-Scale Ecological Networks from Heterogeneous Microbial Sequencing Data. Cell Systems 9, 286–296.e8 (2019).
Orakov, A. et al. GUNC: detection of chimerism and contamination in prokaryotic genomes. Genome Biology 22, 178 (2021).
Parks, D. H. et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat Biotechnol 36, 996–1004 (2018).
Huerta-Cepas, J., Serra, F. & Bork, P. ETE 3: Reconstruction, Analysis, and Visualization of Phylogenomic Data. Mol Biol Evol 33, 1635–1638 (2016).
Coelho, L. P. et al. Towards the biogeography of prokaryotic genes. Nature 601, 252–256 (2022).
Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology 35, 1026–1028 (2017).
Katoh, K. & Standley, D. M. MAFFT Multiple Sequence Alignment Software Version 7: Improvements in Performance and Usability. Molecular Biology and Evolution 30, 772–780 (2013).
Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2 – Approximately Maximum-Likelihood Trees for Large Alignments. PLOS ONE 5, e9490 (2010).
Sayers, E. W. et al. Database resources of the National Center for Biotechnology Information in 2023. Nucleic Acids Research 51, D29–D38 (2023).
Matias Rodrigues, J. F., Schmidt, T. S. B., Tackmann, J. & von Mering, C. MAPseq: highly efficient k-mer search with confidence estimates, for rRNA sequence analysis. Bioinformatics 33, 3808–3810 (2017).
Chao, A. & Jost, L. Coverage-based rarefaction and extrapolation: standardizing samples by completeness rather than size. Ecology 93, 2533–2547 (2012).
Seemann, T. barrnap 0.9: rapid ribosomal RNA prediction. (2018).
Overbeek, R. et al. The SEED and the Rapid Annotation of microbial genomes using Subsystems Technology (RAST). Nucleic Acids Research 42, D206–D214 (2014).
Haft, D. H. et al. TIGRFAMs and Genome Properties in 2013. Nucleic Acids Research 41, D387–D395 (2013).
Baty, F. et al. A Toolbox for Nonlinear Regression in R: The Package nlstools. Journal of Statistical Software 66, 1–21 (2015).
Delignette-Muller, M. L. & Dutang, C. fitdistrplus: An R Package for Fitting Distributions. Journal of Statistical Software 64, 1–34 (2015).
Waskom, M. L. seaborn: statistical data visualization. Journal of Open Source Software 6, 3021 (2021).
Hunter, J. D. Matplotlib: A 2D Graphics Environment. Computing in Science & Engineering 9, 90–95 (2007).
Wickham, H. ggplot2: Elegant Graphics for Data Analysis. (Springer-Verlag New York, 2016).

There is NO Competing Interest.

ExtendedFigure1.png
Extended Data Figure 1. Analysis of factors contributing to interspecies variability in rates of HGT. (a) The fraction of genes with a transfer event detected plotted against the total genomes used for generating the pangenome. The regression line and 95% confidence intervals are depicted in purple. (b) The fraction of genes with a transfer event detected plotted against the total number of genes in the genome that were analysed for the corresponding species. The regression line and 95% confidence intervals are depicted in purple. (c) The fraction of genes with a recent transfer event (i.e., involving gene pairs with ≥99% nucleotide identity) detected compared across species’ preferred environments. (d) The fraction of genes with a transfer event detected compared across species’ preferred environments. In both (c) and (d), outliers are not shown and “***” indicate P_{Mann-Whitney U} ≤ 0.001.
ExtendedFigure2.png
Extended Data Figure 2. Analysis of gene ubiquity and its relationship to HGT. (a) Distribution of gene ubiquity (expressed in the fraction of genomes in species with gene) in putative donor species for gene pairs with (green) and without (brown) transfers. (b)Comparing gene ubiquity of gene pairs participating in HGT across different gene distance bins. More recent transfer events are depicted in dark green, while old transfers are depicted in light green.
ExtendedFigure3.png
Extended Data Figure 3. Functional enrichment analysis performed on selected bins depicted in Fig. 2e for remaining COG categories not depicted in Fig. 2f. Blue indicates a higher-than-expected number of transfers, while red indicates a lower-than-expected number of transfers. Species distance bins are 1 (0.08 - 0.72), 2 (0.72 - 1.36), and 3 (1.36 - 2.00). Gene distances bins are a (0.50 - 0.75), b (0.25 - 0.50), c (0.00 - 0.25), d (0.00 - 0.05), and e (0.00 - 0.01). Significance post-multiple testing correction is indicated with stars.
ExtendedFigure4.png
Extended Data Figure 4. Functional enrichment analysis performed on selected bins depicted in Fig. 2e for KEGG pathway maps. Blue indicates a higher-than-expected number of transfers, while red indicates a lower-than-expected number of transfers. Species distance bins are 1 (0.08 - 0.72), 2 (0.72 - 1.36), and 3 (1.36 - 2.00). Gene distances bins are a (0.50 - 0.75), b (0.25 - 0.50), c (0.00 - 0.25), d (0.00 - 0.05), and e (0.00 - 0.01). Significance post-multiple testing correction is indicated with stars.
ExtendedFigure5.png
Extended Data Figure 5. Functional enrichment analysis performed on selected bins depicted in Fig. 2e for BRITE hierarchy “Antimicrobial resistance genes”. Blue indicates a higher-than-expected number of transfers, while red indicates a lower-than-expected number of transfers. Species distance bins are 1 (0.08 - 0.72), 2 (0.72 - 1.36), and 3 (1.36 - 2.00). Gene distances bins are a (0.50 - 0.75), b (0.25 - 0.50), c (0.00 - 0.25), d (0.00 - 0.05), and e (0.00 - 0.01). Significance post-multiple testing correction is indicated with stars.
ExtendedFigure6.png
Extended Data Figure 6. After correction for phylogenetic signal with the exponential decay equation, co-occurring species are more likely to participate in HGT. For each OTU and its partners in HGT, the relationship between co-occurrence and phylogenetic distance is modelled with the exponential decay equation. Model residuals on co-occurrence are correlated with the number of genes transferred. The resulting distribution of correlations for 3755 OTUs is depicted in orange, with the distribution of correlations prior removing the phylogenetic signal depicted in yellow (same distribution as in Fig. 4a) and distribution using randomised HGT data depicted in grey.
ExtendedFigure7.png
Extended Data Figure 7. Distribution of OTU abundances in four main environments in the MicrobeAtlas database - animal (red), aquatic (blue), plant (green) and soil (orange). Vertical lines depict the 20% and 80% quantiles of the abundance distribution, with low abundance OTUs lying to the left and high abundance OTUs lying to the right.
ExtendedTable1.pdf
Extended Data Table 1. Mapping between COG functional categories and functional categories in KEGG (used in our study), SEED (used in Sheinman M et al), and TIGR (used in Paquola ACM et al and Nakamura Y et al).
ExtendedTable2.pdf
Extended Data Table 2. Multiple-testing corrected p-values obtained after performing Wilcoxon Rank Sum test on curves depicted in Fig. 5a. The two curves being compared are indicated in the first two columns.

Download PDF

Journal Publication

published 05 Mar, 2024

Read the published version in Nature Ecology & Evolution →

Version 1

posted

You are reading this latest preprint version

A global survey of eco-evolutionary pressures acting on horizontal gene transfer

Status:

Journal Publication

Version 1

Abstract

Figures

Introduction

Results and Discussion

The contribution of HGT to prokaryotic genome evolution is extensive

Transferred genes are accessory, but can spread across the species population over time

Recent and old transfer events display distinct functional repertoires

Co-occurring and interacting species are more likely to transfer genes between each other

Analysing species relative abundance across environments reveals additional factors contributing to HGT

Conclusion

Methods

Genome selection and pangenome generation

HGT event detection

Calculating average fraction of genes transferred

MicrobeAtlas data retrieval

Mapping genomes to MicrobeAtlas database OTUs

Preferred habitat assignment

Gene and species distances normalisation

Pangenome analysis

Genome annotation and functional enrichment analysis

Functional repertoire comparison with previous studies

Co-occurrence analysis

Interaction prediction and analysis

Abundance analysis

Generalist and specialist analysis

Declarations

Data visualisation

Data availability

Code availability

Author contributions

Acknowledgements

References

Additional Declarations

Supplementary Files

Status:

Journal Publication

Version 1