Genome selection and pangenome generation
We based our analysis on the proGenomes v2.2 dataset containing 82,400 genomes grouped into 11,562 species (i.e., specI clusters) that were defined based on 40 single-copy marker genes20. The corresponding species tree generated based on concatenated marker gene sequences was kindly provided by the authors of the proGenomes paper.
From this initial selection, we filtered out metagenome-assembled genomes, single amplified genomes, genomes flagged as chimeric by GUNC38, genomes that were not taxonomically cohesive with the rest of the specI cluster according to GTDB39, genomes with no 16S rRNA gene sequence detected, and genomes we couldn’t confidently map to the MicrobeAtlas database (see “Mapping Genomes to MicrobeAtlas database OTUs” below). The species tree was pruned to remove these genomes using the ETE Toolkit v340. As a result, we obtained 78,315 genomes grouped into 8790 species. For each species, a pangenome was generated by clustering all gene sequences on 95% nucleotide sequence identity as described in 41.
HGT event detection
All gene sequences were clustered using MMseqs242 with minimum overlap of 50%, minimum identity threshold of 80% and clustering mode “0”. The rest of the parameters were left as default. For each gene cluster, whenever sequences originated from more than one genome within a species, we only retained sequences that had the highest average nucleotide identity with sequences from other species within the gene cluster. Sequences were then aligned using the automatic strategy selection option in MAFFT v7.47143, all other parameters left as default. Based on the multiple sequence alignment, a gene tree was generated using FastTree v2.1.1144 using the GTR model of nucleotide evolution, all parameters left as default.
Prior to performing tree reconciliation, we subsampled the species tree using ETE Toolkit v340 to decrease computational requirements in the following manner: for each gene cluster, the species tree node corresponding to the last common ancestor of all species within the gene cluster was selected. Clades within the species tree not containing any genes from the gene cluster were collapsed. Subsequently, the subsampled species tree was used to root the gene tree using the OptRoot module from RANGER-DTL v2.023. We then ran RANGER-DTL with default settings to perform gene and species tree reconciliation for a total of 500 times. Gene clusters in which more than 50 optimal roots were detected were not considered further. Reconciliations from each optimal root were aggregated using the AggregateRanger_recipient module from RANGER-DTL v2.0. We used a custom script to aggregate results across optimal roots and detect tree nodes that were labelled as transfers. For downstream analysis, we considered only transfer events detected in ≥80% reconciliations that contained gene pairs with ≥0.5 minimum branch support in the gene tree. In addition, all multifurcations containing 100% identical genes from different species were considered to be transfer events.
Calculating average fraction of genes transferred
For each genome, we counted a gene as having undergone transfer if its pangenome-representative gene was involved in a transfer event. We counted a gene as assessed by the pipeline if its pangenome-representative gene was present in the pipeline output. The number of genes transferred was then divided by the number of genes assessed by the pipeline and the average based on all genomes within a species was calculated. For the examples mentioned in the main text, we used data from specI_v3_Cluster259 for Acinetobacter baumannii and data from specI_v3_Cluster712 for Lysteria monocytogenes.
MicrobeAtlas data retrieval
The NCBI sequence read archive (SRA)45 was searched for samples and studies containing any of the keywords “metagenomic”, “microb*”, “bacteria”, or “archaea” in their metadata. The corresponding raw sequence data (as of 7th of March 2020) were downloaded and quality filtered. To assign OTU labels, quality filtered data were mapped to MAPref v2.2.1 using MAPseq v1.0 at a ≥0.5 confidence level46. We then filtered out samples containing less than 1000 reads and 20 97% OTUs and retained samples with at least 90% community coverage (calculated based on the formula in 47).
NCBI SRA sample metadata were parsed to classify every sample into four general environments: animal, aquatic, plant, and soil. Subsequently, we calculated Bray-Curtis distances between all samples in the dataset and compared community compositions in samples from independent studies. When a sample was consistently similar to samples assigned to a different environment, we adjusted its environment label. In cases where samples with similar community compositions had no general agreement between assigned environments, we removed the environmental label.
Mapping genomes to MicrobeAtlas database OTUs
We used barrnap48 with default settings to predict 16S rRNA gene sequences in the genome selection, proceeding with sequences of ≥50% of expected length. The sequences were then mapped to MAPref v2.2.1 using MAPseq v1.046, retaining only sequences that mapped to an OTU with ≥0.3 confidence level. Genomes containing multiple 16S rRNA gene copies were mapped to OTUs based on majority rule (≥50% copies) or high confidence (at least one copy with 0.98 confidence level). Species containing multiple genomes were mapped to OTUs based on majority rule (≥50% genomes).
Preferred habitat assignment
For each OTU within the dataset, the average abundance was calculated separately for all samples assigned to the animal, aquatic, plant, and soil environments. The OTU was then assigned to its preferred environment based on the highest of the four numbers.
Gene and species distances normalisation
Distances between gene and species pairs were extracted from the corresponding trees using the dist function in ETE Toolkit v340. To plot the distribution in Fig. 2a, only gene pairs with ≥0.5 minimum branch support values and ≥50% sequence overlap were considered. Gene pairs with and without transfer events were normalised with respect to species distance by splitting the species distance distributions into 80 bins and subsampling the group with the larger number of pairs in each bin (either “Transfer Detected” or “No Transfer Detected”) to the number of pairs in the second group in the corresponding bin (either “No Transfer Detected” or “Transfer Detected”). The same procedure was performed for normalising gene pairs with and without transfer events with respect to gene distance. After normalisation, the resulting distributions were compared using the Mann-Whitney U test.
Pangenome analysis
To calculate gene ubiquity, we counted the number of genomes represented by a gene in each pangenome versus the total number of genomes in the species. For subsequent analysis, only species encompassing ≥10 genomes were considered. We used previously defined thresholds25 to distinguish extended core genes (≥90% gene ubiquity) and cloud genes (≤15% gene ubiquity). In the species pair participating in HGT, the species with the higher gene ubiquity was labelled as the putative donor, while the species with the lower gene ubiquity was labelled as the putative recipient. To compare extended core and cloud genes with or without transfer events, Fisher's exact test was performed.
Genome annotation and functional enrichment analysis
We used the COG category and KEGG pathway functional annotations provided by the proGenomes database after running eggNOG-mapper for eggNOG 5.034. Each gene cluster was annotated to the corresponding functional categories based on the union of all gene annotations within the cluster. We calculated a functional category’s background expectation fraction by counting the total number of genes that passed the pipeline that were annotated to this category divided by the total number of genes that passed the pipeline.
For each detected transfer event, we calculated the average species and gene distance by taking all average pairwise distances between left descendants and right descendants of the transfer event (for gene distance calculations, only gene pairs with ≥50% sequence overlap were considered). The resulting distribution of species and gene distances can be seen in Fig. 2e. For functional enrichment analysis, minimum and maximum species and gene distance cut-offs were selected in such a way that there were no bins without observations, the resulting area divided into thirds. We also looked specifically at transfer events at the 0.01 and 0.05 gene distance cut-offs (approximately ≥99% and ≥95% sequence identity respectively) as these results would be more comparable to previous studies that detected HGT events based on nearly identical sequences. We then counted the number of transfer events annotated to each functional category divided by the total number of transfer events in the area. The observed fraction of events annotated to a specific function was then tested with the binomial test against the fraction of all genes that the pipeline was run on that were annotated to this function. Resulting p-values were corrected for multiple testing using the Holm-Sidak method.
A similar procedure was performed using KEGG ortholog annotations, grouping them into KEGG pathway maps (09101 - 09145) for Extended Data Fig. 4 and antimicrobial resistance genes (BR:ko01504) for Extended Data Fig. 5.
Functional repertoire comparison with previous studies
We compared our functional enrichment analysis results with those from Jeong H et al18, Song W et al33, Oliveira PH et al31, Popa O et al30, Cohen O et al29, Cohen O and Pupko T32, Cordero O et al28, Sheinman M et al13, Paquola ACM et al27, and Nakamura Y et al9. In most of these studies, functional categories were based on the COG database, with the exception of Sheinman M et al (with categories based on the SEED49), and Paquola ACM et al and Nakamura et al (both with categories based on TIGRFAMs50). The mapping between COG categories and KEGG pathways (used in our study), SEED, and TIGRFAMs can be found in Extended Table 2.
For our study, we considered enrichment data from the most recent transfers i.e., gene distance bins 0.00 - 0.01, 0.00 - 0.05, and 0.00 - 0.25. These three gene distance bins together with three species distance bins provided us with nine data points to consider for each functional category. We assigned a functional category to have strong evidence for enrichment or depletion in transfers if at least seven of the nine data points showed significant enrichment or depletion. We assigned a functional category to have weak evidence for enrichment or depletion in transfers if most data points showed enrichment or depletion, but this was not always statistically significant.
For Jeong H et al, we considered the results depicted in Fig. 8d and Supplementary Table 13 of the article. We calculated the first and third quartiles of the HGT index using all genes in Supplementary Table 13. We assigned a functional category to have strong evidence for enrichment in transfers if the median HGT index from genes in this category was greater than the third quartile. We assigned a functional category to have strong evidence for depletion in transfers if the median HGT index from genes in this category was less than the first quartile. Only functional categories containing at least five genes were considered.
For Song W et al, we considered the results depicted in Fig. 9 of the article. We considered only recent HGT events (≥99% nucleotide sequence identity). We assigned a functional category to have strong evidence for enrichment in transfers if the median recent HGTs in this category was greater than the third quartile. We assigned a functional category to have strong evidence for depletion in transfers if the median recent HGTs in this category was less than the first quartile.
For Oliveira PH et al, we considered the results depicted in Fig. 4a (HTgenes row) of the article. We considered a functional category to have strong evidence for enrichment or depletion in transfers if the observed-to-expected ratio of orthologous groups was significantly different from one.
For Popa O et al, we considered the results depicted in Supplementary Fig. 7 of the article. We considered a functional category to have strong evidence for enrichment or depletion in transfers if the relative proportion of transferred genes was significantly over- or underrepresented when compared to the set of all bacterial genes.
For Cohen O et al, we considered the results depicted in the first two columns of Table 3 of the article. We considered a functional category to be enriched in transfers if its relative transferability was higher than one, and to be depleted in transfers if its relative transferability was lower than one. We used a p-value cut-off of 0.05 to distinguish strong and weak evidence for enrichment or depletion.
For Cohen O and Pupko T, we considered the results depicted in Table 2 of the article. In the table authors listed functional categories that significantly differed from the background of all gene families. We used a p-value cut-off of 0.05 to distinguish strong and weak evidence for enrichment or depletion.
For Cordero O et al, we considered the results depicted in Fig. 4b of the article. We used Z-score cut-offs of 2 and -2 to distinguish strong and weak evidence for enrichment or depletion.
For Sheinman M et al, we considered the results depicted in Supplementary File 6 (SEED Level 1 and SEED Level 2) of the article. We used a p-value cut-off of 0.05 to distinguish strong and weak evidence for enrichment or depletion. We downweighed depletion evidence for the “Transcription (regulatory)” and “Signal transduction” categories as they both mapped to “Regulation and cell signalling” in the SEED. For COG categories that mapped to multiple categories in the SEED, we indicated evidence based on the consensus from these categories.
For Paquola et al, we considered the results depicted in Table 2 of the article. We downweighed depletion evidence for “Cell cycle control and mitosis” and “Cell motility” as they both mapped to the “Cellular processes” in TIGRFAMs. We also downweighed enrichment evidence for “Carbohydrate transport and metabolism” as there was no one-to-one mapping for this category.
For Nakamura et al, we considered the results depicted in Fig. 2 of the article. We considered a functional category to be enriched in transfers if the proportion of transferred genes was greater than 10%, and to be depleted in transfers if the proportion of transferred genes was less than 3%.
Co-occurrence analysis
An OTU was detected as present in a sample if its relative abundance was at least 0.01%. To calculate the co-occurrence between two OTUs, we counted the number of samples in which both OTUs were present and divided it by the number of samples the less prevalent OTU was present in. Phylogenetic distances between OTUs were retrieved from the MicrobeAtlas database 16S rRNA tree using the dist function in ETE Toolkit v340.
For modelling the relationship between co-occurrence and phylogenetic distance, we only considered OTUs that exchanged at least one gene with 30 other OTUs and OTU pairs in which both OTUs were present in at least 20 environmental samples. The power law equation (1) or the exponential decay equation (2) were used to model the relationship:
Where CO stands for co-occurrence, PD stands for phylogenetic distance, and k, a, N0 and λ are parameters fitted using the nlstools package in R51. Model residuals were then used to calculate the Spearman correlation with the number of genes transferred. To generate the background distribution, the number of genes was shuffled prior to calculating the Spearman correlation. The resulting distributions of Spearman correlations generated based on raw co-occurrence (pre-correction), model residuals (post-correction) or background were compared to each other with the Mann-Whitney U test.
The analysis depicted in Fig. 4b, 4c and 4d has been performed using a similar set up as described in “Gene and species distance normalisation”. We used the ≥7 genes transferred cut-off to denote OTU pairs with many transfer events as this corresponded to the 80% quantile of OTU pairs with at least one gene transferred.
Interaction prediction and analysis
Global networks of predicted interactions were computed with FlashWeave v0.19.037. The parameters used were: sensitive = false, heterogeneous = true, max_k = 3 (with confounder correction) or max_k = 0 (without confounder correction). We used co-occurrence data from all 95,422 OTUs contained within the environmental sample data set, filtering the resulting network for edges between the 4380 OTUs for which transfer event data were generated. OTU pairs with a score higher than zero were considered as interacting. To normalise for differences in phylogenetic distance and co-occurrence distributions between species with at least seven genes transferred and species with zero or one gene transfer, the procedure described in “Gene and species distance normalisation” was performed with simultaneous subsampling on phylogenetic distance and co-occurrence for 80x80 bins.
Abundance analysis
We used the same relative abundance numbers as calculated in “Preferred environment assignment”. For each OTU, we only considered its abundance within its preferred environment, denoting high-abundance OTUs as those whose abundance was above the 80% quantile in this environment. In contrast, we denoted low-abundance OTUs as those whose abundance was below the 20% quantile in this environment. OTU pairs were then sorted based on phylogenetic distance and the fraction of OTU pairs with at least one transfer event detected was calculated for each phylogenetic distance bin. The uncertainty was calculated using Bernoulli’s uncertainty principle. Resulting fractions were then pairwise compared between the High-High, High-Low and Low-Low groups using the Wilcoxon Rank Sum test. Resulting p-values were corrected for multiple testing using the Benjamini-Hochberg method.
Generalist and specialist analysis
We computed a generalism index for each OTU reflecting its environmental flexibility. This index was calculated based on the entropy of the OTU’s abundance values across the four major environments (animal, aquatic, soil, plant). OTUs with similar abundances across environments had higher entropy. OTUs with uneven abundances across environments (a higher abundance in one or a few of the environments compared to the rest) had lower entropy.
To compare inter-environmental transfers, we selected 200 OTUs assigned to each environment (see “Preferred habitat assignment”) that displayed the highest entropy (“generalists”) and 200 OTUs that displayed the lowest entropy (“specialists”). OTU pairs were then subsampled in such a way that phylogenetic distance distributions were equal between all environments and between “generalists”, “specialists”, and all species. We then counted the fraction of OTU pairs with at least one transfer event detected. To generate the background expectation, OTU pairs from all species were subsampled to the target phylogenetic distance distribution 1000 times. We then fit a normal distribution to the generated data using the fitdistr function in R52 to get an estimate of the expected mean, standard deviation and range of transfer rates between different environments.