Plasmids facilitate pathogenicity, not cooperation, in bacteria

Horizontal gene transfer via plasmids could favour cooperation in bacteria, because transfer of 15 a cooperative gene turns non-cooperative cheats into cooperators. This hypothesis has received 16 support from both theoretical and genomic analyses. In contrast, with a comparative analysis 17 across 51 diverse species, we found that genes for extracellular proteins, which are likely to act 18 as cooperative ‘public goods’, were not more likely to be carried on either: (i) plasmids 19 compared to chromosomes; or (ii) plasmids that transfer at higher rates. Our results were 20 supported by theoretical modelling which showed that while horizontal gene transfer can help 21 cooperative genes initially invade a population, it does not favour the longer-term maintenance 22 of cooperation. Instead, we found that genes for extracellular proteins were more likely to be 23 on plasmids when they coded for pathogenic virulence traits, in pathogenic bacteria with a 24 broad host-range. Taken together, these results support an alternate hypothesis, that plasmid 25 gene location confers benefits other than horizontal gene transfer. cooperation. (b) Gain and Loss Hypothesis. The production of the extracellular protein is required in some environments, but not others. Transitions between these environments can result from temporal or spatial change. Cells are selected to either lose (Environment A) or gain (Environment B) the plasmid coding for the production of the extracellular protein. (c) Beyond Horizontal Gene Transfer Hypothesis. The location of a gene on a plasmid could provide a number of benets, other than the possibility for horizontal gene transfer38. For example, when the quantity of extracellular protein required varies across environments (A versus B), plasmid copy number could be varied to adjust production38.


32
The growth and success of many bacterial populations depends upon the production of 33 cooperative 'public goods' [1][2][3][4] . Public goods are molecules whose secretion provides a benefit 34 to the local group of cells. Examples include iron-scavenging siderophores 5 , exotoxins that 35 disintegrate host cell membranes 6,7 , and elastases that break down connective tissues [8][9][10] . A 36 problem is that cooperation can be exploited by 'cheats': cells which avoid the cost of 37 producing public goods but can still use and benefit from those produced by cooperative 38 cells 3,11,12 . What prevents cheats from outcompeting cooperators, and ultimately destabilising 39 cooperation ? 40 41 In bacteria, some genetic elements are able to move between cells 13 . This horizontal gene 42 transfer has been suggested as a mechanism to help stabilize the production of cooperative 43 public goods [14][15][16][17][18] (Figure 1a). If a gene coding for the production of a public good can be 44 transferred horizontally, it would allow cheats to be 'infected' with the cooperative gene and 45 turned into cooperators, increasing genetic relatedness at the cooperative locus. Theoretical 46 models have shown that this can facilitate the invasion of cooperative genes, in conditions 47 where they would not be favoured on chromosomes [14][15][16][17][18] . Experiments have supported this 48 prediction 18 . In addition, bioinformatic analyses across a range of species found that genes that 49 code for extracellular proteins, many of which act as public goods, are more likely to be found 50 on plasmids than the chromosome 15,19,20 . 51 52 There are, however, three potential problems for the hypothesis that horizontal gene transfer 53 favours cooperation. First, previous bioinformatic analyses made important first steps, but are 54 not conclusive. One study examined only a single species, which may not be representative of 55 all bacteria 15 . Two additional studies examined multiple species, but assumed that genes and 56 genomes from the same and different species can be treated as independent data points, in a 57 way that could have led to spurious results 19,20 . Statistical tests typically assume that data points 58 are independent, and even slight non-independence can lead to heavily biased results (type I 59 errors) 21,22 . There is an extensive literature in the field of evolutionary biology showing that 60 species share characteristics inherited though common descent, rather than through 61 independent evolution, and so cannot be considered independent data points [23][24][25] . Genomes are 62 nested within species, and genes are nested within genomes, multiplying this problem of non-63 independence, analogous to the problem of pseudoreplication in experimental studies [26][27][28][29] . 64 plasmids to facilitate rapid gain and loss of genes depending on environmental conditions, and 83 not because they are cooperative per se. Alternatively, genes may be favoured to be on plasmids 84 for reasons other than horizontal gene transfer (Figure 1c) 38 . For example, a higher plasmid 85 copy number offers a mechanism for more expression of a gene, potentially even conditionally, 86 in response to certain environmental conditions 38 . The benefit of being able to regulate gene 87 expression in this way could be higher in genes which code for molecules that are secreted 88 outside the cell, when different quantities of molecule are required in different environments. 89 90 We addressed all three of these potential problems for the hypothesis that horizontal gene 91 transfer favours cooperation. We first tested two predictions that would be expected to hold if 92 horizontal gene transfer favours cooperation. Specifically, cooperative genes would be more 93 likely to be found on: (i) plasmids relative to chromosomes; (ii) more mobile plasmids relative 94 to less mobile plasmids 14-20 . We used phylogeny-based statistical methods that control for the 95 problem of non-independence, analysing 1632 genomes from 51 bacterial species, to examine 96 the location of genes that code for extracellular proteins. We then used theoretical models, to 97 examine whether horizontal gene transfer facilitates the evolution as well as the initial spread 98 of cooperation. 99 100 Finally, we also tested alternative hypotheses for why genes coding for extracellular proteins 101 might be preferentially carried on plasmids. We used three measures of environmental 102 variability to ask whether species which had more variable environments were those most 103 likely to carry genes for extracellular proteins on their plasmids. Additionally, we examined 104 one of these measures in more detail, to help determine whether genes for extracellular proteins 105 were located on plasmids so that they could be gained and lost easily (Figure 1b Blue cells produce extracellular proteins which act as cooperative  111 public goods, while red cells are 'cheats' which exploit this cooperation. Over time cheats grow 112 faster than cooperators since they forgo the cost of public good production. However, because 113 the gene for the extracellular protein is located on a plasmid, cooperators can transfer the gene 114 to the cheats, turning them into cooperators, increasing genetic relatedness at the cooperative 115 locus, and stabilising cooperation. (b) Gain and Loss Hypothesis. The production of the 116 extracellular protein is required in some environments, but not others. Transitions between 117 these environments can result from temporal or spatial change. Cells are selected to either lose 118

125
Genomic Analyses. 126 We use the approach developed by Nogueira et al. 15,19,20 , of using PSORTb 39 to predict the 127 subcellular location of every protein encoded by 1632 complete genomes from 51 diverse 128 bacterial species ( Figure S1; Table S3). We are also building upon the work of researchers who 129 pointed out that extracellular (secreted) proteins are likely to provide a benefit to the local 130 population of cells, and hence act as cooperative public goods 2,15,19,20,40 . The advantage of this 131 method is that it allows a large number of genes to be examined, across multiple species. 132 133 Overall, we found the average bacterial genome had 2696 protein-coding genes on the 134 chromosome(s), and 223 on the plasmid(s) ( Table 1). Of these, an average of 57 genes (~2%) 135 coded for the production of an extracellular protein. These patterns are very similar to those 136 found previously 15,19,20 . We followed methods from previous studies by assuming that genes 137 coding for extracellular proteins are more likely to represent public goods, since the diffusion 138 of these secreted proteins will often mean their effects are shared among neighbouring 139 Extracellular proteins are not overrepresented on plasmids. 152 We found that extracellular proteins were not more likely to be carried on plasmids compared 153 to chromosomes (Figure 2). The difference in the proportion of genes that coded for 154 extracellular proteins between plasmid and chromosome was not significantly different from 155 zero across all species (MCMCglmm 41 ; posterior mean = 0.004, 95% CI = -0.063 to 0.057, 156 pMCMC= 0.87; n = 1632 genomes; R 2 of species sample size = 0.47, R 2 of phylogeny = 0.17; 157 Table S2, row 1). This result was robust to alternative forms of analysis. We also found no 158 significant difference when we: (i) compared chromosomes to plasmids of only certain 159 mobilities (Fig S4; Table S2, rows 20-22); (ii) analysed our data by two alternative methods, 160 by looking at the ratio of proportions instead of the difference, or by considering only whether 161 the plasmid proportion was greater than the chromosome proportion, removing any effect of 162 the magnitude of this difference ( Figure S5; Table S2, rows 2 and 3). 163

164
The lack of an overall significant result was clear when looking at the raw data for the different 165 species that we examined ( Figure 2; Figure S5). There was considerable variation across 166 species in the location of genes coding for extracellular proteins. Overall, extracellular proteins 167 were more likely to be on plasmids in 51% of species (26/51), and more likely to be on the 168 chromosome(s) in 49% (25/51) of species ( Figure S5). For example, in Bacillus anthracis 169 genes coding for extracellular proteins were three times more likely to be on plasmids, whereas 170 in Acinetobacter baumannii genes coding for extracellular proteins were three times more 171 likely to be on the chromosome(s) ( Figure S5). Clearly, across species, genes coding for 172 extracellular proteins are not consistently more likely to be on plasmids. 173 174 As a control, we also analysed the genomic location of the genes coding for all other classes of 175 protein ( Figure S1). Specifically, we analysed genes that coded for the production of 176 Cytoplasmic, Cytoplasmic Membrane, Periplasmic, Outer Membrane and Cell Wall proteins. 177 We found that none of these protein localisations were significantly overrepresented on 178 plasmids or chromosomes across the 51 species ( Figure S6; Table S2, rows 5-10). Plasmids 179 are highly variable in the genes they carry. 180  previous studies, which found that plasmid genes code for proportionally more extracellular 196 proteins than chromosomes 15,19,20 . The first of these studies found this pattern across 20 197 Escherichia coli genomes 15 . We also found that genes coding for extracellular proteins in E. 198 coli were more likely to be found on plasmids (Figure 2; Figure S5). However, Figure 2 shows 199 that this is not a consistent pattern across species: approximately half (25/51) of the species we 200 analysed showed a pattern in the opposite direction, with genes coding for extracellular proteins 201 more likely to be on their chromosome(s) than their plasmid(s). 202 203 Two subsequent, multi-species studies found that plasmid genes were significantly more likely 204 to code for extracellular proteins than chromosome genes 19,20 . These studies used statistical 205 tests such as Wilcoxon signed-rank test to ask whether there was a consistent pattern, using 206 bacterial genomes as independent data points. When we analysed our data with the same 207 statistical methods used in these studies, we also obtained a significant result (Wilcoxon Why does using bacterial genomes as independent data points lead to a significant result? By 213 using a Wilcoxon signed-rank test, at the level of the genome, we are implicitly assuming that 214 all the genomes analysed are: (i) independent from one another; (ii) a representative sample of 215 bacteria in nature. Neither of these are true for multi-species genomic datasets. First, due to 216 shared ancestry, species are not independent from one another, and so neither are genomes in 217 such analyses 24,42 . Even a slight lack of independence can lead to heavily biased results in 218 statistical analyses and spurious conclusions 21 . Second, genomic databases tend to have a 219 disproportionate abundance of certain species and genera. This will bias the results towards 220 commonly sequenced species. 221 222 Consequently, when asking questions across species, it is inappropriate to treat all the genomes 223 in genomic datasets as independent data points. When we performed an analysis analogous to 224 the Wilcoxon signed-rank test, using the same untransformed data which produced a significant 225 result above, but controlled for the number of genomes per species and the non-independence 226 of species, we no longer found any significant difference between the proportion of plasmid 227 and chromosome genes coding for extracellular proteins (MCMCglmm; posterior mean = 228 0.017, 95% CI = -0.021 to 0.057, pMCMC = 0.332; n = 1632 plasmid-chromosome paired 229 differences in extracellular proportion; R 2 : species sample size = 0.46, phylogeny = 0.34; Table  230 S2, row 4). Furthermore, we found that the number of genomes per species and the non-231 independence of species explained 46% and 34% of the variation in data respectively (paired 232 plasmid and chromosome differences across our 1632 genomes). Taken together, this 233 illustrates that it is not our data which disagrees with previous studies, but instead our use of 234 statistical analyses appropriate for multi-genome, multi-species datasets 23-25 . 235 236 These data also illustrate the importance of examining effect sizes, and not just whether results 237 are statistically significant. With large sample sizes it is possible to get results that are 238 significant but not biologically important. One rule of thumb is to assume that a result is only 239 biologically significant if the percentage of variance explained is >10% (i.e. R 2 >0.1) 43 . When 240 bacterial genomes are assumed to be independent data points in across species analyses, this 241 leads to inflated sample sizes. Consequently, even when results are statistically significant at 242 P<0.05, they can still only explain 1-2% of the variation in the data, which is clearly not 243 biologically significant. The flip side of such considerations is that effects sizes and 244 examination of raw data at the species level (e.g. Figure 2) are also useful checks against non-245 significant results due to a lack of statistical power (type II errors). 246 247 Plasmids with higher mobility do not carry more genes for extracellular 248 proteins. 249 We then tested another prediction of the cooperation hypothesis: cooperation is more likely to 250 be favoured when coded for on more mobile plasmids 14-18 . We used data from the MOBsuite 251 database to assign plasmids to one of three levels of mobility (Fig 3a)  conjugation, to be the least mobile (Fig 3a) 44,46 . 256 257 Genes coding for extracellular proteins were not more likely to be on plasmids with higher 258 transfer rates (Figure 3b). Examining the slope of the regression between plasmid mobility and 259 the proportion of genes coding for extracellular proteins, we found no consistent pattern across 260 species (MCMCglmm; posterior mean = 0.006, 95% CI = -0.040 to 0.052, pMCMC = 0.73; n 261 = 40; Table S2, row 11). This lack of a significant relationship was robust to different forms of 262 analysis, including an examination of the means of each mobility type of each species ( Figure  263 S7; Table S2, row 12). A caveat here is that our estimates of transfer rates across different types 264 of plasmid is relative, and it would be very useful to obtain quantitative estimates of transfer 265 rates.  Cells can carry a plasmid which is transferred with probability β between paired cells, and 297 which is costly (CC) to carry. Individuals with the gene for cooperation produce a public good, 298 at a cost CG, which generates a benefit B that is shared between all members of the patch. The 299 gene for cooperation can be on the plasmid or chromosome. 300 301 Consistent with previous analyses, we found that horizontal gene transfer on a plasmid can 302 initially help cooperation invade (Figure 4). Horizontal gene transfer increased the frequency 303 of cooperation, by turning non-cooperators into cooperators, which also increases relatedness 304 at the cooperative locus [14][15][16][17][18]47 . 305

306
In contrast, we found that transfer on a plasmid did not increase the range of parameter space 307 where cooperation was maintained at evolutionary equilibrium (Fig 4a)  invade, but then does not help maintain cooperation in the long term. As a plasmid approaches 315 fixation, any benefit of horizontal gene transfer is lost. Consequently, competition between 316 plasmids with and without a cooperative gene (cooperators and cheats) becomes analogous to 317 the scenario in which the gene for cooperation is on the chromosome. An analogous result was 318 also found in a meta-population model by Mc Ginty et al. 16 . Our prediction has been supported 319 experimentally by Bakkeren et al. 30 , who found that location on a conjugative plasmid could 320 help a cooperative trait invade in Salmonella Typhimurium (S.Tm), but that this was only stable 321 with strong population bottlenecks (high relatedness). 322

323
In addition, we found that, when cooperation is favoured, cooperative traits are not more likely 324 to be favoured on, or transferred to, plasmids. The reason is that, when cooperation is favoured, in our data set, pathogenicity is a key aspect of bacterial lifestyle that has been suggested to be 365 important for plasmid gene content, such as antibiotic resistance and virulence factors 6,40,52,53 . 366 We divided species into three categories: pathogens with broad host-range, pathogens with 367 narrow host-range, and non-pathogens. Broad host-range pathogens are expected to encounter 368 more variable environments than narrow host-range pathogens. 369

370
We found that pathogens with a broad host-range were more likely to carry genes coding for 371 extracellular proteins on their plasmids, compared with both narrow host-range pathogens and 372 non-pathogens (Fig 5). Specifically, we compared the difference in the proportion of genes 373 coding for extracellular proteins between plasmid(s) and chromosome(s) across these three  Table S2, row 25). These patterns hold irrespective 383 of whether we included species that we could not reliably classify into either category, such as 384 opportunistic pathogens, in our analyses ( Figure S10). 385 386 Plasmids of broad host-range pathogens carry many pathogenicity genes. We suspected 387 that the additional extracellular proteins coded for by plasmids of broad host-range species, 388 compared to narrow host-range species, may be particularly involved in facilitating 389 pathogenicity 40,52,53 . To investigate this, we used the program MP3 54 to assign a each 390 extracellular protein as either 'pathogenic' or 'non-pathogenic'. 391

392
We found that plasmids of broad host-range pathogens were particularly enriched with 393 extracellular proteins involved in facilitating pathogenicity, compared to plasmids of narrow 394 host-range species (Figure 6). Specifically, we found that pathogens with a broad host-range 395 were significantly more likely to code for pathogenic extracellular proteins on their plasmids 396 compared to narrow host-range species (Figure 6a Table S2, row 26). In contrast, the relative location of non-pathogenic 399 extracellular proteins did not vary between broad and narrow host-range pathogens (Figure 6b) 400 (MCMCglmm; Narrow compared to Broad host-range pathogens: posterior mean = -0.036, 401 95% CI = -0.115 to 0.040, pMCMC = 0.296; n=474 genomes; Table S2, row 27). 402 Consequently, the excess of genes coding for extracellular proteins on the plasmids of broad 403 host-range species ( Figure 5) appears to arise due to an excess of pathogenicity genes coding 404 for extracellular proteins (Figure 6). 405 406 Most genomic databases are biased towards species that interact with and/or infect humans, so 407 we examined whether these species had driven the above results. In our dataset, 5 out of 10 408 broad host-range species and 3 out of 5 narrow host-range species can infect humans. We found 409 no significant difference in how likely both pathogenic and non-pathogenic extracellular 410 proteins were to be on plasmids of human pathogens compared to non-human pathogens. We 411 also found that while host-range had a significant effect on how likely plasmids were to code 412 for pathogenic extracellular proteins, whether a species could infect humans had no significant 413 effect (Table S2,  proteins were more likely to be on plasmids that transfer at higher rates. This would be 420 predicted by the gain and loss hypothesis, but not the beyond horizontal gene transfer 421 hypothesis. We found that plasmids with higher mobility did not code for more pathogenic 422 extracellular proteins. Specifically, across broad host-range pathogen species, the slope of the 423 regression between plasmid mobility and the proportion of genes coding for pathogenic 424 extracellular proteins was not consistently positive ( Figure S11) (MCMCglmm; posterior mean 425 = -0.020, 95% CI = -0.224 to 0.185, pMCMC = 0.774; n=7; Table S2, row 31). This lack of a 426 significant relationship was robust to additional forms of analysis, such as considering all 427 pathogenic species, including narrow host-range pathogens and those not carrying plasmids of 428 all three mobility types ( Figure S12; Table S2, rows 32 and 33). 429 430 Taken together, our results are most consistent with the hypothesis that genes coding for 431 extracellular proteins are overrepresented on plasmids when plasmid carriage provides a benefit other than mobility (Figure 1c). A number of other factors may influence which genes 433 are carried on plasmids, beyond horizontal gene transfer. First, there is evidence that increasing 434 the copy number of plasmids can lead to increasing rates of evolution in the genes they carry 55 , 435 and it also may act as a mechanism to increase the expression of genes carried on plasmids 56,57 . 436

508
We found no support for the hypothesis that horizontal gene transfer favours cooperation. Our 509 genomic analyses showed that extracellular proteins are not: (i) overrepresented on plasmids 510 compared to chromosomes; (ii) more likely to be carried by plasmids that transfer at higher 511 rates. These patterns could be explained by theoretical modelling, which showed that while 512 horizontal gene transfer may help cooperation to initially invade a population, it does not then 513 help the maintenance of cooperation in the long term. Once plasmids become common, cheat 514 plasmids that do not code for cooperation are able to outcompete cooperative plasmids, 515 analogous to selection at the level of the chromosome 16 . Our prediction has also been supported 516 experimentally by Bakkeren et al. 30 , in Salmonella Typhimurium (S.Tm), who observed 517 cooperation invading on a plasmid, but then being outcompeted by newly emerging non-518 cooperative cheats. In contrast, we found that genes coding for extracellular proteins involved 519 in pathogenicity and virulence are preferentially located on plasmids in pathogens with a broad 520 host-range. These pathogenic virulence genes were not preferentially located on plasmids that 521 transfer at a higher rate, suggesting that the benefit of being located on a plasmid is something 522 other than horizontal gene transfer, such as the ability to vary copy number. 523 524

526
We retrieved 1632 complete genomes comprising 51 bacterial species from GenBank RefSeq 527 (https://www.ncbi.nlm.nih.gov) between February-November 2019. We used species on panX 528 (http://pangenome.tuebingen.mpg.de) 69 as a list of potential species for our dataset, since these 529 comprise the most sequenced bacterial species. To allow comparison of chromosome and 530 plasmid genes within the same genome, we only retrieved genomes that contained at least one 531 plasmid sequence. We included species with 10 or more RefSeq genomes with one or more 532 plasmids available in our analysis. We retrieved up to 100 genomes for each species; this was 533 either all complete genomes available for the species, or a random sample where more than 534 100 were available. Where two or more genomes had the same strain name, we randomly 535 retrieved one genome to reduce the risk of pseudoreplication. 536 537

538
We used PSORTb v.3 39 to predict the subcellular location of every protein encoded by each 539 genome in our dataset. We used a Docker image of PSORTb developed by the Brinkman Lab, 540 available at: https://github.com/brinkmanlab/psortb_commandline_docker. We chose 541 PSORTb because it is widely regarded as one of the best performing programs of its kind 70 . It 542 has also been used in previous analyses to identify 'cooperative' genes and/or extracellular 543 proteins in bacteria 15,20 . The program has a number of modules which are trained to recognise 544 particular features of proteins. Results from these modules are combined to give a Final 545 Prediction for each protein. We consulted the literature to confirm the Gram stain of each of 546 our species. For Gram-positive species, PSORTb assigns proteins to one of four locations 547 within the cell: cytoplasmic, cytoplasmic membrane, extracellular or cell wall ( Figure S1). The 548 locations for Gram-negative species are the same, except that cell wall is replaced with outer 549 membrane and periplasmic, meaning there are five possible locations for proteins of Gram-550 negative species ( Figure S1). We used these predicted locations throughout all subsequent 551 analyses in this work. PSORTb could not reliably assign a subcellular location to 27% of 552 proteins we analysed, giving a final prediction of 'unknown' (Table S1). Unless explicitly 553 stated, we did not include these unknown proteins in our analyses.

569
We next examined how plasmid mobility correlates with each plasmid's extracellular protein 570 proportion. As part of its mobility prediction, MOBsuite 44 identifies sequences within each 571 plasmid involved with conjugation. To control for the possibility that conjugative plasmids, by 572 definition of being conjugative, must carry genes controlling this process, we subtracted the 573 total number of these sequences from the total number of proteins when calculating the 574 extracellular proportion of each plasmid. This is a highly conservative control, since it assumes 575 none of the proteins predicted as extracellular are involved in conjugation. We did all analyses 576 on these data with and without removing these mating-pair accessions to ensure any results 577 were not affected by factors unrelated to plasmids' extracellular protein content. 578 579 Additionally, we used the plasmid mobility predictions to ask whether differences in the 580 mobility of species' plasmids correlated with whether genes encoding extracellular proteins 581 are overrepresented on plasmids compared to chromosomes. We calculated the proportion of 582 plasmids in each genome capable of transferring via conjugation (conjugative and mobilizable 583 plasmids), and averaged across all genomes to give a general measure of the mobility of each 584 species' plasmids. 585 586

587
We classified a species as pathogenic if it was described in the literature as an obligate or 588 facultative pathogen. Given some bacterial species only rarely act as pathogens, such as 589 opportunistic pathogens, we only included species where we could be sure pathogenicity was 590 a key aspect of their lifestyle and a regular selection pressure acting on their genome content. 591 For this reason, we decided not to include species described as opportunistic pathogens in the 592 literature and those which frequently live as commensals in their hosts. We classified non-593 pathogens as species which are strictly environmental (never live in hosts) or strictly mutualists 594 and/or commensals (never cause pathogenicity in their hosts). There were 26 species we could 595 not definitively assign to either of these categories. These were not included in our main 596 analyses, although we carried out additional analyses to ensure that removing these species did 597 not bias our results ( Figure S10). 598

599
To estimate the host-range of pathogens, we used information from the literature to determine 600 the maximum taxonomic level of hosts each species is able to invade. We defined narrow host-601 range species as those which can invade either only one host species, or host species within the 602 same genus or family. In contrast, we defined broad-host range pathogens as those capable of 603 invading host species within the same order, class or phylum. For example, Xanthomonas citri 604 acts as a plant pathogen within the genus Citrus 71 , while Pseudomonas syringae acts as plant 605 pathogen across multiple orders of flowering plants 72 . For more details and references to the 606 literature used for this classification, please see Table S3. 607

608
We completed additional analyses for other two measures and proxies of environmental 609 variability, the details and results of which can be found in Supp. Info. 1. In brief, we used 610 previously published data which classified the habitat diversity of species using 16S RNA 611 environmental datasets across five broad habitats: water, wastewater, sediment, soil and 612 host 60,61 . We also supplemented this with information from the literature for species not 613 included in the published data. We used this to ask whether species which lived in multiple 614 habitats had genes encoding extracellular proteins more overrepresented on their plasmids. 615

616
We also looked at bacterial pangenomes as a proxy for environmental variability, since it has 617 been noted that species with a high % of accessory genes, defined as genes found in only a 618 subset of genomes within a species, are generally those with more variable environments. All 619 pangenome data was collected from panX 69 (http://pangenome.tuebingen.mpg.de), since this 620 calculates the pangenome using the same method across all of our species. 621 622 Pathogenicity categorisation of extracellular proteins 623 We used MP3 54 to examine the pathogenicity of extracellular protein-coding genes in broad 624 host-range and narrow host-range pathogens. MP3 uses two modules to produce a 'Hybrid' 625 prediction for each protein: either 'Pathogenic' or 'Non-Pathogenic'. We used MP3 with 626 default parameters to gain this prediction for every extracellular protein in all genomes of broad 627 and narrow host-range species. MP3 was unable to give a prediction for approximately 9% of 628 extracellular proteins, and so these were not included in this analysis. The response variable in all of our analyses is either a proportion or a measure calculated from 648 proportions. Proportion data is bound between 0 and 1 and has a non-normal distribution. To 649 control for this, all proportion data in our analyses has been arcsine square root transformed to 650 improve normality. 651 652 Phylogeny. To control for species relationships, we generated a phylogeny including all 51 653 species in our dataset (Fig S2). We used a recently published maximum likelihood tree using 654 16S ribosomal protein data as the basis for our phylogeny 76 . This tree of life typically had only 655 one representative species per genus. We used the R package 'ape' to extract all branches 656 matching species in our dataset 77 . In cases where the genus representative was different to the 657 species in our dataset, we swapped the tip name with our species, since all members of the 658 same genus are equally related to members of a sister genus. In cases where we had multiple 659 species within a single genus in our dataset, we used the R package 'phylotools' to add these 660 species as additional branches into their genus 78 . We used published phylogenies from the 661 literature to add any within-genus clustering of species' branches. We used this phylogeny in 662 nexus format for all our MCMCglmm analyses ( Fig S2, Table S2). Methods are also available 663 to control for uncertainty in phylogenetic reconstruction 79 Three hypotheses for why selection might favour genes coding for extracellular proteins to be located on plasmids. (a) Cooperation Hypothesis. Blue cells produce extracellular proteins which act as cooperative public goods, while red cells are 'cheats' which exploit this cooperation. Over time cheats grow faster than cooperators since they forgo the cost of public good production. However, because the gene for the extracellular protein is located on a plasmid, cooperators can transfer the gene to the cheats, turning them into cooperators, increasing genetic relatedness at the cooperative locus, and stabilising cooperation. (b) Gain and Loss Hypothesis. The production of the extracellular protein is required in some environments, but not others. Transitions between these environments can result from temporal or spatial change. Cells are selected to either lose (Environment A) or gain (Environment B) the plasmid coding for the production of the extracellular protein. (c) Beyond Horizontal Gene Transfer Hypothesis. The location of a gene on a plasmid could provide a number of bene ts, other than the possibility for horizontal gene transfer38. For example, when the quantity of extracellular protein required varies across environments (A versus B), plasmid copy number could be varied to adjust production38.

Figure 2
Extracellular proteins are not overrepresented on plasmids. For each species we calculated the mean difference between plasmid(s) and chromosomes in the proportion of genes coding for extracellular proteins. Species in blue have a difference greater than zero, meaning their plasmid genes code for a greater proportion of extracellular proteins than chromosome genes. Species in red have a difference less than zero, meaning their chromosome genes code for a greater proportion of extracellular proteins than plasmid genes. Error bars indicate the standard error. The dot and error bar at the top of the graph indicate the mean difference and 95% Credible Interval given by a MCMCglmm analysis across all species, controlling for phylogeny and sample size. We arcsine square root transformed proportion data before calculating the difference. Overall, there is no consistent trend that genes coding for extracellular proteins are more likely to be carried on plasmids (i.e. no consistent trend towards species in blue).
Blue cells are potential plasmid donors, while red cells are potential recipients. Each panel shows when plasmid transfer is possible for one of the three plasmid mobility types. Non-mobilizable plasmids cannot be transferred. Mobilizable plasmids cannot be transferred alone, but they carry enough genes to 'hijack' the machinery of a conjugative plasmid that is in the same cell. Conjugative plasmids carry all genes necessary to transfer independently. (b) The 40 species which carried plasmids of all three mobilities are shown, with a panel for each of these species. Dots in each panel indicate the mean % of genes coding for extracellular proteins of all plasmids of each mobility level. The blue lines are the linear regression of these three points. We arcsine square root transformed proportion data before calculating the mean for each species, and then back-transformed these values for display of the data. Overall, there is no consistent trend for genes that code for extracellular proteins to be on more mobile plasmids.  Environmental variability and the location of genes coding for extracellular proteins. We have divided species into either pathogens or non-pathogens, with pathogens further categorised into those with a narrow or broad host-range. The y-axis shows the difference in the proportion of genes on plasmids and chromosomes coding for extracellular proteins. Each dot is the mean for all genomes in a species. Species in blue are those with extracellular proteins overrepresented on plasmids, while species in red are those with extracellular proteins overrepresented on chromosomes. The black bars indicate the mean for all species in each category. Overall, pathogens with a broad host-range are more likely to have genes coding for extracellular proteins on their plasmids.

Figure 6
The location of genes coding for pathogenic and non-pathogenic extracellular proteins, in species with broad and narrow host-ranges. We categorised pathogenic species into those with either a broad or narrow host-range. The y-axes in (a) and (b) show the difference in the proportion of genes coding for extracellular proteins on plasmids and chromosomes which are predicted by MP3 as either (a) pathogenic or (b) non-pathogenic. Higher values indicate that extracellular proteins are more likely to be coded for by plasmids. Each dot is the mean for all genomes in a species. Species in blue are those with the relevant subset of extracellular proteins overrepresented on plasmids, while species in red are those with the subset of extracellular proteins overrepresented on chromosomes. Overall, there is a signi cant difference between broad and narrow host-range species in the location of genes coding for pathogenic extracellular proteins, but no difference for non-pathogenic extracellular proteins.

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download. SupplementaryInfo.pdf