Landscape genomics of Escherichia coli in livestock-keeping households across 1 a rapidly developing urban city.

32 The keeping of livestock has been posited as a risk factor for the emergence of zoonoses and the spread of antimicrobial resistance. However, quantitative evidence regarding the major sources of pathogenic and drug-resistant bacteria and transmission routes between hosts 35 remains lacking. In the largest epidemiological study of this nature to date, we sampled 36 Escherichia coli from humans, livestock, food, wildlife and the environment of 99 households 37 across Nairobi, Kenya to gain a deeper understanding of sharing of bacteria among hosts and 38 potential reservoirs. By analysing whole genome sequencing data from 1,338 E. coli isolates, 39 we reconstruct sharing patterns for the sampled E. coli and its antimicrobial resistance 40 determinants. We find that the diversity and sharing patterns of E. coli is heavily structured by 41 household, which is the primary epidemiological interface for bacterial strain sharing. Strain 42 sharing within households was strongly shaped by host type. We also find evidence for inter- 43 household and inter-host sharing, and importantly, between humans and animals, although 44 this occurs much less frequently. We find similar strain sharing patterns for the E. coli accessory 45 genome, suggesting that it is shaped by recent evolutionary history and is strongly associated 46 with the core genome. Resistome similarity, however, were quite differently distributed across 47 host and household, consistent with their being driven by shared exposure to antimicrobials. 48 Our results indicate that there is potential for the exchange of bacteria between humans, 49 livestock and wildlife in the same household in a tropical urban setting, with wider mixing 50 occurring over a period of months or years, but this does not drive the distribution of 51 antimicrobial resistance. 52


53
The spread of bacterial pathogens and antimicrobial resistance (AMR) across human and animal 54 populations present a significant and growing threat to global health and economic development. 55 Identifying risk factors for emergence and spread is one of epidemiology's most important challenges. 56 Many recent pandemics and newly emergent infectious diseases have animal origins (1, 2) and are 57 associated with rapidly urbanising environments (3, 4). The dynamic interfaces between humans, 58 domestic livestock and wild animals act as conduits by which humans can be exposed to zoonotic 59 pathogens and AMR in an environment with inadequate sanitation infrastructure, limited access to 60 appropriate and effective drugs, and unregulated antimicrobial usage (5-8). 61 The importance of livestock to the transmission of bacteria and AMR remains unclear(9). The practice 62 of keeping livestock, particularly in urban settings, has been described as a risk factor for the 63 emergence and spread of zoonoses (10, 11). Antimicrobial agents used in human medicine are also 64 used for growth promotion, disease prevention and treatment in livestock, enhancing selection 65 pressures on bacterial pathogens for AMR emergence and spread. 66 Wild birds and mammals have also been documented to carry and exchange drug-resistant bacteria 67 with livestock and humans (6, 12, 13). The rapid expansion of urban environments into previously 68 pristine or sparsely populated natural landscapes also increases the potential for greater contact 69 between wildlife, humans and livestock which can provide conduits for microbiome sharing (14). 70 Fundamental to whole genome sequencing studies is the availability of systematically sampled 71 bacterial isolates obtained from humans, livestock and wildlife across overlapping geographical 72 regions and time-frames, yet data are lacking (15). In this study, we sampled the bacterium Escherichia 73 coli from humans, livestock and peri-domestic wildlife of 99 households and their environs across 33 74 sublocations in Nairobi, Kenya, in an epidemiologically structured study. The rapid development of 75 Nairobi's urban landscape is comparable to that of many other cities in the developing world, making 76 it an ideal system in which to explore how people's interactions and co-existence with animals 77 influences pathogen transmission across species. (16,17). As a common commensal and pathogen of 78 both human and animal populations, as well as ease of culture and wealth of available genetic 79 information, E.coli is an ideal organism for this study. Here, we report a genomic investigation of 1,338 80 E. coli isolates sourced from humans, livestock and wildlife across Nairobi to elucidate patterns of 81 bacterial strain sharing as a proxy for transmission potential. We test the hypothesis that the 82 distributions of bacterial strains and their genetic pools are limited to particular defined ecological 83 niches (households and hosts) versus an alternative that they display a cosmopolitan distribution -in 84 essence, recapitulating the famous tenet, "Everything is everywhere, but, the environment selects" 85 (18). Our study aims to identify risk factors to help inform surveillance strategies that target potential 86 hotspots for strain sharing and AMR transmission between populations in an urban setting, and more 87 broadly, to understand risks associated with transmission of multi-host pathogens in urban settings. 88

E. coli from humans and animals in Nairobi originate from both global and local lineages. 90
A total of 1,338 Escherichia coli isolates were sequenced as part of this study (Table S1). 311 genomes 91 were obtained from human isolates. 421 genomes isolated from 63 wildlife species, primarily 92 comprised of wild birds (n=245), rodents and bats (n=130). 606 genomes were attained from 13 93 species of livestock that can be grouped into poultry (n=324), goat and sheep (n=109), cattle (n=61), 94 pig (n=49) and rabbit (n=38) isolates. The isolates were distributed across 33 geographic sublocations 95 spanning the entire urban area of Nairobi with a range of 20 to 63 isolates per sublocation (see 96 supplementary methods). A large fraction of isolates in each sublocation were obtained from a 97 household with livestock (minimum 75%). The sampling protocol also ensured that there was at least 98 one and up to ten isolates from a household without livestock in each sublocation. 99 The genomes represent all major lineages of the E. coli sensu stricto phylogroup in addition to 100 members of the cryptic clade I. The isolates belong to Clermont phylogroups B1 (45%), A (38%), B2 101 (6%), D (4%), E (2%), and to a lesser extent clades C, F, G and clade I (>1%). Phylogroup A was strongly 102 associated with humans (41% of human isolates) compared to the other host categories. In livestock 103 mammal, wild bird and wild mammal categories, isolates from phylogroup B1 was the most frequently 104 isolated. 105 A total of 537 unique sequence types (STs), based on the 7-gene Achtman scheme, were represented, 106 with the three most common being ST10 (n= 93, 7%), ST48 (n=64, 5%) and ST155 (n=54, 4%) ( was used to infer the overall phylogenetic relationship between isolates (Figure 1). Associations 114 between the clustering of isolates at the tips of this phylogeny and host types or sublocations were 115 investigated. We found two clusters that were associated with Kitisuru (n=5) and Karen (n=6) 116 sublocations suggesting localised transmission clusters (Supplementary Figure S1). The cluster in 117 Kitisuru involved two wild birds and three poultry isolates from the same household (KTS089), while 118 the cluster in Karen was made up of cattle and primate isolates from one household (KAN007) and 119 wild bird isolates sampled in other households in the same sublocation (KAN008 and KAN009). Both 120 Kitisuru and Karen represent sublocations of the highest wealth category consisting of detached 121 dwellings with relatively large surrounding compounds. 122 Three clusters of isolates were found to be significantly associated with goats (n=18), rats (n=14) and 123 rabbits (n=5), respectively. The longer phylogenetic distances of well over 10 core SNPs separating 124 isolates in these hosts suggest transmission over a longer timescale, which is supported by the 125 distribution of these isolates across multiple households and sublocations (Figure 1 and 126 Supplementary Figure S2). The largest cluster associated with a single host type is the ST297/ST9433 127 caprine cluster that are found in multiple sublocations ( Figure S2a). The second largest cluster, 128 belonging to the ST9441 lineage, is so far unique to this dataset and Nairobi. This cluster is also found 129 across at least nine sublocations and significantly associated with wild rodents.

134
We employed core genome multi locus sequence typing (cgMLST) -a high resolution typing method, 135 which is more reproducible and comparable across larger datasets (19). First, we used cgMLST to 136 compare the global diversity of sequenced E. coli with the isolates in Nairobi by performing an all-vs-137 all pairwise comparison of cgMLST distances between the 1,338 Nairobi genomes and 28,382 publicly 138 available whole genome sequences that were annotated with source type, place and year of isolation 139 from Enterobase (downloaded June 2018). We found that the closest related isolates to those found  Clonal strain sharing of isolates is primarily shaped by household structure. 154 When the frequency distribution of pairs of isolates differing by less than 100 cgMLST loci is plotted, 155 we find a total of 150 pairs of isolates that differ by 10 or fewer cgMLST alleles to other isolates in our 156 collection ( Figure 2). These pairs comprise 187 (14%) isolates, with some isolates involved in multiple 157 pairs. Data on household and host type for these 150 pairs revealed that the majority occur between 158 hosts from the same household (n=101, 67%) and 33% (n=49) involving hosts from different 159 households. Given the low genetic distances and epidemiological context, we refer to these pairs of 160 ≤10 cgMLST as sharing pairs. These sharing pairs are inferred to be evidence of recent strain sharing 161 either by direct transmission or acquisition from a common source. 162 WGS studies of E. coli outbreaks in humans showed that epidemiologically-linked isolates differed by 163 up to four core genome SNPs when isolated within 30 days of each other and when separated by 164 between five to ten core SNPs, this timeframe increases to an average of 8 months (20). Although the 165 cgMLST genetic distance used in this analysis is not directly comparable to core single nucleotide 166 polymorphisms (SNP) distances, 96% of the sharing pairs (n=144) were separated by four or fewer 167 core genome SNPs and almost all pairs (99%, n=149) by a maximum of 10 core SNPs. Therefore To understand the contribution of intra-host diversity on the numbers of detected sharing pairs we 188 obtained multiple isolates per host for a subset of six households (KHW050, KIG019, KIG020, KOR058, 189 UTH029 and VIW002). Ten isolates per host were sequenced from two adult humans, a chicken and a 190 goat from the same household. Comparing these isolates in the context of the larger dataset increased 191 the number of human-livestock sharing pairs within households by only 7 pairs. These pairs were due 192 to 2 clusters of clonal isolates found in either poultry or ruminants in 2 households (UTH029 and 193 VIW002). Between households, we found only two more sharing pairs; involving a human and another 194 human, and between a goat and a chicken. This lack of magnitudinous change in the distribution of 195 sharing pairs provides confidence in our single sample approach. Our single isolate per host approach 196 to sampling, maximises the ability to detect sharing potential among humans and livestock, while 197 minimising the cost and effort of additional sampling and culturing. 198 199

Husbandry is a risk factor for E. coli sharing between humans and animals 200
We identified ten sharing pairs involving human and livestock isolates belonging to STs that are not 201 host-restricted and have been associated with a variety of sources and host species (Table 1). 202 All sharing pairs involved human males (p-value = 0.003, Fisher's Exact test). Six of the ten sharing 203 pairs involved humans and livestock in the same household while four humans (not keeping livestock) 204 shared bacteria with livestock from other households. Six of seven persons (we lacked data for three 205 people) had direct contact with livestock through collecting eggs, slaughter, milking or handling but 206 one person had no history of livestock contact (Table 1). 207

Sharing of E. coli core genome, accessory genome and resistome is shaped by host and households 210
While the sharing threshold for the core genome of ≤10 cgMLST distance, sharing for the pangenome 211 and resistome was based on a Jaccard similarity Index (JI), where a cut-off threshold for sharing was 212 defined in the same way as the core genome. For the pangenome/accessory genome this was 213 determined to be JI ≤ 0.98 (Fig 3c, d). Resistome sharing was defined as JI = 1 (Fig 3e, f), meaning that 214 to be considered a sharing pair two isolates needed to have an identical antibiotic resistance gene 215 profile, with a minimum of two AMR genes in each isolate. Denominator values were based on the 216 number of pairs of isolates in each category, assuming an equal probability of sharing among isolates. 217 We resampled the observed values to generate expected distributions of events based on the 218 frequencies of these expected values (see Methods for details). From this we were able to assess 219 whether our observed number of sharing pairs fell above, below, or within the range we may expect 220 given the sampling effort. 221 Household and host category strongly influenced the distribution of sharing of E. coli isolates in both 222 the core genome and the pangenome in Nairobi (Figure 3; a-d). Within households, sharing of E. coli 223 isolates was consistently higher than expected within the same host category (Figure 3; a, c). No strong 224 pattern was observed between households where the observed shared E. coli isolates fell largely 225 within the expected range (Figure 3; b, d). Resistome sharing was predominantly low between 226 different hosts, but high between poultry isolates, irrespective of household structure (Figure 3; e, f). 227 Sharing among poultry in the same household was particularly high across all three definitions of 228 sharing and similarity, i.e. the core, pangenome and resistome (LB-LB in Figure 3). 229 To further investigate resistome similarity between hosts we performed the same analysis with 230 sharing classed as two isolates sharing resistance genes that confer drug resistance to a given class of 231 antibiotics. We compared 8 classes of antibiotic whose resistance genes were found in the population 232 (Supplementary Figure 7) and found that between households, poultry-poultry sharing continued to 233 be much greater than the expected range (Supplementary Figure 7). Resistome sharing among poultry 234 does not therefore appear to be driven by resistance to a single or few antibiotic classes. Human-235 human sharing between households was also higher than expected, suggesting similar antibiotic 236 selection pressures on human isolates across the board. In our study, we found that household stratification drives clonal strain sharing. Previous studies have 267 shown an important role of the household as a driver for sharing similar microbiomes or bacteria in 268 humans and companion animals (24-27). Our findings show that strain sharing can involve humans, 269 livestock and wildlife found in the same household or area. 270 The use of contemporary isolates in our sampling increased our ability of finding clonal isolates that 271 overlap between hosts, households and sublocations. Previous work using whole genomes either 272 found no overlap or isolates that were separated by more than ten core SNPs, which does not provide 273 strong evidence for a recent sharing event (28, 29). While challenging in practice, we have 274 demonstrated the importance of large-scale structured sampling to understand strain sharing at the 275 population level. 276 Our comparison of the isolates in our study to isolates from a global collection (Enterobase) revealed 277 that globally dispersed lineages differed by at least 9 cgMLST loci. At this level of divergence or higher, 278 isolates in Nairobi could not be differentiated from globally circulating clones found in other parts of 279 the world. The genetic diversity in the two largest clonal lineages in the dataset (ST297 and ST9441) 280 circulating in rodents and goats was similar to the diversity of ST131 isolates circulating globally. This 281 shows that beyond the 10 cgMLST threshold, epidemiological links from strain sharing or transmission 282 events becomes obscured by the bacterial diversity present in the environment. 283 Genotype similarity of the core and accessory genome within households is posited to be driven by 284 direct and social contact among individual hosts (30, 31). Consistent with expectation, host type was 285 also demonstrated to be a strong driver in E. coli isolate sharing within households (Fig 3). Members 286 of the same host category, particularly in the same household, are more likely to have direct and/or 287 indirect contact within shared environments, creating increased opportunity for bacterial sharing (14, 288

23, 24, 30-32). 289
Eight of the ten observed human-livestock sharing pairs involved poultry. Inhalation and ingestion of 290 faecal dust from poultry has previously been identified as a significant risk in the spread of bacteria 291 from one host to another, both within the poultry populations and with humans working in close 292 contact with them (33). Furthermore, it has been previously hypothesised that poultry is likely to be a 293 reservoir of the global epidemic strain, ST131 (34, 35). Humans in direct contact with livestock were 294 more prone to sharing E. coli isolates, likely through direct contact with meat and faecal matter. 295 Though the sample size is small, this result is consistent with previous work postulating direct contact 296 as a risk for bacterial sharing events (26, 36). We note that the strong host type signal for E. coli sharing 297 within a household (Figure 3a) does not hold true when examining pairs between households (Figure  298 3b). This could be due to a higher diversity of E. coli in the wider population, leading to a lower 299 probability of detecting closely related strains. 300 Our resistome sharing analysis also suggests disproportionately higher rates of resistome similarity 301 among poultry irrespective of the household compared to the other host groups. As poultry isolates 302 are phylogenetically diverse, the presence of a common selection pressure could explain this 303 observation. Across Nairobi poultry are routinely exposed to a set regimen of antimicrobial agents (for 304 therapeutic or prophylactic purposes) and such recipes vary minimally geographically from one 305 location to another (37). Conversely, a wider range of combinations of antimicrobials are available for 306 use in ruminants and monogastrics, including an array of injectable formulations, and these greatly 307 vary from one farm to another. We also find resistome similarity to be high among human and wildlife 308 isolates, both mammals and birds. The similar availability and usage patterns of antibiotics in the 309 human population across the city could explain the similarity seen in humans, suggesting resistome 310 similarity occurs from prevailing selective pressures rather than spread from a common source. The 311 presence of manure, rubbish and human waste -all contaminated with potentially similar kinds of 312 AMR pathogens and antimicrobials -across the urban landscape of Nairobi provide a conduit for 313 acquisition and/or selection of similar resistome in wildlife, which act as a sink population for AMR 314 (12). 315 We observed higher than expected level of accessory genome sharing between wild mammals (bats 316 and rodents), between households, apparently involving divergent lineages as we did not see the same 317 pattern at the core genome level. Other types of wildlife, for example, wild birds around the world 318 have been shown to carry and transmit E. coli and should be considered a public health risk (38-40). 319 Our findings suggest that the role of rodents and bats should also be considered. 320 Our study design focuses on the breadth of sampling over depth, and as a single isolate is sampled 321 from each host our approach does not account for intra-host diversity. Previous studies on the intra-322 host diversity of E. coli strains were found to be variable across host populations and taking single 323 isolates has the potential to underestimate the number of sharing pairs (41). However, we showed 324 that for a subset of six households in our study, increased sampling by ten times had a minimal effect 325 on the number of inter-household and inter-host sharing pairs that were detected. Higher intra-host 326 diversity in different host populations for example, between wildlife and domesticated animals, may 327 reduce the probability of finding sharing pairs in hosts with higher bacterial diversity. Future studies 328 should therefore take into account both inter and intra-host diversity to expand on our findings. 329

330
Employing an epidemiologically structured sampling framework and using highly discriminatory whole 331 genome sequencing, our study provides detailed insight into the strain diversity of E. coli across a fast-332 growing African city where livestock-keeping within households is commonplace. To our knowledge, 333 this is one of the largest and most comprehensive survey of the bacterial genomic landscape in an 334 urban environment to date, and serves as a model for epidemiologically structured, targeted sampling 335 and whole genome sequencing of human and animal-borne bacteria. We found evidence of recent 336 clonal sharing between humans and livestock and show that the E. coli population structure in 337 humans, livestock and wildlife in this environment is primarily shaped by household and host type, 338 but not by animal husbandry. We also found similarities in the resistome of the isolates that did not 339 match the patterns of shared genomes and presumably reflects common antibiotic usage practices, 340 particularly in poultry. These findings provide empirical support for the hypothesis 'everything is 341 everywhere' (frequent sharing of bacteria and AMR genes between households) but `environment 342 selects' (different households and hosts have different bacterial and resistome persistence). Further 343 work, guided by the finding of where clonal sharing is most likely to be found, will be required to 344 quantify spillover risk associated with the main routes of inter-host transmission. 345 for 24 h, and thereafter plated on to eosin methylene blue agar (EMBA) and incubated for 24 h at 369 37 °C. Subsequently, five colonies were selected and subcultured on EMBA, before being further 370 subcultured on Müller-Hinton agar. A single colony was picked at random from the plate for each 371 original sample (hereafter referred to as an 'isolate') and a 10-parameter biochemical test was used 372 (triple sugar iron agar=4, Simmon's citrate agar=1, and motility-indole-lysine media=3, urease 373 production from urea media =1, oxidase from tetra-methyl-p-phenylenediamine dihydrochloride = 1) 374 were used for presumptive identification of E coli.

Phylogenetic analyses 398
A core genome alignment was generated using Snippy v4.6.0 (with default settings) using EC958 as a 399 reference genome (GCA_000285655.3). A phylogenetic analysis of the core genome alignment was 400 performed using IQTREE (v1.6.12 ) -m TVM+G4 -bb 1000 -safe. The tree and metadata were visualised 401 in iToLv4.3 (itol.embl.de). Due to the species-level diversity of the isolate collection, positions in the 402 alignment in recombinant region of the genome were not removed. 403 Ad hoc core genome multi Locus sequence typing (cgMLST) was performed on genome assemblies 404 using chewBBACA (v. 2.0.11) with the 2513 gene cgMLST profile from Enterobase (Downloaded 405 October 2018). 406 The association between metadata (sublocation, host category) and phylogeny was tested using 407 Phylotype (46). A minimum of 5 isolates were required to define a cluster, with a maximum of 200 408 core SNP maximum internal cluster distance. 409

Identification of putative bacterial sharing 410
A genetic distance matrix was calculated from all pairwise allelic profile comparisons using the library 411 "ape" in R (Paradis et al., 2004). The cgMLST cutoff of 11 allelles to define putative E. coli (defined here 412 as a sharing pair) transmission clusters was based on the observed bimodal distributions of inter-and 413 intrahousehold allele differences (Supplementary Figure S9). The R package "cutpointR" was used to 414 validate this cutoff as the optimal value to differentiate pairs that occur within and between 415 households (47). 416 417

Epidemiological analysis of sharing 418
We established epidemiological links between every possible pair of E. coli isolates through a 419 systematic comparison. Household level sharing was categorised as: within household, if a sharing pair 420 involved isolates/hosts from the same household; between household, if a sharing pair involved 421 isolates in different household. Wildlife isolates that could not be attributed to a specific household 422 were omitted from the sharing analysis (Table S2). 423 We condensed our host types into five broad categories (Tables S1, S2); (i) Humans, (ii) Livestock birds; 424 poultry dominated by chickens, (iii) Livestock mammals consisting of ruminants and monogastric 425 livestock, (iv) Wild birds; predominantly seed eating birds such as house sparrows, and (v) Wild 426 mammals; predominantly rodents, along with bats. Primates were omitted from the sharing analysis 427 as they were only associated with two households, along with some samples derived from populations 428 of bats and wild birds which could be attributed to sublocation but not household. 429 While the sharing threshold for the core genome was of ≤ 10 MLST distance, sharing for the 430 pangenome and resistome was based on a Jaccard similarity Index (JI; between 0 and 1, where 1 is 431 identical), where a cut-off threshold was defined, similar to the core genome. For the pangenome/ 432 accessory genome this was determined to be JI ≤ 0.98 (Fig 3c, d). Resistome sharing was defined as JI 433 = 1 (Fig 3e, f), with each isolate having a minimum of two AMR genes. In practice, this means that two 434 isolates must share an identical set of AMR genes of length equal to or more than two. 435 We used the sharing thresholds for each facet of the E. coli genome and applied a correction to the 436 number of pairs counted when multiple connections involved the same isolate(s). For example, if 4 437 isolates formed a cluster that were all below the 10 cgMLST threshold with each other, the maximum 438 number of pairs/connections that can be drawn between these 4 isolates is 6 or n(n-1)/2, where n is 439 the number of isolates that form a cluster fewer than 10 cgMLST loci apart. However, when the 440 correction is applied, we count only 3 connections, or n-1. This avoids the overestimation of sharing 441 events between larger clusters of clonal isolates and provides a more realistic estimate of sharing. 442 Having defined the set of observed sharing events among each of our host categories within and 443 between households, we then sought to detect whether these observed events fell above or below 444 what might be expected given the sampling effort. These denominator values were based on the 445 number of pairs of isolates in each category, assuming an equal probability of sharing among isolates. 446 Within households this was calculated using the formula n(n-1)/2, where n = number of samples of a 447 given host type. Between household sharing was calculated as (n1) * (n2). Where n1 = number of 448 samples of a given host in household 1, and n2 = number of samples of a given host in household 2. 449 This approach generated a list of all possible paired (expected) sharing events for each category type. 450 From this we calculated the expected frequencies of each type of category sharing within and between 451 households. We then used a resampling approach of the observed values (1000 times) to generate 452 expected distributions (± 95% confidence intervals) of events based on the frequencies of these 453 expected values. From this we were able to assess whether our observed sharing events fell above, 454 below, or within the range we may expect given the sampling effort. The same approach was applied 455 to all aspects of genome sharing (Figure 3 a-