Outbreak dynamics of foodborne pathogen Vibrio parahaemolyticus over a seventeen year period implies hidden reservoirs

Controlling foodborne diseases requires robust outbreak detection and a comprehensive understanding of outbreak dynamics. Here, by integrating large-scale phylogenomic analysis of 3,642 isolates and epidemiological data, we performed ‘data-driven’ outbreak detection and described the long-term outbreak dynamics of the leading seafood-associated pathogen, Vibrio parahaemolyticus, in Shenzhen, China, over a 17-year period. Contradictory to the widely accepted notion that sporadic patients and independent point-source outbreaks dominated foodborne infections, we found that 71% of isolates from patients grouped into within-1-month clusters that differed by ≤6 single nucleotide polymorphisms, indicating putative outbreaks. Furthermore, we showed that despite the long time spans between clusters, 70% of them were genomically closely related and were inferred to arise from a small number of common sources, which provides evidence that hidden persistent reservoirs generated most of the outbreaks rather than independent point-sources. Phylogeographical analysis further revealed the geographical heterogeneity of outbreaks and identified a coastal district as the potential hotspot of outbreaks and as the hub and major source of cross-district spread events. Our findings provide a comprehensive picture of the long-term spatiotemporal dynamics of foodborne outbreaks and present a different perspective on the major source of foodborne infections, which will inform the design of future disease control strategies. Long-term spatiotemporal dynamics of foodborne pathogen Vibrio parahaemolyticus outbreaks suggests hidden reservoirs as source of infections.

F oodborne diseases (FBDs) are major and enduring threats to public health 1 . Because of the ubiquitous nature of foodborne pathogens in the natural environment and their ability to contaminate the farm-to-fork continuum, FBDs were characterized by occasional outbreaks on the background of steady sporadic patients. Effective control requires robust detection of outbreaks from baseline sporadic patients and a comprehensive understanding of the key properties of outbreak dynamics 2,3 .
Traditional FBD outbreak surveillance was passive and triggered mainly by notification of clustered cases, which made it difficult to detect small-scale, protracted, or cross-regional outbreaks. The establishment of molecular subtyping-based networks substantially promoted FBD surveillance and enabled outbreaks to be detected in an active 'data-driven' way by identifying genetically related clusters of infections. Whole-genome sequencing (WGS) provides ultimate resolution to discriminate outbreak from sporadic isolates and is becoming the reference method in outbreak investigations. Moreover, WGS enables robust phylogenetic reconstruction to reveal the evolutionary relationships between isolates and identify isolates arising from common sources 2,3 . In practice, phylogenomic clusters have been demonstrated to have high concordance with epidemiological links in outbreak investigations 4,5 , suggesting that these clusters can be good indicators to quantify and characterize the dynamics of outbreaks. Recent studies showed the power of phylogenomic clustering analysis in detecting cryptic outbreaks that were neglected by traditional surveillance 6,7 , and in tracking and resolving international outbreaks 8,9 .
Vibrio parahaemolyticus (VP) is the leading cause of seafood-associated bacterial gastroenteritis worldwide 10 . Most pathogenic VP produce thermostable direct hemolysin (TDH) and/ or TDH-related hemolysin (TRH). Notably, since the emergence and pandemic spread of the serotype O3:K6 clonal group in 1996, VP prevalence has increased globally over the past two decades and its geographical range has extended to high-latitude areas where it was previously unreported 10 . Given the increasing disease burden, current outbreak surveillance of VP is based mainly on passive notification, whereas WGS-based epidemiology studies have focused on characterizing a small number of outbreaks [11][12][13][14][15] and local epidemic dynamics in low-prevalence regions 16,17 .
China has a high prevalence of VP infections, especially in southern coastal regions such as Shenzhen 18 , a populous city with approximately 13 million people. A well-functioning integrated surveillance network of FBDs was established in Shenzhen, which captured a dense collection of isolates from documented outbreaks, sporadic patients and the environment. We sequenced all the archived VP isolates collected by the network between 2002 and 2018. By integrating large-scale phylogenomic analysis and epidemiological data, we described the genomic baseline of pathogenic Controlling foodborne diseases requires robust outbreak detection and a comprehensive understanding of outbreak dynamics. Here, by integrating large-scale phylogenomic analysis of 3,642 isolates and epidemiological data, we performed 'data-driven' outbreak detection and described the long-term outbreak dynamics of the leading seafood-associated pathogen, Vibrio parahaemolyticus, in Shenzhen, China, over a 17-year period. Contradictory to the widely accepted notion that sporadic patients and independent point-source outbreaks dominated foodborne infections, we found that 71% of isolates from patients grouped into within-1-month clusters that differed by ≤6 single nucleotide polymorphisms, indicating putative outbreaks. Furthermore, we showed that despite the long time spans between clusters, 70% of them were genomically closely related and were inferred to arise from a small number of common sources, which provides evidence that hidden persistent reservoirs generated most of the outbreaks rather than independent point-sources. Phylogeographical analysis further revealed the geographical heterogeneity of outbreaks and identified a coastal district as the potential hotspot of outbreaks and as the hub and major source of cross-district spread events. Our findings provide a comprehensive picture of the long-term spatiotemporal dynamics of foodborne outbreaks and present a different perspective on the major source of foodborne infections, which will inform the design of future disease control strategies.
'Data-driven' outbreak detection. To obtain the genomic similarity cut-off of phylogenomic clusters (P-cluster) representing putative outbreaks, we analysed the pairwise SNP distances between 615 isolates from 126 FDOS outbreaks (Fig. 3a,b). As expected, most isolates had few SNP differences, and 93% of isolates had ≤6 SNPs before and after removing SNPs caused by recombination. The SNP distance distribution between FDOS-outbreak isolates in different PCGs was similar to the overall distribution (Extended Data Fig. 4). The entire dataset had a discontinuous SNP distance distribution, with <1% pairs of isolates having ≤6 SNPs. When SNP numbers were >6, the number of clustered isolates increased substantially, indicating that isolates without epidemiological links had been introduced (Fig. 3b). Therefore, we selected ≤6 SNPs as the cut-off. Furthermore, we analysed external datasets of outbreak isolates from different countries [11][12][13][14][15] and identified a maximum of 6 SNPs (Extended Data Fig. 4), showing the generalizability of this cut-off.

Ob-clusters vs documented outbreaks.
The numbers of Ob-clusters and associated patients were substantially larger than those of FDOS outbreaks (2.9-fold and 3.1-fold, respectively) (Fig. 3a). Notably, 68% (1,259/1,856) of isolates from outpatients that were previously considered as sporadic, were assigned into Ob-clusters. Specifically, 207 (56%) of the 370 Ob-clusters contained exclusively 843 outpatient isolates, which were not captured by FDOS, representing unrecognized cryptic outbreaks. These Ob-clusters were characterized by their small scale; 197 (95%) of them each involved ≤10 patients (Extended Data Fig. 6). However, 582 (69%) and 634 (75%) of the 843 outpatients were identified in the same sentinel hospital and district, respectively, as another patient in an Ob-cluster; that is, they were geographically close, indicating possible epidemiological links among them. Moreover, 98 (26%) Ob-clusters contained patients from both FDOS outbreaks (n = 615) and outpatients (n = 416), with the latter representing missed outbreak patients. Most (87%, 362/416) of these outpatients were not identified in the district where an FDOS outbreak was reported, which may explain the unrecognized links between them and FDOS-outbreak patients (FDOS investigations were usually limited to local districts).

Genomic links between Ob-clusters over long time spans.
Despite the long time spans between Ob-clusters, 258 (70%) of 370 Ob-clusters were genomically linked, which can be assigned into 61 P-clusters. Each of these P-clusters contained 2-20 Ob-clusters, indicating that persistent common reservoirs caused continuous outbreaks (Figs. 1b and 3d). The minimum-spanning trees of some P-clusters (for example, PC033 and PC202) were 'star-like' , indicating continuous outbreaks caused by a single lasting common source. The trees of other P-clusters (for example, PC052 and PC176) had 'multiple stars' topologies, which may indicate related outbreaks caused by different but linked sources, or strains that colonized and evolved in a local niche and then led to multiple waves of outbreaks (Fig. 3d). To test the hypothesis of ongoing evolution in persistent reservoirs, we constructed dated maximum clade credibility (MCC) trees of two representative P-clusters, PC052 and PC176 (Extended Data Fig. 7). Most isolates from the same Ob-cluster grouped together in the trees and shared a very recent common ancestor. Moreover, both MCC trees were generally staircase-like, analogous to the typology of V. cholerae isolates from the Haiti cholera outbreak between 2010 and 2012 19 , which were consistent with the hypothesis.
The identification of these long-term P-clusters indicated that the common sources had not been fully eliminated after the initial outbreak but persisted in some hidden reservoirs. To quantify preventable outbreaks/patients if the persistent reservoirs are eliminated, we defined continued Ob-clusters/patients (that is, Ob-clusters/patients identified after the initial (earliest) Ob-cluster within a P-cluster) to represent preventable outbreaks/patients; the others were defined as initial ones (Figs. 1b and 3d). There were 197 (53%) Ob-clusters and 1,308 patients (49% of all patients) assigned as continued Ob-clusters/patients. Spatiotemporal dynamics of Ob-clusters. The number of Ob-clusters fluctuated annually, and initial and continued Ob-clusters alternately became dominant, with continued Ob-clusters lagging behind the initial Ob-clusters (Fig. 4a). Since 2006, the proportion of Ob-cluster patients was stable, leading to 76%-91% of patients each year, whereas the proportions of initial and continued Ob-clusters/patients were variable, with continued exceeding initial Ob-clusters in 10 of the 13 years. Monthly, 251 (68%) Ob-clusters and 1,727 (67%) patients were concentrated at the peak of seasonal outbreaks (July to September). The number of initial Ob-clusters showed a similar changing trend as continued Ob-clusters, with the latter lagging behind the former. Notably, during the 3-month peak period, the number of continued exceeded initial Ob-clusters and accounted for 57% of the total patients.
The geographical distribution of 370 Ob-clusters was uneven (Fig. 4b,c). There were 311 (84%) Ob-clusters that accumulated in four northern districts with high VP prevalence (BA, GM, LoH and LG) (Fig. 4c), and there was a significantly higher proportion (P < 0.001, FET) of continued patients (54%) in these four districts compared with the other districts (38%). A hotspot of Ob-clusters was identified in the western coastal district, BA, where 71% of the Ob-clusters can be detected (Fig. 4b,c). Notably, this hotspot was not obvious based on FDOS results. Only 5% (6/126) of FDOS outbreaks were reported here, and 69% of Ob-clusters were cryptic and were not captured by FDOS.
There were 236 (64%) cross-district Ob-clusters identified in multiple (2-6) districts, accounting for 71%-100% of total Ob-clusters in each district (Fig. 4b), suggesting frequent spread of strains between different districts. In the hotspot district, 83% of the total cross-district Ob-clusters and 41%-85% of the cross-district Ob-clusters in other districts were detected, suggesting that the hotspot acted as the hub of cross-district spread. Cross-district Ob-clusters were most prevalent between the hotspot district and three other northern districts with high VP prevalence, where 70%-85% of the cross-district Ob-clusters were related to those of the hotspot.
To further infer the geographical source of the 236 cross-district Ob-clusters, we first analysed the locations of the earliest detected isolates from a cross-district Ob-cluster and found that 43% were in the hotspot (Fig. 3d). Furthermore, we performed phylogeographical analysis and identified the probable source district of 162 cross-district Ob-clusters and found that 79 (49%) were in the hotspot (Fig. 3d). Subsampling analysis based on a balanced number of isolates (1-3) in each district gave similar results, with 77 (49%) of 158 inferred source districts being from the hotspot (Extended Data Fig. 8). These results indicate that the hotspot was also a major source of cross-district spread. Moreover, we found a correlation between geographical and pairwise SNP distance between isolates from Ob-clusters sourced from the hotspot. Isolates with greater geographical distances to the hotspot had larger mean SNP distances (Extended Data Fig. 9), which may be associated with the extra diversity accumulated during cross-district spread of strains.

Discussion
In one of the largest bacterial WGS projects conducted so far, we provide a comprehensive picture of the long-term spatiotemporal dynamics of the leading seafood-associated bacterial pathogen in a high-prevalence city. We showed that pathogenic lineages were stable over a decade, with two major PCGs-PCG3 and PCG189driving local epidemics, which may be related to their higher virulence 20 and/or adaptability. Through active 'data-driven' analysis, we showed that 71% of patient isolates can be grouped into Ob-clusters. Furthermore, we showed that despite the long time spans between clusters, 70% of them can be genomically linked and were inferred to arise from a small number of common sources, which provides evidence that continuous outbreaks caused by hidden persistent reservoirs dominated local infections. Our findings challenge the widely accepted notion of the major source of foodborne infections, that is, sporadic patients and independent point-source outbreaks dominate infections.
Typically, an FBD outbreak is declared when a common source is identified, normally food that all the associated patients were exposed to. However, for many foodborne pathogens that are ubiquitous and persistent in the environment such as Vibrio, it is usually difficult to track back to the common food source. In this particular case, the environmental source (coastal area) could be considered the equivalent of a food facility contaminated by a persistent foodborne pathogen causing recurrent cases over a long period. For VP, a genetic variant or pathogenic strain can be dominant in a particular coastal region, causing infections through diverse routes (for example, different food products) coming from different sites within this region where they can persist for long periods. Phylogenomic clustering analysis with linked epidemiological data can solve some of the difficulties by detecting and charactering outbreaks in a 'data-driven' way. This approach has been widely used and was demonstrated to be highly concordant with epidemiological links in practice 4,5 . We defined Ob-clusters, that is, genetically and spatiotemporally closely related isolates from patients, to represent putative outbreaks. However, it should be noted that the source of a putative outbreak refers to a common environmental source, including but not necessarily a single food source. Different food sources could be linked to a common environmental source.
The inherent limitation of passive outbreak surveillance that is triggered by notification of clustered cases makes it inevitable to underestimate the magnitude of outbreaks. The numbers of Ob-clusters and associated patients we detected were approximated to be 3-fold those of the FDOS outbreaks, indicating a substantial underestimation of outbreaks. Surprisingly, a high proportion of Ob-clusters contained exclusively or included previously defined 'sporadic' patients, representing cryptic outbreaks or missed outbreak patients. Most of the cryptic Ob-clusters were small scale and most of the missed outbreak patients were not in the district where the associated FDOS outbreaks were detected, which may explain why they were unrecognized by traditional surveillance. These results highlight the advantages of centralized sequencing and analysis of data to trace outbreaks. In prospective studies of Salmonella and Listeria monocytogenes, centralized sequencing and analysis enabled the detection of multiple cryptic outbreaks and follow-up investigation established the epidemiological links between patients 6,7 .
Transient point-source outbreaks have been considered as the common pattern of FBD outbreaks, although protracted outbreaks have also been reported 7,21 . We linked 70% of the Ob-clusters to 61 P-clusters with 2-70 months duration, suggesting that most outbreaks were probably related to a small number of hidden persistent reservoirs. Because of its poor resolution, conventional subtyping-based surveillance cannot accurately link extended time-period outbreaks, which are therefore treated as independent events. Notably, there were more continued Ob-clusters/patients than initial ones, and more than half of the Ob-clusters/patients during the peak period of seasonal outbreaks were continued ones. These continued Ob-clusters/patients would be preventable if the persistent reservoirs were eliminated in time.
Persistent reservoirs can be caused by repeated introduction from external sources and/or local colonized sources. VP infections are commonly associated with consumption of contaminated seafood, and most of the seafood available in Shenzhen is imported from outside. Because the external seafood sources vary and can be affected by many determinants, the scenario of repeated external introduction may not be the general cause. Instead, the establishment of local long-term stable environmental reservoirs in, for example, seafood markets or seawater of a coastal area, is more probably the general cause. Cross-contamination of seafood by pathogenic strains from one food source to another can occur in a market or any other point in the distribution chain. In a site where seafood can be stored for short periods (for example, wet storage), pathogenic strains can even evolve under favourable conditions (for example, warmer waters). Moreover, seawater from where pathogenic strains prevail can also lead to contamination when it is used for washing and cleaning seafood in markets 22 . With the continuous circulation of contaminated seafood, pathogenic strains spread  and cause infections. Further in-depth investigations to clarify the causes and geographical locations of the hidden persistent reservoirs will inform the design of future control and prevention strategies. There were obvious geographical differences in VP infections in Shenzhen. The westernmost coastal district BA had much higher VP prevalence and PCG diversity than the other districts. We also showed that BA was the potential hotspot of outbreaks, acted as the hub of cross-district spread of strains and was a major source of outbreaks, indicating that BA should be the focus of future surveillance.
The key role of BA may be related to its special geographical location on the western coast, where many ports and seafood markets are located. A large amount of seafood is transferred from BA and transported to other districts, and pathogenic strains may spread along the way. Moreover, a recent metagenomics study showed that the abundance of harmful Vibrio on the western coast of Shenzhen was significantly higher than that on the eastern coast 23 .
The FSS system was established to prevent outbreaks and protect patients by actively detecting and eliminating contaminated food. However, we found that food isolates from FSS were distinct from the patient isolates; only 2% of the FSS isolates were from PCGs and only one isolate grouped into an Ob-cluster. This can be explained by the possibility that contaminated food that could lead to infection was not collected. However, even for strains from known contaminated food samples in epidemically confirmed FDOS outbreaks, only 23% (12/53) were grouped into Ob-clusters, suggesting that this may not be the general reason. Another explanation is the within-sample diversity of food isolates and the low abundance of pathogenic stains 24 , which were not considered in this study and only one isolate was selected and sequenced for each sample. Therefore, the few PCG isolates identified in FSS may be related to strain diversity and does not necessarily mean absence in the sample. A recent study demonstrated that the use of traditional Vibrio-selective enrichment media substantially reduced (20-fold) the relative abundance of pathogenic to non-pathogenic VP 25 , thereby reducing the possibility of successfully isolating pathogenic strains. We performed enrichment before isolating strains, which may have led to few pathogenic strains isolated in FSS.
Our study has several limitations. First and most importantly, the retrospective study design made it impossible to reinvestigate the epidemiological links among Ob-cluster patients. Further prospective studies are needed to better assess the concordance between Ob-clusters and epidemiologically defined outbreaks, as well as the difference and links between environmental and food sources. Second, the number of Ob-clusters/P-clusters represented only the minimum estimation of the magnitude of outbreaks because only outpatients from 16 sentinel hospitals were included in this study. Finally, within-sample diversity of strains was not assessed, and associated studies may provide clues for the large SNP distances observed between some FDOS-outbreak isolates and the few PCG isolates from FSS.
In summary, we integrated citywide 'big data' of genomes and epidemiological data to establish a model for mining cryptic outbreaks, possible persistent reservoirs, the spread network, and its hub and source. Our results will help not only to improve outbreak and proportion (bars, right Y axis) of initial (blue), continued Ob-cluster (red) and non-Ob-cluster (green) patients. Years/months with less than 50 patient isolates were merged. b, Spatial distribution of Ob-clusters in ten districts. Numbers in brackets next to the heat map indicate the total number of Ob-clusters in a district. Numbers in the heat map indicate the number of local Ob-clusters or shared cross-district Ob-clusters between two districts, and colours indicate the proportions of local/cross-district Ob-clusters to the total number of the upper and right districts. c, Geographical distribution of initial and continued Ob-cluster patients and non-Ob-cluster patients in ten districts. The map was created on the basis of public geographical data from OpenStreetMap. d, Earliest detection locations (red) and inferred source districts (blue) of cross-district Ob-clusters.
surveillance and prevention of VP, but also to provide a basis for the establishment of a 'data-driven' strategy for the prevention and control of other FBDs.

Sampling framework and data collection. The Shenzhen Centers for Disease
Control and Prevention (CDC) established a well-functioning FBD surveillance network consisting of three systems: Foodborne-Disease-Outbreak-Surveillance (FDOS), Infectious-Diarrhoeal-Diseases-Surveillance (IDDS) and Food-Safety-Surveillance (FSS). All hospitals in Shenzhen are required to report suspected outbreaks to district CDCs, followed by epidemiological investigation and laboratory confirmation. The FDOS was performed independently by ten district CDCs, then reported to the city CDC. Documented FDOS outbreaks met the following criteria: (1) two or more clustered cases of diarrhoea; (2) history of dining together within 3 days of disease onset; (3) positive test for VP with the same serotype. Isolates from patients and food associated with the suspected outbreaks were collected by the FDOS, and only VP isolates with the same serotype were defined as FDOS-outbreak isolates. The IDDS routinely collected isolates from outpatients with diarrhoea who presented at 16 sentinel hospitals. IDDS outpatients were previously considered as sporadic because there were no recognized epidemiological links. FSS strains were from food (mainly fish and shellfish) and the environments of aquatic markets. Only one strain from a patient/environmental sample was archived and sequenced. We sequenced all the archived VP isolates between 2002 and 2018. No statistical method was used to predetermine sample size. No data were excluded from the analyses. The experiments were not randomized. The investigators were not blinded to allocation during experiments and outcome assessment. Patients are not directly involved in the present study. Because data collection is part of infectious disease surveillance, individual informed consent was waived. Investigators were blinded to personal information.
Whole-genome sequencing. All the obtained VP isolates were stored in a −80 °C freezer. For DNA extraction, the isolates were inoculated in thiosulfate-citrate-bile-salts-sucrose (TCBS) agar plates at 37 °C. Genomic DNA was extracted using the sodium-Tris-EDTA (STE) method. Paired-end libraries with a mean insert size of 350 bp were prepared using a NEBNext Ultra DNA library preparation kit (NEB). Whole-genome sequencing was performed on Illumina HiSeq X-Ten platforms; the average read length was 150 bp, and an average of 1.5 Gb clean reads were generated for each strain. The genome quality assessment process is shown in Extended Data Fig. 10.
Phylogenomic clustering analysis. Two or more isolates were classified into CGs, P-clusters and Ob-clusters on the basis of pairwise SNP distances and/or time intervals (Fig. 1). First, CGs were defined to show the overall population structure as previously described 31,32 . We analysed the pairwise SNP distances between all the isolates and selected ≤2,500 SNPs as the cut-off to group isolates into CGs. Second, P-clusters were defined to represent isolates that arose from common sources. Currently, there is no standard definition for such clusters. We analysed pairwise SNP distances distribution between isolates (overall and for major PCGs) from FDOS outbreaks and previously reported outbreaks [11][12][13][14][15] , and selected ≤6 SNPs as the cut-off of P-clusters. After taking recombination into account, we detected 22 outlier FDOS-outbreak isolates that had >100 SNPs from any of the other isolates in the same FDOS outbreak, which were unlikely to have been derived from a recent common ancestor, and were therefore excluded from the analysis. Finally, to compare with FDOS outbreaks that took time intervals into account, we performed secondary clustering of P-cluster isolates on the basis of time intervals and defined within-1-month P-cluster/sub-P-cluster isolates as outbreak clusters (Ob-clusters, Fig. 1). A 1-month window was selected because it is a common period for outbreak analysis 33 and a practical turnaround time for routine sequencing and analysis. Sensitivity analysis was performed to assess the effect of different thresholds (SNPs ≤3 and ≤10; 1-week window) on Ob-cluster definition and interpretation.
Virulence factors. ABRicate (https://github.com/tseemann/abricate) was used to detect the presence/absence of two major virulence factor genes, tdh and trh, which encode thermostable direct hemolysin and tdh-related hemolysin, respectively. The tdh or trh gene was defined as present if the sequence coverage was >70%.
Temporal analysis. We investigated the temporal signals of two representative P-clusters, PC052 and PC172, using TempEst v1.5.3 34 . By calculating the correlation between root-to-tip distances and isolation dates, significant temporal signals were detected for both P-clusters. We further performed Bayesian evolutionary analysis by sampling trees (BEAST) analysis using BEAST v1.10.4 35 to construct dated phylogenies. We selected the GTR+Γ substitution model and used isolation date for tip date calibration. Four combinations of molecular clocks (strict and uncorrelated relaxed clock) and coalescent tree models (constant size and Bayesian skyline population size change) were tested to identify the best fit. Path sampling and steppingstone sampling results showed that the best fits of two P-clusters were strict clocks with constant size coalescent trees, which were used as the priors of further analysis. We ran three independent Markov chain Monte Carlo chains for 10 million steps, sampling every 1,000 steps. Tracer v1.7.1 36 was used to calculate the effective sample size (ESS) values, and the ESS of all the parameters were greater than 200. The tree estimates of different runs were combined to generate the maximum clade credibility trees using TreeAnnotator v1.10.4. The estimated mutation rates of PC052 and PC172 were 6.43 × 10 −7 and 6.83 × 10 −7 substitutions per site per year (3.28 and 3.49 substitutions per genome per year), respectively, which were similar to our previous estimation for two pathogenic clonal groups (PCG3 and PCG36, 95% confidence intervals): 3.6-7.2 × 10 −7 substitutions per site per year 31 .
Phylogeographical analysis. The geographical sources of cross-district Ob-clusters were inferred using SIMMAP (stochastic character mapping of discrete traits on phylogenies) implemented in R package phytools v1.0 37,38 . Geographical districts were treated as discrete traits and mapped to the ML trees using the all-rates-different (ARD) model with 100 replicates. The district of the root node with posterior probability >0.7 was considered as the inferred source. Subsampling analysis was performed by selecting 1-3 isolates in each district next to the root node.
Reporting summary. Further information on research design is available in the Nature Research Reporting Summary linked to this article. Last updated by author(s): May 26, 2022 Reporting Summary Nature Portfolio wishes to improve the reproducibility of the work that we publish. This form provides structure for consistency and transparency in reporting. For further information on Nature Portfolio policies, see our Editorial Policies and the Editorial Policy Checklist.

Statistics
For all statistical analyses, confirm that the following items are present in the figure legend, table legend, main text, or Methods section.

n/a Confirmed
The exact sample size (n) for each experimental group/condition, given as a discrete number and unit of measurement A statement on whether measurements were taken from distinct samples or whether the same sample was measured repeatedly The statistical test(s) used AND whether they are one-or two-sided Only common tests should be described solely by name; describe more complex techniques in the Methods section.
A description of all covariates tested A description of any assumptions or corrections, such as tests of normality and adjustment for multiple comparisons A full description of the statistical parameters including central tendency (e.g. means) or other basic estimates (e.g. regression coefficient) AND variation (e.g. standard deviation) or associated estimates of uncertainty (e.g. confidence intervals) For null hypothesis testing, the test statistic (e.g. F, t, r) with confidence intervals, effect sizes, degrees of freedom and P value noted
For manuscripts utilizing custom algorithms or software that are central to the research but not yet described in published literature, software must be made available to editors and reviewers. We strongly encourage code deposition in a community repository (e.g. GitHub). See the Nature Portfolio guidelines for submitting code & software for further information.
nature portfolio | reporting summary

March 2021
Data Policy information about availability of data All manuscripts must include a data availability statement. This statement should provide the following information, where applicable: -Accession codes, unique identifiers, or web links for publicly available datasets -A description of any restrictions on data availability -For clinical datasets or third party data, please ensure that the statement adheres to our policy