4.1. Description of neighbourhoods monitored and wastewater sampling regime
We have been able to advance and innovate WBS at a granular scale owing to a number of local advantages: (i) Calgary is unique as compared to many cities and maintains separate stormwater and wastewater collection systems (preventing human released ARGs from being diluted by stormwater); (ii) Calgary operates using a single municipal utility provider, enabling consistency in sample collection procedures; (iii) the Urban Alliance is a longstanding partnership between the University of Calgary and the City of Calgary allowing for rapid collaboration on matters relevant to societal health 47. Wastewater samples were collected from a total of eight neighbourhoods in Calgary, Canada (Fig. 1a). Locations were chosen based on accessibility to sampling sites, population size, and in order to achieve a diverse cross-section of the city’s population 24. Sampling was performed during the COVID-19 pandemic from December 02, 2020 to October 05, 2021 (Table 1), yielding 4–10 longitudinal ~ weekly samples for each neighbourhood. We followed previously described procedures to collect composite wastewater samples 24. In brief, wastewater autosamplers (ISCO5800 or ISCO6712) were deployed at manholes selected to sample the flow of sewage from the targeted catchment. The wastewater from these sites is exclusive to the population in each respective catchment (Fig. 1). The autosamplers were programmed to collect composite samples by obtaining 100mL in every 15 mins for 24 hours. Samples were shipped to the laboratory at the end of the 24-hour period and stored at -20°C until further analysis. Details on the sampling sites, and their respective populations is detailed in Table 1.
Supplementary sampling was conducted post-hoc from samples collected at the Bonnybrook Wastewater Treatment Plant (WWTP), serving a population of ~ 900,000 (> 80% of the total population of Calgary) mainly to assess a potential study-to-study variation when comparing our results with other published metagenomes (see 4.7). Three samples were obtained, and used as controls, e.g., Calgary-2022-Jan (26 Jan 2022), in the midst of COVID pandemic; Calgary-2022-Jul (July 06 2022) and Calgary-2022-Nov (23 Nov 2022), after the lifting of COVID restriction. The post-hoc sampling was performed in a same way, as detailed above.
4.2. Determining the socio-economic status of monitored neighbourhoods within Calgary
To establish the SES parameters of the monitored neighbourhoods we used data extracted from the Government of Canada’s 2021 Census 48, using the software Beyond 20/20 Professional Browser release 7.0.5092(32) (Beyond 20/20 Inc., Ottawa, Canada). Detailed extracted variables we assessed fell into the following specific categories: income related (average household size, and median total income of household), education related (the populations with, and without postsecondary degrees or equivalent), and unemployment related factors (unemployed population, and total labour force). Immigration status is known to be another factor that could be associated with SES 49 – immigration related factors such as generation status for the population (first, second, third or more generation), and the place of birth for the recent immigrant population were also extracted. The aforementioned data were accessible at each defined area (DA) level, as determined by the national statistical agency. These DAs were defined at a granular (zoomed-in) scale. Consequently, aggregating multiple DAs provided an imperfect approximation of the entire sewage catchment area (Fig. 1a). Notably, several sewersheds captured modest areas of parks, reactional and industrial sites in addition to the sought large residential areas of study. As a result, we calculated the median of values for multiple DAs, which was then subjected to further analysis, as detailed below.
(Eq. 1) Median total income of household per capita (CAD) = Median total household income/Average household size
(Eq. 2) Education rate = Population with a postsecondary degree (or equivalent)/Total population with and without postsecondary degrees
(Eq. 3) Unemployment rate = Unemployed population/Total labour force
(Eq. 4) Immigration rate = Number of first-generation immigrants/Total population
Subsequently, the median values for the abovementioned parameters were calculated for each neighbourhood, then used when correlated against ARG data.
4.3. Understanding the confounding of air travel in to Calgary during the study
The amount of air travel was estimated using the number of passengers arriving at the Calgary international airport from 2018 to 2022 extracted from Statistics-Canada 15. The yearly aggregated data were extracted from the database, for each sector, namely domestic (flights within Canada), transborder (flights between Canada and the USA), and international sectors (flights between Canada and countries other than the USA) (see Fig. S2).
4.4. Sample processing and DNA extraction
Samples were thawed at 4°C overnight, and after thorough mixing 80 mL aliquots were removed and centrifuged (4,200 rpm at 4°C for 20 minutes). Supernatant was removed, and the volume of the remaining pellets was recorded (~ 0.5 to 7.2mL depending on the amount of pellet) and used for further calculation (i.e., percentage of raw wastewater volume processed for DNA extraction). Aliquots of 250µL of the resuspended pellets (corresponding to 5 to 36 mL of raw wastewater) were subjected to total genomic DNA extraction using the PowerSoil DNA Isolation Kit (Qiagen, Germany) in a final elution volume of 50 µL. DNA (260/280 and 260/230 absorbance ratios) was measured using Nanodrop (Thermo Fisher Scientific, USA) and concentrations were determined by fluorometry using a Qubit™ dsDNA High-Sensitivity Kit (Thermo Fisher Scientific, USA). Extraction blanks were included in every batch to assess for contamination, resulting in concentrations always below the detection limit (< 0.1ng/µL). Extracted DNA samples were stored at -70°C until further analysis.
4.5. Shotgun-sequencing and metagenomic analysis for neighbourhood samples
Library preparation and shotgun sequencing were performed by Novogene Corporation Inc., California, USA (https://www.novogene.com/us-en/) on an Illumina Novaseq 6000 with paired-end 2×150bp strategy. Output targets were 10 gigabases (~ 33M paired reads) per sample, and the final outputs ranged from 10.1–17.8 gigabases (median = 13.0, IQR = 11.9–14.0 gigabases) (Dataset S2), yielding 0.85 (IQR: 0.82–0.88) estimated average coverage, calculated using Nonpareil (v3.303) 50 (Fig. S11). Raw reads were initially filtered by the sequencing provider under the following criteria for removal: (1) reads containing adapters, (2) reads containing N > 10%, (3) reads with low quality (Qscore < = 5). The qualities of processed reads were double-checked using FastQC (v.0.11.9) 51, and further filtered using Trimmomatic (v0.39) 52. Filtered reads were then further analyzed using the DeepARG-SL pipeline v2.0 53. This pipeline aligned reads against its database that draws ARG and 16S rRNA gene sequences from multiple public databases (i.e., CARD, ARDB, UniProt for ARGs; Greengenes for 16S rRNA genes) 53. We then calculated relative abundance (ARGs per 16S rRNA gene) according to a customized workflow written in Python (v3.5.3) with Pandas (v0.25.3). The detailed workflow has been described elsewhere 54. Specifically, ARG copies were calculated by the following equation,
\(copy= \frac{The number of reads assigned to each ARG \times Read length \left(bp\right) }{Length of reference sequence \left(bp\right)}\) (Eq. 5)
The calculated copies for ARGs were further normalized by 16S rRNA gene copies (calculated in the same way, using Eq. 5) as described in other studies 54–56 in order to account for variations in fecal bioburden between samples/sites.
4.6. Quantitative assessment of specific ARGs
Selected ARG sub-types were also detected and quantified using independent qPCR assays targeting qnrS families 57 and blaTEM families 58. Assay methodology has been described elsewhere 54,57,58. Reaction volumes of 10 µl contained the following components: TaqMan® Fast Advanced Master Mix (Applied Biosystems, USA) (5µl), primers (final concentration = 650nM for qnrS, and 400nM for blaTEM), probes (final concentration: 250nM for qnrS, and 200nM for blaTEM), 1:5 pre-diluted template (2.5µl) and DNase/RNase-free distilled water. Assays were conducted under the following conditions: 50°C for 3 mins and 95°C for 20s (qnrS) or 2 mins (blaTEM); 40–45 cycles of 95°C (qnrS) or 94°C (blaTEM) for 1s followed by 60°C for 20 s (qnrS) or 1 min (blaTEM). The following gBlocks™ gene fragments were synthesized from Integrated-DNA-Technologies, USA, and used as standards: blaTEM-1 (GenBank accession no. V00613.1; fragment size = 164bp) 58, and qnrS4 (GenBank accession no. FJ418153.1; fragment size = 197bp) 57. 16S rRNA genes were also analysed as normalization markers for overall bacterial load following our previous study 54. In short, qPCR assay were run using a universal primer/probe for V3-4 hypervariable regions of 16S rRNA gene 59 under the following conditions: 50°C for 2 min and 95°C for 2 min; 40 cycles at 95°C for 1s followed by 62°C for 20s. As a standard for 16S rRNA gene, Pseudomonas aeruginosa PA01 genomic DNA was used. Each sample and standard were run in triplicate, and all runs included no-template controls. All qPCRs were performed using QuantStudio 5 Real-time PCR systems (Applied Biosystems, USA).
The qPCR results were subject to further quality control steps, referring to relevant references 56,60,61. These steps included: (1) determining the limit of quantification (LOQ) at the last dilution where standard deviation of Ct is < 0.5, (2) Ct for non-template controls were always > LOQ’s Ct, (3) samples with Ct standard deviations > 0.5 among technical triplicates were not used (labelled as ‘detected but not quantifiable’), (4) samples with average Ct among replicates < LOQ were not used (labelled as ‘detected but not quantifiable’) and (5) samples which did not show signals or showed only 1 out of 3 signals were not used (labelled as ‘not detected’). Key quality parameters for standard curves were provided in Table S5. Relative abundance (ARGs per 16S rRNA gene) was calculated for quantifiable samples.
4.7. Analysis of global metagenomes
We searched for sewage metagenomes from around the world in the public databases (i.e., NCBI- Sequence Read Archive and European Nucleotide Archive), using keyword searching (‘sewage’), then selected studies to include that met the following criteria: (1) the raw data were generated in recent 5 years (i.e., from 2016) from the beginning of this project in 2021; (2) the project included multiple countries, across different continents; (3) the raw data included the samples with a sequencing output of > 4 giga-bases, ensuring the estimated coverage of > 0.5, predicted using Nonpareil curves of 52 neighbourhood samples in this study (Fig. S11). As a result, a total of 656 sewage metagenomic samples from two projects, representing 244 cities within 101 countries collected between 2016–2018, were downloaded 5,7. Details were provided in Dataset S2. These data included four additional raw sewage metagenomes from a wastewater treatment plant in Calgary from 2017–2018 (run accession IDs at ENA : ERR2683154, ERR4682435, ERR4682830, and ERR4678657). Samples varied significantly in sequencing output (see Dataset S2). To ensure accurate quantification of ARG reads across samples in downstream analysis, with a consistent limit of detection, the raw data were rarefied with 4 giga-bases outputs, yielding estimated average coverage of 0.60 (IQR: 0.54–0.67), using Seqtk (v1.3) 62 referring to other relevant studies 63,64.
Data for key SES parameters for global resistomes, such as Human Development Index (HDI) that represents society’s achievement in human development, broadly including dimensions of economic, education, and health 20, and unemployment rate were obtained from the World Bank 36, and United Nations Development Programme 20. In principle, the values corresponding to the year of collection for global metagenome samples were compiled and subsequently averaged for each country, resulting in a singular represented HDI value for each country. Then, samples were categorized in our sample list (a total of 101 countries) into four levels for each parameter under the following rationale: High ≥ Q3 (75%), Q3 ≥ upper middle ≥ Q2 (median; 50%), Q2 ≥ lower middle ≥ Q1 (25%), and Q1 ≥ low, as graphically detailed in Fig. S12.
To compare ARG dissimilarity at different geographical scales (i.e. intra versus inter-city variations; see 2.4), the 52 metagenomes from eight Calgary neighbourhoods were distilled into eight data points where each neighbourhood was represented by a single data point, by using the median of the longitudinal datasets (up to 10) for each ARG subtype, followed by beta-diversity analysis alongside 656 global metagenomes. Similarly, to characterize ARGs profiles specific to Calgary, the 52 Calgary metagenomes were amalgamated into a singular data point (see 2.5). This analysis was based on the assumption that the singular data point represents the entire population, considering homogenous resistomes that neighbourhood samples exhibited over the course of monitoring. Subsequently, multivariate analysis was followed to analyse key determinants, associated uniquely with the Calgary resistome. Several uncertainties may impact this comparison; (1) technical errors, arising from variations in analysis workflows between our study and other published metagenomes; (2) a bias, which may arise when comparing the resistome of sub-population (i.e., neighbourhood samples) to that of the entire population (i.e., city samples from global metagenome libraries); (3) temporal variations, caused by the diverse years during which samples were collected (i.e., 2021 in this study vs 2017–2018 for other published data). To understand the sources of uncertainty, we performed a meta-analysis, including a total of 7 additional wastewater samples, obtained from a municipal wastewater treatment plant (serving > 80% populations within Calgary) as controls. The control samples were used only in pre-test (Fig. S1), but not in the main analysis (Fig. 5–6) to prevent potential skewing of overall patterns, particularly in ordination analysis, due to unbalanced sample numbers. Four control samples were analysed during comparable periods (May-26, July-06, November-23, and December-07 2022, respectively), aligned with those obtained from global metagenome libraries (June-13 2017, June-26 2018, November-21 2017, and December-10 2018, respectively). The other three samples were analysed during a comparable period (i.e, before the major shift of pandemic-related measure on Mar-01 2022; Jan-26, Feb-09, and Feb-16 2022), aligned with our 52 neighbourhood samples. Additional details regarding the selection of control samples, as well as the rationale behind the control assessment, can be found in the Supplementary-Method. Our analysis, described in the Supplementary-Method, proved that all three errors do not exist profoundly, or at least were overshadowed by the greater influence of the overarching factor (i.e., HDI) – we therefore proceeded further downstream analysis, comparing Calgary neighbourhood resistomes with the 656 global metagenomes.
4.8. Statistical analysis
To assess for differences among Calgary neighbourhoods, Kruskal-Wallis test was performed initially to assess whether there was a difference across groups, followed by Wilcoxon test to identify which pair(s) exhibited significant differences, using the R package ‘stats (v4.1.2)’ 65. Alpha-diversity was measured using the Shannon index, and was performed in R using the package ‘vegan (v2.6-4)’ 66. Dissimilarities between samples were calculated using Bray-Curtis distance, and visualized using non-metric multidimensional scaling (NMDS), which was performed using ‘vegan (v.2.6-4)’ 66. Pairwise analysis of similarities (ANOSIM), and permutational multivariate analysis of variance (PERMANOVA) were performed to test for significant differences in dissimilarity between groups using ‘vegan (v2.6-4)’ and ‘ecole (v.0.9)’ in R 67. Procrustes analysis was performed to test for structural correlations between two different ordinations using ‘vegan (v2.6-4)’. Biplot analysis was employed to determine which ARG-subtypes correlated most to the ordination, and r and the corresponding p-value for each ARG were calculated via permutation (n = 9999) using ‘vegan (v2.6-4)’ 66,68. To identify ARGs associated uniquely with Calgary, we calculated the residual sum of squares (RSS) for each regression line against a total of 5 data points representing Calgary (a single amalgamated data point for 2021 samples in this study, as delineated in 4.7, also 4 published Calgary metagenomes), chose the top 100 ARGs with lowest RSS as distinctive determinants uniquely linked to Calgary, then further visualized using heatmap. The heatmaps were generated using ‘gplots (v3.1.3)’ in R 69. The p-values for multiple comparison were corrected according to the Benjamini-Hochberg procedure using ‘stats (v4.1.2)’ in R 65. Regression analysis was performed to test association between metagenomics and qPCR results using ‘stats (v4.1.2)’ 65. All the graphics were generated using basic embedded functions in R (v4.1.2).
4.9. Monte-Carlo Permutation Simulation
Monte-Carlo permutation simulation was performed to compare dissimilarity among Calgary-2021 wastewater samples with dissimilarity among cities from around the world as follows: (1) a total of 28 pairs (= every possible combination for the 8 neighbourhoods, \({C}_{2}^{8}\)) of Bray-Curtis distance between global cities were randomly sampled from each comparison category (Fig. 5a), e.g., distance between a high HDI city versus an upper middle HDI city, a high HDI versus a lower middle HDI city, etc, (2) the median value for each comparison category was compared with the median dissimilarity between Calgary-2021 samples (Fig. 5b), and (3) the previous steps (1–2) were repeated by n = 100, then p-value (Eq. 6) was computed (Fig. 5c).
\(p=\frac{n\left[x\right|\stackrel{-}{{d}_{City, x}} > \stackrel{-}{{d}_{Neighborborhood}}]}{n\left[x\right]}\) (Eq. 6)
Where, \(x\) indicates the number of simulations run (i.e., 1, 2, … 100), \(\stackrel{-}{d}\) indicates dissimilarity distance between two selected nodes.
The simulation was performed using a customized workflow in R modified from our previous study 70, which is available on GitHub (https://github.com/myjackson/permutation-simulation-for-intra-vs-intercity-variations).
4.10. Ethics
This study received approval from the Conjoint Research Health Ethics Board of the University of Calgary (REB20–1544 and REB21-2025).