Study area and duration
The study area is situated across most of the province of Ontario, in Canada, at and below latitudes 50–51° N (Fig. 1). The study spans the years 2003 to 2022.
Mosquito collection data
Since 2002, Public Health Ontario (PHO) has collected data on mosquito surveillance activities conducted by public health units in Ontario, which comprise mosquito capture and identification of 22 species/species groups. Capture is performed using the protocol as described in Talbot et al. (2023) 56. Briefly, Culex pipiens/restuans are prioritized over all other species, and Ae. vexans is also of high priority. RNA is extracted from each pool using RNeasy Mini Kit (Qiagen, Hilden, Germany). RNA extracts are tested by quantitative polymerase chain reaction (PCR) for arbovirus presence 69. We used all mosquito capture data available across the entire province of Ontario, Canada, comprising data from all 34 public health units, for the years 2003 to 2022. For each mosquito pool, we calculated the WNV minimum infection rate (MIR), which is the test outcome (positive = 1; negative = 0), divided by the number of mosquitoes present in the pool, multiplied by 1000.
Land cover and climatic data
We followed the approach of Talbot et al. (2023) 56 to process land cover and climate data for subsequent ecological niche models. Land cover data for the year 2013 were obtained from Agriculture and Agri-Food Canada (AAFC) and the United States Geological Survey (USGS) with a resolution of 30 meters across our study area. Land cover data for the year 2013 were chosen because it is approximately the mid-point of our study period from 2003 to 2022. Annual crop inventory data from AAFC 70 comprise seven land cover classes: open water, wetlands, agricultural croplands, natural grasslands, forests, exposed surface and residential areas. Residential areas were subdivided into four categories, according to the normalized difference vegetation index (NDVI) from USGS, created using Landsat 8 data (collection 2, level 2, maximum 50% clouds) from May to October 2013, from USGS 71,72. The goal of this procedure was to subdivide urban environments according to the presence of vegetation, which may affect habitat selection by the studied species. Residential areas with NDVI < = 0.15 were referred to as non-vegetated, with NDVI > 0.15 and < = 0.30 as low green, with NDVI > 0.30 and < = 0.60 as medium green and with NDVI > 0.60 as high green residential areas. Data from AAFC and USGS used the same resolution with matching cell frames, and therefore merging of the two datasets could be performed manually.
Therefore, we considered temperature and precipitation data across the entire study duration. Temperature and precipitation data were obtained from the National Aeronautics and Space Administration (NASA). Given their ease of access, we extracted data on annual total precipitation, and annual maximum and minimum temperature from the annual surface weather and climatological summaries from NASA 73 with a resolution of 1,000 meters. These data were averaged across the 20 years of the study duration, and then divided by number of days in a year to obtain 20-year averaged mean daily temperature and 20-year averaged total daily precipitation for the total period across our study area.
Ecological niche modeling analysis
We followed the approach of Talbot et al. (2023)56. The two studied mosquito species were observed at least once in the vast majority of the 3,010 sampling locations over the study period (2,783 for Ae. vexans, or 92%, and 2,686 for Cx. pipiens/restuans, or 90%). At least one mosquito vector species was present in a total of 2,810 sites, for which a WNV detection test could be performed, and lead to at least one positive outcome in a total of 650 sites. Therefore, the number of visits in which each species was observed at least once at each site was calculated, and divided by the total number of visits at each site over the study period. The resulting value is the frequency of observed presence, and is a form of aggregated performance measure often used in species distribution models 74. Sites were then dichotomized into a 0/1 distribution for each mosquito species and for WNV, which is a requirement of the approach.: Sites with 50% or more species occurrence for both mosquito species were classified as ‘presence, and sites with less than 50% as ‘absence. Sites with at least one positive test outcome for WNV were classified as ‘presence’, and others as ‘absence’.
The random forest algorithm was considered suitable for our analyses because the presence and absence of a species in a given sampling site visit are likely to be affected by the same sampling bias 75,76. This decision tree-based approach performs as well as the maximum entropy approach 77,78, and better than traditional regression-based approaches when using large datasets sampled over a long duration and a large spatial scale 79. We performed the analyses using the ‘biomod2’ package 80 in R 4.2.1 (R Development Core Team, Vienna, Austria). We projected all land cover and climatic datasets to Albers Conic Equal Area, which was the original projection of the land cover dataset from Agriculture Canada. All explanatory variables were resampled to a cell size of 120 x 120 meters and set to be at the same cell frame to reduce spatial bias caused by unequal resolution with the mosquito dataset 75. We computed the Pearson’s correlation coefficient among climatic variables at sampling sites in R 4.2.1. Potential collinearity problems were considered if r > 0.7. We set the prevalence parameter to 0.5 was specified, meaning “presence” and “absence” distributions are considered in equal proportions in the analysis 81. For each species, we trained 100 replicate models using 80% of data. To evaluate each model, we computed a receiver operating characteristic’s area under the curve (AUC) using the remaining 20% of data. Data was selected randomly in each model for training versus testing. We used the final model, trained by the 100 replicate models and using 100% of data, to generate a habitat suitability index (HSI) map in the study area. We kept all other parameters at default values. We used all models with AUC above 0.7 to generate an ensemble niche model 80. We generated response plots of the mean HSI across models and committee averaging 82, for each explanatory variable. We calculated variable importance for each explanatory variable, which varies from 0 to 1, using a procedure of 100 permutations from the ensemble niche model. Lastly, we created a projected HSI map from the model with the highest AUC for each mosquito species and for WNV.
Socio-economic data
Statistics Canada collects data on a large variety of socio-economic variables during the Census of Population every five years across Canada. Data for the years 2011 and 2016 were initially considered because they are near the mid-point of our study period from 2003 to 2022. Given their easier access and processing, we chose data for the year 2016, which we extracted using the Beyond 20/20 Professional Browser software. We selected all variables that relate to either sex, age, ethnicity, population density, income, shelter infrastructure, education, and occupation, to be included in analyses on the determinants of WNV infection in the human population. We uploaded data at the level of the census subdivision, to match blood donation arboviral testing data structure.
Blood donation arboviral testing data
Across the study area, since 2003, Canadian Blood Services (CBS) have tested blood donations across Ontario. These tests were performed on all blood donations from 2003 to 2015. As of December 2015, all donations from June to November were tested, but only donations from travelers to certain destinations were tested from December to May. These months usually see much less mosquito activity, and therefore likelihood of exposure to an infected mosquito is negligible. Groups of six donations were tested in 6-unit minipools. Positive minipools were then retested separately for each donation from the corresponding minipool, along with all donations from surrounding areas for the next two weeks. From June 2003, testing was performed using TaqScreen WNV test IUO (F. Hoffmann-La Roche AG, Basel, Switzerland). From June 2007, testing was performed using the IND cobas TaqScreen WNV test for use with the cobas s 201 system. From June 2008, testing was performed using the licensed cobas TaqScreen WNV test for use with the cobas s 201 system. From December 2017, testing was performed using the cobas® WNV – Nucleic acid test for use on the cobas® 6800/8800 Systems. At donor registration, the donor’s date of birth, sex and residential address are recorded. For confidentiality purposes, we used only the donor’s reported sex; year of birth, to obtain approximative age of the blood donor at the time of donation; census subdivision of residence from the Canadian Census of Population boundaries for the year 2016; and unique donor identifier.
To avoid sampling biases leading to spurious associations in our analyses of determinants of WNV infection in the human population, we chose to exclude all data from census subdivisions where fewer than 1000 WNV tests on blood donations were available, which are mostly located in sparsely populated areas mostly in northeastern parts of the study area, and north of the northern limit of the study area. This exclusion step lead to the removal of 126 subdivisions out of the total of 417 (30%) where blood donor data is available. In the remaining 291 subdivisions, we chose to exclude 4 census subdivisions where total number of residents was lower than 500, because most socio-economic variables were missing for confidentiality reasons. The final dataset contained 287 census subdivisions. Two of these census subdivisions, named “Kenora, Unorganized” and “Thunder Bay, Unorganized” had boundaries above the northern limit of the study area. However, the large majority of blood donations in these subdivisions were conducted in the south, near the Kenora and Thunder Bay townships, respectively, which is also where the large majority of the residents of these subdivisions live. For these reasons, we considered only the portion in these two subdivisions that lie within our study area (Fig. 1, Fig. 4).
Spatiotemporal WNV infection analyses
To attain our main objective, we investigated the effect of a wide range of factors related to land cover, climate, mosquito habitat and socio-economic status of residents on WNV infection in the human population, using data from WNV tests on blood donations. These factors were carefully chosen a priori to include variables most likely affecting WNV risk in the human population, including aspects related to mosquito occurrence, mosquito activity, and characteristics of residents linked to higher exposure to mosquito bites. We conducted these analyses as two different levels: one at the census subdivision, and one at the individual blood donation.
For the first level of analysis, we gathered all arboviral testing data from blood donations, for which residence information is available, from 2003 to 2022, and grouped them within the respective census subdivision of residence of the blood donor. We conducted a simple generalized linear regression, using the ‘lme4’ package 83 in R 4.2.1, for each variable separately, namely selected socio-economic variables, land cover variables (percent cover of each class in census subdivisions), climatic variables (20-year averaged daily total precipitations and 20-year averaged daily mean temperature, averaged across census subdivisions), habitat suitability index variables (for each mosquito vector species and for WNV, averaged across census subdivisions), and lastly, the census subdivision geographic area (in km2), which can be stochastically associated with number of positive cases. Prior to these analyses, we subtracted the mean and divided by the standard deviation of all values for numerical variables, namely test outcome, donor age, month of test, and year of test. We used the zero-inflated negative binomial modeling family, using the ‘pscl’ package 84 in R 4.2.1, where we used the number of positive cases within census subdivisions as outcome variable, and number of donations tested within census subdivisions as offset variable.
For the second level of analysis, we gathered all WNV testing data from blood donations, for which residence information is available, from 2003 to 2022. We conducted a simple mixed-effects generalized linear regression, using the ‘lme4’ package 83 in R 4.2.1, for each variable separately, namely donor sex, donor age, month of test, and year of test, against test outcome, where 1 is positive and 0 is negative. We used a hierarchical random-effects term, which is the unique donor number nested within the census subdivision of residence. Prior to these analyses, we subtracted the mean and divided by the standard deviation of all values for numerical variables, namely test outcome, donor age, month of test, and year of test. Given the data structure is binomial, the ‘binomial’ modeling family was most intuitive, but the extremely small number of positive cases compared to negative cases lead to numerous fitting errors. Therefore, we chose the more general ‘gaussian’ modeling family, for which no modelling problem occurred.
For both analysis levels, we used Pearson’s correlation coefficient in R 4.2.1, to identify correlation among all selected variables, and dropped one variable from each pair of variables displaying r > 0.7. Next, we selected all variables displaying a significant univariable association (P < 0.01), with WNV infection into a multivariable generalized linear regression analysis, using the same packages as previously in R 4.2.1. For the analysis at the level of the census subdivision, given the large number of variables included, we proceeded with a two-step model selection approach using the Bayesian Information Criterion (BIC), i.e. the Akaike Information Criterion using the logarithm of the number of observations as the k parameter. We computed a BIC value for the full model and for all combinations of the full model excluding one variable. A BIC value decrease of 2 or more is considered positive evidence for a variable displaying little effect on the response variable 85. We ran a new multivariable regression model, but this time excluding all variables that, when dropped from the full model, caused a decrease of the BIC value of more than 2. We then reapplied the same model selection approach as a second step on this new model. Using this model selection approach, we ensured that only the most important variables were retained in the resulting final model. In both analysis levels, we ran the final model using the full dataset, and computed incidence rate ratio (IRR) values, i.e. the exponents of the slope coefficients, 95% confidence intervals for the IRR values, and P values for each variable. Finally, we computed the R2 of the final model, using the ‘modelsummary’ package 86 in R 4.2.1.