All analyses took place in R V4.2.2 (29) using a variety of packages listed in each subsection. All spatial data used the WGS84 coordinate projection system. Conversions to this CRS, when needed, were performed using the corresponding functions available in the “raster”, “sp”, and “sf” packages. All displayed maps were created using a combination of the sf and ggplot2 packages (30, 31).
Mosquito and Arbovirus Surveillance Data: Mosquito and arbovirus data were obtained using hay-lactalbumin infusion-baited gravid traps at 87 surveillance sites in CT from 2001 – 2020. These traps are the most efficient surveillance tool for detecting WNV-infected Cx. pipiens mosquitoes (32), which is the primary enzootic and epidemic vector of WNV in the US northeast (33, 34). Briefly, sites were sampled by operating traps overnight on a 10-day rotation each summer from June through October, and all female mosquitoes were identified to species using a dichotomous key (35) and tested for nine arboviruses using viral cell culture and RT-PCR techniques (36). Collection sites were sampled more frequently (1-2 times weekly) for the remainder of a season if mosquitoes tested positive for an arbovirus of primary concern (WNV or EEEV). The average distance between analyzed sites was 61.3 km (+/- 0.42km).
Climate Data: Monthly temperature and precipitation records from 2001 to 2020 as well as monthly climate 30-year normals were obtained from PRISM (https://prism.oregonstate.edu/) (37) by extracting data to the specific surveillance site coordinates using the Data Explorer tool (https://prism.oregonstate.edu/explorer/). PRISM uses a default 4km grid system, meaning climate values were assigned to points based on the grid in which they are located. Since the CT surveillance program only operates traps from June to October, we extracted the current and 1-month lag values for each site that aligned with the mosquito collection period. We also calculated the absolute difference in the observed and normal values of these time intervals. We chose a 1-month lag in climate variables to allow for the serial interval of WNV infections in mosquitoes.
Landcover Data: Land cover data was obtained from UCONN CLEAR (https://clear.uconn.edu/projects/landscape/index.htm). Briefly, this site hosts data on twelve land cover classes for Long Island Sound from 1985 to 2015 (at intervals of 5 years). To build a suite of land cover variables for model training, shapefiles of the 2015 land cover estimate were processed in ArcMap (ESRI©) using a combination of clipping and spatial joins to produce a .csv of the area of each land cover type within ~500m of a site; percent land cover per trap buffer was then calculated in R. Published research has shown that overall land cover patterns have changed little in CT (38), so we did not include any other estimates of land cover concurrent with the surveillance period. Shapefiles for the political boundaries of CT were downloaded from CT.gov’s geospatial data hub (https://ct-deep-gis-open-data-website-ctdeep.hub.arcgis.com/search?groupIds=71c5c4a9c6ea4ea8ab54d1bf1faaeed8).
Population Density Data: Preliminary analyses with a training data set revealed one important aspect of the data: algorithms that did not directly include trapping effort (i.e., the total number of times a site was sampled in a month) as a predictor term performed poorly by explained deviance and root mean square error. Furthermore, algorithms that included trapping effort directly as a dependent variable often concluded trapping effort was the most important explanatory variable. This created a need to develop a prediction pipeline for trapping effort in unsampled spaces. To do this, we investigated the relationship between trapping effort and population density at any given surveillance site; we chose to analyze this relationship since CAES tends to choose surveillance sites near a variety of human-dominated habitats. We first downloaded the gridded global population density raster image for 2020 from Columbia University’s Socioeconomic Data and Applications Center (https://sedac.ciesin.columbia.edu/data/set/gpw-v4-population-density-adjusted-to-2015-unwpp-country-totals-rev11/data-download), cropped the image to match the spatial extent of CT, resampled the image to match the 4x4km resolution of the PRISM data, then extracted the average population density estimates within 5km of each surveillance site. We chose a 5km buffer since prior analyses with the CAES surveillance data set identified spatial patterns of synchronous WNV detection in mosquitoes up to 5km (39). We then used a Poisson-error generalized linear mixed effects model to confirm the strength of the association between trapping effort and population density while controlling for month of the year.
Hierarchical models: We used two types of modeling approaches. Our primary modeling approach was the use of gradient boosting machine, specifically boosted regression trees (BRT), in the “dismo” and “gbm” R packages (28).All processed surveillance, climate, land cover, and population density data were randomly assigned as training and test data using a 70/30 training/test ratio. Full models of Cx. pipiens trap night collections included all land cover and climate variables, Month of the year, and population density within 5 km of a surveillance site; full models of WNV detection (1 – one or more positive trap nights in a month, 0 – no positive trap nights in a month) included the same variables with the addition of the observed Cx. pipiens trap night collection. For the Cx. pipiens collections and WNV detection models, we first compared full models containing a subset of pairwise combinations of the climate variables and trapping location coordinates. In these subsets, only one climate form (average or difference from normal) per month or per climate variable (temperature or precipitation) was used; full models were also assessed with and without the x y coordinate of each surveillance site. We simplified these models through sequential elimination of the variables with lowest influence; the gbm package allows for model simplification up to X-2 variables. The best performing model for each variable was assessed from the vantage points of model deviance, root mean square error (RMSE, assessed using the test data), and simplicity (i.e., fewest variables). Once all BRT models were assessed, we compared their performance to an all pairwise interaction general linear model (GLM) using the same final terms as in the BRT models; we then simplified these GLMs using a backward selection procedure based on AIC (a similar function to model simplification for BRTs). The use of BRT and GLM models was done to compare, respectively, their ability (or lack of ability) to capture non-linear relationships. Final models utilized for validation were chosen based on lowest RMSE of the simplified BRT and GLMs.
Projecting Predictions: To generate monthly predicted WNV detection probabilities in Cx. pipiens mosquitoes, we first defined a 4 km x 4 km raster grid at the spatial extent of the state of CT. We downloaded the temperature and precipitation raster files from PRISM that matched the forms of the final models for the Years 2001 to 2022, re-projected the PRISM data to match the specified raster grid, and cropped the images to match the political boundaries of CT using a combination of functions available in the “sp”, “sf”, and “raster” packages (30, 40); we used the same methodology to generate a raster image of population density. To generate a percent land cover per raster cell for the entirety of CT, we downloaded a rasterized CT land cover image, converted all pixels to points, and listed the location and landcover value of each point in a data frame. Then for each land cover class, we counted the number of points within a 4km x 4km raster cell, and the percentage landcover within each cell was determined as the number of pixels per cell of each landcover type divided by the total number of pixels within the cell. For Cx. pipiens predictions, we created a RasterStack of all variables’ layers in the final models and generated predictions using the built in prediction function in the “raster” package for both BRT- and GLM-based models. The predicted vector abundance raster layer was then added to the RasterStack containing the variables in the final WNV model, and WNV detection probabilities were generated.
Validating predictions: We performed two different model validations. First, we compared the predicted probability of WNV detection in mosquitoes from our final candidate models to the observed presence/absence of WNV detections in mosquitoes for all sites in the CAES surveillance network in 2021 and 2022. To do this, we generated a data frame of WNV detection predictions for the summer months in towns with at least one trapping location. Because multiple raster cells could occur within a town’s political boundary, we assigned the maximum WNV detection probability to the town, rather than the average. Our rationale for this was that it communicated the maximum level of risk experienced by any resident within a town; using the maximum prediction may also better represent the actual occurrence of transmission within a town’s boundary. We then used binomial-error GLMMs to test the association between our predicted WNV detections and the observed WNV detections. Fixed effects included the current and prior month’s predicted WNV mosquito detection probability, year and month of collection, the town’s population size (log-transformed), and town name as a random effect (to control for repeated observations (41).
Our second validation approach was to compare predicted WNV detection probabilities in mosquitoes to observed detections of human cases from 2001 to 2022. A prime issue with human spillover of WNV is that cases can occur in spaces where little to no surveillance takes place. Because our primary objective was to determine if predicted WNV risk in unsampled spaces was a reliable indicator of WNV activity, testing the association between predicted mosquito detections and incidence of human cases was of great interest. We first generated a data frame of all town by month by year predictions of WNV detection in mosquitoes following a similar procedure for assigning raster values to geopolitical units as described for validating against mosquito data. We then used a GLMM to test the association between observed WNV human cases (modeled as detected 1, not detected 0) and our predicted WNV mosquito detection probabilities; we chose occurrence rather than absolute number of cases since there were rarely more than 1 human case per town for any year-month combination. Fixed effects included the current and prior month’s predicted WNV detection probability in mosquitoes, month of the year, the town’s population size (log-transformed), and town name as a random effect (to control for repeated observations).
To select the best model for both validation approaches, we first randomly split the predicted monthly WNV mosquito detection probability datasets into training and test data using a 70/30 ratio ensuring that data points were not resampled between the training and test data sets; for mosquito validation models, this included all data from 2021 – 2022 and for human validation models this included all data from 2001 - 2022. We then constructed a null model containing any seasonality terms (year and/or month), town population size, and the random effect terms. We then assessed improvements in AIC, BIC, and RMSE (assessed using the test data) due to the inclusion of single monthly detection terms (current and prior month’s predicted WNV detection). If both terms improved AIC, we then evaluated additive and interaction terms. If only one term improved AIC, we assessed whether interactions with the null model terms further improved model fit. GLMMs were built and assessed using a combination of the glmmTMB and sjPlot packages (42, 43).