2.2 Epidemiological Data
A dataset of stool samples from diarrhoeal patients reporting to the Infectious Disease Hospital (IDH) in Kolkata under their diarrhoeal surveillance system during the 21-year period 1999–2019 was obtained from the Indian Council of Medical Research - National Institute of Cholera and Enteric Diseases (ICMR-NICED). In the surveillance system, every fifth patient on two randomly selected days of the week (representing around 6% of total patients) was tested for several pathogens including O1 and O139 Vibrio cholerae. We extracted the number of samples which tested positive for either O1 or O139 Vibrio cholerae where the patient was registered as residing within the Kolkata Municipal Corporation (KMC) region. The dataset from 1999–2007 was un-digitized and pre-aggregated at a monthly resolution. We therefore considered two datasets: a 21-year monthly dataset from 1999–2019, and a 12-year dataset (2008–2019) aggregated weekly. The monthly dataset was normalised to a 30-day month to account for differences in month length (Eq. 1)
Equation 1
$$case{s}_{normalized,month}=case{s}_{raw,month}\cdot \frac{30}{days.in.month}$$
The decadal census populations of Kolkata were recorded as 4399819, 4572876, and 4496694 during the years 1991, 2001, and 2011 respectively (data were unavailable for 2021) [19]. We therefore considered the population of Kolkata to be relatively stable, and therefore ignored any marginal fluctuations in population in our analysis.
2.4 Statistical Analysis
A box plot of monthly cholera cases, temperature, rainfall and run-off was produced for visual inspection of seasonality within the monthly dataset.
To allow for the potential presence of non-linear associations in the data, we explored the association between cholera cases and environmental variables using Generalized Additive Models (GAMs) applied to the 1999–2019 monthly dataset [9]. The GAM analysis was conducted using the R package mgcv [10]. A more in-depth explanation of GAM models is given in the supplementary materials and is briefly summarized here. GAMs can be considered as an extension to generalized linear models (GLMs) which relaxes the linearity assumption and allows for significantly more flexible model fitting. Here, we model the cholera case data as sampled from a negative binomial distribution (due to the count nature and presence of over-dispersion) where the mean (count) is characterized as the sum of smooth functions of the environmental variables. We considered rainfall, modelled rainfall-runoff, temperature and coastal SST environmental variables.
The smooth functions in our model were constructed using penalized cubic regression splines. The flexibility of the smooths can be tuned by increasing the number of knots (k) where a greater number of knots permits a more flexible smooth. Since overfitting is prevented by a smoothing penalization term, we set k to the minimum value above which the influence on the results was negligible to allow sufficient degrees of freedom to describe the true relationship within a manageable computational cost.
In our model, we include the season as a factor within each smooth of an environmental variable which permits the smooths to ‘interact’ with the season factors and produce a particular smooth for each environmental variable in each season. We further accounted for long-term trends by including the smooth function of month number, and for seasonality using a smooth for month-of-the-year using a cyclic penalized cubic regression spline.
To minimise assumptions about lags in the environment/cholera relationship, we considered the following 6 lag options for each climate variable (not included, concurrent with cholera month, lagged by 1 month, lagged by 2 months, mean(concurrent, 1-month lag), mean(1-month lag, 2-month lag) resulting in\({6}^{3}-1=215\) permutations of the model. The final model was selected using Akaike Information Criterion (AIC) [11]. The models were formulated as in Eq. 1:
Equation 2
$$\text{log}\left[E\left({Y}_{t}\right)\right]={\beta }_{0}+{s}_{season}\left(temperatur{e}_{t,lag}\right)+{s}_{season}\left(rainfal{l}_{t,lag}\right)+{s}_{season}\left(SS{T}_{t,lag}\right)+s\left(month\right)+s\left(month-of-year\right)+season$$
Where \(E\left({Y}_{t}\right)\) represents the expected monthly confirmed cholera cases in month \(m\), \({\beta }_{0}\) represents the intercept, and s() represents the smooth functions.
We anticipated temporal autocorrelation in the counts, and after exploratory analysis of the autocorrelation function (ACF) and the partial ACF (PACF) of the residuals of the chosen model (Figure S1), an AR(1) process was indicated. We therefore included an additional one-month lagged term of the response variable in the final model as in Eq. 3. An ACF plot of the final model (Figure S2) confirmed this to be a reasonable assumption.
Equation 3
$$\text{log}\left[E\left(t\right)\right]={\beta }_{0}+{s}_{season}\left(temperatur{e}_{m,lag}\right)+{s}_{season}\left(rainfal{l}_{m,lag}\right)+{s}_{season}\left(SS{T}_{m,lag}\right)+s\left(month\right)+s\left(month-of-year\right)+season+{y}_{t-1}$$
We next used cross-correlation analysis to identify the lags associated with the strongest relationship between climate variables and cholera cases. Specifically, we measured the Pearson’s correlation between climate and cholera time series at lag periods from 0–25 weeks stratified by season. We utilised the 12-year weekly aggregate datasets from 2008–2019 to assess the effect of lag times at a finer temporal resolution.
Finally, we considered the effect of season timing on respective cholera cases. We did this by applying a Negative Binomial linear regression model (again due to the count nature and over-dispersion in the response variable) to the relationship between the week of season onset and the total number of cholera cases occurring within that season as in Eq. 4
Equation 4
$$\text{log}\left[E\left({y}_{season,year}\right)\right]={\beta }_{0}+{\beta }_{1,season}\left(wee{k.of.onset}_{season,year}\right)$$
Where \({y}_{season,year}\) represents the total number of cholera cases reported within a particular season and year, \({\beta }_{0}\) represents the intercept, and \({\beta }_{1,season}\) the regression coefficient for each season.
We defined the ‘week of season onset’ in summer as the first week in the year where the three-week rolling mean temperature is greater than 30°C, and for monsoon as the first week in the year where the three-week rolling mean precipitation is greater than 40mm/week.