We followed the Reporting of studies Conducted using Observational Routinely-collected health Data (RECORD) guidelines and checklist (see supplementary material) (21).
Tuberculosis data:
TB is a notifiable communicable disease in Vietnam. The national TB program collects data using a passive surveillance system covering the entire country. Data is then aggregated quarterly and available by province (n = 63). These data were publicly available from 2010-2015 (22). Classification of TB follows the WHO guidelines to include data on new cases, relapse cases, failure of treatment cases, recurrence cases and others. Any cases with both pulmonary and extra-pulmonary symptoms were classified as pulmonary TB (PTB). Cases with only extra-pulmonary symptoms are classified as extra-pulmonary TB (EPTB) (23). Whole genome sequences from MTB were not yet available from the national program 2010-2015 and therefore those cases defined as “new cases” may be from new transmissions or from latent disease activation.
Influenza-like-illness data:
In the absence of robust data for influenza, influenza-like-illness (ILI) incidence was taken as a surrogate for influenza as a potential confounder or driver of seasonality. ILI data from Vietnam have previously been shown to be highly seasonal (24). These data were available from the national passive surveillance programme (29 diseases) of the General Department of Preventive Medicine (GDPM) of the Ministry of Health, monthly totals per province are publicly available (25). These data were then aggregated by quarter.
Demographic and socio-economic data:
The General Statistics Office of Viet Nam collects and collates data on a wide variety of topics including socio-economic and health data (26). These data have been systematically organized and made publicly available for research purposes ((27), https://www.gso.gov.vn). The following variables were taken from this resource and were all defined per province per year: population density, proportion of male to female, proportion of urban to rural population, poverty ratio (ratio of the number of people whose income falls below ½ the median income of the country), literacy rates of 15 year olds and above, number of hospitals, immigration intake and prevalence of HIV infection at year end. Age structure was available divided into five year intervals up to the age of 80 for the year 2014. We grouped the ages as young, middle and old, £ 15 years, 16-64, ³ 65 years respectively, in keeping with international studies (28, 29).
Climate data:
Meteorological data was made available from the Vietnamese Institute of Meteorology, Hydrology and Environment for 67 climatic stations positioned throughout Vietnam. This includes monthly averages of daily minimum, maximum and average temperatures (°C), as well as of daily average absolute (g/L) and relative (%) humidity, cumulative precipitation (mm) and hours of sunlight. In order to compute climatic variables per province from these climatic station data, the variables where kriged (30) on a 10,000-cell regular grid (i.e. pixels of 5.75 x 5.75 km) and then aggregated by province. Local population density was taken into account by using weighted averages depending on the population density of each grid. The relative population densities were computed from the 2009 population density data available from WorldPop project ((31) http://www.worldpop.org.uk). This allowed the computation of province-aggregated climatic variables that are representative of the population of the considered province. Kriging was done by an automatic tuning of the hyperparameters of the variogram estimation as made available by the automap R package (32). Kriging performed well overall, except for rainfall, most likely because this variable typically varies widely in space, resulting in some extreme values through kriging. These outliers were identified by comparison with the original dataset and replaced by the average rainfall for the province from the other 5 years.
Seasonal factors (meteorological and ILI data) were investigated with and without a lag of one or two quarters. This was based on findings that diagnosis of TB was made at least 12 weeks from transmission or disease activation (33). Unfortunately, data on delay in diagnosis were not available from Vietnam or other similar settings. Since our data were only available quarterly we added lags so that we could examine the climate conditions at the proposed time of transmission or activation.
Analysis:
All analysis, graphical representations and maps were produced using R version 3.5.0. Data from the four databases were merged by province, year, and quarter.
Descriptive analysis:
Incidence of TB, per province per year, was mapped using the “sf” R package (34). Two heat maps of TB incidence, per province per quarter were produced to demonstrate and visualize seasonality in space and time (one normalized per province for temporal variation and one normalized by quarter for geographical variation).
Time series analysis - characterization of seasonality:
Time series data per province were decomposed into seasonality, trend and residual components using a LOESS regression. The trends indicated in this analysis were complex and province specific and could not be modeled linearly.
For each province, harmonic regression using cosine and sine waves were used to model the seasonality on the de-trended data according to the following formula:
LOESS-detrended incidence of TB ~ a*cos(time) + b*sin(time)
Where seasonal amplitude was characterized as sqrt(a2 + b2) and relative seasonal amplitude was expressed as the seasonal amplitude/mean incidence of TB ratio.
Mixed-effect Linear Model: explaining seasonality
A mixed-effect linear model was used, to allow both fixed and random effects to be incorporated, using the “lme4” R package (35).
All variables (except latitude, quarter and age proportions) were de-trended by province using the LOESS method described above. Year and province were introduced as random effects to control for clustering by unmeasured variables or potential trends that could remain when considering all the provinces together. All other variables were introduced as fixed effects with quarter coded as a factor and all others as continuous variables. Likelihood Ratio Tests (LRT) were used to determine significance level of each covariable. Potential confounding effects due to colinearities between covariables were controlled for by computing type-2 sums of squares (using the “car” R package (36)). Covariables were examined after detrending for normal distribution and if skewed, transformed using log transformation in order to linearize potential associations between covariables. Seasonal factors were examined with different lag times and combinations, and the final model chosen from the lowest Akaike information criteria. The full model reads:
TB incidencex,y ~ urbanx + sexx + population densityx + literacyx + youngx + oldx + povertyx + hospitalsx + HIVx + migrationx + quarter + influenza-like-illnessx,y-2 + average temperaturex,y-2 + absolute humidityx,y-2 + rainfallx,y-2 + sunshinex,y-2 + latitudex + (1|province) + (1|year)
Where x = province and y = year quarters. See Table 1 for full description of variables.
Model validation
The validity of the internal assumptions was assessed by examining the presence of trend, temporal and spatial auto-correlation (using the Durbin-Watson test) in the residuals.