Collection of neurodegenerative diseases and bloom data
To identify the association between harmful algal blooms and NDs, chlorophyll-a (chl-a) concentration data and the annual number of NDs for an administrative unit (‘gu’ or ‘gun’) in S. Korea were used. Since the observed sites where chl-a concentrations were measured were located along the rivers and lakes, the administrative units were not accurately matched geographically. Thus, gridded population was used to solve this problem, which are estimates of the population in a grid cell derived with a geo-statistical model using census or small area population counts and a number of other spatial datasets (Thomson et al., 2020).
For bloom data, chl-a concentrations from 939 locations were obtained from the Water Information System at the Korean Ministry of the Environment. Since monitoring of cyanotoxins was limited and has not been conducted on a regular basis, chl-a concentration was used as a bloom intensity indicator; the chl-a data are the only available bloom parameter covering the entire areas along the rivers and lakes during the study period from 2005 to 2017. Annual average chl-a concentrations per each location were used for analysis. A Korean governmental agency conducts monitoring of cyanobacterial cell counts, geosmin, and chl-a concentrations during HAB seasons. Measurements of BMAA, anatoxins, or saxitoxin are rarely conducted and data are not available. Therefore, in this study, chl-a concentration was used as a bloom indicator.
Conversion to gridded population
In order to map the annual HAB and ND data, the longitude and latitude information was collected for all 939 chl-a monitoring locations, and then the data were converted into geographical information system using QGIS software (version 3.22.2, https://qgis.org/en/site/). The entire area of S. Korea was divided into a grid of 1 km and then 127,595 points of the grid were generated. Next, the 939 observed locations were mapped onto the entire grid points. If two or more observation locations of chl-a correspond to one grid, the median value of chl-a concentration was used, and then mapped it onto the grid.
As a next step, the population number in each 1-km grid was collected using the SGIS (Statistical Geographic Information Service) Plus (https://sgis.kostat.go.kr) system. SGIS Plus is a location-based open service platform linked and fused with public and private data, and census data of population, households, houses, and businesses owned by Statistics Korea (Korean Bureau of Census). SGIS Plus provides a variety of statistical data according to the size of grids. The smallest size of the grid is 100 m, and the largest one is 100 km. Numbers of total population and numbers of over 65 year old were extracted for every 1-km grid from the entire areas of S. Korea. In Fig. 1, red dots represent all observed locations of chl-a concentration, and grey lines are 1-km grids. For the surrounding areas in Seoul, the capital, the 1-km grids are not distinguishable due to its highly dense population. Thus, it was enlarged for displaying better granularity.
Figure 2 depicts more detailed information on the Seoul area. In Fig. 2, panel (a) illustrates the number of people in the 1-km grid in Seoul. The darker the color is, the larger the number of populations in the area is. Panel (b) represents the observed location of chl-a concentration measured. It shows that chl-a concentrations were monitored along the rivers and streams. Panel (c) overlays gridded population and chl-a concentration observed locations. A benefit of using 1-km gridded population is its scalability: we can easily extend the grid size by aggregating the grids into 3-km or 5-km grids, and then match each extended grid with observed chl-a concentration data.
Cross-sectional time series data
After collecting chl-a concentration data over 13 years (from 2005 to 2017), we matched the gridded population to these observed years. However, SGIS Plus does not provide data every year. Statistics Korea conducted the Census every five years until 2015. Therefore, gridded population data in 2005, 2010, and 2015 were used. In 2015, Statistics Korea performed a register-based Census, and census data have been generated every year since then. Thus, we could have the gridded population data in 2005, 2010, 2015, 2016, and 2017 based on the Censuses. To match chl-a concentration data for those 13 years to irregularly observed gridded population, we had to expand the population data to 13 years. Thus, we utilized an interpolation method of missing values in time series by connecting successive non-missing input values with spline method (De Boor, 1978). Finally, we had population estimates for these missing years of 2006, 2007, 2008, 2009, 2011, 2012, 2013, and 2014.
In this study, three NDs were considered; MND, AD, and PD. The ICD-10-CM diagnosis codes were G12.2 for MND, G12.21 for ALS, G30.0 for AD, and G20 for PD. The annual numbers of patients of the three diseases from the National Health Insurance Service were collected and the data were rearranged according to the administrative units of the district. In S. Korea, a unit of district-level has three types: "Si" as a city, "Gu" as a district in metropolitan cities only, and "Gun" as a division within provinces. The total number of administrative units is 226 and less than the total number of the 1-km grid and less than the number of chl-a observed locations (939 locations). As the differences among 1-km grid, chl-a observed locations, and the number of administrative units, we had to covert the patient number corresponding to the observed chl-a concentration location.
The number of patients for each administrative district in the 1-km grid was evenly distributed according to the populations of each grid. Let \({x}_{it}\) be the number of senior patients in the \({i}^{th}\) administrative unit at time t and \({y}_{jt}\) be the number of ND patients in the \({j}^{th}\)1 km grid at the same time t. The modified number of patient \({y}_{jt}\)is given by in Eq. (1)
\({{y}}_{{j}{t}}={{x}}_{{i}{t}}\times \frac{{{G}}_{{j}{t}}}{{{P}}_{{i}{t}}}, {i}=\text{1,2},\dots , 226, {j}=\text{1,2},\dots , {{n}}_{{i}}, {t}=2005,\dots ,2017 (\) )
where, \({P}_{it}\)is the number of total populations in the \({i}^{th}\) administrative unit at time t, and \({G}_{jt}\)is the number of populations in the \({j}^{th}\) 1 km grid which is one of the administrative units at time t. Figure 3 depicts the heatmaps of the modified numbers of patients in a 1-km grid. From panel (b) to (d), each panel represents the heatmap of the modified number of ND patients of MND, AD, and PD, respectively. Since, many of the population lives in the Seoul metropolitan area, about one-quarter of the population of S. Korea-we zoom in on the area. The zoom-in area can be seen in the upper right corner of each panel. Other dark-colored areas are another regional capital of S. Korea. The upper left panel (a) represents the concentration of the observed spot of chl-a. We can see that the higher concentrations are located following the four big rivers in S. Korea.
As NDs are related to old age people, the number of patient in the 1-km grid needs to be adjusted after considering the population size of older people (> 65 year old). So we performed the second modification of the number of diseases. Eq. (2) represents the second modified number of NDs according to the older population.
\({{y}}_{{j}{t}}={{x}}_{{i}{t}}\times \frac{1}{{P}{65}_{{i}{t}}}\times \frac{{{G}}_{{j}{t}}}{{{P}}_{{i}{t}}}, {i}=\text{1,2},\dots , 226, {j}=\text{1,2},\dots , {{n}}_{{i}}, {t}=2005,\dots ,2017 (\)2)
where, \(P{65}_{it}\) is the number of populations over 65 in the \({i}^{th}\) administrative unit at time \(t\), the other variables are the same with Eq. (1). The first product on the right side of Eq. (2) is the corrected number of ND patients according to the number of population over 65 years in the administrative unit. The updated number was revised by multiplying the ratio of 1-km grid population and administrative unit population. The final corrected number of patients was calculated for every three ND and was used as a response variable in the statistical model.
Statistical analysis
In order to examine the relationships between the HABs and the numbers of three NDs, we had to figure out two issues first: 1) to evaluate long-lasting effects of HABs on the NDs, and 2) to confirm the spatial distance impact of HABs on the three NDs. We applied generalized linear mixed models (GLMMs) for the first issue and used generalized linear models (GLMs) for the second one. For the GLMMs, we set the log link function and the normal distribution for a random component. In addition, to identify a serial correlation that the repeated measured response variable might have, we added a one-way random effect to the model (O'Hara 2009). The main effects of HABs on the variation of NDs were considered fixed effects in the model. Let \({y}_{jt}^{k}\) be the corrected number of kth ND defined in Eq. (2), where \(k=\text{1,2},3\), 1 is for MNDs, 2 for ADs, and 3 for PDs and \({z}_{jt}\) be the observed concentration of chl-a with the same time point and location. We modeled the number of NDs for the effects of HABs in the following:
\(\mathbf{log}\left({{y}}_{{j}{t}}^{{k}}\right)= {{\beta }}_{0}+\sum _{{l}=0}^{{\infty }}{{\beta }}_{{l}+1}{{z}}_{{j},{t}-{l}}+{{u}}_{{j}}+{{ϵ}}_{{j}{t}}, {k}=\text{1,2},3, {j}=\text{1,2},\dots , 939, {t}=\text{2005,2006},\dots , 2017 (\) 3)
where \({u}_{j}\) is a random effect of \({j}^{th}\) location and we assumed that all \({u}_{j}\) have zero mean and the same variance \({\sigma }_{u}^{2}\) and each \({u}_{j}\)s is independent of each other, and \({ϵ}_{jt}\) is an independently and normally distributed white noise. We could reflect the serial correlation within a location by including a random effect in the model. We performed the Hausman test to determine whether the model had to include a random effect (Hausman 1978). The null hypothesis of the test was that there is zero correlation between the fixed effects and random effects. If we did not reject the null hypothesis, the random effect model was better than the fixed effect model. Based on the results of the Hausman test, all three ND models showed that the null hypothesis did not reject all. Therefore, we could say that all three models have random effects.
For the second issue, we only utilized the 2017 data set because it is the most recent one among the complete data sets, had very few missing observations and it may best reflect the spatial effect of HABs in the model compared to other data sets. To model the ND number response variable, and GLMs were used:
$$\mathbf{log}\left({{y}}_{{j}{r}}^{{k}}\right)= {{\beta }}_{0}+{{\beta }}_{{j}{r}}{{z}}_{{j}}+ {{ϵ}}_{{j}}, {k}=\text{1,2},3, {j}=\text{1,2},\dots , 939.$$
4
In this model, the subscript \(k\) and \(j\) are the same with Eq. (3). The subscript \(r\) represents the size of the radius. We examined the change of the effects of HABs to three ND response variables according to the radius increment by 3 km and 5 km (see Supplementary Information). When we used a 1-km grid for this purpose, many grids had no population as the locations for measuring chl-a concentrations were generally on the rivers if width of the water bodies are bigger than 1-km and very close to the rivers, lakes and strems. Thus, few people reside within those 1-km grids and those grids having the red dots on the river generate lighter blue color due to the number of lower populations (Fig. 2(c)). Therefore, we used the primary grid sizes of 3 km and 5 km, instead of 1 km, for statistical analyses.