EpiVECS: Exploring spatiotemporal epidemiological data using cluster embedding and interactive visualization.

doi:10.21203/rs.3.rs-3417276/v1

Download PDF

Article

EpiVECS: Exploring spatiotemporal epidemiological data using cluster embedding and interactive visualization.

https://doi.org/10.21203/rs.3.rs-3417276/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 01 Dec, 2023

Read the published version in Scientific Reports →

You are reading this latest preprint version

The analysis of data over space and time is a core part of descriptive epidemiology, but the complexity of spatiotemporal data makes this challenging. There is a need for methods which simplify the exploration of such data for tasks such as surveillance and hypothesis generation. In this paper, we use combined clustering and dimensionality reduction methods (hereafter referred to as ‘cluster embedding’ methods) to spatially visualize patterns in epidemiological time-series data. We compare several cluster embedding techniques to see which performs best along a variety of internal cluster validation metrics. We find that methods based on k-means clustering generally perform better than self-organizing maps on real world epidemiological data, with some minor exceptions. We also introduce EpiVECS, a tool which allows the user to perform cluster embedding and explore the results using interactive visualization. EpiVECS is available as a privacy preserving, in-browser open source web application at https://episphere.github.io/epivecs.

Biological sciences/Computational biology and bioinformatics/Software

Biological sciences/Computational biology and bioinformatics/Machine learning

Biological sciences/Computational biology and bioinformatics/Data processing

A key part of public health research is analyzing how diseases are distributed across space [1], [2]. This can help epidemiologists allocate resources, discover potential risk factors, measure health disparities, implement targeted public health policy, and more [1], [3], [4]. Typically, such analysis considers a specific point or period in time and does not explicitly model or analyze temporal structures in the data, despite the fact that many datasets in epidemiology are both spatially and temporally referenced. Analyzing time alongside space can provide useful insights that spatial analysis alone cannot [5]–[7], including an improved understanding of communicable disease transmission, a more comprehensive representation of differences in epidemic response, and a way to compare the impact of varying public health interventions [1], [8]–[11]. Despite recent advances and increased interest in spatiotemporal analysis, the complexity and scale of the data can make it difficult for analysts to perform such analysis effectively. [5], [6], [12].

The complexity of spatiotemporal analysis is especially challenging for exploratory tasks, such as disease surveillance and hypothesis generation, because it makes it difficult to rapidly investigate insights at scale [11]. For this reason, it is useful to have methods that allow analysts to find and explore patterns in spatiotemporal data with speed and flexibility. One particularly effective way of exploring complex data is by using interactive visualizations [13]. A well-designed visualization takes advantage of the analyst’s visual processing skills to convey complex and nuanced information that would not be apparent in an alternative format (e.g. tabular) [13]. The addition of interaction means that the analyst can display the details which are most important to them, and follow-up on any potential insights they gain from the resulting visualizations. Interaction is especially important for complex datasets because it means a lot of information can be included without needing to visualize it all at once [14], [15]. Spatiotemporal data, in particular, is challenging to visualize because both spatial and temporal data are often best represented using the ‘position’ visual channel, and it is therefore difficult to visualize both together [16]. Interaction addresses this challenge by limiting how much spatial or temporal data is visualized at once, in a manner that is guided by user engagement [16]. However, hiding too much data behind interaction can impede one of the main benefits of multivariate exploratory analysis: uncovering interesting patterns that span the dataset. This motivates the use of methods capable of simplifying the data in such a manner that it can be visualized globally, providing some high-level insights before the analyst explores more granular patterns using interaction.

Unsupervised learning techniques can be used to gain a high-level summary of spatiotemporal data by uncovering potentially informative patterns [12]. One popular technique is clustering — finding subsets of the data which show some degree of internal similarity. Clusters act as a summary of the dataset, simplifying the initial information that the analyst must consider. In an interactive dashboard, additional details can then be uncovered through interaction. In epidemiology, time-series clustering has been used for tasks such as surveilling infectious diseases, estimating transmission dynamics, and disease forecasting [17]–[19]. There are many different clustering algorithms available, each having subtle strengths and weaknesses, making makes it difficult to choose the best method for a given problem [20].

Clustering methods provide an especially useful summarization of data when their output includes an informative representation of each cluster, e.g. the centroids in k-means [21]. Self-organizing maps, a basic machine-learning based clustering method, are particularly useful for data summarization because they provide both a high-dimensional and low-dimensional representation of the clusters [22]. For this reason, self-organizing maps are especially useful for visualization [23], [24]. However, self-organizing maps require the low-dimensional representation of the clusters to be positioned on a fixed grid, which can impede both the clustering quality and the informativeness of the 2D cluster representation [25]. An alternative approach is to use any clustering technique which produces a vector representation of the clusters (e.g. the centroids in k-means) and then to calculate a 2D representation of those vectors using a dimensionality-reduction technique. This allows greater flexibility in the choice of both the clustering method and the dimensionality-reduction method, which if appropriately chosen can lead to better results. This is the approach taken by the oKMC + method: the application of Sammon mapping (a non-linear dimensionality reduction technique) to k-means centroids, resulting in higher quality clusters and a better low-dimensional cluster representation when compared with self-organizing maps [25]. There is no standard terminology for methods that cluster data and produce a low-dimensional representation of the clusters, so we will refer to them here as ‘cluster embedding’ methods. In this paper, we will compare the performance of several cluster embedding methods on real-world epidemiological data. We will argue here that cluster embedding is a particularly effective route to explore both overall analytical context and fine-grained spatiotemporal patterns. We will also introduce a web-tool, EpiVECS, which implements an interactive approach to cluster embedding.

Method comparison

Data

The primary goal of this work is to simplify the analysis of real-world spatiotemporal epidemiological data, and therefore we have assessed the proposed methodology using publicly available datasets from the US Centers for Disease Control and Prevention (CDC). The following empirical results are best understood by experimenting with the accompanying EpiVECS web application in conjunction with the supplementary Observable Notebooks. In both cases, the proposed methodology can be tested interactively without the need to download or install any software. To ensure that the methods are tested on a variety of scenarios, we have included 14 datasets covering different concepts including cancer mortality rates, COVID-19 patient bed utilization, and age-adjusted all-cause mortality rates. The datasets include US county-level and state-level spatial resolutions, and week-level, month-level, and year-level temporal resolutions. A range of vector lengths are present, the shortest being of length 17 and the longest of length 171. The datasets were filtered to ensure that there are no missing values; specifically, for each dataset, a time-range was chosen and areas with missing values in that time-range were excluded. For information on the datasets and how they were collected and processed, see Availability.

Clustering validation

To compare the performance of the cluster embedding methods, we first compare the performance of the clustering methods (self-organizing maps and k-means) using several internal cluster validation metrics. Internal cluster validation metrics attempt to quantify cluster quality using internal properties of the clusters such as cluster separation. Many different internal validation metrics have been proposed and, like other parameters in multivariate exploratory analysis, different metrics perform best in different scenarios [26], [27]. Therefore, it is helpful to use a variety of metrics to generate a comprehensive comparison between methods. In this work, we have chosen the following internal cluster metrics based on recommendations from comparison studies: Davies-Bouldin, Calinski-Harabasz, SDbw, and Silhouette Score, and mean square quantization error [26], [27].

An important parameter in both k-means and self-organizing maps is the number of clusters. In k-means, the number of clusters is set by a single parameter (k), whereas in self-organizing maps it is determined by the size and shape of the grid. In this work, we limit self-organizing maps to a square grid and therefore all tested values of k will be square numbers. There is no definitive way to find an appropriate value for k and several popular methods rely on a subjective interpretation of a visualization (e.g. a Silhouette plot [28]). In the EpiVECS tool we leave the choice of k up to the user, and so it is important that we test several values of k when comparing methods (Fig. 1). We have elected to test the following k values: 4, 16, 25, 36, 49, and 64. In cases where the test dataset is small (e.g. the state level datasets consist of around 50 vectors) we test only 4 and 9. We keep other parameters for k-means and self-organizing maps fixed at the default values used in the chosen implementations, shown in the supplementary observable notebooks (see Availability).

Dimensionality reduction validation

An important part of cluster embedding methods is producing an informative representation of the clusters in a low-dimensional (usually 2D) space. Ideally, the low-dimensional cluster representation should provide the analyst with some idea of how clusters relate to one another. One way this quality can be assessed is by using dimensionality-reduction validation metrics to quantify how well certain properties of the high-dimensional cluster representation are preserved in the low-dimensional cluster representation. There are many different validation metrics, and each assesses a different aspect of the dimensionality reduction, so it is useful to use a variety of metrics when comparing methods. In this work, we use AUC of trustworthiness (tAUC), AUC of continuity (cAUC), residual variance between the distance matrices (VR and VRs), co-k-nearest neighbor size (Qglobal), and local continuity meta criterion (Qlocal), all of which are implemented in the pyDRMetrics Python library [29]. A comparison of the dimensionality reduction quality of the implemented methods can be found in Fig. 2.

EpiVECS web-tool

EpiVECS is an open-source, in-browser web-tool which allows the user to perform cluster embedding on their own geospatial vector data and visualize the results in an interactive dashboard. All methods described in this paper are available as interactive features in this tool, alongside basic normalization, smoothing, and interpolation options. The dashboard consists of three primary plots: a line plot showing the cluster centroids, a bubble plot showing the embedded cluster positions, and a choropleth (see Fig. 3). The plots in the dashboard are designed with ‘link and brush’ functionality, wherein the user can highlight and select elements in one plot and the interaction is reflected in all other plots. For example, if the user hovers over a cluster centroid, then all areas in the choropleth which were assigned that cluster centroid’s label are highlighted, and the position of the embedded cluster in the bubble plot is also highlighted. The user can click on elements to select multiple labels which are kept highlighted; this selection can be cleared by clicking in an empty space on any of the plots. If the user hovers over any area in the choropleth, that area’s specific (normalized) vector is shown in the line plot — this shows how well areas match their assigned cluster centroids. Tooltips provide additional information tailored to the plot. We encourage the reader to experiment with the tool to see this functionality in action.

In this manuscript, we have investigated the use of cluster embedding to simplify the exploration of spatiotemporal epidemiological data. We compared two key methods: self-organizing maps and embedded k-means. For embedded k-means we compared four different dimensionality-reduction techniques to embed the cluster centroids: PCA, Sammon mapping, t-SNE, and UMAP.

Epidemiological time-series data is often characterized by properties which render analysis challenging, such as noise and non-linearity, and therefore we performed our experiments using real-world epidemiological data. A variety of clustering and validation metrics were used to ensure that the results were not sensitive to the choice of metric. Our results show that k-means clustering generally performs better than self-organizing maps, especially in regard to clustering quality. This confirms previous results by Flexer which showed that oKMC+, an embedded k-means method with Sammon mapping, outperforms self-organizing maps in both the quality of clustering quality and dimensionality reduction quality [25]. Flexer took a different approach to assessment than the one described in this manuscript, using external validation metrics on simulated data rather than internal validation metrics on real-world data. This suggests that the superior performance of embedded k-means is consistent across a range of experimental designs. However, an important element in data exploration is the subjective experience and goals of the analyst, which these validation metrics are unable to assess. While self-organizing maps may result in lower quality clusters, the simplicity and convenience of the fixed grid structure may be desirable to the analyst. In the accompanying EpiVECS tool, we have included all methods discussed in this paper so that the analyst can experiment with different approaches.

There are two main issues to consider when using cluster embedding methods for data exploration: the methods will produce a result regardless of how meaningful it is, and the methods are sensitive to the choice of hyperparameters (especially the number of clusters k). These two issues are related because the choice of hyperparameters can result in a wide variety of different results. It is therefore difficult to assess the comparative merits of each configuration from numerical validation metrics alone. In contrast, presenting the results as an interactive visualization provides a mechanism to mitigate these issues because the analyst can explore the results in further detail, using analytical and domain knowledge to assess the results. To encourage this type of assessment, we have included a specific feature in the EpiVECS tool: when the user hovers over a specific area that area’s input time-series is shown alongside the centroid (or weight) of the cluster to which it was assigned. This allows the analyst to see how well different time-series match their assigned clusters to find out which set of clustering parameters recognize known spaciotemporal correlations.

Applying cluster embedding methods to time-series data comes with a number of challenges. The approach allows an analyst to explore temporal patterns, but it does not alone convey any information about the distribution of these temporal patterns over space. To provide that information, the accompanying EpiVECS tool includes a choropleth plot where each area is colored according to the corresponding cluster time-series. This approach requires a method of assigning colors to clusters, for example, by treating clusters as independent parameters and use a categorical color scheme. However, while this would convey the similarities between counties in each cluster, it would not convey how counties in different clusters relate to each-other. To address this, we used a key strength of cluster embedding: the summary of the similarities between clusters using positions in 2D space. This can be used to generate a more meaningful assignment of colors reflecting both relationships within and between clusters. Specifically, the EpiVECS tool maps the 2D point representation of the clusters onto a cylindrical color space. We chose the OKHSL space because it is designed to provide a consistent relationship between the distance between points in the color space and the perceived distance in corresponding colors (perceptual uniformity). This is useful because it conveys the relationship between 2D point representations across all points in a relatively consistent manner. While there are many different color spaces that are designed to provide perceptual uniformity, we chose OKHSL due to its simple cylindrical shape. Many color spaces, including OKHSL, are three-dimensional and thus future work could investigate the assignment of colors from a 3D space rather than a 2D (or conical) space.

There has been prior interest in using cluster embedding methods to uncover patterns in spatiotemporal epidemiological data, especially for the spread of COVID-19, but those studies have been limited to self-organizing maps and haven’t provided a comparison with other methods [30], [31]. An interactive tool similar to EpiVECS has been previously developed for the exploration of multivariate data over space using cluster embedding, but it did not consider methods other than self-organizing maps [32]. More broadly, several interactive visualization tools have been proposed for the exploration of data using self-organizing maps, but these typically aren’t designed for spatial analysis [33]. Together, this prior work shows the interest in using cluster embedding methods for exploration of complex data. Our work furthers the idea by considering cluster embedding methods other than self-organizing maps, as well as considering some of the challenging properties of epidemiological, spatiotemporal data. Future work could consider additional clustering techniques such as hierarchical clustering methods, which are popular in epidemiology but do not typically provide a natural high-dimensional vector representation of the clusters. That represents a challenge – and an opportunity - for a cluster embedding approach. Another direction for future work could be in the pre-processing stage: the results here show that basic time-series smoothing results in higher quality clusters; more advanced techniques could lead to further improvements.

EpiVECS was developed as a web application in order to further its reach and reusability (in accordance with the FAIR principles [34]). The browser is a ubiquitous environment familiar to the vast majority of modern users. The requirement to install an application locally renders users less inclined to use it, in part because many users work in restricted computing environments in which they do not have permission to install software. EpiVECS is a fully in-browser application which means that the user’s data remains entirely within the sandbox of their browser and is not sent to a server. This is especially important for fields such as epidemiology where sensitive data is common.

We have investigated the use of cluster embedding methods for the exploration of spatiotemporal epidemiological data and have found that embedded k-means methods perform better than self-organizing maps on real-world data. This suggests that embedded k-means, and perhaps other embedded clustering techniques, may play an important role in future spatiotemporal data exploration. We have produced an interactive web-tool (EpiVECS) which allows users to perform cluster embedding on their own data and explore the results in an interactive dashboard. On the implementation science side, we have ensured the reusability of tool through its development as an in-browser web application.

Cluster embedding

A cluster embedding method partitions a dataset into clusters and provides a low-dimensional representation of each cluster. In this work, the input is a list of m-dimensional numeric vectors $X=\left[{v}_{1},{v}_{2},\dots , {v}_{n}\right]$ where ${v}_{i}\in {\mathbb{R}}^{m}$ and a desired number of clusters k. The output is a list of labels $L=[{l}_{1},{l}_{2}, \dots , {l}_{n}]$ indicating the cluster to which each vector was assigned, a list of m-dimensional representations of the clusters $C=[{c}_{1}, {c}_{2}, \dots , {c}_{k}]$ where ${c}_{i}\in {\mathbb{R}}^{m}$, and a list of two-dimensional representations of the clusters $P=[{p}_{1}, {p}_{2}, \dots , {p}_{k}]$ where ${p}_{i}\in {\mathbb{R}}^{2}$. We have compared two types of cluster embedding methods in this work: self-organizing maps and embedded k-means.

Self-organizing maps

Self-organizing maps consist of a set of nodes where each node is comprised of a weight vector and a location vector. The weight vector is of equal dimensionality to the input vectors and the location vector is typically a two-dimensional vector on a grid structure. The size and shape of the grid is up to the analyst; commonly chosen shapes include rectangular and hexagonal grids. The first step in training a self-organizing map is the initialization of the weight vectors, which is typically done randomly. At a high level, training of a self-organizing map proceeds through a series of steps at each of which a vector from the input set is chosen and each of the weight vectors is moved a step closer to the chosen input vector. The input vectors are usually chosen sequentially from a random ordering of the input vectors, over several runs (epochs). The node with weight vector closest to the chosen input vector (the ‘best matching node’) takes the largest step towards the input vector, and the steps taken by the other nodes is proportional to the distances between their respective location vectors and the best matching node’s location vector. In essence, the change propagates through the grid. The size of the step taken by each vector is also dependent on the learning rate, which typically decreases across training. Training is stopped when a pre-chosen maximum number of steps is exceeded or when a convergence criterion is met. For a more detailed description of the specific self-organizing map implementation used in this work, see Availability. For self-organizing maps, the node weight vectors are the high-dimensional cluster representations, C and the node location vectors are the two-dimensional cluster representations P.

Embedded k-means

An embedded k-means clustering method consists of two steps: applying k-means clustering to the input vectors, and then embedding (i.e. applying a dimensionality-reduction technique to) the centroids of the resulting clusters. The first step in k-means clustering is the initialization of the m-dimensional cluster centroids, often done randomly. Then, each input vector is assigned to the cluster whose centroid it is closest to (for this work, this means closest in Euclidean space) and the centroids are set equal to the mean of all vectors assigned to that cluster. The previous step then repeats until some stopping criterion is met — either convergence or exceeding a pre-defined maximum number of steps. The centroids are the high-dimensional cluster representations, C, introduced above. To calculate the low-dimensional cluster representations P, a dimensionality reduction algorithm is applied to the cluster centroids. In this work, we compare four dimensionality reduction methods: principal component analysis (PCA), Sammon mapping, t-SNE, and UMAP. Sammon mapping was used in the oKMC + method which, as far as we are aware, is the first application of the embedded k-means approach to cluster embedding [25].

Data processing

Pre-processing the input data is an important step in ensuring it is suitable for the cluster embedding methods. A typical first step is to normalize the data. There are many different normalization methods and the choice of which to use depends on the goals of the analysis, e.g. whether the scale of the vectors are important or just their shape. A popular choice for normalizing numerical data is z-score normalization, where values are normalized so that the mean equals 0 and the standard deviation equals 1. For vector data, z-score normalization can be done row-wise, where each vector is normalized, or column-wise, where each dimension is normalized. We are interested in retaining the shapes of the time-series, so we have chosen to apply row-wise z-score normalization when comparing the performance of the cluster embedding methods. However, row-wise, column-wise, and combined z-score normalization methods are available as pre-processing options in the EpiVECS web-tool (see Availability).

Real-world time-series data is often noisy, and noise can have a substantial impact on the performance of clustering methods. Therefore, it can be useful to reduce noise in the data before applying a cluster embedding method. One way to reduce noise is by applying a time-series smoothing technique, the simplest of which is a moving average smoother. A moving average smoother slides a fixed-size window over a time-series and replaces the value at the center of the window with the mean of all values in the window. For a window of size 2h + 1 and vector x, the i^th value in the smoothed vector ${s}_{i}$ is set as follows:

$${s}_{i}=mean({x}_{i-h},\dots ,{x}_{i},\dots ,{x}_{i+h})$$

This doesn’t work at the start and end of the vector, and as is produces a smoothed vector which is shorter than the input vector. However, this can be mitigated in a number of ways, and the way we have chosen is to shorten the sliding window as needed at the ends of the vector, e.g. ${s}_{0}=mean({x}_{i},\dots ,{x}_{i+h})$. This means that some smoothing is still performed at the edges of the vector and the initial length of the vector is maintained.

Coloring

Assigning a color to each cluster is a useful way to visualize cluster assignments, especially across multiple visualizations. When a standard clustering method is applied, there is no explicit relationship between clusters, so colors are typically assigned using a categorical color scheme. However, with cluster embedding methods the 2D representation positionally conveys relationships between the clusters, and therefore can be used to assign colors in a more informative way. In the EpiVECS tool we provide the user with two color assignment options, both based on the OKHSL color space.

The OKHSL color space is a variant on the standard HSL color with an emphasis on perceptual uniformity [35]. Perceptual uniformity is an aspirational property of certain color spaces in which perceived differences in color are proportional to the distances between the numerical representation of those colors in the space. A color in the OKHSL space is defined by three variables: hue, saturation, and lightness. Hue and saturation are polar coordinates (angle and radius respectively) on a circular plane, lightness is the “vertical” axis of a cylinder formed by the circular planes. To color 2D points using this space, we center the color cylinder at the centroid of the points and map points onto the color space based off their angle and distance from the center. Specifically, for a set of k points P we calculate the centroid $\stackrel{-}{P}$, and a radius, R, defined as the maximum Euclidean distance from the centroid across all points, $R=\text{m}\text{a}\text{x}(\parallel {p}_{0}-\stackrel{-}{P}{\parallel }_{2}, \dots ,\parallel {p}_{k}-\stackrel{-}{P}{\parallel }_{2})$. For each point $p\in P$ we then calculate its vector angle to the centroid $\theta =atan2({p}_{y}-\stackrel{-}{{P}_{y}} , {p}_{x}-\stackrel{-}{{P}_{x}})$and Euclidean distance to the centroid $r= \parallel p- \stackrel{-}{P}\parallel$. These values are then normalized to the [0,1] range and passed to the OKHSL color space equation alongside a fixed lightness l:

$$color\left(p\right)=OKHSL\left(\frac{\pi +\theta }{2\pi },\frac{r}{R},l\right)$$

This produces a set of points of uniform lightness, which is useful in maintaining color space uniformity but can make it hard to distinguish between visual elements. An alternative approach is to vary the lightness with r so that points further from the centroid are lighter:

$$color\left(p\right)=OKHSL\left(\frac{\pi +\theta }{2\pi },\frac{r}{R},{l}_{l}+\frac{r}{R}\cdot ({l}_{h}-{l}_{l})\right)$$

Where ${l}_{l}$ and ${l}_{h}$ are the minimum and maximum lightness respectively.

Code Availability

The code for the EpiVECS modules and web-tool is available at https://github.com/episphere/epivecs . The backend functionality of the tool, including the data processing, the cluster embedding, and the color assignment, are available in ES6 modules, Details on how to import and use these modules can be found in the GitHub readme. The JavaScript code used to calculate the cluster embeddings compared in the results section can be found at https://observablehq.com/@siliconjazz/epivecs-results-calculate, and the code use to calculate the clustering and dimensionality reduction metrics can be found in the ‘comparison_metrics_calculator.ipynb’ Jupyter notebook in the GitHub repository. A notebook explaining the methods with interactive elements can be found at https://observablehq.com/@siliconjazz/epivecs-cluster-embedding-on-the-web. For more details on self-organizing maps, see https://observablehq.com/@siliconjazz/self-organizing-maps. The choice to distribute much of the code through Observable Notebooks is motivated both by the power of literate programming environments for explaining concepts but also the enhanced reproducibility of reactive notebooks over traditional notebooks [36].

Author Contributions

LM conceived the idea, wrote the code, performed the evaluations, and was the primary author of the manuscript text. JA oversaw the development of the methods, web-tool, and manuscript. JA and BH provided conceptual contributions to the idea throughout its development. JA and BH provided feedback and substantial additions and modifications to the manuscript text.

Data Availability

The datasets analyzed during the current study are available in the GitHub repository, https://github.com/episphere/epivecs/ . The code used to obtain this data can be found at https://observablehq.com/@siliconjazz/epvecs-data-retrieval, allowing interested readers to obtain more up-to-date versions of the data.

Competing Interests

The authors declare no competing interests.

Kirby, R. S., Delmelle, E. & Eberth, J. M. Advances in spatial epidemiology and geographic information systems. Ann. Epidemiol. 27, 1–9 (2017).
Eberth, J. M., Kramer, M. R., Delmelle, E. M. & Kirby, R. S. What is the place for space in epidemiology? Ann. Epidemiol. 64, 41–46 (2021).
Sun, F., Matthews, S. A., Yang, T.-C. & Hu, M.-H. A spatial analysis of the COVID-19 period prevalence in U.S. counties through June 28, 2020: where geography matters? Ann. Epidemiol. 52, 54–59.e1 (2020).
Cohen, S. A., Cook, S. K., Kelley, L., Foutz, J. D. & Sando, T. A. A Closer Look at Rural-Urban Health Disparities: Associations Between Obesity and Rurality Vary by Geospatial and Sociodemographic Factors: Rural-Urban Disparities: Moderation by Place & SES. J. Rural Health 33, 167–179 (2017).
Pfeiffer, D. U. & Stevens, K. B. Spatial and temporal epidemiological analysis in the Big Data era. Prev. Vet. Med. 122, 213–220 (2015).
Byun, H. G., Lee, N. & Hwang, S. A Systematic Review of Spatial and Spatio-temporal Analyses in Public Health Research in Korea. J. Prev. Med. Pub. Health 54, 301–308 (2021).
Nazia, N. et al. Methods Used in the Spatial and Spatiotemporal Analysis of COVID-19 Epidemiology: A Systematic Review. Int. J. Environ. Res. Public. Health 19, 8267 (2022).
Fatima, M., O’Keefe, K. J., Wei, W., Arshad, S. & Gruebner, O. Geospatial Analysis of COVID-19: A Scoping Review. Int. J. Environ. Res. Public. Health 18, 2336 (2021).
Johnson, B. T., Cromley, E. K. & Marrouch, N. Spatiotemporal meta-analysis: reviewing health psychology phenomena over space and time. Health Psychol. Rev. 11, 280–291 (2017).
Davis, G. S., Sevdalis, N. & Drumright, L. N. Spatial and temporal analyses to investigate infectious disease transmission within healthcare settings. J. Hosp. Infect. 86, 227–243 (2014).
Blangiardo, M. et al. Advances in spatiotemporal models for non-communicable disease surveillance. Int. J. Epidemiol. 49, i26–i37 (2020).
Atluri, G., Karpatne, A. & Kumar, V. Spatio-Temporal Data Mining: A Survey of Problems and Methods. ACM Comput. Surv. 51, 1–41 (2019).
Preim, B. & Lawonn, K. A Survey of Visual Analytics for Public Health. Comput. Graph. Forum 39, 543–580 (2020).
Raidou, R. G. Visual Analytics for the Representation, Exploration, and Analysis of High-Dimensional, Multi-faceted Medical Data. in Biomedical Visualisation (ed. Rea, P. M.) vol. 1138 137–162 (Springer International Publishing, 2019).
Cui, W. Visual Analytics: A Comprehensive Overview. IEEE Access 7, 81555–81573 (2019).
Pena-Araya, V., Pietriga, E. & Bezerianos, A. A Comparison of Visualizations for Identifying Correlation over Space and Time. IEEE Trans. Vis. Comput. Graph. 1–1 (2019) doi:10.1109/TVCG.2019.2934807.
Andreo, V. et al. Time Series Clustering Applied to Eco-Epidemiology: the case of Aedes aegypti in Córdoba, Argentina. in 2019 XVIII Workshop on Information Processing and Control (RPIC) 93–98 (IEEE, 2019). doi:10.1109/RPIC.2019.8882184.
Rojas, F., Valenzuela, O. & Rojas, I. Estimation of COVID-19 dynamics in the different states of the United States using Time-Series Clustering. http://medrxiv.org/lookup/doi/10.1101/2020.06.29.20142364 (2020) doi:10.1101/2020.06.29.20142364.
Bogado, J. V., Stalder, D. H., Schaerer, C. E. & Gomez-Guerrero, S. Time Series Clustering to Improve Dengue Cases Forecasting with Deep Learning. in 2021 XLVII Latin American Computing Conference (CLEI) 1–10 (IEEE, 2021). doi:10.1109/CLEI53233.2021.9640130.
Abbas, O. A. Comparisons between data clustering algorithms. Int. Arab J. Inf. Technol. IAJIT 5, (2008).
Jain, A. K. Data clustering: 50 years beyond K-means. Pattern Recognit. Lett. 31, 651–666 (2010).
Miljkovic, D. Brief review of self-organizing maps. in 2017 40th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO) 1061–1066 (IEEE, 2017). doi:10.23919/MIPRO.2017.7973581.
Flexer, A. On the use of self-organizing maps for clustering and visualization. Intell. Data Anal. 5, 373–384 (2001).
Brito da Silva, L. E. & Wunsch, D. C. An Information-Theoretic-Cluster Visualization for Self-Organizing Maps. IEEE Trans. Neural Netw. Learn. Syst. 29, 2595–2613 (2018).
Flexer, A. Limitations of self-organizing maps for vector quantization and multidimensional scaling. Adv. Neural Inf. Process. Syst. 9, (1996).
Liu, Y., Li, Z., Xiong, H., Gao, X. & Wu, J. Understanding of Internal Clustering Validation Measures. in 2010 IEEE International Conference on Data Mining 911–916 (IEEE, 2010). doi:10.1109/ICDM.2010.35.
Arbelaitz, O., Gurrutxaga, I., Muguerza, J., Pérez, J. M. & Perona, I. An extensive comparative study of cluster validity indices. Pattern Recognit. 46, 243–256 (2013).
Rousseeuw, P. J. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987).
Zhang, Y., Shang, Q. & Zhang, G. pyDRMetrics - A Python toolkit for dimensionality reduction quality assessment. Heliyon 7, e06199 (2021).
Melin, P., Monica, J. C., Sanchez, D. & Castillo, O. Analysis of Spatial Spread Relationships of Coronavirus (COVID-19) Pandemic in the World using Self Organizing Maps. Chaos Solitons Fractals 138, 109917 (2020).
Galvan, D., Effting, L., Cremasco, H. & Conte-Junior, C. A. The Spread of the COVID-19 Outbreak in Brazil: An Overview by Kohonen Self-Organizing Map Networks. Medicina (Mex.) 57, 235 (2021).
Diansheng Guo, Jin Chen, MacEachren, A. M., & Ke Liao. A Visualization System for Space-Time and Multivariate Patterns (VIS-STAMP). IEEE Trans. Vis. Comput. Graph. 12, 1461–1474 (2006).
Sacha, D. et al. SOMFlow: Guided Exploratory Cluster Analysis with Self-Organizing Maps and Analytic Provenance. IEEE Trans. Vis. Comput. Graph. 24, 120–130 (2018).
García-Closas, M. et al. Moving Toward Findable, Accessible, Interoperable, Reusable Practices in Epidemiologic Research. Am. J. Epidemiol. 192, 995–1005 (2023).
Ottosson, B. Two new color spaces for color picking - Okhsv and Okhsl. https://bottosson.github.io/posts/colorpicker/ (2021).
Perkel, J. M. Reactive, reproducible, collaborative: computational notebooks evolve. Nature 593, 156–157 (2021).

No competing interests reported.

Download PDF

Journal Publication

published 01 Dec, 2023

Read the published version in Scientific Reports →

Editorial decision: Major revision
25 Oct, 2023
Reviews received at journal
21 Oct, 2023
Reviewers agreed at journal
10 Oct, 2023
Reviewers agreed at journal
10 Oct, 2023
Reviewers invited by journal
10 Oct, 2023
Editor assigned by journal
10 Oct, 2023
Editor invited by journal
10 Oct, 2023
Submission checks completed at journal
10 Oct, 2023
First submitted to journal
06 Oct, 2023

You are reading this latest preprint version

EpiVECS: Exploring spatiotemporal epidemiological data using cluster embedding and interactive visualization.

Status:

Journal Publication

Version 1

Abstract

Figures

Main

Results

Method comparison

Data

Clustering validation

Dimensionality reduction validation

EpiVECS web-tool

Discussion

Conclusion

Methods

Cluster embedding

Self-organizing maps

Embedded k-means

Data processing

Coloring

Declarations

Code Availability

Author Contributions

Data Availability

Competing Interests

References

Additional Declarations

Status:

Journal Publication

Version 1