Comparative assessment of projection and clustering method combinations in the analysis of biomedical data

doi:10.21203/rs.3.rs-2658032/v1

Download PDF

Research Article

Comparative assessment of projection and clustering method combinations in the analysis of biomedical data

https://doi.org/10.21203/rs.3.rs-2658032/v1

This work is licensed under a CC BY 4.0 License

Version 1

posted

You are reading this latest preprint version

Background

Clustering on projected data is a common component of the analysis of biomedical research datasets. Among projection methods, principal component analysis (PCA) is the most commonly used. It focuses on the dispersion (variance) of the data, whereas clustering attempts to identify concentrations (neighborhoods) within the data. These may be conflicting aims. This report re-evaluates combinations of PCA and other common projection methods with common clustering algorithms.

Methods

PCA, independent component analysis (ICA), isomap, multidimensional scaling (MDS), and t-distributed stochastic neighborhood embedding (t-SNE) were combined with common clustering algorithms (partitioning: k-means, k-medoids, and hierarchical: single, Ward's, average linkage). Projections and clusterings were assessed visually by tessellating the two-dimensional projection plane with Voronoi cells and calculating common measures of cluster quality. Clustering on projected data was evaluated on nine artificial and five real biomedical datasets.

Results

None of the combinations always gave correct results in terms of capturing the prior classifications in the projections and clusters. Visual inspection of the results is therefore essential. PCA was never ranked first, but was consistently outperformed or equaled by neighborhood-based methods such as t-SNE or manifold learning techniques such as isomap.

Conclusions

The results do not support PCA as the standard projection method prior to clustering. Instead, several alternatives with visualization of the projection and clustering results should be compared. A visualization is proposed that uses a combination of Voronoi tessellation of the projection plane according to the clustering with a color coding of the projected data points according to the prior classes. This can be used to find the best combination of data projection and clustering in a given in a given data set.

A common goal of computational analysis of biomedical data, such as clinical signs or symptoms, laboratory values, questionnaire data, cell surface antigens, is to identify valid subgroup or class structures, e.g., malignant versus normal cells, and so on. In fact, clustering is widely used in biomedical research. A search of the PubMed database at https://pubmed.ncbi.nlm.nih.gov on 25 October 2022 for string #1 in Table 1 (clustering) returned 493,484 hits. A typical approach in this setting is to use a particular projection method, namely Principal Component Analysis (PCA) [1, 2], to project the data into two dimensions, followed by clustering of the projected data. The projections or clusterings are then interpreted in terms of group separation or to support overlap between the data-driven grouping and an often known prior class structure, such as patients versus healthy individuals. Some projection techniques are applied prior to clustering.

Table 1

Search strings and hits obtained on October 25, 2022, from a query of the PubMed database (https://pubmed.ncbi.nlm.nih.gov).
Targeted methods	Number	String #	Number of hits
Clustering	1	((clustering OR "cluster analysis") NOT ("cluster trial" OR "cluster randomized trial" OR "cluster randomized controlled trial" OR "cluster headache" OR "Symptom Cluster"))	493,484
PCA	2	(("principal component analysis" OR "principal components analysis" OR "principle component analysis" OR "PCA") NOT ("patient controlled analgesia" OR "patient-controlled analgesia")) NOT review[PT]	95,807
Clustering and all projection methods included	3	(((clustering OR "cluster analysis") NOT ("cluster trial" OR "cluster randomized trial" OR "cluster randomized controlled trial" OR "cluster headache")) AND ((("principal component analysis" OR "principal components analysis" OR "principle component analysis" OR "PCA") NOT ("patient controlled analgesia" OR "patient-controlled analgesia")) OR ("manifold learning" OR "manifold embedding" OR "multidimensional scaling" OR "isomap" or "t-distributed stochastic neighborhood embedding" OR "tSNE" OR "t-SNE"))) NOT review[PT]	14,582
Clustering and MDS	4	((clustering OR "cluster analysis") NOT ("cluster trial" OR "cluster randomized trial" OR "cluster randomized controlled trial" OR "cluster headache" OR "Symptom Cluster")) AND "multidimensional scaling" NOT review[PT]	1,155
Clustering and t-SNE	5	((clustering OR "cluster analysis") NOT ("cluster trial" OR "cluster randomized trial" OR "cluster randomized controlled trial" OR "cluster headache" OR "Symptom Cluster")) AND ("t-distributed stochastic neighborhood embedding" OR "tSNE" OR "t-SNE") NOT review[PT]	266
Clustering and isomap	6	((clustering OR "cluster analysis") NOT ("cluster trial" OR "cluster randomized trial" OR "cluster randomized controlled trial" OR "cluster headache" OR "Symptom Cluster")) AND "isomap" NOT review[PT]	26

PCA focuses on the dispersion (variance) of the data. Clustering, on the other hand, attempts to identify concentrations (neighborhoods) within the data. In this sense, PCA and clustering are opposing methods. The widespread use of PCA as a standard data projection method may be due to its standard use in introductory data analysis courses or at the beginning of book chapters on data projection and clustering, its inclusion in virtually every statistical software package used in the field, or even a selection bias in the sense of a Matthew effect [3]. PCA, as one of the oldest, is the most frequently used, with 95,807 hits in search string #2 in Table 1. The combinations of the above search strings (string #3 in Table 1) yielded 14,582 results. Other options are also available. Searching for mentions of clustering with alternative projection methods (strings #4 -- #6) yielded fewer hits, e.g. 1,155 results for the combination with multidimensional scaling (MDS) [4, 5], 260 results for the combination with t-distributed stochastic neighborhood embedding (t-SNE) [6] and 26 hits for the combination with isomap [7].

This report is concerned with the data analysis setting of clustering projected data, ignoring the fact that clustering methods can also be applied to unprojected data. In particular, we consider whether the predominant use of PCA in biomedical research is justified, given that modern computer hardware and software allow the widespread use of alternative projection techniques. However, these methods may introduce new constraints on data analysis if they are to replace a method that has been established for decades. The limitations of projection and clustering methods are well known and center on the fact that no projection of high-dimensional data to lower dimensions can preserve all distance and/or neighborhood relationships of multivariate data [8]. For example, PCA fails to separate data sets with non-linear interrelationships, and most clustering algorithms assume a particular cluster shape, such as hyperspheres for partitioning-based clusters using the k-means algorithm [9, 10] or hyperellipsoids for hierarchical clusters using Ward's linkage [11]. Textbooks have pointed this out, but it is unclear whether an actual dataset is affected by a weakness in the available standard methods. Without this knowledge of the structural features of a real data set, which is rarely certain in biomedical data, the use of a well-known standard such as hierarchical clustering on principal components [12] seems safe.

To illustrate this problem, two artificial datasets are shown for which the cluster structure is well separated and known from their creation, and can be easily recognized visually as their dimensionality is less than four [13] (Fig. 1). Two projection methods are used, PCA and t-SNE, and two hierarchical clustering algorithms, single and Ward's linkage, are applied to the projected data. None of the approaches alone was sufficient to identify the correct cluster structure even in these simple data sets. Therefore, in this report, PCA and other projection methods as well as different clustering methods have been applied with the aim of finding the ``best'' combination of projection and clustering methods to be used in a standard workflow for the analysis of (multidimensional biomedical data.

Selection of projection and clustering methods

PCA is the widely used standard projection method for clinical and other biomedical data on which the present comparative analyses were designed. PCA is essentially a rotation of the data [2]. A selection of the rotated axes, called principal components (PCs), are used to project the data into a subspace of lower dimensionality. The first PC has the largest possible variance in the data. Each subsequent orthogonal component is selected in the direction of the highest possible remaining variance. Choosing the "principal" components as the projection method reduces the dimension of the data set while retaining most of its variance. PCA is also a common starting point for cluster analysis because the PCs are orthogonal and thus remove correlations from the data set. However, subgroup structure in a data set may also be caused by independent actors in the data set, which can be better investigated using independent component analysis (ICA) [14]. PCA and ICA, and many other projection methods, are inadequate for data sets with non-linear relationships between variables. Therefore, various learning techniques have been introduced in biomedical data analysis, such as multidimensional scaling (MDS) [4, 5] or isomap [7]. Further projections focus on neighborhoods, so t-distributed stochastic neighborhood embedding (t-SNE) [6] was included as a currently popular method in biomedical research.

Among clustering methods, standard partition-based (e.g., k-means) and hierarchical algorithms (e.g., linkage methods) are widely used and implemented in software packages commonly available in biomedical research environments were included. Partitioning based clustering has been implemented as k-means clustering [9] and partitioning around medoids (PAM) [15]. Hierarchical clustering with Ward's linkage [11] was used along with average, single and complete linkage in analogy to the choice made in [16].

Experimentation

Programming was performed in the R language [17] using the R software package [18], version 4.2.1 for Linux, available free of charge from the Comprehensive R Archive Network (CRAN) at https://CRAN.R-project.org/. Experiments were performed on 1–64 cores (threads) of an AMD Ryzen Threadripper 3970X (Advanced Micro Devices, Inc., Santa Clara, CA, USA) computer with 256 GB of random-access memory (RAM) running on Ubuntu Linux 22.04.1 LTS (Canonical, London, UK). Parallel processing was programmed using the implementation of the "parallel" R library provided with the R base environment [18].

Comparative combinatorial evaluation of common projection and clustering methods

Artificial FCPS data set

A first dataset contained examples of low dimensionality, chosen for demonstration purposes to reduce complexity and make it easier to assess the results. It consists of a collection of artificial datasets to address a variety of different challenges for clustering and/or projection algorithms and was available as the so-called Fundamental Clustering and Projection Suite (FCPS) [13]. It provides different subsets created to address specific challenges to clustering algorithms, such as lack of linear separability, different or small inner class distances, classes defined by data density rather than data distances, outliers, or classes in contact. Nine of the ten data sets named "Atom", "Chainlink", "EngyTime", "Golfball", "Hepta", "Lsun", "Target", "Tetra", "TwoDiamonds", and "WingNut" were included in the present study (Supplemental Fig. 1). All of them have 2–3 dimensions and a classification is available, which could be used to test the classification results obtained by different projection and clustering methods in the present evaluations. The dataset collection is available as the R package "FCPS" on https://cran.r-project.org/package=FCPS [19].

Data analysis

Data projections

The projections were performed after z-standardization of the datasets, and several R packages available for this purpose were used, including "FactoMineR" (https://cran.r-project.org/package=FactoMineR [20]) for PCA, "FastICA" (https://cran.r-project.org/package=fastICA [21]) for ICA, "Rtsne" (https://cran.r-project.org/package=Rtsne [22]) for t-SNE, “MASS” (https://cran.r-project.org/package=MASS [23]) for the nonmetric version of MDS and "RDRToolbox" (https://bioconductor.org/packages/RDRToolbox/ [24]) for isomap projection.

Clustering

Clustering was performed on the projected data using the "stats" and "cluster" libraries from the basic R package. Partitioning based clustering including k-means and k-medoids as well as hierarchical clustering using Ward's, average, single and complete linkage were applied. The Euclidean distance was used Clustering was done using the R packages “stats” and “cluster” from the R base environment (https://cran.r-project.org/package=cluster [25]).

Assessment of the agreement of the projections and clusters with prior classes

The consistency of data projections and cluster solutions with prior classes was assessed using both visualizations of the projections and cluster or class structures and common numerical measures of cluster stability and quality.

First, data projections were visualized by plotting the projected data points on a two-dimensional \({\mathbb{R}}^{2}\)plane. Points were colored according to the prior classification. Voronoi cells [26] were computed around each data point in the projection plane and colored according to the prior classes. This provided an intuitive visualization to assess the degree to which the projection of the data preserved the class structure.

Second, clusters were visualized by plotting the projected data points on a two-dimensional \({\mathbb{R}}^{2}\)plane. Points were colored according to the prior classification. Voronoi cells [26] were computed around each data point in the projection plane and colored according to the cluster membership of each data point in a cell (Fig. 1). This provided an intuitive visualization to assess the degree of consistency between the obtained clusters and the prior classes.

Third, cluster quality and stability were assessed by calculating the cluster accuracy, the adjusted Rand index [27] against the prior classification of the data, Dunn’s index [28], and the average silhouette width, calculated using the R packages “fossil”( https://cran.r-project.org/package=fossil [29]), “clValid” (https://cran.r-project.org/package=clValid [30]) and “cluster” from the R base package. The selection corresponded to previous reports where these numerical measures of cluster quality and stability had been shown to adequately capture clustering outcomes [31]. The measures were obtained in a 100 times cross-validation scenario on subsamples of the data set drawn by means of Bootstrapping [32]. Method comparison was performed on the ranks per data set and cluster quality index.

Application of common projection and clustering methods to biomedical data sets

Identification of biomedical research topics involving projection and clustering

A bibliometric analysis was performed to identify biomedical research settings where projection and clustering are common components of data analysis. A search of the PubMed database at https://pubmed.ncbi.nlm.nih.gov/ was performed on October 25, 2022, using search string #3 of Table 1. The hits were filtered for the exact sequences of the queried words to occur in titles, abstracts, or key words. Subsequently, titles were analyzed for informative words, using the R libraries “easyPubMed” (https://cran.r-project.org/package=easyPubMed [33]) and “PubMedWordcloud” (https://cran.r-project.org/package=PubMedWordcloud [34]). The obtained word list was filtered against two generic texts comprising an English translation of Franz Kafka's novel "The Metamorphosis" and a generic scientific text about generic language in scientific texts [35], and additional manual filtering of uninformative words was done. The results of this analysis highlighted areas from which to select biomedical data sets that reflect a relevant range of topics, including "omics" datasets from metabolomics or genomics, cell surface markers, or chemometric analyses (Fig. 2). This also shows that most of the papers included were published recently, with a peak retrieval in 2017.

Biomedical data sets

Metabolomics data set for Parkinson disease

The dataset consisted of plasma concentration values of d = 23 lipid markers analyzed in samples from n = 100 Parkinson disease patients and n = 100 healthy controls used previously [36]. The selection of lipid markers included lysophosphatidic acids (LPA16:0, LPA18:1, LPA18:2, LPA18:3, LPA20:4), ceramides (Cer16:0, Cer18:0, Cer20:0, Cer24:0, Cer24:1, GluCerC16:0, GluCerC24:1, LacCerC16: 0, LacCerC24:0, LacCerC24:1, Cer = ceramide, GluCer = glucosylceramide, LacCer = lactosylceramide), and sphingolipids (sphinganine, sphingosine, S1P, SA1P, C16Sphinganine, C18Sphinganine, C24Sphinganine, C24:1Sphinganine). Previous analyses revealed significant regulation of lipidomics markers and sphingosines in patients with Parkinson disease [36, 37]. Most of the lipid markers differed at high statistical significance levels between patients and controls, supporting the easy separability of the previous classes, and the markers were highly correlated with each other (Supplementary Fig. 2).

Cancer genomics data set for leukemia

A data set designed to demonstrate the feasibility of cancer classification based solely on monitoring gene expression was available in the R package "golubEsets" (https://bioconductor.org/packages/golubEsets [38]). The data set [39] has an original size of 72 x 7,130 and consists of expression data of 7,129 genes analyzed with Affymetrix Hgu6800 chips from bone marrow samples of two classes of patients, i.e., n = 47 patients with acute lymphoblastic leukemia (ALL) and n = 25 patients with acute myeloid leukemia (AML; class information). For the present experiments, the first d = 150 gene expression data were used, sorted in decreasing order of variance as suggested in http://rstudio-pubs-static.s3.amazonaws.com/3773_0afaead59a02436889abc68753e6c20a.html.

Cancer genomics data set for breast cancer

Gene expression patterns of n = 65 surgical samples of human breast tumors were available in the supplementary materials of [16] and are based on the data set published previously [40]. The data are publicly available at https://www.omicsdi.org/dataset.and were downloaded for the present analysis from the supplementary materials of a publication on data transformation [16] (https://wis.kuleuven.be/statdatascience/robust/Programs/pooledVariableScaling/pvs-r.zip). The data originate from a publication of patterns in d = 496 intrinsic genes that showed significantly greater variation between different tumors than variation between paired samples of the same tumor, resulting in four distinct tumor types by applying hierarchical clustering, including (i) ER+/luminal-like, (ii) basal-like, (iii) hereditary B2+, and (iv) normal breast [40]. Before projecting the data set, missing values were imputed using the random forest algorithm [41, 42]. This was done in the Python programming language [43], version 3.8.13 for Linux, using the Python package "miceforest" (https://pypi.org/project/miceforest/), since the analogous R implementation in the library “mice” [44] quit the task with an error.

Cell surface marker leukemia data set

Biomedical data from flow cytometry using fluorescence-activated cell sorting (FACS) were available from a hematologic data set. For the present experiments, d = 4 variables including the value of the forward scatter (FS) and cytological makers (CD) called for nondisclosure reasons a, b and d, which were downsampled from originally n = 111,686 cells obtained from 100 patients with chronic lymphocytic leukemia (CLL) and 100 healthy control subjects to n = 3,000 instances. This data set is available in the R library "EDOtrans" as “FACSdata” and consists of a subsample of a larger data set published at https://data.mendeley.com/datasets/jk4dt6wprv/1 (accessed October 12, 2022) [45].

Chemometric wine properties data set

A dataset containing physiochemical properties of a collection of 4,898 samples of white wine and 1,599 samples of red wine was taken from https://www.kaggle.com/datasets/ruthgn/wine-quality-data-set-red-white-wine. It contains d = 11 variables on chemical properties, i.e., solid acidity, volatile acidity, citric acidity, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulfates, and alcohol. To speed up computation, the data set was class proportionally downsampled to n = 1,000 wine probes using our R package “opdisDownsampling” (https://cran.r-project.org/package=opdisDownsampling), which selects from 10,000–100,000 random samples the one in which the distributions of the variables are most similar to those of the original data [46].

Comparative clustering performance in artificial data posing known projection and clustering problems

Of the selected data projection and clustering methods, no combination provided perfect identification of the known clusters in all nine data sets analyzed from the FCPS collection which are specially defined to challenge clustering methods (Fig. 3). However, in the ranking of the methods per dataset for the four quality indices (clustering accuracy, Rand, Dunn’s and silhouette index) applied to the clusters obtained with the different methods on the projected data, t-SNE followed by any clustering algorithm was superior to the alternative projection methods (Fig. 3). A major exception was the "Target" data set where t-SNE provided the poorest basis for clustering consistent with the prior classification (Supplemental Figs. 4 and 5). Remarkably, the best median ranking obtained with any clustering algorithm when preceded by PCA based data projection was below all median overall rankings achieved when t-SNE was used for data projection instead (Fig. 3B horizontal red dashed line). Moreover, in the ranking of the combined projection and clustering methods across all data sets and indices (Fig. 3), the first occurrence of PCA was at position nine.

The clustering methods were closer when the preceding projections were neglected; however, hierarchical clustering using Ward's linkage and k-means led the rank list with respect to the consistency of the cluster assignments with the prior classes (Fig. 3). Median scores of the four cluster quality indices are shown in Table 2 for the two data sets from the introductory example of this report, i.e., “Lsun” and “Target”. A full list of numerical results of the median values of the cluster quality and stability indices is available as Supplemental Table 1.

Table 2

Median values of numerical measures of cluster quality and stability: cluster accuracy, Rand index, Dunn’s index, and average silhouette width, for clustering solutions obtained by applying different clustering algorithms (partitioning: k-means, k-medoids, hierarchical using single, Ward's, average, or complete linkage) to data projected onto the \({\mathbb{R}}^{2}\)plane by different projection methods. Projection methods included independent component analysis, (ICA), isomap, multidimensional scaling (MDS) principal component analysis (PCA), and t-distributed stochastic neighborhood embedding (t-SNE). Values are shown for the two artificial datasets from the introductory example (Fig. 1), "Lsun" and "Target" from the FCPS dataset collection [13]. A complete list of median values obtained for clusterings of the included artificial datasets from the FCPS collection can be found in Supplementary Table 1.
Cluster quality measure	Clustering method	Projection methods
		“Lsun” FCPS data set					“Target” FCPS data set
		ICA	isomap	MDS	PCA	tSNE	ICA	isomap	MDS	PCA	tSNE
Cluster accuracy	Average	0.73	0.46	0.73	0.73	1	0.68	0.67	0.68	0.68	0.47
Cluster accuracy	Complete	0.7	0.46	0.72	0.72	1	0.68	0.64	0.69	0.69	0.45
Cluster accuracy	Kmeans	0.74	0.5	0.74	0.74	1	0.64	0.65	0.65	0.65	0.41
Cluster accuracy	Kmedoids	0.77	0.5	0.77	0.77	1	0.65	0.65	0.65	0.65	0.41
Cluster accuracy	Single	1	0.52	1	1	1	1	1	1	1	0.77
Cluster accuracy	Ward	0.75	0.47	0.76	0.76	1	0.66	0.66	0.66	0.66	0.45
Dunn’s index	Average	0.07	0.05	0.08	0.08	0.42	0.05	0.05	0.05	0.05	0.04
Dunn’s index	Complete	0.06	0.05	0.06	0.06	0.41	0.04	0.04	0.04	0.04	0.04
Dunn’s index	Kmeans	0.05	0.03	0.05	0.05	0.42	0.03	0.03	0.03	0.03	0.02
Dunn’s index	Kmedoids	0.04	0.02	0.04	0.04	0.41	0.02	0.02	0.02	0.02	0.02
Dunn’s index	Single	0.12	0.06	0.11	0.11	0.42	0.25	0.34	0.25	0.25	0.1
Dunn’s index	Ward	0.07	0.05	0.08	0.08	0.42	0.04	0.05	0.04	0.04	0.04
Rand index	Average	0.38	0.04	0.38	0.38	1	0.16	0.43	0.16	0.16	0.4
Rand index	Complete	0.35	0.05	0.38	0.38	1	0.35	0.5	0.38	0.38	0.39
Rand index	Kmeans	0.41	0.09	0.41	0.41	1	0.64	0.65	0.64	0.64	0.35
Rand index	Kmedoids	0.44	0.11	0.44	0.44	1	0.64	0.65	0.64	0.64	0.34
Rand index	Single	1	0.02	1	1	1	1	1	1	1	0.77
Rand index	Ward	0.42	0.06	0.43	0.43	1	0.65	0.65	0.65	0.65	0.38
Silhouette width	Average	0.51	0.46	0.51	0.51	0.64	0.4	0.51	0.4	0.4	0.46
Silhouette width	Complete	0.5	0.45	0.5	0.5	0.64	0.44	0.55	0.44	0.44	0.45
Silhouette width	Kmeans	0.53	0.48	0.53	0.53	0.64	0.59	0.65	0.59	0.59	0.48
Silhouette width	Kmedoids	0.53	0.48	0.53	0.53	0.64	0.59	0.65	0.59	0.59	0.47
Silhouette width	Single	0.41	-0.14	0.41	0.41	0.64	0.3	0.29	0.3	0.3	0.2
Silhouette width	Ward	0.51	0.46	0.51	0.51	0.64	0.58	0.64	0.58	0.58	0.47

Comparative clustering performance in biomedical data sets

There were large differences in the projection of the data sets among the five different methods (Fig. 4). No combination of projection and clustering methods provided a perfect agreement between the obtained clusters and the prior classes in all data sets (Fig. 5) and an overall best performing projection method before clustering was not as obvious as in for the artificial datasets (Fig. 5). However, isomap and t-SNE ranked before PCA, ICA, and MDS in the overall scores calculated for the cluster memberships against the prior classifications. Remarkably, the best median ranking obtained with any clustering algorithm when preceded by PCA data projection was below most median overall rankings achieved when isomap or t-SNE were used for data projection instead (Fig. 5B horizontal red dashed line). In the ranking of the combined projection and clustering methods across all data sets and indices (Fig. 5), the first occurrence of PCA was again at position nine.

As with the artificial data sets, it was again not immediately obvious which clustering algorithm performed best overall (Fig. 5); however, hierarchical clustering using average linkage or Ward's linkage and k-means led the rank list with respect to the consistency of the cluster assignments with the prior classes (Fig. 5). Median scores of the four cluster quality indices are shown in Table 3 for a data set with good separation of the prior classes across the clusters, i.e., the metabolomics data set for Parkinson disease, and for a data set with only modest agreement between cluster assignments and prior class assignment, i.e., the cancer genomics data set for leukemia. A full list of numerical results is available as Supplemental Table 2. Results are shown only for an example dataset in more detail below. The results obtained for the other real-life datasets are shown in the Supplemental Figures.

Table 3

Median values of formal measures of cluster quality and stability, i.e., Cluster accuracy, Rand index, Dunn’s index, and average silhouette width, for clustering solutions obtained by applying different clustering algorithms (partitioning: k-means, k-medoids, hierarchical using single, Ward's, average, or complete linkage) to data projected onto the \({\mathbb{R}}^{2}\)plane by different methods (independent component analysis, ICA, isomap, multidimensional scaling, MDS, principal component analysis, PCA, and t-distributed stochastic neighborhood embedding, t-SNE). Values are shown for the two real-life datasets with either good separation of the prior classes across the clusters, i.e., the metabolomics data set for Parkinson disease [36], or only modest agreement between cluster assignments and prior class assignment, i.e., the cancer genomics data set for leukemia [38]. A full list of median values obtained for clusterings of the included real-life scientific datasets is provided in Supplementary Table 2.
Cluster quality measure	Clustering method	Projection methods
		Metabolomics data set for Parkinson disease					Cancer genomics data set for leukemia
		ICA	isomap	MDS	PCA	tSNE	ICA	isomap	MDS	PCA	tSNE
Cluster accuracy	Average	0.54	1	0.53	0.54	0.98	0.74	0.8	0.65	0.74	0.76
Cluster accuracy	Complete	0.58	1	0.6	0.58	0.98	0.69	0.82	0.72	0.67	0.74
Cluster accuracy	Kmeans	0.98	1	0.98	0.98	1	0.87	0.97	0.75	0.63	0.79
Cluster accuracy	Kmedoids	0.98	1	0.98	0.97	1	0.78	0.93	0.74	0.72	0.76
Cluster accuracy	Single	0.53	1	0.53	0.53	1	0.69	0.68	0.65	0.68	0.61
Cluster accuracy	Ward	0.98	1	0.98	0.98	0.97	0.73	0.9	0.73	0.68	0.78
Dunn’s index	Average	0.27	0.25	0.22	0.32	0.14	0.2	0.18	0.41	0.19	0.15
Dunn’s index	Complete	0.05	0.22	0.05	0.05	0.13	0.12	0.15	0.11	0.12	0.13
Dunn’s index	Kmeans	0.06	0.24	0.06	0.07	0.22	0.1	0.15	0.08	0.11	0.12
Dunn’s index	Kmedoids	0.06	0.22	0.06	0.07	0.22	0.09	0.13	0.07	0.09	0.11
Dunn’s index	Single	0.29	0.26	0.34	0.32	0.22	0.25	0.27	0.41	0.25	0.21
Dunn’s index	Ward	0.06	0.24	0.06	0.06	0.14	0.13	0.14	0.11	0.12	0.14
Rand index	Average	0	1	0	0	0.91	0.16	0.28	0.02	0.14	0.24
Rand index	Complete	0.02	0.98	0.04	0.02	0.9	0.13	0.38	0.17	0.07	0.23
Rand index	Kmeans	0.92	1	0.92	0.9	1	0.53	0.86	0.22	0.05	0.31
Rand index	Kmedoids	0.9	0.98	0.9	0.88	1	0.3	0.74	0.18	0.18	0.27
Rand index	Single	0	1	0	0	1	0.04	0.04	0.02	0.04	0.04
Rand index	Ward	0.9	1	0.94	0.92	0.88	0.19	0.63	0.16	0.11	0.31
Silhouette width	Average	0.51	0.69	0.49	0.49	0.66	0.44	0.42	0.55	0.43	0.38
Silhouette width	Complete	0.31	0.69	0.3	0.29	0.66	0.38	0.4	0.42	0.39	0.34
Silhouette width	Kmeans	0.44	0.69	0.44	0.48	0.67	0.42	0.47	0.41	0.42	0.4
Silhouette width	Kmedoids	0.44	0.69	0.44	0.48	0.67	0.41	0.46	0.38	0.41	0.4
Silhouette width	Single	0.53	0.69	0.52	0.5	0.67	0.44	0.36	0.55	0.43	0.24
Silhouette width	Ward	0.44	0.69	0.44	0.48	0.66	0.4	0.44	0.41	0.41	0.37

Sample data set for clustering results on projected data

Metabolomics data set for Parkinson disease

As an example of the projection and clustering results, the metabolomics dataset for Parkinson's disease is shown, where data points from patients or controls were well separated in the projections at the \({\mathcal{R}}^{2}\) plane. In the lipidomics data set for Parkinson's disease, projections supported the separation of patients and controls based on lipid marker pattern (Fig. 6). This agrees with published evidence that lipidomic regulation is supported by evidence for all three classes such as for sphingosine kinase 1 and sphingosine-1-phosphate [47], ceramides and glucosylceramide [48–51], phospholipids [52] and also for lysophosphatidic acids [53–56]. The isomap projection provided the best basis for a cluster structure that matched the prior class structure, followed by t-SNE, while the other projection methods performed clearly worse (Fig. 7). While all clustering algorithms separated patients from controls in the isomap projection, this was again observed only in the t-SNE projection except for the hierarchical clustering when using single linkage (Fig. 6D4). In this dataset, the best performing methods yielded comparatively high cluster quality and stability indices in repeated calculations (Table 3).

The present evaluation was undertaken to review the standard of cluster analysis, which is often performed as clustering on principal components [12]. The focus was on evaluating whether alternative projection methods to PCA could provide a better basis for clustering. For both artificial and real biomedical data, PCA did not prove to be the best method for data projection in the analytical context of clustering on projected data. Furthermore, the present experiments provided no evidence that replacing PCA with another method would be at the expense of sacrificing a sound basis for clustering for occasional gains in edge problems. The above results were consistent across 14 datasets analyzed, including real-world biomedical research datasets.

The t-SNE projection method often outperformed PCA as a projection method prior to clustering. Notably, the isomap method ranked highest overall in the biomedical datasets, but by a smaller margin as t-SNE outperformed the other methods in the artificial datasets. t-SNE ranked second in the real-world datasets. However, isomap came last in the artificial data benchmark tests. None of the methods were perfect. For example, t-SNE performed worse than the alternatives prior to clustering the "target" dataset, while in this dataset isomap combined with hierarchical single-linkage clustering performed as well as PCA as a projection method prior to clustering (Supplementary Figs. 4 and 5). Other limitations of t-SNE have been highlighted elsewhere [57]. While the cited paper focused on the pitfalls of using t-SNE as a currently popular method without testing alternatives, the present analysis re-emphasizes the need to compare different methods without pre-selecting one.

Given the widespread use of PCA as a common standard data projection method applied prior to clustering [12], this established method was not expected to have major weaknesses overall. Therefore, the aim of this work was to re-evaluate whether the standard method is supported by superior performance, and if not, whether alternative methods only occasionally perform better, but may have the disadvantage of also producing poor results more often, making it safer to stick with the standard method and leave alternatives to peripheral problems of data analysis where they are known to perform better. The observation in the present experiments was that PCA never came first. Even on the 'target' dataset, where t-SNE failed, isomap was a similarly suitable alternative projection method to PCA (Supplementary Figs. 4 and 5). Furthermore, alternative projection methods were not found to be more likely to produce worse results than PCA.

While a ranking emerged for the projection methods in terms of the quality of the subsequent clustering in terms of the consistency of the cluster membership with the previous classes, the ranking of the clustering algorithms in this context was less clear. This is due to the fact that each clustering algorithm implies an inherent form of a cluster [8, 57]. Many clustering methods could not be considered here due to practical limitations. For example, Density-Based Spatial Clustering of Applications with Noise (DBSCAN) [58] was found to require manual fine-tuning of its ε hyperparameter, otherwise it could not satisfactorily cluster even the present artificial datasets; thus it was considered unsuitable for the present analyses of standard workflows. Furthermore, the aforementioned ESOM method, which does not imply clustering, was not included because it has been addressed previously and the use of these neural networks requires experience and expertise, whereas the present evaluations focused on methods commonly used by field researchers with statistical software rather than by computer experts as a common biomedical research setting.

Importance of visualization of projection and clustering results

The present experiments and method comparisons suggest a workflow that could replace clustering based on principal components. However, it is difficult to find a "universal" combination of projection and clustering methods in terms of separating prior classes on the projection planes and then in the clusters. Therefore, visualization is highly recommended to evaluate the results of the methods. In the present example datasets, the prior class structure was known and the variables were known to reflect it, i.e. typical test datasets for projection and clustering problems were chosen. Otherwise, projections and clusters could only have been evaluated by technical quality measures, e.g. by Shepard plots [4, 5] for distance preserving projections or by calculating silhouette widths or other quality measures for clusters. Indeed, this is often the main way of assessing whether a clustering attempt has been successful when the analyses are completely unsupervised. However, it is also often the case that prior classes are present in biomedical datasets, allowing for a semi-supervised approach. If the variables were collected with the aim of supporting the prior classes, which is often the case, and there is no reasonable doubt about this support, this can be used to evaluate a data projection and clustering solution.

The use of Voronoi tessellation for visualization in this report provided an intuitive means. Voronoi cells are defined as follows [26]. Let P = {\({p}_{i}, i=1,\dots ,n\)} be a set of n distinct points in a metric space D \(ϵ\)\({\mathbb{R}}^{d}\) with a distance function d(x,y) defined for all x,y in P. The Voronoi cells of P are a tessellation of D into n cells, one for each point in P. A point x lies in the cell corresponding to a (center) point \({p}_{i} ϵ P\) if for each \({p}_{i} ϵ P, j\ne i: d\left(x,{p}_{j}\right)> d(x,{p}_{i})\). In the figures presented here, the center points of are the projections of the data points. Two further examples are shown in Fig. 8 to emphasize the utility of this visualization for evaluating projection and clustering results.

This visualization, also called a 'political map', gives a quick overview of how the prior classes are represented on the projection layer and in the clusters. In the present datasets, this often immediately pointed to shortcomings of particular projection and clustering methods, as all datasets were selected with known agreement between the variables and the prior classes, which was the main test criterion. If the agreement between projections and prior classes is not apparent, this triggers interpretation and further analysis. The reasons may be methodological errors in projection or clustering, but the results may also indicate that the variables do not support the prior classification or clustering, which could be a scientific finding.

The two scenarios are illustrated in Fig. 8. One dataset where the prior classes are completely known is the "Chainlink" dataset from the FCPS collection, which has already been used in the present experiments. The PCA projection (Fig. 8) splits one of the two classes into parts, and clustering on this projection passes directly through the class separations (Fig. 8) and is to be discarded. In contrast, the t-SNE projection separates the classes correctly, and the resulting cluster solution is perfect (Fig. 8A3 and A4). The other example, shown in column B of Fig. 8, is a data set where the prior classification is less certain (Fig. 8B1 – B4). The data include pain thresholds for various experimental noxious stimuli collected in a quantitative sensory testing study of n = 125 healthy Caucasian volunteers (69 males, 56 females, aged 18 to 46 years, mean 25 ± 4.4 years) [59]. No prior classification was available. Based on the knowledge that females are more sensitive to pain than males [60], one class structure could be gender. However, both the PCA and t-SNE projections show that the prior classes are not concentrated on the projection planes, and the subsequent clustering did indeed not separate the hypothesized class (χ2 not significant). This does not preclude the possibility that other projection techniques might be more successful in separating classes, or that the clusters might be meaningful in the research context, e.g. high or low pain sensitivity independent of sex, but the visualization would prompt a different projection attempt or a re-evaluation of the current sex difference hypothesis and a search for a different interpretation of the clusters instead, which was not the aim of this report. This demonstrates that unsupervised analysis of biomedical data can reach a point where input from a domain scientist, such as a pain researcher in this example, is needed to continue the analysis as a collaborative effort between computational experts and domain experts.

PCA has been challenged here as the standard projection method prior to clustering in the analysis of multivariate biomedical data, and the results point to possible replacement methods that better capture the data structure without introducing other limitations. Decorrelation of variables in PCA is often a useful preprocessing step. In the present example metabolomics dataset, even a strong correlation of the variables (Supplementary Fig. 2) did not suggest PCA as the best projection method. If cluster analysis is the goal, it should be determined from the data whether the variance or the neighborhood contains the most information. Appropriate methods need to be chosen, without simply assuming that the variance in a data set is the most informative about its class or subgroup structure, thereby pre-determining that PCA is the most appropriate projection method for a data set. In fact, the present results point to projection methods that focus on the nonlinear relationship between variables instead of the variance, such as isomap, or on neighborhood structures, such as t-SNE. In most cases, they outperformed PCA, which was never the best of the methods tested, and only gave the best results in very few single-dataset examples, where PCA always shared the top spot with another method.

The experiments also showed that none of the clustering algorithms significantly and consistently outperformed the other candidates. Neither a clear favorite nor a clear “loser” could be identified. Most useful, however, is a visualization method on the projected data that combines both prior classification and clustering to identify a workflow that is best suited to the particularities of a given dataset. The visualization is uses a combination of Voronoi tessellation of the projection plane according to the clustering with a color coding of the projected data points according to the prior classes. This can be used to find the best combination of data projection and clustering in a given in a given data set.

Ethics approval and consent to participate

All data are freely available to the public, with the exception of the Parkinson's disease metabolomics dataset, which was taken from a previous publication by the authors [36]. That study followed the Declaration of Helsinki and was approved by the Ethics Committee of the Medical Faculty of the Goethe – University Frankfurt am Main, Germany (reference number 197/13).

Consent for publication

All data are freely available to the public, with the exception of the Parkinson's disease metabolomics dataset, which was taken from a previous publication by the authors [36]. For that study, informed written consent into study participation and publication of the results in an anonymized form was obtained from all subjects.

Availability of data and material

The data used in the experiments in this report are publicly available and referenced accordingly.

Competing interests

The authors have declared that no competing interests exist.

Funding

JL was supported by the Deutsche Forschungsgemeinschaft (DFG LO 612/16-1).

Author contributions

JL – Conceptualization of the project, programming, writing of the manuscript, data analyses and creation of the figures.

AU - Critical revision of the manuscript for important intellectual content, writing of the manuscript, contributing to the selection of sample data sets.

Acknowledgement

Not applicable

Hotelling H: Analysis of a complex of statistical variables into principal components. Journal of Educational Psychology 1933, 24(7):498-520.
Pearson K: LIII. On lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 1901, 2(11):559-572.
Merton RK: The Matthew Effect in Science. Science 1968, 159(3810):56-63.
Shepard RN: The analysis of proximities: Multidimensional scaling with an unknown distance function. II. Psychometrika 1962, 27(3):219-246.
Shepard RN: The analysis of proximities: multidimensional scaling with an unknown distance function. I. Psychometrika 1962, 27(2):125-140.
Van der Maaten L, Hinton G: Visualizing Data using t-SNE. J Machine Learn Res 2008, 9:2579-2605.
Tenenbaum JB, de Silva V, Langford JC: A global geometric framework for nonlinear dimensionality reduction. Science 2000, 290(5500):2319-2323.
Ultsch A, Lötsch J: Machine-learned cluster identification in high-dimensional data. J Biomed Inform 2017, 66:95-104.
MacQueen J: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Statistics: 1967 1967; Berkeley, Calif.: University of California Press: 281-297.
Steinhaus H: Sur la division des corps matériels en parties. Bull Acad Polon Sci 1956, 1(804):801.
Ward Jr JH: Hierarchical grouping to optimize an objective function. Journal of the American statistical association 1963, 58(301):236-244.
Kassambara A: Practical Guide To Principal Component Methods in R: PCA, M(CA), FAMD, MFA, HCPC, factoextra: CreateSpace Independent Publishing Platform; 2017.
Ultsch A, Lötsch J: The Fundamental Clustering and Projection Suite (FCPS): A Dataset Collection to Test the Performance of Clustering and Data Projection Algorithms. Data 2020, 5(1):13.
Hyvärinen A, Oja E: Independent component analysis: algorithms and applications. Neural Networks 2000, 13(4):411-430.
Kaufman L, Rousseeuw PJ: Partitioning Around Medoids (Program PAM). Finding Groups in Data 1990:68-125.
Raymaekers J, Zamar RH: Pooled variable scaling for cluster analysis. Bioinformatics 2020, 36(12):3849-3855.
Ihaka R, Gentleman R: R: A Language for Data Analysis and Graphics. Journal of Computational and Graphical Statistics 1996, 5(3):299-314.
R Development Core Team: R: A Language and Environment for Statistical Computing. 2008.
Thrun M, Stier Q: Fundamental clustering algorithms suite. SoftwareX 2021, 13:100642.
Le S, Josse J, Husson Fc: FactoMineR: A Package for Multivariate Analysis. Journal of Statistical Software 2008, 25(1):1-18.
Marchini JL, Heaton C, Ripley BD: fastICA: FastICA Algorithms to Perform ICA and Projection Pursuit. In.; 2021.
Krijthe JH: Rtsne: T-Distributed Stochastic Neighbor Embedding using Barnes-Hut Implementation. In.; 2015.
Venables WN, Ripley BD: Modern Applied Statistics with S. New York: Springer; 2002.
Bartenhagen C: RDRToolbox: A package for nonlinear dimension reduction with Isomap and LLE. In.; 2022.
Maechler M, Rousseeuw P, Struyf A, Hubert M, Hornik K: cluster: Cluster Analysis Basics and Extensions. 2017.
Voronoi G: Nouvelles applications des paramètres continus à la théorie des formes quadratiques. Premier mémoire. Sur quelques propriétés des formes quadratiques positives parfaites. Journal für die reine und angewandte Mathematik (Crelles Journal) 1908:97 - 102.
Rand WM: Objective Criteria for the Evaluation of Clustering Methods. Journal of the American Statistical Association 1971, 66(336):846-850.
Dunn JC: Well-Separated Clusters and Optimal Fuzzy Partitions. Journal of Cybernetics 1974, 4(1):95-104.
Vavrek MJ: fossil: palaeoecological and palaeogeographical analysis tools. Palaeontologia Electronica 2011, 14(1):1T.
Pihur V, Datta S, Datta S: clValid: An R Package for Cluster Validation. 2008 2008, 25(4):22.
Ultsch A, Lötsch J: Euclidean distance-optimized data transformation for cluster analysis in biomedical data (EDOtrans). BMC Bioinformatics 2022, 23(1):233.
Efron B, Tibshirani RJ: An introduction to the bootstrap. San Francisco: Chapman and Hall; 1995.
Fantini D: easyPubMed: Search and Retrieve Scientific Publication Records from PubMed. In.; 2019.
Fan FY: PubMedWordcloud: 'Pubmed' Word Clouds_. R package version 0.3.6, <https://CRAN.R-project.org/package=PubMedWordcloud>. In.; 2019.
DeJesus Jasmine M, Callanan Maureen A, Solis G, Gelman Susan A: Generic language in scientific communication. Proceedings of the National Academy of Sciences 2019, 116(37):18370-18377.
Lötsch J, Lerch F, Djaldetti R, Tegeder I, Ultsch A: Identification of disease-distinct complex biomarker patterns by means of unsupervised machine-learning using an interactive R toolbox (Umatrix). BMC Big Data Analytics 2018, 3(5):https://doi.org/10.1186/s41044-41018-40032-41041.
Klatt-Schreiner K, Valek L, Kang JS, Khlebtovsky A, Trautmann S, Hahnefeld L, Schreiber Y, Gurke R, Thomas D, Wilken-Schmitz A et al: High Glucosylceramides and Low Anandamide Contribute to Sensory Loss and Pain in Parkinson's Disease. Mov Disord 2020, 35(10):1822-1833.
Golub T: golubEsets: exprSets for golub leukemia data. In.; 2022.
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA et al: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999, 286(5439):531-537.
Perou CM, Sørlie T, Eisen MB, van de Rijn M, Jeffrey SS, Rees CA, Pollack JR, Ross DT, Johnsen H, Akslen LA et al: Molecular portraits of human breast tumours. Nature 2000, 406(6797):747-752.
Ho TK: Random decision forests. In: Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 1) - Volume 1. 844681: IEEE Computer Society 1995: 278.
Breiman L: Random Forests. Mach Learn 2001, 45(1):5-32.
Van Rossum G, Drake Jr FL: Python tutorial, vol. 620: Centrum voor Wiskunde en Informatica Amsterdam; 1995.
van Buuren S, Groothuis-Oudshoorn K: mice: Multivariate Imputation by Chained Equations in R. 2011 2011, 45(3):67.
Thrun M, Hoffmann J, Röhnert M, von Bonin M, Oelschlägel U, Brendel C, Ultsch A: Flow Cytometry datasets consisting of peripheral blood and bone marrow samples for the evaluation of explainable artificial intelligence methods. In: Mendeley Data. 2022.
Lötsch J, Malkusch S, Ultsch A: Optimal distribution-preserving downsampling of large biomedical data sets (opdisDownsampling). PLoS One 2021, 16(8):e0255838.
Pyszko J, Strosznajder JB: Sphingosine kinase 1 and sphingosine-1-phosphate in oxidative stress evoked by 1-methyl-4-phenylpyridinium (MPP+) in human dopaminergic neuronal cells. Mol Neurobiol 2014, 50(1):38-48.
Xing Y, Tang Y, Zhao L, Wang Q, Qin W, Ji X, Zhang J, Jia J: Associations between plasma ceramides and cognitive and neuropsychiatric manifestations in Parkinson's disease dementia. J Neurol Sci 2016, 370:82-87.
France-Lanord V, Brugg B, Michel PP, Agid Y, Ruberg M: Mitochondrial free radical signal in ceramide-dependent apoptosis: a putative mechanism for neuronal death in Parkinson's disease. J Neurochem 1997, 69(4):1612-1621.
Boutin M, Sun Y, Shacka JJ, Auray-Blais C: Tandem Mass Spectrometry Multiplex Analysis of Glucosylceramide and Galactosylceramide Isoforms in Brain Tissues at Different Stages of Parkinson Disease. Anal Chem 2016, 88(3):1856-1863.
Mielke MM, Maetzler W, Haughey NJ, Bandaru VV, Savica R, Deuschle C, Gasser T, Hauser AK, Graber-Sultan S, Schleicher E et al: Plasma ceramide and glucosylceramide metabolism is altered in sporadic Parkinson's disease and associated with cognitive impairment: a pilot study. PLoS One 2013, 8(9):e73094.
Li Z, Zhang J, Sun H: Increased plasma levels of phospholipid in Parkinson's disease with mild cognitive impairment. J Clin Neurosci 2015, 22(8):1268-1271.
Ikram M, Ullah R, Khan A, Kim MO: Ongoing Research on the Role of Gintonin in the Management of Neurodegenerative Disorders. Cells 2020, 9(6).
Shen W, Jiang L, Zhao J, Wang H, Hu M, Chen L, Chen Y: Bioactive lipids and their metabolism: New therapeutic opportunities for Parkinson's disease. Eur J Neurosci 2022, 55(3):846-872.
Choi JH, Jang M, Oh S, Nah SY, Cho IH: Multi-Target Protective Effects of Gintonin in 1-Methyl-4-phenyl-1,2,3,6-tetrahydropyridine-Mediated Model of Parkinson's Disease via Lysophosphatidic Acid Receptors. Frontiers in pharmacology 2018, 9:515.
Yang XY, Zhao EY, Zhuang WX, Sun FX, Han HL, Han HR, Lin ZJ, Pan ZF, Qu MH, Zeng XW et al: LPA signaling is required for dopaminergic neuron development and is reduced through low expression of the LPA1 receptor in a 6-OHDA lesion model of Parkinson's disease. Neurol Sci 2015, 36(11):2027-2033.
Lötsch J, Ultsch A: Current Projection Methods-Induced Biases at Subgroup Detection for Machine-Learning Based Data-Analysis of Biomedical Data. International Journal of Molecular Sciences 2019, 21(1).
Ester M, Kriegel H-P, Sander J, Xu X: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. Portland, Oregon: AAAI Press; 1996: 226–231.
Doehring A, Küsener N, Flühr K, Neddermeyer TJ, Schneider G, Lötsch J: Effect sizes in experimental pain produced by gender, genetic variants and sensitization procedures. PLoS One 2011, 6(3):e17724.
Mogil JS: Sex differences in pain and pain inhibition: multiple explanations of a controversial phenomenon. Nat Rev Neurosci 2012, 13(12):859-866.
Arnold JB: ggthemes: Extra Themes, Scales and Geoms for 'ggplot2'. In.; 2019.
R Core Team: R: A Language and Environment for Statistical Computing. In. Vienna, Austria; 2021.
Wickham H: ggplot2: Elegant Graphics for Data Analysis: Springer-Verlag New York; 2009.
Pedersen TL: ggforce: Accelerating 'ggplot2'. In.; 2020.
Gu Z, Eils R, Schlesner M: Complex heatmaps reveal patterns and correlations in multidimensional genomic data. Bioinformatics 2016, 32(18):2847-2849.

No competing interests reported.

SupplementalFiguresAndTables.pdf
Supplemental information includes 11 supplemental figures and two tables ("Supplemental Tables 1 and 2") with the complete list of numerical results of median values of cluster quality and stability indices obtained in the artificial and real data sets, respectively

Download PDF

Version 1

posted

You are reading this latest preprint version

Comparative assessment of projection and clustering method combinations in the analysis of biomedical data

Status:

Version 1

Abstract

Background

Methods

Results

Conclusions

Figures

Introduction

Methods

Selection of projection and clustering methods

Experimentation

Comparative combinatorial evaluation of common projection and clustering methods

Artificial FCPS data set

Data analysis

Application of common projection and clustering methods to biomedical data sets

Identification of biomedical research topics involving projection and clustering

Biomedical data sets

Results

Comparative clustering performance in artificial data posing known projection and clustering problems

Comparative clustering performance in biomedical data sets

Sample data set for clustering results on projected data

Metabolomics data set for Parkinson disease

Discussion

Importance of visualization of projection and clustering results

Conclusions

Declarations

Ethics approval and consent to participate

Consent for publication

Availability of data and material

Competing interests

Funding

Author contributions

Acknowledgement

References

Additional Declarations

Supplementary Files

Status:

Version 1