Biomedical data sets
Metabolomics data set for Parkinson disease
The dataset consisted of plasma concentration values of d = 23 lipid markers analyzed in samples from n = 100 Parkinson disease patients and n = 100 healthy controls used previously [36]. The selection of lipid markers included lysophosphatidic acids (LPA16:0, LPA18:1, LPA18:2, LPA18:3, LPA20:4), ceramides (Cer16:0, Cer18:0, Cer20:0, Cer24:0, Cer24:1, GluCerC16:0, GluCerC24:1, LacCerC16: 0, LacCerC24:0, LacCerC24:1, Cer = ceramide, GluCer = glucosylceramide, LacCer = lactosylceramide), and sphingolipids (sphinganine, sphingosine, S1P, SA1P, C16Sphinganine, C18Sphinganine, C24Sphinganine, C24:1Sphinganine). Previous analyses revealed significant regulation of lipidomics markers and sphingosines in patients with Parkinson disease [36, 37]. Most of the lipid markers differed at high statistical significance levels between patients and controls, supporting the easy separability of the previous classes, and the markers were highly correlated with each other (Supplementary Fig. 2).
Cancer genomics data set for leukemia
A data set designed to demonstrate the feasibility of cancer classification based solely on monitoring gene expression was available in the R package "golubEsets" (https://bioconductor.org/packages/golubEsets [38]). The data set [39] has an original size of 72 x 7,130 and consists of expression data of 7,129 genes analyzed with Affymetrix Hgu6800 chips from bone marrow samples of two classes of patients, i.e., n = 47 patients with acute lymphoblastic leukemia (ALL) and n = 25 patients with acute myeloid leukemia (AML; class information). For the present experiments, the first d = 150 gene expression data were used, sorted in decreasing order of variance as suggested in http://rstudio-pubs-static.s3.amazonaws.com/3773_0afaead59a02436889abc68753e6c20a.html.
Cancer genomics data set for breast cancer
Gene expression patterns of n = 65 surgical samples of human breast tumors were available in the supplementary materials of [16] and are based on the data set published previously [40]. The data are publicly available at https://www.omicsdi.org/dataset.and were downloaded for the present analysis from the supplementary materials of a publication on data transformation [16] (https://wis.kuleuven.be/statdatascience/robust/Programs/pooledVariableScaling/pvs-r.zip). The data originate from a publication of patterns in d = 496 intrinsic genes that showed significantly greater variation between different tumors than variation between paired samples of the same tumor, resulting in four distinct tumor types by applying hierarchical clustering, including (i) ER+/luminal-like, (ii) basal-like, (iii) hereditary B2+, and (iv) normal breast [40]. Before projecting the data set, missing values were imputed using the random forest algorithm [41, 42]. This was done in the Python programming language [43], version 3.8.13 for Linux, using the Python package "miceforest" (https://pypi.org/project/miceforest/), since the analogous R implementation in the library “mice” [44] quit the task with an error.
Cell surface marker leukemia data set
Biomedical data from flow cytometry using fluorescence-activated cell sorting (FACS) were available from a hematologic data set. For the present experiments, d = 4 variables including the value of the forward scatter (FS) and cytological makers (CD) called for nondisclosure reasons a, b and d, which were downsampled from originally n = 111,686 cells obtained from 100 patients with chronic lymphocytic leukemia (CLL) and 100 healthy control subjects to n = 3,000 instances. This data set is available in the R library "EDOtrans" as “FACSdata” and consists of a subsample of a larger data set published at https://data.mendeley.com/datasets/jk4dt6wprv/1 (accessed October 12, 2022) [45].
Chemometric wine properties data set
A dataset containing physiochemical properties of a collection of 4,898 samples of white wine and 1,599 samples of red wine was taken from https://www.kaggle.com/datasets/ruthgn/wine-quality-data-set-red-white-wine. It contains d = 11 variables on chemical properties, i.e., solid acidity, volatile acidity, citric acidity, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulfates, and alcohol. To speed up computation, the data set was class proportionally downsampled to n = 1,000 wine probes using our R package “opdisDownsampling” (https://cran.r-project.org/package=opdisDownsampling), which selects from 10,000–100,000 random samples the one in which the distributions of the variables are most similar to those of the original data [46].