Correlations for Untargeted GC×GC-HRTOF-MS Metabolomics of Colorectal Cancer

doi:10.21203/rs.3.rs-1561376/v1

Download PDF

Research Article

Correlations for Untargeted GC×GC-HRTOF-MS Metabolomics of Colorectal Cancer

https://doi.org/10.21203/rs.3.rs-1561376/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 23 Sep, 2023

Read the published version in Metabolomics →

You are reading this latest preprint version

Introduction Modern comprehensive instrumentations provide an unprecedented coverage of complex matrices in the form of high-dimensional, information rich data sets.

Objective In addition to the usual biomarker research that focuses on the detection of the studied condition, we aimed to define a proper strategy to conduct a correlation analysis on an untargeted colorectal cancer case study with a data set of 102 variables corresponding to metabolites obtained from serum samples analyzed with comprehensive two-dimensional gas chromatography coupled to high-resolution time-of-flight mass spectrometry (GC×GC-HRTOF-MS). Indeed, the strength of association existing between the metabolites contains potentially valuable information about the molecular mechanisms involved and the underlying metabolic network associated to a global perturbation, at no additional analytical effort.

Methods Following Anscombe’s quartet, we took particular attention to four main aspects. First, the presence of non-linear relationships through the comparison of parametric and non-parametric correlation coefficients: Pearson’s r, Spearman’s rho, Kendall’s tau and Goodman-Kruskal’s gamma. Second, the visual control of the detected associations through scatterplots and their associated regressions and angles. Third, the effect and handling of atypical samples and values. Fourth, the role of the precision of the data on the attribution of the ranks through the presence of ties.

Results Kendall’s tau was found the method of choice for the data set at hand. Its application highlighted 17 correlations significantly altered in the active state of colorectal cancer (CRC) in comparison to matched healthy controls (HC), from which 10 were specific to this state in comparison to the remission one (R-CRC) investigated on distinct patients. 15 metabolites involved in the correlations of interest, on the 25 unique ones obtained, were annotated (Metabolomics Standards Initiative level 2)

Conclusions The metabolites highlighted could be used to better understand the pathology and the systematic investigation of the methodological aspects allows to implement correlation analysis to various fields and many specific cases.

Metabolic Correlations

GC×GC

comprehensive gas chromatography

untargeted metabolomics

QC system

colorectal cancer

Colorectal cancer (CRC) is the third deadliest (9.2%, 0.86 million deaths) and the fourth most diagnosed (6.1% of all cancers, 1.1 million new cases in 2018) cancer worldwide [1]. Because of its increase in countries in transition, and despite the stabilization and decrease in mortality observed in developed countries where the rates of incidence are the highest, it is expected to reach 2.2 million new cases and 1.1 million deaths by 2030 [2]. The internationally recognized diagnosis relies on the analysis of a tissue sample by a pathologist obtained by an invasive method of clinical examination. Metabolic profiling [3, 4], the untargeted analysis of the metabolites in a sample, is widely performed in translational research [5, 6]. Since it aims to measure as many compounds as possible, it is mostly a discovery, hypothesis generating approach [7]. Modern instrumentations, particularly comprehensive ones such as comprehensive two-dimensional gas chromatography (GC×GC) [8], generate high-dimensional, complex data sets that present a considerable challenge in terms of their interpretation [9, 10] and therefore require adapted statistical tools to extract as much chemical information as possible to be translated into biologically relevant knowledge [11]. Centred on diagnosis capability, biomarker research provides limited biological knowledge even when the marker metabolites are linked to their metabolic pathways, an approach that has become more frequent over the years [12–14]. On the other hand, the correlations or strength of asscociation existing between the metabolites, calculated as the statistical dependence between them, are rarely considered. This is despite the publication more than 15 years ago of pioneer studies [9, 15, 16], the facility to perform such an analysis through available workflows [17], and the potentially valuable information they contain about the specific metabolic changes induced in the underlying metabolic network associated to a biological process [18–20]. Indeed, the correlation analysis can be seen as a fingerprint of the enzymatic and regulatory reaction network and a measure of their alterations between the biological groups compared in the biomarker discovery study [21]. It has the potential to improve the understanding of the various molecular mechanisms involved in a phenomenon -for example the occurrence, progression, remission and recurrence of a disease- through the generation of interesting hypothesis [22]. Such analysis is particularly interesting in cases where the concentrations of the metabolites are not strongly altered and when the effect measured is a combination of many low impact factors that are thus hard to unveil individually [11]. In addition, they make possible to complete the metabolic pathways as they are already known as well as to discover and to model associations outside them [20, 23]. Finally, they require no additional analytical effort since they are purely data processing. On the other hand, they demand strong metabolic knowledge for their interpretation [16, 19]. This study aims to investigate the methodological aspects of correlation analysis, based on Anscombe’s quartet [24], i.e. to develop strategies in order to properly detect and visually control the significant correlations and variations of correlations through a case study where a metabolomic data set of colorectal cancer was obtained through serum samples from four subgroups of patients analyzed with comprehensive two-dimensional gas chromatography (GC×GC) coupled to time-of-flight mass spectrometry (TOF-MS).

2.1. Parametric Coefficient. Presence of Atypical Samples

Regarding the samples, no outlier (consistently out of the limits defined by the 95% ellipses or clustered at a high dissimilarity (HCA)) was observed with PCA, HCA and PLS plots in the log-transformed data, but 6 samples were dissimilar (regularly out of or close to the limits) from the group they belong to: CRC18, HC23, R-CRC38, R-CRC46, R-HC56 and R-HC58 (see SI S-1A). In the raw data, we detected two outliers: CRC18 and HC37, and 4 dissimilar points: CRC17, R-CRC38, R-HC56 and R-HC65. Overall, the best homogenization of the samples in each subgroup was obtained with the log transformation, particularly when looking at the HCA plots.

Based on the robust z-scores and the relative ranks, one outlier: HC37, and 4 dissimilar points: CRC14, CRC18, HC32 and R-CRC38 were detected in the log-transformed data (SI S-1B). Similarly, 6 outliers were observed in the raw data: CRC18, HC37, R-CRC38, RCRC-39, R-CRC40 and R-HC57, along with 4 dissimilar points: CRC9, CRC17, R-HC55 and R-HC56. Overall, CRC 18 and R-CRC38 were consistently found as dissimilar points in the log-transformed data while CRC18 and HC37 were consensus outliers and CRC 17, R-CRC38 and R-HC56 were found consensus dissimilar points in the raw data. Again, as expected, the log transformation made the data more gaussian and less sensitive to the atypical samples. And this procedure showed efficient to highlight the main atypical samples.

2.2. Parametric Coefficient. Presence of Atypical Values

Many atypical values were observed in all four groups. Indeed, in the 29 selected correlations, we found 36 variables that had at least one outlier and 45 variables that had at least one outlier or one dissimilar value in the raw data, for a total of 55 outliers and 69 dissimilar values. After the log transformation, 19 variables had a least one outlier and 42 had at least one outlier or one dissimilar value, for a total of 24 outliers and 53 dissimilar values. Globally, for each type of data, 27 correlations had at least one outlier or one dissimilar value while 25 (raw data) and 14 (log-transformed data) had at least one outlier. This is likely because the 29 correlations were selected precisely for their differences of values between the coefficients, which is probably due to the presence of outliers. In addition, we observed that the atypical values in the raw and log-transformed data were often high and low ones, respectively (SI S-2A). This is likely due to the application of the log transformation, which reduces the right skewness of a distribution by moving it globally to the left, to (almost) symmetrical data. As for the atypical samples, we also confirmed on our specific data set that the log-transformed data were less prone to atypical values, especially to outliers. From a methodological point of view, looking at the detection methods, in almost half the cases, univariate (boxplots and Grubbs and Dixon tests) and multivariate (2D scatterplots) tools were in agreement. However, in agreement with previous observations [25], bivariate visualization was found useful to take into account the alignment of a value with the other points of a specific correlation as well as to evaluate properly the reality of the correlation [24, 26] (SI S-2B). Indeed, the alignment was found responsible for the different status (normal value, dissimilar one or outlier) of a single value according to the correlation involved. As the Dixon test is limited to the detection of one outlier, it logically underestimated the presence of outliers. Therefore, it is overall recommended to use Grubbs test and boxplots, along with scatterplots that allow to visualize the correlations by considering both variables simultaneously, and to pay particular attention to the non-aligned values. Finaly, the percentages of outliers and dissimilar values in the selected correlations designated the CRC sample 18 as an outlier and the HC sample 19 as a dissimilar sample in the raw data, with respectively 38 and 12% of outlier values and 45 and 33% of dissimilar values. In the log-transformed data, they highlighted the CRC sample 18 as a dissimilar sample with 21% and 29% of outlier and dissimilar values.

2.3. Effect of Atypical Samples and Values on r and Z

Combining the results above, it was found that the samples CRC18 and HC37 were potential outliers in the raw data while CRC17, R-CRC38 and R-HC56 were dissimilar samples. In the log-transformed data, CRC18 and R-CRC38 were dissimilar samples. The effect of those atypical samples was assessed by removing them, which is equivalent to a skipped-correlation coefficient [25, 27], and by comparing the values of r and Z (Z_CRC and Z_R-CRC) with and without them [26]. Other options exist that were tested elsewhere and found less powerful and less efficient [25], particularly with bivariate outliers, such as the winsorizing of the data [28] or the use of alternative coefficients such as Tukey’s biweight [29] or a percentage bend Pearson coefficient [30]. As expected, the dissimilar samples were less influential than the outliers, especially for Z where the effect of atypical samples in two subgroups could combine (Tables 1 and 2). In addition, the dissimilar samples in the log-transformed data (CRC 18) and in the raw data (R-CRC 1) produced similar absolute and relative differences, for both r and Z.

Table 1

Effect of the atypical samples on Pearson’s r.
r			Mean		Median
r			Absolute	%	Absolute	%
CRC	Raw	1 Outlier	0.21	221	0.15	58
HC	Raw	1 Outlier	0.16	313	0.08	41
CRC	Raw	1 Dissimilar	0.07	95	0.03	17
R-CRC	Raw	1 Dissimilar	0.13	197	0.06	28
R-HC	Raw	1 Dissimilar	0.04	64	0.02	10
CRC	Log	1 Dissimilar	0.09	158	0.06	23
R-CRC	Log	1 Dissimilar	0.09	169	0.06	22

Mean and median absolute and relative (in percentage of the correlation values with the outliers) differences of Pearson’s correlation coefficients r in the four subgroups of samples, for all the 5151 correlations in the data set, when removing the atypical samples.

Table 2

Effect of the atypical samples on the differential correlation coefficient Z_CRC and Z_R-CRC.
Z	Mean		Median
Z	Absolute	%	Absolute	%
CRC / HC Raw – 2 Outliers	0.96	332	0.74	95
CRC / HC Raw – 1 Dissimilar	0.22	102	0.10	16
R-CRC / R-HC Raw – 2 Dissimilar	0.44	249	0.23	37
CRC / HC Log – 1 Dissimilar	0.28	138	0.17	29
R-CRC / R-HC Log – 1 Dissimilar	0.25	132	0.17	25

Mean and median absolute and relative (in percentage of the correlation values with the outliers) differences of the differential Z_CRC and Z_R-CRC, for all 5151 correlations in the data set, when removing the atypical samples.

The atypical values also had a clear effect on Pearson’s r and on the differential Z, especially with the raw data (Table 3). Again, and as expected, the raw data were more sensitive to the presence of atypical values than the log-transformed ones [26], but the two suffered from this phenomenon. Therefore, the log transformation revealed efficient but not sufficient to make the GC×GC-TOF-MS data fully gaussian.

This was observed visually on the scatterplots where the deviant values biased the linear regressions and the corresponding angles between them [24, 26] (SI S-3A). When considering the angles, scaling the data prior to the calculation was reported to be of importance to perform a relevant analysis and to avoid misinterpretations [16]. Four types of scaling were tested, consisting in the multiplication of the slopes of the linear regressions by the ratio of the means, the medians, the maxima or the ranges of values of the subgroups considered (CRC and HC or R-CRC and R-HC). Logically since the correlations values depend on the range of observations [31], we found that the range method was the most consistent with the associations observed in the scatterplots as well as with the angles calculated on the ranks, taken as the reference since they measure the metabolites on the same scale and thus do not require any change (SI S-3B). The results obtained confirmed what was seen with the numerical calculations since with the raw data, even the dissimilar values were influential, while with the log-transformed data, only the clear outliers represented an issue (SI S-3C). Overall, and to conclude on this section, when using Pearson’s r on GC×GC-TOF-MS, we found appropriate to use the log-transformed data along with a proper, uni- and multivariate outlier detection strategy, that includes a visual control through 2D graphical representations [26], in order to look for the presence of atypical samples and values. In our view, the same procedure could be applied to biomarker research as well, particularly if non-parametric statistical methods are not included in the biomarker selection process.

Table 3

Effect of the atypical values on Pearson’s r and on the differential Z_CRC and Z_R-CRC.
%	r		Z
Absolute	r		Z
Raw	Outliers	All atypical	Outliers	All atypical
Mean	0.21	0.28	1.34	1.62
Median	0.12	0.21	0.63	0.95
Log	Outliers	All atypical	Outliers	All atypical
Mean	0.06	0.13	0.40	0.74
Median	0.00	0.08	0.00	0.68
Raw	Outliers	All atypical	Outliers	All atypical
Mean	99	136	98	124
Median	31	64	60	74
Log	Outliers	All atypical	Outliers	All atypical
Mean	18	44	20	44
Median	0	17	0	29

Mean and median absolute (above) and relative (in percentage of the (variation of) correlation values with the outliers, below) differences between the correlation coefficients r and their variations Z_CRC and Z_R-CRC with and without the outliers or all atypical values, for all 5151 correlations.

2.4. Non-parametric Coefficients. Presence of Ties through Analytical Precision

To begin with, thresholds were defined to distinguish between ties from non-ties. The three ways selected to assess the presence of ties agreed that a distance under 10% of the combined RSD of two consecutive values, which corresponds to a very small effect size (< 0.2), a t-value < 0.6 and a corresponding probability of rejecting the tie hypothesis (p-value) of less than 70%, could be seen as a confident tie (Table 4). On the contrary, a distance over 40% of the combined RSD, which corresponds to a large effect size (> 0.8), a t-value > 2.4 and a corresponding p-value > 99%, was a confident non-tie. In between, the decision was less clear but it appeared that under 25% of the RSD, equivalent to a medium effect size (0.5), a t-value of 1.5 and a p-value of 90%, the points were probably tied, while over those values they were likely to be not. The thresholds were confirmed visually through scatterplots of the raw data against their respective ranks, constructed for 8 metabolites manually selected as representatively distributed over the data set. To take the uncertainty of measure into account, error bars were drawn that were equal to combined the RSD in the QC samples (Fig. 1). In addition, the signal axis was log-transformed when the data range exceeded 10, which was often the case. As observed by Anscombe [24], this allowed the lowest signals and their error bars to be visible in comparison to the highest ones. Overall, the thresholds made sense visually for our specific data set but were also consistent with the more standard notions of effect size, t-values and p-values, which tends to show that the procedure and the criteria are generalizable to other studies, while the thresholds values would likely have to be adapted, particularly as a function of the sample size and the dynamic range of the detector (SI S-4).

Table 4

Defined thresholds for confident and probable ties and non-ties.
	t-value	p-value* (%)	Hedges’s g	RSD (%)
Confident tie	0.60	70	0.20	10
Probable (non-) tie	1.50	92	0.50	25
Confident non-tie	2.40	99	0.80	40
* p-values for sample sizes 17 ≤ n ≤ 19.

The application of the thresholds to the entire data set showed numerous ties and groups of ties (Table 5), with frequent ‘chains’ of ties difficult to interpret. The resultant number of points left to calculate the correlations was reduced by 30% and more than 50% with the confident and probable ties, respectively. As a result, the precision of the data appeared to be low in comparison to the distances between consecutive values to get confident rank attributions. Despite the care taken in the measurements and the partial correction of the peak volumes for the analytical variation measured in the QC samples through a locally estimated scatterplot smoothing (LOESS) procedure [32]. This led to what we regard as an excessive attribution of the ranks that could impair the efficiency of the non-parametric correlation coefficients if this issue is not raised, which, again, depends on the specific case at hand through both the sample size and the dynamic range of the analytical method employed. On the other hand, the frequency of such a case in the GC×GC-TOFMS data set analyzed here, and the consecutive reduction of the number of points left to evaluate the correlations could also be a problem, which tends to favour the consideration of the confident ties against the probable ones.

The effect of the ties on the correlations was moderate in absolute values (0.03 mean and median variations of the coefficients due to the confident ties, 0.05–0.1 for the probable ones), especially in comparison to the changes induced in Pearson’s r by the atypical values, but it was important in relative terms. Indeed, the median and mean variations were respectively around 10 and 40% when considering the confident ties and around 30 and 120% when considering the probable ones. The differential correlation coefficient Z_CRC was influenced by the ties in a similar manner, with median and mean variations both around 0.05–0.1 in absolute values and respectively around 20 and 80% in relative terms for the confident ties. For the probable ones, the median and mean variations were both around 0.3 and 0.7 in absolute values, representing around 40 and 200% median and mean relative variations. Logically, the significant variations of correlations were also influenced by the ties: their number remained the same (but their identity changed) in the case of Spearman and it was increased with Goodman-Kruskal as well as with Kendall, but in a lesser extent. Detailed results for this entire section are given in SI S-5. The higher inflation of Goodman-Kruskal due to the ties could be explained by the fact that it explicitly takes them into account, through an adjustment of the sample size, while Kendall’s tau leaves them aside from the correlation calculation. However, it is also argued that this phenomenon could be due to its implicit directional nature that automatically uses the narrower variable as the independent one [33, 34]. The detailed visual evaluation of the 8 selected correlations confirmed the effect of the ties but it was found somewhat limited, particularly on the regressions and with the confident ties, as illustrated in Fig. 2.

Table 5

Ties, groups of ties and numbers of points left in the data set.
Number	Groups of ties		Ties		Points left
Number	Confident	Probable	Confident	Probable	Confident	Probable
Mean	3.5	3.8	9.0	14.0	13.0	8.3
Median	3	4	9	14	13	9
%	Confident	Probable	Confident	Probable	Confident	Probable
Mean	19	21	49	75	70	45
Median	16	22	49	76	70	49

Mean and median aggregated numbers (above) and percentages (below) of groups of ties, ties and numbers of points left when considering the confident and probable ties.

Regarding the coefficients, it comes from our results, detailed in SI S5 and S-6, that with no ties or the only confident ties, Spearman’s rho was the most different coefficient of the three (the highest, with a mean rank of 1.1, in Table S17, panels A and C ; see also the left part of Table S11 for the absolute values). However, when the probable ties were taken into account, Goodman-Kruskal’s gamma caught up and Kendall’s tau became the coefficient with the most different values (with the lowest mean rank of 2.9, panel C of Table S17). After transforming Kendall’s tau and Goodman-Kruskal’s gamma to be directly comparable to Pearson’s r and Spearman’s rho [35] (right part of Table S11 and panels B and D of Table S17), they were much closer to Spearman at first (no tie) and then their inflation due to the ties, in addition to separate them from Spearman as well as from each other, made them the highest coefficients. Again, this affected particularly Goodman-Kruskal’s gamma that became the most different coefficient (mean ranks of 1.1 and 1.0 with the confident and probable ties, panel D of Table S17). Those observations were confirmed by the differential correlation coefficient Z_CRC (Tables S14 for the absolute values and S18 for the differences between coefficients). Overall, Goodman-Kruskal was found the most liberal coefficient, and Spearman the most conservative, both by quite far, especially with the ties. Globally, the mean and median correlation coefficients r and differential Z_CRC were relatively low, with respective values of 0.2–0.3 (up to 0.4 for Goodman-Kruskal after transformation and the consideration of the ties, Table S11) and 0.7 (up to 1 and more for Goodman-Kruskal when considering the ties, Table S14). The 75th percentiles were moderate, respectively around 0.4 (up to 0.6 for G-K) and a bit more than 1 (up to 1.8 for G- K). While the comparison between the non-parametric coefficients will be continued in the next section, it already comes from this investigation of the ties and their effects that, in order to take them into account without risking to artificially change the results, it seems a good compromise to calculate those coefficients with only the confident ties. To conclude on this section, when using a non-parametric correlation coefficient, it appears important to consider the presence of possible ties that could bias the results. To do so, the proposed strategy based on the definition and application of specific thresholds for three parameters able to highlight such ties, and the evaluation of their effects on the data and the various coefficients worked well on our GC×GC-TOF-MS data set and should be generalizable, given that the results are properly adjusted to the specific case considered.

2.5. Determination of the appropriate coefficient

To find the appropriate coefficient to perform the correlation analysis, here on our untargeted metabolomics GC×GC-TOFMS data set, the methodology consisted in comparing Pearson’s r calculated on the log-transformed data, without the outlier samples and values previously determined, to the non-parametric coefficients applied with the consideration of the confident ties. We first assessed the aggregated (5151 correlations) absolute values for both the correlation coefficients and their differential Z_CRC. The first ones and their respective ranks were quite similar for the four coefficients. Spearman’s rho and Goodman-Kruskal’s gamma were the most different coefficients with respectively the lowest and highest values (0.28 and 0.34, mean ranks of 1.5 and 3.2; Table 6). The aggregated absolute differences between the coefficients, however, showed that the parametric one, Pearson’s r, was the most different. Thus, the effect of the confident ties to separate the non-parametric measures from each other was not sufficient in comparison to the difference between parametric and non-parametric coefficients. Kendall’s tau, despite being very close to Goodman-Kruskal’s gamma, was the most central coefficient, with no maximum and only 11% of minimum values.

Table 6

Comparison of the correlation coefficients.
Distribution	P	G-K	K	Sp
0,25	0.13	0.15	0.14	0.12
Mean	0.30	0.34	0.32	0.28
Median	0.27	0.31	0.29	0.25
0,75	0.44	0.50	0.47	0.42
Max	1.00	1.00	0.99	0.98
Mean Rank	2.7	1.5	2.6	3.2
Median Rank	3	1	2	3
% of Max	35	60	0	5
% of Min	43	0	11	45
Distribution	P / G-K	P / K	P / Sp	G-K / K	G-K / Sp	K / Sp
0,25	0.04	0.03	0.03	0.01	0.02	0.02
Mean	0.10	0.09	0.08	0.02	0.06	0.04
Median	0.08	0.07	0.06	0.01	0.05	0.04
0,75	0.14	0.13	0.12	0.02	0.08	0.06
Max	0.86	0.87	0.91	0.12	0.25	0.18

Aggregated absolute values, ranks and absolute differences between the correlation coefficients, for all 5151 correlations. Goodman-Kruskal’s gamma (G-K) and Kendall’s tau (K) were transformed to be directly comparable to Pearson’s r (P) and Spearman’s rho (Sp) [35].

Those observations were confirmed, and even amplified, in the aggregated differential correlations Z_CRC, particularly regarding the difference between Pearson’s r and the non-parametric coefficients (Table 7). The number of significant variations of correlations led to the same conclusions, whatever the threshold used, with a ranking of the coefficients from the more liberal to the more conservative: Goodman-Kruskal’s gamma > Kendall’s tau > Pearson’s r > Spearman’s rho (Table 8). Regarding the thresholds for significance, the 0.05 and 0.01 p-values, used in the absence of appropriate direct thresholds for the correlation coefficients [36], led to dozens and even hundreds (with 0.05 and/or ties) of significant results. However, when Bonferroni and Benjamini-Yekutieli corrections for multiple testing were introduced (0.05 p-value threshold) [25, 37], very few results were left for further exploitation (Table 8). Given the exploratory nature of the procedure, and the fact that the true significance can only be achieved through replication [22, 38], we found most appropriate to compromise between the quality (necessary with low sample sizes [16]) and the quantity of the results [25] through an uncorrected p-value threshold of 0.001.

Table 7

Comparison of the differential correlation coefficients Z_CRC.
Distribution	P	G-K	K	Sp
0,25	0.31	0.36	0.34	0.29
Mean	0.80	0.92	0.87	0.76
Median	0.66	0.77	0.73	0.64
0,75	1.16	1.33	1.26	1.09
Max	8.64	5.13	4.77	4.18
Distribution of the diff.	P	G-K	K	Sp
Mean Rank	2.7	2.5	2.4	2.4
Median Rank	3	3	2	2
Number of Max	1836	1547	394	1374
% of Max	36	30	8	27
Number of Min	2559	1436	256	900
% of Min	50	28	5	17
Distribution of the diff.	P / G-K	P / K	P / Sp	G-K / K	G-K / Sp	K / Sp
0,25	0.19	0.17	0.15	0.02	0.08	0.06
Mean	0.48	0.45	0.41	0.06	0.21	0.16
Median	0.39	0.36	0.33	0.04	0.17	0.13
0,75	0.68	0.64	0.57	0.08	0.29	0.23
Max	10.63	10.42	10.16	1.05	1.55	0.86

Aggregated absolute values, ranks and absolute differences between the differential correlation coefficients, for all 5151 correlations. Pearson’s r (P), Goodman-Kruskal’s gamma (G-K), Kendall’s tau (K) and Spearman’s rho (Sp).

Table 8

Comparison of the significant differential correlation coefficients Z_CRC.
Threshold Uncorrected α	P	G-K	K	Sp
5.10^− 2	284	492	397	221
10^− 2	59	139	104	47
10^− 3	9	38	17	5
10^− 4	1	7	3	1
Bonferroni	1	3	1	0
BY	1	1	0	0

Numbers of significant variations of differential correlations coefficients Z between CRC and R-CRC samples, depending on the threshold considered, for all 5151 correlations in the data set. Pearson’s r (P), Goodman-Kruskal’s gamma (G-K), Kendall’s tau (K) and Spearman’s rho (Sp).

Given the recognized capacity of non-parametric correlation coefficients to be more robust to atypical observations (and to provide higher statistical power in their presence [39, 40]), to non-linearity and to non-normal, sparse or small data sets [34], they were expected to give a more accurate assessment of the differential correlations.

This was confirmed with the 29 selected correlations, through their numerical comparison and their visual control. It came out that all coefficients, used taking into account the outliers or the ties, gave similar results -with Pearson’s r confirmed as the most different (Fig. 3)- and seemed to correctly represent numerically the visual distributions and correlations. However, Goodman-Kruskal’s gamma tended to inflate a bit the correlations, on the contrary to Pearson’s r and Spearman’s rho who appeared sometimes to underestimate them (Table 9). Based on those observations, as well as on the previous analysis conducted on all correlations where it was found liberal but not as much as Goodman-Kruskal’s gamma, Kendall’s tau seemed the method of choice to perform an untargeted exploratory correlation study on our GC×GC-TOF MS metabolomics data. This is in agreement with the usual preference for robust algorithms in the case of atypical data, particularly with small sample sizes and unknown distributions [21, 36, 38, 41, 42]. However, this is in disagreement with the observation that, because of the way it handles them, Goodman-Kruskal’s gamma is generally preferred to Spearman or Kendall for data sets with many tied ranks [34]. This is possibly because we decided to use only the confident ties in the calculation of the correlations. In addition to the strategy that consisted in comparing the coefficients in multiple ways (aggregated values, significant results and representative cases) and revealed appropriate, the scatterplots were found very useful from a methodological point of view, confirming previous recommendations [24, 37], despite the small data sizes available leading to incomplete distributions. The use of regressions helped a lot to determine the tendencies in the data (SI S-7). Especially with moderate correlations (between 0.3 and 0.6), of which the significance was more difficult to evaluate, as well as when comparing correlations with ranges and scales that differed between the subgroups studied. Most frequently, we observed quite clear linear behaviours that were well modeled by linear regressions, which made the polynomial ones superfluous. However, the polynomial approximation, limited to the second order to avoid overfitting, could be compared to the linear regression to inform, through its curvature, about the linearity of the data. It was also able to model alternatives trends present in the subgroups of the data, again through its curvature as well as through the length of the curve on each side of it. In addition to the log-transformed data used in the calculations, the graphs included the raw data for comparison purpose. The main observation was a logical increase in linearity of the distributions through the reduction of the distances between the points.

Table 9

Differential correlations Z_CRC of the 29 selected correlations for the four coefficients tested.
Variable ID number		P	G-K	K	Sp
1	10	-2.0	-2.6	-2.5	-2.4
1	69	1.7	2.2	2.0	1.7
5	68	2.3	2.8	2.6	2.3
5	78	4.5	5.1	4.8	4.2
6	25	2.2	1.8	1.7	1.4
7	48	0.4	0.7	0.7	0.6
7	62	0.0	0.2	0.2	0.2
7	69	1.5	1.4	1.3	1.2
14	51	2.7	3.6	3.4	3.0
14	57	2.7	2.5	2.4	2.3
14	85	2.0	2.8	2.7	2.5
18	77	3.1	3.1	3.0	2.9
19	77	1.8	2.5	2.4	2.3
20	22	-1.9	-2.7	-2.5	-2.5
20	68	2.3	2.5	2.4	2.3
22	74	2.0	2.3	2.2	2.0
22	82	1.0	2.4	2.2	1.8
25	77	1.4	2.3	2.2	2.0
28	93	-1.9	-3.2	-2.7	-2.3
29	98	1.8	3.0	2.6	2.3
54	55	1.7	3.6	3.3	3.0
58	90	-1.7	-3.6	-3.3	-2.8
62	95	0.8	0.6	0.6	0.3
76	96	2.0	3.0	2.9	2.5
76	98	1.8	2.9	2.7	2.3
77	101	2.7	3.3	3.2	3.0
78	83	1.3	2.6	2.4	2.1
88	96	0.7	2.4	2.3	2.1
89	91	-1.9	-2.0	-1.8	-1.5

The values which absolute difference with the nearest other coefficient is ≥ 0.5 and 1, the empirical thresholds chosen to represent the medium and large differences, are respectively in light grey and grey. Pearson’s r (P), Goodman-Kruskal’s gamma (G-K), Kendall’s tau (K) and Spearman’s rho (Sp).

2.6. Specific Significant Variations of Correlations Associated to Colorectal Cancer

Using the 0.001 p-value threshold chosen above led to the selection of 17 correlations significantly altered in the “active” state of colorectal cancer (CRC samples) in comparison to the gender and age matched healthy controls (HC samples; Table 10). Among those, 10 were found specific to the active state in comparison to the CRC samples in remission (R-CRC).

Table 10

Differential correlations Z and associated p-values for the indirect comparison between CRC and R-CRC samples
Variable ID numbers		Z_CRC	p-value	Z_R−CRC	Z_CRC/R−CRC	p-value
5	78	4.8	0.000002	0.7	4.1	0.00004
45	57	-4.3	0.00002	-0.5	-3.8	0.0001
77	98	4.2	0.00003	0.6	3.6	0.0003
45	90	-3.8	0.0001	-0.1	-3.8	0.0002
7	77	3.8	0.0002	0.9	2.9	0.004
20	95	3.7	0.0002	0.9	2.8	0.005
8	43	3.6	0.0003	-0.6	4.2	0.00003
44	92	-3.5	0.0004	-1.4	-2.1	0.033
3	43	-3.5	0.0004	-1.1	-2.4	0.014
14	51	3.4	0.0007	1.9	1.5	0.14
54	55	3.3	0.0008	-0.9	4.3	0.00002
45	70	-3.3	0.0008	-0.1	-3.2	0.001
6	14	3.3	0.0009	0.7	2.6	0.009
45	51	-3.3	0.0009	-0.1	-3.2	0.001
29	77	3.3	0.0010	0.6	2.7	0.008
56	98	3.3	0.0010	-1.1	4.4	0.00001
18	71	3.3	0.0010	-2.9	6.2	< 0.00001

The differential Z between CRC and R-CRC, Z_CRC/R-CRC, is the difference between the two differential Z obtained individually against the gender and age matched healthy controls, Z_CRC and Z_R-CRC. In bold, the variations of correlations specific to CRC.

Using the mass spectra, the linear retention indices (LRI) and the exact mass led to the confident annotation (MSI level 2) of 15 molecules on the 25 unique ones involved in the correlations of interest (Table 11), providing 3 correlations with two annotated metabolites, 11 with one metabolite annotated and 3 with no metabolites annotated. All were found to have biological functions. Since the annotation of a correlation requires both metabolites to be annotated, the issue of identifying the metabolites of interest is even more crucial than in biomarker research [43–46] and therefore constitutes a major bottleneck. This is illustrated in the network visualization, where the two main metabolic hubs highlighted (metabolites 45 and 77 in the data set) could not be reliably annotated (Fig. 4). Therefore, there was limited interest to perform a specific network analysis [10] (such as hubs, connectivity, centrality parameters; Fig. 4).

Table 11

Annotation of the metabolites involved in the correlations of interest.
Data set	Metabolite	Molecular Formula	Match Factor	Probability	Theoretical LRI	Mesured LRI	Delta LRI	Exact Mass		Number of criteria met
3	3-Hydroxybutyric acid	C₄H₈O₃	793	49.5	1167	1164	3	0.8	1.4	3
5	Serine	C₃H₇NO₃	821	85.0	1368–1388	1372	6	0.5	0.0	3
8	Threonine	C₄H₉NO₃	877	91.8	1375–1400	1399	12	0.2	0.3	3
14	Aminomalonic acid	C₄H₄NO₄	819	96.0	1485	1484	1	0.1	0.0	3
18	Malic acid	C₄H₆O₅	727	47.9	1478	1503	25	0.3		3
29	Phenylalanine	C₉H₁₁NO₂	704	53.5	1636	1631	5	1.7	2.3	2*
44	Glycylglycine	C₄H₈N₂O₃	731	4.0	1818	1824	6	0.2		3
54	Quinic acid	C₇H₁₂O₆	661	69.4	1851–1918	1897	13	0.6	0.7	3
56	Tyrosine	C₉H₁₁NO₃	876	80.3	1959	1957	2	0.7		3
57	Pentadecanoic acid	C₁₅H₃₀O₂	658	90.2	1950	1951	1	0.3		3
70	Uric acid	C₅H₄N₄O₃	745	60.7	2136	2140	4	4.0	4.4	2*
71	Heptadecanoic acid	C₁₇H₃₄O₂	647	76.6	2146	2147	1	2.8	3.3	2*
78	Uridine	C₉H₁₂N₂O₆	771	89.5	2375	2381	6	0.1		3
95	Erythritol / Threitol	C₄H₁₀O₄	810	43.9	1510	1525	15	0.0	0.4	3
98	Norleucine	C₉H₁₂NO₂	791	11.0	1313	1312–1328	7	0.2		3

The annotation was made through full mass spectrum (library match and probability of correct annotation), LRIs (delta between the measured and theoretical values) and exact mass (mass error in ppm). In bold, the values that met the criteria defined. *Indicates that the exact mass criterium was made less strict, up to 5 ppm.

Most publications about the CRC molecular aspects studied the subtypes of CRC as well as the CRC-initiating or tumorigenesis cellular pathways [48–53]. However, alterations of the metabolic pathways have been observed in various matrices (serum, tissues, urine, feces and breath) that can be informatively compared to the ones observed in the study (Table 12 and in bold below). They include: cell energetic metabolism according to the Warburg effect [54] (glycolysis and TCA cycle), urea cycle, structural maintenance through glycerol and ketone bodies, oxydative stress, cytochrome P450 activity and lipids, amino acids, fatty acids, bile acids and nucleotides metabolisms [4, 14, 55–60] (including multiple potential contributions of the microbiota [51]). The metabolic pathways highlighted here suffer from two weaknesses. First, they are only very weakly enriched, except for aminoacyl-tRNA biosynthesis. Second, they are rather non-specific and seem to be only indirect consequences of colorectal cancer. However, just like the candidate biomarkers, the correlations of interest are only primary results, limited by the low sample size [61]. They require to be properly confirmed through a targeted quantitative analysis of the variations observed, followed by a proper validation conducted on an independent, larger set of patients, to see if they translate to different groups of patients with the same significant distributions. After that, further biological investigation would have to be performed, that ideally would lead to a better understanding of the metabolic effects of the disease process and its remission state detectable in serum samples. In this perspective, a powerful but complex approach would be to integrate other omics data obtained on the specific biological samples analyzed for this study.

Table 12

Metabolic pathways most altered when considering the highlighted candidate biomarkers and correlations of interest.
CB / Corr	Pathway Name	Match Status
CB + Corr	Aminoacyl-tRNA biosynthesis	5/48
CB	Alanine, aspartate and glutamate metabolism	3/28
CB + Corr*	Phenylalanine, tyrosine and tryptophan biosynthesis	2/4
CB + Corr*	Phenylalanine metabolism	2/10
CB + Corr*	Butanoate metabolism	2/15
CB	Glycolysis / Gluconeogenesis	2/26
CB	Glyoxylate and dicarboxylate metabolism	2/32
CB + Corr	Purine metabolism	2/65

CB + Corr* refers to the case where the same metabolite(s) was (were) highlighted in the biomarker research and the correlation analysis; CB + Corr means that different metabolites were highlighted in the two processes that are involved in the same pathway. The match status is the number of metabolites highlighted through the biomarker research and the correlation analysis over the total number of metabolites present in the pathway.

2.7. Limitations

Besides the annotation of the metabolites mentioned above, another recurrent limitation in metabolomics that was highlighted here is the analytical stability. Indeed, the generation of quality data requires high accuracy [10] and therefore suffers from the intrinsic noise present in high-dimensional data sets [62]. In addition to its effect on the non-parametric coefficient investigated above, it reduced the number of signals available to discover candidate biomarkers and correlations of interest. Missing values, besides the problem they can be for correlation calculation by producing incomplete profiles difficult to process (an issue that had no particular consequence in this work), are another important cause of the loss of potentially interesting signals. Here, despite the care given to the chromatography,the implementation of an external QC system, coupled to a LOESS partial correction, and the replacement of all missing values by the half of the lowest signal measured for the specific metabolite, the metabolic coverage was decreased by the selection of only the stable metabolites (RSD in the QC samples under 30% [20]) and the ones present in at least half the samples of any class.Therefore, through the various preprocessing steps (summarized in SI S-8), we went from 646 quality features in the chromatographic template to 102 variables corresponding to metabolites in the final data set, confirming a previous study performed on Crohn’s disease samples where 524 quality features became 183 high-quality metabolites. If it obviously downgrades any biomarker research where it linearly decreases the number of potential candidate biomarkers, it harms even more a correlation analysis where every additional variable can have potentially interesting links with every other and where, therefore, the number of correlations available is proportional to the number of metabolites at the power 2. However, such a selection seems necessary in order to get confident results, as it was shown how even lower analytical variations have a clear effect on the non-parametric correlation estimations. This drawback is shared by all analytical platforms, in various extents, and no single instrumentation is able to provide a complete and fully stable coverage of a complex sample. Because of the exploratory nature of the analysis, if the number of significant variations of correlations was limited with the threshold applied, it could nevertheless be sufficient to provide interesting metabolic insights and associations with colorectal cancer alterations and regulations. Because of those limitations, only a small number of annotated correlations of interest were obtained and thus a very small network. The last, but not the least, limitation of our study is the small sample size we used (17 ≤ n ≤ 19), due to clinical constraint. Among the several potential effects mentioned in SI S-9, the reduced statistical power and the inflation of the correlation values are the most dangerous for us. Here, we investigated not only correlations but also variations of correlation (Z_CRC or Z_R-CRC) and differences in the variations of correlation between different states of the pathology (Z_CRC/_R-CRC). This makes difficult to evaluate the power as well as the metabolic effect sizes of interest and their significance, particularly prior to the study. Indeed, the size of the variations of interest not only depends on the specific metabolites involved, like the simple correlations, but also on both the initial and final correlation values r, as investigated for Pearson’s r by (Bujang and Baharum, 2016) [63]. Power calculation applied to our study shows that with n ≥ 17, the minimal correlation value detectable with a power of 80% at a significance level of 0.05 is around 0.6 [64], which is already a large correlation. Regarding the variations of correlations, we observed that the significant ones always involved a low or medium correlation and a very high one (in absolute value). While the very high correlations are much likely inflated by the low sample size [65], the statistical power can nevertheless be estimated by calculating the sample size necessary to detect the low or medium values. With a significance threshold (type I error) of 0.05 the power to detect the variations of correlations significant at 0.05, 0.01, 0.001 and 0.0001 are respectively of 32, 38, 63 and 90%.

3.1. Samples

Patient recruitment and serum sample intakes were performed at the university hospital of Liège, through the Bibliothèque hospitalière universitaire de Liège (BHUL), Belgium. The sample intake, processing and storage procedures were standardized and followed our biobanking guidelines developed for proteomics studies and utilized for clinical trials [32]. We analyzed serum samples from patients with colorectal cancer (CRC; n = 18) and with colorectal cancer in remission (R-CRC; n = 17) as well as their respective controls (HC and R-HC) matched for gender and age (n = 19 and n = 17). Based on endoscopic examination confirmed on surgical resection specimen, patients with confirmed cancer lesions (primary adenocarcinoma according to the Tumor Node Metastasis (TNM) staging system, CRC group) or with a previous history and no evidence of remaining disease of cancer (R-CRC group) were included. The (“healthy”) control groups consisted of patients screened by colonoscopy and diagnosed negative for CRC (no visible lesion or hyperplastic polyp) as well as for any other pathology affecting the bowel (as inflammatory bowel disease, diverticulitis…) or any other known cancer at the time of sample intake. The internal QC samples consisted in 30 µL aliquots of the study samples, made through one freeze–thaw cycle.

3.2. Data Acquisition and Processing

The clinical data of the patients as well as the chemicals used and all the steps leading to the data set used in this study (sample preparation, GC×GC-HRTOF-MS analysis, data (pre)processing, annotation of the compounds of interest) have been described in Di Giovanni et al. 2020 [47] and a summary can be found in the Supplementary Materials (SI S-10). All calculations and plots were made in Microsoft® Excel® except for the correlation coefficients that were calculated using the Excel® add-in Tanagra [66] and the HCA, PCA, PLS and sPLS plots that were constructed using the web-resource MetaboAnalyst [67]. The correlation networks were drawn in Cytoscape® software [68].

3.3. Correlations Analysis

In a previous study, we looked for specific candidate biomarkers of the ‘active state’ of colorectal cancer (CRC) in comparison to healthy controls (HC) and patients in remission (R-CRC) [47]. As a result, in the data set of 102 metabolites, 24 metabolites were found significantly altered and able to discriminate the CRC and HC samples: Receiver Operating Characteristic (ROC) area under the curve (AUC) of 0.86, sensitivity and specificity of 0.72 and 0.78. Ten of those were found to have signals close to healthy levels also in the R-CRC samples and were therefore potentially specific to the CRC samples analysed. Here, in order to go further,take advantage of their above-mentioned benefits and complement the previous study, we aimed to investigate in a similar manner the correlations between metabolites, on the same data set. To do so, we first looked for the most appropriate way to calculate them, given the type and structure of the data at hand. Following Anscombe’s quartet [24, 69], we identified four main points to take into account: the presence of non-linear relationships, the role and handling of atypical samples and values, the effect of the analytical and biological noise in the data and the need for a visual control of the detected associations. Therefore, four types of correlation parameters were examined: Pearson’s r (parametric) as well as Spearman’s rho (Pearson’s r on the ranks), Kendall’s tau and Goodman-Kruskal’s gamma (non-parametric). The two later belong to the same family of coefficients, Kendall’s tau being the parent measure to which the others reduce in the absence of ties, and can be interpreted as the ranked pairs in agreement between two variables [33, 70]. The other coefficient of the family, Somers’ D, was contemplated at first but finally left aside since it treats the variables asymmetrically, with a dependent variable and an independent one, producing two results for each correlation. To be useful, it would require to completely annotate the molecules and to determine their biological hierarchy prior to the correlation analysis, which was not the case in this exploratory work. For the same reason, partial correlations, which were also considered, were finally left aside. Indeed, without the proper prior biological knowledge, the determination of the influences between the metabolites would be purely mathematical, which is particularly dangerous with low sample sizes and non-negligible analytical errors. Besides, the use of partial correlations would also have limited the use of the visual control tools that are an important part of the protocol (Anscombe’s quartet [24]).

3.4. Parametric Coefficient. Presence and Effect of Atypical Samples and Values

For Pearson’s r, we assessed the influence of the data structure on the results from both the raw and log-transformed data [10, 41], the one that was used in the biomarker research [47]. Both atypical samples and values can be a serious practical problem for correlation calculation [22, 26] and produce misleading results [24, 25]. The appropriateness of the Pearson’s correlation depends more on the presence of outliers in the data than on shape of the empirical or theoretical distribution [39]. The atypical samples were investigated through hierarchical clustering analysis (HCA), principal component analysis (PCA) and partial least squares (PLS) plots constructed on all metabolites as well as through the aggregation of several parameters: mean and median ranks, mean and median robust z-scores, proportion of robust z-scores > 1, 2 and 3.5 [71–73]. The atypical values were studied through robust versions of Grubb’s and Dixon’s tests [74, 75], as well as boxplots and scatterplots [24]. Two types of atypical samples and values were distinguished, with the aim to differenciate the clear outliers (to be reconsidered for inclusion in the data set) from the dissimilar samples. The outliers were defined as the samples and values well outside their respective defined thresholds (95% ellipses for the PCA, PLS and sparse partial least squares (sPLS) plots; high dissimilarity for HCA, 95% probability values for Grubbs and Dixon’s tests; 1.5 IQR limit [76] for the boxplots ; 1, 2 and 3.5 [73, 77] for the z-scores (see SI S-11 for details)) and that were not aligned, monotonically, in the scatterplots. The dissimilar samples and values were defined as the ones inside but close to the thresholds and distant in the scatterplots so that they could affect the calculation of the correlations. We then investigated the effects [26] of those atypical samples and values on the correlation coefficients as well as on the variations of correlations Z_CRC and Z_R-CRC defined as the difference between the correlation coefficients adjusted for the global variance [47] (SI S-12), between the serum samples from patients with colorectal cancer (CRC) or from patients in remission (R-CRC) and their matched “healthy” controls (HC and R-HC), for all 5151 correlations existing between the 102 variables corresponding to metabolites of the data set.

3.5. Non-parametric Coefficients. Presence and Effect of Ties

For the non-parametric coefficients, the frequent proximity of consecutive values in the GC×GC-TOF-MS data set relatively to the uncertainty of measure, evaluated as the residual standard deviation in the QC samples (RSD), led us to take the influence of the precision of the data on the assignment of the ranks into account. Indeed, the QC system we implemented only partially corrects for the analytical variations that affect the data and since the selection of the stable metabolites allows their residual RSD to go up to 30%, which is the conventional threshold used for GC-MS [78]. Therefore, the attribution of the ranks can be somewhat misleading since it sets equal distances between the values whatever their differences in terms of signal measured in the data set and whatever their uncertainty. This could lead to false positive and negative results when looking for the significant (variations of) correlations. To solve this issue, we first investigated the presence of ties through three different aspects, for each pair of consecutive values. First, the ratio between their absolute difference and their combined residual standard deviation measured on the QC samples [32, 78] (see SI S-13 for the formulas). Second, the Hedges’g effect size (the corrected Cohen’s d for sample sizes < 20) [47, 79]. Third, the one-tailed t-tests [80] corresponding to the probability of correct ranking. Then, the effect of the ties on the correlation coefficients and on the variations of correlation Z_CRC, between CRC and HC groups of samples, was investigated, after proper transformation for Kendall’s tau and Goodman-Kruskal’s gamma [34–36, 47].

3.6. Selection and Application of the Appropriate Correlation Coefficient

After that, considering the atypical values and the ties and knowing that no single method is optimal in all situations [22, 25], the four coefficients were compared in three different ways in order to determine the method of choice to perform correlation analysis on the untargeted metabolic GC×GC TOF-MS data at hand. First, through their aggregated absolute values, differences and ranks for all 5151 correlations present in the data set. Second, through the visual monitoring of the scatterplots established for 29 correlations selected to be representative of the data set (detailed in Table 9). Third, by looking at the significant variations of correlation Z_CRC using multiple thresholds (p-value obtained under the assumption of normality of Z_CRC, Bonferroni correction [81] and Benjamini-Yekutely false discovery rate (FDR) procedure [82] applied because of the possible dependency of the metabolic signals measured).

Finally, the selected method was applied with the aim to determine the significantly and specifically altered correlations in the active state of colorectal cancer (CRC samples), in the same way that it was done for the specific candidate biomarkers of CRC in a previous study [47]. To do so, the significant variations of correlations Z_CRC between the CRC and their age and gender matched healthy controls (HC) were highlighted. Then, their specificity to the active state in regard to the remission state (R-CRC) measured on distinct samples was assessed by comparing them to the variations of correlation Z_R-CRC measured between the R-CRC and their distinct age and gender matched healthy controls (R-HC). This comparison, indirect to avoid any bias due to the fact that the CRC and R-CRC samples were not matched, was done through the absolute difference between the two respective Z_CRC and Z_R-CRC [20] that led to Z_CRC/R-CRC as well as through the associated p-values obtained under the assumption of normality. The metabolites involved in the correlations of interest were identified using full mass spectra, exact mass and linear retention indices (LRI) [43, 83], according to a procedure described in a previous publication [47] that uses the following acceptance thresholds: match factor of > 700²³ or > 600 with a probability of > 50%, ΔLRI < 25⁸⁰ and mass error < 1 ppm for any specific fragment. Then, the biological plausibility of the annotated candidates was assessed with the Human Metabolome Database (HMDB [84]) and the Kyoto Encyclopedia of Genes and Genomes (KEGG [85]) in order to avoid analytical artefacts. After that, a literature review of the metabolic pathways potentially altered by colorectal cancer was conducted to see if, among the annotated candidates, some had already been associated to and would therefore be likely to play a role in CRC. Finally, a correlation network was drawn with the significantly (and specifically) altered variations along with the significantly (and specifically) altered metabolites (candidate biomarkers) already highlighted [47]. Importantly, further biological interpretation and exploration of the results are beyond the scope of the present paper that focuses on the methodological aspects of correlation analysis. Ideally, they should include a targeted confirmation of the variations observed, conducted on an independent set of patients of a much larger size. A powerful approach would be to integrate other omics data obtained on these specific biological samples.

This study confirmed the interest of conducting correlation analysis in addition to the more frequent biomarker research in order to increase the metabolic information obtained about the changes induced by the phenomenon investigated. To do so, the issues raised by Anscombe’s quartet are of primary importance. Therefore, a global strategy has been proposed to cover them, which methodological aspects were systematically examined in order to allow the interested reader to implement it to its own specific case. The strategy includes descriptive visual tools such as scatterplots that were found necessary [24, 28, 37] to study the relationships between metabolites, either with parametric or non-parametric coefficients, in order to overcome the limitations of purely technical calculations, namely to assimilate the data in all their complexity [86], to detect the trends and irregularities and to control the summarized statistical values [41]. If Kendall’s tau was found the most appropriate coefficient to use with our own data obtained from serum samples analyzed with GC×GC-TOFMS, the other ones tested were efficient as well to translate numerically the visual tendencies of associations and should, in our view, be considered.

Acknowledgments Jeol (Europe) BV for instrumental support, Restek for GC consumables, the Bibliothèque hospitalière universitaire de Liège (BHUL) for the samples. This work was supported by the chemistry department of the University of Liège.

Author Contributions Conceptualization, N.D.G., M.A.M., E.L. and J.F.F.; methodology, N.D.G.; formal analysis, N.D.G.; investigation, N.D.G.; data curation, N.D.G.; writing—original draft preparation, N.D.G.; writing—review and editing, M.A.M, E.L. and J.F.F; supervision, M.A.M., E.L. and J.F.F.; project administration (patients enrolment, inclusion and sample collection, ethics), M.A.M., E.L and J.F.F.; funding acquisition, E.L., M.A.M. and J.F.F. All authors have read and agreed to the published version of the manuscript.

Funding This research was funded by the chemistry department of the University of Liège.

Conflicts of Interest: The authors declare no conflict of interest.

Ethical Approval The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Review Board (or Ethics Committee) of the University of Liège (B707201213737). Informed consent was obtained from all subjects involved in the study.

Data Availability Statement: The mass spectrometry data and the correspondent data sheets have been deposited to the HARVARD DATAVERSE repository: https://doi.org/10.7910/DVN/CSNWRF

Bray, F., Ferlay, J., Soerjomataram, I., Siegel, R. L., Torre, L. A., & Jemal, A. (2018). Global Cancer Statistics 2018: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36 Cancers in 185 Countries. CA. Cancer J. Clin, 68(6), 394–424. https://doi.org/10.3322/caac.21492
Arnold, M., Sierra, M. S., Laversanne, M., Soerjomataram, I., Jemal, A., & Bray, F. (2017). Global Patterns and Trends in Colorectal Cancer Incidence and Mortality. Gut, 66(4), https://doi.org/10.1136/gutjnl-2015-310912. 683 LP – 691
Lin, Y., Ma, C., Bezabeh, T., Wang, Z., Liang, J., Huang, Y. … Wu, R. (2019). 1H NMR-Based Metabolomics Reveal Overlapping Discriminatory Metabolites and Metabolic Pathway Disturbances between Colorectal Tumor Tissues and Fecal Samples. Int. J. Cancer, 0(0), https://doi.org/10.1002/ijc.32190
Zhang, F., Zhang, Y., Zhao, W., Deng, K., Wang, Z., Yang, C. … Li, K. (2017). Metabolomics for Biomarker Discovery in the Diagnosis, Prognosis, Survival and Recurrence of Colorectal Cancer: A Systematic Review. Oncotarget, 8(21), 35460–35472. https://doi.org/10.18632/oncotarget.16727
Mamas, M., Dunn, W. B., Neyses, L., & Goodacre, R. (2011). The Role of Metabolites and Metabolomics in Clinically Applicable Biomarkers of Disease. Arch. Toxicol, 85(1), 5–17. https://doi.org/10.1007/s00204-010-0609-6
Collino, S., Martin, F. P. J., & Rezzi, S. (2013). Clinical Metabolomics Paves the Way towards Future Healthcare Strategies. Br. J. Clin. Pharmacol, 75(3), 619–629. https://doi.org/10.1111/j.1365-2125.2012.04216.x
Kell, D. B., & Oliver, S. G. (2004). Here Is the Evidence, Now What Is the Hypothesis? The Complementary Roles of Inductive and Hypothesis-Driven Science in the Post-Genomic Era. BioEssays, 26(1), 99–105. https://doi.org/10.1002/bies.10385
Zanella, D., Focant, J. F., & Franchina, F. A. (2021). 30th Anniversary of Comprehensive Two-Dimensional Gas Chromatography: Latest Advances. Anal. Sci. Adv, 2(3–4), 213–224. https://doi.org/https://doi.org/10.1002/ansa.202000142
Mendes, P., Camacho, D., & de la Fuente, A. G. (2005). Modelling and Simulation for Metabolomics Data Analysis. Biochem. Soc. Trans, 33(Pt 6), 1427–1429
Weckwerth, W., Morgenthal, K., & Metabolomics (2005). From Pattern Recognition to Biological Interpretation. Drug Discov. Today, 10(22), 1551–1558. https://doi.org/https://doi.org/10.1016/S1359-6446(05)03609-3
Tebani, A., Afonso, C., & Bekri, S. (2018). Advances in Metabolome Information Retrieval: Turning Chemistry into Biology. Part II: Biological Information Recovery. J. Inherit. Metab. Dis, 41(3), 393–406. https://doi.org/10.1007/s10545-017-0080-0
Mal, M., Koh, P. K., Cheah, P. Y., & Chan, E. C. Y. (2012). Metabotyping of Human Colorectal Cancer Using Two-Dimensional Gas Chromatography Mass Spectrometry. Anal. Bioanal. Chem, 403(2), 483–493. https://doi.org/10.1007/s00216-012-5870-5
Farshidfar, F., Weljie, A. M., Kopciuk, K., Buie, W. D., MacLean, A., Dixon, E. … Bathe, O. F. (2012). Serum Metabolomic Profile as a Means to Distinguish Stage of Colorectal Cancer. Genome Med, 4(5), 42. https://doi.org/10.1186/gm341
Tan, B., Qiu, Y., Zou, X., Chen, T., Xie, G., Cheng, Y. … Jia, W. (2013). Metabonomics Identifies Serum Metabolite Markers of Colorectal Cancer. J. Proteome Res, 12(6), 3000–3009. https://doi.org/10.1021/pr400337b
Steuer, R., Kurths, J., Fiehn, O., & Weckwerth, W. (2003). Observing and Interpreting Correlations in Metabolomic Networks. Bioinformatics, 19(8), 1019–1026. https://doi.org/10.1093/bioinformatics/btg120
Camacho, D., de la Fuente, A., & Mendes, P. (2005). The Origin of Correlations in Metabolomics Data. Metabolomics, 1(1), 53–63. https://doi.org/10.1007/s11306-005-1107-3
Xia, J., Wishart, D. S., & Using (2016). MetaboAnalyst 3.0 for Comprehensive Metabolomics Data Analysis. Curr. Protoc. Bioinforma, 55(1),
14.10.1–14.10.91
.. https://doi.org/10.1002/cpbi.11
Alonso, A., Marsal, S., & Julià, A. (2015). Analytical Methods in Untargeted Metabolomics: State of the Art in 2015. Front. Bioeng. Biotechnol, 3, 23. https://doi.org/10.3389/fbioe.2015.00023
Steuer, R., & Review (2006). On the Analysis and Interpretation of Correlations in Metabolomic Data. Brief. Bioinform, 7(2), 151–158. https://doi.org/10.1093/bib/bbl009
Kotze, H. L., Armitage, E. G., Sharkey, K. J., Allwood, J. W., Dunn, W. B., Williams, K. J., & Goodacre, R. A. (2013). Novel Untargeted Metabolomics Correlation-Based Network Analysis Incorporating Human Metabolic Reconstructions. BMC Syst. Biol, 7(1), 107. https://doi.org/10.1186/1752-0509-7-107
Siska, C., & Kechris, K. (2017). Differential Correlation for Sequencing Data. BMC Res. Notes, 10(1), 54. https://doi.org/10.1186/s13104-016-2331-9
Wilcox, R. R., & Rousselet, G. A. (2018). A Guide to Robust Statistical Methods in Neuroscience. Curr. Protoc. Neurosci, 82(1), https://doi.org/https://doi.org/10.1002/cpns.41. 8.42.1–8.42.30
Weckwerth, W., & Fiehn, O. (2002). Can We Discover Novel Pathways Using Metabolomic Analysis? Curr. Opin. Biotechnol, 13(2), 156–160. https://doi.org/10.1016/S0958-1669(02)00299-9
Anscombe, F. J. (1973). Graphs in Statistical Analysis. Am. Stat, 27(1), 17–21. https://doi.org/10.2307/2682899
Pernet, C., Wilcox, R., & Rousselet, G. (2013). Robust Correlation Analyses: False Positive and Power Validation Using a New Open Source Matlab Toolbox.Frontiers in Psychology. p606
Schober, P., Boer, C., & Schwarte, L. A. (2018). Correlation Coefficients: Appropriate Use and Interpretation.Anesth. Analg.126 (5)
Wilcox, R. (2004). Inferences Based on a Skipped Correlation Coefficient. J. Appl. Stat, 31(2), 131–143. https://doi.org/10.1080/0266476032000148821
McClelland, G. H. (2000). Nasty Data: Unruly, Ill-Mannered Observations Can Ruin Your Analysis. In H. T. Reis, & C. M. Judd (Eds.), Handbook of research methods in social and personality psychology. Cambridge University Press
Hardin, J., Mitani, A., Hicks, L., & VanKoten, B. A. (2007). Robust Measure of Correlation between Two Genes on a Microarray. BMC Bioinformatics, 8(1), 220. https://doi.org/10.1186/1471-2105-8-220
Wilcox, R. R. (1994). The Percentage Bend Correlation Coefficient. Psychometrika, 59(4), 601–616. https://doi.org/10.1007/BF02294395
Janse, R. J., Hoekstra, T., Jager, K. J., Zoccali, C., Tripepi, G., Dekker, F. W., & van Diepen, M. (2021). Conducting Correlation Analysis: Important Limitations and Pitfalls. Clin. Kidney J, 14(11), 2332–2337. https://doi.org/10.1093/ckj/sfab085
Di Giovanni, N., Meuwis, M. A., Louis, E., & Focant, J. F. (2020). Untargeted Serum Metabolic Profiling by Comprehensive Two-Dimensional Gas Chromatography–High-Resolution Time-of-Flight Mass Spectrometry. J. Proteome Res, 19(3), 1013–1028. https://doi.org/10.1021/acs.jproteome.9b00535
Metsämuuronen, J. (2021). Goodman-Kruskal Gamma and Dimension-Corrected Gamma in Educational Measurement Settings. Int. J. Educ. Methodol, 7, 95–118. https://doi.org/10.12973/ijem.7.1.95
Metsämuuronen, J. (2021). Directional Nature of Goodman–Kruskal Gamma and Some Consequences: Identity of Goodman–Kruskal Gamma and Somers Delta, and Their Connection to Jonckheere–Terpstra Test Statistic. Behaviormetrika, 48(2), 283–307. https://doi.org/10.1007/s41237-021-00138-8
Walker, D., & Walker, A. (2003). JMASM9: Converting Kendall’s Tau For Correlational Or Meta-Analytic Analyses. J. Mod. Appl. Stat. Methods Copyr, 2, 525–530. https://doi.org/10.22237/jmasm/1067646360
de Siqueira Santos, S., Takahashi, D. Y., Nakata, A., & Fujita, A. A. (2014). Comparative Study of Statistical Methods Used to Identify Dependencies between Gene Expression Signals. Brief. Bioinform, 15(6), 906–918. https://doi.org/10.1093/bib/bbt051
Rousselet, G., & Pernet, C. (2012). Improving Standards in Brain-Behavior Correlation Analyses.Frontiers in Human Neuroscience. p119
Schwarzkopf, D., de Haas, B., & Rees, G. (2012). Better Ways to Improve Standards in Brain-Behavior Correlation Analysis.Frontiers in Human Neuroscience. p200
Chok, N. S. (2010). Earson’s Versus Spearman’s and Kendall’s Correlation Coefficients for Continuous Data. University of Pittsburgh
de Winter, J. C. F., Gosling, S. D., & Potter, J. (2016). Comparing the Pearson and Spearman Correlation Coefficients across Distributions and Sample Sizes: A Tutorial Using Simulations and Empirical Data. Psychological Methods (pp. 273–290). American Psychological Association. https://doi.org/10.1037/met0000079
Hazra, A., & Gogtay, N. (2016). Biostatistics Series Module 6: Correlation and Linear Regression. Indian J. Dermatol, 61(6), 593–601. https://doi.org/10.4103/0019-5154.193662
Armstrong, R. A. (2019). Should Pearson’s Correlation Coefficient Be Avoided? Ophthalmic Physiol. Opt, 39(5), 316–327. https://doi.org/https://doi.org/10.1111/opo.12636
Dunn, W. B., Erban, A., Weber, R. J. M., Creek, D. J., Brown, M., Breitling, R. … Viant, M. R. (2013). Mass Appeal: Metabolite Identification in Mass Spectrometry-Focused Untargeted Metabolomics. Metabolomics, 9(1), 44–66. https://doi.org/10.1007/s11306-012-0434-4
Bingol, K., Bruschweiler-Li, L., Li, D., Zhang, B., Xie, M., & Brüschweiler, R. (2016). Emerging New Strategies for Successful Metabolite Identification in Metabolomics. Bioanalysis, 8(6), 557–573. https://doi.org/10.4155/bio-2015-0004
Nash, W. J., & Dunn, W. B. (2019). From Mass to Metabolite in Human Untargeted Metabolomics: Recent Advances in Annotation of Metabolites Applying Liquid Chromatography-Mass Spectrometry Data. TrAC Trends Anal. Chem, 120, 115324. https://doi.org/https://doi.org/10.1016/j.trac.2018.11.022
Chaleckis, R., Meister, I., Zhang, P., Wheelock, C. E., & Challenges (2019). Progress and Promises of Metabolite Annotation for LC-MS-Based Metabolomics. Curr. Opin. Biotechnol, 55, 44–50. https://doi.org/10.1016/j.copbio.2018.07.010
Di Giovanni, N., Meuwis, M. A., Louis, E., & Focant, J. F. (2020). Specificity of Metabolic Colorectal Cancer Biomarkers in Serum through Effect Size. Metabolomics, 16(8), https://doi.org/10.1007/s11306-020-01707-w
Rodriguez-Salas, N., Dominguez, G., Barderas, R., Mendiola, M., García-Albéniz, X., Maurel, J., & Batlle, J. F. (2017). Clinical Relevance of Colorectal Cancer Molecular Subtypes. Crit. Rev. Oncol. Hematol, 109, 9–19. https://doi.org/https://doi.org/10.1016/j.critrevonc.2016.11.007
Wang, G., Yu, Y., Wang, Y. Z., Wang, J. J., Guan, R., Sun, Y. … Fu, X. L. (2019). Role of SCFAs in Gut Microbiome and Glycolysis for Colorectal Cancer Therapy. J. Cell. Physiol, 234(10), 17023–17049. https://doi.org/https://doi.org/10.1002/jcp.28436
La Vecchia, S., & Sebastián, C. (2020). Metabolic Pathways Regulating Colorectal Cancer Initiation and Progression. Semin. Cell Dev. Biol, 98, 63–70. https://doi.org/https://doi.org/10.1016/j.semcdb.2019.05.018
Dai, Z., Zhang, J., Wu, Q., Chen, J., Liu, J., Wang, L. … Wang, D. (2019). The Role of Microbiota in the Development of Colorectal Cancer. Int. J. cancer, 145(8), 2032–2041. https://doi.org/10.1002/ijc.32017
Wan, M. L., Wang, Y., Zeng, Z., Deng, B., Zhu, B. S., Cao, T. … Wu, Q. (2020). Colorectal Cancer (CRC) as a Multifactorial Disease and Its Causal Correlations with Multiple Signaling Pathways. Biosci. Rep, 40(3), BSR20200265. https://doi.org/10.1042/BSR20200265
Dienstmann, R., Vermeulen, L., Guinney, J., Kopetz, S., Tejpar, S., & Tabernero, J. Consensus Molecular Subtypes and the Evolution of Precision Medicine in Colorectal Cancer.Nat. Rev. Cancer2017, 17 (2),79–92. https://doi.org/10.1038/nrc.2016.126
Warburg, O. (1956). On the Origin of Cancer Cells. Science, 123(3191), 309–314. https://doi.org/10.1126/science.123.3191.309
Qiu, Y., Cai, G., Su, M., Chen, T., Zheng, X., Xu, Y. … Jia, W. (2009). Serum Metabolite Profiling of Human Colorectal Cancer Using GC – TOFMS and UPLC – QTOFMS. J. Proteome Res, 8(10), 4844–4850. https://doi.org/10.1021/pr9004162
Qiu, Y., Cai, G., Su, M., Chen, T., Liu, Y., Xu, Y. … Jia, W. (2010). Urinary Metabonomic Study on Colorectal Cancer. J. Proteome Res, 9(3), 1627–1634. https://doi.org/10.1021/pr901081y
Seyfried, T. N., & Shelton, L. M. (2010). Cancer as a Metabolic Disease. Nutr. Metab. (Lond), 7(1), 7. https://doi.org/10.1186/1743-7075-7-7
Zhu, J., Djukovic, D., Deng, L., Gu, H., Himmati, F., Chiorean, E. G., & Raftery, D. (2014). Colorectal Cancer Detection Using Targeted Serum Metabolic Profiling. J. Proteome Res, 13(9), 4120–4130. https://doi.org/10.1021/pr500494u
Monedeiro, F., Monedeiro-Milanowski, M., Ligor, T., & Buszewski, B. A. (2020). Review of GC-Based Analysis of Non-Invasive Biomarkers of Colorectal Cancer and Related Pathways. Journal of Clinical Medicine. https://doi.org/10.3390/jcm9103191
Eylem, C. C., Yilmaz, M., Derkus, B., Nemutlu, E., Camci, C. B., Yilmaz, E. … Emregul, E. (2020). Untargeted Multi-Omic Analysis of Colorectal Cancer-Specific Exosomes Reveals Joint Pathways of Colorectal Cancer in Both Clinical Samples and Cell Culture. Cancer Lett, 469, 186–194. https://doi.org/https://doi.org/10.1016/j.canlet.2019.10.038
Dias, D. A., & Koal, T. (2016). Progress in Metabolomics Standardisation and Its Significance in Future Clinical Laboratory Medicine. EJIFCC, 27(4), 331–343
Serra, A., Coretto, P., Fratello, M., & Tagliaferri, R. (2018). Robust and Sparse Correlation Matrix Estimation for the Analysis of High-Dimensional Genomics Data. Bioinformatics, 34(4), 625–634. https://doi.org/10.1093/bioinformatics/btx642
Bujang, M. A., & Baharum, N. (2016). Sample Size Guideline for Correlation Analysis. World J. Soc. Sci. Res, 3, 37. https://doi.org/10.22158/wjssr.v3n1p37
Hulley, S. B., Cummings, S. R., Browner, W. S., Grady, D., & Newman, T. B. (2013). Designing Clinical Research: An Epidemiologic Approach (4th ed.). Philadelphia, PA: Lippincott Williams & Wilkins
Yarkoni, T. (2009). Big Correlations in Little Studies: Inflated FMRI Correlations Reflect Low Statistical Power—Commentary on Vul et Al. Perspect. Psychol. Sci. 2009, 4 (3), 294–298. https://doi.org/10.1111/j.1745-6924.2009.01127.x
Rakotomalala, R. T. A. N. A. G. R. A. (2005). : Une Plate-Forme d’expérimentation Pour La Fouille de Données.Rev. Modul.70–85
Xia, J., Sinelnikov, I. V., Han, B., & Wishart, D. S. (2015). MetaboAnalyst 3.0–Making Metabolomics More Meaningful. Nucleic Acids Res, 43(W1), W251–W257. https://doi.org/10.1093/nar/gkv380
Shannon, P., Markiel, A., Ozier, O., Baliga, N. S., Wang, J. T., Ramage, D. … Cytoscape (2003). A Software Environment for Integrated Models of Biomolecular Interaction Networks. Genome Res, 13(11), 2498–2504. https://doi.org/10.1101/gr.1239303
Tufte, E. R. (2001). The Visual Display of Quantitative Information. Cheshire, Conn: Graphics Press
van der Ark, L. A., & van Aert, R. C. M. (2015). Comparing Confidence Intervals for Goodman and Kruskal’s Gamma Coefficient. J. Stat. Comput. Simul, 85(12), 2491–2505. https://doi.org/10.1080/00949655.2014.932791
Gorrie, C., & Three ways to detect outliers http://colingorrie.github.io/outlier-detection.html (accessed 2020-05-19)
Miller, J. N., & Miller, J. C. (2010). Statistics and Chemometrics for Analytical Chemistry (6th ed.). Ed.: Prentice Hall
Leys, C., Ley, C., Klein, O., Bernard, P., & Licata, L. (2013). Detecting Outliers: Do Not Use Standard Deviation around the Mean, Use Absolute Deviation around the Median. J. Exp. Soc. Psychol, 49(4), 764–766. https://doi.org/https://doi.org/10.1016/j.jesp.2013.03.013
Grubbs, F. E., & Beck, G. (1972). Extension of Sample Sizes and Percentage Points for Significance Tests of Outlying Observations. Technometrics, 14(4), 847–854. https://doi.org/10.1080/00401706.1972.10488981
Rorabacher, D. B. (1991). Statistical Treatment for Rejection of Deviant Values: Critical Values of Dixon’s “Q” Parameter and Related Subrange Ratios at the 95% Confidence Level. Anal. Chem, 63(2), 139–146. https://doi.org/10.1021/ac00002a010
Dekking, F. M., Kraaikamp, C., Lopuhaä, H. P., & Meester, L. E. (2005). A Modern Introduction to Probability and Statistics (1st ed.). London: Springer-Verlag. https://doi.org/10.1007/1-84628-168-7
Iglewicz, B. (1993). Hoaglin David C. (David Caster), 1944-. How to Detect and Handle Outliers. Wis.: ASQC Quality Press. Milwaukee
Dunn, W. B., Broadhurst, D., Begley, P., Zelena, E., Francis-Mcintyre, S., Anderson, N. … Goodacre, R. (2011). Procedures for Large-Scale Metabolic Profiling of Serum and Plasma Using Gas Chromatography and Liquid Chromatography Coupled to Mass Spectrometry. Nat. Protoc, 6(7), 1060–1083. https://doi.org/10.1038/nprot.2011.335
Borenstein, M., Hedges, L. V., Higgins, J. P. T., & Rothstein, H. R. (2009). Introduction to Meta-Analysis; https://doi.org/10.1002/9780470743386
Student (1908). The Probable Error of a Mean. Biometrika, 6(1), 1–25. https://doi.org/10.2307/2331554
Broadhurst, D. I., & Kell, D. B. (2006). Statistical Strategies for Avoiding False Discoveries in Metabolomics and Related Experiments. Metabolomics, 2(4), 171–196. https://doi.org/10.1007/s11306-006-0037-z
Benjamini, Y., & Yekutieli, D. (2001). The Control of the False Discovery Rate in Multiple Testing under Dependency. Ann. Stat, 29(4), 1165–1188
Hyötyläinen, T. (2010). Analytical Methodologies Utilized in the Search for Chronic Disease Biomarkers. Bioanalysis, 2(5), 919–923. https://doi.org/10.4155/bio.10.38
Wishart, D. S., Feunang, Y. D., Marcu, A., Guo, A. C., Liang, K., Vázquez-Fresno, R. … Scalbert, A. (2018). HMDB 4.0: The Human Metabolome Database for 2018. Nucleic Acids Res, 46(D1), D608–D617. https://doi.org/10.1093/nar/gkx1089
Kanehisa, M., Furumichi, M., Tanabe, M., Sato, Y., & Morishima, K. (2017). KEGG: New Perspectives on Genomes, Pathways, Diseases and Drugs. Nucleic Acids Res, 45(D1), D353–D361. https://doi.org/10.1093/nar/gkw1092
Motulsky, H. J., Vanthemsche, M. (Trad., & Deboeck (Eds.). (2019). Biostatistique, 3e édition.; Deboeck; Bruxelles,

No competing interests reported.

SI.pdf
floatimage4.png
TOC

Download PDF

Journal Publication

published 23 Sep, 2023

Read the published version in Metabolomics →

Editorial decision: Major revision
30 Jan, 2023
Reviews received at journal
06 Oct, 2022
Reviewers agreed at journal
27 Sep, 2022
Reviewers invited by journal
19 Apr, 2022
Submission checks completed at journal
16 Apr, 2022
Editor assigned by journal
16 Apr, 2022
First submitted to journal
15 Apr, 2022

You are reading this latest preprint version

Correlations for Untargeted GC×GC-HRTOF-MS Metabolomics of Colorectal Cancer

Status:

Journal Publication

Version 1

Abstract

Figures

1. Introduction

2. Results And Discussion

2.1. Parametric Coefficient. Presence of Atypical Samples

2.2. Parametric Coefficient. Presence of Atypical Values

2.3. Effect of Atypical Samples and Values on r and Z

2.4. Non-parametric Coefficients. Presence of Ties through Analytical Precision

2.5. Determination of the appropriate coefficient

2.6. Specific Significant Variations of Correlations Associated to Colorectal Cancer

2.7. Limitations

3. Materials And Methods

3.1. Samples

3.2. Data Acquisition and Processing

3.3. Correlations Analysis

3.4. Parametric Coefficient. Presence and Effect of Atypical Samples and Values

3.5. Non-parametric Coefficients. Presence and Effect of Ties

3.6. Selection and Application of the Appropriate Correlation Coefficient

4. Conclusions

Declarations

References

Additional Declarations

Supplementary Files

Status:

Journal Publication

Version 1