Performance Determinants of Unsupervised Clustering Methods for Microbiome Data

doi:10.21203/rs.3.rs-151018/v2

Download PDF

Methodology

Performance Determinants of Unsupervised Clustering Methods for Microbiome Data

https://doi.org/10.21203/rs.3.rs-151018/v2

This work is licensed under a CC BY 4.0 License

Journal Publication

published 04 Feb, 2022

Read the published version in Microbiome →

You are reading this latest preprint version

Background: In microbiome data analysis, unsupervised clustering is often used to identify naturally occurring clusters, which can then be assessed for associations with characteristics of interest. In this work, we systematically compared beta diversity and clustering methods commonly used in microbiome analyses. We applied these to four published datasets where highly distinct microbiome profiles could be seen between sample groups, as well a clinical dataset with less clear separation between groups.

Results: Although no single method outperformed the others consistently, we did identify key scenarios where certain methods can underperform. Specifically, the Bray Curtis (BC) metric resulted in poor clustering in a dataset where high-abundance OTUs were relatively rare. In contrast, the unweighted UniFrac (UU) metric clustered poorly on dataset with a high prevalence of low-abundance OTUs. To explore these hypotheses about BC and UU, we systematically modified properties of the poorly performing datasets and found that this approach resulted in improved BC and UU performance. Based on these observations, we rationally combined BC and UU to generate a novel metric. We tested its performance while varying the relative contributions of each metric and also compared it with another combined metric, the generalized UniFrac distance. The proposed metric showed high performance across all datasets.

Conclusions Our systematic evaluation of clustering performance in these five datasets demonstrates that there is no existing clustering method that universally performs best across all datasets. We propose a combined metric of BC and UU that capitalizes on the complementary strengths of the two metrics.

General Microbiology

Unsupervised Clustering

Bray Curtis distance

Unweighted UniFrac distance

Beta diversity

S1.pdf
Figure S1: An illustrative plot of commonly used clustering methods
S2.pdf
Figure S2: Heatmap of the most abundant 300 OTUs for the four example datasets This figure shows the most abundant 300 OTUs for the four published datasets plotted in log 10 scale. Red color indicates high abundance, whereas white color indicates low abundance. As shown in the plot, the abundances of “high abundance” OTUs in Schnorr dataset are lower than those of the other three datasets. Gray and blue indicate different clusters in each dataset.
S3.pdf
Figure S3: Heatmap of the most abundant 300 OTUs for the Schnorr dataset with 0, 19, and 29 levels trimmed off This figure plots the most abundant 300 OTUs of the Schnorr Dataset with 0, 19, and 29 levels trimmed off. As trimming goes along, the abundant OTUs aggregate sequences from distant OTUs. In each subpanel, the gray left part is the Italian sample set, while the blue right part is the Hadza sample set.
S4.pdf
Figure S4: Heatmap of the most abundance 300 OTUs of the Mart´ınez dataset, with descendants This figure plots the most abundance 300 OTUs of the Mart´ınez Dataset with its first, second and third generation descendants. As the tree branches diverge, fewer sequences are left in the most abundant 300 OTUs. In each subpanel, the gray left part is the Papua sample set, while the blue right part is the US sample set.
S5.pdf
Figure S5: Coverting low abundance OTUs to 0s improves the performance of UU for Gopalakrishnan dataset (A) Shannon diversity of the dataset decreases as more OTUs are converted to 0s. (B) The trend of the Rand index rises by converting OTUs to 0s, however excessive count removal affects the performance of UU eventually.
S6.pdf
Figure S6: Rand indices with different α values for the proposed metric and comparison with the generalized UniFrac metric This figure shows the Rand indices with different values of α (0.2, 0.4, 0.6, 0.8) for the proposed metric and compares the results with the Generalized UniFrac under different parameters for it (0, 0.25, 0.5, 0.75)
S7.pdf
Figure S7: Performance of other metrics The figure shows the performance of other less common metrics provided by QIIME2. Among them, city-block distance and species-by-species Euclidean distance are modifications of the Euclidean distance, while Jaccard distance and Canberra distances are similar to the Bray Curtis distance. As pointed out in the vegan package manual [27], Euclidean and Manhattan distances are not good in separating groups. More details of beta diversity metrics can be found in QIIME2 [23].

Download PDF

Journal Publication

published 04 Feb, 2022

Read the published version in Microbiome →

Editorial decision: Accept
15 Nov, 2021
Review #2 received at journal
26 Oct, 2021
Review #1 received at journal
17 Oct, 2021
Reviews received at journal
22 Sep, 2021
Reviewer #2 agreed at journal
21 Sep, 2021
Reviewers invited by journal
21 Sep, 2021
Reviewer #1 agreed at journal
21 Sep, 2021
Editor assigned by journal
13 Sep, 2021
First submitted to journal
13 Sep, 2021
Submission checks completed at journal
12 Sep, 2021
Editor invited by journal
12 Sep, 2021

You are reading this latest preprint version

Performance Determinants of Unsupervised Clustering Methods for Microbiome Data

Status:

Journal Publication

Version 2

Abstract

Figures

Full Text

Supplementary Files

Status:

Journal Publication

Version 2