Combined Beta Metric for Unsupervised Clustering of Microbiome Data

Background: In Microbiome data analysis, unsupervised clustering is often used to identify naturally occurring clusters, which can then be assessed for associations with characteristics of interest. In this work, we systematically compared beta diversity and clustering methods commonly used in microbiome analyses. We applied these to four published datasets where highly distinct microbiome proﬁles could be seen between sample groups. Results: Although no single method outperformed the others consistently, we did identify key scenarios where certain methods can underperform. Speciﬁcally, the Bray Curtis metric resulted in poor clustering in a dataset where high-abundance OTUs were relatively rare. In contrast, the unweighted UniFrac metric clustered poorly when used on a dataset with a high prevalence of low-abundance OTUs. To test our proposition, we systematically modiﬁed properties of the poorly performing datasets and found that this approach resulted in improved Bray Curtis and unweighted UniFrac performance. Conclusions: Based on these observations, we rationally combined the Bray Curtis metric and the unweighted UniFrac metrics and found that this new beta diversity metric showed high performance across all datasets.


Background
The development of next generation sequencing in the past decade has increased access for researchers to analyze microbial communities [1,2]. A common method involves deep sequencing of 16S rRNA genes and grouping bacteria at a certain level of 16S rRNA gene similarity [3]. The highest resolution bacterial sequence is referred to as an operational taxonomic unit (OTU). A challenge with microbiome analyses is that these sequences are derived from bacteria that can be incredibly diverse, and simultaneously have a relatedness structure that is internally associated, both genetically and phenotypically. Metrics to account for phylogenetic closeness have been proposed specifically for microbiome data [4,5,6].
Microbiome data analysis can suffer from overgeneralization (e.g., comparing observations by alpha diversity measures such as Shannon [7] or Simpson [8] diversity) or overspecification (e.g., performing association tests for each taxon, which incurs a heavy penalty for multiple testing). A useful and common strategy to address both of these limitations is to first apply an unsupervised clustering methodology.
After considering the microbiome data as a whole and identifying naturally occurring clusters of samples, these clusters can then be assessed for associations with sample characteristics of interest. Previous studies, for instance, have shown that human stool microbiome samples naturally form clusters that are associated with dietary and geographic factors [9,10]. A problem that often confuses researchers is that clustering performance results often vary depending on the algorithm or the beta diversity metrics used, observed previously by Koren, et al. [11] and Claesson, et al. [12].
When performing microbiome sample clustering, both model-based methods and machine learning methods have been used. Machine learning methods, which rely on defined distance metrics, are used more frequently than model-based statistical methods, due to their efficient implementation and easy interpretation. In this paper, we focused on the Partition Around Medoids (PAM) [13] clustering method, which is related to but considered more robust than K-means. In contrast to Kmeans, which can be sensitive to the effects of outliers, PAM's optimization goal is to minimize the sum of distances to the medoids instead of minimizing the sum of the squared distances to the cluster centers. We also tested another frequently used method for non-Euclidean metrics, hierarchical clustering with complete linkage.
This method initially bundles the closest observations (with distances defined by the longest distance between any two observations in two clusters), and gradually results in a binary tree that combines the two closest clusters.
Supplemental Figure S1 illustrates differences between these clustering methods.
The choice of clustering algorithm and distance metric together determine the per-formance of a machine learning clustering method. The metrics considered in this study are commonly used and include the Bray-Curtis dissimilarity metric [14], the unweighted UniFrac distance [4], the weighted UniFrac distance [5], and the Aitchison distance [15], which is a Euclidean distance quantified following centered log ratio (CLR) transformation of abundances.
In addition to PAM and hierarchical clustering, we also included a third approach, the model-based Dirichlet Multinomial model (DMM) [16]. This algorithm assumes that observations of the same cluster come from the same multinomial distribution, and that parameters of the multinomial distribution have a Dirichlet distribution. This model better captures overdispersion than simply assuming a multinomial.
In this paper, we systematically compared methods for clustering microbiome observations from four published studies with either geographical or seasonal variables as the true cluster label, which enables biological interpretation of the group separation. We first applied clustering with five methods, and assessed performance of the various methods using the adjusted Rand index with the true clustering assignment. The adjusted Rand index is based on a pairwise membership agreement and is corrected for the expected value, where a score of 0 is expected with random clustering, and a score of 1 indicates perfect clustering. We then explored relationships between differences in performance with properties of the datasets considered. With the findings obtained from the method comparisons, we proposed a novel combined metric that provided high performance for all four datasets.

Result
We used four published stool microbiome datasets to test the performance of the clustering methods. A brief summary of these datasets is shown in Table 1. Each study comprises two groups of samples with relatively distinct microbiome profiles.
The dataset Shannon diversity is the sum of Shannon diversities over all samples in the dataset, and details related to this alpha diversity are provided in a later section. The percent of sequences in high abundance OTUs is the percent of sequences in OTUs with average abundance greater than 0.001. While the individual studies varied in their 16S rRNA gene deep sequencing methods, we processed the downloaded raw sequences using the same pipeline, including VSEARCH [17] to dereplicate and remove singletons and the UNOISE2 function in usearch [18] to identify OTUs. More information about the four example datasets is provided in Supplemental Table S1. A heatmap of the high abundance OTUs can also be found in Supplemental Figure S2.
The first three studies compare cohorts separated by geography. The De Filippo The fourth study, Smits et al. [20], comprises fecal samples longitudinally collected from a cohort of Hadza. Their diets are significantly affected by seasonal food availability, with more berries and honey during the wet season and more meat in the dry season. In this study, the authors found that some taxa disappear when certain foods become scarce and reappear when the seasons change. For the purpose of unsupervised learning, we used samples from the two most distinct seasons, the late dry and the early wet seasons.
The first column in Figure 1 illustrates the adjusted Rand indices of the four datasets, with different colors indicating the beta diversity metrics using PAM clustering or Dirichlet multinomial mixture model (DMM). Because of its limited capacity to handle high-dimensional data, when fitting the DMM model, we binned the OTUs present in less than 20% of the observations into one OTU when calculating Rand indices. Also, due to the fact that Dirichlet multinomial mixture is modelbased, there is no corresponding distance matrix that can be shown in a PCoA plot.
The black bars in the Rand index plots and the last column in the PCoA plots are generated from our newly proposed method introduced later in this paper. Due to the generally similar results between PAM and hierarchical clustering (see Figure   2), we presented results in our main text only using PAM. Among these existing methods, we found that there is no clearly superior one that universally performs well in all four datasets. Interestingly, unweighted UniFrac performs well in three of four datastes, despite lacking information regarding bacterial abundances. Similar observations for presence/absence methods have been made by Martino, et al. [21].
Moreover, the Schnorr dataset has a perfect Rand index with UniFrac distances, but a low Rand index with Bray-Curtis, Aitchison, and DMM, all of which ignore phylogenetic relationships between OTUs. We devote the next section to exploring aspects of these differences in performance.
In this section, we will focus on the underperformance of Bray Curtis for the Schnorr dataset, followed by the underperforamce of unweighted UniFrac for the Smits dataset. Because in reality researchers may not know the information of the true cluster assignment, we will focus on a priori properties of the dataset, rather than the differences between the two assigned clusters.

High Abundance OTUs Drive the Performance of the Bray-Curtis Dissimilarity
For a p×n table with p OTUs and n observations, the Bray-Curtis distance between two observations is calculated with the actual counts, By examining the definition, we find that the dissimilarity of the Bray-Curtis metric is driven by differences in high-abundance rather than low-abundance OTUs. Here we chose to define a metric, to potentially identify poor Bray-Curtis performance by summing the average abundance of "high abundance OTUs", which are those with a mean abundance greater than 0.001, though similar results can be obtained across a range of thresholds.
As shown in the last column of Table 1 and the last column of the Supplemental   Table S1, the sum of average abundance and the high-abundance OTUs for the Schnorr dataset is far below that of the other three datasets. A visual description is also provided as a heatmap in Supplemental Figure S3. Thus the Schnorr dataset is characterized by both unusually poor clustering performance by Bray Curtis and few high-abundance OTUs (Figure 1).
To calculate this association, we attempted to improve the performance of Bray Curtis by generating novel OTUs with higher mean abundances. We did this by first generating a phylogenetic tree of the aligned OTUs, and then systematically merging OTUs starting with those most distal to the tree root. In essence, we performed "trimming" of branches from the tree, by combining the abundances of the distal OTUs to generate new, more proximal OTUs with higher mean abundances, as shown in Figure 3 (A).
In Figures 3 (B) and (C), the X-axes indicate the number of levels trimmed, ranging from the first most distal branch to, at most, 34 levels proximal. Notably, as we progressively reduced the resolution of microbiome data during the trimming process, reflected by the decreasing Shannon diversity in Figure 3 (D), the PAM clustering showed improved performance. Shannon diversity is a commonly used alpha diversity metric used to quantify the information within a dataset. The Shannon diversity for observation j is defined as: where p ij is the abundance of the ith taxa for observation j. A reasonable explanation for the increase in Rand indices is that the trimming process actually utilizes the tree phylogenetic information, and similar thinking was previously mentioned in Gopalakrishnan et al. [22] and Peled et al. [23]. We encourage researchers to explore their datasets by trimming until the histogram of the pairwise Bray Cur-tis dissimiarities shows a bimodal shape with the tools provided in our R package "MicrobiomeCluster".
As a counter example, we chose to examine in more detail the Martínez dataset for its high clustering performance when using the Bray-Curtis metric. We reasoned that a high degree of discriminating information was being provided by OTUs with high mean abundances. To simulate OTUs with low mean abundances progressively, we took counts assigned to an OTU and assigned those to a new OTU in randomly selected samples from each tip of the original tree to form first-generation descendants, then grew two new branches from each of the newly formed tree tips to obtain the second-generation descendants, and so on. As zero counts do not contribute to the Shannon diversity, we essentially "grew" branches without increasing the information from the original dataset, as shown in Figure 4 (A). Sequences of each OTU were randomly assigned to either of the two daughter branches. As shown in Figure 4

Prevalence of Low Abundance OTUs Inhibits Clustering Performance of the Unweighted UniFrac Distance
Another key observation from our examination of clustering performance in the four datasets is that the unweighted UniFrac distance performs well for three of the datasets, but markedly underperforms in the Smits dataset. This dataset is distinct from the others in that samples were collected at different time points from the same group of individuals, rather than being collected from two distinct groups of subjects. Thus the association of the microbiome is with time (distinct in two seasons), rather than the source. We would anticipate that, during the late dry season, many bacteria more suited for the early wet season will be reduced in abundance but may not be totally eliminated and can recover when the host diet eventually becomes more hospitable, and vice versa. The geographically separated clusters from the other studies are less likely to have an extensive overlap in presence of bacterial taxa, with many taxa being present only in one population and absent in the other.
To explore determinants of when the unweighted UniFrac metric could lead to reduced clustering performance, we focused on the unweighted nature of the measure, where only the presence or absence of an OTU is quantified rather than the abundance of an OTU. In the Smits dataset, we expected that a large number of OTUs are shared by both clusters. To identify a metric that captures the information uncaptured by unweighted UniFrac, we propose using the total Shannon diversity, which is the sum of the Shannon diversities over all the samples of the dataset.
Because zero entries of the OTU table do not contribute to total Shannon diversity, the total Shannon diversity in a dataset quantifies the amount of information uncaptured by unweighted UniFrac metric across samples in that dataset. To examine this, we manipulated the Smits dataset by gradually switching non-zero entries to 0, in essence adjusting the criteria by which an OTU would be considered "present" in a sample, as shown in Figure 5 (A). In both Figures 5 (B) and (D), the x-axis is the threshold going from 0 to 70. During the process, entries with a number of sequences less than or equal to the threshold will be converted to 0. In Figure 5 (B), the performance of unweighted Unifrac goes up with the number of entries converted to 0, whereas the performance of Bray-Curtis only fluctuates slightly. In PCoA also visually depicts improved discrimination of unweighted Unifrac with an increasing threshold. In Figure 5 (C), we plotted the PCoA of the original data, the dataset where entries less than 30 are converted to 0, and the dataset where entries less than 60 are converted to 0. As more non-zero entries are converted to 0, the two clusters separate further in the first coordinate.
As a counter example, we chose to examine in more detail the Martínez  We found that, as percentage of replacement increased, the two groups of samples become progressively less distinct.

Combined Metric Shows Stable Good Performance
With the above findings, it would be appealing to develop a metric that can use the information of both high and low abundance OTUs. An intuitive way of constructing such a new metric is to combine information from both Bray Curtis and unweighted UniFrac metrics, which tend to complement each other. After normalizing pairwise Bray Curtis and unweighted UniFrac metrics by their largest values the two metrics lie between 0 and 1. We define the distance between two observations in the new metric as As shown in the last column of the bar plots and the last column of PCoA plots of the Figure 1, the proposed new metric, which inherits complementary insights from the Bray Curtis and the unweighted UniFrac dissimilarities, results in high performance in Rand indices for all four datasets.

Discussion
A quantitative recommendation regarding when to avoid certain metrics would require a systematic examination of many more datasets. We did observe that in these datasets, hierarchical clustering, though less frequently used with microbiome data, can provide similar and sometimes superior results compared to the more commonly used PAM method. (Figure 2) Unlike PAM clustering, the number of clusters do not have to be pre-determined prior to hierarchical clustering, which is a potential advantage.

Conclusions
Our systematic evaluation of clustering performance in these four datasets show that there is no existing clustering method that universally performs the best across all datasets. In general, the weighted UniFrac outperforms other methods across studies. When OTUs with high mean abundance are rare, clustering methods that do not consider phylogeny (i.e., Bray-Curtis, Aitchison, DMM) perform less well.
In contrast, in a dataset with lots of low abundance OTUs, the unweighted UniFrac tends to produce poor separation, while other methods can still perform well. To capitalize on the complementary strength of the two metrics, we propose a new metric that incorporates the Bray Curtis metric and the unweighted UniFrac metric.          Table S1 The third and fourth columns of this table refer to the skewness and kurtosis of the averaged abundance vector of the whole dataset or each cluster. Skewness is often used as a measure of symmetry, while kurtosis is a measure for