The widespread application of high-throughput sequencing for immune repertoire analyses is generating vast amounts of data, which represents a challenge with respect to data analysis. Clustering methods to identify clonally related sequences commonly rely on pairwise comparison of nucleotide sequences or k-mers, which is computationally intensive and potentially prohibitive for large datasets. Instead of determining the genetic distance between all sequences of a given dataset directly, Anchor Clustering partitions sequences before performing pairwise distance analyses. This is achieved by selecting a set of maximally spaced sequences as anchors and triangulating the relative position of each remaining sequence with respect to the anchor reference points. The resulting distance vectors are then clustered using unsupervised machine learning. This strategy significantly reduces the number of pairwise comparisons resulting in faster runtimes and reduced memory requirements compared to existing methods.
A key consideration for determining the optimal parameter settings for Point Packing was to minimize runtime while ensuring a high quality of clustering. The minimum distance ratio, which defines the minimum distance between anchor sequences, had a significant impact on these performance metrics. A complicating factor in optimizing this parameter was that the number of generated anchors, is dependent on the size of a given dataset and the relatedness of its sequences. More closely related sequences require a lower minimum distance ratio than more distantly related sequences to obtain a given number of anchors. Increasing the minimum distance ratio between anchors resulted in fewer anchors and hence fewer sequence-to-anchor comparisons, which reduced total run-time. However, this also resulted in larger subclusters, which increased the number of partitioning cycles required to split a dataset into sufficiently small clusters for pairwise sequence comparison. In addition, for the smallest dataset M3-1, no anchors were obtained using a distance ratio of 0.9, suggesting that the use of high distance ratios may render the algorithm inapplicable for datasets of limited size or closely related sequences. Consequently, optimization of parameter settings for Point Packing requires the consideration of the entire workflow and reflects a balance between optimizing anchor selection by Point Packing and sequence partitioning by unsupervised clustering.
Following anchor selection, each nucleotide sequence is converted into a distance vector that reflects its genetic distance to all anchor sequences. In this study, we chose Hamming distance as a distance metric, which is straightforward to compute but requires prior partitioning of sequences based on junctional length. While the latter allowed us to improve speed by multi-threading, this approach precludes detection of similar sequences with differing length. This contrasts with methods such as the Alignment free method, which is based on k-mer frequencies and hence capable of clustering similar sequences with differing length. Using a different distance metric such as the Levenshtein distance could potentially alleviate this shortfall but was not explored in this study.
After conversion into a distance vector, the BIRCH algorithm was used to partition the dataset into subclusters. The choice of this clustering method was based its superior performance compared to other unsupervised clustering methods in initial trials. However, it is possible that other clustering methods could result in an even better performance if parameters are further optimized. An adaptation that significantly boosted performance was to only use a fraction of the data to construct the BIRCH tree model.
In addition to grouping sequences based on VJ usage after clustering, we assessed the effect of grouping sequences based on VJ usage before clustering. The potential advantage of sequence grouping pre-clustering is that the data is split into smaller subsets that are more amenable to pairwise comparison. Interestingly, this strategy yielded small but insignificant improvements in runtime with similar clustering quality.
The most significant advantage of Anchor Clustering over existing methods is that it facilitates the clustering of millions of sequences with minor hardware and software requirements. The fact that datasets with more than one million sequences using existing methods could not be clustered with the computational means used in this study may be due to the limited computational power used in this study but illustrates the potential of Anchor Clustering for large datasets. It might provide a tool to explore the immune repertoire space across a large number of individuals or species.
There are several limitations to this study. First, Anchor Clustering requires the definition of various hyperparameters that might affect performance. While we attempted to characterize a broad range of settings using synthetic and biological datasets, it is unlikely that our findings are universally applicable and individual datasets might require further refinement of hyperparameters. Second, the final step of Anchor Clustering still relies on pairwise comparison of nucleotide sequences, which could hamper the analysis of very large datasets. Third, Anchor clustering has only been tested on BCR nucleotide sequences. The use of Anchor Clustering on amino acid data will likely require modifications, such as the consideration of physicochemical features of amino acids.