Scaling Agglomerative Clustering for Big Data

Agglomerative clustering is one of the best clustering tools in data science, but traditional implementations fail to scale to large datasets. In this article, I will take you through some background on agglomerative clustering, an introduction to reciprocal agglomerative clustering (RAC) based on 2021 research from Google, a runtime comparison between <code>RAC++</code> and <code>scikit-learn</code>’s AgglomerativeClustering, and finally a brief explanation of the theory behind RAC. <h2>Background on Agglomerative Clustering</h2> In data science, it is frequently useful to cluster unlabeled data. With applications ranging from grouping of search engine results, to genotype classification, to banking anomaly detection, clustering is an essential piece of a data scientist’s toolkit. Agglomerative clustering is one of the most popular clustering approaches in data-science and for good reason, it: Is easy to use with little to no parameter tuning Creates meaningful taxonomies Works well with high-dimensional data Does not need to know the number of clusters beforehand Creates the same clusters every time In comparison, partitioning methods like <code>K-Means</code> require the data scientist to guess at the number of clusters, the very popular density-based method <code>DBSCAN</code> requires some parameters around density calculation radius (epsilon) and min neighborhood size, and <code>Gaussian mixture models</code> make strong assumptions about the underlying cluster data distribution. With agglomerative clustering, all you need to specify is a distance metric. <a href="https://towardsdatascience.com/scaling-agglomerative-clustering-for-big-data-an-introduction-to-rac-fb26a6b326ad">Website</a>