🚧 This project is very much under development - use at your own risk! 🚧
RAC++ - Reciprocal Agglomerative Clustering (in C++). Performs a parallelized, bottom up Agglomerative clustering by pairing and merging reciprocal nearest neighbors. This allows RAC++ to scale to much larger datasets than traditional agglomerative clustering while keeping the results (almost always) the same.
All the positive attributes of Agglomerative clustering remain with RAC++ as well. It produces meaningful clusters with little parameter tuning and can create whole taxonomies as well.
Based on the paper:
@article{DBLP:journals/corr/abs-2105-11653,
author = {Baris Sumengen and Anand Rajagopalan and Gui Citovsky and David Simcha and Olivier Bachem and Pradipta Mitra and Sam Blasiak and Mason Liang and Sanjiv Kumar},
title = {Scaling Hierarchical Agglomerative Clustering to Billion-sized Datasets},
journal = {CoRR},
volume = {abs/2105.11653},
year = {2021},
url = {https://arxiv.org/abs/2105.11653},
}
Porter Hunley — @porterehunley
Daniel Frees — @danielfrees
RAC++ is available via PyPI for Python 3.8 and up on all major platforms. Install it via pip:
pip install racplusplus
The RAC++ API is very similar to traditional Agglomerative clustering:
import racplusplus
X = np.random.random((10000, dim))
labels = racplusplus.rac(
X, max_merge_distance=.24, connectivity=None, batch_size=1000, no_processors=8, distance_metric="cosine"
)
It should be noted that only the Average linkage method is available as of writing this Readme.
Xthe array of pointsmax_merge_distancethe merge thresholdconnectivitythe optional connectivity matrix -- must be symmetricbatch_sizethe batch sized used to calculate the distance matrix. Pick a number large enough for fast results but small enough to not overload your memory.no_processorsthe number of threads you want it to use, should be less than or equal to the number of cores available on your machine.distance_metriceither "cosine" or "euclidean"
As of right now, returning the whole tree is not yet available.
RAC++ is designed to scale Agglomerative clustering to much larger datasets. It runs significantly faster than traditional Agglomerative clustering and scales better as well. Right now, RAC++ can run just fine on datasets in the hundreds of thousands, even in very high dimensions. We expect that to grow significantly as we add options to optimize towards a linear runtime.
Here are some benchmarking examples: Benchmarking
Results RAC++ produces the exact same results as Agglomerative clustering when the points are fully connected.
Even if the connectivity is limited, the results are almost always the same or a tad off. However, there are some outlier cases when the results can differ wildly with limited connectivity, so it's a good idea to check the results visually with subsample of data.
We're aiming to recreate as many features from traditional agglomerative clustering as is feasible for the RAC algorithm.
| Feature Name | Status |
|---|---|
| Average Linkage | ✅ |
| Cosine distance | ✅ |
| Euclidean distance | ✅ |
| Ward Linkage | 🚧 |
| Complete Linkage | 🚧 |
| External distance matrix | ❌ |
| Single Linkage | ❌ |
| Returning dendrogram | ❌ |
| Pre-set cluster input | ❌ |