Want something better than k-means? I'm happy to announce our SOTA k-medoids algorithm from NeurIPS 2020, BanditPAM, is now publicly available! `pip install banditpam` or `install.packages("banditpam")` and you're good to go!
k-means is one of the most widely-used algorithms to cluster data. However, it has several limitations: a) it requires the use of L2 distance for efficient clustering, which also b) restricts the data you're clustering to be vectors, and c) doesn't require the means to be datapoints in the dataset.
Unlike in k-means, the k-medoids problem requires cluster centers to be actual datapoints, which permits greater interpretability of your cluster centers. k-medoids also works better with arbitrary distance metrics, so your clustering can be more robust to outliers if you're using metrics like L1. Despite these advantages, most people don't use k-medoids because prior algorithms were too slow.
In our NeurIPS 2020 paper, BanditPAM, we sped up the best known algorithm from O(n^2) to O(nlogn) by using techniques from multi-armed bandits. We were inspired by prior research that demonstrated many algorithms can be sped up by sampling the data intelligently, instead of performing exhaustive computations.
We've released our implementation, which is pip- and CRAN-installable. It's written in C++ for speed, but callable from Python and R. It also supports parallelization and intelligent caching at no extra complexity to end users. Its interface also matches the sklearn.cluster.KMeans interface, so minimal changes are necessary to existing code.
PyPI: https://pypi.org/project/banditpam
CRAN: https://cran.r-project.org/web/packages/banditpam/index.html
Repo: https://github.com/motiwari/BanditPAM
Paper: https://arxiv.org/abs/2006.06856
If you find our work valuable, please consider starring the repo or citing our work. These help us continue development on this project.
I'm Mo Tiwari (motiwari.com), a PhD student in Computer Science at Stanford University. A special thanks to my collaborators on this project, Martin Jinye Zhang, James Mayclin, Sebastian Thrun, Chris Piech, and Ilan Shomorony, as well as the author of the R package, Balasubramanian Narasimhan.
(This is my first time posting on HN; I've read the FAQ before posting, but please let me know if I broke any rules)
I have a couple of non-technical questions:
1. Can you share some background on how this work developed? I am guessing there were many attempts to improve PAM over the last three decades, right? And in hindsight, bandit-based approach seems like a natural approach to try, right? Did you start with trying to improve PAM and realize no one else thought of a probabilistic/random approach?
2. Once you realized multi-armed bandit approach is the way to go, did implementation of the idea and empirical evaluation take a lot of time? I am guessing most of the effort went to providing complexity guarantees, right?
3. The paper has an interesting set of authors from diverse areas - areas in which the k-medoid problems seems highly relevant. This was partly the reason why I asked question 1. - was the project motivated by the need of such an algorithm in application areas or what is by looking for an area to apply the insight that bandit based approaches can actually perform better.
Overall, I really like the life-cycle of the entire paper. It started with a highly relevant and practical problem, gave an intuitive algorithm that comes with complexity bounds, has an accessible blog post to support the paper, and has what seems to be a very efficient implementation that can directly be used in production at scale. A lot of researchers miss the last part and move on to the next project (I am guilty of that) - kudos to you for spending time on the implementation! If you ever end up at UIUC, I'd love to buy you a coffee (:
PS: I am a grad student at UIUC and was scrolling by and stopped as I saw two familiar names: Ilan (took Random processes with him and loved it) and of course who in robotics wouldn't know Prof. Thrun (for those who don't, his Probabilistic Robotics is a mandatory reference in every robotics class).