Hacker News new | past | comments | ask | show | jobs | submit login

Curious to know what value you've seen out of these clusters. In my experience k means clustering was very lackluster. Having to define the number of clusters was a big pain point too.

You almost certainly want a graph like structure (overlapping communities rather than clusters).

But unsupervised clustering was almost entirely ineffective for every use case I had :/




I only got the clustering working this morning, so aside from playing around with it a bit I've not had any results that have convinced me it's a tool I should throw at lots of different problems.

I mainly like it as another example of the kind of things you can use embeddings for.

My implementation is very naive - it's just this:

    sklearn.cluster.MiniBatchKMeans(n_clusters=n, n_init="auto")
I imagine there are all kinds of improvements that could be made to this kind of thing.

I'd love to understand if there's a good way to automatically pick an interesting number of clusters, as opposed to picking a number at the start.

https://github.com/simonw/llm-cluster/blob/main/llm_cluster....


There are iterative methods for optimizing the number of clusters in k-means (silhouette and knee/elbow are common), but in practice I prefer density-based methods like HDBSCAN and OPTICS. There's a very basic visual comparison at https://scikit-learn.org/stable/auto_examples/cluster/plot_c....


You could also use a Bayesian version of kmeans. It applies a Dirichlet process as a prior to an infinite (truncated) set of clusters such that the most probable number k is automatically found. I found one implementation here: https://github.com/vsmolyakov/DP_means

Alternatively, there is a Bayesian GMM in sklearn. When you restrict it to diagonal Covariance matrices, you should be fine in high dimensions


Having close centers might help with the labeling. Let me know if I can help


Switch to using HDBSCAN. It's good.


Elbow method is a good place to start for finding the number of clusters.


That's a useful hint, thanks. I fed it through GPT-4 and got some interesting leads: https://chat.openai.com/share/400f76ae-b53b-4d07-ac31-adcef2... and https://chat.openai.com/share/48650db8-5a29-49c5-84b2-574f53...


Use bottom up clustering, you get the whole tree. fclusterdata in scipy




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: