Curious to know what value you've seen out of these clusters. In my experience k...

simonw · on Sept 4, 2023

I only got the clustering working this morning, so aside from playing around with it a bit I've not had any results that have convinced me it's a tool I should throw at lots of different problems.

I mainly like it as another example of the kind of things you can use embeddings for.

My implementation is very naive - it's just this:

    sklearn.cluster.MiniBatchKMeans(n_clusters=n, n_init="auto")

I imagine there are all kinds of improvements that could be made to this kind of thing.

I'd love to understand if there's a good way to automatically pick an interesting number of clusters, as opposed to picking a number at the start.

https://github.com/simonw/llm-cluster/blob/main/llm_cluster....

FreakLegion · on Sept 4, 2023

There are iterative methods for optimizing the number of clusters in k-means (silhouette and knee/elbow are common), but in practice I prefer density-based methods like HDBSCAN and OPTICS. There's a very basic visual comparison at https://scikit-learn.org/stable/auto_examples/cluster/plot_c....

stefanka · on Sept 5, 2023

You could also use a Bayesian version of kmeans. It applies a Dirichlet process as a prior to an infinite (truncated) set of clusters such that the most probable number k is automatically found. I found one implementation here: https://github.com/vsmolyakov/DP_means

Alternatively, there is a Bayesian GMM in sklearn. When you restrict it to diagonal Covariance matrices, you should be fine in high dimensions

stefanka · on Sept 5, 2023

Having close centers might help with the labeling. Let me know if I can help

nl · on Sept 4, 2023

Switch to using HDBSCAN. It's good.

haxton · on Sept 4, 2023

Elbow method is a good place to start for finding the number of clusters.

simonw · on Sept 4, 2023

That's a useful hint, thanks. I fed it through GPT-4 and got some interesting leads: https://chat.openai.com/share/400f76ae-b53b-4d07-ac31-adcef2... and https://chat.openai.com/share/48650db8-5a29-49c5-84b2-574f53...

visarga · on Sept 5, 2023

Use bottom up clustering, you get the whole tree. fclusterdata in scipy