Pringled's comments

Pringled · on Jan 12, 2025

Thanks for the kind words! Unfortunately, it's hard to directly measure the precision/recall (or other metrics) since there are no real labels. This is one of the reasons we tried to design this in a way that's as explainable as possible, so that you can easily look at examples of deduplication for your given threshold and decide if they make sense. We are still thinking about more/better ways to evaluate this in addition to the benchmarks we've already done.

Pringled · on Jan 12, 2025

Thanks David, that's really nice to hear!

Pringled · on Jan 12, 2025

Thank you, that's nice to hear!

Pringled · on Dec 1, 2024

Some backends/algorithms don't natively support dynamic inserts, and require you to rebuild your index when you want to add vectors to it (Annoy and Pynndescent are the only backends that don't support it).

Hybrid search is a really cool idea though; it's not something we support at the moment, but definitely something we could investigate and add as an upcoming feature, thanks for the suggestion!

Pringled · on Dec 1, 2024

Thanks! This is actually something that we have been experimenting with a bit already (auto-tuning on a specific dataset basically). It turned out to be quite complicated given how many index and parameter combinations you get with a grid-search (making it very costly on larger datasets), which is why we first opted for this approach where you can evaluate with a chosen index + parameter set, but it's definitely something we are still planning to do.

Pringled · on Dec 1, 2024

1: that could be something for the future, but at the moment this is just meant as a way to quickly try out and evaluate various algorithms and libraries without having to learn the syntax for them (we call those backends).

2: we adopted the same methodology as ann-benchmarks for our evaluation, so technically the benchmarks there are valid for the backends we support. However it's a good suggestion to add those explicitly to the repo, I'll add a todo for that.

3: mainly because a: it's the language we are most the comfortable with developing in, b: it's the most widely used and adopted language for ML and c: (almost) all the algorithms we support are written in C/C++/Cython already.

Pringled · on Nov 18, 2024

I think a combination works quite well: first getting a small set of candidates from all the data using a lightweight model, and the using a heavy-duty model to rerank the results and get the final candidates.