An early look at HNSW performance with pgvector

Dowwie · 2023-08-11T10:58:39

Seems that pgvector has a viable competitor extension: https://github.com/tensorchord/pgvecto.rs

nicwolff · 2023-08-11T16:49:07

A couple: https://github.com/neondatabase/pg_embedding

kiwicopple · 2023-08-11T22:58:19

the charts in this blog post show benchmarks with pgvectors HNSW vs pg_embedding.

kiwicopple · 2023-08-10T20:09:14

The results are very promising for the Postgres ecosystem, as the HNSW index shows a significant performance/recall improvement over the current ivfflat index in pgvector.

HNSW will be merged in v0.5.0. I can't speak for Andrew (the creator) but it seems that this release is imminent[0], pending some benchmarking and minor improvements. This is a first look at the performance of pgvector’s HNSW implementation at a specific commit[1].

[0] https://github.com/pgvector/pgvector/commit/51d292c93dff82f6...

[1] https://github.com/pgvector/pgvector/commit/600ca5a7

moab · 2023-08-11T06:00:17

Glad to see more work on pgvector but why test on such small datasets on a large memory machine? The big ann datasets have 1B points and are much more interesting/representative of current embedding use cases (eg from dual encoder models).

I’m also curious if there is a way to not store everything in memory for pgvector. Is that possible?

Lastly, what is the parallelism story? Is it just using a thread pool under the hood? OpenMP?

Understanding if pgvector plans to support point insertions and deletions is also important in practice.

pashkinelfe · 2023-08-11T13:23:24

If you have 10M or bigger dataset of real-world OpenAI- dimensional vectors, please share, I'll use it in the next benchmarks. Random datasets are too misleading for vector search benchmarks because all ANN engines make use of internal distributions in datasets to struggle with the curse of dimensionality. So I never use random datasets for ann indexed benchmarking. Using simplified less dimensional (eg 128 instead of 1536) vectors also changes performance trends.

moab · 2023-08-11T14:40:32

See https://big-ann-benchmarks.com/neurips21.html

They're not OpenAI embeddings, but they are realistic, and much larger (number of vectors).

I think many production embeddings at non-OpenAI companies will use lower-dimensional vectors than 1536, so it makes sense to focus on non-OpenAI embeddings as well in your benchmarking.

pashkinelfe · 2023-08-11T13:17:25

Pgvector doesn't need to store everything in memory. It behaves similar to almost any Postgres AM's and store index data on disk. Performance-wise it's better to have enough memory for index data remain in shared memory buffers, but it is not a requirement for pgvector.

moab · 2023-08-11T14:41:46

Thanks for the response. I wonder whether HNSW will still perform well if it needs to page neighbor-lists to/from disk. Do you plan to benchmark the setting where the dataset is too large to fit in-memory?

pashkinelfe · 2023-08-10T21:14:05

I like these measurements with ANN-benckmark! They allow to compare performance of different index implementations apples-to-apples i.e. at the same build parameters set rather than using some fixed settings (or, even worse, default settings that are different).

The blogpost is very thorough, lots of measurements of different datasets, including great 1536-dimensional 1M rows dbpedia-openai dataset. Furthermore, a very strong point is that all parameters and method is described and transparent.

jkatz05 · 2023-08-10T22:41:03

Blog author. Thanks for the analysis -- I agree that the ANN Benchmark does provide a nice framework for helping with apples-to-apples comparisons. In this case, being able to use the "--local" flag made it easier to run using the native environment, vs. putting it into a container. I'm looking for to ANN Benchmark having more datasets!

heyitsguay · 2023-08-11T01:51:46

This is pretty cool, but what are the scaling limits of a pgvector-based embeddings storage solution? 1 TB of embeddings? 100 TB? Is pgvector suitable for large scale installations?

jkatz05 · 2023-08-12T16:48:09

Blog author. I've done some separate testing on storing ~500GB of embeddings (~1B embeddings) in a partitioned table. The partition key was built using IVFFLAT as a "coarse quantizer" (in this case, sampling the entire dataset and finding K means), storing the mean vectors in a separate table, and then loading each vector into the partition with closest center. After that, I built an IVFFLAT index on each partition. With the indexes, this added up to ~1TB storage. This was primarily a "is it possible test" vs. thorough benchmarking.

kiwicopple · 2023-08-11T23:03:49

The scalability characteristics for HNSW haven’t been properly benchmarked yet. If it’s anything like ivfflat then it should be reasonably predictable based on memory size: https://supabase.com/blog/pgvector-performance

I think 100TB is getting more into “sharded postgres” territory

ilaksh · 2023-08-10T22:40:16

This may be a dumb question but with OpenAI embeddings do we need to use cosine similarity or is the simple distance equivalent? I used cosine similarity before but not sure.

jkatz05 · 2023-08-10T22:43:10

Blog author. You can choose to use any distance metrics. One reason cosine similarity is popular (and used) is that for many of these higher dimensional datasets, it gives a better representation of "nearness" across all the data basd on the nature of "angular" distance. But depending on how your data is distributed, something like L2 distance (Euclidean) could make more sense.

jasfi · 2023-08-11T04:44:46

Can this HNSW implementation in PgVector be sharded across different nodes? Qdrant seems to make that easy.