Hacker News new | past | comments | ask | show | jobs | submit login

This is actually really cool, and despite what I'm sure will come off as (constructive) criticism, I am very impressed!

First, I think you oversell the overhead of keeping data in sync and the costs of not doing so in a timely manner. Almost any distributed system that is using multiple databases already needs to have a strategy for dealing with inconsistent data. As far as this problem goes, inconsistent embeddings are a pretty minor issue given that (1) most embedding-based workflows don't do a lot of updating/deletion; and (2) the sheer volume of embeddings from only a small corpus of data means that in practice you're unlikely to notice consistency issues. In most cases you can get away with doing much less than is described in this post. That being said, I want to emphasize that I still think not having to worrying about syncing data is indeed cool.

Second, IME the most significant drawback to putting your embeddings in a Postgres database with all your other data is that the workload looks so different. To take one example, HNSW indices using pgvector consume a ton of resources - even a small index of tens of millions of embeddings may be hundreds of gigabytes on disk and requires very aggressive vacuuming to perform optimally. It's very easy to run into resource contention issues when you effectively have an index that will consume all the available system resources. The canonical solution is to move your data into another database, but then you've recreated the consistency problem that your solution purports to solve.

Third, a question: how does this interact with filtering? Can you take advantage of partial indices on the underlying data? Are some of the limitations in pgvector's HNSW implementation (as far as filtering goes) still present?




Post co-author here. Really appreciate the feedback.

Your point about HNSW being resource intensive is one we've heard. Our team actually built another extension called pgvectorscale [1] which helps scale vector search on Postgres with a new index type (StreamingDiskANN). It has BQ out the box and can also store vectors on disk vs only in memory.

Another practice I've seen work well is for teams use to use a read replica to service application queries and reduce load on the primary database.

To answer your third question, if you combine Pgai Vectorizer with pgvectorscale, the limitations around filtered search in pgvector HNSW are actually no longer present. Pgvectorscale implements streaming filtering, ensuring more accurate filtered search with Postgres. See [2] for details.

[1]: https://github.com/timescale/pgvectorscale [2]: https://www.timescale.com/blog/how-we-made-postgresql-as-fas...


Thanks for your answer. I hear you on using a read-replica to serve embedding-based queries, but I worry there are lots of cases where that breaks down in practice: presumably you still need to do a bunch of IO on the primary to support insertion, and presumably reconstituting an index (e.g. to test out new hyperparameters) isn't cheap; at least you can offload the memory requirements of reading big chunks of your graph into memory onto the follower though.

Cool to see the pgvectorscale stuff; it sounds like the approach for filtering is not dissimilar to the direction that the pgvector team are taking with 0.8.0, although the much-denser graph (relative to HNSW) may mean the approach works even better in practice?


So… maybe 15 or 20 years ago I had setup MySQL servers such that some replicas had different indexes. MySQL only had what we would now call logical replication.

So after setting up replication and getting it going, I would alter the tables to add indexes useful for special purposes including full text which I did not being built on the master or other replicas.

I imagine, but can not confirm, that you could do something similar with PostgreSQL today.


Yeah, logical replication is supported on PostgreSQL today and would support adding indices to a replica. I am not sure if that works in this case, though, because what's described here isn't just an index.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: