Hacker News new | past | comments | ask | show | jobs | submit login
Vector databases are the wrong abstraction (timescale.com)
493 points by jascha_eng 16 days ago | hide | past | favorite | 90 comments



This is actually really cool, and despite what I'm sure will come off as (constructive) criticism, I am very impressed!

First, I think you oversell the overhead of keeping data in sync and the costs of not doing so in a timely manner. Almost any distributed system that is using multiple databases already needs to have a strategy for dealing with inconsistent data. As far as this problem goes, inconsistent embeddings are a pretty minor issue given that (1) most embedding-based workflows don't do a lot of updating/deletion; and (2) the sheer volume of embeddings from only a small corpus of data means that in practice you're unlikely to notice consistency issues. In most cases you can get away with doing much less than is described in this post. That being said, I want to emphasize that I still think not having to worrying about syncing data is indeed cool.

Second, IME the most significant drawback to putting your embeddings in a Postgres database with all your other data is that the workload looks so different. To take one example, HNSW indices using pgvector consume a ton of resources - even a small index of tens of millions of embeddings may be hundreds of gigabytes on disk and requires very aggressive vacuuming to perform optimally. It's very easy to run into resource contention issues when you effectively have an index that will consume all the available system resources. The canonical solution is to move your data into another database, but then you've recreated the consistency problem that your solution purports to solve.

Third, a question: how does this interact with filtering? Can you take advantage of partial indices on the underlying data? Are some of the limitations in pgvector's HNSW implementation (as far as filtering goes) still present?


Post co-author here. Really appreciate the feedback.

Your point about HNSW being resource intensive is one we've heard. Our team actually built another extension called pgvectorscale [1] which helps scale vector search on Postgres with a new index type (StreamingDiskANN). It has BQ out the box and can also store vectors on disk vs only in memory.

Another practice I've seen work well is for teams use to use a read replica to service application queries and reduce load on the primary database.

To answer your third question, if you combine Pgai Vectorizer with pgvectorscale, the limitations around filtered search in pgvector HNSW are actually no longer present. Pgvectorscale implements streaming filtering, ensuring more accurate filtered search with Postgres. See [2] for details.

[1]: https://github.com/timescale/pgvectorscale [2]: https://www.timescale.com/blog/how-we-made-postgresql-as-fas...


Thanks for your answer. I hear you on using a read-replica to serve embedding-based queries, but I worry there are lots of cases where that breaks down in practice: presumably you still need to do a bunch of IO on the primary to support insertion, and presumably reconstituting an index (e.g. to test out new hyperparameters) isn't cheap; at least you can offload the memory requirements of reading big chunks of your graph into memory onto the follower though.

Cool to see the pgvectorscale stuff; it sounds like the approach for filtering is not dissimilar to the direction that the pgvector team are taking with 0.8.0, although the much-denser graph (relative to HNSW) may mean the approach works even better in practice?


So… maybe 15 or 20 years ago I had setup MySQL servers such that some replicas had different indexes. MySQL only had what we would now call logical replication.

So after setting up replication and getting it going, I would alter the tables to add indexes useful for special purposes including full text which I did not being built on the master or other replicas.

I imagine, but can not confirm, that you could do something similar with PostgreSQL today.


Yeah, logical replication is supported on PostgreSQL today and would support adding indices to a replica. I am not sure if that works in this case, though, because what's described here isn't just an index.


Great point!

(Disclaimer: I work for Elastic)

Elasticsearch has recently added a data type called semantic_text, which automatically chunks text, calculates embeddings, and stores the chunks with sensible defaults.

Queries are similarly simplified, where vectors are calculated and compared internally, which makes a lot less I/O and a lot simpler client code.

https://www.elastic.co/search-labs/blog/semantic-search-simp...


I made something similar, but used duckDB as the vector store (and query engine)! It’s impressively fast

https://github.com/patricktrainer/duckdb-embedding-search


I love duckdb, but their concurrency model is very limiting:

DuckDB has two configurable options for concurrency:

1. One process can both read and write to the database.

2. Multiple processes can read from the database, but no processes can write (access_mode = 'READ_ONLY').

https://duckdb.org/docs/connect/concurrency.html


Amy specific reason to use dDB?

I've got a crapload of json q & a formatted discussions on a topic, and am trying to figure out if I just store it somewhere and query it, or do I also do vector embeddings, kinda lost with all the possible options.


Embeddings are what encode the “meaning” of a given text. Similarity search works by computing the angle between your query vector and the rest of the vectors already stored. DuckDB (and columnar stores in general) is great at aggregation. It’s particularly well suited because DuckDB is a single file. There’s no server to muck with.


There is vector type data available in duckdb now?


They call it a fixed size array type but, yes. It was added earlier this year. Works really great

https://duckdb.org/2024/05/03/vector-similarity-search-vss.h...


Yep! It was added in v0.10.0 - which was released a month or two after I made this.

This is using v0.9.1


How does their embedding model compare in terms of retrieval accuracy to, say `text-embedding-3-small` and `text-embedding-3-large`?


You can use openai embeddings in elastic if you don't want to use their elser sparse embeddings


It’s impossible to answer that question without knowing what content/query domain you are embedding. Checkout MTEB leaderboard, dig into the retrieval benchmark, and look for analogous datasets.


So we're talking maximizing embedding model per use case? Medical dats would require differnet model than say sales data? Sounds very fragmented approach.


The answer lies with a validation dataset that you create for testing.


Hey HN! Post co-author here, excited to share our new open-source PostgreSQL tool that re-imagines vector embeddings as database indexes. It's not literally an index but it functions like one to update embeddings as source data gets added, deleted or changed.

Right now the system only supports OpenAI as an embedding provider, but we plan to extend with local and OSS model support soon.

Eager to hear your feedback and reactions. If you'd like to leave an issue or better yet a PR, you can do so here [1]

[1]: https://github.com/timescale/pgai


Hey, this is really cool! Thanks for the article and the tool itself.

One question - in the RAG projects we've done, most of the source data was scattered in various source systems, but wasn't necessarily imported into a single DB or Data Lake. For example, building an internal Q&A tool for a company that has knowledge stored in services like Zendesk, Google Drive, an internal company Wiki, etc.

In those cases, it made sense to not import the source documents, or only import metadata about them, and keep the embeddings in a dedicated Vector DB. This seems to me to be a fairly common use case - most enterprises have this kind of data scattered across various systems.

How do you envision this kind of use case working with this tool? I may have missed it, but you mention things like working with images, etc, is your assumption that everyone is storing all of that data in Postgres?


Pretty smart. Why is the DB api the abstraction layer though? Why not two columns and a microservice. I assume you are making async calls to get the embeddings?

I say that because it seems n unsual. Index would suit sync better. But async things like embeddings, geo for an address, is this email considered a spammer etc. feel like app level stuff.


(post co-author here)

The DB is the right layer from a interface point of view -- because that's where the data properties should be defined. We also use the DB for bookkeeping what needs to be done because we can leverage transactions and triggers to make sure we never miss any data. From an implementation point of view, the actual embedding does happen outside the database in a python worker or cloud functions.

Merging the embeddings and the original data into a single view allows the full feature set of SQL rather than being constrained by a REST API.


That is arguable because while it is a calculated field, it is not a pure one (IO is required), and not necessarily idempotent, not atomic and not guaranteed to succeed.

It is certainly convenient for the end user, but it hides things. What if the API calls to open AI fail or get rate limited. How is that surfaced. Will I see that in my observability. Will queries just silently miss results.

If the DB does the embedding itself synchronously within the write it would make sense. That would be more like elastic search or a typical full text index.


(co-author here) We automatically retry on failures in a while. We also log error messages in the worker (self-hosted) and have clear indicators in the cloud UI that something went wrong (with plans to add email alerts later).

The error handling is actually the hard part here. We don't believe that failing on inserts due to the endpoint being down is the right thing because that just moves the retry/error-handling logic upstream -- now you need to roll your own queuing system, backoffs etc.


Thanks for the reply. These are compelling points.

I agree not to fail on insert too by the way. The insert is sort of an enqueuing action.

I was debating if a microservice should process that queue.

Since you are a PaaS the distinction might be almost moot. An implementation detail. (It would affect the api though).

However if Postgres added this feature generally it would seem odd to me because it feels like the DB doing app stuff. The DB is fetching data for itself from an external source.

The advantage is it is one less thing for the app to do and maybe deals with errands many teams have to roll their own code for.

A downside is if I want to change how this is done I probably can't. Say I have data residency or securiry requirements that affect the data I want to encode.

I think there is much to consider. Probably the why not both meme applies though. Use the built in feature if you can, and roll your own where you can't.


Thank you for sharing this! I have one question: Is there any plan to add support for local LLM / embeddings models?


"Right now the system only supports OpenAI as an embedding provider, but we plan to extend with local and OSS model support soon."

In the post you responded to


Haha I feel so dumb now. Thank you!


This question keeps popping up but I don't get it. Everyone and their dog has an OpenAI-compatible API. Why not just serve a local LLM and put api.openai.com 127.0.0.1 in your hosts file?

I mean why is that even a question? Is there some fundamental difference between the black box that is GPT-* and say, LLaMA, that I don't grok?


This is super cool! One suggestion for the blog: I would put "re-imagines vector embeddings as database indexes. It's not literally an index but it functions like one to update embeddings as source data gets added, deleted or changed." as a tl/dr at the top.

It wasn't clear to me why this was significantly different than using pg_vector until I read that. That makes the rest of the post (e.g. why this you need the custom methods in a `SELECT`) make a lot more sense in context


I'm doing something similar with go + postgres


Clever!

A method that has worked well for me: divorced databases.

The first database is a plaintext database that stores rows: id, data, and metadata and the second database is a vector database that stores id, embedding. whenever a new row is added the first database makes a POST request to the second database. The second database embeds the data and returns the id of its row. The first database uses that ID to store the plain text.

When searching, the second database is optimized for cosine sim with an HNSW index. It returns the IDs to the first database, which fetch the plaintext to return to the user.

The advantages of this are that the plaintext data can be A/B tested across multiple embedding models without affecting the source, and each database can be provisioned for a specific task. Also lowers hosting costs and security because there only needs to be one central vector database and small provisioned plaintext databases.


It sounds like this is pretty similar to the approach that the post is advocating against although I can see your reasoning behind this.


Post-co author here. This is actually something that we are considering implementing in future versions of pgai Vectorizer. You point the vectorizer at database A but tell it to create and store embeddings in database B. You can always do joins across the two databases with postgres FDWs and it would solve issues of load management if those are concerns. Neat idea and one on our radar!


The limitation with that is no hybrid search, which is often needed. “Show me only results for this user or tenant or category etc.”

Whats wrong with using FAISS as your single db?

Its like sqlite for vector embeddings, and you can store metadata (the primary data, foreign keys, etc) along with the vectors, preserving the relationship.

Not sure if the metadata is indexxed but at least iirc it's more or less trivial to update the embeddings when your data changes (tho i haven't used it in a while so not sure).


Good q. For most standalone vector search use cases, FAISS or a library like it is good.

However, FAISS is not a database. It can store metadata alongside vectors, but it doesn't have things you'd want in your app db like ACID compliance, non-vector indexing, and proper backup/recovery mechanisms. You're basically giving up all the DBMS capabilities.

For new RAG and search apps, many teams prefer just using a single app db with vector search capabilities included (Postgres, Mongo, MySQL etc) vs managing an app db and a separate vector db.


I've been in the vector database space for a while (primary author of txtai). I do think vector indexing in traditional databases with tools like pgvector is a good option.

txtai has long had SQLite + Faiss support to enable metadata filtering with vector search. That pattern can take you farther than you think.

The design decisions I've made is to make it easy to plug different backends in for metadata and vectors. For example, txtai supports storing both in Postgres (w/ pgvector). It also supports sqlite-vec and DuckDB.

I'm not sure there is a one-size-fits-all approach. Flexibility and options seems like a win to me. Different situations warrant different solutions.


Wow, actually a good point I haven't seen anyone make.

Taking raw embeddings and then storing them into vector databases, would be like if you took raw n-grams of your text and put them into a database for search.

Storing documents makes much more sense.


Been using pgvector for a while, and to me it was kind of obvious that the source document and the embeddings are fundamentally linked so we always stored them "together". Basically anyone doing embeddings at scale is doing something similar to what Pgai Vectorizer is doing and is certainly a nice abstraction.


I used FAISS as it also allowed me to trivially store them together.

Idk how well it scales though, it's just doing it's job on my hobby project scale

For my few 100'000s embeddings I must say the performance was satisfactory.


This is how most modern vector dbs work, you usually can store much more than just the raw embeddings (full text, metadata fields, secondary/named vectors, geospatial data, relational fields, etc).


I agree that putting the vectors in a separate DB often does not makes. Just use Hana https://news.sap.com/2024/04/sap-hana-cloud-vector-engine-ai... ;-) IMHO putting the calculation of the embedding vectors into the db (even if it is just a remote call) is not a got idea. How do you react to failures of the remote call, security issues because of code running within your DB ..?


Is SAP HANA used for anything outside the SAP environment?


No.


At my current company, we used Postgres with pgvector so the text is co-located with the embeddings on the same rows. At first, I was a bit apprehensive about the idea of getting so close to the nitty-gritty technical details of computing vector embeddings and doing cosine similarity matching but actually it has been wonderful. There is something magical about working directly with embeddings. Computing, serializing and storing everything yourself is actually surprisingly simple. Don't let the magic scare you.

Recently I've been doing hardcore stuff like taking an old hierarchical clustering library and substituting the vector distance functions with a cosine similarity function so that it groups/clusters records based on similarity of their embeddings. It's funny reading the README of that 10 year old library and they're showing how to use it to do tedious stuff like grouping together 3-dimensional color vectors. I'm using it to cluster together content based on meaning similarity using vectors of over 1.5k dimensions. Somehow, I don't think the library authors saw that coming.

How great is it to come across a library which hasn't been updated in 10 years and yet is flexible and simple enough that it can be re-purposed to serve a radically more advanced use case which would have been beyond the author's imagination at the time...

I think the most surprising aspect about the whole experience is that working with the embeddings directly makes it feel like your database is intelligent; but you know it's just a plain old dumb database and all the embeddings were pre-computed.


In the project I'm currently working on, I use OpenSearch for RAG because it allows me to use hybrid search which combines full-text search with vector search, and OpenSearch does all the math combining two result sets for me. Research shows that hybrid search can give better results than just vector search alone. Another team was already integrating OpenSearch for full text search for a different feature, so I just reused exising infra, sparing the time of DevOps/SRE.


I feel like most of the points raised in the article are solved by “use pgvector”, and then I’m very skeptical of handing over responsibility for API calls for creating the embeddings to the DB itself? I already have a software layer that knows how to do things like logs, API call failures? Having the DB handle fetching data from external sources feels like the wrong abstraction to me.


I agree. However, I think what they are saying is the embedding should just be like any other index. I mean, yeah it should be but that isn't reality. There are massive latencies involved as well as costs.

Perhaps in ~10 years embedding / chunking approaches will be so mature that there will just be one way to do it and will take no more time than updating a btree but that certainly isn't the case now.

I think the right abstraction for today would be for OpenAI to manage the vector search. It is kind of weird to send all of the data to a service only to have it compute a vector and hand it back to me. I have to figure out how to chunk it etc (I'm sure they would do a better job than I would). I should just have to deal with text ideally. Someone else can figure out how to return the best results.


> I think the right abstraction for today would be for OpenAI to manage the vector search

So I disagree, but they have a very easy-to-use RAG system in beta that does what you want.

In my use cases, fine-grained control over chunking and so on is application-level code. I’m using an LLM to split documents into subdocuments with context (and location) and then searching those subdocuments, while pushing the user to the source


I’m using sqlite-vec along with FTS5 in (you guessed it) SQLite and it’s pretty cool. :)


what's your experience with sqlite-vec? I'm considering using sqlite-vec in addition to/or replace qdrant vector db for a project (recurse.chat), since I'm moving all the data to sqlite. I love everything SQLite so far, but haven't got to try out sqlite-vec yet.


Hey, this looks great! I'm a huge fan of vectors in Postgres or wherever your data lives, and this seems like a great abstraction.

When I write a sql query that includes a vector search and some piece of logic, like: ``` select name from users where age > 21 order by <vector_similarity(users.bio, "I like long walks on the beach")> limit 10; ``` Does it filter by age first or second? I've liked the DX of pg_vector, but they do vector search, followed by filtering. It seems like that slows down what should be the superpower of a setup like this.

Here's a bit more of a complicated example of what I'm talking about: https://blog.bawolf.com/p/embeddings-are-a-good-starting-poi...


(post co-author here)

It could do either depending on on what the planner decides. In pgvector it usually does post-filtering in practice (filter after vector search).

pgvector HNSW has the problem that there is a cutoff of retrieving some constant C results and if none of them match the filter than it won't find results. I believe newer version of pgvector address that. Also pgvectorscale's StreamingDiskANN[1] doesn't have that problem to begin with.

[1]: https://www.timescale.com/blog/how-we-made-postgresql-as-fas...


pg_vector does post-filtering, not pre-filtering


timescaledbs pg_vector_scale extension does pre-filtering thankfully. shame i cant get it in RDS though


You can request it for RDS


Safe to say that if you're using off-the-shelf character-based chunking, your AI app is not past PoC.


I agree.

Similar to blog post, instead of at the extension layer I built a PostgreSQL ORM for Node.js based on ActiveRecord + Django's ORM that includes the concept of vector fields [0][1] that lets you write code like this:

    // Stores the `title` and `content` fields together as a vector
    // in the `content_embedding` vector field
    BlogPost.vectorizes(
      'content_embedding',
      (title, content) => `Title: ${title}\n\nBody: ${content}`
    );

    // Find the top 10 blog posts matching "blog posts about dogs"
    // Automatically converts query to a vector
    let searchBlogPosts = await BlogPost.query()
      .search('content_embedding', 'blog posts about dogs')
      .limit(10)
      .select();
I find it tremendously useful; you can query the underlying data or the embedding content, and you can define how the fields in the model get stored as embeddings in the first place.

[0] https://github.com/instant-dev/orm?tab=readme-ov-file#using-...

[1] https://github.com/instant-dev/orm?tab=readme-ov-file#using-...


I agree with the author - introducing a vector database often isn't worth the extra complexity.

Personally, I can vouch for ParadeDB: https://www.paradedb.com/

It adds extra extensions to PostgreSQL which enable vector indexing, full text search and BM25. Works great and developers are helpful!

The major difference is that you must generate the embeddings by yourself, but I consider it an upside - to each their own :)


> I consider it an upside

I'm curious why you consider an upside. Hypothetically speaking, wouldn't it be better if the embeddings could automatically be updated when you want them to be? Is the problem that it's not easy to automated based on the specific rules of when you want updates to happen?


Easier to handle edge-cases - real examples:

- What if certain rows in a table don't need to be embedded?

- What if we use a single API key for embedding database rows and user queries and it hits a rate limit - how to prioritize user queries?

- What if some rows should be vectorized using a different model, depending on an external configuration?


We could add support for something like `pg_vectorize` in order to generate embeddings directly from the database. We simply haven't seen enough demand yet. Perhaps we haven't listened hard enough :')


Yes. Materialized Views are good.


That was just what I was thinking. This approach will have the same issues that materialized views have as well


haha. We had a good internal debate as to whether this is more like indexes or more like Materialized Views. It's kinda a mixture of the two.


We managed 200M long and short form embeddings (patents), indexed in scann at runtime and a metadata layer on leveldb. Some simple murmur hash sharding and a stable K8s cluster on GCP was all we needed. Low millisecond retrieval and rerank augmenting a primary search.

I think in 0 cases would we go back and use vector dbs or managed services if they were available to us (to include lucene or relational db add-ons)


This reads solely as a sales pitch, which quickly cuts to the "we're selling this product so you don't have to think about it."

...when you actually do want to think about it (in 2024).

Right now, we're collectively still figuring out:

  1. Best chunking strategies for documents
  2. Best ways to add context around chunks of documents
  3. How to mix and match similarity search with hybrid search
  4. Best way to version and update your embeddings


(post co-author here)

We agree a lot of stuff still needs to be figured out. Which is why we made vectorizer very configurable. You can configure chunking strategies, formatting (which is a way to add context back into chunks). You can mix semantic and lexical search on the results. That handles your 1,2,3. Versioning can mean a different version of the data (in which case the versioning info lives with the source data) OR a different embedding config, which we also support[1].

Admittedly, right now we have predefined chunking strategies. But we plan to add custom-code options very soon.

Our broader point is that the things you highlight above are the right things to worry about, not the data workflow ops and babysitting your lambda jobs. That's what we want to handle for you.

[1]: https://www.timescale.com/blog/which-rag-chunking-and-format...


Points 2-4 are clear pointers to a real database as the home for vector data & search.


Or you can use Cassandra / DataStax Astra to store the original text, the metadata and the embedding in a single table and then do hybrid queries against them (with pre- or post-filtering, optimized automatically).


Seems like a nice abstraction.

Since I see DuckDB mentioned, folks wanting serverless may also be interested in LanceDB, written in Rust, with most features built out for Python.

https://lancedb.com/

https://github.com/lancedb/lancedb

Side note, I wrote a proof of concept of embeddings generator being handled inside PostgreSQL, independent of the index.

https://github.com/Hendler/flame


This is really cool. I've been working on a RAG application to answer customer support tickets on and off for the past couple months. The whole time I never put together that vectors could "get out of sync" when swapping out the embedding model.

I probably won't use this right now when my app is so small because it of the complexity managing another service introduces. But I imagine as it gets bigger this would make things simpler.


This seems very well reasoned. Ultimately what I think will win is whatever abstraction popular ORM providers can make easiest for devs.

It might be ‘wrong’ to treat vectors as a related table to your main model, but if frameworks and ORMs make it easy to handle the downsides of that abstraction in the app layer and a dev can just have one database for everything I think that will be the most common approach.


> Vector databases treat embeddings as independent data, divorced from the source data from which embeddings are created

With the exception of Pinecone: Chroma, Qdrant, Weaviate, Elastic, Mongo, and many others store the chunk/document alongside the embedding.

This is intentional misinformation.


Post co-author here. The point is a little nuanced, so let me explain:

You are correct in saying that that you can store embeddings and source data together in many vectordbs. We actually point this out in the post. The main point is that they are not linked but merely stored alongside each other. If one changes, the other one does not automatically change, making the relationship between the two stale.

The idea behind Pgai Vectorizer is that it actually links embeddings with underlying source data so that changes in source data are automatically reflected in embeddings. This is a better abstraction and it removes the burden of the engineer to ensure embeddings are in sync as their data changes.


i know it is the case in chroma this is supported out of the box with 0 lines of code. i’m pretty sure it’s supported everywhere else in no more than 3 lines of code.


This is also the case with weaviate (as you assumed). If you update the value of any previously vectorized property, weaviate generates new vectors automatically for you.


as far as I can tell Chroma can only store chunks, not the original documents. This is from your docs `If the documents are too large to embed using the chosen embedding function, an exception will be raised`.

In addition it seems that embeddings happen at ingest time. So, if, for example, the OpenAI endpoint is down the insert will fail. That, in turn means your users need to use a retry mechanism and a queuing system. All the complexity we describe in our blog.

Obviously, I am not an expert in Chroma. So apologies in advance if I got anything wrong. Just trying to get to the heart of the differences between the two systems.


Chroma certainly doesn't have the most advanced API in this area, but you can for sure store chunks or documents, its up to you. If your document size is too large to generate embeddings in a single forward pass, then yes you do need to chunk in that scenario.

Oftentimes though, even if the document does fit, you choose to chunk anyways or further transform the data with abstractive/extractive summarization techniques to improve your search dynamics. This is why I'm not sure the complexity noted in the article is relevant in anything beyond a "naive RAG" stack. How its stored or linked is an issue to some degree, but the greater more complex smell is in what happens before you even get to that point of inserting the data.

For more production-grade RAG, just blindly inserting embeddings wholesale for full documents is rarely going to get you great results (this varies a lot between document sizes and domains). So as a result, you're almost always going to be doing ahead-of-time chunking (or summarization/NER/etc) not because you have to due to document size, but because your search performance demands it. Frequently this involves more than one embeddings model for capturing different semantics or supporting different tasks, not to mention reranking after the initial sweep.

That's the complexity that I think is worth tackling in a paid product offering, but the current state of the module described in the article isn't really competitive with the rest of the field in that respect IMHO.


(Post co author) We absolutely agree that chunking is critical for good RAG. What I think you missed in our post is that the vectorizer allows you to configure a chunking strategy of your choice. So you store the full doc but then the system well chunk and embed it for you. We don’t blindly embed the full document.


I didn't miss that detail, I just don't think chunking alone is where the complexity lies and that the pgai feature set isn't really differentiated at all from other offerings in that context. My commentary about full documents was responding directly to your comment here in this thread more so than I was the article (you claimed chroma can only insert chunks, which isn't accurate, and I expanded from there).


This is true only in the trivial case where the entire document fits in a single chunk, correct?

That seems like a meaningful distinction.


Yes that is correct, but my position (which perhaps has been poorly-articulated) is that in the non-trivial instances, it is a distinction without difference in the greater context of the RAG stack and related pipelines.

Just allowing for a chunking function to be defined which is called at insertion time doesn't really alleviate the major pain points inherent to the process. Its a minor convenience, but in fact, as pointed out elsewhere in this thread by others, its a convenience you can afford to yourself in a handful of lines of code that you only ever have to write once.


For anyone that wants to see how this compares on ann-benchmarks.com, the project is called 'sptag'.

This article depicts a perfect world and links it to a solution which is fairly distant from that. I understand the wishful thinking of having a "magic box" for search infrastructure but as someone worked on web-scale search at Google for years I'd say the reality isn't that simple.

1. The real problem in embedding data lifecycle management is changing the embedding mode, which involves a migration process. You can't really solve that by simply streamline the vectorization and suddenly use a new model for new data ingested. You need the non-fancy migration process: create a new collection, batch generate new vectors with the new model, port all of them there, meanwhile doing dual write for all newly ingested documents, and switch search traffic to the new collection once batch ingestion is done. Streamlining vectorization as part of the ingestion call doesn't solve that. Though it is an interesting feature to lower mental complexity, that's why at Zilliz (a vector db startup) our product https://zilliz.com/zilliz-cloud-pipelines supports that and our open-source Milvus plans supporting streamlining API call to embedding service in 3.0 version: https://milvus.io/docs/roadmap.md. That said I must state that changing the embedding model is more difficult than what the article makes it feels like. We provide tools like bulk import to batch port a whole dataset of vector embeddings with other metadata like original text or image urls. But solving the problem with one "magic box" sounds unrealistic to me, at least not for production use cases.

2. The article linked to an implementation that does naive doc processing like chunking, but in reality people need more flexibility on parsing, doc chuncking, and choice of embedding models. That's why people need tools like LlamaIndex and unstructured.io, and write a doc processing pipeline for that.

3. Most vector DBs support storing original unstructured data with the vector embedding. For example, in Milvus users usually ingest text, the vector of the text, other labels like author, title, chunk id, publish_time. The ingestion of that data is atomic naturally as that's one single row of data. "Having data and embedding not in async" is just a false claim. When you update the document, you remove the old rows and add new rows with bundled new text and new vector. I'm not sure how it could be out-of-sync. The real problem is #1, the migration problem if you want to change the embedding model, in which case you need to wipe out all existing data's vectors as they are not compatible with new embedding model so you can't blend some docs with old embedding and some with new. You need to migrate the whole dataset to another new collection and decide when to start serving queries from the new collection.

4. Lastly, the consistency/freshness problem in search usually resides between the source data, say files on S3 or a Zendesk table, and the serving stack, say vector db. Thus to build a production ready search, it needs sophisticated syncing mechanism to detect data change from the source S3, business apps or even world-wide-web and sync them to the search indexing pipeline for processing the updates and write them to the serving stack. Tools like https://www.fivetran.com/blog/unlock-ai-powered-search-with-... can offer some help in avoiding engineering complexity of implementing that in house.


> the responsibility for generating and updating them as the underlying data changes can be handed over to the database management system

And now we shift ever more slightly back towards logic in the DB. I for one am thrilled; there’s no reason other than unfamiliarity to not let RDBMS perform functions it’s designed to do. As long as these offloads are documented in code, embrace not needing to handle it in your app.


Or you can use Postgres to store the original text, the metadata and the embedding in a single table and then do hybrid queries against them (with pre- or post-filtering, optimized automatically).

Shameless plug:

BM25 search implemented in PL/pgSQL: https://github.com/jankovicsandras/plpgsql_bm25

faster BM25 search algorithms in Python: https://github.com/jankovicsandras/bm25opt


BM25 implemented in Postgres as a Postgres extension: https://github.com/paradedb/paradedb (disclaimer: I work for ParadeDB)


Yeah when I implemented a RAG myself I wondered why people were storing the text separately, it doesn't make any sense to me!

It's not that "vector databases are the wrong abstraction", it's that "vector data is not an abstraction at all". It's just a data type with some operators, you are responsible for architecting that tool into your system in a coherent way.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: