Hacker News new | past | comments | ask | show | jobs | submit login
Pg_vectorize: Vector search and RAG on Postgres (github.com/tembo-io)
295 points by samaysharma 9 months ago | hide | past | favorite | 116 comments



I did a hobby RAG project a little while back, and I'll just share my experience here.

1. First ask the LLM to answer your questions without RAG. It is easy to do and you may be surprised (I was, but my data was semi-public). This also gives you a baseline to beat.

2. Chunking of your data needs to be smart. Just chunking every N characters wasn't especially fruitful. My data was a book, so it was hierarchal (by heading level). I would chunk by book section and hand it to the LLM.

3. Use the context window effectively. There is a weighted knapsack problem here, there are chunks of various sizes (chars/tokens) with various weightings (quality of match). If your data supports it, then the problem is also hierarchal. For example, I have 4 excellent matches in this chapter, so should I include each match, or should I include the whole chapter?

4. Quality of input data counts. I spent 30 minutes copy-pasting the entire book into markdown format.

This was only a small project. I'd be interested to hear any other thoughts/tips.


Pro tip, you can use small models like phi-2 to do semantic chunking rather than just chunk on based on size, and chunking works even better if you book-end chunks with summaries of text coming before/after to enrich the context.

Another pro tip, you can again use a small model to summarize/extract RAG content to get more actual data in the context.


Could you share a bit more about semantic chunking with Phi? Any recommendations/examples of prompts?


Sure, it'll look something like this:

""" Task: Divide the provided text into semantically coherent chunks, each containing between 250-350 words. Aim to preserve logical and thematic continuity within each chunk, ensuring that sentences or ideas that belong together are not split across different chunks.

Guidelines: 1. Identify natural text breaks such as paragraph ends or section divides to initiate new chunks. 2. Estimate the word count as you include content in a chunk. Begin a new chunk when you reach approximately 250 words, preferring to end on a natural break close to this count, without exceeding 350 words. 3. In cases where text does not neatly fit within these constraints, prioritize maintaining the integrity of ideas and sentences over strict adherence to word limits. 4. Adjust the boundaries iteratively, refining your initial segmentation based on semantic coherence and word count guidelines.

Your primary goal is to minimize disruption to the logical flow of content across chunks, even if slight deviations from the word count range are necessary to achieve this. """


Might sound like a rookie question, but curious how you'd tackle semantic chunking for a hefty text, like a 100k-word book, especially with phi-2's 2048 token limit [0]. Found some hints about stretching this to 8k tokens [1] but still scratching my head on handling the whole book. And even if we get the 100k words in, how do we smartly chunk the output into manageable 250-350 word bits? Is there a cap on how much the output can handle? From what I've picked up, a neat summary ratio for a large text without missing the good parts is about 10%, which translates to around 7.5K words or over 20 chunks for the output. Appreciate any insights here, and apologies if this comes off as basic.

[0]: https://huggingface.co/microsoft/phi-2

[1]: https://old.reddit.com/r/LocalLLaMA/comments/197kweu/experie...


Wild speculation - do you think there could be any benefit from creating two sets of chunks with one set at a different offset from the first? So like, the boundary between chunks in the first set would be near the middle of a chunk in the second set?


No, it's better to just create summaries of all the chunks, and return summaries of chunks that are adjacent to chunks that are being retrieved. That gives you edge context without the duplication. Having 50% duplicated chunks is just going to burn context, or force you to do more pre-processing of your context.


This just isn't working for me, phi-2 starts summarizing the document I'm giving it. I tried a few news articles and blog posts. Does using a GGUF version make a difference?


Depending on the number of bits in the quantization, for sure. The most common failure mode should be minor restatements which you can choose to ignore or not.


Any comments about using Sparse Priming Representations for achieving similar things?


That looks like it'd be an adjunct strategy IMO. In most cases you want to have the original source material on tap, it helps with explainability and citations.

That being said, it seems that everyone working at the state of the art is thinking about using LLMs to summarize chunks, and summarize groups of chunks in a hierarchical manner. RAPTOR (https://arxiv.org/html/2401.18059v1) was just published and is close to SoTA, and from a quick read I can already think of several directions to improve it, and that's not to brag but more to say how fertile the field is.


Is phi actually able to follow those instructions? How do you handle errors?


Whether or not it follows the instructions as written, it produces good output as long as the chunk size stays on the smaller side. You can validate that all the original text is present in the chunks and that no additional text has been inserted easily enough and automatically re-prompt.


Can you speak a bit to the speed/slowness of doing such chunking? We recently started using LLMs to clean the text (data quality/text cleanliness is a problem for us), and it has increased our indexing time a lot.


It's going to depend on what you're running on, but phi2 is pretty fast so you can reasonably expect to be hitting ~50 tokens a second. Given that, if you are ingesting a 100k token document you can expect it to take 30-40 minutes if done serially, and you can of course spread stuff in parallel.


thanks for the info--good to know we aren't the only ones contending with speed for large documents lol


I've heard of people having success with methods like this. Would be awesome if we found a way to build that into this project :)


>tree-based approach to organize and summarize text data, capturing both high-level and low-level details.

https://twitter.com/parthsarthi03/status/1753199233241674040

processes documents, organizing content and improving readability by handling sections, paragraphs, links, tables, lists, page continuations, and removing redundancies, watermarks, and applying OCR, with additional support for HTML and other formats through Apache Tika:

https://github.com/nlmatics/nlm-ingestor


I don't understand. Why build up text chunks from different, non-contiguous sections?


On the level of paper, not everything is laid out linearly. The main text is often laid out in column, the flow can be be offset with pictures with a caption, additional text can be placed in inserts, etc ...

You need a human eye to figure that out and this is the task nlm-ingestor tackles.

As for the content, semantic contiguity is not always guaranteed. A typical example of this are conversations, where people engage in narrative/argumentative competitions. Topics get nested as the conversation advances, along the lines of "Hey, this remind me of ...". Building up a stack that can be popped once subtopics have been exhausted: "To get back to the topic of ...".

This is explored at length by Kebrat-Orecchioni in:

https://www.cambridge.org/core/journals/language-in-society/...

And an explanation is offered by Dessalles in:

https://telecom-paris.hal.science/hal-03814068/document


If those non-contiguous sections share similar semantic/other meaning, it can make sense from a search perspective to group them?


it starts to look like a graph problem


>> Just chunking every N characters wasn't especially fruitful

Is there any science associated with creative effective embedding sets? For a book, you could do every sentence, every paragraph, every page or every chapter (or all of these options). Eventually people will want to just point their RAG system at data and everything works.


The easy answer is just use a model to chunk your data for you. Phi-2 can chunk and annotate with pre/post summary context in one pass, and it's pretty fast/cheap.

There is an optimal chunk size, which IIRC is ~512 tokens depending on some factors. You could hierarchically model your data with embeddings by chunking the data, then generating summaries of those chunks and chunking the summaries, and repeating that process ad nauseum until you only have a small number of top level chunks.


This is an example of knowledge transfer from a model. I used a similar approach to augment chunked texts with questions, summaries, and keyterms (which require structured output from the LLM). I haven't tried using a smaller model to do this as GPT3.5 is fast and cheap enough, but I like the idea of running a model in house to do things like this.


How does this work when there is a limited context window. You do some pre-chunking?


Phi can ingest 2k tokens and the optimal chunk size is between 512-1024 depending on the model/application, so you just give it a big chunk and tell it to break it down into smaller chunks that are semantically related, leaving enough room for book-end sentences to enrich the context of the chunk. Then you start the next big chunk with the remnants of the previous one that the model couldn't group.


Isn't "give it a big chunk" just the same problem at a higher level? How do you handle, say, a book?


You don't need to handle a whole book, the goal is to chunk the book into chunks of the correct size, which is less than the context size of the model you're using to chunk it semantically. When you're ingesting data, you fill up the chunker model's context, and it breaks that up into smaller, self relevant chunks and a remainder. You then start from the remainder and slurp up as much additional text as you can to fill the context and repeat the process.


It seems like you put a lot of thought and effort into this. Were you happy with the results?

If you could put the entire book into the context window of Gemini (1M tokens), how do you think it would compare with your approach? It would cost like $15 to do this, so not cheap but also cost effective considering the time you’ve spent chunking it.


At the time I was working with either GPT 3.5 or 4, and – looking at my old code - I was limiting myself to around 14k tokens.

> Were you happy with the results?

It was somewhat ok. I was testing the system from a (copyrighted) psychiatry textbook book and getting feedback on the output from a psychotherapist. The idea was to provide a tool to help therapists prep for a session, rather than help patients directly.

As usual it was somewhat helpful but a little too vague sometimes, or missed important specific information for situation.

It is possible that it could be improved with a larger context window, having more data to select, or different prompting. But the frequent response was along the lines of, "this is good advice, but it just doesn't drill down enough".

Ultimately we found that GPT3.5/4 could produce responses that matched or exceeded our RAG-based solution. This was surprising as it is quite domain specific, but also it seemed pretty clear that GPT must have be trained on very data very similar the (copyrighted) content we were using.

Further steps would be:

1. Use other LLM models. Is it just GPT3.5/4 that is reluctant to drill down?

2. Used specifically-trained LLMs(/LORA) based on the expected response style

I'd be careful of entering this kind of arms race. It seems to be a fight against mediocre results, and at any moment OpenAI at al may release a new model that eats your lunch.


The thing you have to keep in mind is that OpenAI (and all the big tech companies) are risk averse. That's the reason that model alignment is so overbearing. The foundation models are also trying to be good at everything, which means they won't be sensitive to the nuances of specific types of questions.

RAG and chatbot memory systems are here to stay, and they will always provide a benefit.


Did you try or consider fine-tuning Gpt 3.5 as a complementary approach?


But you would need to spend the $15 on every request whereas the RAG approach would be most likely significantly cheaper per request.


I think people assume RAG will just be a vector search problem. You take the user's input and somehow get relevant context.

It's really hard to coordinate between LLMs, a vector store, chunking the embeddings, turning user's chats into query embeddings - and other queries - etc. It's a complicated search relevance problem that's extremely multifaceted, use case, and domain specific. Just doing search from a search bar well without all this complexity is hard enough. And vector search is just one data store you might use alongside others (alongside keyword search, SQL, whatever else).

I say this to get across Postgres is actually uniquely situated to bring a multifaceted approach that's not just about vector storage / retrieval.


Yes ... and no.

You're right for many applications. Yet, for many other applications, simply converting each document into an embedding, converting the search string into an embedding, and doing a search (e.g. cosine similarity), is all that is needed.

My first attempt at RAG was just that, and I was blown away at how effective it was. It worked primarily because each document was short enough (a few lines of text) that you didn't lose much by making an embedding for the whole document.

My next attempt failed, for the reasons you mention.

Point being: Don't be afraid to give the simple/trivial approach a try.


I'm essence what you are saying is it worked for a demo but not for an actual application


Most people need it just for a demo, though. I'm not developing any applications for it, but I have several ideas in mind for RAG just for my personal use case on my local machine.

And for the one that worked, BTW, would work almost unchanged in an application. I may put it online one day as several people have expressed interest - simply because there are no existing sites that do the job as well as my POC. The challenges will mostly be the usual web app challenges (rate limits, etc), but the actual RAG component doesn't need improvement to be the "best" one out there.


Totally agree - the "R" in RAG is about retrieval which is a complex problem and much more than just similarity between embedding vectors.


Not sure if postgres is uniquely situated, elasticsearch can do everything you mentioned and much more.


Yes es/os has been especially great to the louie.ai team. We need to deal with a lot of data - we handle talking to real-time news, logs, social media, alerts, threat feeds, etc - and these systems were previously already scaled for large document corpuses with rich metadata even before having the vector index added. Eg, imagine having to deal with hot vs cold storage for diff queries, and auto-graduating data.

When we looked at the pricing for vector DBs and less mature DB extensions, the real playing field shrunk down pretty quick. (And esp if you want managed & unified system wherever possible vs a NIH Frankenstein.)


Funny this has come up – I've literally just finished building a RAG search with Postgres this morning.

I run a directory site for UX articles and tools, which I've recently rebuilt in Django. It has links to over 5000 articles, making it hard to browse, so I thought it'd be fun to use RAG with citations to create a knowledge search tool.

The site fetches new articles via RSS, which are chunked, embedded and added to the vector store. On conducting a search, the site returns a summary as well as links to the original articles. I'm using LlamaIndex, OpenAI and Supabase.

It's taken a little while to figure out as I really didn't know what I was doing and there's loads of improvements to make, but you can try it out here: https://www.uxlift.org/search/

I'd love to hear what you think.


This is awesome! I just tried it out, and I have a few bits of feedback:

Show the question I asked. I asked something like "What are the latest trends in UX for mobile?" and got a good answer that started with "The latest trends in UX design for mobile include..." but that's not the same as just listing the exact thing I asked.

Either show a chat log or stuff it into the browser history or...something. There's no way to see the answer to the question I asked before this one.

I just went back and did another search, and the results had 6 links, while the response (only) had cites labeled [7] [6] and [8] in that order.

Again, great stuff!


Hey thanks for trying it out and giving your feedback, much appreciated!

Good idea about showing the question, I've just added that to the search result.

Yes, I agree a chat history is really needed - it's on the list, along with a chat-style interface.

Also need to figure out what's going on with the citations in the text - the numbers seem pretty random and don't link up to the cited articles.

Thanks again!


Could you share more about your strategy or the approach you took for chunking the articles? I'm curious about the criteria or methods you used to decide the boundaries of each chunk and how you ensured the chunks remained meaningful for the search functionality. Thanks!


I wrote a C# library to do this, which is similar to other chunking approaches that are common, like the way langchain does it: https://github.com/drittich/SemanticSlicer

Given a list of separators (regexes), it goes through them in order and keeps splitting the text by them until the chunk fits within the desired size. By putting the higher level separators first (e.g., for HTML split by <h1> before <h2>), it's a pretty good proxy for maintaining context.

Which chunk size you decide on largely depends on your data, so I typically eyeball a sample of the results to determine if the splitting is satisfactory.

You can see the separators here: https://github.com/drittich/SemanticSlicer/blob/main/Semanti...


As this was my first attempt, I decided to take a pretty basic approach, see what the results were like and optimise it later.

Content is stored in Django as posts, so I wrote a custom document reader that created a new LlamaIndex document for each post, attaching the post id, title, link and published date as metadata. This gave better results than just loading in all the content as a text or CSV file, which I tried first.

I did try with a bunch of different techniques to split the chunks, including by sentence count and a larger and smaller number of tokens. In the end I decided to leave it to the LlamaIndex default just to get it working.


Any chance you'd Open Source it?


Yes, I'm planning to write an article showing the code, mostly so I don't forget how I did it but hopefully it'll also be useful to others.

I relied on a lot of articles and code examples when putting it together so I'm happy to do the same.


Looking forward to this. We're running a Django codebase and would like to pull in Confluence articles and run GPT queries against them.


Are there any examples for when RAG powered by vectorsearch works really well?

I tried best practices like having the llm formulate an answer and using the answer for the search (instead of the question) and trying different chunk sizes and so on but never got it to work in a way that I would consider the result as "good".

Maybe it was because of the type of data or the capabilities of the model at the time (GPT 3.5 and GPT 4)?

By now context windows with some models are large enough to fit lots of context directly into the prompt which is easier to do and yields better results. It is way more costly but cost is going down fast so I wonder what this means for RAG + vectorsearch going forward.

Where does it shine?


We built a RAG system for one of our clients in the aviation industry. >20m technical support messages and associated answers / documentation, and we're seeing between 60-80% recall for top 3 documents when testing. Definitely pays off to use as much of the structure you'll find in the data, plus combining multiple strategies (knowledge graph for structured data, text embeddings across data types, filtering and boosting based on experts experience, etc). The baseline pure RAG-only approach was under 25% recall.


We're getting very solid results.

Instead of performing rag on the (vectorised) raw source texts, we create representations of elements/"context clusters" contained within the source, which are then vectorised and ranked. That's all I can disclose, hope that helps.


Thanks for your message. I should say that giving your comment to GPT-4, with a request for a solution architecture that could produce good results based on the comment, produced a very detailed, fascinating solution. https://chat.openai.com/share/435a3855-bf02-4791-97b3-4531b8...


If only the thing could speak and summarize in plain English instead of hollow, overly verbose bulleted lists.


A whole lot of noise


Maybe, but it expanded on the idea in the vague comment and together introduced me to the idea of embedding each sentence and then clustering the sentences, then taking the centroid of the sentences as the embedding to index/search against. I had not thought of doing that before.


Sounds a little like this recent paper;

"RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval"

https://arxiv.org/abs/2401.18059


After seeing raw source text performance, I agree that representational learning of higher-level semantic "context clusters" as you say seems like an interesting direction.


I had similar experiences. Can't understand all the hype around RAG when the results are so bad.


I‘m using it in for an internal application and the results so far are amazing. Considering it was hacked together in a few hours.

It helps a lot with discovery. We have some large PDFs and also a large amount of smaller PDFs. Simply asking a question, getting an answer with the exact location in the pdf is really helpful.


From our experience simple RAG is often not that helpful as the questions itself are not represented in the vector space (except you use an FAQ dataset as input). Either a preprocessing by an LLM or specific context handling needs to be done.


have you tried Vectara?


Few relevant blogs on using pg_vectorize:

* Doing vector search with just 2 commands https://tembo.io/blog/introducing-pg_vectorize

* Connecting Postgres to any huggingface sentence transformer https://tembo.io/blog/sentence-transformers

* Building a question answer chatbot natively on Postgres https://tembo.io/blog/tembo-rag-stack


I don't know how to articulate the uncomfortable feeling I'd be having, about something 'inside' the database doing the download and making requests to other systems outside a boundary. It might be a security threat or just my inexperience, how common is it for postgres extensions to do this?


I feel the same way. I think it's useful to think of a database as simply a fast storage and lookup for data. When it starts to run network code or run doom it breaks our expectations. This will confuse people on your team, and confusion creates security problems and tech debt.

It's cool that postgresql can do this, and I've even bookmarked the project. But in my projects I expect all network code and API access like this to be on the same application layer.


There's a fair amount of skepticism towards the efficacy of RAG in these comments—often in contrast to simply using a model with a huge context window to analyze the corpus in one giant chunk. But that will not be a viable alternative in all use cases.

For example, one might need to analyze/search a very large corpus composed of many documents which, as a whole, is very unlikely to fit within any realistic context window. Or one might be constrained to only use local models and may not have access to models with these huge windows. Or both!

In cases like these, can anyone recommend a more promising approach than RAG?


RAG.

Large context windows can cause the LLMs to get confused or use part of the window. I think googles new 1M context window is promising for its recall but needs more testing to be certain.

Other llms have shown they reduce performance on larger contexts.

Additionally, the LLM might discard instructions and hallucinate.

Also, if you are concatenating user input, you are opening yourself up to prompt injection.

The larger the context, the larger the prompt injection can be.

There is also the cost. Why load up 1M tokens and pay for a massive request when you can load a fraction that is actually relevant?

So regardless if google’s 1M context is perfect and others can match it, I would steer away from just throwing it into the massive context.

To me, it’s easier to mentally think of this as SQL query performance. Is a full table scan better or worse than using an index? Even if you can do a full table scan in a reasonable amount of time, why bother? Just do things the right way…


rag is also so much faster and cheaper and energy efficient than running a giant transformer over every single document, it's like not remotely close


The high level API seems very smooth to quickly iterate on testing RAGs. It seems great for prototyping, however I have doubts whether it's a good idea to hide the LLM calling logic in a DB extension.

Error handling when you get rate limited, the token has expired or the token length is too long would be problematic, and from a security point of view it requires your DB to directly call OpenAI which can also be risky.

Personally I haven't used that many Postgres extensions, so perhaps these risks are mitigated somehow that I don't know?


We are working on a 'self-hosted' alternative to OpenAI. The project already has that for the embeddings. i.e. you specify an open-source model from hugging face/sentence-transformers, then API calls get routed to that service that you're self hosting in a container next to Postgres. This is how the docker-compose example in the project readme is set up. We'll be doing the same pattern but for chat completion models.

On Tembo cloud, we deploy this as part of the VectorDB and RAG Stacks. So you get a dedicated Postgres instance, and a container next to Postgres that hosts the text-to-embeddings transformers. The API calls/data never leave your namespace.


I would agree with you. Similar to a Langchain in my mind, some interesting ideas but a lot more powerful to implement on your own. I would much rather use pgvectors directly.


Sorry if i'm completely missing it, I noticed in the code, there is something around chat:

https://github.com/tembo-io/pg_vectorize/blob/main/src/chat....

This would lead me to believe there is some way to actually use SQL for not just embeddings, but also prompting/querying the LLM... which would be crazy powerful. Are there any examples on how to do this?


There is a RAG example here https://github.com/tembo-io/pg_vectorize?tab=readme-ov-file#...

You can provide your own prompts by adding them to the `vectorize.prompts` table. There's an API for this in the works. It is poorly documented at the moment.


What is RAG in this context? I only know it as red, amber, green...


I found this explanation: https://www.promptingguide.ai/techniques/rag

> General-purpose language models can be fine-tuned to achieve several common tasks such as sentiment analysis and named entity recognition. These tasks generally don't require additional background knowledge.

> For more complex and knowledge-intensive tasks, it's possible to build a language model-based system that accesses external knowledge sources to complete tasks. This enables more factual consistency, improves reliability of the generated responses, and helps to mitigate the problem of "hallucination".

> Meta AI researchers introduced a method called Retrieval Augmented Generation (RAG) to address such knowledge-intensive tasks. RAG combines an information retrieval component with a text generator model. RAG can be fine-tuned and its internal knowledge can be modified in an efficient manner and without needing retraining of the entire model.


Thank you!


given the context of vector db, it would be Retrieval-Augmented Generation (RAG) [1]

1: https://medium.com/@mutahar789/optimizing-rag-a-guide-to-cho...


Thanks for the information :)


Has anyone used sqlite for storing embeddings? Are there any extensions or tips for making it easier?

I have a small command line python app that uses sqlite for a db. Postgres would be a huge overkill for the app

PS: is sqlite-vas good? https://github.com/asg017/sqlite-vss


We do, from years ago, with embedding stored as blobs (well before sqlite_vss). No SQLite extension but some user-defined functions for some distance operations and for others we are able to load all weights into memory and perform the ops there (data is small enough).

Another comment mentioned meilisearch. You might be interested in the fact that a recent rearchitecture of their codebase split it into a library that can be embedded into your app without needing a separate process.


Personally, I'd use meilisearch in your case (though, obv, I know next to nothing about all the requirements and details of your case).

Stuffing everything in PG makes sense, because PG is Good Enough in most situations that one throws at it, but hardly "very good". It's just "the best" because it is already there¹. PG truly is "very good", maybe even "the best" at relational data. But all other cases: there are better alternatives out there.

Meilisearch is small. It's not embeddable (last time I looked) so It does run as separate service; but does so with minimal effort and overhead. And aside from being blazingly fast in searching (often faster than elasticsearch; definitely much easier and lighter), it has stellar vector storage.

¹ I wrote e.g. github.com/berkes/postgres_key_value a Ruby (so, slow) hashtable-alike interface that uses existing Postgres tech for a situation where Redis and Memcached would've been a much better candidate, but where we already had Postgres, and with this library, could postpone the introduction, complexity and management of a redis server.

Edit: forgot this footnote.


Meilisearch is embeddable as of recently, but you have to do some legwork since it’s not the intended use case.


Have you used it for hybrid search? What kind of performance do you observe?


Duckdb seems pretty reasonable for embeddings, as there's a native vector type and you can insert it from a dataframe. For similarity, there's array_cosine and array_dot_product.


Tembo has been doing very interesting work.


Indeed! Everyone who pushes Postgres is a win in my book.


Why?


Because it empowers developers instead of salespeople?


I’m using pg_vector in supabase and it seems great in a prototype form.

Has anyone tried using it at scale? How does it do vs pine cone / Cloudflare before search?


Pg_vector is great! I’m using it at scale here:

https://trendingpapers.com/

But it’s Postgres + pg_vector only, no Supabase


There's an issue in the pgvector repo about someone having several ~10-20million row tables and getting acceptable performance with the right hardware and some performance tuning: https://github.com/pgvector/pgvector/issues/455

I'm in the early stages of evaluating pgvector myself. but having used pinecone I currently am liking pgvector better because of it being open source. The indexing algorithm is clear, one can understand and modify the parameters. Furthermore the database is postgresql, not a proprietary document store. When the other data in the problem is stored relationally, it is very convenient to have the vectors stored like this as well. And postgresql has good observability and metrics. I think when it comes to flexibility for specialized applications, pgvector seems like the clear winner. But I can definitely see pinecone's appeal if vector search is not a core component of the problem/business, as it is very easy to use and scales very easily


I’ve used pgvector at scale (hundreds of millions of rows) and while it does work quite well, you’ll hit some rough edges:

* Reliance on vacuum means that you really have to watch/tune autovacuum or else recall will take a big hit, and the indices are huge and hard to vacuum.

* No real support for doing ANN on a subset of embeddings (ie filtering) aside from building partial indices or hoping that oversampling plus the right runtime parameters get the job done (very hard to reason about in practice), which doesn’t fit many workloads.

* Very long index build times makes it hard to experiment with HNSW hyperparameters. This might get better with parallel index builds, but how much better is still TBD.

* Some minor annoyances around database api support in Python, e.g. you’ll usually have to use a DBAPI-level cursor if you’re working with numpy arrays on the application side.

That being said, overall the pgvector team are doing an awesome job considering the limited resources they have.


fwiw, we have plenty of companies using pgvector at scale, in production - some of them migrating their workload directly from pinecone. I’ll see if I can dig up some of the high-level numbers later today


I read this paper claiming that there's no fundamental limitations to Postgres' vector support down the line. Pretty excited for the work Supabase, Neon, Tembo, ourselves at ParadeDB and many others are doing to keep pushing it forward.

Definitely more and more vector production workloads coming to Postgres

Paper: https://www.cs.purdue.edu/homes/csjgwang/pubs/ICDE24_VecDB.p...


been a lot of these RAG abstractions posted recently. As someone working on this problem, it's unclear to me whether the calculation and ingestion of embeddings from source data should be abstracted into the same software package as their search and retrieval. I guess it probably depends on the complexity of the problem. This does seem interesting in that it does make intuitive sense to have a built-in db extension if the source data itself is coming from the same place as the embeddings are going. But so far I have preferred a separation of concerns in this respect, as it seems that in some cases the models will be used to compute embeddings outside the db context (for example, the user search query needs to get vectorized. why not have the frontend and the backend query the same embedding service?) Anyone else have thoughts on this?


Its certainly up for debate and there is a lot of nuance. I think it can simplify the system's architecture quite a bit of all the consumers of data do not need to keep track of which transformer model to use. After all, once the embeddings are first derived from the source data, any subsequent search query will need to use the same transformer model that created the embeddings in the first place.

I think the same problem exists with classical/supervised machine learning. Most model's features went through some sort of transformation, and when its time to call the model for inference those same transformations will need to happen again.


How should developers think about using this extension versus PostgresML's pgml extension?


One difference between the two projects is that pg_vectorize does not run the embedding or chat models on the same host as postgres, rather they are always separate. The extension makes http requests to those models, and provides background workers that help with that orchestration. Last I checked, PostgresML extension interoperates with a python runtime on same host as postgres.

PostgresML has a bunch of features for supervised machine learning and ML Ops...and pg_vectorize does not do any of that. e.g. you cannot train an XGboost model on pg_vectorize, but PostgresML does a great job with that. PGML is a great extension, there is some overlap but architecturally very different projects.


Is RAG just a fancy term for sticking an LLM in front of a search engine?


I'll also attempt an answer.

You chop up your documents into chunks. You create fancy numbers for those chunks. You take the user's question and find the chunks that kind of match it. You pass the user's question with the document to an LLM and tell it to produce a nice looking answer.

So it's a fancy way of getting an LLM to produce a natural looking answer over your chunky choppy search.


The other way round. You stick a search engine in front of your LLM to selectively feed it facts relevant to your query, which the LLM synthesizes in its own words.


To oversimplify, its more like sending a block of text to an LLM and asking it to answer a query based on that block of text.


I assume you need to split the data to suitable sized database rows matching your model max length? Or does it do some chunking magic automatically?


There is no chunking built into the postgres extension yet, but we are working on it.

It does check the context length of the request against the limits of the chat model before sending the request, and optionally allows you to auto-trim the least relevant documents out of the request so that it fits the model's context window. IMO its worth spending time getting chunks prepared, sized, tuned for your use case though. There are some good conversations above discussing methods around this, such as using a summarization model to create the chunks.


How does this compare to other vector search solutions (LanceDB, Chroma, etc.)? Curious to know which one I should choose.


Neat! Any notable gotchas we should know about?


RAG can cost a lot of money if not done thoughtfully. Most embedding and chat completion model providers charge by the token (think number of words in the request). You'll pay to have the data in the database transformed into embeddings, that is mostly a one-time fixed cost. Then every time there is a search query in RAG, that question needs to be transformed. The chat completion model (like ChatGPT 4), will charge for the number of tokens in the request + number of tokens in the response.

Self-hosting can be a big advantage for cost control, but it can be complicated too. Tembo.io's managed service provides privately hosted embedding models, but does not have hosted chat completion models yet.


Do you work for Tembo?

I only ask because making embedding model consumption based charges seem like the norm is off to me — there are a ton of open source, self managed embeddings that you can use.

You can spin up a tool like Marqo and get a platform that handles making the embedding calls and chunking strategy as well.


Is this only for LLMs?


No. Anything with a vector representation is compatible.


Ok, so I don't understand what's the difference between pg_vectorize and pgvector

I see that pg_vectorize uses pgvector under the hood so it does.. more things?


pg_vectorize is a wrapper around pgvector. In addition to what pgvector provides, vectorize provides hooks into many methods to generate your embeddings, implements several methods for keeping embeddings updated as your data grows or changes, etc. It also handles the transformation of your search query for you.

For example, it creates the index for you, create cron job to keep embeddings updated (or triggers if thats what you prefer), handles inserts/upserts as new data hits the table or existing data is updated. When you search for "products for mobile electronic devices", that needs to be transformed to embeddings, then the vector similarity search needs to happen -- this is what the project abstracts.


High-level API that hand waves the embedding and LLM search/query implementation.


> and LLM search/query implementation.

So part of pg_vectorize is specific to LLMs?


There's an API that abstracts vector search only. vectorize.search() and that part is not unique to LLMs but it does require selection of an embedding model. Some people have called embedding models LLMs.

vectorize.rag() requires selection of a chat completion model. Thats more specific to LLM than vector search IMO.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: