How do domain-specific chatbots work? A retrieval augmented generation overview

isoprophlex · 2023-08-25T17:13:20

Some thoughts:

- RAG generally gets you to prototype stage with interesting/demo-able results quickly, however, if your users turn out to submit queries that the embedder finds hard to vectorize (meaning you don't retrieve relevant vectorized chunks of source data) then it quickly becomes painful

- It's easy to go overboard with prompt chains that do summarizing/reducing of the fetched results, or pre-processing prompts that help vectorize the query better (see pain point above); always invest in some testing framework & sane test data upfront so you can avoid the classical data science "tweak it until the demo looks good" trap

- Don't ever use langchain, it's a baroque shitshow of a tool with a cluttered, inconsistent API written with a bad, inefficient coding style

- Paying for bespoke vector databases is probably snake oil and besides the weird pricing only causes pain in the long run when you want to store more than just your embeddings (looking at you 70$+ usd/month Pinecone); postgres with pgvector gets you very far until you hit multiple million documents, and you get all the benefits of a mature, scalable rdbms. Keep your embeddings close to the rest of your data.

czue · 2023-08-25T17:51:38

I largely agree with this (and especially your first and last points!)

I'm intruiged by your second bullet. What have you seen in the way of test frameworks and test data that's been beneficial? I will admit, a lot of my iteration has fallen into the "tweak till the demos look good" camp...

On LangChain, I'd say it slightly differently. It is a fantastic tool for getting up and running with something quickly, and for giving a high-level overview of the pattern and components you will need. That said, it is pretty awful for scaling or customizing. I (and many other people I have talked to) benefited greatly from starting a project with LangChain, but then eventually re-wrote ~everything in the stack piece-by-piece, as needed. As I mentioned in the post, the LangChain loaders are the most valuable piece for me, if not for direct use, at least as a starting point of working code to build various integrations.

isoprophlex · 2023-08-25T19:46:42

Sounds legit what you mention wrt. Langchain. I agree that it's nice for grokking the patterns and getting something out there. Maybe I worded my post too harshly...

Testing; I haven't really found something solid, something that I'd write up in good conscience as a clear 'best practice'. What I meant to say with testing framework is more along the lines of "take some time to set up a bespoke test harness for your use case".

For example, a RAG summarizer chatbot-and-PowerPoint generator that is backed by a data store containing 100s of sales pitches. No low latency reqs, but we need some appreciable accuracy.

There's plenty of cool ideas to try: do I generate n expanded versions of my user query to help retrieval, get m records back, summarize those in one go? Or do I go for a single query but do I let the LLM filter out false positives? Etc etc.

It helps to invest in putting together an (almost) statistically relevant, diverse set of queries and grepping (or even ranking with a LLM) for expected results. In our case, when summarizing sales in the public sector in the past 5 years, did the thing exclude our Y2K work from 23 years back?

Simple, diverse test jobs you can use to give some flavor of scientific hypothesis testing to the weirdness of figuring out how to tweak everything inside the pipeline. This also allows you to register succes criteria upfront, eg. we accept this if we get 19/20 right. Then that becomes 190/200 in the next project phase. Then that becomes "less than 5% thumbs down feedback from users when we go live".

It's an exciting field. Makes me think about building my own little convnets before pytorch become so good.

janalsncm · 2023-08-25T19:46:54

For testing, there’s a whole slew of ranking-based metrics you can use. You want to make sure the content being retrieved is actually relevant. Offline precision, recall, nDCG and MRR tend to be pretty good. Online you can just directly measure user behavior as a function of your model’s inputs and outputs.

https://en.m.wikipedia.org/wiki/Evaluation_measures_(informa...

treprinum · 2023-08-28T23:21:35

LangChain is stuck in the LLMs of the GPT-3 era, all its concepts don't translate well to GPT-4 (multi-task single shot vs chain of multi-shot in LangChain) and it often gets in the way/overcomplicates things there.

eskibars · 2023-08-25T23:10:44

IMHO I think the best points you make here are the last 2. To contrast the first 2, I have seen that many users can provide really good capabilities with RAG -- especially combined with some kind of keyword/hybrid blending so that hyper-specific terms still come up.

Of course, it's difficult for most organizations to actually test their results in a meaningful and cost-effective way, which is why we see so much of this "deploy, cross fingers, wait for a call from the CEO" behavior.

sharemywin · 2023-08-25T17:22:52

what do you use to create your vectorized chunks?

J_Shelby_J · 2023-08-26T15:43:17

It's pretty straight forward.

Pre-process -> chunk -> openai embedding endpoint

I found langchain not great, so ended up rewritting most of it for my chat bot.

https://github.com/shelby-as-a/shelby-as-a-service/blob/50d8...

sharemywin · 2023-08-27T01:10:23

Sorry I meant the chunk part. do you do anything special with trying to break the text into manageable chunks. TextSplitter in Langchain just uses some line breaks stuff I guess.

--- The default recommended text splitter is the RecursiveCharacterTextSplitter. This text splitter takes a list of characters. It tries to create chunks based on splitting on the first character, but if any chunks are too large it then moves onto the next character, and so forth. By default the characters it tries to split on are ["\n\n", "\n", " ", ""] ---

https://python.langchain.com/docs/modules/data_connection/do...

isoprophlex · 2023-08-25T17:33:47

OpenAI's stuff, no experience with local models. I'm curious to hear from people that have compared between those.

kordlessagain · 2023-08-26T01:19:55

Check out Instructor Large.

andrenotgiant · 2023-08-25T16:19:26

Great explanation. I understand the process a lot better after reading.

I have questions now: It seems like everything hinges on the search step, where domain-specific content is gathered to serve as prompt input for the LLM.

1. In an area like technical docs, if I can't find what I need it's often because I don't know the right terms - e.g.: I am looking for "inline subquery" but they're really called "lateral joins" - Would the search step have any likelihood of finding the right context to feed the LLM here?

2. How much value is added by feeding the search results through the LLM with the user's prompt, vs just returning the results directly to the user?

3. Are there good techniques being developed for handling citations in the LLM output? IIRC Google had this in Bard

vharuck · 2023-08-25T17:19:50

>2. How much value is added by feeding the search results through the LLM with the user's prompt, vs just returning the results directly to the user?

These systems usually retrieve the "closest N" snippets. Which might include irrelevant ones if there aren't enough close matches. The LLM will favor the relevant info when composing an answer.

>3. Are there good techniques being developed for handling citations in the LLM output? IIRC Google had this in Bard

Like sharemywin said, you can include citation info with the snippets when making the prompt.

In a proof-of-concept I did at work, I was having trouble forcing the gpt-3.5-turbo model to include URL citations. What eventually worked was adding an extra snippet to the list of relevant ones: a few sentences from the Wikipedia page for Lorem Ipsum plus its URL (arbitrary choice, just had to not overlap with any topic from my documents). Then the prompt included an example question about Lorem Ipsum, with an example answer citing the URL.

sharemywin · 2023-08-25T16:33:08

1. it's using "semantic" search. so if the embeddings understand that "inline subquery" and "lateral joins" are more closely related than say cross joins or something then that doc segment will get returned.

2. the use cases for this would be you want shorter answers then a collection of document snippets and or you had follow up questions that might not be in the same document. and/or you prefer conversing with a chatobot over reading lots of extra text.

3. you can attach a source column to the embeddings and snippet in the vector database.

Here's an explanation of pgvector using supabase

https://supabase.com/docs/guides/database/extensions/pgvecto...

it's basically just a special kind of index to use on an embeddings column.

czue · 2023-08-25T16:45:42

Great answers. The only thing I'll add to number 3 is that you can also get the LLM to cite which source it used within the matching snippets. A simple example with OpenAI functions-style declarations looks something like this: https://gist.github.com/czue/81eea58cb718e67be7c57c224800e30...

If you do this you can pare down the citation from everything that returned from embedding search to what the bot actually used to generate its answer.

andrenotgiant · 2023-08-25T16:55:34

Is there something (besides the convenience of being already in the proper vector formats) unique about the combination of semantic search and LLM's that makes it a better fit?

I guess I'm surprised that the approach is to use semantic search here, where in regular search, completely different algorithms (Lucene?) have "won".

lmeyerov · 2023-08-25T17:10:31

Keyword search won for commodity bc semantic search was weaker and more complicated. More sophisticated orgs like Google switched to semantic/hybrid for years as both problems got addressed.

Now that opensearch/elasticsearch have vector indexes, and high quality text embedding is a lot easier...

petra · 2023-08-25T18:57:02

Is the search in such systems simple and intuitive, or are there tricks to get better results from such systems ,as a user?

goldemerald · 2023-08-25T18:33:20

ACL had a recent tutorial about state of the art for this topic. https://acl2023-retrieval-lm.github.io/

My favorite takeaway was that purely fine tuning your model on your documents (without extra document context during inference) consistently performs worse than using a context added from a datastore.

atmb4u · 2023-08-26T01:42:42

ACL tutorial is a treasure trove. Thank you so much for sharing.

Do you have any thoughts on how to handle real-time streaming data refreshes in such a system?

goldemerald · 2023-08-26T02:56:12

Hire a grad student for a year to write a research paper on it, haha. I've just gotten into the LLM space for work, so I wouldn't know

czue · 2023-08-25T13:09:05

Hello, author here. I wrote this guide up after spending a long time trying to wrap my head around LangChain and the associated ecosystem, and hope others find it useful. Would be happy to answer any questions or feedback anyone has.

sixhobbits · 2023-08-25T14:03:28

hey I have wanted to write a post like this for ages but of course I never got around to it, and now I can just link people to this one instead!

My only suggestions are:

- Coining (? at least I think you did) the new term "Embedding machine" I think loses some clarity versus using "vectorizer", or other existing terminology.

- I know that researchers were surprised by how well word embeddings work, but I think you could have built a bit more intuition here instead of just saying "black box" (but maybe article would have gotten too long) - maybe using some of the intuitions from word2vec, even though that is dated now, the idea that 'similar tokens appear in similar contexts' is useful I think.

czue · 2023-08-25T14:29:29

Thanks for the input and will very happily accept you sending links to this post. :)

My use of "embedding machine" is just revealing my own ignorance. Never heard the term "vectorizer" before, but will switch it (or find a better word).

I'll also have to do some reading to educate myself on word2vec. That's this? https://en.wikipedia.org/wiki/Word2vec

sixhobbits · 2023-08-25T14:36:05

yes, the most-quoted 'breakthrough' was figuring out analogies like this https://www.technologyreview.com/2015/09/17/166211/king-man-...

sharemywin · 2023-08-25T16:19:58

you might want to check these out too:

https://www.sbert.net/docs/pretrained_models.html#sentence-e...

that way you don't need to pay openai for the embeddings

Animats · 2023-08-25T18:30:26

OK, so this is one of those things where the query first goes to some kind of lookup system, and you get back something which is fed into the LLM as part of the prompt.

Is the lookup running locally (not on OpenAI)? Can you look at the output of the lookup to see what hopefully relevant info it is throwing into the LLM? Can you use this with some local open-source LLM system?

ryoshu · 2023-08-25T18:51:10

You can run it locally depending on the machine (13b models will run on a 32GB Apple Silicon laptop). You can also spin up one of the larger models on Huggingface or similar. If you want to productize it you'll likely need devops/mlops and money.

ingridpan · 2023-08-25T17:32:26

RAG is great for pulling some additional knowledge, but if you combine it with fine-tuning (i.e., the LLM 'understands' the domain-specific terminology better) it becomes a lot more effective

monkeydust · 2023-08-25T17:55:14

Looking exactly into this, any research on this topic?

ingridpan · 2023-08-25T18:24:05

Research: https://arxiv.org/abs/2005.11401

An implementation on Hugging Face: https://huggingface.co/docs/transformers/model_doc/rag

monkeydust · 2023-08-25T18:55:07

Txs but I meant fine tuning and RAG combined

spankalee · 2023-08-25T15:17:43

Relatedly, to have a useful chatbot you need to track chat history in a way very similar to augmenting with document retrieval, but you may need to generate embeddings and summaries as you go.

A friend of mine is working on an OSS memory system for chat apps that helps store, retrieve, summarize chat history, and documents to now, I believe, on top of LangChain: https://www.getzep.com/

captaindiego · 2023-08-25T20:14:27

For those playing with this - If you attach unique identifiers in a consistent way to your documents - you can prompt to cite sources when generating the answer.

hammeiam · 2023-08-26T05:05:16

One thing I’ve wondered about is how can I best perform query expansion to return vectors that answer my question rather than just looking like my question itself?

ofermend · 2023-08-25T17:27:45

retrieval augmented generation (we call it "grounded generation" at Vectara) is a great way to build GenAI apps with your data. This blog post can be useful: https://vectara.com/a-reference-architecture-for-grounded-ge.... The long and short of it is: building RAG applications seems easy at the start but gets complicated as you go from toy application to scalable enterprise deployment.

sharemywin · 2023-08-25T16:22:50

Something I wondered is why there isn't a language model trained on doing just RAG.

my suspicion is the language model could be a lot smaller if it's just regurgitating things from the context above.