Hacker News new | past | comments | ask | show | jobs | submit login
LLMs, RAG, and the missing storage layer for AI (lancedb.com)
151 points by yurisagalov on Sept 7, 2023 | hide | past | favorite | 61 comments



The first unstated assumption is that similar vectors are relevant documents, and for many use cases that's just not true. Cosine similarity != relevance. So if your pipeline pulls 2 or 4 or 12 document chunks into the LLM's context, and half or more of them aren't relevant, does this make the LLM's response more or less relevant?

The second unstated assumption is that the vector index can accurately identify the top K vectors by cosine similarity, and that's not true either. If you retrieve the top K vectors according to the vector index (instead of computing all the pairwise similarities in advance), that set of 10 vectors will be missing documents that have a higher cosine similarity than that of the K'th vector retrieved.

All of this means you'll need to retrieve a multiple of K vectors, figure out some way to re-rank them to exclude the irrelevant ones, and have your own ground truth to measure the index's precision and recall.


The vectors are literally constructed so that cosine similarity is semantic similarity.

> second unstated assumption is that the vector index can accurately identify the top K vectors by cosine similarity, and that's not true either

Its not unstated, its called ANN for a reason


> The vectors are literally constructed so that cosine similarity is semantic similarity.

Are they? A learned embedding doesn't guarantee this and a positional embedding certainly doesn't. Our latent embeddings don't either unless you are inferring this through the dot product in the attention mechanism. But that too is learned. There are no guarantees that the similarities that they learn are the same things we consider as similarities. High dimensional space is really weird.

And while we're at it, we should mention that methods like t-SNE and UMAP are clustering algorithms not dimensional reduction. Just because they can find ways to cluster the data in a lower dimensional projection (epic mapping) doesn't mean that they are similar in the higher dimensional space. It all depends on the ability to unknot in the higher dimensional space.

It is extremely important to do what the OP is doing and consider the assumptions of the model, data, and measurements. Good results do not necessarily mean good methods. I like to say that you don't need to know math to make a good model, but you do need to know math to know why your model is wrong. Your comment just comes off as dismissive rather than actually countering the claims. There's plenty more assumptions than OP listed too. But their assumptions don't mean the model won't work, it just means what constraints the model is working under. We want to understand the constraints/assumptions if we want to make better models. Large models have advantages because they can have larger latent spaces and that gives them a lot of freedom to unknot data and move them around as they please. But that doesn't mean the methods are efficient.


There are embeddings that are trained to reflect similarity, for example SentenceBERT, where the training process pushes pairs of similar sentences (as defined by whoever built the dataset) to have closer embeddings and dissimilar sentences to be further apart.


As the OP points out, Cosine similarity doesn't always equate to relevance. As I was expanding upon, things get really messy as the dimensions increase and your intuition about how vectors relate to one another goes out the window, and fast. Distributional mass is not uniform. Rate of originality increases. And of course, there is no guarantees that latent dimensions align with human meaningful semantic features. There's no pressure to align basis vectors with human perceived semantics. My argument isn't about that there isn't a similarity pressure it's that similarity in high dimensions means different things then similarities in low dimensions. For example, in high dimensions most of a unit cube's mass lies outside the unit sphere, while in 2 or 3 dimensions the unit cube is always contained inside with room to spare. High dimensions are weird and that's what my comment is about because many people are using their lower dimensional intuition for ML.


Do you know how embedding models are trained?


Yes. My comment is about the geometry of higher dimensions and their meanings. These are not the same as in {2,3}D


To be fair… semantic similarity isn’t the same as relevance either.

They are related, and we frequently assume they are close enough that it doesn’t matter, but they are different.


I disagree, the embeddings are what are used by the llms themselves to produce relevant output and the output is relevant ergo the embeddings do produce relevant output via similarity search


You probably aren’t using an LLM for your text embeddings for document retrieval (they don’t perform as well as specialist embedding models[0]), and even if they did, you have an embedding about a bare document, without any context of what you are trying to get out of it. If you were to add your context in and then get an embedding, you would get a different answer. As your query gets specific, irrelevant aspects of the embedding space can overwhelm the similarity function, leading to irrelevant answers that are still semantically similar.

[0] https://huggingface.co/spaces/mteb/leaderboard


The recent SILO-LM paper has a slightly different approach: rather than using input embeddings and prompting the LLM with documents, it searches the database according to the LLM's output embedding and uses KNN search to skew the output embedding vector before token generation. Done that way round, using LLM embeddings outperforms RAG, allegedly.

They did it with a custom language model. I really want to give this a try with llama2 embeddings but haven't had the bandwidth yet (and llama2's embedding vectors are inconveniently huge, but that's a different problem).


Interesting! I’ll have to look into that.


Consider the extreme case: when I ask a question about X, then a page with just the questions about X will get the highest similarity. But what I want in terms of relevance for the answer is a page with a little bit about X and lots of surrounding context that answers the question. By definition the extra context will likely lower the similarity.


Not if you're using ANN. In some cases that will be very similar to exhaustive search but in other cases you'll get results that you don't want. You also need embeddings that distribute things mostly evenly across the embedding space (not all will).


That's interesting.

Are there any good sources to learn more about that?


This is kind of a moot argument, semantic similarity is higher dimensionality than cosine similarity can capture.

If I'm using vectors for question/answer, then:

"What is a cat"

and

"What is a dog"

Should be more dissimilar than the documents answering either.

If I'm using it for FAQ filtering then they should be more similar.


I've had decent results using a doc2query style approach:

    1. Ask an LLM to return a list of questions answered by the document
    2. Store the embeddings of the questions along with a document ID
    3. On user query, get the embedding of the user query
    4. KNN cosine similarity search the user embedding vs. the corpus of question embeddings
    5. Return the highest ranked documents
You can tweak this approach depending on your use case, so that in step 1 you generate embeddings that are more similar to the types of things you want returned in step 5. If you want the answer to "What is a cat" to be similar to "What is a dog," you'd prompt/finetune the LLM in step 1 to generate broad questions that would encompass both; if you want them to be very different, you'd do the opposite and avoid generalities.


You just reinvented a 2 year old technique with a more expensive pipeline and missed performance gains (from the cross-encoder step):

https://www.sbert.net/examples/domain_adaptation/README.html https://arxiv.org/abs/2112.07577


I'm aware of more efficient ways to do it! (Hence referencing e.g. doc2query.) But you have to train a model, whereas with an LLM you can get a working version in 5mins of work.


But with even less work you can just pick up a model that was pre-trained using GPL and get great results.

I'm able to pull messy results directly from internet sources and re-rank on the fly with a quantized e5 model small enough to fit in a serverless function.

You don't need a vector database to do all this stuff, people who are paid off people using vector databases are the ones who are hyping them up the most.


Oh, I wasn't suggesting using a vector DB. Personally I just iterate through the corpus and check cosine similarity with a for loop.

If by "quantized e5 model small enough to fit in a serverless function" you mean e5-small-v2, FYI it actually underperforms just calling OpenAI for embeddings (text-embedding-ada-002) on the HuggingFace MTEB benchmarks. And that definitely doesn't negate using a doc2query-style approach to preprocess the documents before running them through the pretrained embedding model if you're comparing e.g. questions to answers, rather than raw document-to-document similarity. (Of course a custom trained model will be more efficient! In fact, the original doc2query paper in 2019 used a custom trained model for step 1, as did many enhancements on it e.g. doc-t5-query. What's neat is that with the advent of really good pretrained LLMs, you can get results approximating that without training your own models in like ~5mins of work.)


I guess this really boils down to your usecase: if you can have a result for your user with fully predictable latency (my biggest beef with non-Azure OpenAI), no additional round trip, and increased configurability, does MTEB performance move the needle?

Considering the LLM is still doing the final pass, and the latency from the LLM is based on output length, I find the UX to be significantly improved just doing reranking in-process.

I think there's been a bit of whiplash, where people went from gatekeeping "hard ML", to "I can shove this all at a REST API", but there's a golden path laying in between for use-cases where UX matters.

I even fall back to old school NLP (like ML-less, glorified wordlist POS taggers) for LLM tasks and end up with significantly improved performance for almost 0 additional effort


yes - but calculating the consine similarity for all the candidates is prohibitively expensive.

hence heuristic.


Switching to Word2Vec embeddings led to a substantial improvement in my cosine similarity evaluations for text similarity, but granted I was looking for actual similarity, not relevance. I tried many different methods and had lots of mediocre results initially.

code: https://github.com/jimmc414/document_intelligence/blob/main/... https://github.com/jimmc414/document_intelligence


Interesting, do you happen to have some quantitative results on this/additional insights/etc?

I've interpreted transformer vector similarity as 'likelihood to be followed by the same thing' which is close to word2vec's 'sum of likelihoods of all words to be replaced by the other set' (kinda), but also very different in some contexts.


There's no simplified definition like that, vectors can even capture logical properties, it's all down to what the model was tuned for: https://www.sbert.net/examples/training/nli/README.html


this is very interesting. you had better results here than the openai ada02 and other embeddings like bge ?


As opposed to sentencebert or what?


DistilBERT and RoBERTa


Could you please explain a bit on your 2nd paragraph. I couldn’t quite understand either the problem statement nor the reasoning itself.


"Cosine similarity != relevance" In all ML search products, there's a tradeoff between precision and recall, and moreover there's almost never any "gold" data that ensures the "correctness" of surfaced results. I mean, Bing and Google have both invested millions of dollars in labeling web pages and even evaluating search results, but those labels can become useless as your set of documents change.

Cosine similar is a useful compromise and yes a lot of authors take this for granted. At the end of the day, an LLM product probably won't be evaluated on accuracy but rather "lift" over an alternative. And the evaluation will be in units of user happiness.

> All of this means you'll need to retrieve a multiple of K vectors, figure out some way to re-rank them to exclude the irrelevant ones, and have your own ground truth to measure the index's precision and recall.

This is usually a Series E problem, not a Series A problem.


Azure Cognitive Search takes care of all of this combining semantic search with other layers of traditional search methods


As an architect working on LLM applications I have these criteria for a database.

- Full SQL support

- Has good tooling around migrations (i.e. dbmate)

- Good support for running in Kubernetes or in the cloud

- Well understood by operations i.e. backups and scaling

- Supports vectors and similarity search.

- Well supported client libraries

So basically Postgres and PgVector.


Exactly. The whole point about databases is you don't need "a database for AI" you need a database, ideally with an extension to add additional AI functionality (ie postgres and pgvector). Trying to take a special store you invent for AI and retrofit all the desirable things you need to make it work properly in the context of a real application you're just going to end up with a mess.

As a thought-experiment for people who don't understand why you need (for example) regular relational columns alongside vector storage, consider how you would implement RAG for a set of documents where not everyone has permission to view every document. In the pgvector case it's easy - I can add one or more label columns and then when I do my search query filter to only include labels that user has permission to view. Then my vector similarity results will definitely not include anything that violates my access control. Trivial with something like pgvector - basically impossible (afaics) with special-purpose vector stores.

Or think about ranking. Say you want to do RAG over a space where you want to prioritise the most recent results, not just pure similarity. Or prioritise on a set of other features somehow (source credibility whatever). Easy to do if you have relational columns, no bueno if you just have a vector store.

And that's not to mention the obvious things around ACID, availability, recovery, replication, etc.


Can I add one more nice to have? Good support for graph data. I'm not 100% certain on it yet, but there's a lot of ideas surrounding storing knowledge as a graph out there and it makes a lot of intuitive sense. I haven't found a killer use case for it yet as so far I can get by just tagging things and sql querying on the tags is powerful enough.

Maybe someone could pitch in. Is knowledge really a graph (for your problem domain), or is that just some bullshit people made up when they still thought AI could be captured mathematically? It feels to me now knowledge is much more like the way vector embeddings work, it's in a cloud where things are related to each other in an analog or statistical way, not a discrete way.

But, perhaps for similar reasons, vector embeddings haven't been super useful to me in building RAG agents yet. Knowledge is either relevant or it's not, and at least for me if it's relevant it has the keywords or tags I need, and just a straight up SQL query brings it in.


You can think of a vector database with n vectors as a network whose adjacency matrix is nxn and each edge is represented by whatever similarity metric between nodes you choose to use. So you can have strongly connected edges and weakly connected edges.


You may want to take a look at Zep, an LLM application platform that wraps Postgres, pgvector, embedding models, and more to offer chat memory persistence and document vector search.

The Python and TS SDKs are designed to support drop-in replacements for the bits of LangChain that don’t scale, but nothing stops you accessing Postgres directly.

https://github.com/getzep/zep

Disclosure: I’m the primary author.


Yes totally agree with that (and other comments below). Moving from a toy example to production deployment requires all the things we are used to having in robust/mature products like postgres.


I don’t fully understand the fascination with retrieval augmented generation. The retrieval part is already really good and computationally inexpensive — why not just pass the semantic search results to the user in a pleasant interface and allow them to synthesize their own response? Reading a generated paragraph that obscures the full sourcing seems like a practice that’s been popularized to justify using the shiny new tech, but is the generated part what users actually want? (Not to mention there is no bulletproof way to prevent hallucinations, lies, and prompt injection even with retrieval context.)


On the modeling side, it's compelling to separate the memory from the linguistic skills. Vector search is hella fast and can be very good. So you can off load the memorization part of the problem, and let the language model focus on the language. This should allow better performance with much smaller models.


I really like using LLMs to learn stuff because they can explain anything at the exact level I need. Hallucination is a big problem with that and RAG pretty much solves it. If I give chatGPT a good stackoverflow post and tell it to dumb it down for me, it does very well. RAG just automates that process with the added benefit of not letting the LLM decide which information to retrieve, which should greatly reduce the chance of accidentally biasing the model with your prompt.


In a strict "one question / one response" search, raw semantic search results are a great solution. And consumes far fewer tokens.

In conversational AI, providing search results appended to a long-memory context produces "human-like" results.


The main reason is that you might not want the raw information but some reasoning above. LLM is not only the context but all the information it has been trained with. For example a math student is making a question, it doesn't want the raw theorems but some reasoning with them, and currently LLM can do that. It will make mistakes sometimes because of hallucinations, but for not very difficult questions it usually gives you the right answer. And that helps a lot when you are not an expert in the domain. And that is the reason GPT4 is a great tool for students, it helps you to understand the basics as if you have a teacher with you.


Sometimes what I want is to ask Google/Alexa/Siri a question and get a summary response along with the source. I think that would be a good application of the above.

Less so IMO when I’m on my phone or in front of the computer.


For me, the #1 advantage is being able to ask follow-up questions


It's not clear to me that only a vector DB should be used for RAG. Vector DBs give you stochastic responses.

For customer chatbots, it seems that structured data - from an operational database or a feature store adds more value. If the user asks about an order they made or a product they have a question about, you use the user-id (when logged in) to retrieve all info about what the user bought recently - the LLM will figure out what the prompt is referring to.

Reference:

https://www.hopsworks.ai/dictionary/retrieval-augmented-llm


Thanks for sharing that observation on customer chatbots.

1. Will that query look like this:

  SELECT LLM("{user_question}", order_info)  
  FROM postgres_data.order_table  
  WHERE user_id = “101”;
2. How will a feature store, like Hopsworks, help in this app?

Shameless self-plug: We are building EvaDB [1], a query engine for shipping fast AI-powered apps with SQL. Would love to exchange notes on such apps if you're up for it!

[1] https://github.com/georgia-tech-db/evadb


Why would your projection be this - SELECT LLM("{user_question}", ?

You can train a small llm on your private data to map the user question to tables in your db.

Then Just select with a limit ( or time bounded). The feature store is just another operational store that could have relevant data for the query.


> You can train a small llm on your private data to map the user question to tables in your db.

Can you? You've personally done this? Deployed it to production at some kind of non trivial scale and it's working well? I'm not aware of any "small llm" that approaches the quality of gpt-3.5.


This is called Text2SQL or NL2SQL, it’s a surprisingly difficult problem even with RAG and GPT4 as soon as the query is non trivial, especially if there are semantic differences between the question and the db schema.


And for technical documentation or code I'm unclear how well semantic search works for CEQ.

I would assume the embedding model isn't trained on code and specific words that are industry/company specific.


A lot of things mentioned are too handwaved and not explained well.

It's not explained how vector DB is going to help while incumbents like chatgpt4 can already call functions and do API calls.

It doesn't make AI less black box, it's irrelevant and not explained..

There's already existing ways to fine tune models without expensive hardwares such as using LoRA to inject small layers with customized training data, which trains in fractions of the time and resource needed to retrain the model


There is lots of things like which you don’t want leak eg customer specific data. For those cases vectors are great.


We use Lance extensively at my startup. This blog post (previously on HN) details nicely why: https://thedataquarry.com/posts/vector-db-4/ but essentially it’s because Lance is a “just a file” in the same way SQLite is a “just a file” which makes it embedded and serverless and straightforward to use locally or in a deployment.


I find it quite comical to speak of a "missing storage layer" during your own self-promotion, considering that the market for vector databases is literally overflowing right now.

Everything else may be missing, but not the storage layer.


Does ChatGPT always start articles with “in the rapidly evolving landscape of X”?

Surely if you’re posting an article promoting miraculous AI tech you should human edit the article summary so that it’s not really obviously drafted by AI.

Or just use the prompt “tone your writing down and please remember that you’re not writing for a high school student who is impressed by nonsensical hyperbole”. I’ve started using this prompt and it works astonishingly well in the fast evolving landscape of directionless content creation.


Unrelated question: is there a standard way for writing down neural network diagrams? I'm thinking of how it is done in electrical circuit schematics, which capture all relevant information in a single diagram, in a (mostly) standardized way.

I've seen the diagrams in DL papers etc. but I guess everyone invents their own conventions, and the diagrams often don't convey the complete flow of information.


There are conventions and most libraries have libraries to export diagrams to LaTex or image (e.g., TorchViz).

Visualizations are highly context and usage dependent anyway. Generally, there's is no value in showing fully connected or feed forward layers in detail outside of teaching materials.


> Generally, there's is no value in showing fully connected or feed forward layers in detail outside of teaching materials.

Well, in electrical circuit diagrams it is customary to draw e.g. a signal bus as a single connection, with the number of wires in the bus written next to it (with a little strike-through line). I'm guessing something similar can be done for DL networks.


Shameless self promotion


404




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: