Hacker News new | past | comments | ask | show | jobs | submit login
Who thinks document "chunking" sucks and why?
8 points by sebscholl 6 months ago | hide | past | favorite | 4 comments
All the "memory" systems/pipelines for LLMs seems to be using the exact same approach:

1. Chunk Docs + Embed 2. Store in a VectorDB 3. Query embeddings based on semantic relevance

In my work, this has consistently failed to get meaningful context for prompts. Who has seen a better way of handling this problem?




If you consider how a mature search product works, that's probably going to give you an improvement over a single embedding method. You'll find that these systems are really an amalgamation of indexes, queries, and rankings. Often an item will have more data and linkage attached to it than just the embedding of the content. They will also perform multiple different queries at the same time and then combine and rank results. If you want Google search response times, you'll go a step further and perform multiple instances of each query so you get your result faster from the first to return.

That being said, people are finding the basic steps you showed above sufficient. There are parameters you can change in these 3 steps. Have you tried changing any of those?

- how you chunk, chunk size and rules - how you embed, which model and size - how you query, the metric(s) used

2 is probably only important to quality of results in that it determines what is available for you to use in the other steps, notably the embedding comparison metric that really defines relevance


Chunking is one of those things that needs to be custom to the document being processed. As a general rule, try recursive chunking for most questions. Consider using multiple strategies in tandem. It has the nice advantage of incorporating both broad and specific context. However, even the document itself is not enough to design a chunking strategy. Consider an HTML document and the questions:

1. How does that webpage code work?

2. Summarize this website.

As you can see, you might benefit from pre-processing info different based on the intended result. One cares strongly about the tags, styling etc. while one only cares about the text and you could maybe just scrub the tags.

Also, consider chunks overlap and max chunk size and tune them based on different trials.

Check your chunk scores (cos similarity) against sample queries and make sure chunks texts are meaningful. "Is this how I would store info in my head?" might be a good way to start, if your chunks are garbage you will get garbage.

Consider visualizing your chunks in clusters to validate topic relevance.

Last thing a RAG is a multi-step arch., only one step being bad will turn the whole thing to garbage, put lots of debugging, eval steps in yours. Make sure its not the prompt step thats ruining it. Identify the weak points and triage accordingly.


To get consistent semantic chunks for RAG, you can't just slice it into arbitrary 2k character chuncks after doing a PDF text extraction.

Most documents have implicit structural semantics not explicitly worded in the text. You need to surface and embed those into the chunks, and also furter enrich chunk candidades by flatten in references and other relations.

There is no general solution to this. While you can apply chunking and enrichment patterns, each document type is a bespoke effort to get it right.


this is a technique called text tiling which separates a document into semantic chunks

https://github.com/Ighina/DeepTiling

https://medium.com/@ganymedenil/how-to-segment-large-texts-f...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: