Hacker News new | past | comments | ask | show | jobs | submit login

The biggest issue I have with semantic chunking is that it requires a LLM to help create the breakpoints. That's a pretty big cost and latency penalty, for potentially no benefit. That being said, we've seen chunk size have a huge impact on the naive extraction to the graph. Using recursive character chunking showed huge gains from going from 1000 characters down to 500 characters, even with long context LLMs. However, once we got out to 2000-4000 character chunks, there didn't appear to be much difference. But, if you're looking to extract maximum detail from a text corpus, it seems utra-small chunking is likely beneficial.

That being said, with ultra-small chunking, there's a lot of redundancy in the extracted graph edges. These are some of the problems were trying to solve with the TrustGraph extraction processes.

Daniel






Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: