Hacker Newsnew | past | comments | ask | show | jobs | submit | snyy's commentslogin

We're working on Mongo integrations!


awesome! let me know if I can be helpful in any way in connecting with Mongo resources


We want to be the platform that connects documents to AI for all applications. Consequently, we want to cover all use cases, including the ones you mentioned :)


Yes :) Code chunker is fantastic for SQL


We don't yet, but our library comes with a visualization tool that you can use to compare chunkers directly. https://docs.chonkie.ai/python-sdk/utils/visualizer


Pretty much what you described. Convert the PDF to Markdown, join content across pages so that its all one string, then chunk it. Our evals show this approach works best.


> you may want to remove errors while retaining the correction

Double clicking on this, are these messages you’d want to drop from memory because they’re not part of the actual content (e.g. execution errors or warnings)? That kind of cleanup is something Chonkie can help with as a pre-processing step.

If you can share an example structure of your message threads, I can give more specific guidance. We've seen folks use Chonkie to chunk and embed AI chat threads — treating the resulting vector store as long-term memory. That way, you can RAG over past threads to recover context without redoing the conversation.

P.S. If HN isn’t ideal for going back and forth, feel free to send me an email at shreyash@chonkie.ai.


> We've seen folks use Chonkie to chunk and embed AI chat threads

yep, that's what we're looking for. We'll give it a shot!

I think it's worth creating a guide for this use case. Seems like something many people would want to do and the input should be very similar across your users.


Awesome! Keep us posted :)


Great questions!

Chunking fundamentals remain the same whether you're doing traditional semantic search or agentic retrieval. The key difference lies in the retrieval strategy, not the chunking approach itself.

For quality agentic retrieval, you still need to create a knowledge base by chunking documents, generating embeddings, and storing them in a vector database. You can add organizational structure here—like creating separate collections for different document categories (Physics papers, Biology papers, etc.)—though the importance of this organization depends on the size and diversity of your source data.

The agent then operates exactly as you described: it queries the vector database, retrieves relevant chunks, and synthesizes them into a coherent response. The chunking strategy should still optimize for semantic coherence and appropriate context window usage.

Regarding your concern about large DB records: you're absolutely right. Even individual research papers often exceed context windows, so you'd still need to chunk them into smaller, semantically meaningful pieces (perhaps by section, abstract, methodology, etc.). The agent can then retrieve and combine multiple chunks from the same paper or across papers as needed.

The main advantage of agentic retrieval is that the agent can make multiple queries, refine its search strategy, and iteratively build context—but it still relies on well-chunked, embedded content in the underlying vector database.


In addition to size and speed we also offer the most variety of chunking strategies!

Typically, our current users fall into one of two categories:

- People who are running async chunking but need access to a strategy not supported in langchain/llamaIndex. Sometimes speed matters here too, especially if the user has a high volume of documents

- people who need real time chunking. Super useful for apps like codegen/code review tools.


Thank you :)


Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: