Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: Neum AI – Open-source large-scale RAG framework (github.com/neumtry)
155 points by picohen 9 months ago | hide | past | favorite | 30 comments
Over the last couple months we have been supporting developers in building large-scale RAG pipelines to process millions of pieces of data.

We documented our approach in an HN post (https://news.ycombinator.com/item?id=37824547) a couple weeks ago. Today, we are open sourcing the framework we have developed.

The framework focuses on RAG data pipelines and provides scale, reliability, and data synchronization capabilities out of the box.

For those newer to RAG, it is a technique to provide context to Large Language Models. It consists of grabbing pieces of information (i.e. pieces of news articles, papers, descriptions, etc.) and incorporating them into prompts to help contextualize the responses. The technique goes one level deeper in finding the right pieces of information to incorporate. The search for relevant information is done through the use of vector embeddings and vector databases.

Those pieces of news articles, papers, etc. are transformed into a vector embedding that represents the semantic meaning of the information. These vector representations are organized into indexes where we can quickly search for the pieces of information that most closely resembles (from a semantic perspective) a given question or query. For example, if I take news articles from this year, vectorize them, and add them to an index, I can quickly search for pieces of information about the US elections.

To help achieve this, the Neum AI framework features:

Starting with built-in data connectors for common data sources, embedding services and vector stores, the framework provides modularity to build data pipelines to your specification.

The connectors support pre-processing capabilities to define loading, chunking and selecting strategies to optimize content to be embedded. This also includes extracting metadata that is going to be associated to a given vector.

The generated pipelines support large scale jobs through a high throughput distributed architecture. The connectors allow you to parallelize tasks like downloading documents, processing them, generating embedding and ingesting data into the vector DB.

For data sources that might be continuously changing, the framework supports data scheduling and synchronization. This includes delta syncs where only new data is pulled.

Once data is transformed into a vector database, the framework supports querying of the data including hybrid search using the available metadata added during pre-processing. As part of the querying process, the framework provides capabilities to capture feedback on retrieved data as well as run evaluations against different pipeline configurations.

Try it out and if interested in chatting more about this shoot us an email founders@tryneum.com




Interesting to see that the semantic chunking in the tools library is a wrapper around GPT-4. Asks GPT for the python code and executes it: https://github.com/NeumTry/NeumAI/blob/main/neumai-tools/neu...


You seem to have found a very polite way of highlighting this.

I assume that in a not so distant future, a malware scanner will detect this and disallow one to run this locally.


Yeah, we were playing around with doing some semantic chunking. Works okay for some use cases. We have some ideas to go further on that.

Generally we have found that recursive chunking and character chunking tend to be short sighted.


Don't you find it dangerous to just run the code w/o any sanitizing?

Why not capture a few strategies that the LLM returns as code that can be properly audited (and ran locally improving the overall performance)?


It is dangerous, part of the reason that we haven't productized that further. One of the ideas we had to productize the capabilities further was to leverage edge / lambda functions to compartmentalize the code generated. (Plus it becomes a general extensibility for folks that are not using semantic code generation and simply want to write their own code.)

The idea of auditing the strategy is interesting. The flow that we have used for the semantic chunkers up to date has been along these lines where we : 1) Use the utility to generate the code snippets (and do some manual inspection) 2) Test the code snippets against some sample text 3) Validate the results


Why not use Stanford Stanza?


Very bearish on these frameworks and abstractions.

Yes, obviously useful for prototyping and creating hype articles & tweets with fun examples. However any engineer is capable of doing their own rag with the same effort (minimal data extraction using the ancient pdf/scrape tools that are still open sota, or use cloud ocr for best —-> brute force chunking —-> embed —-> load in Ann with complementary metadata store)

Anyone doing prod needs to know the intricacies and make advanced engineering decisions. There’s a reason there aren’t similar end-to-end abstractions over creating Lucene (solr/elastic) indexes. Hmm, why not after many decades? …

In reality, the RAG tech is not entirely novel— it’s etl. Which in reality, complex etl is often a serious data curation effort. LLMs are the closest thing to enabling better data curation, and as long as you aren’t competing with open ai (arguably any commercial system is) then you can use chatgpt to create your chunks.

Beyond this embedding strategies are nice to abstract but the best approach to embeddings still remains to create your own and figure out contextual integration on your own. Creating your own can also just be fine-tuning. Inference is often an ensemble depending on your use case.


I don't disagree with all your points. That said, what we have built has proven useful for us as we have built pipelines for customers and think it might be useful for others.

Probably the main point I disagree with you is that RAG is just ETL. If that was the case, all of the AI apps people are building would be AMAZING because we solved the ETL problem years ago. Yet, app after app being released have issues like hallucinations and incorrect data. IMO the second you insert a non-deterministic entity in the middle of an ETL pipeline, it is no longer just ETL. To try to add value here, our focus has been on adding capabilities to the framework around data synchronization (which is actually more of a vector management problem), contextualization of data through metadata and retrieval (this part being were we have spent the least time to date, but are currently spending the most)


The problem is that business people have been just rubbing together libraries for decades making money (though maybe not the MOST) and will see these frameworks as a "simple" way to accelerate development, when in fact most of them are opinionated (and bad) ETL. Just put langchain together and it's done, right?

I went through building a RAG pipeline for a company and brought up at each stage how there's been no tuning, no efficacy testing for different scenarios, no testing of different chunking strategies, just the most basic work done and they released it almost immediately. Surprisingly to not much fan fare.

It doesn't really matter


DAIR.AI > Prompt Engineering Guide > Technics > Retrieval Augmented Generation (RAG) https://www.promptingguide.ai/techniques/rag

https://github.com/topics/rag


Cool. Do you do any of the relevance calculations directly, or is that all handled by Weaviate? If so, is there any way to influence that part of it, or is it something of a black box?


Relevance calculations are handled by the vector db but we try to improve such relevance with the use of metadata (you will see how our components have "selectors" so that metadata can flow all the way to the vector database at the vector level and have an influence when results/scores get retrieved at search time)


Got it. I'd encourage you to expose more of that functionality at the level of your application if possible. I think there is a lot of potential in using more than just cosine similarity, especially when there are lots of candidates and you really want to sharpen up the top few recommendations to the best ones. You might find this open-source library I made recently useful for that:

https://github.com/Dicklesworthstone/fast_vector_similarity

I've had good results from starting with cosine similarity (using FAISS) and then "enriching" the top results from that with more sophisticated measures of similarity from my library to get the final ranking.


How does the improve upon retrieval compared to just using any vector db and semantic search?


Co-founder here :)

Today, it is mostly about convenience. We provide abstractions in the form of a pipeline that encompasses a data source, embed and sink definition. This means that you don't have to think about embedding your query or what class you used to add the data into the vector DB.

In the future, we have some additional abstractions that we are adding that will add more convenience. For example, we are working on a concept of pipeline collections so that you can search across multiple indexes but get unified results. We are also adding more automation around metadata given that as part of the pipeline configuration we know what metadata was added and examples of it, so we can help translate queries into hybrid search. I think about it as a self-query retriever from Langchain or Llama Index but that automatically has context of the data at hand. (no need to provide attributes)

Are there any specific retrieval capabilities you are looking for?


this sums up current wave of AI 'companies':

submissions by this user (https://news.ycombinator.com/submitted?id=picohen):

Show HN: Neum AI – Open-source large-scale RAG framework (github.com/neumtry)

Show HN: ElectionGPT – easy-to-consume information about U.S. candidates (electiongpt.ai)

Efficiently sync context for your LLM application (neum.ai)

Show HN: Neum AI – Improve your AI's accuracy with up-to-date context (neum.ai)


I haven't yet seen any competitor come close to what we've achieved at my startup https://olympia.chat - very humanlike assistants crafted specifically for solopreneurs and bootstrapped startups


Gotta start somewhere! Take a look at this one as well! https://news.ycombinator.com/item?id=37824547


How is this any different from LlamaIndex [1]?

[1] https://www.llamaindex.ai


LlamaIndex is pretty awesome.

There are a couple areas where we think we are driving some differentiation.

1. The management of metadata as a first class citizen. This includes capturing metadata at every stage of the pipeline.

2. Be infra ready. We are still evolving this point, but we want to add abstractions that can help developers apply this type of framework to a large scale distributed architecture.

3. Enable different types of data synchronization natively. So far we enable both full and delta syncs, but have work in the pipeline to bring in abstractions for real-time syncing. 3.


If someone is about to start their project using Haystack would you suggest they instead look at Neumtry?


Well, of course I'm biased on the answer :). But to give a not-so-biased answer, I would first try to understand what the project is about and whether RAG is a priority in it. If the project is leveraging agents and LLMs without worrying too much on context/up-to-date data then Haystack could be a good option. If the focus is to eventually use RAG then our framework could help.

Additionally, there might be a potential route where both are used, depending on the use case.

Feel free to dm if you want to chat further on this!


Actually Haystack is very focused on RAG lately, just have a look at the latest blog articles: https://haystack.deepset.ai/blog

(Disclaimer: I am a Haystack maintainer)


A bit odd that they might not be aware of that. Any ideas from this project you see that might benefit Haystack?


I understood Haystack as doing RAG but your comment seems to define it differently than my understanding.


Why MySQL and not PostgreSQL or Redis on the roadmap for sources?


Postgres is already available :)


In fact, we will be releasing a blog post in the next few days of how we do real-time syncing for a RAG application with postgres hosted on supabase


Have you guys connected with MemGPT?


Haven't connected.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: