Ingesting documents and using natural language to search your org docs with an internal assistant sounds more like a good use case for RAG[1]. Agents are best when you need to autonomously plan and execute a series of actions[2]. You can combine the two but knowing when depends on the use case.
I really like the OpenAI approach and how they outlined the thought process of when and how to use agents.
In this case, the agent would also need to learn from new events, like project lessons learned, for example.
Just curious: can a RAG[1] system actually learn from new situations over time in this kind of setup, or is it purely pulling from what's already there?
Especially with a client, consider the word choices around "learning". When using llms, agents, or rag, the system isn't learning (yet) but making a decision based on the context you provide. Most models are a fixed snapshot. If you provide up to date information, it will be able to give you an output based on that.
"Learning" happens when initially training the llm or arguably when fine-tuning. Neither of which are needed for your use case as presented.
Thanks for the clarification, really appreciate it. It helps frame things more precisely.
In my case, there will be a large amount of initial data fed into the system as context. But the client also expects the agent to act more like a smart assistant or teacher, one that can respond to new, evolving scenarios.
Without getting into too much detail, imagine I feed the system an instruction like: “Box A and Box B should fit into Box 1 with at least 1" clearance.” Later, a user gives the agent Box A, Box B, and now adds Box D and E, and asks it to fit everything into Box 1, which is too small. The expected behavior would be that the agent infers that an additional Box 2 is needed to accommodate everything.
So I understand this isn't "learning" in the training sense, but rather pattern recognition and contextual reasoning based on prior examples and constraints.
Basically, I should be saying "contextual reasoning" instead of "learning."
There is no memory that the LLM has from your initial instructions to your later instructions.
In practice you have to send the entire conversation history with every prompt. So you should think of it as appending to an expanding list of rules that you put send every time.
What you're attempting to do, integrating an agent in your business, is difficult. It is however relatively easy to fake. Just setup a quick RAG tool, plug it into your LLM, and you're done. From the outside, the only difference between a quick-n-dirty integration and a much more robust approach will be in numbers. One will be more accurate than the other, but you need to actually measure and count performance to establish it as a fact and not just a vibe.
First advice: build up a dataset and measure performance as you develop your agent. Or just don't, and deliver what hype demands.
As for advices ... and looking at those other commenters left ... If you want to do this seriously, I'd recommend that you hire someone who already did that kind of integration, at least as a consultant. Someone whose first reflex won't be to just tell you LLMs are fixed and can't learn but will also add this isn't a limitation since RAG pipelines are better suited for this task than fine-tuning [1].
Also RAG isn't a monolithic solutions, there are many, many variations. For your use-case, I'd consider more elaborate solutions than just baseline RAG, such as GraphRAG [2]. For the box problem above, you might want to consider integrating symbolic reasoning tools such as prolog, or consider using reasoning models and developing your own reinforcement learning environments. Needless to say, all of these aspects need to be carefully balanced and optimized to work together, and you need to follow a benchmark/dataset centric-approach to developing your solution. For this task consider frameworks that were designed to optimize llm/agentic workflows as a whole [3][4].
Shit is complex really.
[1] https://arxiv.org/abs/2505.24832 tells us generalization happens in LLM once their capacity for remembering things is saturated, and this might explain why fine-tuning has been less efficient than RAG so far.
Sound advice and much appreciated.
In this case, I might team up with someone to help me add this feature to my SaaS. But I’ll definitely dive deeper into the subject. Thanks for the info and the links!
There's also (of course) the agentic rag, especially if your data is from a lot of different types of resources, and you will have some context / memory set up that it relies on, in actuality with a lot of context there's is not a lot of "learning" needed.
Incorporating more data or new data into the RAG pool is a form of “learning”, but in general agents don’t “learn” unless you give them a journal or allow them to modify their own prompt.
This article was written a few weeks after MCP was released and touches on why MCP is important. While I guess you could argue that technically there's nothing to it, protocols such as MCP is addressing a missing need to standardize interactions between your ai app and another service. Code needs to now be written for users, devs (apis), and ai.
smolagents by huggingface would be more of an agent framework. If it was discussed we would see smolagents/llamaindex/pydantic/etc with frameworks on figure 2. Several frameworks were left off in this paper as it focuses more on the protocols.
I like the idea of more comparisons of models. Are there plans to add independent analyses of these models or is it only an aggregation of input limits?
How do you see this differing from or adding to other analyses such as:
I made https://aimodelreview.com/ to compare the outputs of LLMs over a variety of prompts and categories, allowing a side by side comparison between them. I ran each prompt 4 times for different temperature values and that's available as a toggle.
I was going to add reviews on each model but ran out of steam. Some users have messaged me saying the comparisons are still helpful to them in getting a sense of how different models respond to the same prompt and how temperature affects the same models output on the same prompt.
Hey, this is pretty insightful! Wonder if, in the course of researching to build this website you reached any conclusions as to what’s the AI assistant currently ahead.
I want to point out you dodged the data question, and there's a reason for it.
I like your work visually on first glance, god knows you're right about gradio, even if its irrelevant.
But peddling extremely limited, out of date, versions of other people's data, trumps that, especially with this tagline. "A website to compare every AI model: LLMs, TTSs, STTs"
It is a handful of LLMs, then one TTS model, then one STT model, both with 0 data. And it's worth pointing out, since this endeavor is motivated by design trumping all: all the columns are for LLM data.
now imagine going one step further and actually running a prompt across every AI model and showing you the best answer and the AI model that generated it
Those tools exist, they do not need to be imagined. Look into the related comments. Also they do little, but increase the labor of getting an answer. Not exactly an improvement of AI for the user to spend more time reviewing AI answers.
I really like the OpenAI approach and how they outlined the thought process of when and how to use agents.
[1] https://www.willowtreeapps.com/craft/retrieval-augmented-gen...
[2] https://www.willowtreeapps.com/craft/building-ai-agents-with...