Once we start talking about multi-agent AI systems I think my intuitions go out the window. If we consider that GPT-4 is about as capable as a reasonably intelligent high school graduate or college freshmen, I wonder how much actual benefit we get from multi-agent setups.
I mean that intuitively I couldn't imagine replacing 1 experienced professional with 1, 2, 10, 100 or even 1000 intelligent high school graduates. Intelligence doesn't seem strictly additive across multiple individuals for all cases. But then I consider that one of the most powerful life-lines in the TV game show "Who wants to be a millionaire" was the "Ask the audience". I am reminded of the cliche of the wisdom of crowds, even when the crowd is made up of non-experts.
This suggests to me that there is a kind of problem where multiple lower power agents can solve the issue to a higher quality. But there are also kinds of problem where a single higher-power intelligence will be necessary. I haven't developed an intuition when each approach is valid.
I suspect the real value is more from an architecture and caching perspective.
Architecture as in you may have multiple GPT instances being prompted to look at a problem from different angles: "Analyse the problem as a pessimist", "As an optimist", "as a mathematician", "an engineer", "a philosopher", etc.
Caching, as in determining what you can store from these outputs, e.g. "give a numeric 1-10 score of what you think of this product as a LGBT-friendly conservative from the Midwest", etc.
Sure, but just as a thought experiment imagine you have a really bad disease. I give you two options: 100 high-school students can diagnose and prescribe a treatment or you can choose 1 professional with 10 years of experience in related diseases.
You might initially prefer the professional with 10 years of experience. Would your opinion change if I told you that I selected the high-schoolers to be diverse so that one is pessimist, one is an optimist, one got good grades in engineering, one loves philosophy?
Of course, my intuition might be wrong. For example, perhaps it legitimately would be better to have 100 high-school students where one is a high-school level ability with ear-nose-throat, one is a high-school level ability in oncology, one is a high-school level ability in cardiology, etc. Except some control AI would have to synthesize their answers into a coherent response ... and that controller would be high school level.
I'm not really sold either way if I am honest. I don't think we have the answers to these questions. It just shows that my intuition about intelligence is open to challenges.
In real life, though, that specialist with 10 years experience has a 2 year waiting list for new patients, and you wouldn't even get in to see them anyway, because they don't take your health insurance.
Meanwhile, the "100 high school students" on a mobile phone are the only medical consultation someone in Sub-Saharan Africa or Southern Asia is going to have access to at all.
"Why Not Both" would surely apply here. Maybe we can't give everyone access to a kind, caring, patient human doctor; but we sure can make coming in second place a lot less painful.
I suspect the divide may end up similar to that between a start up and a large company. Yes, some problems are better solved by a single person. But others benefit from a division of thought even if there is some bureaucratic overhead.
That is my speculation. But the question is one of intuition. If the speculation is correct, that some problems are better solved by a single more intelligent agent, then how can I determine the appropriate approach?
What I mean to say is, if I am considering building a product based on LLMs then I may have to make a basic decision: can I use multiple cheap LLMs in a multi-agent setup or must I use a single expensive powerful LLM. Right now I don't have any intuition on what kinds of problems are solved most efficiently by either approach. Just looking at a problem description I can't intuit which approach is appropriate.
This is a very interesting and important question. I think that until we answer your question rigorously, the best bet right now is to start with the expensive powerful LLM, and then after your system is prototyped and working, piecewise move prompts in your prompt network over to other systems as it proves feasible.
If you try to start with the multiple cheap LLMs approach, it is almost certainly going to be more difficult to get the prompting right, and if you don't already know what you are trying to accomplish is feasible, you're adding a lot of work that might add up to nothing (even if it would actually work if you used the more powerful model).
It's not quite the same thing, but the "mixture of experts" model is a popular deep learning approach that displays very good performance: see e.g., https://arxiv.org/abs/2208.02813
The difference is that an LLM isn't 1000 different intelligences. It's one intelligence, being asked to pretend to be 1,000 different people. Every instance is the essentially the same weights trained on essentially the same data. The difference in perspective doesn't resemble that of the difference between any two humans.
Humans love to think of multi agent systems as being like a team of people. It's much more like a writer imagining different characters and how they would respond. When George RR Martin imagines all 500 characters in Game of Thrones, there is a lot of diversity of perspective and thought there. But all of that is coming from one intelligence and doesn't represent a collaboration in any traditional sense.
>The difference is that an LLM isn't 1000 different intelligences. It's one intelligence, being asked to pretend to be 1,000 different people.
No it's not. It does no good for a Language Model to configure a global persona. It needs to be able to predict text from wildly varying backgrounds and contexts. It's not pretending anymore than anything else it does is pretending.
That's why experiments like the below actually work
This is something I’m working on, for now in a much more simplified format and still very rough. But I did manage to get team members to be contextually aware, so they know they team and everyone’s details. They can even search through chat threads with any given team member.
https://www.conjure.team/
Cosign, with extreme prejudice. We went through this with AutoGPT.
I see two camps, people quietly working, and people working on grandiose ideas from the initial rush that make good headlines but not good products.
Some informal warning signs I use subconciously:
- "paper" on Arxiv, and its about prompt engineering
- agent_S_
- state machine where the LLM is eating its own output repeatedly
- LLM makes decisions based on its own output.
There's 100x more alpha in the obvious stuff because even that's not well-implemented or shared widely. Ex. people are still stuck on hallucinations 90% of the time: an obvious way to handle that is doc retrieval.
Last n.b.: in 2021 I was frustrated with ~100% of models being behind locked doors. Then, I saw how GPT3.0 was working and evolving. My mantra became "products not papers." Maybe that applies downstream of the LLM now.
> My mantra became "products not papers." Maybe that applies downstream of the LLM now.
I agree, in the sense that my intuition is fed by having as many available examples as possible. I am not even sure one could quantify the difference between intelligence levels in a way that I would be satisfied with in a paper.
For example, I find it intuitive that a mixture of 10 narrowly fine-tuned GPT-3s are better at a task than 1 broadly fine-tuned GPT-3. But I don't have a real intuition about how many GPT-3s you would have to mix to match the quality of GPT-4, or if there even is any number of GPT-3s you can mix to achieve the result of GPT-4. I think we just need to start building systems and see what happens.
TL;DR: AutoGPT has similar if not identical problems
re: AutoGPT
Got really excited at first. Thought maybe I had missed that it was viable in a year of playing with LLMs. Tried a few demos, didn't work. Looked into it more and confirmed a core loop involved LLM eating its own output to make decision.
I still kept investigating it on my todo list, in case my earlier experiments with that approach were wrong.
I took it off the todo list later.
I saw near-universal feedback like yours, that general technique never worked IMHO, and IMHO sycophancy explains why. Paper here[1], TL;DR the model is very likely to agree, so critical feedback loops over multiple steps tend to settle into a loops of steps.
IMHO this doesn't mean sycophancy breaks _all_ workflows, ex. a flow for writing a story involving outlining, writing, criticizing, then rewriting is a genuine real quality boost.
However, if a human does write => criticize 10 times, it keeps getting better each iteration. If you have an LLM do it 10 times, IMHO it's actively harmful after round 3.
"outline" => ["write page 1", "write page 2", "write page 3"] => ["feedback on page 1"..."feedback on page3" => ["use feedback and original draft to rewrite page1"..."page3"] => "combine pages 1 2 and 3 into cohesive story"
I've tested the writing loop you describe and the biggest problem seems to be that currently GPT4 is good at generalities, but gets into loops of generalities that doesn't drive things forward when you expect too much detail.
It's not that it can't improve the writing further per se, but that it takes very detailed prompting to get it to give a detailed enough critique to do so consistently enough across even a page (e.g. ot might come uo qith a great lone but proceed to edit out the best paragraph elsewhere) to the point that I tend to agree with you in as much as it at least will take a much more convoluted chain of prompts to maybe get there at the moment, and you'll be fighting GPT4s tendency to actively cheer on really juvenile prose the whole way.
In a way I think the biggest hindrance to get it to write better at the moment is that it has awful "taste", and having to explicitly give it a long list of rules to check against is a poor substitute.
As an aide, though, I think you could get reasonable but not great results at "bridge these two paragraphs and maintain the style" type tasks, or expanding descriptions into a paragraph or two, though more so for non-fiction writing.
For fleshing out the basics of a technical spec and pointing out what I've missed I've had decent luck, on the other hand. It's not come up with any Earth shattering revelations, but for a dry spec that's not the point.
> However, if a human does write => criticize 10 times, it keeps getting better each iteration. If you have an LLM do it 10 times, IMHO it's actively harmful after round 3.
>> The game has no winner: the entertainment comes from comparing the original and final messages. Intermediate messages may also be compared; some messages will become unrecognizable after only a few steps.
>> The transmission chain method is a method used in cultural evolution research to uncover biases in cultural transmission.[1] This method was first developed by Frederic Bartlett in 1932.[2][1]
> In mathematics, computer science and logic, convergence is the idea that different sequences of transformations come to a conclusion in a finite amount of time (the transformations are terminating), and that the conclusion reached is independent of the path taken to get to it (they are confluent).
> More formally, a preordered set of term rewriting transformations are said to be convergent if they are confluent and terminating.[1]
To simulate a full-scale multi-agent game, there would need to be a "Fourth Estate" (and maybe a Fifth Estate); a peanut gallery of dissenters with signs and no jobs.
Then, you could model consensus with LLMs and have something better than the mediocrity that sometimes results from committees.
More samples from the same LLM vs Sample different LLMs
Maybe feed one or more agents relevant encyclopedia articles as context first; which read the most encyclopedia articles first?
Mmmm… Assuming you’re referring to US high school, I’m going to say that GPT-4 is more useful than a high school graduate.
Rationale:
- Speed
- Ability to recall
- Depth and breadth of “exposure”
I can’t speak to others high school experience, but in my experience, high school does not prepare young adults to be useful in the same settings that GPT-4 could be useful.
Example prompt:
- Pose this to a high school graduate and GPT-4, somehow benchmark quality.
1. “You are an experienced business strategist. Your client has the following problem <insert problem synopsis here>. Generate an outline of brainstorming topics your client should consider to accomplish <insert goal here>.
- First, GPT-4 is going to respond in ~30sec or less
- Second, a high school grad has 0 context for these types of prompts.
> Mmmm… Assuming you’re referring to US high school, I’m going to say that GPT-4 is more useful than a high school graduate.
Useful for what?
LLMs are largely useless for the vast majority of labor that people with HS degrees tend to do (which mostly involves not being behind a keyboard all day).
Forgive my conjecture, as I'm no expert in this field, but have enjoyed playing with the different models.
I see it as a matter of attention. GPT-4 is limited in the number of tokens you can feed it and receive back from it. I see this as how much attention the LLM can give you. I see meta-gpt and other models like this as increasing the attention of the LLM by allowing it combine it's short attention span with multiple assessments of the request, and give you a more complete picture of what you want because it can keep giving it consistent attention to the problem at hand, instead of simply trying to solve it on the first try.
Someone please correct me if I'm way off, but this simple mental model helps me abstract this.
The problem with this is that it has no memory across the different contexts. An analogy would be giving one page of a five page document to five different people, then taking it away and asking them to collaborate. While they can each give more attention to their individual page, none of them can see the whole picture and a lot of information will be lost when trying to communicate.
You can use multiple agents, or split a lot of information across multiple requests to one agent. The result is the same. Some problems require a full understanding of the whole picture.
> If we consider that GPT-4 is about as capable as a reasonably intelligent high school graduate or college freshmen, I wonder how much actual benefit we get from multi-agent setups.
It's worth noting that GPT-4 internally uses a Mixture of Experts (MoE) model with 8 'experts' internally, so it's more similar to a multi-agent setup than you might think initially.
> It's worth noting that GPT-4 internally uses a Mixture of Experts (MoE) model with 8 'experts' internally, so it's more similar to a multi-agent setup than you might think initially.
That's a very common misunderstanding of what MoEs are. What you're describing is an Ensemble.
A MoE is where a the input is "routed" to one of X "experts" in the higher layers of the network. Something like this: {input} -> [lower layers] -> <decide which upper layers to send input to> -> [expert] -> output
And these experts aren't experts in the space of human defined subjects. Their expertise are in some embedding space, which might or might not line up with our intuition of a subject.
Right, but the intuition I am going for isn't "Are 10 high-school students who collaborate together smarter than 1 high-school student working on their own".
The intuition I am asking for is more like: "Are 10, 100, or 1000 GPT-3s capable of producing the equivalent intelligence of 1 GPT-4". That kind of reasoning expanded to GPT-N.
And further, if no number of GPT-3s can reach the level of intelligence of a GPT-4, then what set of problems can be solved by 100s of GPT-3s in a mixed model and what problems require the single GPT-4.
You can generally amplify models by using more tokens, more calls or relying on some oracle of correctness. I see rapid growth of synthetic datasets from now on, we need the diversity of synthetic data to progress further.
> It's worth noting that GPT-4 internally uses a Mixture of Experts (MoE) model with 8 'experts' internally
Has this actually been confirmed, either officially by OpenAI or otherwise? As far as I know, George Hotz claimed this once, and since then everyone just assumed it was the truth without actually waiting for any sort of verification.
TL;DR This might work - but it will be like watching Groundhog Day. It will require many iterations, make too many mistakes to get there and won’t remember a thing.
My naive understanding of layers in a model is that each layer loosely acts as an expert in one step of the entire process.
For example, in an object recognition model, one layer takes on the task of separating objects from the background, another excels at knowing the colour of different things, another might learn the difference between a blue sky vs. the colour sky blue.
So essentially, we’re trying to mimic the same working model at a higher level of abstraction. Similar to how our body is made of atoms. Many atoms make a molecule. Many molecules make organic tissue, and amino acids that perform more complex operations. Skip all the way up and you have a human being. Human beings in numbers can put a man on the moon, create anti-matter and nuclear explosions.
In theory - this can work just as well. One agent to provide a solution, another to critique it, another to verify it, another to mimic the end-user. The big obvious missing piece is memory and the ability to learn while doing (at least in existing LLMs).
It’s like having a software development team that’s frozen in time and knowledge. These LLM agents will always require some micromanagement and hand holding, and will waste a lot of resources with failed attempts - that they are going to repeat every time you ask them to perform this task.
GPT-4 is not comparable to a high school graduate.
It will happily hallucinate facts and claim they are true.
Well, ok, maybe it is ;)
The issue they seem to tackle is trying to minimize hallucinations by injecting some human expertise into the pipeline. They do this by more strictly defining the roles and tasks a step in that pipeline needs to accomplish.
Interesting, multi-agent was almost my first thought when I began using ChatGPT. For complex tasks I figured you would want to chain together several LLMs, each for a dedicated domain. It would actually be really neat to be able to use a Zapier-like interface to build your own AI workflow.
Understanding a single node of the system is not the same as understanding the sum of all parts. In other words, a system of 2 high school graduates may appear at first as the sum of the output of just 2 high school graduates but the sum of 1000 high school graduates may appear as super intelligence.
The phenomenon I am describing is known as "Emergence."
> I mean that intuitively I couldn't imagine replacing 1 experienced professional with 1, 2, 10, 100 or even 1000 intelligent high school graduates
This analogy doesn't make sense, because the professional is presumably also a high school graduate. This case is more like leveraging a team of specialists with expertise in different domains.
I think it has to do with the fact that LLM's are stochastic, autoregressive models. There are "trajectories" that it can take that are wrong (hallucinations) and there are trajectories that lead to the right answer. In essence multi-agent is doing something like self consistency, self-reflection.
It’s the power to persuade (others to follow you). How do you decide which of the 100 agents is the expert ? They need to persuade the others. especially when they don’t know if they are the expert themselves. sometimes the right decision is to follow someone else.
Different components of thought should be specialized. Choosing the meta structure will likely have an impact on what the system can achieve simply by the benefit of compartmentalization.
Meta-commentary: these different AI agent systems (frameworks?) sure are good at racking up GitHub stars but don't appear to have any use beyond a fun demo. My impression is that, like AutoGPT, this will simply fall over with anything even mildly complicated. Maybe it's the direction we're ultimately headed, but I'm just far too skeptical given what I've seen so far, and I don't believe we just need the right orchestration of models to pull off useful general purpose AI agents.
I think at least some of it is outright fraud. A lot of these are being pushed by shameless spammers and former crypto hucksters. Some of the vanity metrics like followers and stars can just be bought if you have a few bucks, and boosting those numbers makes it easier to cheat and convince others to give you more money. This cycle effectively pays for itself assuming you don't care about the reputational risks.
The creation process used 11,940 tokens on input and 2,993 tokens on output, which cost $0.35 and $0.18, respectively.
The game it generated consisted of four python classes in four separate files: Main, Game, Snake, and Food.
The game executed without error on the first try, but the snake wasn't able to 'eat' the food. Here's the relevant code for 'eating' food:
# Check if the snake ate the food
if self.snake.body[0] == self.food.position:
self.score += 1
self.snake.grow()
self.food.generate()
The issue was that the snake's body was represented as a list of lists, whereas the food position was stored in a tuple. After changing the food position to a list, the game worked correctly.
I strongly believe this to be total bullshit. Here's how I managed to recreate their example of a BlackJack game using a single simple prompt in ChatGPT:
I'm not convinced the meta/agenty/company stuff is doing anything to help the LLM generate working code and it looks like they never bothered to check the null hypohesis before hitting publish.
Expand please. I think the initial hype is gone and we see that most implementations are not that much of a game changer just yet.
Stable Diffusion can create images but its still incredibly cumbersome to find the right seed and prompt to get the image one desires.
ControlNet, Inpainting, and Loras are helpful but have to be implemented in a useful workflow.
Just starting to read this. Their Fig. 1 is basically the README of a project we've been quietly working on.
It's easy to imagine how systems like these might succeed by following workflows analogous to those humans use, on greenfield toy problems.
The trouble is, if you paratrooper something like this into an exiting repository, or you want to build something of real significance, then you've got more context than can fit into an LLMs window. You don't just have the hallucination / stay-on-task problems, you have the problem of the LLM not having everything it needs to know to complete its task.
In my experience, the latter problem of ensuring the LLM knows everything it needs to know is the bigger problem. The other stuff (using standard operating procedures and workflows, as in this paper) is actually the relatively easy stuff. I'm not saying it's easy-easy or that it's obvious from the outside if you haven't played around with it; I'm not disparaging the paper.
So, what I'm most curious about, and I wonder if anyone here has seen good solutions, is this:
How do these multi-agent systems maintain/retrieve the appropriate domain context required to make good design decisions, good coding decisions, and so on? "Use embedding and shove stuff into a vector database" is hand-wavy and doesn't get you all the way there. I'm looking for concrete solutions; academic papers that show a lot of promise; and the like.
(Maybe they lay it out in this paper and I just haven't gotten to it yet. But if they've cracked that nut I suspect they would lead with it.)
Yes, recently upscaled research for my iOS app. Have made a pipeline for fact extracts from various sources, including some sentiment analysis. Would have taken me years to do it manually. Will mainly be value adding for end user . Not directly client facing since this is done offline and shipped as data into the final product.
Yes. We've launched multiple production instances of this across a handful of clients. In general humans using the new workflows show very significant performance and quality improvements over unassisted humans so far for the well-defined tasks the workflows are tailored to perform.
AWWWW... It would be great to share some "meta-data" on the usecase - i wonder whether it is narrow task (that can be done via CRON job), or something that are more general and can be creative and not in control sometimes.
What kind of tools that the agent use? (Web-search/ Generating content/...)
Did you made any modification to the code or use it "as is". if so- how much time did you spend on the modification and in what kind of areas...
Feel free to answer only subset of the questions that you feel comfortable with them.
our entire focus is on multi-agent workflows based and trained on real peoples published work. I think the output is amazing, but I'm biased of course. Its ability to accomplish soft tasks like advice, help etc is far superior to vanilla GPT4
This is just a collection of static ChatGPT prompts and an output aggregator - I really don't see any novel research or real-world applicability of this aside from potentially scaffolding a project. One of the assumptions baked into the coding prompt is that you will never need to use an external library or API [1] which is an inherent requirement in most practical use cases.
> ## Code: {filename} Write code with triple quoto, based on the following list and context.
> 1. Do your best to implement THIS ONLY ONE FILE. ONLY USE EXISTING API. IF NO API, IMPLEMENT IT.
Honestly this paper and many similar are more a scientific proof of what people in the open-source community have already published.
Especially after tree-of-thoughts it should be clear to everybody that LLMs profit from the same guidelines and structures that allow humans to think better.
The future will likely be simulated companies and the business logic will be how you compose this agent interaction.
I haven't run it yet but I feel this specific paper is just good prompt engineering (with a little bit real engineering in it).
Also it reads like a pitch honestly. Would bet they are looking for funding rn.
I am very skeptical that doing things like producing "Competitive Quadrant Charts" (see Figure 3) is a helpful form of prompt engineering. The extremely weak ablation (Section 4.4) rings a lot of alarm bells for me also.
I'm gonna take some time this afternoon to give this a spin. The challenge is a multiplayer Snake game set on an expanding canvas which can support an arbitrary amount of players and communicates over websocket. Players join through the browser and control their snake with the keyboard.
This was a take-home challenge I got back in 2015 for an internship. I spent all day on it and had a blast. Curious to compare my nooby college code to state of the art LLM code!
Is it just me or should it be "Meta Programming for a Collaborative Multi-Agent Framework". I'm not a native speaker but I encounter this kind of weird word order a lot in papers recently and it came to no surprise to me that the author list contains a lot of chinese sounding names. I would be curious if this word order sounds better when translated into Mandarin word by word.
It's pretty interesting how well it works, but there is a lot more to improve as you can refine the agents for more than their generalized domain, into their perspective and real publish knowledge for increased improvement.
Wanted to give it a try, but gating the product behind Twitter/Google SSO was a non-starter for me. Hope you can get a no-login-required demo up eventually!
I think you have to be pretty realistic about the costs of LLM products, there is a reason no one has a non-gated product -- It would be an absolutely ridiculous and uncontrollable cost.
That being said, totally understood. Logging in is very little cost though, we just need to make a better value prop upfront.
I mean that intuitively I couldn't imagine replacing 1 experienced professional with 1, 2, 10, 100 or even 1000 intelligent high school graduates. Intelligence doesn't seem strictly additive across multiple individuals for all cases. But then I consider that one of the most powerful life-lines in the TV game show "Who wants to be a millionaire" was the "Ask the audience". I am reminded of the cliche of the wisdom of crowds, even when the crowd is made up of non-experts.
This suggests to me that there is a kind of problem where multiple lower power agents can solve the issue to a higher quality. But there are also kinds of problem where a single higher-power intelligence will be necessary. I haven't developed an intuition when each approach is valid.