Hacker News new | past | comments | ask | show | jobs | submit login
Multi AI agent systems using OpenAI's assistants API (github.com/metaskills)
227 points by metaskills 29 days ago | hide | past | favorite | 80 comments

Assistants API is promising, but earlier versions have many issues, especially with how it calculates the costs. As per OpenAI docs, you pay for data storage, a fixed price per API call, + token usage. It sounds straightforward until you start using it.

Here is how it works. When you upload attachments, in my case a very large PDF, it chunks that PDF into small parts and stores them in a vector database. It seems like the chunking part is not that great, as every time you make a call, the system loads a large chunk or many chunks and sends them to the model along with your prompt, which inflates your per request costs to 10 times more than the prompt + response tokens combined. So, be mindful of the hidden costs and monitor your usage.

> as every time you make a call, the system loads a large chunk or many chunks and sends them to the model along with your prompt,

This is how RAG works.

While you can come up with work-arounds like using lesser LLMs as a pre-filtering step the fact is that if you need GPT to read the doc you need GPT to read the doc.

True, this is how RAG works, but this is why I prefer to use open-source LLMs for RAG: because the token costs are less opaque and I can control how many chunks I pull fromthe database to manage my costs

I believe it will get better and more efficient as we go. On a side note, OpenAI seems to release products before they are ready and they evolve as they go.

> I believe it will get better and more efficient as we go.

Yes of course. The point remains: the LLM has to process the data somehow.

If you are concerned about costs and token usage then switch to a provider that works for your problem (Flash Gemini looks very interesting..)

Yup, this seems right. You pay for tokens no matter what. Even in other APIs. Did you know you can set an expire for files, vector stores, etc? No need to pay for long term storage on those. Also, threads are free.

There isn’t really any other way for this to work. The only way for the model to answer questions on your pdf is for the information to be somewhere in the prompt.

That might be true of specific models or specific APIs for accessing them, but I’d argue isn’t even remotely true of neural networks generally or generatively-pretrained decoder-only attention-inspired language models in particular.

Ideally if you want a model’s weights to include a credible representation of non-trivial data you want it somewhere in the training pipeline (usually earlier is better for important stuff but that’s a hubristic at best), but there’s transfer learning of various kinds, and joint losses of countless kinds (CLIP in SD-style diffusors come to mind), and fine tunes (if that doesn’t just count as transfer learning), and dimensionality reduction that is often remarkably effective, and multi-tower models like what evolved into DLRM, and I’m forgetting/omitting easily 100x the approaches I mentioned.

It’s possible I misunderstand you, so please elaborate if so?

The way they vectorized the PDF could be less efficient than simply extracting the text and dropping it into context as text. If it's a 100 MB PDF then it's probably a scanned PDF, and OpenAI is probably using an OCR model to vectorize each page directly. It seems an opaque process with room to be inefficient. So I would be interested to know if we could save on token/vector fees by preprocessing the PDF to text with our own OCR.

No, it is not a scanned PDF but a standard textual PDF with tables, bullet points, chapters, etc. Somewhat like a manual.

How large is a very large PDF?

Close to 100mb.

FWIW, that's about an order of magnitude larger than I imagined a "very large PDF" to be. That's an enormous PDF.

Are the pages complete images (scanned document) or is it 100mb of text with some images (graphs etc.) mixed in?

plain text, tables, and bulleted lists - all text, no graphs or images.

I'd be interested in knowing if anyone is seriously using the assistants API, it feels like such a lock in to OpenAIs platform when your can alternatively just use completions that are much more easily interchanged.

I do and built Assistants API compat layer for Groq and Anthropic: https://github.com/supercorp-ai/supercompat I’d argue that Assistants API DX > manual completions API.

Aye, but your FinOps will be comolaining even with simple use.

Assistants API use in prod used to suck because it would send full convo on each message. But last month they added an option to send truncted history so its no longer 2$ a pop thankfully. Also Grok, Haiku and Mistral is cheap

Are you using Assistants API v2 with streaming?

Yeah, I do both in prod and in the lib. In the lib I even ported Anthropics streaming API to be OpenAI compatible. Will write the docs over the coming days if interested.

I've indeed refused to work with some providers giving only a chat interface and not a completion interface because it made the communication "less natural" to the model (like adding new system messages in between for function calling on models which don't officially does it, or adding other categories than system/user/assistant)

Great points. Dont even get me started about how function calling in other LLMs costs me tokens. Something OpenAI provides OOTB. I'm also not a big fan of OpenAI's lock in. Right now I'm on a huge Claude 3 Haiku kick. That said, OpenAI does seem to get the APIs right and my hunch is the new Assistants API is going to potentially disrupt things again. Time will tell.

I would love to be using Claude, but you can't get API access (beyond an initial trial period) in the EU without providing a European VAT number. They don't want personal users or people to even learn and experiment I guess.

You can use the Claude APIs via OpenRouter with a pre-paid account.

Thanks, this did the job!

Interesting, would Amazon Bedrock be an alternative? That's how I use Claude.

I'd guess it's more likely about the additional programming needed to meet GDPR compliance requirements.

Opus is really cool. I’ve found it to have a few persistent bugs in what I initially assumed is tokenization but now wonder if might be more fundamental, but modulo a few typographical-level errors, I personally think it’s the most useful of the API-mediated models for involved sessions.

And there are some serious people at Anthropic, they’ll get the typo thing if they haven’t already (been a busy week and change, they easily could have shipped a fix and I overlooked it).

> Dont even get me started about how function calling in other LLMs costs me tokens. Something OpenAI provides out of the box.

Not sure what you mean by this.

I have some assumptions/guesses on how billing works. Gonna do a post on this on my unremarkable.ai blog, please do signup for posts there, no spam. I could be right or wrong but need to do some experiments and publish later.

I'm not sure you're talking about the same thing: OpenAI specifically has a "Assistants API" that manages long term memory and tool usage for the consumer: https://platform.openai.com/assistants

I'd guestimate 99% of people using LLMs are using instruct-based message interfaces that have a variation of system/user/assistant. The top models mostly only come as a completion models, and even Anthropic has switched to a message based API

I've used it and in some cases it's taking days and weeks of development away to get to testing the market.

In some cases the lock in is what it is for now because a particular model in reality is so far ahead, or staying ahead.

It doesn't mean other options won't become available, but it does matter to relate your need to your actions.

Getting something working consistently for example might be the first goal, and then learning to implement it with multiple models might be secondary. The chances of that increase the later other models are explored in some cases.

It should be possible to tell pretty quickly if something works in a particular model that's the leader, how others compare to it and how to track the rate of change between them.

I know at least one team is at work is using the Assistants API, and I'm talking with another team that is leaning pretty heavily towards using it over building a custom RAG solution themselves, or even over other in-house frameworks.

I use it mostly exclusively (I've even developed a Python library for it, https://github.com/skorokithakis/ez-openai), because it does RAG and function calling out of the box. It's pretty convenient, even if OpenAI's APIs are generally a trash fire.

I've not seen any of these "agentic" systems be all that useful in practice. Complicated chain of software where a lot can wrong at any step, and the probability of failure explodes when you have many steps.

I stay away from such frameworks because:

- Writing what I want in Python/other-lingo gives me much more customizability than these frameworks offer.

- No worries about the future plans of the repo and having to deal with abandonware.

- No vendor lock in. Currently most repos like this focus on OpenAI's models, but I prefer to work with local models of all kinds and any abstraction above llama.cpp or llama-cpp-python is a no-no for me.

The last point means I refuse to build on top of ollama's API as it's yet another wrapper around llama.cpp.

Not using the ollama API means you have to keep track of context yourself, and run all your stuff in the same box. Hardly ideal.

What's the use cases people are using Multi AI Agents to solve problems that deliver real value? Someone has something with your hands on right now?

We’ve tried. A lot. Custom frameworks and all.

There is really no way to make the ensemble behave with an acceptable level of consistency.

Where we ended up is now having a frontier model generate a whole tree of possible execution plans, and then have the user select one of those path, and then we just run whatever the user chose in a plain sequence until the next decision point that needs user approval.

I've encountered two viable cases: instructions are too complex, too many tools, or wildly different processing steps, in which case it semplify a lot the processing to have a few well defined steps each doing their thing, and a coordinator on top, either sequential, or intelligent, that is only focuesed on next step routing.

the other is memory for conversational retrieval. ai memory is still quite limited, especially if there needs to be a lot of token in context, and context too long impede the ability of llm of focus on the task itself, especially if the context is itself a conversation or a request, so spreading the context along a few agents, and propagating the user request among agent, and having those produce answer fragment for another llm to formulate an answer allows to not lose the conversational context without swamping the llm with noise.

the problem tho remains latency as son as you nest them latency explodes as you can only stream the last layer of llm output

I tried the last crop. Interesting idea but the success rate of any real multi step task always approached 0% the longer it went

I imagine having an agent set up with specific RAG context to solve a specific problem and having another with a different RAG context to solve a different problem can be useful.

I see customer support as a very talked subject to solve this. But these system really manage to solve the issue removing the human feedback dramatically?

From the website linked in the readme:

“A lot of research has been doing in this are and we can expect a lot more in 2024 in this space. I promise to share some clarity around where I think this industry is headed. In personal talks I have warned that multi-agent systems are complex and hard to get right. I've seen little evidence of real-world use cases too”

These assistant systems fascinate me, but I just don’t have the time and energy to set something up. I was going to ask if anyone had a good experience with it, but the above makes it sound like there’s not much hope at the moment. Curious what other people’s experience are.

We tried using a multi-agent system for a complex NLP-type task and we found:

- Too many errors that just propogate on top of each other, if a single agent in the chain generates something even a little bit off then the whole system goes off the rails.

- You often end up having to pass a massive amount of shared context to every agent which just increases the cost dramatically.

Curiously enough we had an architect from OpenAI tell us the same thing about agent systems a few days ago (our company is a big spender so they serve a consulting function), so I don't think anybody is really finding success with multi-agent systems currently. IMO the core tech is nowhere near good enough yet.

> Too many errors that just propogate on top of each other

LLMs are like the perfect improv comedy troupe, they virtually always say “yes, and…”

> perfect improv comedy troupe

Check out Vtubers like CodeMiko, who improvs against LLM agents. Or 24/7 streaming LLM cartoon shows that take audience plot suggestions.

we do multistep programs in louie.ai via a variety of agents/tools, like "get X data from DB Y, wrangle cols A+B in Python, and then draw an interactive map + graph"

The ultimate answer is fairly short if you are a senior python data scientist, like 50loc. The agents will wander and iterate until they push through. You might correct & tweak if a bit off.

Importantly, this does agents opposite of the way Devin AI engineer replacements are presented. Here, you get it to do a few steps, and then move on to the next few steps. The agents still crank away a ton and do all sorts of clever things for you... to get you more reliably to the next step, vs something big & wrong.

So the human is like a reviewer, coming in, checking things, tweaking etc, then sending it back to the machine? (At which point the cycle continues)

Yes, imagine data analysis scenarios like Excel users or Jupyter notebooks, or operational investigations like user 360's and security incidents. Just now defaulting to natural language and connected to your data silos and a variety of analytics tools & libraries.

We try to make the generated code and backing data explainable. Users are figuring out the scenario by having the AI go ahead for them, and automating much of the debug loop in typical coding and investigations, so folks can focus more on the analysis, less on syntax, schemas, libraries, and be more ambitious on each step.

Importantly, it is still kind of like making a much more accessible Jupyter notebook or editable excel/doc, vs a linear chat session. Instead of generating the whole notebook and it being buggy and you starting over (~= Devin, or notebook.io's ChatGPT plugin), you drive it forward only 1-3 cells at a time, and as it is an interactive document so you can edit those, go to the next, or non-destructively edit earlier ones. In contrast, ChatGPT's data assistant deletes cells below the current edit, which would stink in a normal data env.

There are other differences, but from a perspective of using genAI well, we budget 3-60s for genAI assisting in 1-3 steps, aiming for 10-100x productivity wins and a lot more peace of mind during it. Taking 1-3 steps forward may mean the AI takes 3-10 internally due to backtracking / CoT / etc

We could let the system take 100 turns, and have interesting experiments there such as around security investigations, but the use cases become more niche due to cascading errors => reliability.

Thanks @beoberha, I am too. I like one take I heard on Twitter. The sentiment was something like these types of systems are useful under the AI-Powered Productivity industry which has incremental gains, no big bangs. Said another way, if your job was to help a TON of your employees be more productive individually, it is worth it because companies measure those efforts broadly and the payoff is there. But again, not big. My advice for folks to stay lower level and hook AI automation up with simple, closed loop, LLM patterns that feel more like basic API calls in a choreographed manner. OMG, hope all that made sense

that's actually a great reply, thanks

A lot of folks I've spoken with say that single-agent systems are still extremely limited, let alone multi-agent platforms. In general, it seems to boil down to:

- Agents need lots of manual tuning and guardrails to make them useful

- Agents with too many guardrails are not general-purpose enough to be worth the time and effort to build

I believe truly great agents will only come from models whose weights are dynamically updated. I hope I'm wrong.

By the time you do get around to it OpenAi would have built a full interface for this. This is the type of stuff that’s gonna get steamrolled.

I'm impatiently waiting to become the ultimate armchair music video director I've always dreamed of once this video AI thing rolls out...

Pretty much this. I'd love counter examples of startups in the space that haven't been crushed from the top yet.

Anyone recommend the best way to use AI to search all of my documents for a project. I've got specifications, blueprints, emails, forms, etc.

Would be great to be able to ask it, 'have we completed the X process with contractor Y yet?'


A bit off topic, but has anyone seen any agent systems focused on improving the agents capabilities with more usage?

From their linked main page:

> In my opinion, exploration of multi-agent systems is going to require a broader audience of engineers. For AI to become a true commodity, it needs to move out of the Python origins and into more popular languages like JavaScript , a major fact on why I wrote Experts.js.

I wholeheartedly agree

I don’t understand the comment about server send events not being async friendly.

What is unfriendly about this?

  import OpenAI from 'openai';

  const openai = new OpenAI();

  async function main() {
    const stream = await openai.chat.completions.create({
      model: 'gpt-4',
      messages: [{ role: 'user', content: 'Say this is a test' }],
      stream: true,
    for await (const chunk of stream) {
    process.stdout.write(chunk.choices[0]?.delta?.content || '');

It’s easy to collect the streaming output and return it all when the llm’s response is done.

They’re referring to the:

> assistant.on("textDelta”, () => …

Callbacks, which are not async and can’t be streamed that way directly without wrapping it in some helper function.

(Which does seem obvious; I’m also not sure why they called it out specifically as not being async friendly? I guess most callback style functions these days have async equivalents in popular libraries and these ones don’t)

> const stream = await…

Is this right? Aren’t you prematurely unwrapping the promise here?

I believe that is what gets the call started so awaiting there is okay. There isn’t anything to stream at that point.

AI botnet?

Ooh, shiny!

Bare JS. What is this 2001?

What would you like to see instead? What would be the benefit of something more complex?

LMAO. Yes, I love ESM modules. So maybe more like 2012 or 2015. Would you like to see TypeScript?

Thank you for using vanilla JS!

Not OP, but I use TypeScript because it adds a layer of safety to the codebase.

It's like having good test coverage - you can make large changes and if the tests pass (the code compiles), you can be fairly confident that you didn't mess anything up.

I've written Ruby for years, so I'm used to dynamically typed languages. But JavaScript is it's own level of special, and there's so many ways you can accidentally mess things up.

Having tests cover every single path (especially failure paths) can be very time consuming, and often hard or messy to setup (how would you mock the OpenAI module returning an error when adding metadata to a thread?), where as using something like TypeScript can make sure your code handles all paths somewhat correctly (at least as well as the types you defined).

Your code looks clean, and you appear to have good test coverage, so you do you though :-)

Yes this is great so much easier to work with

Y'all just made my day!

I was pleasantly surprised to see the .js. THANK YOU!!

Please, no

My main conversation “loop” at https://olympia.chat has tool functions connected to “helper AIs” for things such as integrating with email. It lets me minimize functions on the main loop and actually works really well.

Sid Kapoor, Content Specialist, forgot to include himself in Growth or Pro plans. Guess he is Basic!

I'm sorry but that is absolutely hilarious.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact