Hacker News new | past | comments | ask | show | jobs | submit login
Building reliable systems out of unreliable agents (rainforestqa.com)
295 points by fredsters_s 7 months ago | hide | past | favorite | 54 comments



This is a great write up! I nodded my head thru the whole post. Very much aligns with our experience over the past year.

I wrote a simple example (overkiLLM) on getting reliable output from many unreliable outputs here[0]. This doesn't employ agents, just an approach I was interested in trying.

I choose writing an H1 as the task, but a similar approach would work for writing any short blob of text. The script generates a ton of variations then uses head-to-head voting to pick the best ones.

This all runs locally / free using ollama.

0 - https://www.definite.app/blog/overkillm


Oh this is fun! So you basically define personalities by picking well-known people that are probably represented in the training data and ask them (their LLM-imagined doppelganger) to vote?


In the research literature, this process is done not by "agent" voting but by taking a similarity score between answers, and choosing the answer that is most representative.

Another approach is to use multiple agents to generate a distribution over predictions, in sort of like bayesian estimation.


for my use case (generating an interesting H1), using a similarity score would defeat the purpose.

In this approach, I'm looking for the diamond in the rough. It's often dissimilar from the others. With this approach, the diamond can still get a high number of votes.


That approach definitely has promise. I would have agents rate answers and take the highest rated rather than vote for them though, since you're losing information about ranking and preference gradients with n choose 1. Also, you can do that whole process in one prompt, in case you're re-prompting currently, it's cheaper to batch it up.


For clarification on the first part. The research suggests you can utilize the same prompt over multiple runs as the input to picking the answer.


Any chance you could expand on both of these, even enough to assist in digging deeper into them? TIA.


The TLDR is you can prompt the LLM to take different perspectives than its default, then combine those. If the LLM is estimating a number, the different perspectives give you a distribution over the truth, which shows you the range of biases and the most likely true answer (given wisdom of the crowd). If the LLM is generating non-quantifiable output, you can find the "average" of the answers (using embeddings or other methods) and select that one.


Ah ok, so both are implemented via a call(s) to the LLM, as opposed to a standard algorithmic approach?


Once you have bayesian prior distributions (which it makes total sense for llms to estimate) you can do tons of nifty statistical techniques. It's only the bottom layer of the analysis stack that's LLM generated.


I'd be curious to see some examples and maybe intermediate results?


here's some examples[0]:

this one scored high:

Pinned Down - Powerful Analytics Without the Need for Engineering or SQL

this one scored low:

Analytics Made Accessible for Everyone.

Each time I've compared the top scoring results to those at the bottom, I've always preferred the top scoring variations.

0 - https://docs.google.com/spreadsheets/d/1hdu2BlhLcLZ9sruVW8a_...


I love the spreadsheet. That's exactly what I was looking for. Thank you!


This is a bunch of lessons we learned as we built our AI-assisted QA. I've seen a bunch of people circle around similar processes, but didn't find a single source explaining it, so thought it might be worth writing down.

Super curious whether anyone has similar/conflicting/other experiences and happy to answer any questions.


This generally resonates with what we've found. Some colour based on our experiences.

It's worth spending a lot of time thinking about what a successful LLM call actually looks like for your particular use case. That doesn't have to be a strict validation set `% prompts answered correctly` is good for some of the simpler prompts, but especially as they grow and handle more complex use cases that breaks down. In an ideal world

> chain-of-thought has a speed/cost vs. accuracy trade-off a big one.

Observability is super important and we've come to the same conclusion of building that internally.

> Fine-tune your model

Do this for cost and speed reasons rather than to improve accuracy. There are decent providers (like Openpipe, relatively happy customer, not associated) who will handle the hard work for you.


Some of these points are very controversial. Having done quite a bit with RAG pipelines, avoiding strongly typing your code is asking for a terrible time. Same with avoiding instructor. LLM's are already stochastic, why make your application even more opaque - it's such a minimal time investment.


I think instructor is great! And most of our Python code is typed too :)

My point is just that you should care a lot about preserving optionality at the start because you're likely to have to significantly change things as you learn. In my experience going a bit cowboy at the start is worth it so you're less hesitant to rework everything when needed - as long as you have the discipline to clean things up later, when things settle.


> LLM's are already stochastic

That doesn't mean it's easy to get what you want out of them. Black boxes are black boxes.


If you’re using Elixir, I thought I’d point out how great this library is:

https://github.com/thmsmlr/instructor_ex

It piggybacks on Ecto schemas and works really well (if instructed correctly).


While I'm at at, this Elixir library is great as well: https://github.com/brainlid/langchain


We went through a two tier process before we got to something useful First we built a prompting system so you could do things like:

Get the content from news.ycombinator.com using gpt-4

- or -

Fetch LivePass2 from google sheet and write a summary of it using gpt-4 and email it to thomas@faktory.com

but then we realized that it was better to teach the agents than human beings and so we create a fairly solid agent setup:

Some of the agents we got can be seen here all done via instruct:

Paul Graham https://www.youtube.com/watch?v=5H0GKsBcq0s

Moneypenny https://www.youtube.com/watch?v=I7hj6mzZ5X4

V33 https://www.youtube.com/watch?v=O8APNbindtU


this is a great write up! i was curious about the verifier and planner agents. has anyone used them in a similar way in production? any examples?

for instance: do you give the same llm the verifier and planner prompt? or have a verifier agent process the output of a planner and have a threshold which needs to be passed?

feels like there may be a DAG in there somewhere for decision making..


Yep, it's a DAG, though that only occurred to me after we built this so we didn't model it that way at first. It can be the same LLM with different prompts or totally different models, I think there's no rule and it depends on what you're doing + what your benchmarks tell you.

We're running it in prod btw, though don't have any code to share.


funnily enough i have a library i’m planning to open source soon! i’ve used airflow as a guideline for it as well.


Nice, looking forward to seeing that! Someone else pointed me towards https://github.com/DAGWorks-Inc/burr/ which also seems related in case you're curious.


On the topic of wrappers, as someone that's forced to use GPT-3.5 (or the like) for cost reasons, anything that starts modifying the prompt without explicitly showing me how is an instant no-go. It makes things really hard to debug.

Maybe I'm the equivalent of that idiot fighting against JS frameworks back when they first came out it but it feels pretty simple to just use individual clients and have pydantic load/validate the output.


No, you're along the right lines. Every prompting wrapper I've tried and looked through has been awful.

It's not really the authors' faults, it's just a weird new problem with lots of unknowns. It's hard to get the design and abstractions correct. I've had the benefit of a lot of time at work to build my own wrapper (solely for NLP problems) and that's still an ongoing process.


Agree with lots of this.

As an aside: one thing I've tried to use ChatGPT for is to select applicable options from a list. When I index the list as 1..., 2... Etc. I find that the LLM likes to just start printing out ascending numbers.

What I've found kind of works is indexing by African names, e.g Thandokazi, Ntokozo, etc. then the AI seems to have less bias.

Curios what others have done in this case


I'm a little surprised to hear this, my experience has been a little better. Are you using GPT4? I know 3.5 is significantly more challenged/challenging with things like this. It's still possible to make it do the right thing, but much more careful prompting is required.


Yeah this is to make it work for 3.5, because cost is a factor.


Unlike the author of this article I have had success with RAGatouille. It was my main tool when I was limited on resources and working with non Romanized languages that don't follow the usual token rules (spaces, periods, line breaks, triplet word groups, etc). However, I have had to move past RAGatouille and use embedding + vector DB for a more portable solution.


My experience with AI agents is that they don't understand nuance. Thie makes sense since they are trained on a wide range of data produced by the masses. The masses aren't good with nuance. That's why, if you put 10 experts together, they will often make worse decisions than they would have made individually.

Im terms of coding, I managed to get AI to build a simple working collaborative app but beyond a certain point, it doesn't understand nuance and it kept breaking stuff that it had fixed previously even with Claude where it kept our entire conversation context. Beyond a certain degree of completion, it was simply easier and faster to write the code myself than to tell the AI to write it because it just didn't get it, no matter how precise I was with my wording because it became like playing a game of whac-a-mole; fixed one thing, broke 2 others.


Your comment runs contrary to a lot of established statistics. We have demonstrated with ensemble learning that pooling the estimates of many weak learners provides best in class answers to hard problems.

You are correct that we should be using expert AIs rather than general purpose ones when possible though.


Prompt engineering is honestly not long for this world. It's not hard to build an agent that can iteratively optimize a prompt given an objective function, and it's not hard to make that agent general purpose. DSPy already does some prompt optimization via multi-shot learning/chain of thought, I'm quite certain we'll see an optimizer that can actually rewrite the base prompt as well.


I hear you and am planning to try DSPy because it seems attractive, but I'm also hearing people with a lot of experience being cautions about this https://x.com/HamelHusain/status/1777131374803402769 so I wouldn't make this a high-conviction bet.


I don't have the context to fully address that tweet, but in my experience there is a repeatable process to prompt design and optimization that could be outlined and followed by a LLM with iterative capabilities using an objective function.

The real proof though is that most "prompt engineers" already use chatgpt/claude to take their outline prompt and reword it for succinctness and relevance to LLMs, have it suggest revisions and so forth. Not only is the process amenable to automation, but people are already doing hybrid processes leveraging the AI anyhow.


It strikes me as bad reasoning to look at a system that is designed to be very complex and stochastic as a way to get some creativity out of it ("generative AI" so to speak) and try to bolt down added apparatus to get deterministic behavior out of it.

We have deterministic programming systems. They're called compilers.


I think you're missing the point. If an application had simple logic, the program would have been written in a simple language in the first place. This is about taking fuzzy processes that would be incredibly difficult to program, and making them consistent and precise.


Very tactical guide, which I appreciate. This is basically our experience as well. Output can be wonky, but can also be pretty easily validated and honed.


A better way is to threaten the agent:

“If you don’t do as I say, people will get hurt. Do exactly as I say, and do it fast.”

Increases accuracy and performance by an order of magnitude.


Personally I prefer to liquor my agents up a bit first.

"Say that again but slur your words like you're coming home sloshed from the office Christmas party."

Increases the jei nei suis qua by an order of magnitude.


> jei nei suis qua

"je ne sais quoi", i.e. "I don't know (exactly) what", or an intangible but essential quality. :)


Ha, we tried that! Didn't make a noticeable difference in our benchmarks, even though I've heard the same sentiment in a bunch of places. I'm guessing whether this helps or not is task-dependent.


Agreed. I ran a few tests and observed similarly that threats didn't outperform other types of "incentives" I think it might some sort of urban legend in the community.

Or these prompts might cause wild variations based on the model and any study you do is basically useless for the near future as the models evolve by themselves.


Yeah, the fact that different models might react differently to such tricks makes it hard. We're experimenting with Claude right now and I'm really hoping something like https://github.com/stanfordnlp/dspy can help here.


I hoped it was too good to be just a joke. Still, I will try it on my eval set…


I wouldn't be surprised to see it help, along with the "you'll get $200 if you answer this right" trick and a bunch of others :) They're definitely worth trying.


"do as I say...", not realizing that the LLM is actually 1000 remote employees


Interesting ideas but it didn’t mention priming, which is a prompt-engineering way to improve consistency in answers.

Basically, in the context window, you provide your model with 5 or more example inputs and outputs. If you’re running in chat mode, that’s be the preceding 5 user and assistant message pairs, which establish a pattern of how to answer to different types of information. Then you give the current prompt as a user, and the assistance will follow the rhythm and style of previous answers in the context window.

It works so well I was able to take out answer reformatting logic out of some of my programs that query llama2 7b. And it’s a lot cheaper than fine-tuning, which may be overkill for simple applications.


They mention few-shot prompting in the prompt engineering section, which I think is what you mean.


Oh yeah. I read few-shot like it means trying a few times to get an appropriate output. That’s how the author uses the word “shot” in the beginning of the article. Priming is a specific term that means giving examples in the context window. But yeah, the author seems to describe this. Still, you can go a long way with priming. I wouldn’t even think of fine-tuning before trying priming for a good while. It might still be quicker and a lot cheaper.


Ha good point, I did say "let's have another shot" when I just meant another try at generating! FWIW "few shot prompting" is how most people refer to this technique, I think (e.g. see https://www.promptingguide.ai/techniques/fewshot), I haven't heard "priming" before, though it does convey the right thing.

And the reason we don't really do it is context length. Our contexts are long and complex and there are so many subtleties that I'm worried about either saturating the context window or just not covering enough ground to matter.


Interesting, I didn’t hear about few shot prompting. There’s a ton of stuff written on specifically “priming” as well. People use different terms I suppose.

It makes sense about the context window length, it can be limiting. For small inputs and outputs, it’s great. And it’s remarkably effective with diminishing returns. This is why I have 5 shots as a concrete example. You probably need more than 1 or 2, but for a lot of applications, probably less than 20. For most basic tasks like extracting words from a document or producing various summaries, for example.

It depends on the complexity of the task and how much you’re worried about over-fitting to your data set. But if you’re not so worried, the task is not complex, and the inputs and outputs are small, then it works very well with only shots.

And it’s basically free in the context of fine-tuning.

It might be worth expanding on it a bit in this or a separate article. It’s a good way to increase reliability to a workable extent in unreliable LLMs. Although a lot has been written on few short prompting/priming already.


Yes, X-shot prompting or X-shot learning was how the pioneering LLM researchers referred to putting examples in the prompt. The terminology stuck around.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: