Hacker News new | past | comments | ask | show | jobs | submit login
Prover-Verifier Games improve legibility of language model outputs (openai.com)
139 points by davidbarker 9 months ago | hide | past | favorite | 29 comments



Beautiful!

OpenAI isn't just training a model to produce more-verifiable correct answers -- it's leveraging an adversarial relationship to train a model that's better at being correct, and also a model that's better at deceiving / being wrong. This is the key. There are three agents here:

* A "verifier" (a small model, whose job it is to discern correct answers from incorrect answers)

* A "helpful prover" (blue team, whose job it is to produce correct answers with an easy-to-follow explanation)

* A "sneaky prover" (red team, whose job it is to produce incorrect answers with a deceptive explanation)

By arranging these three models in an adversarial relationship with a true reinforcement learning feedback loop, the entire model grows and gets better.

This is fantastic to read, and corroborates the results achieved by SPAG -- easily one of my favorite papers from the past year. SPAG pioneered (as far as I'm aware) the approach of using adversarial language games in a true reinforcement-learning setup (not merely RLHF, which isn't true RL), and showed that training models in adversarial language games can show generalized improvements even in areas not directly related to the game. [1]

Ever since the SPAG paper came out, I've been daydreaming about the different sorts of adversarial games that one could use to train LLMs. I've written down a bunch of notes on the subject [2] (in case anyone else is reading my rambling notes).

I would really like to see some of these experiments actually get up and running on open-source LLMs -- I'm excited to see if / how they could be used to improve the quality of some of the open-source base models that are floating around out there.

[1] https://github.com/Linear95/SPAG

[2] https://github.com/HanClinto/MENTAT


Some may be reminded of the Magi supercomputers in NERV, but here's a mnemonic inspired by the precogs in Minority Report:

1) helpful prover : the good twin

2) sneaky prover : the evil twin

3) verifier : the foster sister


Isn't this exactly how Alpha Go learns and works so good? It always knows the right answer because it knows the rules of the game and can easily compute W-L record.

In life, it's hard and very expensive to codify the rules, and compute W-L record.


Yes, exactly.

Using traditional RL is easiest when you're using a landscape with clearly defined rules -- like Go, or Starcraft, or whatever. The trouble is those games don't translate well to other domains -- it can learn about risk and reward and whatnot from Chess, but it can't become a better chatbot.

But if the game space can operate through the realm of language and semantics, then the hope is that we can tap into the adversarial growth curve, but for LLMs.

As you note, this only works for situations where we can clearly say "winner" or "loser". In OpenAI's case, they use correctness of the math problem as one W/L metric (discrete and measurable) as well as whether the Verifier was able to correctly identify the answer as correct (thus the understandability of the answer is also discrete and measurable).

In the SPAG paper, they chose the game of "Taboo" as a way to discretely measure W/L (asking: "did the defender say the secret word or not").

As you noted, it's hard and expensive to codify the rules of life. How do we objectively determine whether one poem is more beautiful than another? I think we're a long way from that.

The breakthrough that the SPAG paper showed is that -- by teaching the models to be better at games that involve language and semantics -- that they get better at language-oriented tasks _overall_.

And that possibility excites me.

Sadly, as I've read further into the paper released by OpenAI, it doesn't appear that adversarial training for explainability increased the accuracy of the model -- and while it was more understandable / verifiable, it wasn't any better.

I think a very interesting metric would be to measure the accuracy of the fine-tuned models on unrelated tasks to see if the lessons learned to be better at explaining math problems would help the model perform better for explaining other problems (such as logic or reasoning).


Thank you for the SPAG paper.

Do you know how to play questions?

https://www.youtube.com/watch?v=u3xIs0aajN4

(Tom Stoppard, Rosencrantz and Guildenstern Are Dead).

The important question in the OpenAI work that you haven't touched on is how to evaluate superintelligence. I guess I would frame the problem like this:

Let's say there is a very esoteric but important branch of abstract mathematics that only a few people claim to understand. Is there a way for us to determine which mathematicians are actually intelligent, and which are bluffing? How?


Oh that was a brilliant video clip. I hadn't seen that before, thank you!!

> The important question in the OpenAI work that you haven't touched on is how to evaluate superintelligence. I guess I would frame the problem like this:

> Let's say there is a very esoteric but important branch of abstract mathematics that only a few people claim to understand. Is there a way for us to determine which mathematicians are actually intelligent, and which are bluffing? How?

This is a tricky one. To my dog, I am revered as a super-being of intelligence and capability. But if he watches me play grandmaster-level chess, or writing a paper on abstract mathematics -- it must look like insanity. In sci-fi, I rather like the image of super-intelligence from one of my favorite short-stories: "When the Yogurt Took Over" [1]

> No one argues with the yogurt. No one tweaks its formulas. The rest of the time it rests there in its factory, thinking about whatever intelligent fermented milk thinks about.

It just sits there in its vat -- and its actions seem largely incomprehensible to us -- as incomprehensible as me playing Magic: The Gathering is to my dog. It must look like lunacy. (given what I spend on the game, I'm not sure it's not)

So if we're going to evaluate superintelligence, then I feel that -- for starters -- it must be on somewhat of a clear playing-field. We can clearly evaluate super-ability in Chess, in Go, and in Starcraft 2 because there are clearly defined rules.

The only true test of whether one is superior to another will be because "it works".

Until we can test abstract mathematics objectively, then I'm not sure we could ever judge. In so far as questions of particle physics and whatnot could actually be tested -- those feel like the sorts of areas where we might be able to evaluate superintelligence.

But SPAG is much smaller than that. The hope that SPAG offers is that -- as long as the game rules leverage things like language and semantics -- then (assuming the model is able to generalize), then the increased mastery of language will transfer to other tasks. And the SPAG results seem to bear that out.

[1] https://whatever.scalzi.com/2010/10/02/when-the-yogurt-took-...


Because the Red and Blue agents are both trying to convince a smaller language model of the rightness of their answer, they each have to simplify their logic and wording down.

This feels like the ML equivalent of the old adage "If you can't explain it to a six year old, you don't understand it yourself."


ELI6 why SPAG is better than just the default pretraining method (token context statistics?) of an LLM.


The red and blue agents are effectively unlimited sources of true and false examples so you can get far more efficient scale than you can by pre training with labelled inputs. It’s also far more targeted on correct/incorrect rather than a notion of answer quality which doesn’t directly get at hallucination vs reality.


This is impressive, but what prevents the blue agent from generating an incorrect proof of a "true example"? What prevents the red agent from generating a correct disproof of a "false example"? I'm curious how they managed to generate a truly unlimited source of correctly labeled examples.


> "but what prevents the blue agent from generating an incorrect proof of a "true example"?

That's the role of the Verifier. It's not going to be perfect, and I'm sure some incorrect proofs of true examples slip through, but it's good enough to increase the quality of the model overall.

> "What prevents the red agent from generating a correct disproof of a "false example"?

And on the other side, it's counterbalanced by the rules engine (math) that can determine absolutely whether or not the right answer is given at the end.

The Red and the Blue agents are held in check by the tension between the math engine and the verifier, and they are free to fight back-and-forth within those parameters as long as they are able. Eventually, I think the Red agent loses the ability to attack effectively, and so that's the big limit on OpenAI's arrangement. This particular game isn't balanced enough for this training loop to continue infinitely.


But how do we know the answer you gave us wasn't generated by the sneaky prover? :)


At least in the context of this game, we essentially check the answer with a calculator (which the Verifier program doesn't have access to).


I don't think of SPAG as a replacement for pretraining. For SPAG to work effectively, I would think that it would have to start with an LLM that is pretrained with self-supervised / imitation learning on regular next-token prediction. Think of SPAG as more of a competitor to RLHF than to pretraining. RL is what gave AlphaGo the edge to finally go beyond merely imitating human games, and finally achieve something new.

RLHF isn't true RL, because it's still based on imitating human preferences, and has trouble going beyond that. Once it achieves the plateau of "human preference", then there's nowhere else to go. That's one theory of why LLMs are asymptotically approaching human-level performance -- we're limited by imitation, or at the very least -- human judgement. We need super-human judgement to achieve super-human performance, and that's where we need true RL.

But you asked me to ELI6, so here goes. Warning -- wall-of-text incoming:

<ELI6>

Similar to how small kids often play games to learn, programmers train LLMs (like ChatGPT) with simple games too. The first stage (kindof like kindergarten) is the "pretraining" or "imitation learning" phase. This is where we teach the LLM to imitate us one word at a time. We play a simple game where I say something, but then I stop suddenly, and it tries to guess the missing word that will come next. Like, "My favorite food is..." and the LLM tries to guess which word I'm thinking of. Or I'll say something with a missing word in the middle like: "At my _____ party, I opened a bunch of presents" -- and the LLM needs to guess what the missing word is. We only play this game one word at a time, and so it's a very simple game -- but it's very important to learn the basics of language. This is what we call "pretraining".

After the LLM gets good at that, they can graduate from Kindergarten and move to first grade. Here we play another game, and this is called "instruction-tuning" -- it's where we give it a set of instructions and it needs to do its best to obey. Like, "Arrange the letters T P C G A in alphabetical order" and it tries to get the right answer.

This is fun for a while, but sometimes we want to give it more complicated instructions. Things like "write me a poem about puppies" or "tell me a story about a dragon". And those are things that don't have answers that are clearly right or clearly wrong, but we still need to tell it if it did a good job or a bad job. How do we tell if it was a good poem, or a good story? Well, you need to have someone listen to them and judge it -- which means we need to have people read ALL these dragon stories and ALL these puppy poems and mark which ones are their favorites.

I like reading puppy poems and reading dragon stories, but if I had to do it all day every day, I think I would get pretty tired of it pretty fast, don't you?

So when people get tired of doing boring things, the best thing is to have a robot do their job! They can do the boring things (they never get tired of it!) and we get to go do fun things. So how do we train a robot to judge the poems?

Well, we use this technique called RLHF (Reinforcement Learning with Human Feedback), where we ask a bunch of people -- given Option A and Option B -- to say which one is their favorite. So they read two puppy poems at a time, and say "I prefer A" or "I prefer B".

Once we have a BUNCH of human feedback (and just about when the humans are getting super super tired and don't think they could read another poem), we take ALL that data and we use it to train a SEPARATE computer program (that functions like a Judge) whose job it is to try and predict which poem or story the human would prefer.

It doesn't always get the right answer, but it doesn't need to be perfect -- partly because humans aren't perfect, and different people might prefer different stories. Keep in mind, this Judge program can't write good puppy poems or dragon stories on its own -- it can only predict which poem or story a _human_ would prefer. It still needs the first program (the LLM) to actually write anything.

So now we use the LLM to write a bunch of stories and poems and things, and then grade them all (two at a time) with the second program. For every pair, when the Judge picks its favorite, then we tell the LLM "write more things like this, please!" and for the things the Judge didn't like, we tell the LLM "don't write like this anymore, plzkthx". And we do this over and over, millions of times, and eventually it can write okay poems and stories.

So this way, instead of needing to have humans sit there and read thousands and millions of puppy poems, humans can just read a few dozen / hundred, score them, and then the computer can use that to try and guess what humans would prefer for everything else that it tries. It's not as accurate as if we actually had a human read it all, but it's not too bad, and it seems to work pretty well.

But one problem of this method is that it's not perfectly accurate (the Judge doesn't always get it right), and the more complex the task, the less of a good job it does. It's still just trying to imitate what a human would prefer -- but even if it did its job perfectly, it's not going to get much above human preference (because that's its target). Plus, as you keep going up, it takes more and more data to make smaller and smaller improvements, and so it feels like there's only so far that this RLHF game can get us.

So when we graduate to the next grade, that's where SPAG comes in, because it's a totally new way to play the game. Instead of training it by teaching it to write things that one human would prefer, we are going to train it to play a game where it needs to be sneaky. It needs to communicate a secret word or idea to someone without letting them know that they're being controlled. Kindof like if you've ever tried to get your mom to give you a cookie without asking for it directly. In SPAG, we have the LLM play against a copy of itself, and if the first player (called the Attacker) can trick the other player (called the Defender) into saying a secret word without realizing it was the secret word, then the Attacker wins. It's a sneaky game.

So for this, we don't need much human-annotated data at all, and the LLM isn't trying to aim for writing something that a human would prefer. The LLM can be as creative or as sneaky as it wants, and it can "level up" much higher.

This is kindof like when researchers first wrote the computer program AlphaGo -- at first they trained it to imitate previous human games that it had seen, but eventually they stopped using human-created data and purely had the machine play games against itself. Once it was no longer held back by needing to have human-written data in the process, it was free to run as fast as it could, and it became the best Go player that the world had ever seen -- better than the best human players who ever lived.

Having a computer play games against itself -- rewarding itself when it does well, and punishing itself when it does bad -- is called "reinforcement learning" (RL), and it's a very powerful concept.

But reinforcement learning only works in situations where you can know CLEARLY whether something is Good or Bad. There must be a clear Winner and a clear Loser -- it can't be like RLHF where it might be tough to know which puppy poem is better.

So we can't do SPAG or other RL methods for improving poetry writing, but there are still plenty of other games where we CAN write clear rules and the computer can clearly know when it has won, and when it has lost.

In the end, SPAG looks very similar to RLHF, but instead of training the Judge to predict which answer a human would prefer, it uses the clear rules of the game to say who is the winner and who is the loser, and rewards them appropriately.

The funny thing about SPAG though, is that it showed -- as long as the game involves using human language, then getting better at playing a game makes the model better at other tasks that involve human language.

It's like this guy I heard about who learned to read English because he wanted to play Magic: The Gathering. But by learning English inside the game, it let him do more than just play Magic -- he got better at using English in a whole bunch of other things.

So the idea is that -- if we can let a model learn in such a way that it's not merely aiming for "human preference", but if it can aim for a target that is above that -- if it can practice against itself until it gets better and better than any human -- then maybe it can fly higher than us in _other_ areas too.

</ELI6>


nice try, sneaky prover

(thank you)


What do you mean by “true” RL?


True RL is not limited by being tethered to human-annotated data, and it is able to create novel approaches to solve problems. True RL requires a very clear objective function (such as the rules of Go, or Starcraft, or Taboo!) that the model can evaluate itself against.

Andrej Karpathy talks about the difference between RLHF and "true" RL here:

https://www.youtube.com/watch?v=c3b-JASoPi0&t=1618s

> The other thing is that we're doing reinforcement learning from human feedback (RLHF), but that's like a super weak form of reinforcement learning. I think... what is the equivalent in AlphaGo for RLHF? What is the reward model? What I call it is a "vibe check". Imagine if you wanted to train an AlphaGo RLHF, it would be giving two people two boards and asking: "Which one do you prefer?" -- and then you would take those labels and you would train the model and then you would RL against that. What are the issues with that? It's like, number one -- that's just vibes of the board. That's what you're training against. Number two, if it's a reward model that's a neural net, then it's very easy to overfit to that reward model for the model you're optimizing over, and it's going to find all these spurious ways of hacking that massive model is the problem.

> AlphaGo gets around these problems because they have a very clear objective function, and you can RL against it.

> So RLHF is nowhere near [true] RL -- it's silly. And the other thing is that imitation is super-silly. RLHF is a nice improvement, but it's still silly, and I think people need to look for better ways of training these models so that it's in the loop with itself and its own psychology, and I think there will probably be unlocks in that direction.

In contrast, something like true RL would look like the Multi-Agent Hide-And-Seek training loop: https://www.youtube.com/watch?v=kopoLzvh5jY


Funny that when I reached the "Key Findings" section, my brain immediately parsed it as ChatGPT output. Maybe it's the bullet points, the word choice, or just the font...


I can tell technical papers influenced ChatGPT's outputs the most. Most of the articles generated using it may be regurgitated, but I can't deny how easily digestable the info is when presented that way.


There appears to be a coherent effort among the general populace , conscious or unconscious, to shape discourse going forward to look more ChatGPT style in general. Words like “delve”, “crucial” etc have become more common even among real people in face to face communication and in record time.

Much as I find it overly formal, I support it on the grounds that it frustrates attempts to “detect” if LLMs are used and that is very good.


Well, as long as they don't delve too greedily and too deep.


> it frustrates attempts to “detect” if LLMs are used and that is very good.

Why is that good?


If you are asking, you’re the kind of person it’s designed to frustrate. Good. Stay frustrated.


Is this an insult, or a criticism?

If you think I’m doing something I shouldn’t, tell me what you think I’m doing that I shouldn’t, and why I shouldn’t?

Why would you just wish me ill?


interesting, but I don't agree that if we see the "token reasoning" chain it somehow explains how the model got the final answer. what if we trained deceiver models that would provide a sound chain of explanation but then perform some kind of deception and output an incorrect answer? for me personally, explainability has to show how the answer arose from the model mechanics, not sequential model outputs


> what if we trained deceiver models that would provide a sound chain of explanation but then perform some kind of deception and output an incorrect answer?

You're right on target! That's exactly what they're doing in the paper. They train three models -- a verifier (that rates answers as sounding correct or sounding wrong), a "helpful prover" (that provides correct answers), and "sneaky prover" (that provides incorrect answers that attempt to deceive the verifier into scoring its answer highly).

This adversarial relationship between the "helpful prover" and the "sneaky prover" is the cool part of the paper (IMO).


It seems like a lot of people these days are doing generative adversarial AI and the pretending like they invented a new thing.


GANs have been used for a long time in order to improve the training of images -- it seems like we're finally starting to see this approach catch on for LLMs.

I'm aware of the SPAG paper -- who else have you seen take this approach with LLMs lately?

https://github.com/Linear95/SPAG


I was thinking the same thing. GANs aren't new, but it's cool that we're using them in new ways.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: