> It acknowledges that these behaviours are unsafe when its own outputs are fed back to it.
This is the typical "Ah yes you are right, I made a mistake. Let's correct the thing...."-type hallucination or whatever you want to call it. Calling it power-seeking or deceptive behaviour I find overblown.
Question about the rule-based rewards (correctness and format) mentioned in the paper: Does the raw base model just expected “stumble upon“ a correct answer /correct format to get a reward and start the learning process? Are there any more details about the reward modelling?
When BF Skinner used to train his pigeons, he’d initially reinforce any tiny movement that at least went in the right direction. For the exact reasons you mentioned.
For example, instead of waiting for the pigeon to peck the lever directly (which it might not do for many hours), he’d give reinforcement if the pigeon so much as turned its head towards the lever. Over time, he’d raise the bar. Until, eventually, only clear lever pecks would receive reinforcement.
I don’t know if they’re doing something like that here. But it would be smart.
Since intermediate steps of reasoning are hard to verify they only award final results. Yet that produces enough signal to produce more productive reasoning over time. In a way when pigeons are virtual one can afford to have a lot more of them.
Yes and no. In their paper they said they trained two models. One is purely RL based (R1Zero). So this one is trained like you described, i.e. it has to stumble upon the correct answer. They found it to be good but has problems like repetition and language mixing.
The main R1 model was first finetuned with synthetic CoT data before going through RL IIUC.
The prompt in table 1 makes it very likely that the model will use the correct format. The pretrained model is pretty good so it only needs to stumble upon a correct answer every once in a while to start making progress. Some additional details in the Shao et al, 2024 paper.
I was wondering the same thing. I feel there is too large of a gap between a raw base model and and a model that produces fully correct answers and follows a specific format. My guess is their rule base reward system is more nuanced than just correctness and format.
Yeah I find this part not clearly expressed as well. My best guess is that it's not simply binary "correct/incorrect" but rather the reward is made up of multiple parts (e.g. format + correctness) and structured in a way such that "close enough" answers still get some reward. From there I would expect that a base model might at least be able to "autocomplete" the format/style, at which point RL machinery would kick in to tune it to properly obey the format, and once that's mastered eventually correctness.
They did mention something about tuning on an un-SFT'd base model being much slower 'warming it up' with some existing reasoning traces.
My question: "Which waiter got second most total money from tips? create a jpeg with a bar chart of the total tips for each waiter."
Step 0 included a correct sql query.
Step 1 included python code with matplotlib, which failed because matplotlib was not in the allowed imports. The system then pivoted to another python script which printed a bar plot with ## characters.
Final answer: The waiter with the second most total tips is Corey Johnson
Bar Chart of Total Tips for Each Waiter:
Michael Watts ################################################## (5.67)
reply