Hacker News new | past | comments | ask | show | jobs | submit | kcorbitt's comments login

Yep. And tbh you probably don't even have to do this; the R1 paper found that just running SFT the base model with a relatively small number of monolingual reasoning traces was enough for it to get the idea and iirc they didn't even bother selecting for language specifically in the RL training looop itself.


One of the authors here. Happy to answer any questions about our methods/results!


Please define an acronym the first time you use it in the body text. I had to scroll about 20% the way through your article just to understand the title.


We updated the first paragraph to define the acronym. Thanks again for the feedback!


Great point! Thanks for the feedback.


Can you elaborate on this point:

“ We discovered that meaningful performance improvements, as high as 10–15%, can be achieved with as few as 16 training examples.”

In particular, did you need to change the hyperparameters much, and did this limited recipe show different improvements for the larger vs smaller models? Also, how did you select these 16 examples?


No meaningful changes to the hyperparameters, just changed the tasks per iteration to 16 and trained on the same first 16 training tasks each iteration.

We only tested this with the 14B model. You can see the run here:

https://wandb.ai/bradhilton/rl-experiments/runs/062

Performance peaked after 21 iterations at 45% accuracy instead of the final 59%, but still a significant increase on very few samples.


Thanks.


Any hypotheses on why the performance dropped suddenly while training?


Hi, other author here. I think the models converged on shallow/greedy strategies that improved performance up to a point, but are ultimately shortsighted, especially for harder puzzles.

Something interesting I noticed in the responses was that for shorter puzzles it would make deductions, building up a set additional "clues" for itself, before answering the question. However, for harder puzzles with more clues it would often merely repeat all the given clues and then try to directly answer the questions.

Maybe some form of curriculum learning would help, starting with easier puzzles and progressing to more challenging ones.

Other ideas to explore include:

- Distilling responses from stronger models - Encouraging exploration with entropy regularization or reward shaping - Training from base models instead of instruct models, like DeepSeek-R1-Zero


Is my understanding here correct? Could this be the reason?

https://news.ycombinator.com/item?id=43287312


As for why they dropped suddenly, I don't really know. Sometimes models develop degenerate behaviors, but even when forking from the best checkpoint and lowering the learning rate or changing other hyperparameters, performance stills drops. It's as if its fate has already been sealed many iterations ago.


Can I just wholeheartedly congratulate you for having found a critical benchmark to evaluate LLMs. Either they achieve 100% accuracy in your game, or they cannot be considered trustworthy. I remain very confident that modules must be added to the available architectures to achieve the "strict 100%" result.


Do you have any other logic puzzles you could use to see if the performance generalises?


To be honest, I don't expect the performance to generalize to other task types with this specific training regime. If we had a panel of like 30 logic puzzles and cross-trained against all of them simultaneously it might though.

I think there's a lot of benefit to discovering a training regime that allows small specialized models to do extremely well in one narrow task; if we can figure out how to make small models that beat SOTA on a specific task and are cheap to train and run, that's in some ways a more useful outcome than a very large model that is good at many tasks (but is more expensive to run for each of them).


The question to me if you can call that deduction in that case. Isn't it just a type of pattern matching that fits this particular task?


Once the problem gets narrow enough, do you risk training a model that reinvents a straightforward classic algorithm at far higher cost?


Well, in this case there is a much more straightforward method with the same CP-SAT solver used to create the puzzles. This is more of a fun experiment to see if we can train LLMs to solve these kinds of logical deduction problems.


OpenPipe | ML & Full-Stack Engineers | Full-time | Seattle, WA (ONSITE) | https://openpipe.ai/ | Highly competitive pay + equity We've built the world's best fine-tuning platform in just over a year. First to launch self-service preference tuning, integrated evals/data prep/fine-tuning, and self-service learning from human feedback.

Our customers range from fast-growing startups to Fortune 500s. We're growing 40% MoM and have achieved this with a team of just 5 engineers, including the founders.

Seeking:

- ML Engineers: Help advance our fine-tuning capabilities. (We have thousands of datasets and evals; testing new ideas is really easy!)

- Systems Engineers: Scale our platform serving hundreds of millions of daily requests

- Full-stack Engineers: Build end-to-end features in TypeScript/Python

Ideal candidates:

- Strong programming skills (TypeScript/Python)

- Systems architecture expertise

- Self-starters (founder/founding engineer experience valued)

- Willing to relocate to Seattle

Highly competitive salary + equity. Well-funded with large customers. Perfect for future founders wanting startup experience or engineers breaking into AI/ML with immediate impact.

Email: kyle@openpipe.ai Include something impressive you've built.


Lots of folks working on open-source reasoning models trained with reinforcement learning right now. The best one atm appears to be Alibaba's 32B-parameter QwQ: https://qwenlm.github.io/blog/qwq-32b-preview/

I also recently wrote a blog explaining how reinforcement fine-tuning works, which is likely at least part of the pipeline used to train o1: https://openpipe.ai/blog/openai-rft


I don't know if I would call it "the best one" when it has "How many r in strawberry" as one of its example questions and when tried it arrives at the answer "two".


OpenPipe | ML & Full-Stack Engineers | Full-time | Seattle, WA (ONSITE) | https://openpipe.ai/ | Highly competitive pay + equity

We've built the world's best fine-tuning platform in just over a year. First to launch self-service preference tuning, integrated evals/data prep/fine-tuning, and self-service learning from human feedback.

Our customers range from fast-growing startups to Fortune 500s. We're growing 40% MoM and have achieved this with a team of just 5 engineers, including the founders.

Seeking:

- ML Engineers: Help advance our fine-tuning capabilities. (We have thousands of datasets and evals; testing new ideas is really easy!)

- Systems Engineers: Scale our platform serving hundreds of millions of daily requests

- Full-stack Engineers: Build end-to-end features in TypeScript/Python

Ideal candidates:

- Strong programming skills (TypeScript/Python)

- Systems architecture expertise

- Self-starters (founder/founding engineer experience valued)

- Willing to relocate to Seattle

Highly competitive salary + equity. Well-funded with large customers. Perfect for future founders wanting startup experience or engineers breaking into AI/ML with immediate impact.

Email: kyle@openpipe.ai Include something impressive you've built.


No, generally speaking OpenAI doesn't re-use training data between customers. It's worth it to them anyway because they learn what does/doesn't work on different tasks

Of course, it isn't your IP free and clear either, because the base model isn't open so your fine-tuned model will always live inside OpenAI's walled garden.

If you're interested in reinforcement learning on top of truly open models where you own the end product, we're putting a lot of thought into that and are also looking for design partners! Feel free to email me at kyle@openpipe.ai.


> No, generally speaking OpenAI doesn't re-use training data between customers

How do you know this?


This is a fair point. The reason why I think "correlation" is a better metric than "predicts the exact correct score" is because of how I'll be using this model in the next post.

Broadly, the main use case for this model (in the RL context) will be to take two different versions of the same post, and predict which of the two is more likely to be upvoted. So what matters isn't that it gets the exact number of upvotes correctly, but that it correctly predicts the relative difference in likely upvote count between two variants.

Now it still doesn't do a great job at that (the correlation is only 0.53 after all) but it still does a good enough job to provide some useful signal.


That makes me wonder though what the best loss function was. I assume you used MSE on the logscore. I wonder if a sigmoid on which of two articles has the higher score would yield better results for the downstream RLHF task.


Yes! The architecture is almost identical. The only difference is in the final layer. In an LLM used for text generation, the final layer has a separate output for every potential token the model could produce, and we decide which token to generate by choosing the one with the highest likelihood at each generation step (at least that's what the simplest sampling methods do). In an LLM used as a reward model, we only have one output in the final layer, and we interpret its value as the predicted reward.

Everything else in the model before that final layer is exactly identical, architecture-wise.


But a typical LLM has a feedback loop: it looks at the last token it generated and then decides, given the N tokens before that, which token to output next.

In the case of a reward model, are you streaming in the list of tokens; if so, what is the output after each token? Or are you feeding in all of the tokens in one shot, with the predicted reward as the output?


There are multiple ways to model reward. You can have it be fine-grained, such that every token gets its own reward, but by far the most common is to feed in the whole sequence and generate a single reward at the end.


I guess I'm not sure how the "feed in the whole sequence" works, if there's a single reward at the end.


It depends on the model and the problem. As an example, BERT-based models have a special [CLS] token that was pre-trained to encode information about the whole sequence. A reward model based on BERT would take the output embedding of that token from the last layer and feed it through a classification head, which would depend on your problem. You could then train this classification head on your alignment dataset like a classification problem.

You can check the examples from the TRL library for more information.


> You can check the examples from the TRL library for more information.

What library is that? Thanks!


I hadn't heard of isotonicregression before but I like it!

> it's good to create two models, one for likelihood of zero karma, and another expected karma, conditional on it being non-zero.

Another way to do this is to keep a single model but have it predict two outputs: (1) likelihood of zero karma, and (2) expected karma if non-zero. This would require writing a custom loss function which sounds intimidating but actually isn't too bad.

If I were actually putting a model like this into production at HN I'd likely try modeling the problem in that way.


> For me the objective of "most up votes" is not fully correlated with where I get the most value on HN. Most of the time, the most up voted I would have found them anyway on other platforms.

Yes, this is a fantastic point. I'm curious if there's some other measurable proxy metric for "things I get the most value out of on HN"? Upvotes seems like the most natural but optimizing for it too strongly would definitely take HN down a dark path.


Perhaps selecting for posts with the highest quality reply engagement? If many different people were drawn to lengthy discussions, that suggests the content sparks thoughts that others then feel compelled to engage with. Or select for the emotional content of replies, awe/empathy/anger, depending on what one wants from HN?


lots of platforms optimize for engagement, but all that does is encourage ragebait


Ohh, I really like that as a potential proxy metric!


Perhaps number of comments, or number of non-flamewar comments, or proportion of flamewar comments together with number of comments?


Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: