Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

While the post uses DPO to illustrate RL and RLHF, in fact DPO is an alternative to RLHF that does not use RL. See the abstract of the DPO paper https://arxiv.org/abs/2305.18290, and Figure 1 in the paper: "DPO optimizes for human preferences while avoiding reinforcement learning".

The confusion is understandable. The definition of RL in the Sutton/Barto book extends over two chapters iirc, and after reading it I did not see how it differed from other learning methods. Studying some of the academic papers cleared things up.





Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: