Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Show HN: Next-Gen AI Training: LLM-RLHF-Tuning with PPO and DPO (github.com/raghavc)
30 points by rags1 on March 18, 2024 | hide | past | favorite | 9 comments


Introducing LLM-RLHF-Tuning, a cutting-edge project implementing Reinforcement Learning from Human Feedback (RLHF) with an emphasis on Proximal Policy Optimization (PPO) and Deterministic Policy Optimization (DPO) algorithms. Designed to fine-tune and train the Alpaca, LLaMA, and LLaMA2 models more effectively, our project supports various configurations, including LoRA adapters for accelerated and deepspeed training. Ideal for AI researchers and developers seeking to push the boundaries of machine learning models.


How does this compare to existing LLM fine-tuning projects, like axolotl and LLaMA-Factory?


chatgpt description?


It sounds like it to me. In the sentence "This project implements Reinforcement Learning from Human Feedback (RLHF) training...

the keypoint for me that trigger my "ChatGPT-sense tingling is the abbreviation of (RLHF) when it's kind of common terminology in the LLM space nowadays.

Otherwise, the sentence's "shape"/construction of : Proximal Policy Optimization (PPO) PPO is an optimization algorithm used in reinforcement learning to update policy parameters by optimizing a clipped surrogate objective function. The objective function for PPO is defined as...

Sounds realy like ChatGPT or the begining of an Wikipedia article.

I'd be interested in knowing if indeed these part have been written by ChatGPT, the guy himself, or even another llm.



What about this seems like it's written by ChatGPT to you?


Very interested in the expansion of RL for transformers, but I can't quite tell what this project is.

Could you please add links to the documentation to the readme where it states "It includes detailed documentation".

Also maybe DPO should use the DDPG acronym instead so your repos Deterministic Policy Optimization isn't confused for trl's Direct Preference Optimization.


Does this also work with qlora


Looks like a wrapper on top of a couple of popular Huggingface libraries. All the heavy-lifting is done with TRL, Transformers and PEFT.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: