Show HN: Next-Gen AI Training: LLM-RLHF-Tuning with PPO and DPO

rags1 · on March 18, 2024

Introducing LLM-RLHF-Tuning, a cutting-edge project implementing Reinforcement Learning from Human Feedback (RLHF) with an emphasis on Proximal Policy Optimization (PPO) and Deterministic Policy Optimization (DPO) algorithms. Designed to fine-tune and train the Alpaca, LLaMA, and LLaMA2 models more effectively, our project supports various configurations, including LoRA adapters for accelerated and deepspeed training. Ideal for AI researchers and developers seeking to push the boundaries of machine learning models.

michaelt · on March 18, 2024

How does this compare to existing LLM fine-tuning projects, like axolotl and LLaMA-Factory?

swyx · on March 18, 2024

chatgpt description?

d3m0t3p · on March 18, 2024

It sounds like it to me. In the sentence "This project implements Reinforcement Learning from Human Feedback (RLHF) training...

the keypoint for me that trigger my "ChatGPT-sense tingling is the abbreviation of (RLHF) when it's kind of common terminology in the LLM space nowadays.

Otherwise, the sentence's "shape"/construction of : Proximal Policy Optimization (PPO) PPO is an optimization algorithm used in reinforcement learning to update policy parameters by optimizing a clipped surrogate objective function. The objective function for PPO is defined as...

Sounds realy like ChatGPT or the begining of an Wikipedia article.

I'd be interested in knowing if indeed these part have been written by ChatGPT, the guy himself, or even another llm.

swyx · on March 18, 2024

https://www.youtube.com/watch?v=kaahx4hMxmw

squigz · on March 18, 2024

What about this seems like it's written by ChatGPT to you?

lappa · on March 18, 2024

Very interested in the expansion of RL for transformers, but I can't quite tell what this project is.

Could you please add links to the documentation to the readme where it states "It includes detailed documentation".

Also maybe DPO should use the DDPG acronym instead so your repos Deterministic Policy Optimization isn't confused for trl's Direct Preference Optimization.

manuxp · on March 18, 2024

Does this also work with qlora

shutty · on March 18, 2024

Looks like a wrapper on top of a couple of popular Huggingface libraries. All the heavy-lifting is done with TRL, Transformers and PEFT.