I agree RLHF is not full RL, more like contextual bandits, because there is alwa...

I agree RLHF is not full RL, more like contextual bandits, because there is always just one single decision and no credit assignment difficulties. But there is one great thing about RLHF compared to supervised training: it updates the model on the whole sequence instead of only the next token. This is fundamentally different from pre-training, where the model learns to be myopic and doesn't learn to address the "big picture".

So there are 3 levels of optimization in discussion here:

1. for the next token (NTP)

2. for a single turn response (RLHF)

3. for actual task completion or long-term objectives (RL)