Another little tidbit about RLHF and InstructGPT is that the training scheme is ...

Another little tidbit about RLHF and InstructGPT is that the training scheme is by far dominated by supervised learning. There is a bit of RL sprinkled on top, but the term is scaled down by a lot and 8x more compute time is spent on the supervised loss terms.