Another little tidbit about RLHF and InstructGPT is that the training scheme is by far dominated by supervised learning. There is a bit of RL sprinkled on top, but the term is scaled down by a lot and 8x more compute time is spent on the supervised loss terms.