Have you read about this specific model we're talking about?
My understanding is that the whole point of R1 is that it was surprisingly effective to train on synthetic data AND to reinforce on the output rather than the whole chain of thought. Which does not require so much human-curated data and is a big part of where the efficiency gain came from.
My understanding is that the whole point of R1 is that it was surprisingly effective to train on synthetic data AND to reinforce on the output rather than the whole chain of thought. Which does not require so much human-curated data and is a big part of where the efficiency gain came from.