> This technique, the "Self-Taught Reasoner" (STaR), relies on a simple loop: generate rationales to answer many questions, prompted with a few rationale examples; if the generated answers are wrong, try again to generate a rationale given the correct answer; fine-tune on all the rationales that ultimately yielded correct answers; repeat. We show that STaR significantly improves performance on multiple datasets compared to a model fine-tuned to directly predict final answers
But there are a few others. In general good data is good data. We're definitely learning more about how to produce good synthetic version.
One issue with that is that the model may learn to smuggle data. You as a human think that the plain reading of the words is what is doing the reasoning, but (part of) the processing is done by the exact comma placement and synonym choice etc.
Data smuggling is a known phenomenon in similar tasks.
I don't think data smuggling is relevant in star style scenarios. You're still validating the final output. If it works on test data, what could be even smuggled.
> This technique, the "Self-Taught Reasoner" (STaR), relies on a simple loop: generate rationales to answer many questions, prompted with a few rationale examples; if the generated answers are wrong, try again to generate a rationale given the correct answer; fine-tune on all the rationales that ultimately yielded correct answers; repeat. We show that STaR significantly improves performance on multiple datasets compared to a model fine-tuned to directly predict final answers
But there are a few others. In general good data is good data. We're definitely learning more about how to produce good synthetic version.