yes they got "smarter" by compiling a corpus of knowledge which future generatio...

yes they got "smarter" by compiling a corpus of knowledge which future generations could train on.

sarcasm aside, throwing away the existing corpus in favor of creating a new one from scratch seems misguided.

this paper isn't about creating a new language, they are omitting the sampler that chooses a single token in favor of sending the entire end state back in to the model like a superposition of tokens. that's the breadth first search part, they don't collapse the choice down to a single token before continuing so it effectively operates on all of the possible tokens each step until it decides it's done.

it would be interesting to try this with similar models that had slightly different post training if you could devise a good way to choose the best answer or combine the outputs effectively or feed the output of a downstream model back in to the initial model, etc. but I'm not sure if there'd necessarily be any benefit to this over using a single specialized model.