Reminds me of the AlphaCode approach. Why do you say it's sampling programs from...

Reminds me of the AlphaCode approach.

Why do you say it's sampling programs from "training data"? With that choice of words, you're rhetorically assuming the conclusion.

If he only sampled 20 programs, instead of 8000, will we still say the programs came from "training data", or will we say it's genuine OOD generalization? At what point do we attribute the intelligence to the LLM itself instead of the outer loop?

This isn't meant to be facetious. Because clearly, if the N programs sampled is very large, it's easy to get the right solution with little intelligence by relying on luck. But as N gets small the LLM has to be intelligent and capable of OOD generalization, assuming the benchmark is good.