I've also worked in the area and published research in it a couple years ago. I ...

YeGoblynQueenne · on Feb 3, 2022

I don't agree. I don't know your work, but the approach in AlphaCode and Copilot (or the Codex model behind it) is a step backwards for neural program synthesis, and for program synthesis in general. The idea is to train a large language model to generate code. The trained language model has no way to direct its generation towards code that satisfies a specification. It can only complete source code from some initial prompt. The code generated is remarkably grammatical (in the context of a programming language grammar) which is certainly an advance for language representation, but some kind of additional mechanism is required to ensure that the generated code is relevant to the specification. In Copilot, that mechanism is the user's eyballs. In AlphaCode the mechanism is to test against the few I/O examples of the programming problems. In either case, the whole thing is just hit-and-miss. The language model generates mostly garbage -DeepMind brags that AlphaCode generates "orders of magnitude" more code than previous work, but that's just to say that it's even more random, and its generation misses the target even more than previous work! Even filtering on I/O examples is not enough to control the excessive over-generation and so additional measures are needed (clustering and ranking of programs etc).

All this could be done 40 years ago with a dumb DSL, or perhaps a more sophisticated system like a PCFG for programs, with a verifier bolted on [1]. It's nothing new. What's new is that it's done with a large language model trained with a Transformer, which is all the rage these days, and of course that it's done at the scale and with the amount of processing power available to DeepMind. Which I'm going to assume you didn't have back when you published your work.

Honestly, this is just an archaic, regressive approach, that can only work because of very big computers and very big datasets.

___________

[1] Which btw, is straightforard to do "by hand" and is something that people do all the time. In the AlphaCode work, the large language model simply replaces a hand-crafted program generator with a lot of data, but there is no reason to do that. This is the quintessential problem where a machine learning solution is not necessary because a hand-crafted solution is available, and easier to control.

Mehdi2277 · on Feb 3, 2022

I agree the approach itself is quite brute force heavy. There is a lot of information that could be used and I’d hope is helpful like grammar or language, traces/simulations of behavior, etc.

But when I say alpha code/copilot is good I’m referring solely to the difficulty of problems they are doing. There are many papers including mine that worked on simpler problems with more structure used to work on them.

I expect follow up work will include actually incorporating other knowledge more heavily to the model. My work was mainly on restricting tree like models to only make predictions following grammar of the language. Does that parallelize/fit well with a transformer? Unsure, but I would expect some language information/genuine problem constraints to be incorporated in future work.

Honestly I am pretty surprised how far pure brute force with large model is going. I would not have expected gpt3 level language modeling from more scale on a transformer and little else.

YeGoblynQueenne · on Feb 3, 2022

Well, I'm not surprised because I know that large language models can learn smooth approximations of natural language. They can generate very grammatical natural English, so why not grammatical source code, which is easier? Of course, once you have a generator for code, finding a program that satisfies a specification is just a matter of searching -and assuming the generated code includes such a program. But it seems like that isn't really the case with AlphaCode, because its performance is very poor.

I have to say that usually I'm the one speaking out against an over-reliance on machine learning benchmarks and against expecting a new approach to beat the state of the art before it can be taken seriously, but this is not a new approach, and that's the problem I have here. It's nothing new, repackaged as something new and sold as something it isn't ("reasoning" and "critical thinking" and other nonsense like that).

I agree that future work must get smarter, and incorporate some better inductive biases (knowledge, something). Or perhaps it's a matter of searching more intelligently because given they can generate millions of programs I'd have thought they'd be able to find more programs that approximate a solution.