Hacker News new | past | comments | ask | show | jobs | submit login

Because the backspace question (essentially: is T a subsequence of S with a deletion size of N?) probably occurs hundreds of times, in one form or another, within AlphaCode's training corpus.

Any leetcode grinder can tell you there are a few dozen types of competitive programming problem (monostack, breadth-first state search, binary search over solution space, etc.) so solutions to new problems are often very similar to solutions for old problems.

The training corpus for these code transformers is so large that almost all evaluation involves asking them to generate code from the training corpus.

To evaluate CoPilot, we should ask questions that are unusual enough they can't be answered through regurgitation of the training corpus.

What does CoPilot generate, given this prompt:

  // A Go function to set the middle six bits of an unsigned 64-bit integer to 1.
Here is a human solution:

  func six1s(x uint64) uint64 {
      const (
          ones  = uint64(63)
          shift = (64 - 6) / 2
          mask  = ones << shift
      )
      return x | mask
  }
Can it solve this simple but unusual problem?



> To evaluate CoPilot, we should ask questions that are unusual enough they can't be answered through regurgitation of the training corpus.

Exactly! It's great that Copilot can generate correct code for a given question, but we cannot gauge its full capability unless we try it on a range of different questions, especially ones that are not found in the training data.

I mentioned this in the other AlphaCode post: It would be nice to know how "unusual" a given question is. Maybe an exact replica exists in the training data, or a solution exists but in a different programming language, or a solution can be constructed by combining two samples from the training data.

Quantifying the "unusual-ness" of a question will make it easier to gauge the capability of models like AlphaCode. I wrote a simple metric that uses nearest neighbors (https://arxiv.org/abs/2109.12075). There are also other tools to do this: conformal predictors, which are used in classification methods, and the RETRO transformer (https://arxiv.org/pdf/2112.04426.pdf) has a calculation for the effect of "dataset leakage".


AlphaCode does some analysis of training data copying in their paper (Sections 6.1 and Appendix F): https://storage.googleapis.com/deepmind-media/AlphaCode/comp...

It does not seem to be copying from the training data in any meaningful way.


> It does not seem to be copying from the training data in any meaningful way.

My point is, I would like to verify this claim with different metrics, because we probably have different interpretations of the word "meaningful".

AlphaCode measures similarity between programs via longest common substrings. That's better than nothing, but that would mean two programs that differ only in variable naming would not be considered similar. If two programs differed only in the names of the variables/functions, I would consider that copying.

I think there are better comparisons of structural similarity: compare the ASTs, or the bytecode/assembly code generated, the control flow graphs, or perform SSA and compare the blocks generated. Each of these might have weaknesses as well, but they won't be as obvious as variable renaming, and so we'd get a better idea of what AlphaCode is copying, and therefore a better idea of its full capabilities.

I expect AlphaCode performs well on Python because the training data is dominated by Python, but Python isn't ideal for comparing program structure. I wonder which programming language (given enough training data) would be best suited for language model generation and analysis.


Sure, I think this would be a really interesting study to do! I have been meaning to scrape a big chunk of GitHub in order to do this kind of analysis. I think I would be surprised (just based on my own use of Codex and Copilot) if they were copying at the level of "same code but renamed variables" either. Past that I think comparisons get pretty difficult, and to some degree I would actually be more impressed if it understood enough about programs to be able to mimic higher-level structure without just memorizing text.

Why do you think Python isn't good for comparing program structure? Certainly you could compare ASTs and bytecode pretty easy; I think it's actually much easier to do so than in C/C++ since you don't have to deal with preprocessor junk and compile flags influencing the meaning of the code. There's less available for classical analyses like data and control flow analysis, in part because those are much harder in dynamic languages, but there are some tools out there like the ones used in PyPy for getting SSA CFGs [1].

I feel like many people are equivocating a bit on what they mean by "regurgitating" here. It seems clear that the models have at best a shaky grasp on code semantics (e.g., they are bad at things like predicting what the output of some code will be: https://arxiv.org/abs/2112.00114), and struggle with problems that are very different from anything they've seen before.

[1] https://foss.heptapod.net/pypy/pypy/-/tree/branch/default/rp...


> I would actually be more impressed if it understood enough about programs to be able to mimic higher-level structure without just memorizing text.

Agreed.

> Why do you think Python isn't good for comparing program structure?

From recent experience I prefer control flow analysis, or something that results in a graph structure. As you said, that's harder with dynamic languages. I also think some Python features (english-like syntax, f-strings, division converts int to float, whitespace indentation, loops vs generator expressions) make structural comparisons messy, but that may just be bias.

The ideal language would be one with minimal syntax, where we can target a decent range of programs, and obtain as much info as possible about program structure without actually running the program. I've come across LISP-without-macros in Dreamcoder (https://arxiv.org/abs/2006.08381), BF++ (https://arxiv.org/abs/2101.09571), and a couple of others which I can't remember right now. I think the APL family (APL/J/K) would interesting because fewer characters to generate, but each character has a lot of meaning.

Right now I'm looking at flow-based programming (FBP) for this: In FBP the control flow is explicit - the program code describes a directed acyclic graph (DAG), so comparing program structure becomes straightforward (subgraph isomorphism with some heuristics). I'm writing a toy FBP language that draws images (https://github.com/mayahq/flatland), with which I aim to test what these models understand.


> What does CoPilot generate, given this prompt:

Here's what it generated for me:

  // A Go function to set the middle six bits of an unsigned 64-bit integer to 1.

  func set6(x uint64) uint64 {
   return x | 0x3f
  }


Incorrect. Thank you.

CoPilot is helpless if it needs to do more than just regurgitate someone else's code.

The training of these models on GitHub, so they regurgitate licensed code without attribution, is the greatest theft of intellectual property in the history of Man. Perhaps not according to the letter of the law, but surely according to the spirit.


I like CoPilot's answer better than yours, and I think it's closer to what most people would do; clearly 0x3F is the wrong constant but the approach is good.


I like the manual solution better: it includes thought process without negative impact on readability or performance.

This makes it easier to match the code against the specification, something a random (and in this case even wrong) magic number fails to do.

But maybe I'm overthinking it.


You're not overthinking it.

You're correct.


CoPilot's solution is totally wrong. Sorry.

CoPilot regurgitated somebody's solution... to a different problem. It's pathetic.


Here's my solution (I'm a human):

    func set6(x uint64) uint64 {
       return x | 0x7E0000000
    }
Is this also pathetic?


A good solution! You SOLVED the problem.

CoPilot got it WRONG. You got it RIGHT.

You UNDERSTAND the problem but CoPilot does NOT.

Is that clear?


For fun I rephrased the prompt a little. "Middle bits" is kind of vague; when provided an explicit description of which bits you want to set it does fine:

Prompt:

  // Function to set bits 29-34 in a uint64 to 1
  func setbits (uint64 x) uint64 {
Completion:

    return x | (1 << 29) | (1 << 30) | (1 << 31) | (1 << 32) | (1 << 33) | (1 << 34)
  }
It would be nice if it made a better guess about what "middle" is supposed to mean here, of course.


Middle bits is not ambiguous, but CoPilot hasn't seen code for that phrase in its training so it has nothing to regurgitate.

You spelled out exactly what to do, in term of what it has seen in its training, and it was able to regurgitate a solution.

By asking question that require mathematical reasoning or are too far from the training corpus, I can create an endless list of simple problems that CoPilot can't solve.

Look at my comment history to see another one (swapping bits).




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: