I wonder if there is really so much low hanging fruit still lying around, or if everybody who tried injecting some more domain knowledge into tools like this had quietly failed.
For example, the obvious way of building a distributed representation of, e.g., the simply-typed lambda-calculus (STLC) is by building a model. There are four local constraints that the model has to satisfy and the payoff is a representation that is invariant under program equivalence.
There are some complexity theoretic reasons why this cannot really work all the time (conversion in STLC is nonelementary), but even something that works in simple cases would be more robust than a statistical fingerprint that gets confused by the names of local variables...
is this even computable at all (leaving aside the complexity theoretic issues)?
Thus you get a representation invariant under computation. This remains decidable when you consider only normalizing programs as in STLC or related subsystems.
Herbrand equivalence is the best you can do (in general) if you are trying to say whether two variables have the same values at the same program points.
If you are willing to be probablistically correct you can do better, but you will get wrong answers (and not know they are wrong)
That is likely okay for this application.
One way would be an IDE extension that suggests a reference implementation of an algortihm if it finds code in your code base resembling it with high prediction score.
Or if it sees code duplication it will suggest a refactoring to factorize the common function.
Also, props for an academic work having an extensive README! https://github.com/tech-srl/code2vec
I do wonder if some sort of AST normalization would improve the input signal. Example 8 on the website shows their system correctly identifying an isPrime function. However some irrelevant perturbations can break it. If you swap the if statement condition around from `n % i == 0` to `0 == n % i` the proposed names are totally different and make no sense.
Then we could use an adversarial network to try to learn the relationship between requirements (or tests) and code.
Although we tend to think as Fibonacci and FizzBuzz as "basic" - these are actually not common in real projects.
We assumed that their code and naming quality are worth learning, such that the learned patterns will transfer to other code and projects, because they are popular and actively maintained projects. So, as you can imagine, FizzBuzz is not that common there :)
To play devil's advocate, could I just run code2vec across different repos to see if they are "real enough"?
I like to library and just printed out the paper to read, but since my focus is more with learning to program, this sounds like a tools I'd love to be able to use for analysis, but I'm being told no.
I suspect that's a function of the size and quality of their training set?