
Code2vec: learning distributed representations of code - abecedarius
https://code2vec.org/
======
fmap
Most examples I tried didn't work very well, but when it did work it was truly
neat. The performance makes sense from a quick glance into the paper. The
model represents programs as paths in the AST, which is not sufficient to
reconstruct the semantics, but is a good "fingerprint" of a program for fuzzy
retrieval tasks. That's the domain which the authors wanted to target.

I wonder if there is really so much low hanging fruit still lying around, or
if everybody who tried injecting some more domain knowledge into tools like
this had quietly failed.

For example, the obvious way of building a distributed representation of,
e.g., the simply-typed lambda-calculus (STLC) is by building a model. There
are four local constraints that the model has to satisfy and the payoff is a
representation that is invariant under program equivalence.

There are some complexity theoretic reasons why this cannot really work all
the time (conversion in STLC is nonelementary), but even something that works
in simple cases would be more robust than a statistical fingerprint that gets
confused by the names of local variables...

~~~
0xBABAD00C
> representation that is invariant under program equivalence

is this even computable at all (leaving aside the complexity theoretic
issues)?

~~~
fmap
That was a poor choice of words. Models of lambda calculus are invariant under
beta-eta conversion, which is what I meant by program equivalence, but which
is not the same thing as contextual equivalence.

Thus you get a representation invariant under computation. This remains
decidable when you consider only normalizing programs as in STLC or related
subsystems.

------
kaveh_h
Nice! could be useful base for a tool to make code more DRY or higher quality.

One way would be an IDE extension that suggests a reference implementation of
an algortihm if it finds code in your code base resembling it with high
prediction score. Or if it sees code duplication it will suggest a refactoring
to factorize the common function.

------
olalonde
This could be useful for reverse engineering minified/obfuscated code.

------
gcommer
Interesting concept, now to rush and lookup word2vec applications and see if
they make sense for code :p

Also, props for an academic work having an extensive README!
[https://github.com/tech-srl/code2vec](https://github.com/tech-srl/code2vec)

I do wonder if some sort of AST normalization would improve the input signal.
Example 8 on the website shows their system correctly identifying an isPrime
function. However some irrelevant perturbations can break it. If you swap the
if statement condition around from `n % i == 0` to `0 == n % i` the proposed
names are totally different and make no sense.

~~~
namibj
Well, than maybe train it on random AST equivalency transformations? I'd
assume that to work better.

~~~
yahave
Yes, that indeed works better

------
w_t_payne
It would be nice if someone had some code with associated requirements, use-
cases, user stories, or even test suites or documentation.

Then we could use an adversarial network to try to learn the relationship
between requirements (or tests) and code.

~~~
thwy12321
JIRA could be parsed to do this, specifically when git commits are tagged to
tickets. Quite an effort to get all that data in the right form though

------
ehsankia
Very cool idea, but other than the examples, I didn't very much luck. It
wasn't able to guess a basic Fibonacci or FizzBuzz, and for isEven, it
returned isOdd, which I guess is close but still.

~~~
urialon
The model was trained on real-world code of top starred repositories from
GitHub.

Although we tend to think as Fibonacci and FizzBuzz as "basic" \- these are
actually not common in real projects.

~~~
tsumnia
But what would we quantify as "real" projects?

~~~
urialon
We just took the top-starred Java projects from GitHub, assuming that their
source code and naming are of good quality. These included projects like
elasticsearch, hadoop-common and maven.

We assumed that their code and naming quality are worth learning, such that
the learned patterns will transfer to other code and projects, because they
are popular and actively maintained projects. So, as you can imagine, FizzBuzz
is not that common there :)

~~~
stult
You could probably dig up a pretty large number of high quality teaching repos
to supplement those projects. I'm thinking of the types of repos that are
included as supplements for MOOCs or reference books. Then you'll get some of
the canonical gimmicky algorithms like fizzbuzz. But more importantly, you
will get reference implementations of fundamental algorithms. Things like
mergesort or binary tree search. While you are unlikely to see a
straightforward implementation of those algorithms in a production repo, it
wouldn't surprise me to see some the core abstract patterns repeated over and
over again because they are so fundamental to CS pedagogy. And if you pick
sufficiently high quality repos, you (hopefully) won't be compromising the
code quality metrics driving your selection of those top starred repos.

~~~
jpfed
Consider also Rosetta Code, which should have Java implementations for a ton
of simple problems.

------
wtroughton
This is a really cool demo. For common patterns like get, contains, ends with,
etc, this could be helpful in code representation without having to go into
the details.

------
zelly
We’re being replaced.

