
Deep code search - godelmachine
https://blog.acolyer.org/2018/06/26/deep-code-search/
======
mloncode
Hello there! Glad to see other people working on this area as well. Ho-Hsiang
and I from GitHub have been prototyping this exact same approach and have
published / open sourced our work about a month ago:
[https://towardsdatascience.com/semantic-code-
search-3cd6d244...](https://towardsdatascience.com/semantic-code-
search-3cd6d244a39c)

~~~
mloncode
Oh and [https://towardsdatascience.com/semantic-code-
search-3cd6d244...](https://towardsdatascience.com/semantic-code-
search-3cd6d244a39c) is completely open source end-to-end, with code, data,
and detailed explanations on how to reproduce step-by-step.

------
ryanackley
I think the reverse would be incredibly more useful. Here is a giant block of
code. Break it down into snippets and explain what they are doing.

I rarely have trouble finding examples of code using Google. When I do, I'm
most likely using obscure languages or it's a very niche programming scenario.

------
iamrohitbanga
Great! Brings back some old memories. Straight out of college, I interviewed
with some big company for a product management role. At the time the
interviewer asked me to pick a website I like and describe what new features I
would like to see in it. I picked github.com and said that github has a such a
big corpus of code and people have so many questions about how to do this or
that in a programming language. I would like to see a way to search that
corpus and see related code examples. Also may be find common bug patterns and
suggest fixes (I didn't know about FindBugs at the time). Sadly I didn't have
a good way to implement it or a solid design idea at the time.

------
adrianmonk
I wonder if a similar approach could be used to help discover common
programming errors and their solutions, i.e. data mine bug hindsight.

It would work something like:

1\. Go through a repo's history and look at commits.

2\. Using text classification of commit messages and/or cross-referencing with
a bug tracker database (and limiting to issues that are really defects, not
feature requests), identify commits that fix bugs.

3\. Now you have before and after code. Try to discover the salient part of
what changed, perhaps by parsing them both and comparing ASTs, or by diffing
the text. Or run the unit test (that you added with the fix) through the old
code and the new code while tracing execution to see differences in the
dynamic behavior of the code.

4\. Correlate this with descriptions of what the method is trying to do.

Perhaps you might be able to generate statements like "when fixing bugs
related to _create a daemon_ , programmers often _add calls to close()_ or
_add calls to umask()_ ".

------
nl
I've been playing around in this area for a while and I think there is a lot
of potential - although I'm not sure code search is the ultimate expression of
how this should be used.

There's another interesting paper called code2vec[1]. Code2vec uses the AST of
a code snippet and builds an embedding of those rather than dealing with is as
just a sequence of tokens which some earlier attempts do.

This paper does the same, which is nice.

Code2vec (at least the demo at [2] which I think is the same thing) is
extremely sensitive to variable names. I'm not sure if that is a bug or if
they are including that in their embedding.

If people have specific things they'd like to do if they had a tool which had
deep understanding of the intent of code I'd be pretty interested to hear
about it. Contact details in my profile, or reply here.

[1] [https://arxiv.org/abs/1803.09473](https://arxiv.org/abs/1803.09473)

[2] [https://code2vec.com/](https://code2vec.com/)

~~~
fullstackchris
Another cool paper along a similar thread is DLPaper2Code: Auto-generation of
Code from Deep Learning Research Papers
([https://arxiv.org/abs/1711.03543](https://arxiv.org/abs/1711.03543)).

I see opportunities with tools like this in the form of apps and/or SaaS, IF a
team is willing to put thousands of hours into a top quality way to make the
machine learning super user friendly. The only comparable product of such a
nature I've seen is deepl.com/translator (translator better than google
translate which leverages deep learning - don't trust me? go try it)

------
Normal_gaussian
Anyone know if there is a chance we will actually get our hands on the core of
this so we can implement it on existing codebases?

It would be an invaluable tool to search through specific codebases for the
place they do X as well as for answering questions on how to do X.

~~~
cetra3
Code is here I believe: [https://github.com/guxd/deep-code-
search](https://github.com/guxd/deep-code-search)

~~~
daveyand
Yeah it is, i've cloned it and trying to run it but i'm getting the following
error: python codesearcher.py --mode train Traceback (most recent call last):
File "codesearcher.py", line 10, in <module> from datashape.coretypes import
real ImportError: No module named datashape.coretypes

clearly i'm doing something wrong and have no idea on how python works :P

~~~
PurplePanda
something like `pip install --user DataShape`, maybe.

------
daveyand
has anyone been able to implement this yet? I'd love to test it on my codebase
(which is 15 years old and a monolith)

------
keeganpoppen
ok wow that actually seems to work pretty damn credibly[1], at least given the
inputs that i could come up with. i did find it difficult to come up with
queries that (1) clearly would rely more on the code embeddings than the
description embeddings (which i'd assume are at least pretty decent on their
own), while (2) actually making sense / having a sensical answer.

the best i could come up with is describing what you want ~operationally,
rather than with domain-specific (read: predictive) jargon. but while also not
underspecifying it to the point where the query is nonsense[2].

but when i try the query "sort the operands in decreasing order", not only
does it get a bunch of sorting functions[3], but i'll be damned if the top 2
results weren't: (1) a `swapOperands` function that takes two Comparator<T>s
and returns a new one that invokes the comparison with the operands reversed
and (2) a function that sorts a dequeue into decreasing order.

obviously that's not the perfect query for telling the relative contributions
apart because "operands" is kinda "jargon"-y by my earlier standard, but the
results did (correctly) address the only reason i'd ever search for something
like that: to remember if the default comparison operator is "A - B" or "B -
A". if you change "operands" to "elements", the results do get a little
"worse", but i'd argue that the query (1) is actually quite a bit more vague
in that formulation and (2) is less likely to represent a code snippet that
actually exists, if only because most comparison operators on individual
"containee" types (~"element[ type]s") are parameterized by the sort direction
[citation needed].

tl;dr - lmk if anyone thinks of particularly good ways to "fool" this and/or
demarcate what kinds of things are and are not well-represented by the code
embeddings; to my eye they definitely seem to be doing a decent job of...
embedding the code... as code embeddings are wont to do.

[1] not to "damn with faint praise"\-- if you had asked me whether something
like this paper would work well enough to be useful i... would have guessed
"no" :)

[2] e.g. queries like "do some machine learning on the user images" and "...
on the training data" don't get very "good" results in terms of topic
proximity, but there really isn't a sensible response anyway, so if anything
that's a good sign-- garbage in, garbage out

[3] which obviously to be expected, as P(<can't find sorting functions> |
<paper published>) is (hopefully) pretty low...

