
GitHub releases an ImageNet for code and a CodeSearchNet challenge - slewis
https://github.blog/2019-09-26-introducing-the-codesearchnet-challenge/
======
mloncode
Hello folks, this is Hamel from GitHub -- I’m one of the Machine Learning
Engineers who worked on this project. The reason we are excited to host this
data is that we believe the community will be able to innovate and advance the
state of the art much faster if it is provided in a tractable format for
machine learning researchers. This data is already public, however there is
specialized knowledge required to acquire, parse, dedupe, and clean code from
many programming languages at a massive scale on GitHub. We strived to reduce
these barriers to encourage greater involvement.

While we present the task of information retrieval as one possible use case of
this dataset, we know there could be other practical applications of this data
(i.e. code summarization). While we went through great lengths to pre-process
the data for the community, the data is still messy and often you will find
that there might not be high-quality comments that are aligned with a code
snippet we parsed. However, we believe this is part of the excitement of the
dataset it poses challenges that machine learning practitioners will have to
address.

Code is very different than natural language with regard to its structure and
syntactic rules and may benefit from different approaches relative to natural
language processing. Our baseline models and benchmarks mostly treat code as
natural language, however we are aware that there could be an opportunity to
innovate on this front. If anyone creates any interesting projects from this
dataset, please do get in touch. Happy to answer any questions!

~~~
flohofwoe
I know this sounds silly because code on github is visible to everybody anyway
and that's a good thing, but I would appreciate a way to opt out my own code
from such automated data gathering programs for machine learning purposes.

Something like a robots.txt for github projects. Not that anybody would really
care, only to make my intent clear that I don't support this sort of mass data
gathering nonsense.

~~~
kogir
You opt out by using a restrictive license or making your repo private.

~~~
flohofwoe
Using a restrictive license seems a bit extreme when the whole point of a
public github repo is to make the code available for other programmers.

I consider this mass data gathering use for machine learning an edge case
which should not harm the 'legitimate' users of the repository.

~~~
bpt3
Why do you see this effort as not "legitimate"?

------
slewis
Shawn from Weights & Biases here. We've been working with Github and Microsoft
Research on this for just about a year now and we're super excited to launch
it today.

We've seen huge advances in human language modeling and translation due to the
success of deep learning. Often new directions start with a really motivated
team producing a new kind of dataset. Who better to do that for code as
language than Github!

This started as a grassroots effort inside of Github, and went through many
iterations. When it was presented to Github's CEO six months ago, he correctly
pointed out that we needed to go back and include Github's most popular
language (javascript). As the project went on many smart people chipped in,
and we produced something that we think is truly useful.

Check out the paper here:
[https://arxiv.org/abs/1909.09436](https://arxiv.org/abs/1909.09436)

We overcame plenty of challenges to pull this off. For example: how do you
clean this data? how do you label it? We've got folks from Github, Microsoft
Research and Weights & Biases here to answer any and all questions you might
have. Can't wait to see where this goes!

~~~
codetrotter
> We've seen huge advances in human language modeling and translation due to
> the success of deep learning.

I wonder if we’ll eventually see a system where instead of writing code you
describe in natural language what you want the program to do and then ML is
applied in order to generate the code for that.

I mean, a lot of people have been interested in the past in making human
programming languages, and had varying degrees of success.

Personally I love writing code but, it could be, couldn’t it?

Write some unit tests, a human description of what it does and based on the
source code and description of existing software the system would basically
“debug the program into existence” for you.

That’d be kind of freaky, kind of cool and a little bit scary.

~~~
mikekchar
Find a human who wants a computer program written (A). Find a human who can
program computers (B). Have A describe what they want and have B code it
without asking questions to further clarify the issues. What do you expect the
result to be like? For me, my experience tells me that it will be a total
failure.

The problem with programming is _not_ the encoding of the requirements in
programming language for the most part. The problem is that the specifier (A
in this example) usually does not have a full grasp of what they actually
want. In fact, they usually don't have any idea at all. "Give me a e-shopping
system to sell comic books" is the level of detail they can understand.

The closer A can come to expressing the requirements they need, the closer
they are to actually being B in reality. B's real skill is not in knowing the
syntax and grammar of the computer language, it's in knowing that in order to
make a system that will satisfy A we need to do X, Y, and Z to the tiniest
detail.

When we get into trouble with our software is when we write code that is
dramatically more complex than the problem we are trying to represent. This
doesn't happen so much because we don't know how to program. This happens
because we are slowly extending the code base over time with imperfect
knowledge about what we are ultimately building at any given time. We also
have to trade-off the benefit for getting something done with discovering
generalities that allow us to simplify the expression of code that we already
have.

I don't think we will ever replace "programmers" with AI -- at least not until
the AI can be trained to ask the important questions about what the system
_really_ needs to be (and for that we need turing-test passing level AI). I
think it's much more likely that we will build more and better tools that help
programmers visualise and plan the programming situation. I think we _will_
have automatic code generation because we already have it: Look at "derive" in
Haskell and Rust, for example. But I think _that 's_ the level of automatic
code generation we're going to want for at leas the next 20 years or so.

Interestingly for testing, I think we'll actually go the opposite direction:
We will spend more time thinking about requirements and the computer will help
us by writing tests that challenge our assumptions: I've broken this function,
are you sure you got it right? Again, we already have these kinds of systems
and I think that _this_ is the most appropriate direction to invest in
research.

~~~
neuronexmachina
The only way I think something like what you described in your first paragraph
would work is if you had an AI system B that could present questions and
prototypes back to requirement-setter A for feedback. Of course, that'd be a
very difficult problem, even if you limit it to a constrained domain.

------
s_Hogg
Code challenges like this are interesting - thank you for putting this
together!

I do have a question about the set-up, if that's alright. Netflix and others
have found that shared tasks can lead to great models, but not necessarily
ones that are suited for use in a production environment. Have you put much
thought into how best to set up a challenge such at this to make the obvious
"ensemble everything" solution be less worthwhile?

Similarly, have you put much thought into how to encourage the sharing of
information between participants?

Thanks again.

~~~
metaphdor
Excellent questions, thank you!

1\. We could log additional information about the model, such as inference
time, number of parameters, memory usage, etc. and have the primary metric be
overall efficiency (best NDCG with fewest parameters/fastest runtime/etc).

2\. We're experimenting with different kinds of benchmarks, and I am most
excited about explicitly collaborative ones. In these there is no
contest/prize (hence no incentive to cheat/withhold information); only the
shared goal of improving the model and our collective understanding of the
problem. I hope we can incentivize information sharing by tracking and
acknowledging individual contributions to the eventual best model in the
benchmark. We could approximate individual contribution by seeing which
scripts, code segments, workflows, architectural changes, writeups, or
discussion comments other participants rate as the most helpful or choose to
include in their experiments most often as the benchmark evolves. Of course
this could only be an estimate--as Shawn says above, any idea could have
"actually happened in a hallway conversation". Still, this is much easier to
achieve in a logging/visualization platform like W&B than in the current
paradigm of "read research papers, clone relevant repos, spend weeks trying to
synthesize/reproduce their results, run your own experiments, write them up in
a research paper, hope it gets accepted to a conference before other people
publish the same idea, try to integrate your changes/publish your own repo,
repeat"\--and for hundreds of practitioners, ranging from brand new students
to PhDs, working on related problems. This cycle is especially challenging for
folks who are new to, working outside of, or trying to collaborate across the
relatively few established/well-funded academic/industrial teams.

Collaborative benchmarks can be especially impactful for social good projects,
where the primary incentive is to figure out and broadly implement the best
solution ASAP (e.g. climate change!), not to make money or argue over the
credit attribution. So, my long-term goal is for as much sharing of
information and collaboration from as many folks as possible--the more
inclusive and transparent the field of deep learning research becomes, the
safer and better its outcomes. Very open to ideas on how to help make this
happen.

~Stacey, deep learning engineer at W&B

------
mishraka
Great effort in putting together this large corpus! While reading through your
paper, I noticed the difficulties you faced in correctly annotating the code
for their quality, correctness and hiring annotators for different languages.
I can imagine how herculean this task could be.

I was wondering if you thought to include stackoverflow questions and answers,
which have been vetted by thousands of programmers over a long period of time.
Stackoverflow might even want to participate in this effort to provide a clean
ground truth for this great project.

~~~
mallamanis
[I'm one of the Microsoft Research people who worked on this]

We did consider adding StackOverflow questions. Some of our queries in the
CodeSearchNet challenge do actually come from StackOverflow (via StaQC [1]).
It's certainly interesting to see how all other SO data can be useful for this
task. Thanks for the suggestion!

The reason we didn't try this at this point:

Many people in research have tried working with SO data. In my experience I
have observed an interesting problem with the data: it's deduplicated! This is
great for users but bad for machine learning, since the data looks "sparse"
(roughly, each concept appears once). Sparsity is an obstacle, since it's hard
for most existing machine learning methods to generalize from sparse data. In
contrast, in natural language there are (e.g.) multiple articles describing
the same event more or less.

[1]
[https://ml4code.github.io/publications/yao2018staqc/](https://ml4code.github.io/publications/yao2018staqc/)

------
CountHackulus
Code reuse search engines is something that Haskell is excellent for. With
Hoogle you can search for a generic type signature and probably find the
method you want.

Want to flatten an array of Maybes? Just search for [Maybe a] -> [a] and
you'll find catMaybes and takeWhileJust.

This of course only works for Haskell because so much of a function is
determined by it's types. Doing this sort of thing in say JavaScript would be
an absolute nightmare. Still, it's an interesting train of thought.

------
Dowwie
I've been experimenting with my own custom markup for annotating code blocks.
I want to document code use cases and automatically generate an index of
independent sandbox examples. This approach doesn't require AI but rather
manual effort remembering to annotate novel uses in code and a standard
convention for doing so. One problem this solved is that I've written so much
code that I lose track of where I experienced certain patterns and this would
help me track that better. This also helps to facilitate knowledge sharing
across teams with access to the source.

Has anyone worked on something such as this and can comment or share ideas?

~~~
lifeisstillgood
sort of ...

I am playing with `todoinator` - it is a way of finding all the "todos" I
leave scattered around my code, but also gives me ways to rank my code - its
not quite where you are but I think the principle of having almost everythin
derived from code is the guiding light here.

[#]
[https://github.com/mikadosoftware/todoinator/blob/master/tod...](https://github.com/mikadosoftware/todoinator/blob/master/todoinator/todoinator.py)

------
mikekchar
I haven't looked at this in detail, but where I could see this technology
going is by doing a kind of lint. We've got hand built linters for most
languages and some of them are really good (Clippy in Rust is amazingly good
-- it practically writes my code sometimes). However, a tool that analysed my
code, picked out things that were idiomatically strange and then suggested
example code that might be better _would_ be quite useful, I think.

~~~
melling
I’d like to type my code in one language and have it translated into another.

Someday I’d like to learn Rust, for example.

If I could leverage the languages that I already know, I could more quickly
build something more useful.

~~~
nestorD
A seq2seq model trained on something like roseta code might be a viable way to
do that...

------
mkagenius
A phrase is never enough to describe what the task at hand is. It may work for
simpler use cases like the results on stack overflow. Otherwise I do not see
it doing better than a google search which leads me to stack overflow.

~~~
nbardy
You're right. We'll need bigger datasets and more complex models to achieve
more useful results, but machine learning models can give predictions in real
time so you could have this running as you type in your IDE, instead of
visiting a website, also with a model fine tuned to your code base. "simpler
use cases like the results on stack overflow" could explain at least 10% of
the work I do, so even just that tool would be very useful for mediocre
programmers like myself.

Behind all the hype, predictive text is something machine learning models are
beginning to do very well. G-mail has rolled out a lot of similar features
from advancements in deep learning models.

------
cmroanirgo
> _Searching for code to reuse, call into, or to see how others handle a
> problem is one of the most common tasks in a software developer’s day_

I don't disagree at all that this is how we code these days... but I
distinctly remember a time when this wasn't so. We had to do everything
ourselves. We engineered our solutions based on various requirements and
constraints, and most importantly, we had to figure it out ourselves. The only
external help we had was with the APIs we used... and they had to be _studied_
to be understood.

Even in recent times, the most fun I've had _programming_ has been when it's
all come from my wee little head, rather than trawling for solutions and
shoehorning/ reworking something similar.

~~~
probably_wrong
Only mildly related, but I had a recent experience that made me fond of Java's
way of doing things that your comment reminded me of.

I had to use a tool for work (name withhold to protect the guilty) that has
awful documentation. The tool allows you to write snippets of your own code,
but provides no IDE and no documentation (AFAIK) of anything but the most
trivial aspects of the API.

I started using Python, but with no debugger and no interactive shell there
was no way I was going to guess the names of the functions I needed. Lucky for
me, someone uploaded the Javadoc of an older version of the API, and that was
the missing piece of my puzzle: having the function names, the return types,
and Java's stack traces, I now had all I needed.

Back to the topic: like you, I sometimes wonder if there's a downside to not
having to scroll through hundreds of manual pages anymore. But until someone
shows some kind of evidence of something being lost, I won't worry too much.

That said, I definitely wish more companies would make their documentation
available offline, if only as a static version of the online version. For
those of us who regularly program in trains and planes, offline docs are a
lifesaver.

------
psankar
A little off-topic.

> Our fully preprocessed CodeSearchNet Corpus is available for download on
> Amazon S3

I am surprised that Github went with S3 for this download. Isn't there a Azure
equivalent of S3 for large object storage ? This just shows the dominance of
AWS.

~~~
powersnail
It’s a good thing that MS don’t force every team to always choose MS product,
especially for a trivial thing like a download storage. Maybe the team was
more familiar with AWS, who knows. I’m just glad that they can make this
decision.

------
zelly
Looks like they're using tokens as input as if it's natural language. You have
an AST, use it. I think the limitation is the lack of graph and tree based
neural networks, but I don't think you need a neural network to search code.
You already have the AST. This could be solved with a traditional analytic
approach, but it's probably a lot of hard work.

------
crispyambulance
Something like this might be useful for making better "documentation" for
things like API's and man-pages.

Documentation tends to be either simplistic "hello-world" examples or
everything-but-the-kitchen-sink dumps that take forever to consume. Neither of
these is helpful to a practitioner that just wants to get through a basic task
without starting a mini-research project, or getting burnt to a crisp on
stackoverflow.

So basically, I am thinking this corpus could be used to filter for specific
examples of usage within particular contexts and problem domains? Or maybe
not?

------
jakobgm
Since the CodeSearchNet Corpus contains metadata such as owner/repository, it
would be nice to create a search tool for the data set itself. That way you
could check if, by chance, some of your open source code is part of the
corpus.

The data set is apparently ~20GB [0], so a cheap VPS instance might do the job
of hosting the data in a searchable format.

[0] [https://github.com/github/CodeSearchNet#downloading-data-
fro...](https://github.com/github/CodeSearchNet#downloading-data-from-s3)

------
AndrewKemendo
I think this is fantastic and I'm excited to see what people come up with
here.

It's very disappointing to see a lot of the negative comments here, almost
completely around licensing, despite the licensing being well explained.

You'd think this place was full of lazy bureaucrats.

------
thrownaway954
I'm not trying to be a jerk here, but screw it.

I'm sure whatever it is you've done, I'll probably use at Github at some
point. However, from my 10 second scan of the page, I know absolutely NOTHING
about what this is or what it can do for me. Is it just me or do I
_constantly_ see example of developers making the worst marketers in the
world? Where are the code examples or videos showing whatever it is this is
and how it will help me. There is nothing on that page but a HUGE image and a
bunch of text I can't understand without 4 PHDs and a half a tab of Adderall.

It's so frustrating see stories on Hacker News that are just piss poorly
explained. You've most likely worked hard at this for a long time, why is it
you can't take 5 minutes to explain it in lay man's terms for all to
understand?

END RANT

~~~
mabrocks
This is a release of a dataset, to help people train machine learning models.
That's why the blog post says "We’re also releasing a large dataset to help
data scientists build models for this task."

We also provide a way to evaluate how well your machine learning model works.
That's why the blog post says "We’re announcing the CodeSearchNet Challenge
evaluation environment."

The released data and code (and hence, the announcement) are meant for data
scientists and ML researchers who want to work on this problem, and the rest
of the world does not need to care. There are no products or applications of
this work at this time. If the terms in the blog post don't mean anything to
you, then you are not in the target audience for this announcement, and that's
OK.

------
agsilvio
I remember a search engine for developers called Symbolhound. Wasn't fruitful
for me bit I'll never very forget it.

------
baalimago
Very cool! I wish I had more time!

