While we present the task of information retrieval as one possible use case of this dataset, we know there could be other practical applications of this data (i.e. code summarization). While we went through great lengths to pre-process the data for the community, the data is still messy and often you will find that there might not be high-quality comments that are aligned with a code snippet we parsed. However, we believe this is part of the excitement of the dataset it poses challenges that machine learning practitioners will have to address.
Code is very different than natural language with regard to its structure and syntactic rules and may benefit from different approaches relative to natural language processing. Our baseline models and benchmarks mostly treat code as natural language, however we are aware that there could be an opportunity to innovate on this front. If anyone creates any interesting projects from this dataset, please do get in touch. Happy to answer any questions!
We used the repo-level license information from github to filter to free non-copyleft licenses, and the license files are actually stored next to the extracted data corpus (see https://github.com/github/CodeSearchNet#Licenses).
It seems irresponsible for Github to continue distributing infringing code when it clearly has the tools to prevent that happening. Youtube has fingerprinting for infringing music and video. Github should be doing the same with source code.
If you're saying they are the entity in the best position to identify infringement, I would agree, but what tools does Github already have in place to do that?
By failing to do so, Github really seems to be encouraging copyright infringement.
If Github were somehow helping copyright infringers, that would be encouragement. What you're describing is additional effort on their part to actively discourage copyright infringement, which is not the same thing.
What are you talking about?
From the paper, it's not really clear how the resulting binary or intermediate representation fits into the picture? Wouldn't a robust model have to exploit features of the underlying computer architecture as well?
Besides the possible applications mentioned such as recommendation systems and code generation. I am wondering about automatic code annotation and documentation. With the eventual goal of creating a "code tutor" that would assist students in real-time as they type. I think what's interesting about this is that the most talented CS grads tend to have a high correlation with deep understanding not only of the history of digital logic design, but also math, physics etc.
Fantastic work ;)
Thanks for your questions! We have thought of many heuristics but we didn't want to constrain the dataset release on some heuristic that we picked, possibly ruining the dataset. Participants in the challenge should feel free to apply additional filters as they see fit. For example, this  work could be useful as a filtering method.
Unfortunately, we do not have the budget to provide any compute resources to help with running the models at this time. Note that any techniques developed with this dataset will be owned by those who develop them and it's up to them how/if they will make them available/open-source.
All code is from public repositories. We only used publicly available APIs to obtain the data, so that others can reproduce the results / build on top of the data extraction pipeline we built: https://github.com/github/CodeSearchNet/tree/master/function...
It's also easy to test without access to private repos, by just splitting some repos off as a separate test set, so should hopefully be something that other people can also make progress on!
Do you realize how rude and disrespectful that tone is?
There is a civil way to air one's views; one that we use naturally in face-to-face conversations. Yet once it's from behind a computer screen, we somehow seem to think it's okay to lash out in a barrage of hurtful language...
C’mon, cut these folks some slack — GitHub is huge, and one annoying adjacent feature doesn’t give you the right to attack these folks like this.
> Just get the stupid dedupping done, for God's sake.
> You have no excuse to so badly that a teenager can beat you by simple handcrafted rules.
> First learn and do the basics and then come here to talk about your need for ML.
> No one needs to suffer like this.
Let’s see, the OP belittles the original authors for something they’re not responsible for, implies that they’re incompetent, tells them to go away until their problem is fixed, and, oh, for god measure: invokes a deity.
This is not “straight forward feedback” to me.
How about we just cut all the insults and shift the tone a bit:
> GitHub has a really poor search experience compared with most modern web services. Search doesn't need fancy ML or DL, and dedupping alone would help a ton. A simple TF-IDF similarity metric would make GitHub’s search 10,000% better. You even know forks, names of files, folders, number of stars.
> PS: Sorry, I had to get this out. I've spent better part of my productivity in past two years shifting through page number 50 and 100 of undedupped results that GitHub returns and it’s been extremely frustrating.
See how this doesn’t blame the authors for something that isn’t their fault? Doesn’t imply they’re stupid or incompetent for some other feature they have nothing to do with? Doesn’t tell them to go away? And a bonus: It presents some approaches to solve a problem the author is frustrated by.
My experience ranting on the internet is that personal attacks don’t lead to feature improvements.
I think the pointed out asymmetry was spot on.
Something like a robots.txt for github projects. Not that anybody would really care, only to make my intent clear that I don't support this sort of mass data gathering nonsense.
I consider this mass data gathering use for machine learning an edge case which should not harm the 'legitimate' users of the repository.
We've seen huge advances in human language modeling and translation due to the success of deep learning. Often new directions start with a really motivated team producing a new kind of dataset. Who better to do that for code as language than Github!
Check out the paper here: https://arxiv.org/abs/1909.09436
We overcame plenty of challenges to pull this off. For example: how do you clean this data? how do you label it? We've got folks from Github, Microsoft Research and Weights & Biases here to answer any and all questions you might have. Can't wait to see where this goes!
I wonder if we’ll eventually see a system where instead of writing code you describe in natural language what you want the program to do and then ML is applied in order to generate the code for that.
I mean, a lot of people have been interested in the past in making human programming languages, and had varying degrees of success.
Personally I love writing code but, it could be, couldn’t it?
Write some unit tests, a human description of what it does and based on the source code and description of existing software the system would basically “debug the program into existence” for you.
That’d be kind of freaky, kind of cool and a little bit scary.
The problem with programming is not the encoding of the requirements in programming language for the most part. The problem is that the specifier (A in this example) usually does not have a full grasp of what they actually want. In fact, they usually don't have any idea at all. "Give me a e-shopping system to sell comic books" is the level of detail they can understand.
The closer A can come to expressing the requirements they need, the closer they are to actually being B in reality. B's real skill is not in knowing the syntax and grammar of the computer language, it's in knowing that in order to make a system that will satisfy A we need to do X, Y, and Z to the tiniest detail.
When we get into trouble with our software is when we write code that is dramatically more complex than the problem we are trying to represent. This doesn't happen so much because we don't know how to program. This happens because we are slowly extending the code base over time with imperfect knowledge about what we are ultimately building at any given time. We also have to trade-off the benefit for getting something done with discovering generalities that allow us to simplify the expression of code that we already have.
I don't think we will ever replace "programmers" with AI -- at least not until the AI can be trained to ask the important questions about what the system really needs to be (and for that we need turing-test passing level AI). I think it's much more likely that we will build more and better tools that help programmers visualise and plan the programming situation. I think we will have automatic code generation because we already have it: Look at "derive" in Haskell and Rust, for example. But I think that's the level of automatic code generation we're going to want for at leas the next 20 years or so.
Interestingly for testing, I think we'll actually go the opposite direction: We will spend more time thinking about requirements and the computer will help us by writing tests that challenge our assumptions: I've broken this function, are you sure you got it right? Again, we already have these kinds of systems and I think that this is the most appropriate direction to invest in research.
This is already possible, but not with deep learning which is probably the reason you haven't heard of it.
Learning programs from specifications (not necessarily in natural language) is the subject of the field of Program Synthesis .
Learning programs from examples of their input and outputs (which is basicaly writing unit tests) is the subject of Inductive Programming .
Inductive Programming encompasses the fields of Inductive Logic Programming and Inductive Functional Programming, that target logic and functional programming languages, respectively.
Sounds more or less like the mechanism by which developer jobs succumb to automation. Hopefully those of us that are working class have seized the capital by then.
To spin that more positively, it may automate the basic boring stuff (both for us technical types and potentially for the simian on the street) and leave us more time to spend time in more fun & challenging playgrounds.
Microsoft was showing off similar work from their own code datasets this year at ICLR, I couldn't find a link online, but the demos had block suggestions from method signatures for C#. It should be possible to get similar results with natural language queries.
(thanks Nick. Here are the links)
Generative Code Modeling with Graphs: https://arxiv.org/abs/1805.08490
Learning to Represent Programs with Graphs: https://arxiv.org/abs/1711.00740
I do think there would be value in machine-written code if it made code in general more alike, so that you don't have to relearn the weird tricks any new writer could have chosen to use.
There wasn't a technical reason (unless you count lazyness as a technical reason) -- we simply had infrastructure for Python specifically lying around from past research projects, which we initially reused.
After we got Nat's feedback, we redid our data processing pipeline completely to be based TreeSitter (which wasn't around when we started thinking about parsing Python), which makes it much easier to scale to the number of programming languages on GitHub.
I do have a question about the set-up, if that's alright. Netflix and others have found that shared tasks can lead to great models, but not necessarily ones that are suited for use in a production environment. Have you put much thought into how best to set up a challenge such at this to make the obvious "ensemble everything" solution be less worthwhile?
Similarly, have you put much thought into how to encourage the sharing of information between participants?
1. We could log additional information about the model, such as inference time, number of parameters, memory usage, etc. and have the primary metric be overall efficiency (best NDCG with fewest parameters/fastest runtime/etc).
2. We're experimenting with different kinds of benchmarks, and I am most excited about explicitly collaborative ones. In these there is no contest/prize (hence no incentive to cheat/withhold information); only the shared goal of improving the model and our collective understanding of the problem. I hope we can incentivize information sharing by tracking and acknowledging individual contributions to the eventual best model in the benchmark. We could approximate individual contribution by seeing which scripts, code segments, workflows, architectural changes, writeups, or discussion comments other participants rate as the most helpful or choose to include in their experiments most often as the benchmark evolves. Of course this could only be an estimate--as Shawn says above, any idea could have "actually happened in a hallway conversation". Still, this is much easier to achieve in a logging/visualization platform like W&B than in the current paradigm of "read research papers, clone relevant repos, spend weeks trying to synthesize/reproduce their results, run your own experiments, write them up in a research paper, hope it gets accepted to a conference before other people publish the same idea, try to integrate your changes/publish your own repo, repeat"--and for hundreds of practitioners, ranging from brand new students to PhDs, working on related problems. This cycle is especially challenging for folks who are new to, working outside of, or trying to collaborate across the relatively few established/well-funded academic/industrial teams.
Collaborative benchmarks can be especially impactful for social good projects, where the primary incentive is to figure out and broadly implement the best solution ASAP (e.g. climate change!), not to make money or argue over the credit attribution. So, my long-term goal is for as much sharing of information and collaboration from as many folks as possible--the more inclusive and transparent the field of deep learning research becomes, the safer and better its outcomes. Very open to ideas on how to help make this happen.
~Stacey, deep learning engineer at W&B
- All the system logging information including CPU/GPU utilization, with the runtime and type of GPU card used
- Extensive logging of model training progression
- All of the model artifacts and metadata
- A link to the code on GitHub with the code that ran that data.
- Anything emitted to stdout (for logging)
This allows for extreme reproducibility and insight that is very helpful. With these tools, the community can see if an "ensemble everything: method is used and how long the model takes to train and what resources are consumed, etc.
We've considered benchmarks that proceed in phases: a closed competitive phase for 3 months, then award a prize to the top result, and another prize for best user writeup. Follow that by a collaborative phase where it's more about sharing, teamwork etc. Rinse and repeat.
The question of attribution is really interesting. Who made the largest contribution to the development of a model? It could have actually happened in a hallway conversation, or something equally as untrackable. We'd love to hear other peoples' thoughts on this.
Stacey on our team has put a lot of thought into these topics and may have more
to say here!
Resources and training time are also kept even between submissions.
I was wondering if you thought to include stackoverflow questions and answers, which have been vetted by thousands of programmers over a long period of time. Stackoverflow might even want to participate in this effort to provide a clean ground truth for this great project.
We did consider adding StackOverflow questions. Some of our queries in the CodeSearchNet challenge do actually come from StackOverflow (via StaQC ). It's certainly interesting to see how all other SO data can be useful for this task. Thanks for the suggestion!
The reason we didn't try this at this point:
Many people in research have tried working with SO data. In my experience I have observed an interesting problem with the data: it's deduplicated! This is great for users but bad for machine learning, since the data looks "sparse" (roughly, each concept appears once). Sparsity is an obstacle, since it's hard for most existing machine learning methods to generalize from sparse data. In contrast, in natural language there are (e.g.) multiple articles describing the same event more or less.
Want to flatten an array of Maybes? Just search for [Maybe a] -> [a] and you'll find catMaybes and takeWhileJust.
Has anyone worked on something such as this and can comment or share ideas?
I am playing with `todoinator` - it is a way of finding all the "todos" I leave scattered around my code, but also gives me ways to rank my code - its not quite where you are but I think the principle of having almost everythin derived from code is the guiding light here.
There are many interesting ideas that you could build on top of this kind of data, and we only scratched the surface so far.
For example, the simple "search" technique we are using as a baseline is based on the idea of joint embeddings: We learn functions f_query and f_js/f_python/... to map from the inputs into some vector space such that, for example, for a (python method, docstring) pair, f_query(docstring) is near to f_python(method). To search given a query, we just do f_query(query) and look for nearest neighbours in all the code we indexed before.
Now, we could also just do f_python(def bubblesort(...): ...) and look for the nearest neighbour that is in C#, and should get out a C# implementation of bubblesort. Similarly, we could apply all kinds of filters on the results (code from highly-starred repos, code that uses framework X, ...) to do more interesting thing.
Someday I’d like to learn Rust, for example.
If I could leverage the languages that I already know, I could more quickly build something more useful.
Behind all the hype, predictive text is something machine learning models are beginning to do very well. G-mail has rolled out a lot of similar features from advancements in deep learning models.
That's certainly true for simple use cases. Our goal here is to eventually also capture the long-tail of queries about a codebase. Often, within the domain of a project there is a set of natural language terms/jargon that describe complex tasks specific to the domain. Imagine for example a developer joining a mid-sized project and trying to find how to achieve some simple but project/domain-specific task.
I don't disagree at all that this is how we code these days... but I distinctly remember a time when this wasn't so. We had to do everything ourselves. We engineered our solutions based on various requirements and constraints, and most importantly, we had to figure it out ourselves. The only external help we had was with the APIs we used... and they had to be studied to be understood.
Even in recent times, the most fun I've had programming has been when it's all come from my wee little head, rather than trawling for solutions and shoehorning/ reworking something similar.
I had to use a tool for work (name withhold to protect the guilty) that has awful documentation. The tool allows you to write snippets of your own code, but provides no IDE and no documentation (AFAIK) of anything but the most trivial aspects of the API.
I started using Python, but with no debugger and no interactive shell there was no way I was going to guess the names of the functions I needed. Lucky for me, someone uploaded the Javadoc of an older version of the API, and that was the missing piece of my puzzle: having the function names, the return types, and Java's stack traces, I now had all I needed.
Back to the topic: like you, I sometimes wonder if there's a downside to not having to scroll through hundreds of manual pages anymore. But until someone shows some kind of evidence of something being lost, I won't worry too much.
That said, I definitely wish more companies would make their documentation available offline, if only as a static version of the online version. For those of us who regularly program in trains and planes, offline docs are a lifesaver.
> Our fully preprocessed CodeSearchNet Corpus is available for download on Amazon S3
I am surprised that Github went with S3 for this download. Isn't there a Azure equivalent of S3 for large object storage ? This just shows the dominance of AWS.
Documentation tends to be either simplistic "hello-world" examples or everything-but-the-kitchen-sink dumps that take forever to consume. Neither of these is helpful to a practitioner that just wants to get through a basic task without starting a mini-research project, or getting burnt to a crisp on stackoverflow.
So basically, I am thinking this corpus could be used to filter for specific examples of usage within particular contexts and problem domains? Or maybe not?
The data set is apparently ~20GB , so a cheap VPS instance might do the job of hosting the data in a searchable format.
It's very disappointing to see a lot of the negative comments here, almost completely around licensing, despite the licensing being well explained.
You'd think this place was full of lazy bureaucrats.
I'm sure whatever it is you've done, I'll probably use at Github at some point. However, from my 10 second scan of the page, I know absolutely NOTHING about what this is or what it can do for me. Is it just me or do I _constantly_ see example of developers making the worst marketers in the world? Where are the code examples or videos showing whatever it is this is and how it will help me. There is nothing on that page but a HUGE image and a bunch of text I can't understand without 4 PHDs and a half a tab of Adderall.
It's so frustrating see stories on Hacker News that are just piss poorly explained. You've most likely worked hard at this for a long time, why is it you can't take 5 minutes to explain it in lay man's terms for all to understand?
We also provide a way to evaluate how well your machine learning model works. That's why the blog post says "We’re announcing the CodeSearchNet Challenge evaluation environment."
The released data and code (and hence, the announcement) are meant for data scientists and ML researchers who want to work on this problem, and the rest of the world does not need to care. There are no products or applications of this work at this time. If the terms in the blog post don't mean anything to you, then you are not in the target audience for this announcement, and that's OK.