Hacker News new | past | comments | ask | show | jobs | submit login
GitHub releases an ImageNet for code and a CodeSearchNet challenge (github.blog)
428 points by slewis 20 days ago | hide | past | web | favorite | 93 comments



Hello folks, this is Hamel from GitHub -- I’m one of the Machine Learning Engineers who worked on this project. The reason we are excited to host this data is that we believe the community will be able to innovate and advance the state of the art much faster if it is provided in a tractable format for machine learning researchers. This data is already public, however there is specialized knowledge required to acquire, parse, dedupe, and clean code from many programming languages at a massive scale on GitHub. We strived to reduce these barriers to encourage greater involvement.

While we present the task of information retrieval as one possible use case of this dataset, we know there could be other practical applications of this data (i.e. code summarization). While we went through great lengths to pre-process the data for the community, the data is still messy and often you will find that there might not be high-quality comments that are aligned with a code snippet we parsed. However, we believe this is part of the excitement of the dataset it poses challenges that machine learning practitioners will have to address.

Code is very different than natural language with regard to its structure and syntactic rules and may benefit from different approaches relative to natural language processing. Our baseline models and benchmarks mostly treat code as natural language, however we are aware that there could be an opportunity to innovate on this front. If anyone creates any interesting projects from this dataset, please do get in touch. Happy to answer any questions!


How would you make sure that any derived works (resulting from the use of this dataset) are properly licensed? It is very likely that this dataset (as per 1909.09436.pdf: "2 million functions, obtained from mechanically scraping and preprocessing associated function documentation") is contaminated by code with dubious licenses, right? And any derived work as a result would also be contaminated, right? What's the plan to deal with this issue?


[I'm one of the Microsoft Research people who worked on this]

We used the repo-level license information from github to filter to free non-copyleft licenses, and the license files are actually stored next to the extracted data corpus (see https://github.com/github/CodeSearchNet#Licenses).


Are you doing anything to help copyleft licensed code repos find infringers? We all know it's common practice for unscrupulous developers and companies to take copyleft code, copy it, and then try to relicensed under permissive licenses on a regular basis. Like Google did with Oracle's GPLed Java source code.

It seems irresponsible for Github to continue distributing infringing code when it clearly has the tools to prevent that happening. Youtube has fingerprinting for infringing music and video. Github should be doing the same with source code.


> Like Google did with Oracle's GPLed Java source code. That's not what happened -at all-.


How does Github clearly have the tools to prevent copyright infringement from happening?

If you're saying they are the entity in the best position to identify infringement, I would agree, but what tools does Github already have in place to do that?


Github can ID project dependencies and notify devs when one of them is vulnerable to a CVE. They could more easily find licenses for all project dependencies also, they just don't. Github could id private repos using GPL code for non-gpl licensed projects. Github could even automatically apply the correct GPL license and make the projects public, which would satisfy the license.

By failing to do so, Github really seems to be encouraging copyright infringement.


Nothing you described here is a trivial problem, nor a problem that Github has any responsibility to unilaterally fix, nor do they have the tools to do so already as you previously stated.

If Github were somehow helping copyright infringers, that would be encouragement. What you're describing is additional effort on their part to actively discourage copyright infringement, which is not the same thing.


>Like Google did with Oracle's GPLed Java source code.

What are you talking about?



Test files that were not shipped to customers...


Absolute game changer. We are witnessing the birth of "Big Code"

From the paper, it's not really clear how the resulting binary or intermediate representation fits into the picture? Wouldn't a robust model have to exploit features of the underlying computer architecture as well?

Besides the possible applications mentioned such as recommendation systems and code generation. I am wondering about automatic code annotation and documentation. With the eventual goal of creating a "code tutor" that would assist students in real-time as they type. I think what's interesting about this is that the most talented CS grads tend to have a high correlation with deep understanding not only of the history of digital logic design, but also math, physics etc.

Fantastic work ;)


thanks for the dataset release! So many questions. Have you thought about any approaches to assess quality/accuracy of comments by using the data itself? One could assume a distribution of accuracy and look for correlated metrics on the comment? Other information such as author or project could be helpful if present, by simplying regularizing comment quality over author/project? If GitHub would be a primary beneficiary of any techniques developed with this dataset, would they volunteer any compute resources toward running models?


[I'm one of the Microsoft Research people who worked on this]

Thanks for your questions! We have thought of many heuristics but we didn't want to constrain the dataset release on some heuristic that we picked, possibly ruining the dataset. Participants in the challenge should feel free to apply additional filters as they see fit. For example, this [1] work could be useful as a filtering method.

Unfortunately, we do not have the budget to provide any compute resources to help with running the models at this time. Note that any techniques developed with this dataset will be owned by those who develop them and it's up to them how/if they will make them available/open-source.

[1] https://arxiv.org/abs/1806.04616


Have you used code from private repositories in the training of this model?


[I'm one of the Microsoft Research people who worked on this]

All code is from public repositories. We only used publicly available APIs to obtain the data, so that others can reproduce the results / build on top of the data extraction pipeline we built: https://github.com/github/CodeSearchNet/tree/master/function...


Thank you for responding. So, then the next question is whether or not the code found in public repositories is also representative for the code found in private ones. I can see arguments for it going either way (better, worse), which might have substantial impact on the applicability of the models for code from the class that it never saw.


That's an interesting question which we haven't studied so far. There is a good argument to be made that ML models trained on public data do not work well on private data on the interesting queries, where I consider those queries interesting that only make sense in the specific project. The core issue there is around the specialised vocabulary that you would only see in the (private) project, which the model would be unfamiliar with. This could be mitigated by using absurd amounts of data, models that can generalise more easily (character-level/BPE/subtokens/...), and finetuning on project-specific data, but it's an open question nonetheless.

It's also easy to test without access to private repos, by just splitting some repos off as a separate test set, so should hopefully be something that other people can also make progress on!


You could test on GH and MS's own private repos to get some idea of the performance. These might still not be representative but it would be an interesting data point.


What about code from public repository that don't have some kind of open source licence? Won't you be running into a copyright issue somewhere?


We only used code from repos that github has marked as using a non-copyleft open source license (i.e., we had an explicit license whitelist and used only repos matching that).


So how do you handle repositories that apply different licenses to different subsections of code? E.g. what haopens when an optional helper library is in a subdirectory and licensed differently?


Could you have code reviews in the data?


The unit of observation for this data are functions or methods and their associated docstrings or top-level comments. So there are no code reviews in the data. However, we do include metadata including the SHA and owner/repo which would allow you to retrieve this information! What were you thinking of doing with code reviews?


We can try to write an automatic code review system?


[flagged]


If you had met mloncode at a party, would you have said it this way to his face after he said "Hey, check out this thing I helped build to make code search better"?

Do you realize how rude and disrespectful that tone is?

There is a civil way to air one's views; one that we use naturally in face-to-face conversations. Yet once it's from behind a computer screen, we somehow seem to think it's okay to lash out in a barrage of hurtful language...


Seriously, how dare someone at GitHub release a large dataset that might be interesting while their website’s basic search functionality is still terrible!? Why isn’t everyone at GitHub working to solve just this problem??

C’mon, cut these folks some slack — GitHub is huge, and one annoying adjacent feature doesn’t give you the right to attack these folks like this.


This is not an "attack", is just straight forward feedback, and really one I agree with, a company that has received more than 100 millions on investment shouldn't have such a bad search feature


Really? Did you read the same post I did?

> Just get the stupid dedupping done, for God's sake.

> You have no excuse to so badly that a teenager can beat you by simple handcrafted rules.

> First learn and do the basics and then come here to talk about your need for ML.

> No one needs to suffer like this.

Let’s see, the OP belittles the original authors for something they’re not responsible for, implies that they’re incompetent, tells them to go away until their problem is fixed, and, oh, for god measure: invokes a deity.

This is not “straight forward feedback” to me.

How about we just cut all the insults and shift the tone a bit:

> GitHub has a really poor search experience compared with most modern web services. Search doesn't need fancy ML or DL, and dedupping alone would help a ton. A simple TF-IDF similarity metric would make GitHub’s search 10,000% better. You even know forks, names of files, folders, number of stars.

> PS: Sorry, I had to get this out. I've spent better part of my productivity in past two years shifting through page number 50 and 100 of undedupped results that GitHub returns and it’s been extremely frustrating.

See how this doesn’t blame the authors for something that isn’t their fault? Doesn’t imply they’re stupid or incompetent for some other feature they have nothing to do with? Doesn’t tell them to go away? And a bonus: It presents some approaches to solve a problem the author is frustrated by.

My experience ranting on the internet is that personal attacks don’t lead to feature improvements.


I didn't feel that the commenter attacked/belittled anyone personally. The comment seemed to address the company (albeit in an overly emotional way).

I think the pointed out asymmetry was spot on.


I think the OP intentionally wrote it very aggressive, he even put the <rant> tag to be specific.


While I agree with OP's sentiment (gh's search is surprisingly poor), putting a warning “I'm going to be a jerk” isn't an excuse to be one…


While I probably would have disagreed with the OP's tone (I can't see the post anymore because of too many downvotes, guess I'm late to the party), I agree in principle: GitHub could have drastically improved their search function already years ago with some pretty simple ideas (low-hanging fruit and such), no need to throw ML at it...


Case sensitive search is all I want.


I downvoted this because ranty tantrums don't belong on HN.


Lol, GitHub search is far better than tf-idf you are seriously mistaken if you think that tf-idf will be better


Downvoted for tone and non-constructive, unsubstantiated criticism.


Why are you hosting the data in AWS instead of on Azure?


I know this sounds silly because code on github is visible to everybody anyway and that's a good thing, but I would appreciate a way to opt out my own code from such automated data gathering programs for machine learning purposes.

Something like a robots.txt for github projects. Not that anybody would really care, only to make my intent clear that I don't support this sort of mass data gathering nonsense.


You opt out by using a restrictive license or making your repo private.


Using a restrictive license seems a bit extreme when the whole point of a public github repo is to make the code available for other programmers.

I consider this mass data gathering use for machine learning an edge case which should not harm the 'legitimate' users of the repository.


Why do you see this effort as not "legitimate"?


Self-host your own Git browser. Gitlab or Cgit is still good which is still used by the biggest project in the world (linux kernel).


Shawn from Weights & Biases here. We've been working with Github and Microsoft Research on this for just about a year now and we're super excited to launch it today.

We've seen huge advances in human language modeling and translation due to the success of deep learning. Often new directions start with a really motivated team producing a new kind of dataset. Who better to do that for code as language than Github!

This started as a grassroots effort inside of Github, and went through many iterations. When it was presented to Github's CEO six months ago, he correctly pointed out that we needed to go back and include Github's most popular language (javascript). As the project went on many smart people chipped in, and we produced something that we think is truly useful.

Check out the paper here: https://arxiv.org/abs/1909.09436

We overcame plenty of challenges to pull this off. For example: how do you clean this data? how do you label it? We've got folks from Github, Microsoft Research and Weights & Biases here to answer any and all questions you might have. Can't wait to see where this goes!


> We've seen huge advances in human language modeling and translation due to the success of deep learning.

I wonder if we’ll eventually see a system where instead of writing code you describe in natural language what you want the program to do and then ML is applied in order to generate the code for that.

I mean, a lot of people have been interested in the past in making human programming languages, and had varying degrees of success.

Personally I love writing code but, it could be, couldn’t it?

Write some unit tests, a human description of what it does and based on the source code and description of existing software the system would basically “debug the program into existence” for you.

That’d be kind of freaky, kind of cool and a little bit scary.


Find a human who wants a computer program written (A). Find a human who can program computers (B). Have A describe what they want and have B code it without asking questions to further clarify the issues. What do you expect the result to be like? For me, my experience tells me that it will be a total failure.

The problem with programming is not the encoding of the requirements in programming language for the most part. The problem is that the specifier (A in this example) usually does not have a full grasp of what they actually want. In fact, they usually don't have any idea at all. "Give me a e-shopping system to sell comic books" is the level of detail they can understand.

The closer A can come to expressing the requirements they need, the closer they are to actually being B in reality. B's real skill is not in knowing the syntax and grammar of the computer language, it's in knowing that in order to make a system that will satisfy A we need to do X, Y, and Z to the tiniest detail.

When we get into trouble with our software is when we write code that is dramatically more complex than the problem we are trying to represent. This doesn't happen so much because we don't know how to program. This happens because we are slowly extending the code base over time with imperfect knowledge about what we are ultimately building at any given time. We also have to trade-off the benefit for getting something done with discovering generalities that allow us to simplify the expression of code that we already have.

I don't think we will ever replace "programmers" with AI -- at least not until the AI can be trained to ask the important questions about what the system really needs to be (and for that we need turing-test passing level AI). I think it's much more likely that we will build more and better tools that help programmers visualise and plan the programming situation. I think we will have automatic code generation because we already have it: Look at "derive" in Haskell and Rust, for example. But I think that's the level of automatic code generation we're going to want for at leas the next 20 years or so.

Interestingly for testing, I think we'll actually go the opposite direction: We will spend more time thinking about requirements and the computer will help us by writing tests that challenge our assumptions: I've broken this function, are you sure you got it right? Again, we already have these kinds of systems and I think that this is the most appropriate direction to invest in research.


The only way I think something like what you described in your first paragraph would work is if you had an AI system B that could present questions and prototypes back to requirement-setter A for feedback. Of course, that'd be a very difficult problem, even if you limit it to a constrained domain.


>> Write some unit tests, a human description of what it does and based on the source code and description of existing software the system would basically “debug the program into existence” for you.

This is already possible, but not with deep learning which is probably the reason you haven't heard of it.

Learning programs from specifications (not necessarily in natural language) is the subject of the field of Program Synthesis [1].

Learning programs from examples of their input and outputs (which is basicaly writing unit tests) is the subject of Inductive Programming [2].

Inductive Programming encompasses the fields of Inductive Logic Programming and Inductive Functional Programming, that target logic and functional programming languages, respectively.

You won't find anything learning from examples in javascript or python directly, though- imperative languages are too sloppy for that sort of thing.

___________

[1] https://en.wikipedia.org/wiki/Program_synthesis

[2] https://en.wikipedia.org/wiki/Inductive_programming


> Write some unit tests, a human description of what it does and based on the source code and description of existing software the system would basically “debug the program into existence” for you.

Sounds more or less like the mechanism by which developer jobs succumb to automation. Hopefully those of us that are working class have seized the capital by then.


> Sounds more or less like the mechanism by which developer jobs succumb to automation.

To spin that more positively, it may automate the basic boring stuff (both for us technical types and potentially for the simian on the street) and leave us more time to spend time in more fun & challenging playgrounds.


Yes that would be really cool. The field of program synthesis has made strides in this area but it doesn't appear you can create anything more than trivial programs from human languages at the moment. I think that you are more likely to see technology that augments the human significantly -- for example better code completion, error detection etc. that allows you to work much faster. It will be exciting to see how machine learning shapes developer tools and workflows in the future.


Great idea, I would love to see this. Tools for programming are going make some interesting leaps with all the new work going into language modeling. So much of the code we write is tweaks and combinations of a not particularly large set of patterns: loops, functions, merge, reduce, sort, filter, interleave, etc... Generating large blocks that are near your target result would be really useful for saving time, especially on the more repetitive tasks like writing tests or simple CRUD API endpoints.

Microsoft was showing off similar work from their own code datasets this year at ICLR, I couldn't find a link online, but the demos had block suggestions from method signatures for C#. It should be possible to get similar results with natural language queries.


[I'm one of the Microsoft Research people who worked on this]

(thanks Nick. Here are the links)

Generative Code Modeling with Graphs: https://arxiv.org/abs/1805.08490 Learning to Represent Programs with Graphs: https://arxiv.org/abs/1711.00740


It has a lot less real-world value if the AI is not capable of debugging the software in future conditions, or explain why it made it the way it made it. I would love for the "writing programs into existence" parts to be automated, but it does make further investigation costlier (since the human investigating also does have to get acquainted with the patterns in the code).

I do think there would be value in machine-written code if it made code in general more alike, so that you don't have to relearn the weird tricks any new writer could have chosen to use.


Just curious, is there a technical reason you initially omitted JavaScript?


[I'm one of the Microsoft Research people who worked on this]

There wasn't a technical reason (unless you count lazyness as a technical reason) -- we simply had infrastructure for Python specifically lying around from past research projects, which we initially reused.

After we got Nat's feedback, we redid our data processing pipeline completely to be based TreeSitter (which wasn't around when we started thinking about parsing Python), which makes it much easier to scale to the number of programming languages on GitHub.


In other words, the system is ready to be trained for languages not on the initial list, without requiring additional development?


You would need to extend the data-processing pipeline for the new language, which in the best case only requires to adapt the standard wrapper around the Tree Sitter parser. The wrapper needs to take care of language-specific details (e.g., where to find the documentation for a function, what methods should be filtered out, etc.) but can be fairly small. See https://github.com/github/CodeSearchNet/blob/master/function... for examples.


Spoken like a true Kool aid drinker


Code challenges like this are interesting - thank you for putting this together!

I do have a question about the set-up, if that's alright. Netflix and others have found that shared tasks can lead to great models, but not necessarily ones that are suited for use in a production environment. Have you put much thought into how best to set up a challenge such at this to make the obvious "ensemble everything" solution be less worthwhile?

Similarly, have you put much thought into how to encourage the sharing of information between participants?

Thanks again.


Excellent questions, thank you!

1. We could log additional information about the model, such as inference time, number of parameters, memory usage, etc. and have the primary metric be overall efficiency (best NDCG with fewest parameters/fastest runtime/etc).

2. We're experimenting with different kinds of benchmarks, and I am most excited about explicitly collaborative ones. In these there is no contest/prize (hence no incentive to cheat/withhold information); only the shared goal of improving the model and our collective understanding of the problem. I hope we can incentivize information sharing by tracking and acknowledging individual contributions to the eventual best model in the benchmark. We could approximate individual contribution by seeing which scripts, code segments, workflows, architectural changes, writeups, or discussion comments other participants rate as the most helpful or choose to include in their experiments most often as the benchmark evolves. Of course this could only be an estimate--as Shawn says above, any idea could have "actually happened in a hallway conversation". Still, this is much easier to achieve in a logging/visualization platform like W&B than in the current paradigm of "read research papers, clone relevant repos, spend weeks trying to synthesize/reproduce their results, run your own experiments, write them up in a research paper, hope it gets accepted to a conference before other people publish the same idea, try to integrate your changes/publish your own repo, repeat"--and for hundreds of practitioners, ranging from brand new students to PhDs, working on related problems. This cycle is especially challenging for folks who are new to, working outside of, or trying to collaborate across the relatively few established/well-funded academic/industrial teams.

Collaborative benchmarks can be especially impactful for social good projects, where the primary incentive is to figure out and broadly implement the best solution ASAP (e.g. climate change!), not to make money or argue over the credit attribution. So, my long-term goal is for as much sharing of information and collaboration from as many folks as possible--the more inclusive and transparent the field of deep learning research becomes, the safer and better its outcomes. Very open to ideas on how to help make this happen.

~Stacey, deep learning engineer at W&B


What is exciting about the tools we provided in this competition -- especially the weights and biases leaderboard is the level of transparency you get that you don't always see in a Kaggle competition (unless it is shared voluntarily). You can see:

- All the system logging information including CPU/GPU utilization, with the runtime and type of GPU card used - Extensive logging of model training progression - All of the model artifacts and metadata - A link to the code on GitHub with the code that ran that data. - Anything emitted to stdout (for logging) - etc.

This allows for extreme reproducibility and insight that is very helpful. With these tools, the community can see if an "ensemble everything: method is used and how long the model takes to train and what resources are consumed, etc.

Good question!


There's been a lot of thought at Weights & Biases about the tradeoff between competition and collaboration. Competition certainly fosters activity, but it encourages behaviors like ensembling everything to eek out a few more hundredths of a percent. This benchmark isn't incentivized in any way other than "let's drive the field forward", so we may see less of that behavior.

We've considered benchmarks that proceed in phases: a closed competitive phase for 3 months, then award a prize to the top result, and another prize for best user writeup. Follow that by a collaborative phase where it's more about sharing, teamwork etc. Rinse and repeat.

The question of attribution is really interesting. Who made the largest contribution to the development of a model? It could have actually happened in a hallway conversation, or something equally as untrackable. We'd love to hear other peoples' thoughts on this.

Stacey on our team has put a lot of thought into these topics and may have more to say here!


The Netflix challenge was quite a while ago now. Since then Kaggle has added things like Kaggle kernels, where the models are trained on data they haven’t seen before (not just evaluated).

Resources and training time are also kept even between submissions.


Great effort in putting together this large corpus! While reading through your paper, I noticed the difficulties you faced in correctly annotating the code for their quality, correctness and hiring annotators for different languages. I can imagine how herculean this task could be.

I was wondering if you thought to include stackoverflow questions and answers, which have been vetted by thousands of programmers over a long period of time. Stackoverflow might even want to participate in this effort to provide a clean ground truth for this great project.


[I'm one of the Microsoft Research people who worked on this]

We did consider adding StackOverflow questions. Some of our queries in the CodeSearchNet challenge do actually come from StackOverflow (via StaQC [1]). It's certainly interesting to see how all other SO data can be useful for this task. Thanks for the suggestion!

The reason we didn't try this at this point:

Many people in research have tried working with SO data. In my experience I have observed an interesting problem with the data: it's deduplicated! This is great for users but bad for machine learning, since the data looks "sparse" (roughly, each concept appears once). Sparsity is an obstacle, since it's hard for most existing machine learning methods to generalize from sparse data. In contrast, in natural language there are (e.g.) multiple articles describing the same event more or less.

[1] https://ml4code.github.io/publications/yao2018staqc/


Code reuse search engines is something that Haskell is excellent for. With Hoogle you can search for a generic type signature and probably find the method you want.

Want to flatten an array of Maybes? Just search for [Maybe a] -> [a] and you'll find catMaybes and takeWhileJust.

This of course only works for Haskell because so much of a function is determined by it's types. Doing this sort of thing in say JavaScript would be an absolute nightmare. Still, it's an interesting train of thought.


I've been experimenting with my own custom markup for annotating code blocks. I want to document code use cases and automatically generate an index of independent sandbox examples. This approach doesn't require AI but rather manual effort remembering to annotate novel uses in code and a standard convention for doing so. One problem this solved is that I've written so much code that I lose track of where I experienced certain patterns and this would help me track that better. This also helps to facilitate knowledge sharing across teams with access to the source.

Has anyone worked on something such as this and can comment or share ideas?


sort of ...

I am playing with `todoinator` - it is a way of finding all the "todos" I leave scattered around my code, but also gives me ways to rank my code - its not quite where you are but I think the principle of having almost everythin derived from code is the guiding light here.

[#] https://github.com/mikadosoftware/todoinator/blob/master/tod...


What about some sort of code URI? Maybe something like:

    code:github.com/kstenerud/bit-tricks/absolute_value.h/absolute_value


I haven't looked at this in detail, but where I could see this technology going is by doing a kind of lint. We've got hand built linters for most languages and some of them are really good (Clippy in Rust is amazingly good -- it practically writes my code sometimes). However, a tool that analysed my code, picked out things that were idiomatically strange and then suggested example code that might be better would be quite useful, I think.


[I'm one of the Microsoft Research people who worked on this]

There are many interesting ideas that you could build on top of this kind of data, and we only scratched the surface so far.

For example, the simple "search" technique we are using as a baseline is based on the idea of joint embeddings: We learn functions f_query and f_js/f_python/... to map from the inputs into some vector space such that, for example, for a (python method, docstring) pair, f_query(docstring) is near to f_python(method). To search given a query, we just do f_query(query) and look for nearest neighbours in all the code we indexed before.

Now, we could also just do f_python(def bubblesort(...): ...) and look for the nearest neighbour that is in C#, and should get out a C# implementation of bubblesort. Similarly, we could apply all kinds of filters on the results (code from highly-starred repos, code that uses framework X, ...) to do more interesting thing.


I’d like to type my code in one language and have it translated into another.

Someday I’d like to learn Rust, for example.

If I could leverage the languages that I already know, I could more quickly build something more useful.


A seq2seq model trained on something like roseta code might be a viable way to do that...


A phrase is never enough to describe what the task at hand is. It may work for simpler use cases like the results on stack overflow. Otherwise I do not see it doing better than a google search which leads me to stack overflow.


You're right. We'll need bigger datasets and more complex models to achieve more useful results, but machine learning models can give predictions in real time so you could have this running as you type in your IDE, instead of visiting a website, also with a model fine tuned to your code base. "simpler use cases like the results on stack overflow" could explain at least 10% of the work I do, so even just that tool would be very useful for mediocre programmers like myself.

Behind all the hype, predictive text is something machine learning models are beginning to do very well. G-mail has rolled out a lot of similar features from advancements in deep learning models.


[I'm one of the Microsoft Research people who worked on this]

That's certainly true for simple use cases. Our goal here is to eventually also capture the long-tail of queries about a codebase. Often, within the domain of a project there is a set of natural language terms/jargon that describe complex tasks specific to the domain. Imagine for example a developer joining a mid-sized project and trying to find how to achieve some simple but project/domain-specific task.


It might be useful for scoped search and code navigation when you don't have a general-purpose question that is amenable to Stack Overflow. Let's say you are trying to find code in a repository that carries out a task but your keyword search turns up empty -- semantic search might be able to help you in that kind of situation.


> Searching for code to reuse, call into, or to see how others handle a problem is one of the most common tasks in a software developer’s day

I don't disagree at all that this is how we code these days... but I distinctly remember a time when this wasn't so. We had to do everything ourselves. We engineered our solutions based on various requirements and constraints, and most importantly, we had to figure it out ourselves. The only external help we had was with the APIs we used... and they had to be studied to be understood.

Even in recent times, the most fun I've had programming has been when it's all come from my wee little head, rather than trawling for solutions and shoehorning/ reworking something similar.


Only mildly related, but I had a recent experience that made me fond of Java's way of doing things that your comment reminded me of.

I had to use a tool for work (name withhold to protect the guilty) that has awful documentation. The tool allows you to write snippets of your own code, but provides no IDE and no documentation (AFAIK) of anything but the most trivial aspects of the API.

I started using Python, but with no debugger and no interactive shell there was no way I was going to guess the names of the functions I needed. Lucky for me, someone uploaded the Javadoc of an older version of the API, and that was the missing piece of my puzzle: having the function names, the return types, and Java's stack traces, I now had all I needed.

Back to the topic: like you, I sometimes wonder if there's a downside to not having to scroll through hundreds of manual pages anymore. But until someone shows some kind of evidence of something being lost, I won't worry too much.

That said, I definitely wish more companies would make their documentation available offline, if only as a static version of the online version. For those of us who regularly program in trains and planes, offline docs are a lifesaver.


A little off-topic.

> Our fully preprocessed CodeSearchNet Corpus is available for download on Amazon S3

I am surprised that Github went with S3 for this download. Isn't there a Azure equivalent of S3 for large object storage ? This just shows the dominance of AWS.


It’s a good thing that MS don’t force every team to always choose MS product, especially for a trivial thing like a download storage. Maybe the team was more familiar with AWS, who knows. I’m just glad that they can make this decision.


GitHub was a heavy S3 user before MS took control. It may eventually move over to using their infrastructure more, maybe using it exclusively in the distant future, but there is no point rushing in such a massive systemic change. MS presumably learned something from the debacle that was moving HotMail's back-end from Solaris+Apache to Windows+IIS back in '97/'98!


GitHub is a heavy s3 user. Pretty much every file you upload to GitHub that isn’t part of a hit commit is stored on s3.


Looks like they're using tokens as input as if it's natural language. You have an AST, use it. I think the limitation is the lack of graph and tree based neural networks, but I don't think you need a neural network to search code. You already have the AST. This could be solved with a traditional analytic approach, but it's probably a lot of hard work.


Something like this might be useful for making better "documentation" for things like API's and man-pages.

Documentation tends to be either simplistic "hello-world" examples or everything-but-the-kitchen-sink dumps that take forever to consume. Neither of these is helpful to a practitioner that just wants to get through a basic task without starting a mini-research project, or getting burnt to a crisp on stackoverflow.

So basically, I am thinking this corpus could be used to filter for specific examples of usage within particular contexts and problem domains? Or maybe not?


Since the CodeSearchNet Corpus contains metadata such as owner/repository, it would be nice to create a search tool for the data set itself. That way you could check if, by chance, some of your open source code is part of the corpus.

The data set is apparently ~20GB [0], so a cheap VPS instance might do the job of hosting the data in a searchable format.

[0] https://github.com/github/CodeSearchNet#downloading-data-fro...


I think this is fantastic and I'm excited to see what people come up with here.

It's very disappointing to see a lot of the negative comments here, almost completely around licensing, despite the licensing being well explained.

You'd think this place was full of lazy bureaucrats.


I'm not trying to be a jerk here, but screw it.

I'm sure whatever it is you've done, I'll probably use at Github at some point. However, from my 10 second scan of the page, I know absolutely NOTHING about what this is or what it can do for me. Is it just me or do I _constantly_ see example of developers making the worst marketers in the world? Where are the code examples or videos showing whatever it is this is and how it will help me. There is nothing on that page but a HUGE image and a bunch of text I can't understand without 4 PHDs and a half a tab of Adderall.

It's so frustrating see stories on Hacker News that are just piss poorly explained. You've most likely worked hard at this for a long time, why is it you can't take 5 minutes to explain it in lay man's terms for all to understand?

END RANT


This is a release of a dataset, to help people train machine learning models. That's why the blog post says "We’re also releasing a large dataset to help data scientists build models for this task."

We also provide a way to evaluate how well your machine learning model works. That's why the blog post says "We’re announcing the CodeSearchNet Challenge evaluation environment."

The released data and code (and hence, the announcement) are meant for data scientists and ML researchers who want to work on this problem, and the rest of the world does not need to care. There are no products or applications of this work at this time. If the terms in the blog post don't mean anything to you, then you are not in the target audience for this announcement, and that's OK.


They need Lin Clark[0]'s help. Explaining progress to a non technical audience is a critical skill that many developers lack.

0: https://hacks.mozilla.org/2018/10/webassemblys-post-mvp-futu...


I remember a search engine for developers called Symbolhound. Wasn't fruitful for me bit I'll never very forget it.


Very cool! I wish I had more time!




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: