Hacker News new | past | comments | ask | show | jobs | submit login
GitHub Code Search – Programmers' Goldmine (jakubdziworski.github.io)
393 points by knivek on Aug 28, 2016 | hide | past | web | favorite | 56 comments

If anyone from github is reading this, I would totally pay for a much better search. Like even $20-30/month. If you deduplicate results (it sucks when you are searching for something that's appears In a popular library) and like make it integrated with my ide, I think that it would spike my productivity by like thousand percent.

Have you checked out out https://sourcegraph.com/ ? It has awesome code search and navigation for github.

I am. But they don't support he languages I care about these days. Which is mostly swift I guess.

Sourcegraph CEO here. Go to https://sourcegraph.com/beta to get early access and to vote for your favorite language. We hear you and are rapidly adding support for new languages. :)

(BTW: Passionate about a language and want to help us build Sourcegraph support for it, or know someone who'd be interested? Email me at sqs@sourcegraph.com. We can sponsor you. The latest requirements for adding a new language are not yet publicly documented, so contact us before getting started.)

Kind of expected that you would be able to search all of GitHub there but it seems to be limited to selected libraries and your own repos? Also, surprised that C is not supported.. Personally I'm using GitHub's code search to discover dumps of code and docs for random Chinese hardware (SBCs and such). I've found it much easier than trying to find SDKs on Baidu's file sharing site (Baidu Pan).

Sourcegraph CEO here. We need to do a bunch of language-specific stuff to make search work well for finding definitions and usage examples, so we've scoped it for now. But we're almost ready to add more languages and index all repositories; check https://sourcegraph.com/beta to get early access and to +1 your preferred languages. This is the #1 request we get, and we can't wait to have Sourcegraph work for all code and more languages. :)

Shameless plug: there’s also Debian Code Search at https://codesearch.debian.net/. It’s based on the same idea as Google Code Search (using regular expressions sped up by a trigram index) and indexes everything that is available in Debian sid.

If your search on GitHub isn’t fruitful, consider giving Debian Code Search a spin.

Disclaimer: I built it. Happy to answer questions if you have any.

It shares the biggest problem of Github code search. Duplicate results.

Partly this is because DCS is indexing Debian sid, in which multiple versions of packages can be present (due to transitions):

$ curl -s http://ftp.ch.debian.org/debian/dists/sid/main/source/Source... | zgrep '^Package: git$' | wc -l 4

If we didn’t index Debian sid, you’d only get one version per package, but also the results wouldn’t be as fresh.

You can follow one of these two issues: https://github.com/Debian/dcs/issues/40 (at most 1 result per package) https://github.com/Debian/dcs/issues/49 (limit by suite, where you could search something else than Debian sid)

If you have any other ideas about how to address the issue, please let me know.

thanks for the project :) One query - Is it actively maintained? The reason is I see "2012-2014 Debian Code Search" on the footer. So just wanted to check!

thanks for the confirmation & details.

I've been using GitHub code search to let other Python programmers know about the % string formatter.

I search for ".2f}%'.format(100" and then send a push request to simplify the code.

See https://github.com/EdwardBetts?tab=overview&from=2016-04-19

This makes me think... maybe this could be generalized into a linter-as-a-service that occasionally submits pull requests for obvious style wins. Maybe just commenting on existing pull requests to be less annoying (and because there's a better chance it can be fixed by the issue opener).

And what is the simplification? I can't see it in your git repo

"Percentage: {:.2f}%".format(foobar * 100) vs "Percentage: {:.2%}".format(foobar)

its actually harder to read, in my opinion, since the % character is often used for formatting,eg '%s' % somevar.

personally, i dislike such spammy prs. it adds nothing to my code, and its not like accepting the pr is going to attract you as a developer on my code.

This is what I've been using as of late to learn new technologies. I always seem to forget to tell younger developers about using this approach to learn.

Something else I would add is that you should try to find well known OSS contributors that work on projects that use tech/languages you use. If you monitor their projects you can pick up great habits from reading through their code.

Sounds like a great approach. Can you (or anyone else) recommend contributors to follow?

I usually watch the Apache projects.

Apache Arrow (https://github.com/apache/arrow/) is very interesting and promising project. It is in early state; you can watch the design decisions taken and how they are implemented. Plus it is multi-language project, you can find code in C++, Java and Python.

I've found that the following authors have great practices for managing OSS projects:

- https://github.com/mbostock

- https://github.com/tj

as great as it is to be able to search source code in github is, it is equally (if not more) frustrating.

I think it works fine for simple searches mentioned in the post, but anything even slightly more advanced doesn't work very well. I also wish there was a way to search in branches other than master

> I also wish there was a way to search in branches other than master

This is actually an insanely hard problem to solve, given current technology. Right now, you can either try to index a little bit of everything, which is what GitHub and others (GitLab, Bitbucket, etc.) are doing. Or you can take the approach that I'm taking, which is make it possible for the user to search any branch they want, but limit the number of repositories they can search in parallel.

GitHub and my approach are definitely targeted at two different use cases. My background is Enterprise, which is why my search solution is based on "search relevance", while GitHub is more focused on "search discovery", which is more suited for finding random things.

Interesting Approach, we did similar in RhodeCode. You can pick search refs like Branches, Bookmarks, Tags for each repo individually. With this you can create few forks and set different search setup for them.

It's a workaround to do search in different branches on 1 repo, but works fine

Yeah, my indexers work on an ondemand bases as well. You can see how the ondemand indexing technology works with Bitbucket Server, in the following video:


What may not be obvious from the video is, indexing additional branches is extremely efficient and fast, since most branches usually share 90 - 99% of the same content.

My indexing technology, also introduces a "group" concept, which makes it possible for you to share indices across similar repos, which makes indexing hundreds of forked repos, quite efficient and fast.

There's also the limitation that some characters [1] are ignored, which makes it improbable you'll find what you're looking for.

Apparently this is related to how the search index works, but cumbersome nevertheless.

[1] https://help.github.com/articles/searching-code/#considerati...

It's important to pay attention to how the code is licensed before incorporating it into your own work, lest it have ramifications in the future.

Totally agree. It's one of the reasons I created searchcode.com back before github had code search and after google code search closed.

I doubled down by offering the locally installable one which I then deployed at my workplace. It has been useful for finding how other teams implement bluebird promises and general auditing for checked in AWS keys and the like. Now I think about it it is used lot for code auditing. Probably something I should focus on for new release.

Something so obvious yet I've never thought about using it like this before. Thanks for the tip OP!

Just tried a simple search for a lib I use and found great ways to refactor my code.


I'm going to use this method when learning about a new package.

It can be difficult to verify the quality of code in GitHub. Even well know products that work really well can be implemented with awful code. I suspect the vast majority of projects on GitHub are experimental and unfinished. Using this feature would be very unlike a system like Stack Overflow where there are strong incentives to provide good answers.

Still, if the user is aware of these pitfalls then it could be quite useful. Perhaps it could be improved if GitHub provided some incentive for developers to showcase their better work as 'known good' examples and let the community vote on the quality, similar to the Stack Overflow model.

what do you mean difficult to verify the quality of code? you are looking at the code.

In this context, when you look at the code you are trying to learn how to use an unfamiliar API. So you can't verify that the code is using the API correctly or in the most efficient, organized, and secure manner, because you don't yet know everything the API offers.

For example, maybe the code sample has a section that formats and saves the result to a file using the standard library. From looking at that section, you have no way of knowing that the library offers a .saveToFile method that would have been better to use instead. Or if you search for "mysql" examples in PHP, you may find example usage of the hard-to-secure, deprecated mysql_real_escape_string function instead of the PDO library that is supposed to replace it.

This is, indeed, cool. However, be aware of the legal ramifications of viewing and using other people's code. From copyright license problems to having your proprietary code "tainted" (it's a legal concept that seems to say that you can infringe on IP even if you don't copy the code directly, something management at Microsoft apparently thinks about regularly), there's a lot to be aware of in today's litigious environment.

One of the things I used it for was to see how a specific API is used in production ready code. It was of immense help.

Searching for code without understanding could lead one astray. Even such places as Rosettacode host a buggy code.


Don't go from one extreme, only using docs, to another. Code examples provide context to the documentation. They should be used together.

Moreover, there are occasions when snippets from highly upvoted (and even accepted) answers on Stack Overflow should not be copied either.

I still think that Google Chromium code search repository is by far the best one I've seen. Things like following refs everywhere and seeing git blame and related files makes it a hell of a lot better than github. https://codesearch.chromium.org/chromium/src/components/inva...

FYI, part of the tech stack behind that is open source: https://www.kythe.io/

97 results is doable. And that's what you get for a scala search today. Which is maybe why the author picked it in the first place: there's a ton of other real-life scenarios as well, and for me more often than not the number of results ranges from a couple of hundred to half a million. The only sort of sane way to filter that for quality is specifying the number of forks but it's a bit of hit or miss sometimes. Now I know this is a hard problem to solve so Im not really blaming github but maybe they should try to come up with something. Oh and this also illustrates the 'goldmine' term is spot on: yes its gold but you have to dig really hard sometimes.

You can do even more now that all github public data is publicly available on Google Big Query [1] [2], where you can use not only SQL commands but also code (JS) [3] to search for anything on github.

There was a small lag between repo code and the dump done on Big Query (some days) last time I checked, but it seems that the delay has been getting smaller and smaller over time.

[1] https://cloudplatform.googleblog.com/2016/06/GitHub-on-BigQu...

[2] https://changelog.com/209/

[3] https://cloud.google.com/bigquery/user-defined-functions

Having it parse symbols makes many code excerpts ungoogleable (or un-Githubable in this case).

Thanks for the post. I didn't about this feature from github before. I just found out about new ways to initialize a Sequelize object in a Node.js project I have been working on, which I couldn't find in their documentation easily.

have fun searching for anything that isn't [a-zA-Z0-9\.]+

Damn!!! I have been coding for almost 20 years and google everything and never thought of searching github. I can't believe I have never thought of that. Thanks for the tip.

I'd love an IDE that auto pulled code snippets in based on what I was working on. You could use he file name, algo pattern, etc. Would be hard to get right but worth it.

This is a great idea and well presented. I often like to see how others implement a particular API and for what purpose, which in turn keeps me developing new ideas.

More than a programmer's goldmine, this is Github's. Stack Overflow is trying to become the go-to for sample code and documentation, Github could have a lot to offer in a partnership.

Considering most devs copy and paste code from SO, I would say they don't need github at all.

Looking at the comments on the original article. Seems like it doesn't work anymore? Any ideas?

I have been using GitHub code search this past week to extend my JavaScript parser/beautifier to also support TypeScript, C#, Java.

Try [ "FileIO.fromPath(" filetype:scala ] on Google :)

interest to hooking that up with a copy paste detector that takes LICENSE file of repo into account

I wonder how feasible it would be to implement a higher order programming language based on searching code snippets. If it could be made to work, it could potentially speed up development time by an order of magnitude.

Order of magnitude? How could anyone possibly know this if it doesn't exist?

It could just as easily turn out to be no better than CASE or rational or any other junk that's come before.

Well, theoretically, it would make it possible to abstract away writing traditional fine-grained code, just like our languages today abstracted away assembly, which in turn asbtracted away flipping mechanical switches.

Registration is open for Startup School 2019. Classes start July 22nd.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact