
GitHub Code Search – Programmers' Goldmine - knivek
http://jakubdziworski.github.io/tools/2016/08/26/github-code-advances-search-programmers-goldmine.html
======
adamnemecek
If anyone from github is reading this, I would totally pay for a much better
search. Like even $20-30/month. If you deduplicate results (it sucks when you
are searching for something that's appears In a popular library) and like make
it integrated with my ide, I think that it would spike my productivity by like
thousand percent.

~~~
dcsommer
Have you checked out out [https://sourcegraph.com/](https://sourcegraph.com/)
? It has awesome code search and navigation for github.

~~~
adamnemecek
I am. But they don't support he languages I care about these days. Which is
mostly swift I guess.

~~~
sqs
Sourcegraph CEO here. Go to
[https://sourcegraph.com/beta](https://sourcegraph.com/beta) to get early
access and to vote for your favorite language. We hear you and are rapidly
adding support for new languages. :)

(BTW: Passionate about a language and want to help us build Sourcegraph
support for it, or know someone who'd be interested? Email me at
sqs@sourcegraph.com. We can sponsor you. The latest requirements for adding a
new language are not yet publicly documented, so contact us before getting
started.)

------
secure
Shameless plug: there’s also Debian Code Search at
[https://codesearch.debian.net/](https://codesearch.debian.net/). It’s based
on the same idea as Google Code Search (using regular expressions sped up by a
trigram index) and indexes everything that is available in Debian sid.

If your search on GitHub isn’t fruitful, consider giving Debian Code Search a
spin.

Disclaimer: I built it. Happy to answer questions if you have any.

~~~
biokoda
It shares the biggest problem of Github code search. Duplicate results.

~~~
secure
Partly this is because DCS is indexing Debian sid, in which multiple versions
of packages can be present (due to transitions):

$ curl -s
[http://ftp.ch.debian.org/debian/dists/sid/main/source/Source...](http://ftp.ch.debian.org/debian/dists/sid/main/source/Sources.gz)
| zgrep '^Package: git$' | wc -l 4

If we didn’t index Debian sid, you’d only get one version per package, but
also the results wouldn’t be as fresh.

You can follow one of these two issues:
[https://github.com/Debian/dcs/issues/40](https://github.com/Debian/dcs/issues/40)
(at most 1 result per package)
[https://github.com/Debian/dcs/issues/49](https://github.com/Debian/dcs/issues/49)
(limit by suite, where you could search something else than Debian sid)

If you have any other ideas about how to address the issue, please let me
know.

------
edward
I've been using GitHub code search to let other Python programmers know about
the % string formatter.

I search for ".2f}%'.format(100" and then send a push request to simplify the
code.

See
[https://github.com/EdwardBetts?tab=overview&from=2016-04-19](https://github.com/EdwardBetts?tab=overview&from=2016-04-19)

~~~
Patient0
And what is the simplification? I can't see it in your git repo

~~~
Pyppe
"Percentage: {:.2f}%".format(foobar * 100) vs "Percentage:
{:.2%}".format(foobar)

~~~
hughperkins
its actually harder to read, in my opinion, since the % character is often
used for formatting,eg '%s' % somevar.

personally, i dislike such spammy prs. it adds nothing to my code, and its not
like accepting the pr is going to attract you as a developer on my code.

------
_virtu
This is what I've been using as of late to learn new technologies. I always
seem to forget to tell younger developers about using this approach to learn.

Something else I would add is that you should try to find well known OSS
contributors that work on projects that use tech/languages you use. If you
monitor their projects you can pick up great habits from reading through their
code.

~~~
gagagababa
Sounds like a great approach. Can you (or anyone else) recommend contributors
to follow?

~~~
morazow
I usually watch the Apache projects.

Apache Arrow
([https://github.com/apache/arrow/](https://github.com/apache/arrow/)) is very
interesting and promising project. It is in early state; you can watch the
design decisions taken and how they are implemented. Plus it is multi-language
project, you can find code in C++, Java and Python.

------
TheOsiris
as great as it is to be able to search source code in github is, it is equally
(if not more) frustrating.

I think it works fine for simple searches mentioned in the post, but anything
even slightly more advanced doesn't work very well. I also wish there was a
way to search in branches other than master

~~~
sdesol
> I also wish there was a way to search in branches other than master

This is actually an insanely hard problem to solve, given current technology.
Right now, you can either try to index a little bit of everything, which is
what GitHub and others (GitLab, Bitbucket, etc.) are doing. Or you can take
the approach that I'm taking, which is make it possible for the user to search
any branch they want, but limit the number of repositories they can search in
parallel.

GitHub and my approach are definitely targeted at two different use cases. My
background is Enterprise, which is why my search solution is based on "search
relevance", while GitHub is more focused on "search discovery", which is more
suited for finding random things.

~~~
marcinkuzminski
Interesting Approach, we did similar in RhodeCode. You can pick search refs
like Branches, Bookmarks, Tags for each repo individually. With this you can
create few forks and set different search setup for them.

It's a workaround to do search in different branches on 1 repo, but works fine

~~~
sdesol
Yeah, my indexers work on an ondemand bases as well. You can see how the
ondemand indexing technology works with Bitbucket Server, in the following
video:

[https://www.youtube.com/watch?v=-VQVmh0UOnU](https://www.youtube.com/watch?v=-VQVmh0UOnU)

What may not be obvious from the video is, indexing additional branches is
extremely efficient and fast, since most branches usually share 90 - 99% of
the same content.

My indexing technology, also introduces a "group" concept, which makes it
possible for you to share indices across similar repos, which makes indexing
hundreds of forked repos, quite efficient and fast.

------
ISL
It's important to pay attention to how the code is licensed before
incorporating it into your own work, lest it have ramifications in the future.

------
boyter
Totally agree. It's one of the reasons I created searchcode.com back before
github had code search and after google code search closed.

I doubled down by offering the locally installable one which I then deployed
at my workplace. It has been useful for finding how other teams implement
bluebird promises and general auditing for checked in AWS keys and the like.
Now I think about it it is used lot for code auditing. Probably something I
should focus on for new release.

------
sergiotapia
Something so obvious yet I've never thought about using it like this before.
Thanks for the tip OP!

Just tried a simple search for a lib I use and found great ways to refactor my
code.

[https://github.com/search?l=elixir&q=Exq.enqueue%28Exq&ref=s...](https://github.com/search?l=elixir&q=Exq.enqueue%28Exq&ref=searchresults&type=Code&utf8=%E2%9C%93)

I'm going to use this method when learning about a new package.

------
mekazu
It can be difficult to verify the quality of code in GitHub. Even well know
products that work really well can be implemented with awful code. I suspect
the vast majority of projects on GitHub are experimental and unfinished. Using
this feature would be very unlike a system like Stack Overflow where there are
strong incentives to provide good answers.

Still, if the user is aware of these pitfalls then it could be quite useful.
Perhaps it could be improved if GitHub provided some incentive for developers
to showcase their better work as 'known good' examples and let the community
vote on the quality, similar to the Stack Overflow model.

~~~
SonicSoul
what do you mean difficult to verify the quality of code? you are looking at
the code.

~~~
roryokane
In this context, when you look at the code you are trying to learn how to use
an unfamiliar API. So you can't verify that the code is using the API
correctly or in the most efficient, organized, and secure manner, because you
don't yet know everything the API offers.

For example, maybe the code sample has a section that formats and saves the
result to a file using the standard library. From looking at that section, you
have no way of knowing that the library offers a .saveToFile method that would
have been better to use instead. Or if you search for "mysql" examples in PHP,
you may find example usage of the hard-to-secure, deprecated
mysql_real_escape_string function instead of the PDO library that is supposed
to replace it.

------
falcolas
This is, indeed, cool. However, be aware of the legal ramifications of viewing
and using other people's code. From copyright license problems to having your
proprietary code "tainted" (it's a legal concept that seems to say that you
can infringe on IP even if you don't copy the code directly, something
management at Microsoft apparently thinks about regularly), there's a lot to
be aware of in today's litigious environment.

------
sudeepj
One of the things I used it for was to see how a specific API is used in
production ready code. It was of immense help.

------
dschiptsov
Searching for code without understanding could lead one astray. Even such
places as Rosettacode host a buggy code.

[http://karma-engineering.com/lab/blog/DoNotCopyPasteShit](http://karma-
engineering.com/lab/blog/DoNotCopyPasteShit)

~~~
friendlygrammar
Don't go from one extreme, only using docs, to another. Code examples provide
context to the documentation. They should be used together.

------
bombita
I still think that Google Chromium code search repository is by far the best
one I've seen. Things like following refs everywhere and seeing git blame and
related files makes it a hell of a lot better than github.
[https://codesearch.chromium.org/chromium/src/components/inva...](https://codesearch.chromium.org/chromium/src/components/invalidation/impl/non_blocking_invalidator.cc)

~~~
secure
FYI, part of the tech stack behind that is open source:
[https://www.kythe.io/](https://www.kythe.io/)

------
stinos
97 results is doable. And that's what you get for a scala search today. Which
is maybe why the author picked it in the first place: there's a ton of other
real-life scenarios as well, and for me more often than not the number of
results ranges from a couple of hundred to half a million. The only sort of
sane way to filter that for quality is specifying the number of forks but it's
a bit of hit or miss sometimes. Now I know this is a hard problem to solve so
Im not really blaming github but maybe they should try to come up with
something. Oh and this also illustrates the 'goldmine' term is spot on: yes
its gold but you have to dig really hard sometimes.

------
felipesabino
You can do even more now that all github public data is publicly available on
Google Big Query [1] [2], where you can use not only SQL commands but also
code (JS) [3] to search for anything on github.

There was a small lag between repo code and the dump done on Big Query (some
days) last time I checked, but it seems that the delay has been getting
smaller and smaller over time.

[1] [https://cloudplatform.googleblog.com/2016/06/GitHub-on-
BigQu...](https://cloudplatform.googleblog.com/2016/06/GitHub-on-BigQuery-
analyze-all-the-open-source-code.html)

[2] [https://changelog.com/209/](https://changelog.com/209/)

[3] [https://cloud.google.com/bigquery/user-defined-
functions](https://cloud.google.com/bigquery/user-defined-functions)

------
nv-vn
Having it parse symbols makes many code excerpts ungoogleable (or un-
Githubable in this case).

------
sunilkumarc
Thanks for the post. I didn't about this feature from github before. I just
found out about new ways to initialize a Sequelize object in a Node.js project
I have been working on, which I couldn't find in their documentation easily.

------
arekkas
have fun searching for anything that isn't [a-zA-Z0-9\\.]+

------
eibrahim
Damn!!! I have been coding for almost 20 years and google everything and never
thought of searching github. I can't believe I have never thought of that.
Thanks for the tip.

------
BINARYcreature
I'd love an IDE that auto pulled code snippets in based on what I was working
on. You could use he file name, algo pattern, etc. Would be hard to get right
but worth it.

------
m0atz
This is a great idea and well presented. I often like to see how others
implement a particular API and for what purpose, which in turn keeps me
developing new ideas.

------
rollinDyno
More than a programmer's goldmine, this is Github's. Stack Overflow is trying
to become the go-to for sample code and documentation, Github could have a lot
to offer in a partnership.

~~~
JustSomeNobody
Considering most devs copy and paste code from SO, I would say they don't need
github at all.

------
ac123
Looking at the comments on the original article. Seems like it doesn't work
anymore? Any ideas?

------
achen2345
I have been using GitHub code search this past week to extend my JavaScript
parser/beautifier to also support TypeScript, C#, Java.

------
inaprovaline
Try [ "FileIO.fromPath(" filetype:scala ] on Google :)

------
DyslexicAtheist
interest to hooking that up with a copy paste detector that takes LICENSE file
of repo into account

------
hclgckxjtxjfxur
I wonder how feasible it would be to implement a higher order programming
language based on searching code snippets. If it could be made to work, it
could potentially speed up development time by an order of magnitude.

~~~
JustSomeNobody
Order of magnitude? How could anyone possibly know this if it doesn't exist?

It could just as easily turn out to be no better than CASE or rational or any
other junk that's come before.

~~~
hclgckxjtxjfxur
Well, theoretically, it would make it possible to abstract away writing
traditional fine-grained code, just like our languages today abstracted away
assembly, which in turn asbtracted away flipping mechanical switches.

