
Show HN: Search code in GitHub repos using regular expressions - danfox
https://grep.app
======
jasoncwarner
This is awesome!

@danfox, sent you an email though commenting here too.

I'm the CTO @ GitHub. Would love to talk to you about this and other things we
are building in this area at GitHub.

Feel free to email direct to jason at github.com

~~~
latenightcoding
github's code search is notoriously bad, feels like a huge missed opportunity.
Nice to see you guys reaching out to other people working in this area.

~~~
erikpukinskis
The fact that you can’t search for file names is the funniest part to me.

~~~
nickthemagicman
You can but only in the repo itself not on a site wide scale.

~~~
cole-h
Really? I searched for `filename:home.nix` (which brought me to
[https://github.com/search?utf8=%E2%9C%93&q=filename%3Ahome.n...](https://github.com/search?utf8=%E2%9C%93&q=filename%3Ahome.nix)).
That seems site-wide to me, unless I'm misunderstanding you.

~~~
giancarlostoro
These kind of keywords really should be next to the search box with a question
mark next to them or something.

TIL some of them are on this page that you only see if you search for an empty
string:

[https://github.com/search?q=](https://github.com/search?q=)

Click on 'prefixes'. This kind of thing should be readily available from any
search box that searches through GitHub.

------
simonw
Impressive! Really fast, full featured code search across a huge corpus.

1\. How did you build the index? Did you use a GitHub dump of some sort? How
often do you refresh it?

2\. Is it Elasticsearch or similar or a completely custom engine?

3\. What kind of RAM/CPU are you using to power it?

4\. Any plans to open source the code or commercialize the technology?

I could absolutely imagine paying for a private code search engine like this
to run against a large internal company codebase spread across many
repositories.

~~~
danfox
Thanks! It's built on top of Solr. It fetches the repos from GitHub - it
should pick up any updates to repos within a few days. It's running on a
couple servers with 20 cores each, which is not really enough for the traffic
it's getting right now.

~~~
rattray
Have you seen livegrep?

Blazing fast multi-repo regex code search. May be more expensive to run in
prod, not sure.

------
dang
I still miss Google Code Search, which was a great way to find examples of
anything I wanted to learn about in programming and usually answered my
questions better than anything else, including Stack Overflow. Has it really
been 8 years...
[https://news.ycombinator.com/item?id=3112029](https://news.ycombinator.com/item?id=3112029)

If this tool can fill that hole in my world, I'll be stoked. I've bookmarked
it.

~~~
londons_explore
Google code search still exists as long as you want to search Chromium source
code.

[1]: [https://cs.chromium.org/](https://cs.chromium.org/)

~~~
londons_explore
The main difference it has IMO is it indexes a symbolic code graph extracted
from halfway through the compilation process. That means when you search, it
knows which functions are frequently called. For example, the LOG() macro is
defined in hundreds of places, but the one in logging.h is the one everyone
calls, so that's the one that comes top of the results.

It also keeps track of back references, so you can search "who calls any
function in this file", which is very hard to do with any other search system.

Major disadvantages are it only indexes one build config, so if you're
debugging android code in a multi-platform project and the indexing was done
on the windows version, you won't find much (apart from dumb text based search
which it does in addition).

The difficulty of compiling every project to build a decent index would make
this approach hard on a GitHub scale - all it takes is one missing header file
from a dependency not in the repo and the build fails and the whole project
can't be indexed. Also, have fun with things like JavaScript which are so
dynamic you have to solve the halting problem to know which bit of code calls
which other.

------
thanatos_dem
Next post from danfox - “how to get 3 job offers in 3 hours”.

Already has been publicly contacted by:

\- GitHub CTO

\- SerpApi CEO

\- SourceGraph CEO

Search is hot right now!

~~~
swat535
Actually, It would more be like: "How I failed at 3 interviews, despite being
directly contacted by execs."

~~~
nickthemagicman
Sure you built app on multi 20 core machines with functionality to search
hundreds of millions of lines of code almost instantaneously, but are you
someone I'd drink a beer with?

~~~
yakshaving_jgt
This snide remark dismisses the fact that working on software does mean
working with other humans, not just unemotional robots devoid of any kind of
irrational ideas. Being able to “drink a beer with” (and reasonably
substituting the drinking of beer for just about any other social interaction)
is an important part of being able to work with someone. Unless of course you
believe an office environment consisting of a tyrannical manager barking
orders at worker drones is a healthy relationship.

~~~
nickthemagicman
It's an ego thing to want to work with someone just like you instead of
adapting yourself to others. It's basically bro culture. It's kind of what's
wrong with technology culture.

Give me someone who is talented who makes great code so I can be home at
4:30pm and I don't care what their personality is like. Additionally someone
who tells me when something is an issue even at my ego's expense is extremely
valuable, over back patters and schmoozers who just want to keep everyone
happy. That leads to a terrible product. I would not like to see whatever
product you're working on is like.

You all should take a long look at yourselves and ask why you have to work
with people who are just like you instead of being adaptive to other walks of
life, personality, and backgrounds. Try getting out of yourselves for a
minute. You might even learn something now outside of your own tiny tiny
worlds!

~~~
yakshaving_jgt
That’s a pretty unfortunate interpretation of my comment, and not entirely
logically consistent either.

I mean, if one person who rejects bro culture only wants to collaborate with
other people who also reject bro culture, does that mean they are now
proponents of bro culture?

I also find it frankly a bit weird for you to make grand sweeping assumptions
about who some strangers on an Internet forum choose to associate and
collaborate with. How do you know people here don’t work with people from
other backgrounds?

~~~
nickthemagicman
I found your interpretation of my original 'snide' comment pretty unfortunate.

And not a single thing you just said makes any logical sense.

I do know I would never want to work on any project that you're in charge of
because I guarantee they're nightmare environments.

Best of luck to you nonetheless.

------
sqs
This is really cool. What are you using it for? Usage examples, debugging,
etc.?

I'm the CEO at Sourcegraph (universal code search for companies to use on
their internal code). Our product is really optimized for searching a
company's internal code right now, but soon we'll start working on offering
much better search for public and open-source code as well. If you'd like to
help out or just chat, please reach out! sqs@sourcegraph.com

~~~
edwinyzh
Sorry, but his code search covers far more languages than yours the last time
I tried yours :)

~~~
akavel
Doesn't sourcegraph allow to just search regex over any files in a repo? This
is textual search, so how are languages relevant to it? I didn't seem to have
problems with that

~~~
edwinyzh
Sorry, maybe I have confused SourceGraph with
[https://searchcode.com](https://searchcode.com), but last time I tried, it
supports only most widely used languages such as Java, Python and so on, but
not the language I use (Delphi/Object Pascal). I'm sorry if I'm wrong.

~~~
sqs
Sourcegraph CEO here. You can definitely search all languages (and all files,
and cross-repo, and all commits, etc.) with Sourcegraph.

~~~
edwinyzh
Great! Do you have a live demo? like the one being Showed HN?

~~~
sqs
Here ya go:
[https://sourcegraph.com/search?q=open+repo:edwinyzh+lang:pas...](https://sourcegraph.com/search?q=open+repo:edwinyzh+lang:pascal&patternType=regexp)
(search) and
[https://sourcegraph.com/github.com/edwinyzh/EditBone@d9ec56a...](https://sourcegraph.com/github.com/edwinyzh/EditBone@d9ec56affc9820940c2f96c76c996cb5cee2dbc9/-/blob/Forms/EditBone.Form.LanguageEditor.pas#L162:31&tab=references)
(find references in Pascal)

Sourcegraph.com is universal code search and navigation across all public
repositories. To use it on private code inside your company, run a self-hosted
instance at
[https://docs.sourcegraph.com/#quickstart](https://docs.sourcegraph.com/#quickstart).

We've been so focused on _internal_ code search for companies. See
[https://about.sourcegraph.com](https://about.sourcegraph.com) for some of the
logos of well-known companies whose devs all use Sourcegraph. Because of that,
our "public demo" site at Sourcegraph.com has a few limitations that we're
working on lifting, such as only searching across a subset of popular
repositories by default (unless you specify a specific subset with `repo:` in
the query).

------
franciscop
This is amazing! One thing that allows me to do, which I wasn't before, is to
do a search for the repos that use some of my open source.

While there were some tools for this, they fail sort for older projects where
using a library meant copy/paste it into your project, which is not reported
in the CDN stats, npm installs or github "uses".

Now I can run a search with a bit of code that is only present in my library
and find reliably those who copy/pasted it. While I publish my code under the
MIT, this would also be very useful for those publishing under the GPL to
detect bad actors.

------
danielecook
Wow. This is incredibly helpful. You can use it to see how someone may have
used a function with named parameters:

    
    
      my_function(label=x, option_1=2)
      my_function.*option_1 # search

~~~
SlowRobotAhead
That was my first thought. I’ll have to wait until tomorrow to try it, but I
have one super rarely used function ima rare package I’d love to see how other
people are using.

------
hoorayimhelping
to grep specific repos locally, I use a tool called Hound,
[https://github.com/hound-search/hound](https://github.com/hound-search/hound)
developed by a couple of engineers at Etsy while I was there, but never
released officially.

------
oefrha
A tangent, my biggest gripe with GitHub code search (within a repo) off the
top of my head is the inability to blacklist directories or only search
whitelisted directories. Often times I want to look up the implementation of a
function, and bam, three pages of results from tests.

~~~
Noctem
I'm glad I'm not the only one. It's very common that I'll be searching for a
keyword that only appears in the actual code a handful of times but hundreds
of times in tests. GitHub's search is practically useless in those cases.

I almost always just resort to cloning and searching with ripgrep, which can
be annoying if I have no other reason to have the codebase on my machine or
it's just a one-off.

------
glouwbug
Amazin, why Microsoft hasn't built this for GitHub yet is beyond me.

Can it grep on individual repos?

~~~
funklute
Why would you want to use this tool to grep individual repos? If you know the
repo you're interested in, you can just clone it and then grep it locally...?

~~~
big_chungus
Some things can take a while to clone. On the top end, repos like blink,
webkit, and gecko can take half an hour or more.

~~~
leni536
Even with --depth=1?

------
fanf2
I wonder how this compares to Debian Code Search
([https://codesearch.debian.net/about](https://codesearch.debian.net/about))
and Russ Cox’s code search tools
([https://swtch.com/~rsc/regexp/regexp4.html](https://swtch.com/~rsc/regexp/regexp4.html)).

Obviously the source material is different (Debian packages vs GitHub repos)
and grep.app also uses re2, but that is all I can see from a look at the
“about” blurb.

~~~
sciurus
Another related tool is

[https://searchfox.org/](https://searchfox.org/)

[https://github.com/bgrins/searchfox](https://github.com/bgrins/searchfox)

------
hartator
Excellent work!

I am the CEO at SerpApi. If you need a job, shot me an at julien _at_
serpapi.com.

------
nickjj
Hey Dan, if you ever wanted to come on my podcast to talk about your tech
stack (how your site is developed / deployed, lessons learned, etc.), I'd love
to have you on.

That podcast is at:
[https://runninginproduction.com/](https://runninginproduction.com/), drop me
a line at nick.janetakis@gmail.com if you're interested.

------
edwinyzh
@danfox, Without revealing your tech/business secretes, I wonder if you can
share some tips about building such a search app :)

------
patrickdevivo
This is really cool! Awesome work. I assume you've seen
[https://sourcegraph.com/](https://sourcegraph.com/) as well? This to me seems
much clearer and a bit more intuitive (though I've only spent a little time in
sourcegraph). Really really cool. Does it also search code comments?

~~~
edwinyzh
last time I tried sourcegraph doesn't cover the language I use, so it's
useless to me.

~~~
akavel
For regex?? how's language relevant?

~~~
edwinyzh
Sorry, maybe I have confused SourceGraph with
[https://searchcode.com](https://searchcode.com), but last time I tried, it
supports only most widely used languages such as Java, Python and so on, but
not the language I use (Delphi/Object Pascal)

------
lol768
How did you pick the 500k repositories to index out of the 28 million or so
which are public?

~~~
danfox
It was based on the number of stars/forks and the size of the repository.

~~~
atxbcp
There must be something else or something wrong, because you indexed one of my
small repo (~100 stars, ~20 forks, ~20Mb) and not the bigger ones (~500 stars,
~100/150 forks, ~150Mb)

~~~
giovannibonetti
Maybe he is limiting it to repositories of 50 MB or less, for example.

~~~
tempay
Looking around at repositories I'm familiar with this seems to be the case.

------
TACIXAT
I do not have a great example to try on my phone, but are results
deduplicated? That's my big peeve with GitHub search is getting 5 pages of the
same forked repo.

~~~
danfox
There isn't any deduplication, although that will hopefully be less of an
issue at this point since there's a limited number of repositories in the
index.

------
aodj
You have no idea how often I've wanted something like this for GitHub. Thanks
so much!

------
j1elo
GitHub confirmed to me that their search is not able to find in substrings;
this is annoying because if you want to find all affected code among all
possibly involved repositories, before a change, you need to clone them and
grep locally. In the end this means you need to clone absolutely everything
you work with, because otherwise you might miss changing that one repo you
didn't think of:

[https://stackoverflow.com/questions/43891605/search-
partial-...](https://stackoverflow.com/questions/43891605/search-partial-
words-in-github-organizations-code)

I've used Sourcegraph and it was cool; will have a look at this new tool too.
But, GitHub pretty please add plain food old grep abilities to your search!

------
w-m
Amazing feat!

Something I found when testing the regexp: the highlights seem to be off
sometimes. When grepping for '<.*?@gmail.com>' (sorry, just the first thing
that came to mind to try out the regexp), the second highlight in the first
result seems to be in the wrong location:

[https://grep.app/search?q=%3C.%2A%3F%40gmail.com%3E&regexp=t...](https://grep.app/search?q=%3C.%2A%3F%40gmail.com%3E&regexp=true)

[https://imgur.com/a/VyUXhcF](https://imgur.com/a/VyUXhcF)

------
sn4pp
Seems to be good for stuff like

api_key="[a-z0-9]+"

Ty

~~~
bananaeater
"We didn't find any matching results."

~~~
rafi_kamal
You need to enable regular expression.

------
ferenczy
I would say this needs a list of indexed repos and mainly an explanation of
how it exactly works to be usable (how's the index build and how often it's
refreshed, what types of files are being indexed, etc.). Otherwise, there's no
much value in searching in an unknown data, is it?

Anyway, to not only criticize, good job! It's definitely one of GitHub's
missing features. And I can imagine it's not an easy job to build something
like that. But as I wrote, it really has to be well explained to be actually
usable.

~~~
clarry
> there's no much value in searching in an unknown data, is it?

So you know exactly how Google's index works?

I think "best effort", whatever it is, is useful even if I don't know the
specifics of what it captures or misses. As long as it returns useful results.

------
tekkk
Superb work. You built a better code search than Github (well with some of its
features missing sure) with a lot less resources. Shows how stagnated the
progress in big companies is after a service is deemed "good enough". Good for
you kicking them in their butts to lead the way. Hope you get out of this
something else too than HN karma.

Really like the minimalistic design, not too designy but still easy on my
eyes. Just the way I want it to let me focus on the task at hand

------
jakear
Any plans to include backrefs? I'd like to see how many examples of /(\w+) &&
\1\\./ are out there in .js/.ts compared to /(\w+)\?\\./

~~~
tyingq
The about blurb mentions it uses RE2. So backreferences aren't likely. See
[https://github.com/google/re2/issues/101](https://github.com/google/re2/issues/101)

~~~
jakear
Ripgrep is based on RE2 and supports backrefs. Wonder why they didn't use
that.

~~~
burntsushi
Not quite. ripgrep uses Rust's regex engine, not RE2. Rust's regex engine is
_descended_ from RE2, but there is no code sharing.

Rust's regex engine does not support backreferences. RE2 does not either.
ripgrep does however have a -P/\--pcre2 flag which causes it to use PCRE2
instead of Rust's regex engine. PCRE2 supports backreferences and other
things, like look-around. (ripgrep also has an --auto-hybrid-regex flag, which
will automatically enable PCRE2 for you if you write a regex with
backreferences or look-around.)

The reason not to use an engine like PCRE2 for a project like this is because
it would be trivially exposed to ReDoS:
[https://en.wikipedia.org/wiki/ReDoS](https://en.wikipedia.org/wiki/ReDoS)

~~~
manthideaal
Perhaps to protect against ReDoS the client should use an extended finite
automata (1).

[https://www.arl.wustl.edu/~pcrowley/a25-becchi.pdf](https://www.arl.wustl.edu/~pcrowley/a25-becchi.pdf)

(1) Extending Finite Automata to Efficiently Match Perl-Compatible Regular
Expressions.

~~~
burntsushi
Nope. That still supports backreferences, and resolving backreferences is an
NP-complete problem.[1] And I don't see anything in that paper that addresses
that. Note that there may be some versions of the problem that maybe aren't
NP-complete[2], but again, not addressed by that paper.

Besides, that paper was published 12 years ago. Where is the productionized
version of it? Or are you suggesting the the OP go spend a few years writing a
regex eninge? :-) Doesn't seem like a particularly practical suggestion.

[1] -
[https://perl.plover.com/NPC/NPC-3SAT.html](https://perl.plover.com/NPC/NPC-3SAT.html)

[2] - [https://branchfree.org/2019/04/04/question-is-matching-
fixed...](https://branchfree.org/2019/04/04/question-is-matching-fixed-
regexes-with-back-references-in-p/)

~~~
manthideaal
In the paper there are some bounds about the number of states in the automata
as a function of the length of the input. So one could limit the length of the
input when using back references to bound the complexity of the algorithm.
They have used their algorithm for snort (network intrusion detection) using
asic. The author could contact the authors of the paper and ask for (or pay
for) an implementation.

By the way, good work ripgrep and rust.

------
dabei
It’s interesting how it took so many years for such an obviously useful tool
to emerge. I guess hosting this is finally getting cheap enough.

~~~
edwinyzh
I've been wondering the same thing for many years. And I don't know why Google
killed Code Search

------
blackandblue
thank you so much for doing this! i hope it continues to open more doors of
opportunities to you!

primo, this is a crazy snappy proof that shows that github search can be done.
next, the UI is amazing. and finally, all my queries worked!

i am now going to remove "github search sucks" from my to-be-published rants
because this post demonstrates that 1. people care 2. github was already
working on it.

------
mrkramer
Very similar to
[https://news.ycombinator.com/item?id=18565239](https://news.ycombinator.com/item?id=18565239)

Backend for codegrep was Play framework + Elasticsearch and you could search
by programming languages.

Screenshot: [http://archive.is/0mFML](http://archive.is/0mFML)

------
edwinyzh
Awesome! To me it looks like the come back of "Google Code Search" which I've
been missing for many years!

------
enriquto
Curious that I found many "secret forks" of my stuff, but none of my repos is
directly indexed.

~~~
justanotheratom
Can you elaborate how you found them?

~~~
enriquto
I looked for strings that I am sure only appear in my code, and I found
several copies of them, but not mine.

~~~
polyphonicist
Can you provide detailed steps to reproduce? What strings did you search? Two
examples of repos that appeared in the results? What is the link to your repo
that did not appear in the results?

Details like this would help the OP to track down the exact cause of why it
has indexed the forks but not the original repo.

~~~
enriquto
The authors are quite explicit that this site only includes a fraction of all
github repos. Thus, this is not a "bug" that needs to be corrected.

In my case, I am not talking about forks but about people who copied my files
into their repositories (with proper attribution and respecting the license).
I just searched for my surname and was happily surprised to see it in major
projects like ffmpeg, pytorch, bytedeco, scikit and opencv.

------
welder
Can I search only additions/deletions? Recently when searching GitHub I wanted
to find if anyone had replaced the usage of a deprecated method with the new
one, because the docs for that library don't mention the non-deprecated method
name.

------
yuz
Do you index the default branch of every repo? Or do you just index the master
branch?

~~~
danfox
It indexes the default branch of each repo.

~~~
yuz
Cool. Keep up :) definitely gonna share with my co-workers.

Can't wait for filename filters which would make this the perfect solution

~~~
danfox
Thanks :) If you type into the path filter box, that'll match against the full
path for each file, so you can use that to filter on a filename.

------
cddotdotslash
The interface for this is really clean and nice - did you use a theme or
framework?

~~~
danfox
Thanks! It's using Elastic's Search UI ([https://github.com/elastic/search-
ui](https://github.com/elastic/search-ui)) and Ant Design
([https://github.com/ant-design/ant-design](https://github.com/ant-design/ant-
design)).

------
inetknght
I was going to say that I didn't want javascript on this.

But it's actually pretty #neat. It's all tidied up into a single app without
any dependencies.

This rocks and, so far, seems way way WAY better than Github's own search
tool.

------
bilekas
This is cool, reminds me of the vulnerability search too.

[https://shhgit.darkport.co.uk/](https://shhgit.darkport.co.uk/)

------
Existenceblinks
^(. _) '(._)'(. _)$

I got a tooltip say:

Error: JSON.parse: unexpected character at line 1 column 1 of the JSON data

Update:: Oh ^(._)"(. _) "(._)$ works and fast.

~~~
danfox
I think that error was just because the server was overloaded - sorry about
that.

------
stagas
I wish there was something this fast, but for searching error outputs instead
(along with discussions/solutions).

------
AdrianEGraphene
Feels like magic to me! Lets me easily see who's working on similar topics.
Thanks!

~~~
edwinyzh
Can you share your search string? Thanks.

------
OutsmartDan
This is one of the fastest, most responsive searches i've ever used. Great
work!

------
thrownaway954
might be a good idea to have some sort of clickable "demo" search or "try
these" example on the frontend page to show off the capabilities of this.

------
KhoomeiK
How is it that fast?

------
chasers
How do you handle expensive regex statements?

------
doubleorseven
My last name(Ament) is really rare where I come from, so I've used the tool to
find other people with the same last name. Was not disappoint. Thank you!

------
mtnGoat
this is awesome stuff, thank you! great work!

------
habit20
Hello world

------
whatever1
Why regex still exists? It is unintuitive, requires mastering an obscure
syntax, it is very hard to debug, and very difficult to explain to others how
it works. It feels like we are trying to write intermediate code by ourselves,
while we should have a human readable language that generates regex.

~~~
frabert
Do it! You will find that it's very easy, but the result will either be
extermely verbose or just like regex. Since most regexes (at least for me) are
meant as one-time-use, the extra verboseness has no added benefit. If you have
complex needs, you should probably be using something other that regex,
anyways.

~~~
thanatos_dem
Extremely verbose is right. Here's one such approach in java that I found last
year - [https://github.com/sgreben/regex-
builder](https://github.com/sgreben/regex-builder).

Yeah, regex can be a bit clunky at times and has a steeper learning curve, but
they're pretty industry standard at this point, and portable across languages
with a few caveats.

