

How Google Code Search Worked - basugasubaku
http://swtch.com/~rsc/regexp/regexp4.html

======
gruseom
Google made a mistake in killing code search. Indexing the world's source code
and making it searchable is so obviously part of their core mission that I
wonder how this decision even got made.

Yeah, code search is a niche market numerically speaking, but intellectually
and economically (considering the economic impact of software) it is vital.
Google was doing so much better a job of it than anybody else that they
completely owned the space. How could that not be worth what it cost them to
run it? And now we are bereft of anything good.

I used to use Google Code Search like this: encounter an unfamiliar language
or API construct, go to code search, get page after page of real-world usages.
Or like this: wonder how to do X, imagine what code that does X might say,
search for snippets of that until hitting a working example of X. It was a
powerful learning device that I am sad to lose. I sure hope a startup comes
along to get search right for the world's source code. Github ought to, but
their search sucks.

In any case, congratulations Russ Cox et. al. on a brilliant piece of work.

~~~
boyter
You mentioned hoping someone can come along to get this right. I am certainly
trying with <http://searchco.de/>

Its still a long way from being close to Google code search both in terms of
code indexed (amending that as I write this) but I hope to get things up-to a
par as soon as I possibly can.

Symbolhound <http://symbolhound.com/> also has a code index that's worth a
look too.

~~~
jauco
That's a seriously cool website. You're obviously still developing it, but way
to go!

~~~
boyter
Thanks. Working on increasing the index size and speed for the moment. Live
indexes will be coming up soon too.

------
singular
I sometimes wonder whether Russ Cox is actually human or is in fact a
collective pen name for a group of very talented hackers :)

~~~
cdibona
You're not alone. I like to think of him as 'Future japan prize winner Russ
Cox'

------
boyter
Interesting solution. I did something a little different for searchco.de when
I was implementing regex search.

Essentially I took the regex such as [cb]at and then expand it out to get all
of the terms, in this case cat and bat and then do a standard fulltex search
based on those terms to get as close a match as possible. I then loop over the
results to get back those which are exact matches.

Its actually more complex then that but with some other trickery (infix and
prefix indexing) but it works reasonably well although I am still ironing out
some kinks.

~~~
ww520
Interesting.

    
    
      What does the wildcard match "foo.*bar" expand to?
      How do you handle those pathological cases like "f.*r"?
    

Edit: HN can't displaying the wildcard char. Reformat it.

~~~
boyter
Much of the same. Most indexers these days allow a proximity search. So you
can expand out "foo. _bar" into the search "foo_ << *bar".

It is less likely to pick up things like "fooabar" but assuming there are
still results like "foomanchu bar" they will be found. Also assuming the
default for your proximity search is OR logic you should still pick up
"foobar" eventually.

As for the other case you will naturally find all sorts of things that match.
But as with the method in the article the more information you give it the
closer a match you will find.

I don't know if the best/worst case is any better then the linked but it does
work reasonably well.

------
bri3d
This is the most awesome kind of solution: built off of few mostly-off-the-
shelf moving parts, simple and easy to understand, and entirely perfect for
the problem at hand. This write-up would be awesome teaching material for
anyone moving from "I know how to write a program" to "how do I build a clean,
elegant system to solve a specific problem?"

------
arnsholt
To be fair, the reason Perl, Python and PCRE (which all use pretty much the
same regex syntax) don't use the linear-time algorithms is because they can't.
Features like look-around and matching what you've already matched (/(\w+) \1/
to find repeated words for example) give you more expressivity than regular
languages, but also takes away linear time algorithms.

~~~
fasdg
As Russ points out in his earlier re2-related blog posts, these regex engines
still perform non-linearly on inputs which don't involve look-around, look-
behind, etc. There's plenty of room for improvement even if they want to keep
these features.

~~~
snprbob86
Seems like the default should be linear runtime and you should have to
explicitly ask for the richer feature set (and opt into the assertion that you
trust the input not to DOS your process).

Could easily be added as a modifier (see `man perlre`), but should be
implemented as two to enable explicit behavior and toggling the default.
Randomly picking the letter N:

    
    
        /(\w+) \1/n    # Error: look-behind is incompatible with linear runtime RE engine
        /(\w+) \1/N    # Works!
        /(\w+) \1/     # Preferable works for backwards compat, maybe overridden by an ENV var

~~~
mdwrigh2
I don't think you need to even go as far as adding a modifier. A smart enough
regex engine would know when it could use the linear runtime algorithm, and
when it needs to fall back.

~~~
snprbob86
Hence my last example without the modifier & the comment saying it should
work.

My claim was that there should be a modifier to _demand_ a particular
performance characteristic. i.e. "I want an error if I do something stupid"

Assuming that means that a library function exists to verify that no
linearity-breaking features are used, you could also use that to validate user
input, which may be good enough.

------
bambax
This post is very interesting and informative, esp. the part about indexing
trigrams instead of words:

> _Regular expression matches do not always line up nicely on word boundaries,
> so the inverted index cannot be based on words like in the previous example.
> Instead, we can use an old information retrieval trick and build an index of
> n-grams, substrings of length n. This sounds more general than it is. In
> practice, there are too few distinct 2-grams and too many distinct 4-grams,
> so 3-grams (trigrams) it is._

As he explains, in order to perform a search using regular expressions, the RE
is first converted to an "ordinary" query string, which finds a set of
candidate documents; the candidate documents are then loaded in memory, and
the "real" regular expression run against them, in order to find only the
matching documents.

He used Google's retrieval engine in order to build the trigram index, but he
doesn't say how he identified "code" amidst the ocean of ordinary web pages?

He does say this regarding an example implementation:

> _The indexer (...) rejects files that are unlikely to be interesting, such
> as those that (...) have very long lines, or that have a very large number
> of distinct trigrams._

so maybe that's what Code Search did too.

What I'm wondering is this: wouldn't it be interesting to have a full web
index based on trigrams, that would let us search not only using RE but also
using wildcards (at the end of words or at the beginning)?

Maybe it would be too complex to build such an index for the whole web, but
for limited corpora (such as one's mail) it would be very useful.

------
toddc
Russ's articles are an excellent write-up and explanation.

However, many finite-state automata regex implementations have existed for
years (e.g. Java <http://cs.au.dk/~amoeller/automaton>) without the
backtracking feature, of course. Also of interest is the benchmark data at:
<http://tusker.org/regex/regex_benchmark.html>

~~~
mdwrigh2
> However, many finite-state automata regex implementations have existed for
> years

If you read his write-up on RegEx matching, you'll see notes that Thompson
wrote an implementation in the mid-60s, so he definitely doesn't claim they're
new. What he does claim is that most regex libraries don't use them, even when
the regex they're matching to doesn't require backtracking.

------
staunch
I'd love some idea of how large the index was for code search, how many
machines it required, and how much total code it was searching.

------
ch
And here I had always thought Google Code Search was based on some kind of
fancy federated radix-tree. Very nice design Russ.

------
petdog
This is a really great tool. If it could take the first .csearchindex going up
the tree as the current index (somewhat like git does with .git dirs), it
could easily top rgrep/ack for searching into projects. (just add line numbers
and some match coloring)

------
ximeng
Not only does Russ Cox write code and English really well, but he's also on
HN. Thanks for the article rsc!

<http://news.ycombinator.com/user?id=rsc>

------
dr_rezzy
I like. Word splitting is very interesting. Today, 2012, I would be hard
pressed to provide a reason to use this technique. Splitting an index is a
classic complexity/resource trade off (your index has a very predictable
compact footprint). Again, today, memory is cheap, wide, uniform, and
predictable. Indexes are now cheap and highly specialized. Complexity can be
reduced for simplicity. Index specialization now becomes natural. My point
here is that this solves a class of very expensive searches with ease, leading
wildcard searches et al. Also, couldnt really tell from your code (you may be
doing this), but reverse your trigrams in your generated query. If ordered
properly, your search will be a lot more efficient.

------
dmoy
What were the criteria for determining that there were too many unique 2-grams
and too few 3-grams? Did it just come down to too much memory for the former,
and barely enough memory for 3-grams?

~~~
rsc
Compare 256^2, 256^3, 256^4.

------
brown9-2
Wow, is there anything at Google that Jeff Dean didn't have a hand in?

------
_investigator
tl;dr

The original basic RE and extended RE (when backreferencing is not used) are
significantly faster than implementations that most programmers traditionally
rave about, e.g., Perl RE.

Tell me something I didn't know.

He thus used such 30 year old code as a model and easily topped the speeds of
the built-in RE capabilities of today's popular scripting languages.

Common sense is underrated.

~~~
rsc
Wow, everyone's a cynic. Did you miss the part about the trigram index?

It sounds like you are replying to regexp1.html, not regexp4.html.

~~~
_investigator
No. You misread. I love Russ Cox's way of thinking and his continual
generosity in sharing his work. I too use the Plan 9 base sed and grep. I'm
cynical about the people who rave on about
perl/javascript/python/ruby/whatever regexp, who derive some perverse joy in
ridiculously complex RE and who often diss sed and all things old school as
being slow or deficient in some way.

------
dennisgorelik
It's a nice design (in particular Trigram Index), but overall product still
failed.

My guess is that regular expression search is not as useful as full-text
search that general Google Search does.

~~~
ajasmin
Finding code with the regular Google Search is nearly impossible though.

~~~
dennisgorelik
I'm finding code with regular Google Search all the time.

Yes, General Google Search is missing some neat features, but overall these
features are not as important as convenience of using familiar general search
queries, search speed, and the size of general google search index.

BTW, do you have your own explanation of why Google Code Search was cancelled?

~~~
sciurus
Saying "it failed" isn't an explanation.

