Hacker News new | past | comments | ask | show | jobs | submit login
How Google Code Search Worked (swtch.com)
287 points by basugasubaku on Jan 19, 2012 | hide | past | web | favorite | 45 comments

Google made a mistake in killing code search. Indexing the world's source code and making it searchable is so obviously part of their core mission that I wonder how this decision even got made.

Yeah, code search is a niche market numerically speaking, but intellectually and economically (considering the economic impact of software) it is vital. Google was doing so much better a job of it than anybody else that they completely owned the space. How could that not be worth what it cost them to run it? And now we are bereft of anything good.

I used to use Google Code Search like this: encounter an unfamiliar language or API construct, go to code search, get page after page of real-world usages. Or like this: wonder how to do X, imagine what code that does X might say, search for snippets of that until hitting a working example of X. It was a powerful learning device that I am sad to lose. I sure hope a startup comes along to get search right for the world's source code. Github ought to, but their search sucks.

In any case, congratulations Russ Cox et. al. on a brilliant piece of work.

From Steve Yegge's most recent blog rant:

Now, as it happens, I am in fact working on a very cool project at Google. [...] a project that aims to turn source code -- ALL source code -- from plain text into Wikipedia.

Ah. Well, the decision to kill Code Search would make sense if it were in favour of something better. But then why kill it now and leave nothing for any length of time? Also, there's no guarantee the new thing will turn out to actually be better.

For a company that succeeded partly by leveraging the economic value of hackers in a way that hadn't been been done before, this decision is disturbingly out of character. It feels like something that must have happened for inward-facing political reasons - in other words, a sign of rot.

I get that Steve told Larry to focus, but "code" and "search" almost define focus in their case.

Edit: It seems the service is still available under a different URL. Weird, but I'll happily take it! http://news.ycombinator.com/item?id=3487950

Historically, there were two types of projects at Google: the one that's deprecated and the one that doesn't work yet. It seems they have amended that slightly so now it's break-before-make during a migration instead of the other way around.

You mentioned hoping someone can come along to get this right. I am certainly trying with http://searchco.de/

Its still a long way from being close to Google code search both in terms of code indexed (amending that as I write this) but I hope to get things up-to a par as soon as I possibly can.

Symbolhound http://symbolhound.com/ also has a code index that's worth a look too.

That's a seriously cool website. You're obviously still developing it, but way to go!

Thanks. Working on increasing the index size and speed for the moment. Live indexes will be coming up soon too.

SymbolHound developer here, thanks for the mention boyter!

No problem at all. The more people attacking the problem the better off we all are.

Did it stop working internally? Do Google employees still get to use it?

In the long term, this will be search only over code.google.com projects, as mentioned in the 'discussion' link on codesearch.google.com.

So it was too good to be true then. :(

Oh, thank you. This is my favourite form of ego-surfing. It's so nice to see my code, uploaded by someone else, modified and used in different ways. It's nice to think it could outlive me.

Wow. If this really is the same service that was available before - which it looks like it is - then thanks!

p.s. You should submit this as a story in its own right.

I sometimes wonder whether Russ Cox is actually human or is in fact a collective pen name for a group of very talented hackers :)

You're not alone. I like to think of him as 'Future japan prize winner Russ Cox'

Interesting solution. I did something a little different for searchco.de when I was implementing regex search.

Essentially I took the regex such as [cb]at and then expand it out to get all of the terms, in this case cat and bat and then do a standard fulltex search based on those terms to get as close a match as possible. I then loop over the results to get back those which are exact matches.

Its actually more complex then that but with some other trickery (infix and prefix indexing) but it works reasonably well although I am still ironing out some kinks.


  What does the wildcard match "foo.*bar" expand to?
  How do you handle those pathological cases like "f.*r"?
Edit: HN can't displaying the wildcard char. Reformat it.

Much of the same. Most indexers these days allow a proximity search. So you can expand out "foo.bar" into the search "foo << *bar".

It is less likely to pick up things like "fooabar" but assuming there are still results like "foomanchu bar" they will be found. Also assuming the default for your proximity search is OR logic you should still pick up "foobar" eventually.

As for the other case you will naturally find all sorts of things that match. But as with the method in the article the more information you give it the closer a match you will find.

I don't know if the best/worst case is any better then the linked but it does work reasonably well.

This is the most awesome kind of solution: built off of few mostly-off-the-shelf moving parts, simple and easy to understand, and entirely perfect for the problem at hand. This write-up would be awesome teaching material for anyone moving from "I know how to write a program" to "how do I build a clean, elegant system to solve a specific problem?"

To be fair, the reason Perl, Python and PCRE (which all use pretty much the same regex syntax) don't use the linear-time algorithms is because they can't. Features like look-around and matching what you've already matched (/(\w+) \1/ to find repeated words for example) give you more expressivity than regular languages, but also takes away linear time algorithms.

As Russ points out in his earlier re2-related blog posts, these regex engines still perform non-linearly on inputs which don't involve look-around, look-behind, etc. There's plenty of room for improvement even if they want to keep these features.

Seems like the default should be linear runtime and you should have to explicitly ask for the richer feature set (and opt into the assertion that you trust the input not to DOS your process).

Could easily be added as a modifier (see `man perlre`), but should be implemented as two to enable explicit behavior and toggling the default. Randomly picking the letter N:

    /(\w+) \1/n    # Error: look-behind is incompatible with linear runtime RE engine
    /(\w+) \1/N    # Works!
    /(\w+) \1/     # Preferable works for backwards compat, maybe overridden by an ENV var

I don't think you need to even go as far as adding a modifier. A smart enough regex engine would know when it could use the linear runtime algorithm, and when it needs to fall back.

Hence my last example without the modifier & the comment saying it should work.

My claim was that there should be a modifier to demand a particular performance characteristic. i.e. "I want an error if I do something stupid"

Assuming that means that a library function exists to verify that no linearity-breaking features are used, you could also use that to validate user input, which may be good enough.

But the user may not necessarily know it.

This post is very interesting and informative, esp. the part about indexing trigrams instead of words:

> Regular expression matches do not always line up nicely on word boundaries, so the inverted index cannot be based on words like in the previous example. Instead, we can use an old information retrieval trick and build an index of n-grams, substrings of length n. This sounds more general than it is. In practice, there are too few distinct 2-grams and too many distinct 4-grams, so 3-grams (trigrams) it is.

As he explains, in order to perform a search using regular expressions, the RE is first converted to an "ordinary" query string, which finds a set of candidate documents; the candidate documents are then loaded in memory, and the "real" regular expression run against them, in order to find only the matching documents.

He used Google's retrieval engine in order to build the trigram index, but he doesn't say how he identified "code" amidst the ocean of ordinary web pages?

He does say this regarding an example implementation:

> The indexer (...) rejects files that are unlikely to be interesting, such as those that (...) have very long lines, or that have a very large number of distinct trigrams.

so maybe that's what Code Search did too.

What I'm wondering is this: wouldn't it be interesting to have a full web index based on trigrams, that would let us search not only using RE but also using wildcards (at the end of words or at the beginning)?

Maybe it would be too complex to build such an index for the whole web, but for limited corpora (such as one's mail) it would be very useful.

Russ's articles are an excellent write-up and explanation.

However, many finite-state automata regex implementations have existed for years (e.g. Java http://cs.au.dk/~amoeller/automaton) without the backtracking feature, of course. Also of interest is the benchmark data at: http://tusker.org/regex/regex_benchmark.html

> However, many finite-state automata regex implementations have existed for years

If you read his write-up on RegEx matching, you'll see notes that Thompson wrote an implementation in the mid-60s, so he definitely doesn't claim they're new. What he does claim is that most regex libraries don't use them, even when the regex they're matching to doesn't require backtracking.

I'd love some idea of how large the index was for code search, how many machines it required, and how much total code it was searching.

And here I had always thought Google Code Search was based on some kind of fancy federated radix-tree. Very nice design Russ.

This is a really great tool. If it could take the first .csearchindex going up the tree as the current index (somewhat like git does with .git dirs), it could easily top rgrep/ack for searching into projects. (just add line numbers and some match coloring)

Not only does Russ Cox write code and English really well, but he's also on HN. Thanks for the article rsc!


I like. Word splitting is very interesting. Today, 2012, I would be hard pressed to provide a reason to use this technique. Splitting an index is a classic complexity/resource trade off (your index has a very predictable compact footprint). Again, today, memory is cheap, wide, uniform, and predictable. Indexes are now cheap and highly specialized. Complexity can be reduced for simplicity. Index specialization now becomes natural. My point here is that this solves a class of very expensive searches with ease, leading wildcard searches et al. Also, couldnt really tell from your code (you may be doing this), but reverse your trigrams in your generated query. If ordered properly, your search will be a lot more efficient.

What were the criteria for determining that there were too many unique 2-grams and too few 3-grams? Did it just come down to too much memory for the former, and barely enough memory for 3-grams?

Compare 256^2, 256^3, 256^4.

Wow, is there anything at Google that Jeff Dean didn't have a hand in?


The original basic RE and extended RE (when backreferencing is not used) are significantly faster than implementations that most programmers traditionally rave about, e.g., Perl RE.

Tell me something I didn't know.

He thus used such 30 year old code as a model and easily topped the speeds of the built-in RE capabilities of today's popular scripting languages.

Common sense is underrated.

Wow, everyone's a cynic. Did you miss the part about the trigram index?

It sounds like you are replying to regexp1.html, not regexp4.html.

No. You misread. I love Russ Cox's way of thinking and his continual generosity in sharing his work. I too use the Plan 9 base sed and grep. I'm cynical about the people who rave on about perl/javascript/python/ruby/whatever regexp, who derive some perverse joy in ridiculously complex RE and who often diss sed and all things old school as being slow or deficient in some way.

It's a nice design (in particular Trigram Index), but overall product still failed.

My guess is that regular expression search is not as useful as full-text search that general Google Search does.

Finding code with the regular Google Search is nearly impossible though.

I'm finding code with regular Google Search all the time.

Yes, General Google Search is missing some neat features, but overall these features are not as important as convenience of using familiar general search queries, search speed, and the size of general google search index.

BTW, do you have your own explanation of why Google Code Search was cancelled?

Saying "it failed" isn't an explanation.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact