
Ask HN: What kind of index is Google using here? - pc
http://google.com/codesearch?hl=en&lr=&q=%5Cd%5Cw%7B2%7D%5CW%7B3%7D%5C%28%5Cw%7B2%7D%5C.&sbtn=Search
======
wooster
AFAIK, Google Code Search was written by Russ Cox, who also wrote this:
<http://swtch.com/~rsc/regexp/regexp1.html>

My guess, back when I was doing code search at Apple, was that he was actually
scanning all the docs with his fast regexp, which would be do-able with the
number of machines at his disposal + the expected traffic.

If anyone knows any differently, I'm all ears. :)

~~~
bravura
If Russell Cox wrote it, it must be good. I programmed competitively with him
at USACO and in college (we got to 8th best college team in the world), and he
was a consistent ass-kicker. Hardcore and disciplined and fast.

~~~
uriel
Other (open source) projects by Russ people might find interesting:

* Plan 9 from User Space - <http://plan9.us>

* vx32/9vx - <http://pdos.csail.mit.edu/~baford/vm/>

* libtask - <http://swtch.com/libtask/>

(Plus quite a few bits of Plan 9 itself, obviously.)

Edit: forgot this really cool storage system:
<http://doc.cat-v.org/plan_9/misc/foundation/>

------
fizx
Reverse Aho-Corasick matching.

Basically, you store all of your searchable text in a suffix tree. You then do
a depth-first search of the tree for matching strings.

<http://www.cs.brandeis.edu/~mfc/cs120/papers/ptrees.pdf>

<http://en.wikipedia.org/wiki/Aho-Corasick_algorithm>

<http://en.wikipedia.org/wiki/Suffix_tree>

~~~
pc
How does that help here? My query doesn't contain any directly-searchable
strings, unless you start abstracting things by treating all alpha chars as
the same, etc.

~~~
fizx
A nice thing about regular expressions is that you always know where you are
in the expression (no backtracking, at least in theory).

Say your query is \d+(th|nd|rd), and your searchable text is "The 3rd planet
from the sun is 20000000 miles away." Your tree search will hit 2, then fail
on 20, then 3, then 3r, then 3rd, and then terminate successfully. So we
followed 5 pointers, and tested another 20 or so for nullity.

This was independent of the length of the searchable text (but not the number
of found matches). It would have performed equally well on "The 3rd planet
from the sun is 20000000 miles away. Lorem ipsum (insert 10mb of random Latin
here)".

~~~
nostrademons
How much RAM do you think it'd take to store all the world's source code in a
suffix tree, and how would you partition it across multiple machines?

~~~
fizx
Well, there's always suffix arrays, so two 64-bit pointers, (plus the original
char) per char?

Assuming their claim of 116,000,000 matches for /./, and 5k mean document
size, 70TB, uncompressed.

Papers like "Compressed Suffix Arrays and Suffix Trees with Applications to
Text Indexing and String Matching" [1] lead me to believe that the state of
the art could be as low as 4 bits/char, in which case the answer would be a
much more tractable 270GB.

As far as partitioning goes, now I'm getting really out of my league. At a
first pass, I imagine that you could keep the root and neighbors on a primary
server, and as big of subtrees as you can fit in RAM on a bunch of slaves. The
first couple chars get processed in the master, and then however many subtrees
are alive get delegated in parallel to the secondary servers. The secondary
servers return the answers to the primary, which returns to the client.

I feel like I'm answering hypothetical questions at a job interview :p

[1] <http://www.di.unipi.it/~grossi/PAPERS/sicomp05.pdf>

~~~
tlb
The math doesn't work. 116,000,000 matches x 5k is 580 GB. It fits on one big
disk, and with 5800 machines holding 100 Mb of plain text each you could
search all of it naively in real time. (Grep searches 100 Mb in 50 mSec.)

5800 machines is practical for a real product. Hardware probably costs Google
$40/month/CPU, so $230k / month, about equal to 10 engineers. So if a more
clever solution needs a whole team to maintain it then it's probably better to
just use grep.

~~~
fizx
10 engineers to program one data structure? I'd be greatly surprised if 1 PhD
and 2 engineers couldn't easily do the additional work beyond distributed grep
in less than six months. Maybe one engineer to maintain it? That saves
~$200k/mo--more if you go over 20 queries/second it takes to max out the
5800cpu limit.

------
drcode
Is your question "How do you create an index that allows regexes?"

In that case, I'd like to know the answer to the question, as well.

~~~
thwarted
perl has a "study" function that examines a string in anticipation of doing
many pattern matches on it. I'm not sure what it does exactly, but the kind of
structures it creates could be serialized and stored along with an input file
(or perhaps replace it), and used as kind of regular expression optimized
index. This isn't necessarily an "index" in the normal sense, but then,
neither is the likes of a full-text index either. But really, any data
structures that are optimized for a specific kind of processing are indexes.

~~~
jrockway
study does not do anything really interesting. It looks up substrings in a
table of substring -> average frequency. It then scans for the rarest
character first, so that it can fail quickly for most input.

From the docs:

(The way "study" works is this: a linked list of every character in the string
to be searched is made, so we know, for example, where all the 'k' characters
are. From each search string, the rarest character is selected, based on some
static frequency tables constructed from some C programs and English text.
Only those places that contain this "rarest" character are examined.)

Basically, this is not what Google is doing for code search.

~~~
thwarted
Study doesn't do anything really interesting, but neither is creating a
keyword index for a keyword based search all that interesting either.

Considering that Google code search is not intended to search English (for
which their regular search technology would most likely suffice for that)
using regular expressions, then I agree that the naive study implementation
you've quoted is, basically, not what Google is using for code search.
However, there's no reason that the frequency tables need to be static nor
need to be constructed solely from from C programs and English text. Frequency
tables can be continuously updated based on the input corpus and the input
corpus should be the code that is being searched. Part of search is being able
to rule out significant portions of the search space in the name of getting
positive results as fast as possible, so being able to fail quickly for most
inputs (considering the example regular expression only returns three hits, it
seems a wide majority don't match), a study-like implementation would appear
to be a possibility.

------
tlrobinson
This is off-topic, but I just realized Google Code Search is great for seeing
where code you open source ends up. Came across a few things I had no idea
were using my code. Kinda cool.

~~~
mc
Tom, did you know there are extra filters you can use to search Github.com?

Try "tlrobinson package:github.com lang:javascript"

------
robk
IIRC, this was a rather high-compute-cost project inside of Google, far more
so than other search projects. Obviously it was a favorite though of the eng
decision makers, so was well cared for resource-wise.

------
pmorici
This presentation talks about using N-grams to make an regex searchable index

[http://www.informatik.hu-
berlin.de/forschung/gebiete/wbi/tea...](http://www.informatik.hu-
berlin.de/forschung/gebiete/wbi/teaching/archive/sose08/pr_text/vortraege/ngram_indexing.pdf)

~~~
fizx
No complexity analysis. Random student results.

It's viable to build an index with n-grams, but I don't think you'd get any
benefit over suffix-tree search. There might be a way to select n-grams
cleverly some of the time, but you can also easily run into pathological
cases.

------
mc
I'd (probably) consider building and searching an index of ASTs
(<http://en.wikipedia.org/wiki/Abstract_syntax_tree>).

Not only would it allow searching with regexp, but it could allow more
sophisticated search queries like "Search all my Ruby code for blocks that
rescue StandardError and are not followed by an ensure clause"

~~~
gruseom
But you'd have to do that separately for every language.

Edit: the OP is such a great question. I've been saving up the joke of
commenting "How is this Hacker News?" on the _most literally appropriate_ HN
post I run across... and almost used it on this one. Sadly, I probably never
will.

~~~
mc
> But you'd have to do that separately for every language.

That's correct. It's one of the challenging parts.

~~~
gruseom
It's one of those ideas that would be best tried out on one instance first.
It's not obvious that the ASTs of programs in different languages would be "A"
enough for a generalized querying system to be useful. Still, I don't mean to
shoot the idea down - it would make for an interesting experiment.

------
crux_
Something like this?

\- Regular expression searching on compressed text,
<http://portal.acm.org/citation.cfm?id=967612>

~~~
mbrubeck
The cost of that algorithm still grows linearly with respect to the length of
the compressed text. It's possible Google is doing something like this
(searching over the entire corpus until they find a match), but is there any
work on precomputing an actual "index" to speed up this type of search?

~~~
crux_
I'm certain there are many approaches to be found via citation-following.

But consider that you can use the linked algorithm, and a bit of trickiness,
to create an index already:

\- Step (a): prepare another version of the document where every digit is
replaced with 0, every word char with 'a', etc.

\- Step (b): note how much better it compresses. :)

\- Step (c): Perform a similar 'weakening' on the search regex.

\- Step (d): Done! Not an exact index, but it will allow you to quickly
eliminate a large number of documents.

You can also combine this approach with text-based ones for the cases where
there are parts of the search regex that contain literal text. (This is almost
certainly the most common use case).

Or you could make it hierarchical if you need to by creating several watered-
down texts for each input with different levels of substitution.

\---------------------------------------------------

Since this is the sort of thing that's fun to think about, here's a whole
different approach:

\- Break each document apart into code lines.

\- Create a hierarchical tree: leaves are the code snippets (with a pointer
back to source doc), each node represents a clustering-by-text-similarity of
all its children. (Appropriate distance measure left as an exercise for the
reader. :) )

\- Let's say that for two regular expressions, _RA_ & _RB_ , _RA_ < _RB_ iff
any text matched by _RA_ will also be matched by _RB_.

\- At each node, create a regular expression _R_ so that for each child regex
_C_ , _C_ < _R_. (The leaves become a regex that simply matches the source
line literally.) Along the way, make a best effort to restrict the length of
_R_ by introducing generalizations ( _R_ can't simply be _Ca_ | _Cb_ |
_Cc_...) without making it so loose that it matches everything.

\- Then, searching for a regex _S_ can be done recursively: For each child
regex _R_ of the current node, if you can construct a new regex _T_ such that
_T_ < _S_ and _T_ < _R_ , descend into that node.

n.b. I typed that out in a hurry and without being rigorous... so it's
probably not quite right, but an interesting path to think about exploring.
Also: testing if one regex is less than another (as defined above) is actually
fairly straightforward, even though it seems hard.

(And a quick edit to add: the last step is definitely wrong. Unfortunately I
need to hop onto a conference call, so it'll have to wait until the indefinite
future.)

(Edit to that: I think I fixed it for now. Shorter version: if the
intersection of _S_ and _R_ is nonempty, descend.)

~~~
crux_
Replying to myself, since editing was getting out of control.

A neat property of the 2nd approach is that if you ever reach a node _R_ where
_R_ < _S_ you're done: all children will be matched.

A less neat property is that it won't allow multi-line matches, but I'm not
sure those are allowed anyway.

------
mtrimpe
It might be possible to adapt a regex state machine to use an AVL TDAWG (
[http://www.ssp.isee.kyushu-u.ac.jp/~inenaga/papers/tcs-
terna...](http://www.ssp.isee.kyushu-u.ac.jp/~inenaga/papers/tcs-ternary.pdf)
)

------
johngunderman
I am 100% postulating here, but this is what seems most likely to me:

Google is known for MapReduce. It seems like it would make regex searches
manageable when put in such an algorithm. All Google would have to do would be
to take whatever key system they use for their big-tables and use MapReduce to
expand the regex over the keys. Then they just have to return the data
associated with those keys.

~~~
tumult
..what?

I would love to see you try to explain how you think this works :)

~~~
johngunderman
to quote <http://infolab.stanford.edu/~backrub/google.html>

Indexing Documents into Barrels -- After each document is parsed, it is
encoded into a number of barrels. Every word is converted into a wordID by
using an in-memory hash table -- the lexicon. New additions to the lexicon
hash table are logged to a file. Once the words are converted into wordID's,
their occurrences in the current document are translated into hit lists and
are written into the forward barrels. The main difficulty with parallelization
of the indexing phase is that the lexicon needs to be shared. Instead of
sharing the lexicon, we took the approach of writing a log of all the extra
words that were not in a base lexicon, which we fixed at 14 million words.
That way multiple indexers can run in parallel and then the small log file of
extra words can be processed by one final indexer.

So we have a hash table of words/searches -> results

Because we have this, we can regex search the keys, and end up with the result
pages.

we do have other options though. It could be:

a) magic b) a trade secret (think PidgeonRank :-) )

~~~
tumult
Quote is unrelated to what you are saying!

 _use MapReduce to expand the regex over the keys_ _So we have a hash table of
words/searches - > results Because we have this, we can regex search the keys,
and end up with the result pages._ ...

Looking forward to further explanations!

~~~
inerte
Imagine the following code:

money = 5;

Could be tokenized and mapped to a few regex keywords, with one or another as
key-value in a datastore. Since Google is big on MapReduce, which could be
seen as computation on sets:

[a-z] = (DOCID_14566, DOCID_15999, DOCID_888)

\d{5} = (DOCID_15, DOCID_15999, DOCID_552)

Then when someone searchs for a "number with five digits", the value for the
\d{5} key is retrived, and it has the documents where the search appears, and
this set/list is combined with the other set/list that match other search
keywords, and you have your results.

On a normal web search, the document is split into words and you can make
these words into keys, so know where they appear, right? (the key's values).

Well, if you think that the user will type a regex expression "\d{5}", then
you can map it to documents too.

It's like pre-processing the possible regex that would match a document token,
and searching _these regexes_ instead of words like a normal web search.

It's "just" an extra step, splitting words (tokens) from a document, and
assigning possible regexes for these tokens.

Not saying Google does it :p But I guess that's what the OP meant...

Edited: Before anyone notices how inefficient my example is, I wouldn't
implement every possible \d as a key, or [a-b], [a-c], [a-d] etc as another
key, this is just an example of how sets of precomputed regexs (from any
depth) can be (ab)used.

~~~
tumult
This makes sense and it could work :) I imagine you would have to spend time
tuning the tokens you map out. I think this would break down pretty quickly
with moderate or complicated regexes, but those might not be the majority of
searches.

Is there any trickery you could do with the inner nodes on the b+trees? I
spent a few minutes thinking about tokens mapped out and intermediate
reductions but didn't come up with anything.

------
miracle
Easynews also has a global search where you can search through all the
newsgroup posts with a regular expression (> 150 days).

You might ask them? They are certainly not as secrective as google is ;)

~~~
jrockway
Is the global search really a regex? I thought it was just substrings.

~~~
s3graham
The labels are links to regularexpressions.info so I'd guess yes.

I'm pretty sure EN just greps subject + output of something like 'strings'
though, it chugs pretty hard on broad searches.

------
simonw
I've always wanted to know that.

