
How Google Code Search Worked: Regex Matching with a Trigram Index (2012) - kbumsik
https://swtch.com/~rsc/regexp/regexp4.html
======
ravenstine
This reminds me of a browser-side search I once wrote using a combination of
regex, Soundex, and Damerau-Levenshtein. The reason I made this browser-side
is because the app doesn't have _that_ many records that it needed to be on
the server; the records were in the hundreds to thousands, not millions.

[https://github.com/SCPR/fire-
tracker/blob/master/app/lib/sea...](https://github.com/SCPR/fire-
tracker/blob/master/app/lib/search-index.js)

Not the best code I've ever written by far, but it was a neat little
experiment.

Basically, indexing each record with Soundex(phonetics) makes the search semi-
tolerant to misspellings and also brings up similar-sounding results, while
Damerau-Levenshtein is used to sort the results by closeness. I also had it
index timestamps with

Here it is in action:

[https://firetracker.scpr.org/archive](https://firetracker.scpr.org/archive)

As you can see, you can misspell search terms and it will _usually_ get it
right. Granted, a lot of unrelated records usually come up in a search, but
the important thing is the user should get what they're looking for in the
first few results, usually the first.

Regex is a powerful thing.

------
softwaredoug
Very cool! - Most search problems are solved this way - interesting index data
modeling to the needed use cases, not magic machine learning neural network
unicorns. Many want to sell you the latter. The thing that gets the job done
is usually the former.

One big reasons for this is the need for really good training data to do
supervised learning on search problems. If you’re Google or Wikipedia, you can
get this data. But many struggle to understand what a good result was for a
query based on clicks and other user data due to low volumes of traffic.

~~~
kqr
Then you're using magic machine learning wrong. I'm part of a small team who
uses it exactly for the purpose of giving intelligent search to non-Googles
and non-Wikipedias, and we have a very good track record.

The key is to use machine learning to generalize user behavior across multiple
items in roughly the same category, rather than individual items. In fact,
with tiny amounts of data you should pretty much never deal in individual
items -- always in groups of related items.

~~~
chudi
When you said that you work in groups, the groups are discovered with a
clustering technique and then you apply the learning to rank algos ?

~~~
kqr
It's more of a hierarchy than groups, actually. If a user indicates that a
pair of maroon walking shoes were relevant to their query for "red sneakers",
then we have learned that "red sneakers" is associated, in order of decreasing
strength, with e.g.

\- maroon walking shoes \- walking shoes \- casual shoes \- footwear \-
clothes

And obviously the same thing can be applied to generalize over the query.
These hierarchies are constructed statically and dynamically with unsupervised
learning, and then associations from query to groups happens dynamically.

~~~
softwaredoug
Very cool, how exactly do you generate the hierarchies? Is there an existing
site taxonomy or categorization you’re using? Or associating query strings
with the docs clicked and using refinements to see the hierarchy? Or maybe LtR
training data per site category?

~~~
kqr
They are constructed from a probabilistic similarity measure defined in terms
of the metadata available for items, where their path tends to be weighed
fairly heavily. Does that answer make sense?

~~~
softwaredoug
Cool stuff, appreciate the answer. Makes more sense. BTW feel free to join us
on relevance slack community, a lot of us we’re curious what you were doing

[http://o19s.com/slack](http://o19s.com/slack)

------
secure
Shameless plug:

This is what I based Debian Code Search on:
[https://codesearch.debian.net/](https://codesearch.debian.net/)

See also [https://codesearch.debian.net/research/bsc-
thesis.pdf](https://codesearch.debian.net/research/bsc-thesis.pdf) if you’re
interested in details.

~~~
devoply
Is your code open source?

~~~
LukeShu
[https://github.com/Debian/dcs](https://github.com/Debian/dcs)

------
btown
Are there any good writeups on how to implement the trigram index side of
this? For instance, the code alludes to the idea that you could store each
trigram as a bitmap over documents, then the Boolean combinations become
bitwise operations. I suppose you could keep the bitmaps in compressed (or
even probabilistic sketch) form, then decompress only those required for a
query, but is this the right intuition? Are there lower-level libraries or
databases that do this kind of thing well, rather than rolling your own on a
KV store or using something monstrous like ElasticSearch?

~~~
shabble
[https://wiki.postgresql.org/images/6/6c/Index_support_for_re...](https://wiki.postgresql.org/images/6/6c/Index_support_for_regular_expression_search.pdf)
is I think the slides for a conference talk on implementing within Postgres

~~~
JaggedNZ
Postgres FTS with trigrams (pg_trigram extension) works really well for many
search applications. Most of the effort would be tuning the FTS dictionary.

~~~
dreamer_soul
My issue with it is that a small term such as one word doesn't event match to
a record which is weird, I'm using it with rails. Thinking of using elastic
search to help with that

------
karmakaze
How fitting. I've been working on making Etsy houndd [0] available as SaaS.
Houndd uses a trigram index with regexp and can search all the repos I care
about in ms.

The demo page works. The signup just send me a confirmation email and the
setup is still manual, so expect hours/day delay. Aiming to have this all
automated by tomorrow. Appreciate any/all feedback.

[0] [https://github.com/etsy/hound](https://github.com/etsy/hound) [1]
[https://gitgrep.com](https://gitgrep.com)

~~~
xooms
Have you made sure that Git people are OK with the name GitGrep? See
[https://git-scm.com/about/trademark](https://git-scm.com/about/trademark) for
details.

------
sqs
Another shameless plug: Sourcegraph uses this for indexed search via Zoekt (it
also has index-less live search). Check out the cmd/indexed-search code at
[https://github.com/sourcegraph/sourcegraph](https://github.com/sourcegraph/sourcegraph).

------
rasmi
The search features discussed in the article are now available through Google
Cloud Source Repositories: [https://cloud.google.com/source-
repositories/docs/searching-...](https://cloud.google.com/source-
repositories/docs/searching-code)

------
petters
I wish GitHub had decent search functionality.

~~~
enriquto
you mean, searching the past history of all codes?

~~~
petters
Mostly searching for regexes in code and in filenames. But yes, history would
sometimes be useful too.

And this is really the bare minimum. An even better search would e.g. allow
searching for identifiers (comments and strings disregarded).

~~~
rpedela
I believe GitHub uses ES, and currently allowing users to perform regex can
bog down the entire cluster if the regex is malformed. This is a problem with
Solr too. I believe there was some effort to resolve this at the Lucene level,
but I am not sure the status.

In other words, I agree with you but I also know it is an extremely hard
problem to solve at Github's scale.

~~~
avar
If they use ES (or other Lucene) they could already be doing fuzzy search via
ES's own ngram support. At that point indexed regex search as described in the
article isn't far away. You just need to bridge the gap between a regex and a
trigram index.

~~~
rpedela
Agreed, but its not obvious to me how you would bridge that gap without a lot
of custom code.

------
mslot
If you want to build something like this yourself, Postgres has a Trigram
index:
[https://www.postgresql.org/docs/current/pgtrgm.html](https://www.postgresql.org/docs/current/pgtrgm.html)

~~~
ta1234567890
That's great. It seems like it only supports character-level trigrams though.
Do you know of any tools that can create word-level trigrams from Postgres?

------
joatmon-snoo
FWIW kythe.io is the modern successor that does this internally at Google.
Haven't worked on it, but have written some code that's a client.
Unfortunately I think the indexing pipelines aren't publically available.

------
Thorrez
You can still use Google Code Search on the Chrome codebase.

[https://cs.chromium.org](https://cs.chromium.org)

For example here's a hacky regex that finds lambdas:

[https://cs.chromium.org/search/?q=%5CW%5C%5B%5B%5E%5C%5D%5D*...](https://cs.chromium.org/search/?q=%5CW%5C%5B%5B%5E%5C%5D%5D*%5C%5D%5Cs*%5C\(%5B%5E\)%5D*%5C\)%5Cs*%7B+lang:c%2B%2B&sq=package:chromium&type=cs)

------
ngnear
You can store a local search index of your code using
[https://github.com/google/codesearch](https://github.com/google/codesearch),
which uses this algorithm in the background.

------
techbio
Trigrams are of three characters, not tokens.

This is not the obvious search strategy for code, lacking semantic structure,
though apparently it suits regex searchable indexes.

~~~
crawshaw
Trigram is defined in the dictionary as "a group of three consecutive written
units such as letters, syllables, or words." Thus using the term for a triplet
of tokens seems appropriate.

------
sn41
For a reasonable code base, say the linux kernel source code, how large is the
trigram index? Is it necessary that the index be kept in memory?

~~~
nestorD
77MB for the Linux 3.1.3 kernel sources according to the "Implementation"
section of the article.

------
PunchTornado
Be an inter with Jeff Dean as a mentor...

------
rurban
This was posted here at least 10x before.

The last time someone posted a link to a hopeful successor at github:
[https://news.ycombinator.com/item?id=18022357](https://news.ycombinator.com/item?id=18022357)

------
z3t4
Modern servers have a lot of compute resources and memory compared to 15 years
back, before doing any _optimizations_ I would start out with a naive full
text search, which will most likely be fast enough. A full code search using
regex is also more powerful for users. Imagine if you could use Regex when
searching on Github or Google!? Source code use relative little space, If you
strip out the markup/JS from web sites, they also take up relative little
space. The only problem however is to educate users on how to do effective
searches.

~~~
humbledrone
I think you might be underestimating the size of the document corpus that
you'd be running over.

~~~
z3t4
Lets say there are 5 billion url's with an average of 10 KiB of data (if you
take out JS/CSS/images etc), and one server has 50 GB or ram you would need
1000 servers, which is very small considered Google probably have one million
servers deployed. I just tried to text search your comment on Google and it
found your post! So Google is already doing full text search, and does it in
less then one second (0.69 to be precise). There are probably many reasons why
they don't allow Regex, probably because it would be very easy to "scrape"
resources such as e-mail addresses, credit card numbers, etc. It would however
be cool if Google would allow you to search _structured_ data, for example
find 100 recipes that has eggs in it :P Silly example, but the possibilities
are endless!

~~~
humbledrone
But that's exactly my point -- when you get to the stage where you have 1,000
servers with 50G of RAM each, you have gotten to the point where an
optimization like an inverted index is completely sensible. The design you
propose has to do a full regex scan over 50 TB of RAM for every. single. user.
query. For Pete's sake! This is definitely the realm where the computational
costs make it worthwhile to spend engineering resources to optimize,
especially if you are going to serve lots of users.

