
Show HN: We built a fast substring search engine in Go - alexirobbins
http://www.tamber.com/posts/ferret.html
======
DrJosiah
There are two methods that are more typically known in the context of
substring search that may help you get something even better than your prefix
+ Levenshtein distance 1 searches.

The first and most common is stemming, it can be useful, but it requires a lot
of hand tuning and doesn't get you all that much (at least in my experience).

The second is using phonetics of words themselves to try to figure out what
someone meant to type in. The grand-daddy of these is Soundex, which works,
but isn't as good as Metaphone. Metaphone and Double Metaphone are algorithms
that translate most latin-alphabet languages into possible abbreviated
phonemes.

Double Metaphone, in particular, has allowed me to build spelling-agnostic
prefix matching as an effectively trivial service for two different companies
(the details of which you can read about in the Redis mailing list, and/or
section 6.1.2 in my book, Redis in Action). The long and short of it is that
if you used double metaphone, for the cost of 2-3x index size, you could get
all of the results you are looking for in 2-3 searches instead of length(word)
searches.

~~~
argusdusty
This is a good idea :) I'll probably add in a module for phonetic matching
when I get the time.

~~~
DrJosiah
A good idea is using a paper towel to open the door on your way out of a
public restroom.

This is practical advice based on experience over several months of research
and development scattered over 9 years. ;)

------
diego
This is cool. However, I wonder why the author didn't switch to a Lucene-based
server such as Elasticsearch or Solr, for which you can find decent Go clients
(or use the REST api directly). The NGram tokenizer lets you do infix search
out of the box. By the way, the author says Lucene is "bloated." That's just
nonsense, like saying MySQL is "bloated." LinkedIn and Twitter use Lucene to
handle insane numbers of QPS with real-time updates. There is no excuse for
not trying a search server before deciding to hack something new.

My impression from reading the post is that they went from an extremely
inefficient solution straight to one that seems _way_ overengineered for the
company's scale. The app has only 28 ratings on iTunes, which means that it's
not greatly popular ($150/month is not much more than my personal AWS bill).

My guess is that the highest ROI of this code (besides the joy of hacking)
comes from the HN page views. Usually, optimizing software to this level is a
bad idea for a small startup. You're not Twitter, Github, DropBox. Code less
and grow your business more.

------
danieldk
It's nice to see suffix arrays being used in the real world. For anyone who is
interested in suffix arrays, the following to papers are recommended:

* Sufﬁx arrays: A new method for on-line string searches, U Manber, G Myers, 1993

* Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus, M Yamamoto, KW Church, 2001

The second paper actually uses an 'inverted index' for matching sequences to
documents.

~~~
Scaevolus
Suffix arrays are useful for approximate matches too-- they're used by the
binary delta program BSDiff[1] to match similar code sequences. BSDiff is used
in Google products (Including Chrome)[2] and Firefox's delta updates.

[1]:
[http://www.daemonology.net/papers/bsdiff.pdf](http://www.daemonology.net/papers/bsdiff.pdf)
[2]: It performs additional preprocessing to make updates smaller (Courgette).

~~~
LukeShu
Courgette isn't BSDiff, it is Google's replacement for/improvement upon
BSDiff. They were using BSDiff for Chrome updates before they switched to
Courgette.

~~~
Scaevolus
Courgette is a preprocessing stage to reduce the number of differences
(disassembly + offset fixup + reassembly) plus BSDiff to catch anything else
(using lzma instead of bzip2).

------
Scaevolus
Related: Russ Cox wrote and documented a miniature version of Google Code
Search (a fast regex search engine) in Go:
[http://swtch.com/~rsc/regexp/regexp4.html](http://swtch.com/~rsc/regexp/regexp4.html)

Using a Suffix Array _should_ make it possible to do fuzzy matching easily,
but I haven't examined precisely how they combined it with an inverted index.

------
thedufer
> Reading from the harddrive is very slow.

Why were you reading from harddrive with Mongo? Prefix indexing is the only
kind of string indexing that Mongo does, but it does a pretty good job with
it. You must have a pretty large dataset if it couldn't keep the index in
memory.

> It could be easily hit with a regex injection

It should be pretty easy to regex-escape the search string. Easier than
building a substring searching engine.

> It couldn't handle artists with multiple names

That sounds like a data format issue.

What I'm getting at is that, while the prefix-only issue might have been
enough to justify this switch, none of your other reasons make a ton of sense
to me.

------
fleitz
Amazing, servicing _thousands_ of users for $150 a month.

I'd clearly start writing my own indexing engine instead of using Posgresql,
or maybe switching to digital ocean/hetzner.

~~~
pjscott
You don't know the size of their dataset, or the number of queries per second
they need to support, or the latency requirements they have to meet. They do,
and found that the obvious solution -- prefix queries on B-trees -- was too
slow in practice. That their actual measurements trump your sarcasm, and you
would know this if you'd read beyond the first paragraph before contemptuously
dismissing the whole thing.

~~~
fleitz
You don't know that I didn't read the whole article, and since I did I know
their dataset is < 8GB because they are running EC2 medium instances and they
said they have a budget of $150 which means they have a max of 2 servers. Also
because I read the code I know it doesn't write to disk which means the
dataset has to fit in RAM unless they are relying on the pager to swap memory
for them.

Since an 8 GB dataset is a fucking joke I made fun of their post in a
sarcastic way.

Also databases that lock are a fucking joke too, maybe I'll just repeat
webscale a few times and that will make it performant.

Should be titled "Startup throws out Mongo DB, gets decent performance"

~~~
argusdusty
EC2 medium instances come with 480GB space. Our database currently takes
roughly 60GB.

Ferret isn't running on this entire database, though - it's for auto-complete
searches over our artist names, roughly 4MB.

EDIT: Since I can't seem to reply to fleitz' below comment, I'll post a
response here:

1\. Boyer-Moore takes linear time. Ferret takes logarithmic time. Also, it's
intended for searching over a single string, so using it on a dictionary would
require some sort of hack like a termination character, taking slightly more
memory and time.

2\. A trie was one of the original iterations of Ferret. The reason it lost
out was memory usage, requiring quadratic memory from every word (because this
is a suffix search, not just a prefix search).

~~~
fleitz
4MB? Wouldn't boyer-moore, or a trie be very efficient for this?

~~~
thedufer
boyer-moore doesn't make sense if you're searching the same data set over and
over again - it doesn't do any pre-processing on the data, which is why it
can't be better than linear time.

------
drsintoma
nice. I've been waiting to see some full-text search engine in Go for a while
to compete with the omnipresent Lucene. Things like Go's low memory footprint
could give it an important advantage in this area. This looks like a first
step in that direction.

~~~
danieldk
_Things like Go 's low memory footprint could give it an important advantage
in this area._

You know that there is a C++ version of Lucene?

[http://clucene.sourceforge.net/](http://clucene.sourceforge.net/)

There's also Xapian, which is built in C++:

[http://xapian.org/](http://xapian.org/)

So, if Java's footprint is a concern, there are alternatives.

~~~
nkurz
I hadn't realized that CLucene was still active, thanks!

Apache Lucy is another option that might appeal if you consider Lucene
bloated. It's written in C, and while there aren't Go bindings yet, there's
great interest in providing them and a couple people starting on them:
[http://mail-archives.apache.org/mod_mbox/lucy-user/201308.mb...](http://mail-
archives.apache.org/mod_mbox/lucy-user/201308.mbox/browser)

------
adcuz
Future HN headline: "Startup accelerator written in Go"

------
fspeech
Very cool. Considering that the core functionality you want is done in less
than 500 LOC
([https://github.com/argusdusty/Ferret/blob/master/ferret.go](https://github.com/argusdusty/Ferret/blob/master/ferret.go))
it seems a great way to go :) instead of depending on a large external
library. OTOH you could have probably just as easily written it in another
performant language. But since Go happens to be performant and is already used
in other parts of your project things just turned out great for you.

------
jph
Is this related to the Ferret search engine for Ruby?
[https://en.wikipedia.org/wiki/Ferret_search_library](https://en.wikipedia.org/wiki/Ferret_search_library)

~~~
argusdusty
Author here: They are unrelated.

We did take the time to examine writing our own full search engine based on
Lucene, like the aforementioned search library. The reasons why we didn't are
briefly mentioned in the blog post, but simply put, we found Lucene too
bloated for our purposes (intended for large-scale data retrievals, among many
other things). We just needed a low-cost search over a relatively small
dictionary, which could be used in an auto-complete field without stealing
resources from other processes, such as our recommendation engine.

------
6thSigma
Clicking on the logo gives a 404 error by the way.

~~~
argusdusty
Thanks for the heads-up. It should be working now.

------
oron
doesn't work for me ... looks like it's stuck

------
rorrr2
$150/month can buy you a very very good dedicated server.

[http://www.hetzner.de/en/hosting/produkte_rootserver/ex10](http://www.hetzner.de/en/hosting/produkte_rootserver/ex10)

~~~
argusdusty
The $150 per month includes several other servers we run. The server running
our core app backend costs under half that, and handles our requirements just
fine. Spending another $80 per month to save me development time is a cost we,
as a poorly funded startup, simply cannot afford.

~~~
thedufer
What are you valuing your time at? That sounds like shaky logic at best,
especially considering how much work it sounds like this took.

