Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: We built a fast substring search engine in Go (tamber.com)
145 points by alexirobbins on Aug 29, 2013 | hide | past | favorite | 41 comments



There are two methods that are more typically known in the context of substring search that may help you get something even better than your prefix + Levenshtein distance 1 searches.

The first and most common is stemming, it can be useful, but it requires a lot of hand tuning and doesn't get you all that much (at least in my experience).

The second is using phonetics of words themselves to try to figure out what someone meant to type in. The grand-daddy of these is Soundex, which works, but isn't as good as Metaphone. Metaphone and Double Metaphone are algorithms that translate most latin-alphabet languages into possible abbreviated phonemes.

Double Metaphone, in particular, has allowed me to build spelling-agnostic prefix matching as an effectively trivial service for two different companies (the details of which you can read about in the Redis mailing list, and/or section 6.1.2 in my book, Redis in Action). The long and short of it is that if you used double metaphone, for the cost of 2-3x index size, you could get all of the results you are looking for in 2-3 searches instead of length(word) searches.


I'd say having spelling agnostic search would be crucial. I searched yesterday on google play music for an artist I'd heard of that sounded something like "n door" so I typed that and he came up as the top hit... "Youssou N'Dour". For this ferret search engine all I get is "Nas".


This is a good idea :) I'll probably add in a module for phonetic matching when I get the time.


A good idea is using a paper towel to open the door on your way out of a public restroom.

This is practical advice based on experience over several months of research and development scattered over 9 years. ;)


This is cool. However, I wonder why the author didn't switch to a Lucene-based server such as Elasticsearch or Solr, for which you can find decent Go clients (or use the REST api directly). The NGram tokenizer lets you do infix search out of the box. By the way, the author says Lucene is "bloated." That's just nonsense, like saying MySQL is "bloated." LinkedIn and Twitter use Lucene to handle insane numbers of QPS with real-time updates. There is no excuse for not trying a search server before deciding to hack something new.

My impression from reading the post is that they went from an extremely inefficient solution straight to one that seems way overengineered for the company's scale. The app has only 28 ratings on iTunes, which means that it's not greatly popular ($150/month is not much more than my personal AWS bill).

My guess is that the highest ROI of this code (besides the joy of hacking) comes from the HN page views. Usually, optimizing software to this level is a bad idea for a small startup. You're not Twitter, Github, DropBox. Code less and grow your business more.


It's nice to see suffix arrays being used in the real world. For anyone who is interested in suffix arrays, the following to papers are recommended:

* Suffix arrays: A new method for on-line string searches, U Manber, G Myers, 1993

* Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus, M Yamamoto, KW Church, 2001

The second paper actually uses an 'inverted index' for matching sequences to documents.


Suffix arrays are useful for approximate matches too-- they're used by the binary delta program BSDiff[1] to match similar code sequences. BSDiff is used in Google products (Including Chrome)[2] and Firefox's delta updates.

[1]: http://www.daemonology.net/papers/bsdiff.pdf [2]: It performs additional preprocessing to make updates smaller (Courgette).


Courgette isn't BSDiff, it is Google's replacement for/improvement upon BSDiff. They were using BSDiff for Chrome updates before they switched to Courgette.


Courgette is a preprocessing stage to reduce the number of differences (disassembly + offset fixup + reassembly) plus BSDiff to catch anything else (using lzma instead of bzip2).


URL of the paper: "Using Suffix Arrays to Compute Term Frequency and Document Frequency for All Substrings in a Corpus, M Yamamoto, KW Church, 2001"

http://acl.ldc.upenn.edu/J/J01/J01-1001.pdf


> It's nice to see suffix arrays being used in the real world.

You are apparently not working in Bioinformatics. Suffix arrays are used a LOT here, but rather the more advanced stuff like the BWT. Although, maybe Genomics and Next Generation Sequencing technology are not very "real worldy" to most... :)


Related: Russ Cox wrote and documented a miniature version of Google Code Search (a fast regex search engine) in Go: http://swtch.com/~rsc/regexp/regexp4.html

Using a Suffix Array should make it possible to do fuzzy matching easily, but I haven't examined precisely how they combined it with an inverted index.


> Reading from the harddrive is very slow.

Why were you reading from harddrive with Mongo? Prefix indexing is the only kind of string indexing that Mongo does, but it does a pretty good job with it. You must have a pretty large dataset if it couldn't keep the index in memory.

> It could be easily hit with a regex injection

It should be pretty easy to regex-escape the search string. Easier than building a substring searching engine.

> It couldn't handle artists with multiple names

That sounds like a data format issue.

What I'm getting at is that, while the prefix-only issue might have been enough to justify this switch, none of your other reasons make a ton of sense to me.


Amazing, servicing thousands of users for $150 a month.

I'd clearly start writing my own indexing engine instead of using Posgresql, or maybe switching to digital ocean/hetzner.


You don't know the size of their dataset, or the number of queries per second they need to support, or the latency requirements they have to meet. They do, and found that the obvious solution -- prefix queries on B-trees -- was too slow in practice. That their actual measurements trump your sarcasm, and you would know this if you'd read beyond the first paragraph before contemptuously dismissing the whole thing.


You don't know that I didn't read the whole article, and since I did I know their dataset is < 8GB because they are running EC2 medium instances and they said they have a budget of $150 which means they have a max of 2 servers. Also because I read the code I know it doesn't write to disk which means the dataset has to fit in RAM unless they are relying on the pager to swap memory for them.

Since an 8 GB dataset is a fucking joke I made fun of their post in a sarcastic way.

Also databases that lock are a fucking joke too, maybe I'll just repeat webscale a few times and that will make it performant.

Should be titled "Startup throws out Mongo DB, gets decent performance"


EC2 medium instances come with 480GB space. Our database currently takes roughly 60GB.

Ferret isn't running on this entire database, though - it's for auto-complete searches over our artist names, roughly 4MB.

EDIT: Since I can't seem to reply to fleitz' below comment, I'll post a response here:

1. Boyer-Moore takes linear time. Ferret takes logarithmic time. Also, it's intended for searching over a single string, so using it on a dictionary would require some sort of hack like a termination character, taking slightly more memory and time.

2. A trie was one of the original iterations of Ferret. The reason it lost out was memory usage, requiring quadratic memory from every word (because this is a suffix search, not just a prefix search).


I don't understand why the trie would take quadratic memory. I would expect it to be high-constant linear. Can you explain?


Because a trie works great for prefix searches, but we want substring searches, so we'd have to modify the trie to be more like a suffix tree, in it's simplest form inserting each possible suffix into the trie and doing a prefix search over that. For each word, adding up the length of each suffix gives quadratic memory.


That's linear. It's quadratic on each word, but that doesn't matter (this is why I mentioned high-constant). It's linear on the size of the data set, which is the important part. I'd be happy to code it up if you'd like. Even a binary tree on all suffixes (which would be somewhat simpler to write) would have the linear memory/logarithmic search time that you're looking for.

Edit: I can't reply to you yet, so here goes:

Calling it O(n.m^2) is, at best, an abuse of big-O notation. Since m is a constant (1), that is large-constant O(n). Which is what I've been saying.

Yes, it increases the size of the data set. That would be the "large-constant" part.

(1) If this is not the case, I would love to hear why.


I did say "requiring quadratic memory from every word." I'm not sure you can really call that linear - it's O(n m^2), where n is the number of words, and m is the average word length. It's not the O(n m) that Ferret achieves.

For even a medium sized dictionary of a few dozen MB, you'll find yourself quickly running out of memory. The 4MB dictionary running on the demo would jump to 328MB. Both a Trie and a binary search tree (both of which I've coded and tested) take significantly more memory, and the binary search tree is somewhat slower (try doing a tree traversal between two points as fast as the same traversal over an array).

EDIT: I'll just reply to your edit here for simplicity and because I really don't want to start another thread as I have to get to sleep. I view m as about as constant as n - adding or subtracting words generally changes them both. Try thinking of m as c/n, where c is the total number of characters in your dictionary. O(n m^2) -> O(c^2 / n), whereas ferret uses O(c). The 1,000,000 most frequent English words might have an m of 6-7, but the 100,000 longest English words might have an m of 12-15, taking more memory in a trie, but less memory in Ferret.

But the point is really irrelevant, tbh. We both know how m and n work, and how much memory the trie and Ferret cost for different dictionaries, which is all that should really matter.


4MB? Wouldn't boyer-moore, or a trie be very efficient for this?


boyer-moore doesn't make sense if you're searching the same data set over and over again - it doesn't do any pre-processing on the data, which is why it can't be better than linear time.


nice. I've been waiting to see some full-text search engine in Go for a while to compete with the omnipresent Lucene. Things like Go's low memory footprint could give it an important advantage in this area. This looks like a first step in that direction.


Things like Go's low memory footprint could give it an important advantage in this area.

You know that there is a C++ version of Lucene?

http://clucene.sourceforge.net/

There's also Xapian, which is built in C++:

http://xapian.org/

So, if Java's footprint is a concern, there are alternatives.


I hadn't realized that CLucene was still active, thanks!

Apache Lucy is another option that might appeal if you consider Lucene bloated. It's written in C, and while there aren't Go bindings yet, there's great interest in providing them and a couple people starting on them: http://mail-archives.apache.org/mod_mbox/lucy-user/201308.mb...


As for other non-Java Lucene alternatives, Sphinx works great, is compact and the C++ code is fairly easy to pick up enough of to make modifications if needed:

http://www.spingxsearch.com


There are certainly alternatives to Lucene.

What about idzebra? It's written in C, it predates Lucene and it was used as the indexing system for Harvest (the origin of Squid).

What is more likely than "low memory footprint" to give an indexing system written in Go an advantage over alternatives is the marketing of Go. It seems the language is being promoted nearly every day on HN's front page. If this is any indication of usage trends among developers, the Go Authors must be pleased.


Since 4.0, Lucene's memory usage has dropped quite a bit actually.


There are a lot of good people working on Lucene, and every year good contributions come via GSoC.


Future HN headline: "Startup accelerator written in Go"


Very cool. Considering that the core functionality you want is done in less than 500 LOC (https://github.com/argusdusty/Ferret/blob/master/ferret.go) it seems a great way to go :) instead of depending on a large external library. OTOH you could have probably just as easily written it in another performant language. But since Go happens to be performant and is already used in other parts of your project things just turned out great for you.


Is this related to the Ferret search engine for Ruby? https://en.wikipedia.org/wiki/Ferret_search_library


Author here: They are unrelated.

We did take the time to examine writing our own full search engine based on Lucene, like the aforementioned search library. The reasons why we didn't are briefly mentioned in the blog post, but simply put, we found Lucene too bloated for our purposes (intended for large-scale data retrievals, among many other things). We just needed a low-cost search over a relatively small dictionary, which could be used in an auto-complete field without stealing resources from other processes, such as our recommendation engine.


I thought the same and was going to mention something. Seeing 'Ferret' alongside search brought up bad experiences with that Ruby library.

But it looks like this Go library is not based on Lucene which the Ruby version one was and I don't think that Ruby version is maintained anymore so maybe it's not a problem to have the same name.


Clicking on the logo gives a 404 error by the way.


Thanks for the heads-up. It should be working now.


doesn't work for me ... looks like it's stuck


$150/month can buy you a very very good dedicated server.

http://www.hetzner.de/en/hosting/produkte_rootserver/ex10


The $150 per month includes several other servers we run. The server running our core app backend costs under half that, and handles our requirements just fine. Spending another $80 per month to save me development time is a cost we, as a poorly funded startup, simply cannot afford.


What are you valuing your time at? That sounds like shaky logic at best, especially considering how much work it sounds like this took.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: