

Announcing Soulmate, a Redis-backed service for fast autocompleting - ericwaller
http://seatgeek.com/blog/dev/announcing-soulmate-a-redis-backed-service-for-fast-autocompleting

======
timr
Are you really using the technique described in the Redis auto-complete page?
Doesn't that method take a lot more RAM than is necessary using a more
specialized approach (i.e. a trie)?

Also, from what I can tell, every query is log(N) in the size of the
completion set, instead of linear in the length of the query/suggestion
(again, like a trie). Seems like this might have trouble scaling to large
suggestion sets.

~~~
ericwaller
We're actually using a slight variation of the second technique described,
which involves maintaining a sorted set of the most relevant results for every
possible prefix.

A query for a single-prefix term (i.e. "yank") is technically O(log(N)+M),
where N is the # of items in the sorted set, and M is the # of items returned.
But since in practice we often want a small # of items, we can keep both N and
M small.

One tradeoff we made is that this approach doesn't support incremental updates
easily, you can always add new items, but it would take a good bit of work to
remove items that have expired, or update the scores of items that change.

As for memory usage, we only store each item once (in a redis hash table), and
use the unique ids as entries in the sorted sets. Which sort of approaches a
trie, since instead of:

 _ya - > yan -> yank -> yanke -> yankee -> yankees -> [data about yankees]_

we have:

 _ya - > 101_

 _yan - > 101_

 _..._

 _101 - > [data about yankees]_

~~~
timr
Yeah, that's what it sounded like.

Aside from not supporting incremental updates, that approach has a high memory
overhead. For every word of length N, you've got to store 1 + 2 + ... + N-1 +
N characters, which is O(N^2). The constant is 1/2, and you do get some
savings in shared prefixes, so that's definitely a worst-case upper bound --
but a trie grows with O(N).

The O(log(N)) lookup time can also be a limiting factor with large sets. A
trie is going to give you worst-case linear lookup time in the length of the
longest string in your set. This can be a pretty dramatic difference as string
sets get large.

~~~
Robin_Message
Isn't that O(Nk), where k is the average word length and N is the number of
words. Given an upper bound on k, this is no worse than a trie. Of course, if
I've misunderstood that part, then ignore what follows.

In a real world scenario (English language), since words share so much
structure, this only grows as O(N). [1] Although the top 1000 recently
searched items might not share as much structure, that is a trivial amount of
data, even for O(N^2). The storage overhead on a trie is high, especially as
many bigrams don't occur in English, so array-based tries necessarily waste
space.

You're right that the lookup time is killer though. A ternary search tree
gives you a good trade-off – really it's a just a good way to store a trie
with less memory overhead.

1\. Running the following commands

    
    
        # wc /usr/share/dict/words
        234936
        # cat /usr/share/dict/words | sed '{: foo
            p
            s/^\(.*\).$/\1/
            t foo
            p
        }' | sort | uniq | wc
        791089
    

shows only a factor of 3.3 increase in all the prefixes on a large dictionary
of English. For propernames, 1323 words turns into 3612 prefixes, a factor of
2.7.

~~~
timr
_"Isn't that O(Nk), where k is the average word length and N is the number of
words. Given an upper bound on k, this is no worse than a trie."_

Let's make an even stronger assumption that every word in your set has length
k. You then need to store (k^2 + k)/2 characters per distinct word with the
sorted set implementation (plus any additional data-structure overhead). In a
worst-case scenario, you have N of these k-length words, distinct at the first
character (this might seem paranoid for something like ASCII, but think about
Unicode here...or just change the problem to N distinct strings at character
>=2), resulting in a worst-case memory bound of O(N * ((k^2 + k) / 2)).

That said, the possible range of N will be much larger than k for any non-
trivial k, so I suppose it's fair to say that real-world memory use is
something more like O(N) with a very high constant that's quadratically
related to your average word length. But that's still substantially worse than
a well-implemented trie (where your constant is small and unrelated to k), and
with large data sets every byte counts.

 _"Although the top 1000 recently searched items might not share as much
structure, that is a trivial amount of data, even for O(N^2)....the storage
overhead on a trie is high, especially as many bigrams don't occur in English,
so array-based tries necessarily waste space."_

If your set size is only 1,000, then yeah, this conversation is academic. I'm
talking about big suggestion sets -- for a website, it's not too difficult to
get into a situation where your suggestion set is in the hundreds of thousands
or millions of strings or more. Remember that web text is nothing like
/usr/share/dict. Proper nouns, acronyms, non-words, new words, foreign words,
etc. all conspire to make real-world language much larger and more diverse
than the system dictionary. The size of real-world vocabularies is one of
those counter-intuitive things that I wouldn't have guessed before working on
websites.

Finally, while you definitely have to be careful when implementing a trie,
there are improved trie algorithms (like burst tries), that greatly improve
upon average- and worst-case memory usage. A totally naïve trie implementation
has a ton of overhead...but then, a naïve set implementation has a ton of
overhead, too. So it's important to not be naïve. ;-)

(related: looks like the redis guys have been oddly resistant to adding tries,
despite several requests:
<https://code.google.com/p/redis/issues/detail?id=110>)

~~~
Robin_Message
(These are all excellent points and it's been a pleasure to discus them with
you.)

Yes, it should have been O(N * k^2). I just didn't think it was as bad as
O(N^2). (Of course if you switch to Unicode, your trie gets even more exciting
;)

I guess suggestions on search queries is more complex since you're now talking
about sentences. I'm not convinced that web text is nothing like the system
dictionary though – that dictionary has 300 thousand words, which dwarfs even
the most advanced user of the English language's lexicon. Also, proper nouns
and new words still follow the phonetic structure of your base language, so
will compress well, but you're right on acronyms and non-words. Do you have
any references on the size of real-world vocabularies? Obviously personal
experience with implementing such systems counts for a great deal.

Finally, I agree that a non-naive trie is the best data structure for this (in
fact, they seem to be the best data structure full-stop – some tries beat hash
tables on lookup by index) but if you are building on the primitives you've
got, sorted sets don't seem too bad.

------
jedsmith
As a UX note, I've always quietly loved autocompletes that aren't just a flat
list of terms, but actually contain structured information organized in an
intuitive fashion. The design of the suggestions on SeatGeek is fantastic.

~~~
ericwaller
Thanks, very much agreed. I think Spotlight on OSX is probably the first place
I saw it done really well. I remember on XP I used to use search all the time,
but now on OSX I just autocomplete to what I want probably 95% of the time.

------
ncavig
Would be cool to have an example of this running on websockets and get rid of
the request/response latency that most autocompletes have. Keep the socket
open when the text field is focused and you should be able to cut down the
response time even further without that overhead.

~~~
ahrjay
Considering chrome and firefox are switching off websockets until they resolve
the security issues that wouldn't be very viable. I guess you could fallback
to flash sockets or long polling.

------
dmix
The example autocomplete on Seatgeek.com is indeed impressively fast.

I might have to use this in my next service.

~~~
ericwaller
Thanks. If you do, definitely let us (SeatGeek) know how it goes.

~~~
kscaldef
Quick bit of feedback: I thought it was odd that I couldn't type in my city
name in conjunction with a band name to pare things down to just the show in
my town. Instead, it shows no results.

~~~
ericwaller
Very good point. I think optimally a really solid autocompleter allows you to
bypass search as often as possible, and location awareness is definitely an
important part of that.

------
dhruvbird
How many phrases of length 30 could you handle with 1GB of RAM?

Or do you have numbers on the mean length of a phrase you handle currently,
the number of such phrases and how much memory it takes?

------
kin
Awesome work guys! I really hope this type of UI becomes more widespread. On a
side note, I've always thought the guys over at www.glyde.com execute it quite
well.

------
henriklied
Great stuff!

In curious: What do you think about exposing this service via WebSockets?
Would that make it even faster?

------
siculars
Oh, oh yes. Thank you kindly. I was literally about to embark on this very
feature. Let's take a closer look...

------
kingkilr
The UI on seatgeek is almost identical to what rdio provides, I wonder if rdio
is using it.

------
Detrus
My first few searches took a while, then every search was pretty fast. Is it
the load?

------
jarin
How ironic, this is perfect for the dating site I'm working on :)

~~~
ericwaller
Glad to hear the name works for someone. We thought it meshed nicely with some
of the other hyperbole out there in ruby gem names (god, unicorn, thor,
shotgun, etc.).

------
jbendotnet
Nice work.

