

Revised JavaScript Dictionary Search - atularora
http://ejohn.org/blog/revised-javascript-dictionary-search/

======
mckoss
This was a fun (and distracting) challenge. After a lot of iterations, I'
think I've squeezed about as much compression and performance as I can. You
can play with the Trie/DAWG generator here:

<http://lookups.pageforest.com>

Source and format documentation here:

<http://github.com/mckoss/lookups>

------
pbrumm
this has been a great set of blog posts.

since your covering all the bases. you should look at the bloom filters.

something like this could work and be created on the server.

<http://dren.ch/js-bloom-filter/>

(not the newest post has a compatible perl and ruby impl)

Ilya Grigorik has a couple of relevant posts

[http://www.igvita.com/2008/12/27/scalable-datasets-bloom-
fil...](http://www.igvita.com/2008/12/27/scalable-datasets-bloom-filters-in-
ruby/)

and

a ruby + redis backend.

<https://github.com/igrigorik/bloomfilter>

here are some of the memory footprints that Ilya got

    
    
        * 1.0% error rate for 1M items, 10 bits/item: 2.5 mb
        * 1.0% error rate for 150M items, 10 bits per item: 358.52 mb
        * 0.1% error rate for 150M items, 15 bits per item: 537.33 mb

~~~
jeresig
This was discussed a bit in the Hacker News thread after last blog post but I
continue to feel that bloom filters aren't particularly useful here. Since any
solution I use would require no errors it would mean that I'd have to place
the bloom filter in front of another solution (likely the succinct trie
structure that I described).

Let's assume that the bloom filter is as fast as a hash lookup (unlikely, but
anyway) - at most it'll be saving me about 5-46ms of lookup time, with some
non-trivial additional memory usage and data transfer. I'm 100% OK with having
5-46ms of lookup time if it means that I can have such an incredibly low
memory profile. If lookups took somewhere around 200ms then yeah, I'd
definitely look at layering a bloom filter on top of this. As it stands though
using a bloom filter will only add additional code and memory usage for a
negligible amount of benefit.

~~~
rorrr
Why do you require 0% error?

~~~
jeresig
I would prefer that people spell words in my game that are actually words and
not abuses of the system. I expect users to be challenging one another so I'd
imagine that they wouldn't take too kindly towards one person getting an
unnecessary (and unexpected) advantage.

~~~
rorrr
If the error rate is 0.1% it would take a user 1000 tries (on average) to
enter a word that's not in the dictionary. That's a lot of attempts.

------
jacobolus
I’m curious whether regular expression matching could be used instead of these
tricky plain string structures or string+object structures; storing a trie in
a regular expression (using grouping and or) should be both small in terms of
data over the wire and fast for lookups. The only question is how fast
compiling the javascript would be at the start; if that took too long, it
could be broken into parts and each part only compiled when needed and then
memoized.

John, do you have a link to the corpus you’re using, so that we can use it for
tests? Or should people just send in code for you to test?

Edit: nevermind, it's here: [https://github.com/jeresig/trie-
js/raw/master/dict/string.tx...](https://github.com/jeresig/trie-
js/raw/master/dict/string.txt)

------
ot
> However the new Succinct Trie ends up being disturbingly slow (taking about
> 5.5ms to look up a word in Node.js on a fast machine)

Ouch! A succinct trie I implemented in C++ took 2-3 _micro_ seconds for
lookups. I guess Javascript is not the best language for low-level data
structures engineering.

