
DAWG data structure - sytelus
http://porcupineprogrammer.blogspot.com/2012/03/dawg-data-structure-in-word-judge.html
======
cdelsolar
DAWGs are really cool. GADDAGs are faster than DAWGs only for the special
purpose of finding a word given a prefix (applicable to Scrabble move-
finding). Otherwise they are the same speed if you're just anagramming; since
a GADDAG is at least 5 times bigger than a DAWG on average (given an English
dictionary corpus), the GADDAG may even be slower because it wouldn't fit in
cache memory.

The best known Scrabble AI (Quackle) uses both data structures, although only
the DAWG is strictly necessary.

I wrote a GADDAG maker in Go a while ago with the intent of turning it into a
Scrabble move finder, but haven't found the time to work on it in a while. I
would like to optimize it further too...

[https://github.com/domino14/macondo/tree/master/gaddag](https://github.com/domino14/macondo/tree/master/gaddag)

------
fekberg
I saw a lecture from Stanford years back (2008) that talked through this,
really interesting lecture:
[https://www.youtube.com/watch?v=TJ8SkcUSdbU](https://www.youtube.com/watch?v=TJ8SkcUSdbU)

Also, if anyone wonders, DAWG means: Directed Acyclic Word Graph

------
kmike84
There is another good DAWG/DAFSA implementation - check
[https://code.google.com/p/dawgdic/](https://code.google.com/p/dawgdic/). In
README of
[https://github.com/chalup/dawggenerator](https://github.com/chalup/dawggenerator)
it is said it takes 55s to encode
[http://sjp.pl/slownik/growy/](http://sjp.pl/slownik/growy/) data. I just
tried it with dawgdic, and it builds a DAFSA from
[http://sjp.pl/slownik/growy/](http://sjp.pl/slownik/growy/) in 2s using a
Python wrapper [https://github.com/kmike/DAWG](https://github.com/kmike/DAWG).
I think it is so much faster because dawgdic uses Daciuk's algorithm.

The resulting file size is slightly larger (1.6MB instead of 1.5MB), but
that's likely because I converted data from cp1250 to utf8 before encoding.

------
chubot
Hm, so testing for membership in a DAWG is like testing for one in a trie,
which involves a lot of pointer chasing?

Couldn't you get the small space requirements with a more cache-friendly
structure? What about this solution:

1) Train a huffman code on the dictionary 2) Compress all words with this code
3) Store them concatenated and sorted. 4) On lookup, compress the key and do a
binary search

Well I guess he says that a 36 MB dictionary goes down to 1.5 MB. That's
pretty good. I'd guess that with a good huffman encoder you could probably get
it down to 3.6 MB (10%), so maybe they are still doing better in terms of
size.

I guess for some applications like games you don't really care too much about
lookup speed.

~~~
kmike84
You can store a Trie or a DAFSA in a single chunk of memory and still have
blazingly fast lookups - see double-array tries. The disadvantage is that
inserting/removing values is more costly, but DAFSA doesn't support it
anyways.

[https://code.google.com/p/dawgdic/](https://code.google.com/p/dawgdic/)
library works this way; I'm not sure, but it seems
[https://github.com/chalup/dawggenerator](https://github.com/chalup/dawggenerator)
also works this way.

For mutable tries there are HAT-Tries which are designed to be cache-friendly;
see e.g. an implementation at [https://github.com/dcjones/hat-
trie](https://github.com/dcjones/hat-trie) and a Python wrapper at
[https://github.com/kmike/hat-trie](https://github.com/kmike/hat-trie).

------
qzervaas
I am using DAWG in my iOS word game, Hexiled[1].

It was pretty interesting to implement and sped up the real-time dictionary
searches exponentially (from my original solution, which was a basic SQL LIKE
match).

Additionally, Hexiled uses a number of languages, including with accented
characters and this didn't have any negative impact on things.

[1] [http://hexiledgame.com](http://hexiledgame.com)

------
colig
I was just looking into this today and I found GADDAG:
([http://ericsink.com/downloads/faster-scrabble-
gordon.pdf](http://ericsink.com/downloads/faster-scrabble-gordon.pdf)), which
claims to be faster than DAWG.

~~~
Kutta
Not faster, just able to search prefixes and suffixes of a string, while plain
DAWG can only do suffixes efficiently.

------
erikb
Often when people talk about trees as data structure what they actually want
are DAGs (e.g. filesystems usually are used as DAGs not trees, because they
can have links (although on linux you can actually create cycles in practice
people try to avoid it)). Nice to see a good example.

------
Kenji
The images also contain the word "ablates" which he failed to mention. But
this is a very interesting article, learned a lot.

------
pgen
How does it compare to Ternary Search Tree?

~~~
kmike84
DAFSA is much smaller and faster, but you can't insert new items in DAFSA once
it is created, and you can't attach arbitrary values to keys.

------
mmwako
wut up, dawg.

