

Compressing Scrabble Dictionaries - nkurz
http://williamedwardscoder.tumblr.com/post/87682811573/compressing-scrabble-dictionaries

======
binarymax
I'm not really a data structures guy, but I love anagrams. When i wrote the
anagramica API, the simplest way that I could come up with a fast search was
this:

    
    
        - Take a word and sort its characters.
        - Add it to a dictionary where the key is the sorted characters and the value is the word.
        - If the sorted characters already exist in the dictionary then add it to the list of words for the same key.
    

This gives O(log n) when you give it a list of letters and you need to find
all the possible words.

What benefits do GADDAG offer over the above?

~~~
dbaupp
In O(log n), what is n? The number of characters? I would've thought that was
O(n log n) (due to the sort).

~~~
binarymax
Yes good point, allow me to clarify. n is the number of keys in the dictionary
(log n for the binary search). I suppose yes it is probably the more complex
O(n log n) to sort the letters before the search.

~~~
bradleyjg
If n is the number of keys in the dictionary, then the whole procedure would
not be O(n log n), it'd be O(m log n) where m is the number of letters. Given
that n >> m, I would think your original statement is correct.

------
btn
The node packing format he describes sounds a bit like a LOUDS tree [1], which
stores the structure of a tree as a bit array (each node as a '1' for each
child, plus a '0'\---for a total of 2n-1 bits for a tree of n nodes), and the
data in a separate packed array. It can't represent the node-deduplication
(nodes with multiple parents), but I think it gives comparable compression:
for the full word list of 3,213,156 nodes, the tree structure is 6,426,311
bits (0.76MB), plus 3,213,156 bytes of character data---for 3.83MB total.

The downside is that traversing the tree is a series of linear bit-counting
operations---which can be painfully show without a bit of pre-caching.

[1]:
[http://www.cs.cmu.edu/afs/cs.cmu.edu/project/aladdin/wwwloca...](http://www.cs.cmu.edu/afs/cs.cmu.edu/project/aladdin/wwwlocal/compression/00063533.pdf)

------
SeanDav
What open source dictionaries are being used by these scrabble programs? (I
know I could bing (or DDG) the answer but would like to hear from an "insider"
if possible.)

~~~
willvarfar
The contest linked at the top of the article has a standard word list. You
should enter! :)

~~~
baking
[http://www.azspcs.net/Content/AlphabetCity/Lexicon.txt](http://www.azspcs.net/Content/AlphabetCity/Lexicon.txt)

for those who couldn't find it the first time through.

------
rickbradley
This appears to be just reinvention of known algorithms on suffix trees. I
recommend (and recommended as a comment on the original article) Dan
Gusfield's book "Algorithms on Strings, Trees, and Sequences" which does a
pretty thorough job of covering the relevant algorithms and data structures.

~~~
willvarfar
The GADDAG is a standard datastructure - the original paper, linked from the
article, is from 1994.

The article doesn't pretend to invent the GADDAG, nor claim to compress it
better than others, only to try and explain how to simplify and pack a GADDAG.

The steps would work on all DAGs generally. This is nothing new, but hopefully
its new to some of us and a nice article.

