Hacker News new | past | comments | ask | show | jobs | submit login
Dictionary Compression (kevincox.ca)
44 points by todsacerdoti on March 1, 2022 | hide | past | favorite | 18 comments



For the reference, it seems that the most publicized attempts on compressing the Wordle dictionary is the Game Boy Wordle clone [1]. There are also several other independent attempts, including the golf.horse Kolmogorov complexity competition [2] (where my entry is incidentally at the first place ;-).

[1] https://news.ycombinator.com/item?id=30402407

[2] http://golf.horse/wordle/


Isn't this best done with the trie-based dictionary (as in lexical dictionary) approach? Am I missing something here? Is there some additional compression opportunity due to the fact that all the words are the same length?


Tries are generally a good choice if you need to access the word list without first decompressing them. If you are looking for the best compression possible though, there may be more latent contexts that you may not realize their existence but can still leverage.


> I would have expected at least some simple domain-specific compression.

Why? The site is static, so it's cached and browsers get a 304 when they come back. For new players the server compresses the whole thing in flight, not just the word list.

Doing some fancy domain-specific compression on the word list would have added unnecessary complexity (read: places for bugs to creep in) and taken time/effort away from more important things.


Yeah, this seems like a classic case of optimization for optimization's sake.


This is really cool and I love that people are using Wordle to explore and explain computer science concepts (to people like me).

> The speed-up is significant, about 20x faster. This makes sense because instead of scanning half of all words on average, you only need to scan half of the words with the same first letter. I’ll assume this isn’t exactly 13x due to letter distribution.

Extremely minor, but I think one would still expect a 26x speedup, right? You go from searching half of all words to half of 1/26th of all words. Obviously there are more words starting with S than X and all, but my question is just in theory.


Right. The effect of letter distribution, however, is that you're more likely to be searching a bucket with more words in it. So getting somewhat less than 26x speedup makes sense.


Ah, you are absolutely right. I'll update the post. I think your logic for why you get less than 26x makes sense too.


I've been wondering about this per a unique business system I've been working on... Hypothetically, if you have a situation where you are not allowed to delete any prior business data (i.e. append-only event log), what would stop you from directly using prior log entries as the basis for some sort of incremental/recursive compression dictionary?

Put differently, is there any way to leverage scenarios where you have a practically-unlimited dictionary size and willingness to retain all data forever? Or, am I playing with some basic information theory equation the wrong way in my mind?


You can have an adaptive dictionary, favoring recently seen entries. The dictionary would have to be rebuilt periodically in practice, but it has an advantage of being able to cope with the ever-changing ("non-stationary") data distribution. Sometimes the dictionary doesn't even need to be stored and instead can be calculated as the data gets decompressed; this is so big deal that many compression algorithms indeed omit and adaptively build the dictionary instead.


What you're thinking of is a journaling archiver. Something like zpaq.


This might be a good fit for roaring bitmaps [1] or Minimal Perfect Hashing

[1] http://roaringbitmap.org/


I haven't got a full comprehension of 1 and why it would compress a lot better. It would be interesting to take a look.

I did consider perfect hashing, however IIUC it will map a fixed set of entries to sequential integers, but it won't do something clever for items not in the set, so I couldn't find a way to make it work well for this problem. I did see someone talking about perfect bloom filters but to be honest I didn't quite understand it https://iswsa.acm.org/bloom.html


I wonder what the lower bound on the entropy is.


cmix version 19 [1] compressed the wordle wordlist (using the golf.horse dataset [2] as the canonical version) into 9,714 bytes in about three minutes and ~15 GB of memory. This obviously excludes the decompressor size; my golf.horse entry corresponds to the information entropy of at most 11.7 KB though.

[1] https://github.com/byronknoll/cmix

[2] http://golf.horse/wordle/list


paq8px v206 [1] with option "-8art" compresses it to 9,145 bytes using ~2GB of memory in 82s, mostly due to a better pre-training stage.

[1] https://github.com/hxim/paq8px


Hmm. My own entropy estimating code suggests a minimum entropy of 17443 bits with a 63 bit word size (I would need to change how I do things to search above 64 bits, which I would like to do and probably will one day). I used this list:

https://gist.github.com/cfreshman/a03ef2cba789d8cf00c08f767e...

Here is the code:

  X = fread(fid, sprintf('*ubit%d', nBits));
  M = numel(X);
  words = double(unique(X))';
  binEdges = [words-0.5; words+0.5];
  binEdges = unique(binEdges(:));
  P = histcounts(X, binEdges) / M;
  P(P == 0) = [];
  H = -sum(P .* log(P) ./ log(max([nBits, 2]))) / nBits * log2(max([nBits, 2]));
  E = H * M * nBits;
This is a textbook entropy implementation and I've been under the impression that you'll never do better than this (for a given word size).


You calculated the sum of entropys of each word, not the entropy of the entire word list as a whole (which would be much lower due to the cross-correlation).




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: