I wonder what the lower bound on the entropy is.

lifthrasiir · on March 1, 2022

cmix version 19 [1] compressed the wordle wordlist (using the golf.horse dataset [2] as the canonical version) into 9,714 bytes in about three minutes and ~15 GB of memory. This obviously excludes the decompressor size; my golf.horse entry corresponds to the information entropy of at most 11.7 KB though.

[1] https://github.com/byronknoll/cmix

[2] http://golf.horse/wordle/list

mpais · on March 1, 2022

paq8px v206 [1] with option "-8art" compresses it to 9,145 bytes using ~2GB of memory in 82s, mostly due to a better pre-training stage.

[1] https://github.com/hxim/paq8px

willis936 · on March 2, 2022

Hmm. My own entropy estimating code suggests a minimum entropy of 17443 bits with a 63 bit word size (I would need to change how I do things to search above 64 bits, which I would like to do and probably will one day). I used this list:

https://gist.github.com/cfreshman/a03ef2cba789d8cf00c08f767e...

Here is the code:

  X = fread(fid, sprintf('*ubit%d', nBits));
  M = numel(X);
  words = double(unique(X))';
  binEdges = [words-0.5; words+0.5];
  binEdges = unique(binEdges(:));
  P = histcounts(X, binEdges) / M;
  P(P == 0) = [];
  H = -sum(P .* log(P) ./ log(max([nBits, 2]))) / nBits * log2(max([nBits, 2]));
  E = H * M * nBits;

This is a textbook entropy implementation and I've been under the impression that you'll never do better than this (for a given word size).

lifthrasiir · on March 2, 2022

You calculated the sum of entropys of each word, not the entropy of the entire word list as a whole (which would be much lower due to the cross-correlation).