
Improved chess game compression (2018) - psuter
https://lichess.org/blog/Wqa7GiAAAOIpBLoY/developer-update-275-improved-game-compression
======
dzdt
This reminds me where many years ago I learned about the world record holder
for computer optical character recognition (OCR) accuracy.

The computer scientists took as a target an eastern European chess journal
which printed move-by-move reports of tournament chess matches. They
incorporated a crude chess engine in the recognition step estimating the
liklihood of next moves and combining that with the OCR engine liklihood
estimate that the printed characters were particular glyphs. Despite very low
quality of the printing, the journal had very high quality editing. The source
material was self consistent. Completely illegible characters could mostly be
filled in as the sensible game moves that were allowed. It took hundreds of
hours of human review time to find a single OCR mistake from this process!

~~~
kaiabwpdjqn
> It took hundreds of hours of human review time to find a single OCR mistake
> from this process!

This stands out to me as improbable. Not in that the error rate could be that
low, but in that they actually had humans spend hundreds of hours checking the
accuracy of difficult character recognition. How did that happen?

~~~
dzdt
I searched out the article: "Reading Chess", 1990, HS Baird and Ken Thompson.
(Yes, that Ken Thompson).

[http://doc.cat-v.org/bell_labs/reading_chess/reading_chess.p...](http://doc.cat-v.org/bell_labs/reading_chess/reading_chess.pdf)

It doesn't actually quantify the human proofreading time. I might have
recalled incorrectly; I heard about this in the late 1990's as a war story
from another OCR researcher.

~~~
PaulHoule
It's an embarassing problem to have a system with "accuracy to high to
measure"!

------
willvarfar
Hmm, a lot of effort to make documents small, but then storing it in
mongodb?!?!

If size and performance are a focus, just store them in a normal sorted table
with compression (e.g. leveldb, or mysql using rocksdb).

This means all these small documents can be compressed with repetition between
games and not just within each game.

And probably much much faster and simpler etc.

Basically, the size taken by the database should be at the same kind of level
as you get by just having a text file with one line per game, and gzipping the
whole thing. I'd expect it to be an order of magnitude smaller than per-
document compression.

~~~
gowld
Games are quite small -- under 50 turns, each under 2 bytes in binary, but
more if you start with simple text languge before compression. You need to
compress each game individually. Would database-wide compression work?

~~~
willvarfar
You do not need to compress every game individually. Behind the scenes, your
mainstream database is storing the data in pages, and these pages - containing
many adjacent rows - can be compressed.

The compression level achieved at a page level is much higher than compressing
rows individually because there is lots of repetition between rows. It also
speeds up io and other good things. It is almost always good, which is why the
most modern database storage engines do it and do it by default.

Consider a csv file, and compare it to the same data stored as json objects,
one row per line. The uncompressed json file is going to be much bigger, as
the columns are repeated in every line. But both files gzip to much the same
size, because all those keys are repeated again and again and the two files
have basically the same entropy.

On the other hand, compressing each line in the file individually would be a
poor choice, giving relatively poor gains.

There were database engines that did row-level compression, but these
performed poorly and I know of nobody who used eg innodb compression.

------
thom
I have always loved just how much of computer science you could learn by doing
nothing but working on chess your entire life.

~~~
lihaciudaniel
Usually the earliest computers were made so that they can play chess and
"crack" it

~~~
saagarjha
I thought they all did math for science or the military?

~~~
johannes1234321
... and pr0n

------
SethTro
The article doesn't say it anywhere but I sure hope all the games are backed
up to disk/tape in txt format and that this is just used to keep in-memory
DB/memcache size down.

Otherwise an interesting article about injecting domain knowledge to improve
beyond what's possible with a normal compression algorithm.

~~~
apetresc
Indeed, you can download this backup yourself if you'd like:
[https://database.lichess.org/](https://database.lichess.org/)

------
jameshart
the basis of this algorithm is to rank the possible moves from the current
position, then use that to choose a Huffman encoding. In essence, they use a
very naive single-move-look ahead chess AI to quickly rank moves giving them a
crude measure for how ‘surprising’ a particular move would be at that point in
the game.

Interesting question: If you just generated the bit string that corresponded
to taking the ‘most obvious move’ according to their heuristics, what game
would that play out? In a way, that would be the ‘most obvious chess game’, or
perhaps the ‘least creative chess game’...

In theory a more optimal version would be to use a more sophisticated AI to
rank the next moves, and even to choose how aggressively to Huffman code the
possible options.

In a sense this would measure the information content of a chess game (at
least relative to the ‘knowledge’ of that AI) . I wonder what the variance in
the number of bits needed to encode various games would tell you about the
play style of different players, or the nature of the game, or even the
relationship betweeen information theory and creativity....

~~~
boarquantile
The most boring chess game according to Lichess move compression:
[https://lichess.org/study/5jQeJXXb](https://lichess.org/study/5jQeJXXb)

~~~
linkdd
damn, that's some "obvious" moves right there

------
jhrmnn
This nicely illustrates the point of the Hutter Prize, that efficient
compression comes from understanding the content on a nontrivial level.

[https://en.wikipedia.org/wiki/Hutter_Prize](https://en.wikipedia.org/wiki/Hutter_Prize)

------
6510
The to slow approach is fun. If you freeze and use a reasonable engine a bit
string could represent which moves match the engine move. (lets call it block
1) The missing moves could be 3 bit numbers like 000 for second best engine
move, 001 for 3rd, 010 for 4th, 011 for 5th, 100 for 6th best move, 101 for
7th, 110 for 8th and 111 for other moves. (block 2) The other moves are simply
put like E4, E24 or NF3G5.(block 3)

~~~
dmurray
For optimum compression with this approach, you'd use an engine that generated
the most likely moves (perhaps given the time control and the players'
ratings, since Lichess stores those anyway), not the strongest moves. That
might not look much like a regular chess engine.

------
steerablesafe
As I understand it uses the Huffman-code for all moves including the opening
moves. Alternatively statistics could be gathered for all first moves, all
second moves, etc..., then different Huffmann-code could be applied for
opening moves.

I wouldn't be surprised if the statistics for the first few moves were
significantly different to the moves deep into the game.

~~~
jameshart
Probably worth just having short Huffman codes specifically to encode
particular opening sequences, before referring to per-move coding

------
gowld
> Our database runs on 3 servers (1 primary, 2 secondaries), each with two
> 480GB SSDs in a RAID 1 configuration

Algorithmically cool, but quite a lot of work to save <$500/yr in hardware for
a handful of 1TB SSDs.

Converting the data to the new format cost more than upgrading disks.

~~~
hinkley
In the conclusion they mentioned rate of growth as well.

If rate of growth is disk usage is at or below the rate of growth in mid-tier
SSDs, then yes, it's $500/yr. If you are growing faster than that, then an
improvement might be saving you $350/yr + $150/yr², and a wall in the future
might be pushed out for years.

With your Cloud provider, it might be hard to get more disk, memory, or CPU
without paying for more of the other two. Also, many real organizations and
vendor agreements are full of the most stupid artificial constraints. Just
because someone else can build a 1PB SSD storage array doesn't mean that any
of my coworkers will be allowed to build one any time soon. CAPEX austerity
measures are among the most frustrating penny-wise pound-foolish policies we
have to deal with.

------
SamReidHughes
Nice. A touchy balance between overengineering and too much overengineering.
Maybe you could better compress combinations of moves, like exchanges.

If you also store move times, there’s not much of a win to this.

------
bumbledraven
The article includes a brief but clear explanation of Huffman coding. I didn’t
understood how it works until now!

