
Branch prediction minutiae in LZ decoders - luu
http://mikejsavage.co.uk/blog/branch-prediction-minutiae-lz.html
======
terrelln
This is an important trick for LZ4's speed. But, there is a bit more to it
than this. Some important tricks are:

* LZ4 can represent a literal length of 14, and a match length of 18 in a single byte [0]. Most LZ4 sequences will have a literal length LL <= 14, and most LZ4 sequences will have a match length ML <= 18. Instead of copying LL and ML bytes, we can copy 14 bytes and 18 bytes. Then only if LL > 14 we copy extra bytes, and same for ML > 18.

* The case of overlapping matches (e.g. offset = 1, and ML = 100) needs to be handled specially, otherwise you will run into store forwarding problems.

* Always copy 16 or 32 bytes at a time and allow overwrites. Allowing overcopies simplifies the copy routine, improves branch prediction, and reduces store forwarding issues, at the expense of ensuring you always have enough slack at the end of your buffers.

[0]
[https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md](https://github.com/lz4/lz4/blob/dev/doc/lz4_Block_format.md)

~~~
giovannibajo1
Another big advantage is that the wire format is a byte stream and not a bit
stream, so parsing is so much faster.

------
ncmncm
The performance optimizations in modern CPUs are a deal with the Devil: They
make your programs go faster, but you have to do increasingly crazy things to
make them work.

~~~
oddity
Sadly, this is true generally in tech whenever an implementation evolves past
the abstraction. The alternative is that CPUs get faster, but only if all
programs are rewritten for the new architecture. Arguably, this has happened
already on a market-by-market basis, several times, with no end in sight :)

