
Zip Files: History, Explanation and Implementation - Breadmaker
https://www.hanshq.net/zip.html
======
RandallBrown
In college we had a project to write a compression/decompression utility.

We had to implement all the data structures from scratch and implement several
compression schemes.

I got the simplest one, run length encoding, working but couldn't get the
fancier LZ77 stuff to work correctly. Mostly because I had trouble
implementing some of the data structures required in C++, from scratch.

The professor had a few of the better students present their programs and one
of them demonstrated that not only could they make smaller files than WinZip,
they did it _faster_ too.

It absolutely blew my mind and that's probably the day I realized I needed to
really step up my game and put in the work if I wanted to be successful in
this field.

~~~
userbinator
If you're not too worried about speed nor memory consumption, a simple brute-
force LZ77 compressor is, to a first approximation, two nested loops (one to
find a match, and one to extend it.) This will also compress better than the
hash-table-based approach discussed in the article, since it always finds the
_best_ match, although at the expense of speed. In comparison, RLE is one
nested loop (to count repeated bytes.)

~~~
nauful
Do you mean following the hash chain to the end of the window?

If you want an exhaustive LZ matchfinder, a trie, binary tree such as LZMA's
bt4 or suffix array are more efficient algorithms, especially with window
sizes larger than 32KB, and there are other better matchfinders past 64MB such
as Rabin Karp for longer matches.

Zip is inefficient as a format on modern out of order CPUs. Modern
replacements such as zstd have blocks of literals, lengths and decisions to
decode an entropy coded table at once rather than conditionally branch.

Also, beating zip compression is quite easy: use a larger window size and a
relatively quick (small number of tested slots) hash chain (or multiple e.g.
3, 4 and 7 bytes) can find better matches.

If decompression time and speed is not an issue, and you aren't limited to LZ
based approaches, a bytewise model such as ppm or bitwise such as paq will
provide much better compression without too much thinking.

Edit: forgot to mention, the parsing strategy is also important. Finding an
approximately optimal lowest cost path through a block can often save 10%+
ratio with LZ + Huffman.

------
jaclaz
As a side note, one of the "tightest" ZIP compatible programs around AFAIK,
Ken Silverman's KZIP:

[http://advsys.net/ken/utils.htm](http://advsys.net/ken/utils.htm)

Worth of note is also ZIPMIX that in a few cases is very useful.

------
okareaman
As a recovering alcoholic, I've always looked at the story of Phil Katz as a
cautionary tale. He made enough money from people registering PKZip for discs
and manuals that he was able to succumb to his tendency to socially isolate
and drink to excess. He drank himself to death.

~~~
mwest
I remember the name Phil Katz from "PKZIP" back in the BBS days. I had no idea
he had such an unhappy story!

The wikipedia page covers it briefly:
[https://en.wikipedia.org/wiki/Phil_Katz](https://en.wikipedia.org/wiki/Phil_Katz)

~~~
kawsper
I can recommend that you watch "BBS: The Documentary", they cover the arc wars
as well, there's also this document:
[http://www.bbsdocumentary.com/library/CONTROVERSY/LAWSUITS/S...](http://www.bbsdocumentary.com/library/CONTROVERSY/LAWSUITS/SEA/katzbio.txt)

