
How are zlib, gzip and Zip related? (2013) - yurisagalov
http://stackoverflow.com/questions/20762094/how-are-zlib-gzip-and-zip-related-what-do-they-have-in-common-and-how-are-they/20765054#20765054
======
lpage
In other compression news, Apple open sourced their implementation of lzfse
yesterday: [https://github.com/lzfse/lzfse](https://github.com/lzfse/lzfse).
It's based on a relatively new type of coding - asymmetric numeral systems.
Huffman coding is only optimal if you consider one bit as the smallest unit of
information. ANS (and more broadly, arithmetic coding) allows for fractional
bits and gets closer to the Shannon limit. It's also simpler to implement than
(real world) Huffman.

Unfortunately, most open source implementations of ANS are not highly
optimized and quite division heavy, so they lag on speed benchmarks. Apple's
implementation looks pretty good (they're using it in OS X, err, macOS, and
iOS) and there's some promising academic work being done on better
implementations (optimizing Huffman for x86, ARM, and FPGA is a pretty well
studied problem). The compression story is still being written.

~~~
mayoff
Strange that Apple didn't publish it on
[https://github.com/apple](https://github.com/apple) (which holds all of their
open Swift repos).

Also, to clarify, ANS is relatively new (2009) but arithmetic coding has been
around for a long time. Historically it was avoided because of patents, many
(all?) of which have now expired. Apparently ANS isn't going to be patented.

~~~
bch
Question not about ANS, but patenting: Wouldn't it make more sense for Apple
(or anybody) to actually patent it, and keep those patents, but make it open
via licensing ? If there's patentable material and $developer passes on filing
for it, doesn't that leave it open for poaching by $third_party ?

~~~
riffraff
The patent is invalid if prior art exists,so publishing it is enough to avoid
it being poached by others

~~~
derekp7
But doesn't the patent office only look at previous patents when searching for
prior art? Of course, other sources of prior art can be brought in as a
defense, but it still may not stop someone from getting a patent initially and
using it to harass financially weak opponents.

~~~
thwarted
Lack of patent examiner familiarity with the industry, history, and the state
of the art is traditionally considered a weak point in the system as a whole.
Supposedly when writing the patent claims you're supposed to do a search to
make sure you're not conflicting with other patents and that there is no prior
art, but this is often used to word the patent in such a way to avoid the
prior art from invalidating the patent, or use differences with the prior art
as support for the uniqueness of the patent being applied for. Unfortunately,
patent examiners are traditionally illequipped to accurately make an
assessment, and the patent applicant is allowed to make adjustments to address
the issues the patent examiner found. I remember there was work on getting
patent examiners to be more familiar with the areas the patents they are
examining/approving are for (technology, software, or business, for example),
but I don't know where that stands today.

However, IANAPL and IANAPE.

~~~
leeoniya
[http://patents.stackexchange.com/](http://patents.stackexchange.com/) helps
with this

------
chickenbane
Not only is this a great read, but the follow up for citations is replied with
"I am the reference".

If this were reddit I'd post the hot fire gif. Eh, here it's anyway:
[http://i.imgur.com/VQLGJOL.gif](http://i.imgur.com/VQLGJOL.gif)

~~~
dragontamer
I thought he was joking.

Then I read the name: Adler: as in Adler-32 (the checksum function that zlib
uses).

Then I knew it was real. The darn author of zlib answered the question. That's
as close as you're gonna get to "primary source" folks!

~~~
phkahler
>> That's as close as you're gonna get to "primary source" folks!

Yes it is. Whenever I think of PKzip I am reminded of the sad tale of Phil
Katz death - partly because I share his first and last initials, and now
reading his Wikipedia page I see we had more in common. Fortunately alcoholism
is not my thing.

[https://en.wikipedia.org/wiki/Phil_Katz](https://en.wikipedia.org/wiki/Phil_Katz)

~~~
allenu
I remember reading that the creators of the ARC format sued Phil Katz for
essentially taking their original source code and rebranding it PKARC (with
assembly optimizations). They claim [1] that Katz did essentially the same
thing with PKZIP.

Can anyone verify these claims? I always thought it was an injustice that the
original authors were treated poorly by the BBS community and treated like a
big company despite just being run from home.

[1]
[http://www.esva.net/~thom/philkatz.html](http://www.esva.net/~thom/philkatz.html)

~~~
kimmel
This is covered in the BBS Documentary[1]. A comparison of Phil Katz work
showed he just renamed variables and moved things around. Phil rallied the BBS
community to character assassinate Thom. The documentary is shot years later
and Thom still breaks down and cries when talking about what happened. It is
very sad.

[1] [http://bbsdocumentary.com/](http://bbsdocumentary.com/)

~~~
WoodenChair
Crazy how that documentary appears to be not for sale in any form and DVDs on
Amazon go for $3xx. Am I missing something?

~~~
rsneekes
The page says: "Enhanced Digital Downloads of the BBS Documentary series are
planned. Please sign up here: sales@bbsdocumentary.com if you wish to be
notified."

But it appears to be on YouTube:
[https://www.youtube.com/playlist?list=PL2B9EF89CE228ED0A](https://www.youtube.com/playlist?list=PL2B9EF89CE228ED0A)

------
sikhnerd
It's annoyingly common how the OP doesn't mark this answer as accepted, or
even acknowledge how amazing this answer is from one of the technology's
creators -- instead just goes on to ask a followup.

~~~
mixmastamyk
The accepted answer is often wrong (or obsolete) and grants a small number of
points. The obsession with it on SO is a bit odd.

~~~
ghayes
Yeah I tend to think if another answer has double the points of an accepted
answer, that answer should come first.

~~~
dexterdog
Plus, often there is a decent and accepted answer given quickly and then a
much more complete answer given later on that is really the one that should be
considered in the archive.

------
kbenson
It seems like it wouldn't be that hard to create an indexed tar.gz format
that's backwards compatible.

One way would be to use the last file in the tar as the index, and as files
are added, you can remove the index, append the new file, append some basic
file metadata and the compressed offset (maybe of the deflate chunk) into the
index, update the index size in bytes in a small footer at the end of the
index, and append to the compressed tar (add).

You can retrieve the index by starting at the end of the compressed archive,
and reading backwards until you find a deflate header (at most 65k plus a few
more bytes, since that's the size of a deflate chunk), If it's an indexed tar,
the last file will be the index, and the end of the index will be a footer
with the index size (so you know the maximum you'll need to seek back from the
end). This isn't extremely efficient, but it is limited in scope, and helped
by knowing the index size.

You could verify the index by checking some or all of the reported file byte
offsets. Worst case scenario is small files with one or more per deflate
chunk, and you would have to visit each chunk. This makes the worst case
scenario equivalent to listing files an un-indexed tar.gz, plus the overhead
of locating and reading the index (relatively small).

Uncompressing the archive as a regular tar.gz would result in a normal
operation, with an additional file (the index) included.

I imagine this isn't popular is not because it hasn't been done, but because
most people don't really need an index.

~~~
TimMeade
If I recall correctly and it's been a while: The initial header is x bytes
from the beginning of the file, or that you have to search for a known key
string PKsomething. Then you have to go back and forth between that and the
compressed data as you decompress. When compressing files you have to go back
to the beginning of the zip file ( or disk1 in a multi disk archive) and
update that table with the CRC info. Just was not practical for streaming for
multiple reasons. Remember the original zip file format was created by phil in
about 1987. I believe they just thought it easier to start over with a better
design for gzip with the same compression algorithm.

~~~
dexterdog
That can't be the case anymore because I implemented a streaming function a
few years back that sends an unlimited number of files as a zip, but builds
the zip on the fly streaming it to the client. Maybe it works because I am not
using compression (files in my case are all JPG and/or compressed video so
there is little benefit). But my process definitely starts streaming the zip
right away and has to pull URLs on the fly to create the zip file.

~~~
TimMeade
This cannot be a 'true' zip file. The central directory is at the end of the
zip file. It's made up of information that is needed for the decompression. So
if you are streaming it, you cannot decompress until the entire file is
reassembled.

~~~
pixelglow
You can definitely compose a zip file "on the fly" and stream it out. The
central directory at the end of the zip file can be determined from all the
content already streamed.

The only wrinkle is that each entry has a header which typically states the
compressed size and checksum. Either you have to compress each entry content
in some temp buffer or file to figure out the compressed size and checksum,
then write out the header and compressed content. Or you write out the header
with these fields zeroed, then compress the content on the fly and write it
out, then write out a data descriptor with the compressed size and checksum.

~~~
derefr
What the parent is saying is that the client has to buffer the entire zip file
before they receive the index and can begin decompressing it. "Streaming
compression" usually implies streaming _de_ compression as well—importantly
because a non-negligible use case for streaming compression is sending streams
of compressed data that won't fit on the destination together with the
decompressed copy; or sending _continuous_ streams of compressed data that
could never reach the end to be decompressed. What the parent is saying is
that zip cannot work for these use-cases.

------
rdslw
Worth reading answer about coolest kid on the block: xz compression algorithm
(of lzma fame) plus tar.gz vs tar.xz scenarios/discussion.

[http://stackoverflow.com/questions/6493270/why-is-tar-gz-
sti...](http://stackoverflow.com/questions/6493270/why-is-tar-gz-still-much-
more-common-than-tar-xz)

~~~
ygra
Did you intend to link the answer, maybe?

------
winterismute
My father teaching me to type PKUNZIP on files that "ended with .zip" in the
DOS shell (not long before the Norton Commander kind of GUI arrived to our
computer) is one of my earliest memories as a toddler: I would ask him "What
does it mean?" and he would simply not know. It was 1990 and I was 3 and a
half I think. When I learned what it stood for it was kind of epic, for me.

------
digi_owl
The last few days i find myself wondering if there needs to be some kind of
org set up to preserve this sort of info.

Right now it seems to be strewn across a myriad of blogs, forums and whatsnot
that risk going poof. And even if the Internet Archive picks them up, it is
anything but curated (unlike say wikipedia, even with all the warts).

~~~
greglindahl
The Internet Archive has both curated and non-curated collections. Right now
the Wayback Machine doesn't have any curation, but that doesn't mean it's
going to stay that way.

However, in this situation, the collaboration between Wikipedia and the
Wayback may be what you had in mind. We're working with Wikipedia to make sure
all external links from Wikipedia articles are backed up in the Wayback
Machine, and there's a Wikipedia bot going around adding these links to
articles.

~~~
digi_owl
Not quite.

How easily would it be to find the linked to stackoverflow answer on the
archive if ever stackoverflow were to vanish from the net one day?

There was also a wikipedia article linked on HN recently about a certain
mainframe terminal, and a spreadsheet program that make use of a special
capability of that terminal. Said article was at risk of being deleted from
wikipedia because they deemed it "original research".

Again, how easily could one find such an article within the archive?

~~~
greglindahl
Ah, now you want to talk about discovery! It so happens that I'm building a
search engine for the Wayback Machine. But it's true that discovery for it is
never going to be that great until the search engines that people commonly use
index the Wayback, which is probably never.

~~~
digi_owl
I dunno why you seem to treat those as two separate issues.

The way i see it, one feeds into the other. Wikipedia and other wikis have a
repuration for being time/attention sinks, this because you can follow one
article to another to another.

Search don't do that. It may or may not barf up what you are looking for if
fed the right terms, not much beyond that.

Maybe what i have in mind is something akin to wikiquote but for topics rather
than persons. So that this posting by Adler can be filed under zip, gzip, zlib
and whatsnot, and people can find it to go alongside the "encyclopedic"
description of either of the technologies.

~~~
greglindahl
I don't think they're separate issues, and I totally agree that they feed into
each other. I'm just having a hard time figuring out what this HN discussion
is about.

------
hardwaresofton
It is rare to be able to have a question answered so completely and from such
a first-hand source. This post is gold and tickles me in all the right places.

StackOverflow is sitting on a veritable treasure trove of knowledge.

------
lunchables
Reminds me of the very sad zip story:

[https://www.youtube.com/watch?v=_zvFeHtcxuA](https://www.youtube.com/watch?v=_zvFeHtcxuA)

The whole "The BBS Documentary" is great and I recommend starting at the
beginning if you're interested in it.

[https://www.youtube.com/watch?v=dRap7uw9iWI](https://www.youtube.com/watch?v=dRap7uw9iWI)

------
404-universe
Where do the other popular compression utilities (e.g. bzip2, xzip, lzma,
7zip) fit in to this?

~~~
gopalv
The big picture is that there's a few compression algorithms.

Run Length Encoding

Huffman Encoding

Lempel Ziv (LZ77)

Burrows Wheeler Transform

The interesting part of all these specific implementations is their own
specialized way of extending or combining these algorithms for their own niche
(or tradeoff point).

Bzip2 is for example, compresses really well because BWT is expensive, but
clever - rzip extends the LZ part of bzip2 to look further across the file
(instead of a few hundred kb).

Zlib itself has enough of these flags exposed out, so all Zlib isn't really
the same - Z_FILTERED, Z_HUFFMAN_ONLY, Z_FIXED, Z_RLE etc. Look at something
like Zopfli to see how they can be remixed, provided the tradeoffs of CPU
change from the historic positions of Zlib.

~~~
eln1
Beside Huffman coding, there is also arithmetic/range coding and now is
becoming widely used ANS coding - e.g. LZFSE (Apple), ZSTD (Facebook), VP10
(Google): [http://encode.ru/threads/2078-List-of-Asymmetric-Numeral-
Sys...](http://encode.ru/threads/2078-List-of-Asymmetric-Numeral-Systems-
implementations)

------
the_common_man
One important difference in practice is that zip files needs to be saved to
disk to be extracted. gzip files on the other hand can be stream unzipped i.e
curl [http://example.com/foo.tar.gz](http://example.com/foo.tar.gz) | tar zxvf
- is possible but not with zip files. I am not sure if this is a limitation of
the unzip tool. I would love to know if there is a work around to this.

~~~
bluedino
You can't do that with a .zip file because the file header information is
actually at the end of the file. You could work around that by sending the
header first, though.

~~~
greggman
You can't put the header first. It is an intentional feature of .zip that the
header is at the end so you can update a large zip file by just appending a
new header to the end. That way the entire file does not have to be rewritten.
Just read the old header, append new files, append new header. This was
important back in floppy disk days

~~~
bluedino
>> Just read the old header, append new files, append new header. This was
important back in floppy disk days

Don't forget 'overwrite old header'

Important for floppy disks in two ways, one because of space constraints and
two, because of how slow floppies were.

~~~
joosters
_Don 't forget 'overwrite old header'_

Why? The new one will become the 'real' one since it will be at the end of the
modified file. So it doesn't really matter if you delete the old one. If
you're really trying to squeeze file sizes down, you could reference the old
'header' from the new one, so that you do not have to list the entire
archive's contents again.

~~~
spc476
Having written code to extract files from a ZIP file, it's because the header
is variable sized, anywhere from 22 to 65,557 bytes in size (22 bytes fixed,
up to 65,535 bytes for a comment). There are two ways to scan for the header,
one is to seek just 22 bytes shy of the end of the file and start checking
backwards (since the majority of ZIPs I've encountered do not have the
comment) or, seek 65,557 bytes from the end and scan forward.

This fact has been used to construct pathological ZIP files where one tool
will report one list of files and another tool will list a different set of
files. That's why you really need to overwrite the old header.

~~~
greggman
then you wrote a bad extractor. Scanning forward is against the format.
Zipping up a zip file is a perfectly valid thing to do. If the inner zip is
stored uncompressed you'll have 2 headers at the end. If you scan forward
you'll fail.

------
tdicola
Wow a stackoverflow question that hasn't been closed or removed for some
trivial reason--thought I'd never see something like that again.

~~~
brianwawok
You think stackoverflow has too many good questions being closed? If anything
I think more could be closed.

~~~
Houshalter
Typically when an interesting stackoverflow thread is linked on HN, it's
closed. I don't know why this is. But it gives a very bad impression of the
site's moderators.

~~~
hrnnnnnn
Almost every link I follow from Google to SO results in a "closed as offtopic"
question.

Usually the question has the exact answer I was looking for (I guess that's
why its pagerank is high).

------
minionslave
It's kinda bad-ass when he said: you can use this text on Wikipedia, I'm the
primary reference.

~~~
Grue3
Wikipedia mostly doesn't allow primary sources though. It needs to be
corroborated by a reliable secondary source.

~~~
0xffff2
Which is why he actually said it could be cited:

>I am the reference, having been part of all of that. This post could be cited
in Wikipedia as an original source.

His post on SO is the reliable secondary source.

~~~
MyNameIsFred
No, his post on SO is a firsthand account, and is therefore an unreliable,
primary source. Technically speaking; Nothing personal.

------
coryfklein
I love the discussion in the comments:

> This post is packed with so much history and information that I feel like
> some citations need be added incase people try to reference this post as an
> information source. Though if this information is reflected somewhere with
> citations like Wikipedia, a link to such similar cited work would be
> appreciated. - ThorSummoner

> I am the reference, having been part of all of that. This post could be
> cited in Wikipedia as an original source. – Mark Adler

------
virtualized
But can he invert a binary tree and is he willing to relocate to San
Francisco?

~~~
umanwizard
What does "invert a binary tree" mean?

~~~
fred256
It's a reference to
[https://twitter.com/mxcl/status/608682016205344768](https://twitter.com/mxcl/status/608682016205344768)

~~~
umanwizard
I know the reference and I saw the discussion around it when it came out. I
still don't understand what "invert a binary tree" means.

------
adontz
When I read "I am the reference" it reminded me of "I am the danger".

[https://www.youtube.com/watch?v=3v_zlyHgazs](https://www.youtube.com/watch?v=3v_zlyHgazs)

~~~
SNACKeR99
And another one: you are not IN a traffic jam, you ARE the traffic jam.

------
new_hackers
When Chuck Norris computes checksums, he uses Adler-32

------
agumonkey
Archived link juste in case [http://archive.is/SvUO5](http://archive.is/SvUO5)

~~~
sigjuice
I appreciate your effort and sentiment, but it would be nice if the awful
eyesores of ads were not plastered all over the page.

~~~
cyphar
I'm confused. The only adds in the Internet Archive are the ones from the
original page. Either you're getting MITM'd by someone who's injecting ads, or
your ad blocker doesn't deal with archived links so well.

