Hacker News new | past | comments | ask | show | jobs | submit login

There were a lot of encoders at the time using this general scheme (a few more values to indicate match length or distances). PKZIP won at the time because it was faster, and PK had the name recognition from his PKARC, which was a superfast implementation of SEA ARC (the dominant archiver on PCs at the time).

PK had to stop working on PKARC because of a C&D request from SEA. He wrote the first algos of PKZIP, which were on par with SEA ARC on compression (and with PKARC on speed), but weren't much better. (And have been deprecated since the 1990 if I'm not mistaken).

Then, the Japanese encoders started to rule the compression ratio (and had comparably reasonable compression times) - LHArc, LZAri, don't remember the rest of the names. LHArc or LHA (don't remember which), had basically the same scheme that PKZIP converged on, except it used adaptive arithmetic coding. PK replaced that with static huffman coding, trading a little compression for a lot of speed, and the format we now know and love as "zlib compression" was born (and quickly took the world by storm, being in a sweet spot of compression and speed).

There's another non-trivial thing that PKZIP had going for it - it put the directory at the end, which meant you could see the list of files in the archive without reading the entire archive! This sounds simple, but back then everyone adopted the tar-style "file header then data" appendable style, which meant that just listing the files inside a 300KB zip file (almost a whole floppy disk!) meant reading that entire floppy (30 seconds at least). PKZIP could do it in about 2 seconds.




> There's another non-trivial thing that PKZIP had going for it - it put the directory at the end, which meant you could see the list of files in the archive without reading the entire archive!

Beginning and the end, which continues to bite us in the ass until today, where we regularly stumble over bugs when one part of a toolchain uses one entry and the rest the other.


The directory is only at the end in the PKZIP, there is additional info before each compressed file, like in TAR. It's actually good to have both, as it allows recovering individual files even when there's a corruption in other files or in the directory.


Yes, it's possible to use the information to recover a corruption.

The problem is, what happens if the two are different? This can be through accident or maliciously. If one tool uses the first and another uses the second, then you'll end up with different results, and it can be hard to figure out why they are different.

It's like a CD, where audio players interpret the CD different than CD-ROM players. Some anti-copying techniques tried to take advantage of this to produce un-rippable CDs. A problem, however, was that car CD players used CD-ROM drives, so these CDs weren't playable in those cars. (https://en.wikipedia.org/wiki/Compact_Disc_and_DVD_copy_prot... )

Or some HTTP header attacks based on duplicate headers. Let's say there's a firewall which allows only access requests for "Content-Type: text/plain". If the attacker includes the header twice, once as text/plain and the other as image/gif, then the firewall might only check the first while the back-end web server interprets the second. This is a silly example; I couldn't remember the real attacks which take advantage of the same mechanism.


Zip inconsistency vulnerability: https://nakedsecurity.sophos.com/2013/08/09/android-master-k...

Duplicate HTTP header vulnerabilities:

https://bugzilla.mozilla.org/show_bug.cgi?id=376756

https://splash.riverbed.com/thread/7772

If a spec is open for interpretation, you can be sure that some software gets it wrong.


The "zip inconsistency" linked is not a problem of the Zip file format specification. The goal of the zip format was never to provide a tamper-proof cryptographically secure binary package. Whoever uses this structure for some more security-wise complex demands is responsible to make sure that the security assumptions he needs are respected by the code he implements. One simple method were signing the resulting archive, then no modification is possible, and therefore also isn't possible adding a second entry with the same name by the attacker.

I really don't care about HTTP headers or Reed-Salomon, as they are completely irrelevant to the Zip format designed in eighties for very specific purposes then (being fast on the floppy disk and 4 MHz 8088 processor, allowing to recover as much as possible from the floppy even after the floppy disk sector failure), I just claim, through the whole conversation here, that the Zip format is not a bad format for having a central directory and the metadata outside of it too, and I gave my arguments for that. NTFS also has some metadata on more than one place and it's an intentional design decision solving real problems: in Zip case allowing for fast access of the list of the files in the archive but also easy recovery of the single files when some part of the archive gets corrupted.

I also claim that the Zip format is not in any way "bad" for not making impossible per the archive format design and specification of having the two identical file names in the archive. I can even imagine use cases where exactly allowing that is/was beneficial.


I have news for you: you'll find the same "problem" (cache consistency) everywhere in computing. Still having the cache is a good thing. Maybe even the most important thing.

"Almost all programming can be viewed as an exercise in caching." —Terje Mathisen

The directory at the end was a fantastic cache as the seek times were unacceptably high. It's still worth maintaining for a lot of use cases. If you have a different use case where its nonexistence would make your life easier, fine, but that doesn't mean nobody needs it or that it's "wrong" to have it.


I didn't argue that was right or wrong. I explained why creshal's comment about it biting them in the ass made sense.

Your first argument was about the benefits of fixing data corruption, not about latency.

Duplication is a poor way to handle data corruption. If you really wanted to recover data robustly, there are better methods for error correction.

Yes, you are free make a new argument that there are other benefits to data duplication. Do be aware that the HTTP header duplication example I gave is not one of cache consistency, but the more general problem of data consistency.


> If you really wanted to recover data robustly, there are better methods for error correction.

Please tell me which format is better for recovering most of the files from the multifile archive if the corruption occurs somewhere (e.g. a small part of the whole archive file gets corrupted): the tar.gz or the Zip?

Which archive format would you use to be "better"? Is it widely used? What are its disadvantages?


> Please tell me which format is better for recovering most of the files from the multifile archive if the corruption occurs somewhere (e.g. a small part of the whole archive file gets corrupted): the tar.gz or the Zip?

For the same filesize as the zip you could transmit the tar.gz + a par2, which would be much better for recovery from such corruption.


Sometimes the corruption is in the metadata, in which case you are lucky and the duplication can help.

But neither method you mentioned lets you recover your data if the data itself is corrupted.

If you really want to recover data, use something with error-correction, like Parchive. RAR 5.0's recovery record is another solution, also based on Reed-Solomon error correction codes.

Archive tools usually don't use those methods because performance and compression size are the prime reasons people use an archive system, not robustness to error. In large part because the storage and network protocols are so good that such errors rarely occur. Because they use error-correction codes.


> But neither method you mentioned lets you recover your data if the data itself is corrupted.

I have another experience: If I've made a zip archive of a 1000 relatively same files, even if it's a few GB big, is if I'm left with a third of the archive I can easily recover first third of the files, if I have half of the archive, around the half of the files. Exactly because the common implementations of zip archivers would recognize that the directory at the end is missing but still allow me to extract everything that it can read, as they would use the data in front of every file, even with the fully missing "central directory".

If the corruption affects the file 3 of a 1000 files I can extract both the first two and all 4-1000. If the corruption is in the first 0.3 part of the tar.gz, most probably I won't be able to extract anything but the files 1 and 2. All the remaining 998 are then lost.

And listing the files inside of the non-corrupted ZIP archive, no matter how big, is effectively instantaneous, thanks to the same central directory. Try to list the files inside of a tar.gz which is a few GB big, then compare. ZIP is a very convenient format for fast access to the list of the files and to the single file in the archive, and still very convenient when the part of the archive is missing.

Edit, this is the response to your response: I've never ever mentioned Reed-Solomon, nowhere in the discussion. I've supported in my opinion good designing decision of Zip archives having the central directory and the metadata in front of the every compressed file and I just give some real-life examples of the corrupted files I've encountered and where I've had the best recovery ratio with the ZIP format. And I've never experienced the kind of "data mismatch" myself that you refer to. You are welcome to describe it more, I'd like to read how exactly you experienced it. If the answer is that you yourself wrote a program that ignored the full format of the Zip, or used such a bad library, then there is the real problem. Once again: "Almost all programming can be viewed as an exercise in caching." —Terje Mathisen


Please pick a topic and stick with it.

You asked about corruption, specifically when "a small part of the whole archive file gets corrupted". I addressed that point. I'm not going to cover all of the advantages and disadvantages between the two systems, because they are well-known and boring.

You of course must choose the method that matches your expected corruption methods. A incomplete write is a different failure mode than you wanted me to address. Reed-Solomon isn't designed for that. Other methods are, and they are often used in databases.

And you of course must choose a method which fits your access model.

Just like you would of course choose tar over zip if you want to preserve all of the filesystem metadata on a unix-based machine. Zip doesn't handle hard links, for example.

My point again is that data duplication causes well-known problems when there is a data mismatch, and while the duplication in zip as the result of its write-only/journalling format helps with some forms of error recovery, there are better methods for error recovery if that is your primary concern.


What you call the "data duplication" is actually a "metadata redundancy" which results in the faster access to the list of all the files in the archive (a good example of the pre-computed pre-cached info). The format is good, if some library can't cope with the format, the problem is with the library making the corruption of the archive. And that archive is still, even after being corrupted by wrongly written tool, recoverable because of the existing redundancy.

The concept of metadata redundancy isn't anything specific to the archives, the filesystems do it too, and it's always an engineering trade-off, and I consider the one in Zip format is a good one, based on the real-life examples I've given.


I agree that it's a good format. It's an engineering compromise. The benefits you see also cause creshal's comment:

> Beginning and the end, which continues to bite us in the ass until today, where we regularly stumble over bugs when one part of a toolchain uses one entry and the rest the other.

Do you think creshal is mistaken?

Of course they are due to bugs in the toolchain. That's a natural consequence of the file format. That's the whole point of creshal's comment.

What do you think I'm trying to say, because I don't understand your responses based on what I thought I was trying to say.


Garbage in, Garbage out.


Weren't there also (alleged) patents on arithmetic coding back then? Huffman coding had that as an advantage, too, especially when GNU were looking for an alternative to Unix' compress program.


Yes there were. JPEG was also affected, and most decoders still don't support the optional arithmetic mode.


LZH actually remains very common in BIOS images, probably because of its superior compression in relation to the other formats along with its low implementation complexity.

See this article, for example:

http://www.intel.com/content/www/xr/ar/support/boards-and-ki...

The file it references can be downloaded here:

https://downloadcenter.intel.com/download/8375/Logo-Update-U...

...and it's somewhat amazing that it contains the original LHA.EXE from 25 years ago:

    LHA version 2.13                      Copyright (c) Haruyasu Yoshizaki, 1988-91
    === <<< A High-Performance File-Compression Program >>> ========  07/20/91  ===


Hey, the first link seems to be from a RTL language, here is the Left-To-Right version that should look better:

https://www-ssl.intel.com/content/www/xr/en/support/boards-a...


And "there's another non-trivial thing that PKZIP had going for it":

https://en.wikipedia.org/wiki/Phil_Katz

"SEA was attempting to retroactively declare the ARC file format to be closed and proprietary. Katz received positive publicity by releasing the APPNOTE.TXT specification, documenting the Zip file format, and declaring that the Zip file format would always be free for competing software to implement."




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: