Hacker News new | past | comments | ask | show | jobs | submit login
Zip – How not to design a file format (greggman.com)
274 points by st_goliath on July 22, 2021 | hide | past | favorite | 162 comments



https://web.archive.org/web/20210722231119/https://games.gre...

Edit: For anyone seeing this much later in time, the website got flooded when it hit the Hacker News main page. I grabbed the archive mirror of it from earlier today.


Since this is a frequent occurrence, wouldn’t it be neat if we could do this kind of “archive interesting article” thing using IPFS? Maybe even do it automatically when submitting it to HN such that it could tell HN itself to grab a copy?


> Since this is a frequent occurrence, wouldn’t it be neat if we could do this kind of “archive interesting article” thing using IPFS? Maybe even do it automatically when submitting it to HN such that it could tell HN itself to grab a copy?

Saving things to the Wayback Machine at the Internet Archive is a far superior solution than IPFS. IIRC, IPFS only keeps things as long as they're popular, which has famously resulted in things like most NFTs pointing to dead links.


IIRC, wayback machine has a pretty wicked flaw in that if a domain adds a restrictive robots.txt at some point (say, a domain changes hands), then all prior archives for that domain are retroactively taken offline. Not a problem for archiving things just to keep them online under heavy load, but it's a big problem for future use.

I'd prefer something like archive.is for longer-term storage.


With IPFS this can be avoided by using IPFS pinning. It can be done on any node but usually the use of an IPFS pinning service from providers such as Pinata and/or Infura makes the most sense.

It’s the best of both worlds - IPFS will still do it’s decentralized thing but with a pinning service even when individual IPFS nodes do garbage collection or otherwise expire an object locally they can always go fetch it from the persistent copy stored with the pining service.


If my understanding is correct, IPFS provides a way to only store files. Wayback Machine archives entire webpages, along with any CSS or JS (?) included. So building a webpage archival service over IPFS needs significantly more development effort.


You're absolutely correct. I wasn't commenting on the appropriateness or suitability using IPFS for the purpose of archiving webpages. I was replying to parent that there is an IPFS solution for the expired objects/dead links issue.


> How do you read a zip file?

> This is undefined by the spec.

> There are 2 obvious ways.

> 1. Scan from the front, when you see an id for a record do the appropriate thing.

> 2. Scan from the back, find the end-of-central-directory-record and then use it to read through the central directory, only looking at things the central directory references.

I was recently bitten by this at work. I got a zip from someone and couldn't find inside the files that were supposed to be there. I asked a colleague, and they sent me a screenshot showing that the files were there, and that they didn't see the set of files that I saw. I listed the content of the zip using the "unzip -l" command. They used the engrampa GUI. At that point I looked at the hexdump of the file. What caught my eye was that I saw the zip magic number near the end of the zip, which was odd. The magic number was also present at the beginning of the file. At this point I suspected that someone used cat(1) to concatenate two zips together. I checked it with dd(1), extracting the sequence of bytes before the second occurrence of the zip magic number and the remainder into two separate files. And sure enough at that point both "unzip -l" and "engrampa" showed the same set of files, and both could show both zips correctly. Turns out engrampa was reading the file forwards, whereas unzip was reading the file backwards.


The fact that merging zip files with cat and do not produce error is amazing by itself. Now we'll wait until someone find / use a way to distribute hidden files using this method.


What will really blow your mind is that you can cat a png and a zip together and the resulting file is both a valid png that looks identical to the image and a zip that contains the same contents as the archive.


This is how these self extracting archives work, it’s just an exe (that allows adding random data to the end) and a zip (that allows random data to be prepended) added together.


more specifically the executable part contains the logic to decompress the remainder of the file, while still allowing a decompression program to be able to open and decompress the file (ignoring the executable portion of that file).


This may be also the trick used by the fantasy console PICO-8[0] which games run by png-based cartridges[1].

[0]: https://www.lexaloffle.com/pico-8.php

[1]: https://www.lexaloffle.com/bbs/?cat=7&carts_tab=1#sub=2&mode...


Apparently, Pico8 encodes the compressed app data directly in the PNG pixel data. They use 2 least significant bits of each color channel to encode a single byte of data per color. The pixel count of pico8 cartridges is what creates the console's 32,800 byte size limit.

https://pico-8.fandom.com/wiki/P8PNGFileFormat


You might enjoy PoC||GTFO issue #10:

> The polyglot file pocorgtfo10.pdf is valid as a PDF, as a ZIP file, and as an LSMV recording of a Tool Assisted Speedrun (TAS) that exploits Pokémon Red in a Super GameBoy on a Super NES. The result of the exploit is a chat room that plays the text of PoCkGTFO 10:3.

https://www.alchemistowl.org/pocorgtfo/pocorgtfo10.pdf


Mind. Blown. Only works if the PNG is first and the ZIP second though, because PNG reads front to back and ZIP reads back to front.


> Now we'll wait until someone find / use a way to distribute hidden files using this method.

You mean like the GIFAR attack (https://en.wikipedia.org/wiki/GIFAR)?


Back in the day people used to spread zips hidden in jpegs on imageboards using basically the same idea (after the jpg data is the zip data)

The most iconic /i/ kit spread this way being Dangerous Kitten (https://web.archive.org/web/20120322025258/http://partyvan.i...)


This is what the original SFX self-executable extractor use. Appends a zip file to a extractor exe.

This is also why traditionally zip are read from backwards.


> Now we'll wait until someone find / use a way to distribute hidden files using this method.

No need to wait. Microsoft does this in their file formats. (E.g., PowerBI's pbix files have this pattern.)


A lot of software installers use this method. It’s a very old tried and true method.


You may enjoy the PoC || GTFO ("Proof of Concept or Get The Fuck Out", google) "newsletter", which is a pdf and zip in a single file, sometimes with a some other file types merged in, too (e.g. truecrypt), and which also contains back issues in the current issue, russian-doll-style.


> Turns out engrampa was reading the file forwards, whereas unzip was reading the file backwards.

Was the second zip less than 64KB in size? If not then engrampa was being extra wrong about looking for the header.


The second zip had 109 bytes.


I can think of some scenarios where that could really cause some harm


APK files are zips, although the same code that reads them probably checks signatures.


There was at least one or two vulnerabilities in Android because it wasn't the same code doing those two things. Later they added a signature for whole zip(almost*) instead of previous approach inherited from jar which hashed each file inside the zip individually. That way signature check doesn't need to parse zip headers*.

* Since the signature is put somewhwere between zip headers instead of separate file there needs to be little bit of parsing to locate it and the hash isn't straight forward hash of whole file because it needs to exclude the signature part.

edit: turns out they later also added a version of external signature


Intermingling signatures or authentication and data layout is such a common failure and people do it over and over again, even in relatively new projects like Android. I wonder why the initial design took off? This is something were I would expect any crypto engineer to refuse to sign the design off, because it is going to break; it's not even a question. If you do anything other than Sign(Blob) and later Verify(Signature, Blob) && Parse(Blob) you're going to do it wrong.


Not just APK, EPUB (ebook file format) as well, likely many others as well (IIRC JAR for Java ?).


I've seen legitimate uses of this feature where an executable appends a zip file onto itself, and can then access this portable, compressed storage at runtime.


You got an invalid ZIP file, so behavior is inherently undefined.

Per the blog + spec:

A ZIP file MUST have only one "end of central directory record".


Super helpful for streaming though -- don't need to know all file sizes upfront.


The author is missing an obvious point that computing was way too different at the time the format was designed.

You couldn't do a "find XXX | grep YYY | zip" pipeline; you would instead often add files to an archive one-by-one. So the format consists of appendable records sharing the same structure, and adding new files is simply a matter of appending records to the end.

Now sometimes you would want to quickly locate and extract just one file out of the entire archive, and scanning the entire archive would be ridiculously slow on a system with <1MB/s throughput a some kilobytes of RAM. Hence, the central directory at the end of the archive.

Some other time, the power would go down while you are appending to the archive, or a bad block would pop up on your HDD, hence redundancy between the central directory and the records.

Also, writing archivers back then was very different from what we do now. There were no unit tests, no abstraction layers, and very limited debuggers. Software was hand-coded in assembly, and adding an extra field or a check here or there was much less trivial than it is now.

You can always find something that was designed under completely different constraints and call it a bad design. But in reality it only shows that the author hasn't done enough research on the topic.


> You couldn't do a "find XXX | grep YYY | zip" pipeline;

The format was developed in 1989, at which time find(1) and grep(1) and command lines pipes had been around for at least 10 years (more, depending on how you measure it).

It wasn't that you couldn't run pipelines like this, but more that zip's creators didn't care about Unix.

> Software was hand-coded in assembly

I'm not sure what version of software history you've been reading, but I can assure you that I personally was not hand-coding assembly in 1989, even though I was writing device drivers, PostScript RIPs and other stuff at that time.


> The format was developed in 1989, at which time find(1) and grep(1) and command lines pipes had been around for at least 10 years (more, depending on how you measure it).

PKZIP, which originated the Zip format, was developed on MS-DOS, which did not have find(1) and grep(1), nor did it have pipes (IIRC):

* https://en.wikipedia.org/wiki/PKZIP

It was a third-party who created utilities that ran under Unix-y systems:

* https://en.wikipedia.org/wiki/Info-ZIP

Unix had its own file archive formats at the time, but it seems that folks on the PC side of things either didn't know about them, or didn't care for them:

* https://en.wikipedia.org/wiki/Tar_(computing)

* https://en.wikipedia.org/wiki/Ar_(Unix)

* https://en.wikipedia.org/wiki/Cpio


> PKZIP, which originated the Zip format, was developed on MS-DOS, which did not have find(1) and grep(1), nor did it have pipes (IIRC):

COMMAND.COM supported find(1) in the form of DIR /S and grep(1) in the form of FIND. It also supported Unix-like pipe syntax, though because DOS wasn't multitasking it was implemented under the hood by redirecting to a temporary file and running the programs sequentially.


GP said "computing was different back then", not "Zip was developed on a toy operating system that lacked these sorts of facilities".

I did note that likely what mattered the most was that pkzip's creators didn't care about (or maybe even know about) Unix.


I can assure you this was exactly the case. The original was done in C and then compiled. Phil then would optimize it by hand resulting in highly optimized asm code. All the DOS versions were done this way. That was the genius of Phil Katz. The first VAX version was done in C and that code was then ported to the first unix variants at which time it was moved to 32 bit and the extended attributes fields were added.


>> I personally was not hand-coding assembly in 1989

None at all? 80's C compilers were garbage. Most example code for device drivers (sound cards, video cards, etc) was in ASM back then.


Maybe 80s compilers for MS-DOS/Windows were garbage. Maybe even the ones for Unix were garbage too, and we just didn't care. I've only ever worked on Unix-based systems in my 35 year careers in software, and given that the entire operating system (bar a few low level details) was written in C from the beginning, it has always been rare to fall back to asm for anything unless it was essentially impossible to do the job in C.

I've written on the order of a half dozen device drivers in my life for sound cards, motor controllers, DSP boards and sensors, and have never used asm for any of it.


The discussion is about PkZIP which ran on DOS PC's, so your experience in the UNIX world is going to be very different. Life was different in 16-bit, PC XT/AT land with 640k.


Sure. But this started with a comment that said "computing was different back then". Not the same thing at all.

Also, note that in 1989, Xenix and others ran quite happily on the sort of hardware you're citing, so MS-DOS is/was the actual "limiting factor".


The people that were using UNIX workstations back then, even Xenix, compared to MS-DOS was probably 1 to 1000.


Sure. But MS-DOS didn't define "computing" in 1989. I think it's fine to say something like "pkzip was developed on a platform with a lot of limitations, and you can see some of them in both the program and file format designs". It's quite different to claim that the design of pkzip somehow reflected the general state of computing in 1989 - it just isn't true.


I'd argue that it did, again for 1 out of 1000 people.

Remember that the 'rest of the computing world' didn't really catch up to the early UNIX workstations until Windows 2000/XP.


I did some then (video device driver), and much more of it 10 years earlier.


> I personally was not hand-coding assembly in 1989

I was. You missed out on some fun.


It seems you didn't bother to read the article completely, because the author also goes into the reasoning of this problem:

> You might think this is nonsense but you have to remember, pkzip comes from the era of floppy disks. Reading an entire zip file's contents and writing out a brand new zip file could be an extremely slow process. In both cases, the ability to delete a file just by updating the central directory, or to add a file by reading the existing central directory, appending the new data, then writing a new central directory, is a desirable feature. This would be especially true if you had a zip file that spanned multiple floppy disks; something that was common in 1989. You'd like to be able to update a README.TXT in your zip file without having to re-write multiple floppies.


Sure you could, even though MS-DOS wasn't UNIX, and using pipelines meant writing temporary data into the current directory.

In fact that was one of the reasons ARJ got more love than zip, given its flexibility and ability to split archives across floppies.

Unless you were coding COM files, demoscene or commercial games, there was enough software being written in Turbo and Quick Basic, Turbo/Quick/TMT Pascal, Clipper/FoxPro/DBase III, Turbo/Quick C/C++, Turbo/TMT Modula-2,....


> Sure you could, even though MS-DOS wasn't UNIX, and using pipelines meant writing temporary data into the current directory.

Assuming you're commenting on the GP's point about UNIX pipelines, I think your comment here is a little disingenuous. Akin to that famous HN comment mocking the then start up Dropbox saying "you can just use FTP...." while completely missing the reason Dropbox would then go on to be successful for.

> In fact that was one of the reasons ARJ got more love than zip, given its flexibility and ability to split archives across floppies.

Did ARJ actually get more love than ZIP or is that just your anecdotal recall of the era. I seem to recall the opposite with ZIP being far more prevolent. Remember ZIP also supported splitting archives over multiple floppies.

Interestingly ARJ's website is still live and not been updated in 10 years[1]. It might be ugly by today's standards but man is it better designed for getting the core information out. We've definitely lost something when sites became more about presentation and less about information.

[1] http://www.arjsoftware.com/arj.htm


Not disingenuous at all, MS-DOS CLI was good enough for most pipeline use cases, and if you really wanted a more UNIX like experience you could always buy 4DOS, the Dropbox of MS-DOS shell utilities.

Back in the MS-DOS days, you couldn't use zip across floppies unless you bought the commercial Pkzip version, the shareware one did not offer it.

Whereas ARJ offered all what pkzip was capable of, for free, and every high school kid with PC at home got it on their floppies collection.


> Not disingenuous at all, MS-DOS CLI was good enough for most pipeline use cases, and if you really wanted a more UNIX like experience you could always buy 4DOS, the Dropbox of MS-DOS shell utilities.

It literally is disingenuous for the reasons I'd already outlined. And then saying "but you could always install this other lesser known DOS-like command shell" doesn't exactly help your claim.

Sure, with a little effort pretty much anything is possible in IT (that's a large part of the reason I was inspired to work in IT growing up). But my point was if MS-DOS requires a new command shell or other hacks then your whole argument of it supporting pipelines is rather stretched. Literally the whole point of pipelines is as a quick and common way of passing information between programs and nothing you've details is quick nor common.

But you'll come along and find some other tedious argument because that is what you do....

> Back in the MS-DOS days, you couldn't use zip across floppies unless you bought the commercial Pkzip version, the shareware one did not offer it.

But the file format did.

> Whereas ARJ offered all what pkzip was capable of, for free, and every high school kid with PC at home got it on their floppies collection.

Every high school kid would have had the commercial version of pkzip too -- albeit not legally. The amount of pirated software that would get swapped around in the 80s and 90s was pretty ridiculous.

In fairness to you though, I guess the preference of ARJ vs ZIP is always going to be an unprovable subjective point. I can't go back in time to prove your community of friends used one over the other any more than your anecdotal evidence being proof of my group's, or any others, preferences. And to be fair, it was a long time ago now and memory is fallible. I remember using pkzip a lot to package software I was writing but I might just have forgotten about all those times I used ARJ.


> the Dropbox of MS-DOS shell utilities

I'm not sure I understand the Dropbox analogy. In what way was 4DOS "the Dropbox of MS-DOS shell utilities"?


Akin to writing them yourself, as tongue and cheek remark to the person I was replying to.


> Akin to writing them yourself

But...people don't write Dropbox for themselves? That's an existing product that someone else made (unless "you" refers to a Dropbox employee of course).



Yes, but that makes Dropbox and 4DOS dissimilar to writing things yourself, not akin to it.


What I can remember from the 90s in Sweden the most common archive formats for spanning multiple 3 1/2 inch floppies was ARJ or RAR. ZIP existed but was uncommon in comparison.

I think it wasn’t so much about the qualities about the zip format itself rather that WinZip software wasn’t as good compared to WinRAR. ARJ was command line, at least that is how I used it, it was well documented and easy to use.

It may have been that the zip format was more common in the DOS 5 1/4 inch floppy days because zip (1989) existed prior to both both ARJ (1992) and RAR (1993)


So was in Czechia during the DOS times. ARJ was almost synonymous with compression. Some people used RAR.

The advent of Windows was when ZIP started to push them out.


I think the point of the article is that we are now stuck with a very technically obsolete format that no longer reflects the wants and needs of our environment - and that, in the future, anyone who designs a new format (which, in the HN community, some people actually might do), should think a bit more generously than the creator of ZIP.

For example, the need to be able to skip unknown blocks is fairly important and imcompatibility of parsers was a known technical problem in the early 1980s already.


>> The author is missing an obvious point that computing was way too different at the time the format was designed.

I think the author is aware, at the time that PkZIP was created he was writing NES games and using data compression to fit more content into the games.

https://games.greggman.com/game/programming_m_c__kids/


None of the author's suggestions are incompatible with those constraints.


Zip files on floppy disks!


There were zip floppies with 120MiB which was glorious back in those days.


>scanning the entire archive would be ridiculously slow on a system with <1MB/s throughput a some kilobytes of RAM

That's how tar works. Just the other day I was making a backup and streamed 5gb of data as a tar archive, opening it in the archiver took a lot of time.


> >scanning the entire archive would be ridiculously slow on a system with <1MB/s throughput a some kilobytes of RAM

> That's how tar works. Just the other day I was making a backup and streamed 5gb of data as a tar archive, opening it in the archiver took a lot of time.

That's because tar is the tape archiver, specifically written for a medium whose very nature is linear reading/writing and where skipping around was highly impractical or even impossible.


But having a file directory would be useful even with a tape because you wouldn't need to read the whole tape to get the list of files inside.


Could that just be the archiver? I often skip UIs since they seem to take longer to show the file contents than it takes me to unpack the entire archive on the command line. Worse I am convinced some UIs end up unpacking the entire archive repeatedly when you select only two or three files.


No, tar does not have a central directory record so you need to skip over all data even just to list files. Of course you can have additionall slowdowns on top of that by doing stupid things.


I was looking at the details of the ZIP format just last week when I discovered that a major performance issue was being caused by quirks of a ZIP implementation. Some knowledge of the ZIP format on the developer's part would have saved massive I/O loads (and WAN traffic).

There were various "archive" ZIP files that had grown to several gigabytes each. The developer was using Info-ZIP "zip.exe" to "append" files to the archives multiple times per hour (>100). Like virtually every ZIP implementation, Info-ZIP just re-writes the entire file when appending. As a result, I was seeing multiple terabytes of I/O churn each day as new ZIP files were being created to replace the old ones (which was killing VM replication across a 500Mbps pipe, and is what drew my attention to begin with).

While I would have loved to find a ZIP implementation that just appended the file and a new central directory I found it easier to just rename the ZIP files every day to minimize the churn. Still, I can't believe there isn't a ZIP implementation out there that can do this.


Huh? I just checked the Info-ZIP zip man page, and it says (reformatted):

> -g, --grow: Grow (append to) the specified zip archive, instead of creating a new one. If this operation fails, zip attempts to restore the archive to its original state. If the restoration fails, the archive might become corrupted. This option is ignored when there's no existing archive or when at least one archive member must be updated or deleted.

I tested this on Linux and it appears to work as advertised. I assume the reason why it is not enabled by default is because "If the restoration fails, the archive might become corrupted".

edit: python zipfile in append mode also appears to avoid rewriting the zip file. libarchive doesn't seem to support appending to zip files.


Well I feel like a dunce. Thanks for this tip!

I didn't see this option on the version of Info-ZIP they were using but that would definitely do what I wanted. I thought I looked at current man pages but I guess I didn't.


I checked up on this. Turns out the "-g" option isn't listed in the "-?" help. I guess I didn't look at the man page-- just the "-?" help.


Another commenter pointed out that Info-ZIP supports append, but alternatively it seems like tar is built for exactly this sort of thing. Instead of the usual gzipped tar archive (ie .tar.gz), just gzip (or something else) and then append the result to an existing tar archive. (https://www.gnu.org/software/tar/manual/html_node/appending-...)

Also I'd preferably use zstd because it's a significant improvement in many ways and would afford you the option of placing a custom compression dictionary at the front of the tar archive.


Gzip files can just be straight concatenated together. There are some caveats around metadata but it’s an amazing property.


But only if the first file ends on a byte boundary of course. Just to save someone trying this some debugging :)


> byte boundary

Do GZip files have bit-based lengths?


Yes they do!


Wow, I wonder what led to that design decision. Does that mean you have to take that into account when concatenating them?


Well, the whole article is about the flaws in the format :)

But the answer in this case is Huffman coding. The most common sequences compress to just a few bits, but backreferences and such are not byte-aligned, they just follow literals in the bitstream. Aligning things to bytes would significantly reduce the compression ratio.


Is that the compressed file that ends on the byte boundary, or the uncompressed one?


Compressed. The compressed stream isn't byte aligned. Normally this isn't a problem because an application will flush the output and pad it to end on a byte boundary before closing a file (see https://www.bolet.org/~pornin/deflate-flush.html for a good explanation)

Where this becomes a problem is when someone thinks "I'll compress my logs as I write them out". The next problem is what to do on restart. It's tempting to think "I can just append to the existing file because I can concatenate two gzip files". But if the restart is after a crash rather than a clean exit you get log salad.


Does this meet your needs? https://sqlite.org/zipfile.html


I think what is being forgotten here is the time the zip file format was created. It was more than 32 years ago as Phil started with arc. Also; there was no internet, no windows, no linux even. It was designed for BBS file transfer. Even in the late 80's and early 90's when it was expanded to the current format for new compression and multi-platform, some machines were big endian and others little endian and that is taken into account. Means had to be added to deal with "extended attributes".

When Jean-loup Gailly and Mark Adler started infozip a lot of discussion was done to try to guess what the future would hold, but it was just that. Guessing.

The fact we are still using it all is a major miracle.


The author makes good points about the poor documentation and vagueness of details of the current Zip format. That is something worth criticizing.

But I think criticizing the 30-year-old 'design' is pointless. So this is a good article with a bad title.

What is much more helpful is suggesting, in detail, how Zip might be extended to remain backwards compatible while moving forward to adopt a more sensible future format. (Though 30 years later, some guy is going to write an article saying how terrible that new design is).


ZIP is a product of the microcomputer era, which was filled with hobbyists, many new programmers learning the craft without the benefit of experience with mature systems and the preceding decades of CS work, and even experienced programmers who needed to take shortcuts to do a lot with a little. It was an amazing time where many unpolished programs were developed to try out all sorts of ideas, and there's a lot I find fascinating about the era. (For example, flipping through old computer magazines, it's always interesting how few advertiser addresses were in Silicon Valley.) But as a professional developer in 2021, looking back on that time and genre of computing is also a bit horrifying. ;)


While I agree with your overall point regarding the importance of contextual history when looking at any design, one nit I would pick is your statement that there was no internet when Phil Katz created and PKWARE released PKZIP. First release was Feb of 1989, and I was most assuredly on the internet at that time, and in fact that is where I first encountered PKZIP.

The 80s BBS scene was built around things like ARC, LHarc, and ZOO. ARC essentially got killed by its creators SEA thanks to their suing PKWARE over PKARC. Many BBS operators banned the use of ARC after that lawsuit in protest (a similar furor would erupt 5-6 years later over GIF).


My thoughts exactly. Easy to sit atop an ivory tower to look down upon deflate 32 years later. No denying it’s usefulness though.


Small nit. There was internet--but very few people had access to it. (And I'm guessing, don't remember, that ZIP wasn't used much or at all on internet FTP sites and that Unix-centric formats were used instead.)


I think what one needs to realize here is in what circumstances zip was born.

It was ultimately a private person who founded a small company who wrote a tool for DOS and distributed it as shareware. There were other tools like it (ARJ and LHA were other popular ones). PKZIP disappeared, but its format ZIP became the de-facto standard and survived decades later, which was likely the result of some random circumstances.

So it shouldn't be surprising that the format isn't necessary ideal looking decades back.

(One could add that ZIP encryption is a big mess. The original ZIP encryption is insecure, there are two incompatible less insecure modes that still are vulnerable to malleability attacks and there's no way to have any modern form of encryption with an AEAD in ZIP.)


I always used LHA. I think I'm in good company seeing as Carmack used it to compress assets in both Doom and Quake.


LHA was released as freeware and included some source code to help with porting, whereas PKZip was released as shareware and as the article mentions, it's hard to find actual implementation details around i.e. it was probably easier/cheaper/smarter to go with LHA.

https://en.wikipedia.org/wiki/LHA_(file_format)


There's also ISO/IEC 21320-1 (https://www.iso.org/standard/60101.html, available as a free download at https://standards.iso.org/ittf/PubliclyAvailableStandards/in...), which subsets APPNOTE.TXT by forbidding several less common features of the file format (like multiple volumes, encryption, or compression methods other than DEFLATE).

Unfortunately, as far as I can see even that standard doesn't forbid self-extracting archives, non-empty zip file comments, or local file headers which don't match the central directory. It would be great if that standard were updated to also forbid these problematic features of the ZIP format, so that archive files created following that standard could be parsed without ambiguity. Then it would be a matter of updating other ZIP-using standards like OpenDocument to reference this simplified standard.


ZIP does have bad ideas like mentioned there.

I think is better to use a separate format for archive and for compression (like .tar.gz is, and some others).

(If you do not need the metadata, I like the Hamster archive format, which is: It is a sequence of lumps, where each lump consists of: null-terminated ASCII filename, 32-bit PDP-endian data size, and then the data. (That is the entire specification.))

(Another thing about ZIP: If you have a truncated ZIP file (possibly due to a disk error on a floppy disk), I have found that bsdtar can read it; other programs I have tried are unable to read truncated ZIP files.)


> I think is better to use a separate format for archive and for compression (like .tar.gz is, and some others).

That has the bad property that it does not allow for easy random access. With the ZIP format, you can directly read an arbitrary file within the archive without having to decompress or even read the compressed data of any of the other files, and many applications of the ZIP format (for example, JAR archives) depend on that property. With .tar.gz and similar, you have to decompress all data before the desired file.


The solution here is to invert (like zip does). So rather than a .tar.gz you get a .tar full of .gzs.


Tar does not have a good index either. It is designed for tapes, not random access disk


This ruins your compression ratio, since references between files becomes impossible.


Zstd (https://facebook.github.io/zstd/) provides more flexibility and supports user provided compression dictionaries - you can just place one at the front of the archive.

Of course this doesn't fully solve the metadata issue and it's definitely more complex than just using the zip format. But on the plus side, it allows you to avoid using the zip format and the compression algorithm is significantly more performant and flexible.


Placing a user-supplied dictionary at the front of the archive is essentially how “normal” compression works.

The advantage of a user-supplied dictionary is that it can be transmitted once, out of band. With this approach it can also be larger than what would make sense for an individual file and it can take advantage of similarity between files.

But if you just send it along with the compressed data anyway, you’re not buying anything extra.


Yes, I realize that it's manually reinventing the wheel. However the context here was supporting random access (and append, truncate, etc) while also using separate formats for archiving and compression. Such an approach is more work but provides clean separation and thus far more flexibility.

Bear in mind that the alternative (in context) is throwing a bunch of independently compressed files at something like tar. In the event that the files are small, that will absolutely destroy the compression ratio. If the files are large enough or you only have a few then you could always omit the custom dictionary.


To manage this, you're essentially back to writing a tool which handles both compression and zipping, which is exactly what zzo38computer set out to avoid.


(Unlimited) references between files are incompatible with random access.


Yes, that is a valid point; for some applications, random access will be helpful.

One possibility is to make each file as a separate compression block, which is possible if using concatenatable compression formats; then, the same format can be used for solid and for non-solid compression.

However, then, you will need a index if you want random access. I don't know what compression formats have such a feature as a index, but I thought of some ideas of a (concatenatable) compression format; one possible thing to add is a optional index block (which links the compressed and uncompressed offsets of the beginning of blocks which do not depend on any previous blocks).


Random access of ZIPs was helpful during the dial-up days because you could download select files instead of the whole ZIP file too.

If something like tar.gz had been standard then, then that wouldn’t have been possible.


A solution would be to reinvent the wheel invented by pixie and others (PDF has a TOC at the end, too, for example (in ascii, because PDF started life as a text-based format “postscript without the halting problem”)) introduce a convention that a table of contents gets appended as two files:

- a file named “TOC” with a well-specified format that contains the byte offsets of each file in the archive

- a file named “TOC-offset” with content the 8-byte offset of the “TOC” entry (keeping this separate means the file can be fixed-size. That makes reliably finding it easier)

Preferably, both files also would have some checksum. The TOC itself could have a pointer to the previous TOC. If so, it could be a diff.

Supporting tools could easily check for the presence of a TOC and either overwrite it with new files and wrote a new one, or just append new files and write a new TOC pair.

Attackers could thwart this by creating an archive without a TOC that has a file ending in what looks like a TOC/TOC-offset pair.

An alternative would be to write the TOC-offset file at the start. That’s more robust against such attacks, but requires the ability to overwrite the start of a file.


There are dangers when allowing one TOC to refer to another, and the ability of offsets to cause self-reference, e.g. see Section 3 of http://www.ieee-security.org/TC/SPW2014/papers/5103a198.PDF

In general, it's also a good idea to keep data and metadata distinct, to avoid potential conflicts (and hence exploits). In this case having the archive metadata as a file inside the archive seems fraught with dangers. We could avoid this by e.g. having two levels of archive, e.g. the first level contains TOC files and data files (e.g. tar files); the 'payload' files are kept in the data files. That way, there's no way for payload to be interpreted as metadata, or vice versa.


Random access is a nice feature of zip, for a system I worked with I made HTTP API that could read induvial files within a zip archive without unpacking, trivial to implement and still performant, so we didn't need to change anything in how these archives were stored.


Obviously the solution is to make .gz.tar files:)


32-bit PDP-endian data size

That may have been sufficient 30 years ago...


>I like the Hamster archive format, which is: It is a sequence of lumps, where each lump consists of: null-terminated ASCII filename, 32-bit PDP-endian data size, and then the data.

gzip format does that too, but gunzip extracts concatenated files as one file with everything concatenated.


Another Zip article from early 2020 - historical but also covers quite a lot of technical ground:

Zip Files: History, Explanation and Implementation

Hans Wennborg

https://www.hanshq.net/zip.html

https://news.ycombinator.com/item?id=22506451


When scanning backwards, when you find 0x504b0506 (EOCD signature), you can also check that the length of comment field (20 bytes ahead) is correct and the offset to start of central directory (16 bytes ahead) points to the right signature to virtually eliminate the possibility of mistaking a spurious 0x504b0506 as the start of EOCD.

Not sure exactly how it works for ZIP64 but it should be similar.


I guess file formats also somewhat(?) suffer from the ‘Worse Is Better’[0] – a bunch of universal formats suffer from being not well-designed because it was a result of a popular program that worked at the right time…

[0]: https://www.dreamsongs.com/RiseOfWorseIsBetter.html


To get an idea on how to deal with ZIP files in code:

— in C, minimal-style library: https://github.com/richgel999/miniz, look especially at https://github.com/richgel999/miniz/blob/master/miniz_zip.h

— in JS, writing only and no compression: https://github.com/PaulCapron/pwa2uwp/blob/master/src/zip.js (255 lines including blanks & comments)


You can also read the redbean source code, since it's the web server that reads/writes to its own executable zip file as a native process. https://github.com/jart/cosmopolitan/blob/f3e28aa192d433c379... Some ZIP APIs are designed in such a way that you can even use the POPCNT instruction to decode them.


Oh yes, the redbean source makes for a nice & adequate addition to the list.

BTW, jart, let me thank you here: redbean is boldly innovative, refreshlingly witty. An inspiring work of art from a true hacker!


Thanks!


Given that apk’s, and office docs are zip files, the ambiguities might be used as an attack vector. You might be able to construct a file that looks different when it is audited versus when it is used/run, due to differences in the zip parsing.


What is a good example of a well-designed file format. It seems like file formats are often designed according to subjective opinions and tastes. Is there "universally-accepted" guidance to designing them.


I always liked the PNG file format (http://www.libpng.org/pub/png/spec/1.2/png-1.2-pdg.html) as a good example of a well-designed file format. It starts with a fixed eight-byte signature, followed by a series of chunks, each chunk having exactly the same four fields: the chunk length, the chunk type, a variable-sized data, and a CRC over the type and data. The chunk type includes a flag telling the decoder whether the chunk can be skipped if the type is unknown (since all chunks have a length in the same place, every decoder will know how to skip the chunk). The specification also has a rationale section explaining the reason for several of its choices.


RIFF [0] (based on IFF) is a generic chunked format that's not too different. It's the basis for WAV, AVI, and WebP, among others [1].

[0] https://johnloomis.org/cpe102/asgn/asgn1/riff.html

[1] https://en.wikipedia.org/wiki/Resource_Interchange_File_Form...


Jpeg2000 is also a really neat format. Among other things, it supports progressive decoding and tiling. (https://en.wikipedia.org/wiki/JPEG_2000)


I've seen PNG as derivative of RIFF: in that it's a file comprised of atomic "chunks" (though PNG's chunks are much larger and not time-based), RIFF itself is basically the same as EA's IFF format from 1985 - so it's not like good ideas like a well-reasoned file-format are a recent innovation.

It does seem a bit weird to think that something as highly technical (and thus: subject to becoming stale quickly) as a multimedia file format has withstood the test of time so well - while "simpler" files like Office documents from 15 years ago are almost unreadable today.


RIFF is not without issues either - all size fields are 32-bit, so handling RIFF files larger than 4GB is ambiguous, if not undefined. The header contains a size field that should indicate the total size of the RIFF container, and in theory this cannot exceed 4GB, just like each individual RIFF chunk cannot exceed 4GB. There are some specifications that try to solve this issue but they are not very widespread (e.g. RF64), and probably have their own issues as well.


It might be very problematic to read multimedia files from 15 years ago, weren't those the days of 'codec packs' and every downloaded movie seeming to require a different one ?

But at least the multimedia files (and archive formats) were designed to be interchangeable between applications. '.DOC' word docs were basically a binary dump of its internal data structures - there was no intention of it being interchangeable with any other application, and it showed.


Zip format is a sequence of chunks too: one file is one chunk marked by a signature, like png chunks, that's why files can be added to a zip archive relatively cheaply.


Sure, but with Zip, you can’t reliably figure out where the chunks start and stop. In PNG, you can easily enumerate the chunks and skip over any chunks your program does not parse.


Heh, well that's the point of the article. Zip as an example of what not to do. So to improve things a file format should have a complete specification, conformance tests, have a fingerprint, a version number, variable length records should have lengths (to allow skipping records you don't understand), should allow for 64 bit lengths, and there should be index (and not a partial or incorrect index which is legal in zip).

On top of that, anything intent on allow archives should provide for checksums and error correction (Reed-Solomon or similar).


Despite using the flawed ZIP structure, I was moderately impressed by the "System.IO.Packaging" library in .NET, which is an implementation of the Open Packaging Conventions (OPC) that underpins the various MS Office document formats.

It has a nice central directory, an extensibility mechanism, and even supports incremental saving of large documents in a reasonably sane way. Another fancy feature only typically seen in video container formats is the ability to interleave chunks to enable streaming loads or efficient storage on mechanical drives.

Unfortunately, layering XML on top of ZIP is hideously inefficient for all sorts of reasons.

IMHO, an ideal generic format should have some of the high-level API features of OPC, but implemented with a low-level storage encoding that is:

1) Linear in the sense that no O(n^2) or even O(n log n) algorithms are never needed (such as XML parsing). Reading the data should be a single-pass, with all buffer sizes known in advance. It's okay to sacrifice write speed, because writing is less common than reading, and often scales 1:many. E.g.: a file written once is often read many times. It's almost never the case that a file is repeatedly overwritten and read only once. Similarly, writes can be cached using write-back trivially, but reads are often uncachable and are on the critical path for end-user perceived latency.

2) Inherently paralleliseable. A mistake inherited from the 1990s era of file formats is strictly sequential formats that can never be decoded faster than whatever a single CPU core can do. At a very small compression efficiency loss of about 1-5%, huge wins can be achieved by simply restarting the compressor every 64KB or so. Similarly, instead of a naive hash function, use a Merkle tree, which can be verified in parallel.

3) Safe in the sense that overlapping block references are never allowed, and the decoder code should be generated from the schema in such as way that this is strictly enforced. This would allow languages like Rust or C# to safely return "span" references to the file data buffer without having to break it up into a million tiny allocations. If aliasing is allowed, this breaks the memory model of several languages.


At a glance the, developed with years of hindsight later, 7zip specification (the official version of which might be in the SDK?) appears to be a conformant counter-example to a zip file format.

This is a more easily accessed derivitive of that specification.

https://py7zr.readthedocs.io/en/latest/archive_format.html


At this point, just don't design file format. Use flatbuffers.

1. It has good forward / backward compatibility stories;

2. Schema is your file format documentation;

3. Don't need to decode every parts;

4. Can skip for unknowns (if you use union type);

5. Many language bindings.

The only downside, and this is a big one: validate a flatbuffers payload method doesn't have nearly as much language coverage as the encode / decode methods. (i.e. C++ has it, but a lot of other languages don't have validate method implemented).

Case study - Apache Arrow uses flatbuffers to persist its metadata: https://github.com/apache/arrow/tree/master/format


I'd mostly settle for "documented well". Like the Sqlite file format: https://www.sqlite.org/fileformat.html

Also nice that it's platform independent. Same file works on big/small endian machines.


Fittingly, the author of SQLite proposed an alternative to Zip files based on SQLite database files: https://www.sqlite.org/sqlar.html ; https://sqlite.org/sqlar/doc/trunk/README.md


Lzip, which is a better version of xz, is designed to be well designed. If you see what I mean.

https://www.nongnu.org/lzip/lzip_talk_ghm_2019.html


MS Word 97 binary doc format with OLE. /s


That hit me right in the feels, lol.

> Microsoft engineer looks at crack pipe...

“Yea, let’s put an entire filesystem inside this new document format”


You mean like ODF and DOCX still are zip files, basically a file system?

It was there for a reason - you could embed documents/excelsheets/external apps that supported OLE into each other (think adding an editable excel graph to your document). And the FAT-like OLE file format allowed you to do partial updates to a file in place, which was pretty useful given the slow write speeds of hard disks or even floppy drives. Fully rewriting the whole ZIP/DOCX file on ever Save like we can do now wasn't an option then.

I'm not sure if, given the requirements driven by that embedding use case, the engineers could have come up with something better in the 90s.


Probably not a such a terrible design considering it was later used as the base, among others, for Java archives and all modern Office documents.


It gets worse, because the DEFLATE compression format used in Zip has a number of unspecified behaviors: https://www.nayuki.io/page/unspecified-edge-cases-in-the-def...


> Reading an entire zip file's contents and writing out a brand new zip file could be an extremely slow process.

That’s not the reason. Writing to disk and reading it back is bad, but keeping the whole file in memory is far worse and people seem to forget that routinely.

If you read from the beginning, you can stream the decompression. If you are already taking care not to clobber important files on your filesystem, this isn’t much worse.

The other advantage in the modern era of this format is that you can stream compression of the archive too. Need to send someone a bunch of files? You can start writing the file to the client while still compressing the data. If you’re very brave, you can start sending it while you’re still globbing the file system or pulling database records. This means you aren’t turning CPU time on the source into latency at the receiving end.


Well, a rant about ancient format is a rant, but I guess that today a ZIP file is what a handful of well-known libraries agree upon, and it's okay — as long as you only need file archive functions, and not some data signing guarantees, file deduplication, mapping for random access, and everything else.


.PDF would like a word..


re the streaming thing, I assume the intent is that a server can stream a zipfile trivially while generating its contents on the fly. and then write the central directory at the end. It can't easily be read while streaming, but it's a great choice for writing a stream


If that was the only application then whey not get rid of the local file headers entirely?


yeah, fair enough, they seem useless


The discussion here made me try opening a ZIP file that Windows considered corrupted with 7zip. And it worked! I guess that could imply that it was truncated in transfer?

Edit: Extracting shows a warning regarding data after the last record. Yep, appears to have been truncated.


As an IDS/IPS architect, Zip (and many other compression) format is the narly data stream that necessitate HUGE buffering, just to get to the compressing coding table at the end of the file.

I only wish that compression table was at the forefront of its file format.


And Microsoft chose zip as the underlying format for all it's documents!


They're far from being the first or only ones in doing that. There are many other examples such as JAR, ODF, APK, etc...


> such as JAR

Not to be confused with the other JAR: http://www.arjsoftware.com/jar.htm



And tar is a mire of slightly incompatible versions, that only supports streaming, not random access. I wish there was a modern alternative that was ubiquitous enough to be practical.


Anyone know a good widely supported archive format?

tar is nicely simple, if it weren't for the fact that there is so many legacy variants to support.


I only use zip because zip won. There exist other better archive formats, but what matters for me is availability of tooling, every OS has a zip implementation and every major programming language has zip library.

But one good thing that RAR has is recovery records to protect against data corruption, nice feature for a long term archiving format. Not sure if other formats have this too.


> But one good thing that RAR has is recovery records to protect against data corruption, nice feature for a long term archiving format. Not sure if other formats have this too.

This can be done with any format by having an additional .par2 file for the recovery information. [0]

[0] https://en.wikipedia.org/wiki/Parchive


apart from all negative point the positive point is that we can make .zip file act like a .iso file and .iso file act like .html file and .html act like a bash and .bat file and have a global executable format bootable.apk...zip.iso.html.bat that run on all platform I mean literally all platform.


It's not that far-fetched for desktop platforms. This was actually implemented as the "αcτµαlly pδrταblε εxεcµταblε" [0]

[0] https://justine.lol/ape.html


I think ape currently modifies the file but my idea (that partially inspired by it) don't need to modify the executable you just double click on windows and `chmod +x` it on mac and linux and it will run, and rename the file to .apk to install on android same for ios. and we can also use webassembly to share code or even use PWA for all platform and also with a little effort we can make it bootable on all architecture.


Just distribute some checksum and/or parity files with the zip file and you should be good to go.


Greggman… I swear I remember reading him blogging back in 2000? 2001? Maybe worked for Nintendo in Japan?


I’m not designing a file format right now.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: