
The .zip file specification is flawed - AndyKelley
https://github.com/thejoshwolfe/yauzl/issues/48#issuecomment-266587526
======
jplasmeier
This strikes me of an example of a sort of dualism that exists within
software. The .zip spec is flawed, and can be broken in all sorts of different
ways. However, after years and years of using computers, I've never had an
issue with extracting or using a .zip file.

I wonder if this same pattern can be applied to physical engineering
disciplines- e.g. a structural engineer assessing a bridge and finding
numerous faults in the design, despite traffic still using the bridge as
normal...

~~~
niftich
Most out-in-the-wild usage of zip files fits the 'Robustness Principle', or,
after its author, Postel's Law, found in RFC 1122 [1]:

> _Be liberal in what you accept, and conservative in what you send_

There are several analyses of this maxim and whether it's a good choice for
designing robust, secure systems. This particular Internet Draft doesn't agree
[2].

[1]
[https://tools.ietf.org/html/rfc1122#section-1.2.2](https://tools.ietf.org/html/rfc1122#section-1.2.2)
[2] [https://tools.ietf.org/html/draft-thomson-postel-was-
wrong-0...](https://tools.ietf.org/html/draft-thomson-postel-was-wrong-00)

~~~
user5994461
For user interfaces: Be liberal in what you accept from the user.

For M2M (machine to machine): Be strict in what you accept, and strict in what
you send.

That's my take on it now.

For instance, a user will copy/paste URLs to his browser, there will probably
be a space too much before or after. It's okay to clean it (and a better
experience than a "site not found").

When a machine sends something "weird". Well, it's not possible to know if
it's really wrong and it can't be corrected in any meaningful way. Just fail
and throw an error so the developer can fix it.

~~~
wolrah
I agree with those positions, but what about the middle ground? Browsers for
example, taking machine-to-machine data which sometimes is human generated.

I'm a fan of the strict solution as far as that goes, but there's a reason
XHTML failed. Somehow asking people who write web sites to do it right or else
it doesn't work is a big deal.

~~~
taneq
The best approach seems to be to be strict in what you accept, and strict in
what you emit, and if you want your software to be forgiving to the vagaries
of human input, then strictly adhere to a forgiving spec.

------
arh68
I don't understand why this program should fail, though, if other tools can
unzip the file just fine. I specifically wonder about this part of the code
[0]:

    
    
        if (commentLength !== expectedCommentLength) {
          return callback(new Error("invalid comment length. expected: " + expectedCommentLength + ". found: " + commentLength));
        }
    

Why not just determine _this isn 't the record_, keep looking, and handle the
error after the loop?

    
    
        if (commentLength !== expectedCommentLength) { continue }
    

[0]
[https://github.com/thejoshwolfe/yauzl/blob/master/index.js#L...](https://github.com/thejoshwolfe/yauzl/blob/master/index.js#L124-L126)

~~~
x1798DE
I think the argument would be that if you write a non-standards-compliant
unzip utility, in many cases people generating malformed zip files won't get
any warning because the software "knows what you meant to do", then people
emitting malformed zip files with a random amount of data at the end won't
know that they're doing anything wrong until one day they hit the limit of
what is acceptable, way down the road and after they've been emitting
malformed zip files for ages.

I think it's the programming equivalent of not telling someone they have some
food on their face - yeah, it's no skin off your back if you don't tell them,
but who knows who they're going to run into later.

That said, a hybrid approach which fails by default but then offers a way to
"force unzip" the file in a non-standards-compliant way might be the best of
both worlds here, because the users get an immediate way to fix the problem
and the emitters are likely to be notified that they are creating malformed
zips.

~~~
jonahrd
I think you're missing arh68's point. If it's supposed to find the magic
number, and by definition the comment is allowed to contain the magic number,
then this section of the code should (correctly) diagnose that this is not the
magic number, but (incorrect in the current implementation) then continue to
search backward for the REAL magic number.

~~~
Drdrdrq
The way I understood it magic number is not allowed in the comment.

------
twr
I almost wrote a zip parser. After studying the format a bit, I gave up that
idea.

\- A valid zip does not begin with, as you would normally expect, a magic
number.

\- A valid zip can contain arbitrary prepended data.

\- A valid zip can contain sections with no identifier value.

\- A valid zip can contain arbitrary data between sections.

\- A valid zip is validated starting with a tail section located at the end.

\- A valid zip can have a valid tail section that contains a valid tail
section.

\- A valid zip can contain arbitrary appended data.

I don't get it: Why design a format that's so hard to parse? Implementing a
single-pass streaming parser is impossible. It should be a basic requirement
for most file formats. /usr/bin/unzip cannot even extract from standard input.
I'm sure the implementer didn't feel like receiving user complaints about
exhausted memory.

~~~
badsectoracula
Because the format was originally written by PKWARE for their PKZIP and
PKUNZIP programs for DOS in the 80s and used for backup purposes, among others
(being able to span multiple floppy disks was an often used feature).

To write the directory at the beginning they'd need to keep all the data in
memory or do some extra postprocessing that would make it prohibitively slow
on a 4.7MHz machine with a 20MD hard disk and a slow 360K floppy disk drive.
So the directory goes at the end.

The files do not begin with a magic number because ZIP files can be embedded
in other files - most commonly executable files for the self-extracting
feature of PKZIP that placed the ZIP file right after the decompressor. This
allowed the executable to be used both by itself and by PKZIP.

As for the data after the magic number, it was used for the comment which was
most likely added at a later point and they decided to put it after the magic
number so that it remains compatible with existing decompressors. In the 80s
and early 90s people didn't had Internet to get the latest version and a lot
of old versions of PKUNZIP were floating around for many years.

~~~
dkonofalski
To further add on to that, I would frequently run into programs (and sometimes
games) that would span multiple floppies where almost the entire first part of
the archive was just installer data. Sometimes, the actual .zip file data
would start on the tail end of the first disk and then span however many disks
were needed while other times the first disk was _just_ the installer and then
the .zip data was on the other disks. In either case, PKZIP and PKUNZIP would
be able to read that file but only in the 2nd case would they be able to read
the file without that first disk.

Things got further complicated when RARs made the scene because I remember
that some of the early .RAR apps offered their own implementations of .zip
support that would create multi-span .zips where the installer data was on the
first disk, the .zip data was on the other disks, but the container spanned
_all_ the disks.

This meant that you could potentially have 3 different ways of dealing with
the same data and there was no safe way to assume which method was used:

1\. Installer on disk 1, multi-spanned .zip on disks 2-n 2\. Multi-span .zip
on disks 1-n 3\. Multi-span .zip on disks 1-n but installer data only on disk
1

------
ztravis
Having browsed through a few different zip parsers in the course of writing my
own, most handle a range of malformed/modified input, including
prepended/appended garbage, filename encodings (CP437, Unicode "extra field"
flags), etc. I found this set of test files useful:
[https://github.com/Stuk/jszip/tree/master/test/ref](https://github.com/Stuk/jszip/tree/master/test/ref)

It is annoying that support for these non-standard zips has become de-facto
standard (you don't want your tool to fail where others succeed)...

------
optionalparens
Most specs I encounter have seriously flawed implementations in the wild. This
is regardless of the quality of the spec.

Here's a short-list off the top of my head of tech with specs and horrific
offenders in the wild, including many popular implementors.

* Telnet

* VT-XXX, i.e. VT-100 or any term spec for that matter

* MIDI

* MKV, AVI, etc.

* GIF, JPEG

* A huge amount of tech in any browser

* SAUCE

* ANSI escape parsers/writers/sequences

* LZH/LHA

* Open Document Format

* XMP

* MAPI

* POP, IMAP

As I think of this, I realize I'll be typing all night. In some cases, the
implementors all adopt things from the most popular implementors even if they
are historical. In other cases, things just don't work or break in awesome
ways. Zip bombs were mentioned, but there are plenty of awesome things.

I'll finish by saying I wonder if many authors of anything related to
terminals even ever bother to read specs or care. Over the years I've started
to wonder how some of these things even work in the wild, but the answer is
most people don't notice what actually isn't working, while anyone who needs
to care has to draw from what become unofficial new specs.

------
trothamel
Another fairly major flaw with zip is a lack of a defined encoding on
filenames. As far as I can tell, it's impossible to 100% reliably include non-
ASCII filenames inside a zip. (Probably because the format predates unicode's
general acceptance.)

~~~
TAForObvReasons
APPNOTE.TXT actually has a comment about it:

> D.1 The ZIP format has historically supported only the original IBM PC
> character encoding set, commonly referred to as IBM Code Page 437. This
> limits storing file name characters to only those within the original MS-DOS
> range of values and does not properly support file names in other character
> encoding, or languages.

Extra fields are generally used to implement other code pages. There is a
direct way to do it with an extension to the GPBF. Problem is that some
writers don't implement the proper flags.

------
xenadu02
Why can't this algorithm work reliably:

1\. Begin linear search backwards from end of file 2\. If magic constant
found, push onto stack of possible locations 3\. Once 32k has been reached
stop searching 4\. For each location on the stack, validate the rest of the
header and that the remaining comment length matches the recorded length. 5\.
The first location to validate as a correct header is chosen

This should account for the appearance of the magic header in comments or data
preceding the end of directory entry.

Have I missed something?

~~~
ambrop7
Yes you have missed that a valid header may appear by chance somewhere near
the end of main zip data. Because your algorithm will prefer picking up
headers toward the front, it might misinterpret a valid file, even one without
a magic number in the comment; if the last 64k-(actual header+comment size)
happens to contain an apparently valid header.

~~~
j_s
This happens with nested .ZIPs.

------
londons_explore
Frakenfiles like this are made by PHP coders who have:

Download.php:

if (isset($POST['whatever']) write($file_contents);

print "download page contents here"

~~~
mschuster91
When you're done flaming PHP, read the whole Github thread, and especially
comment #4. The file was produced by a .NET library.

~~~
Graphon1
Yep, it has nothing to do with PHP. It's easy to produce this kind of
frankenfile with an incorrect cp command. Or with incorrect stream management
using any zip library that emits streams.

------
Meic
See also zip bombs [0]

[0]
[https://en.m.wikipedia.org/wiki/Zip_bomb](https://en.m.wikipedia.org/wiki/Zip_bomb)

------
kazinator
What the article doesn't cover is more braindamage in the ZIP spec which
prevents a forward scan through the individual file records. This is detailed
in the Wikipedia.

 _Tools that correctly read .ZIP archives must scan for the end of central
directory record signature, and then, as appropriate, the other, indicated,
central directory records. They must not scan for entries from the top of the
ZIP file, because only the central directory specifies where a file chunk
starts. Scanning could lead to false positives, as the format does not forbid
other data to be between chunks, nor file data streams from containing such
signatures._

~~~
762236
I wonder who here remembers self-extracting zip files under DOS/Windows? They
put the zip file into the executable file. And you could extract with a normal
unzip program, I believe.

~~~
flukus
You're probably thinking if NSIS
([http://nsis.sourceforge.net/Main_Page](http://nsis.sourceforge.net/Main_Page))
and a few others.

------
jbb555
Presumably this flaw explains why zip files are so obscure and difficult to
use. (Yeah it's no great but clearly it's not been an issue)

------
Dylan16807
So the flaw in the spec is extremely minor and irrelevant to the real
complaint. The real complaint is about programs making and accepting slightly
invalid files.

------
Aldo_MX
Which is the consensus about better alternatives?

~~~
teilo
DAR. GPL, indexed, file permissions, ACLs, alternate file streams, per-file
compression, slices, hard and symbolic links, sparse files, encryption, parity
records for recovery. Everything you could want. Not sure why it hasn't had
much use. Maybe because it is usually thought of as a backup tool (at which it
is excellent). But it is also an excellent archival tool.

From the man page: "Dar can also use a sequential reading mode, in which dar
acts like tar, just reading byte by byte the whole archive to know its
contents and eventually extracting file at each step. In other words, the
archive contents is located at both locations, all along the archive used for
tar-like behavior suitable for sequential access media (tapes) and at the end
for faster access, suitable for random access media (disks). ... Note also
that the sequential reading mode let you extract data from a partially written
archive (those that failed to complete due to a lack of disk space for
example)."

[http://dar.linux.free.fr](http://dar.linux.free.fr)

~~~
witty_username
From [http://dar.linux.free.fr](http://dar.linux.free.fr)

> Consequently, to use the API, your program must be released under the GPL as
> well.

GPL is a bad thing for programs like this. It hampers usability. Especially at
least the decoder should be BSD/MIT/etc.

~~~
teilo
A fair point. Lacking good GUI tools, integration libraries, etc., something
like this will never gain traction. This looks to be a fairly complex code
base. Re-implementing it in a library licensed under BSD/MIT would be non-
trivial.

------
BentFranklin
There is an issue of POC|GTFO that discusses zip frankenfiles.

------
GFK_of_xmaspast
See also those tricks you can do with a file that's both a zip and a gif at
the same time.

~~~
klodolph
There are a ton of formats you can turn into "both a ZIP and X" at the same
time, because most formats have a header at the beginning of the file and ZIP
has a header at the end. Self-extracting EXEs are really just some standard
EXE which is also a ZIP file, so you can use unzip to extract them without
having to install WINE or whatever.

~~~
xenadu02
Same applies to PDF. I believe it is possible to construct a file that is an
EXE, PDF, and ZIP all in one.

------
andrewmcwatters
Fascinating... I wonder how many .zip files in the wild have this issue and if
perhaps as a result many unarchiving applications tolerate it.

~~~
diggernet
I'm actually surprised yauzl can get away with rejecting .zip files due to
appended data. I recall that back in the BBS days downloaded files frequently
had trailing bytes on them (perhaps padding to block size, or something?).

~~~
optionalparens
You're making me seem old I guess, but I've encountered this a bunch of times.
This was often done as a way to add metadata to things that otherwise didn't
have metadata or needed additional info. Typical use cases include music
players, trackers, drawing programs, media viewers, and anything that scanned
files for metadata (ex: BBS).

The SAUCE spec adds trailing bytes after the EOF mark for instance. I wrote a
few SAUCE reader/writers embedded in various codebases (and will release a
stand-alone good one some day soon).

Generally, a trick was that if you wanted some sort of metadata on a file, you
could add it before the contents of the file (as in MKV allows ASCII at the
start) or after the EOF marker. The latter made sure you didn't interfere with
a lot of existing programs on various operating systems of the time, including
but not limited to DOS, Windows, and CP/M. This trick still works for a lot of
OSs or in other forms today, we just don't rely on markers for EOF as much so
there are other methods, and people have started to bake metadata into specs.

As for the BBS scene, although SAUCE was used, for zips DIZ tended to be used
more and was just packed into the archive and could be found using the
internal TOC, or otherwise just uploaded separately.

~~~
softblush
[http://www.acid.org/info/sauce/sauce.htm](http://www.acid.org/info/sauce/sauce.htm)

[http://www.textfiles.com/computers/fileid.txt](http://www.textfiles.com/computers/fileid.txt)

------
winteriscoming
I haven't taken a look at the spec, but assuming what's noted in that issue
comment is indeed the case, then I find it surprising that something as
obvious as this would not be fixed/raised during the RFC and review of the
spec itself.

~~~
gpcz
There was/is no RFC for the ZIP file format (there is for gzip and the DEFLATE
algorithm). It was Phil Katz releasing a text file called APPNOTE.TXT
describing the file structure (latest version is available at
[https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT](https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT)
).

~~~
andrewmcwatters
It almost makes .zip feel much more proprietary than it is!

