It's not yet open-sourced but it has defenses against excessive compression ratios, mismatching local and central directory headers, ambiguous filenames, directory traversals and symlink traversals, and anything ambiguous that could exploit differences in zip implementations, e.g. some zip implementations decode from the front while others decode from the back. Most importantly, it balks at deviations from the zip format, including any kind of overlapping or sparseness, buffer bleeds etc.
At least it detected all three of the author's samples as malicious.
But I don’t recall ever mentioning that fact to someone who already knew it. While you could probably get away with rejecting that file (who still uses that? Some sort of streaming protocol?), it was a feature at one point.
These "invisible" dead zones can be used for malware stuffing, or to exploit ambiguity across different zip implementations, those that parse forwards (using the local file header as the canonical header) and those that parse backwards (using the central directory header as the canonical header).
For example, a malware author might put an EXE in the first version of a file, and a TXT in the second version of that same file. Those that parse forwards get the EXE. Those that parse backwards get the TXT. Of course, the spec advocates parsing backwards according to the central directory record, but implementations exist that don't do this.
The goal of @ronomon/zip is to scan email attachments at the gateway and reject zip archives that might prove dangerous to more vulnerable zip software running downstream (MS Office, macOS) etc.
Also, as you say, I don't think incrementally updated archives are used much. From what I could see, there were no false positives for rejecting gaps between referenced local files on a small sample of 5000 archives.
Some have enough confidence and are willing to do it, others would rather only show the final piece and be free to not publish anything if they aren't happy with the result. Both stances are ok.
Windows 10 then began doing the same thing for Windows Defender, but some sane limits aborted it after a few seconds.
I worked on productizing a code signing tool a while back and I believe the first thing I did after we got it working was change it so nothing touched the disk until after the signature had been validated (in this case the signers had business relationships with each other. This would be necessary but insufficient for download s from the internet).
There were already well known CERT advisories about how relative paths can confuse archive tools, email tools and web servers from Microsoft. Know history or repeat it.
I didn’t know “zip bomb” as a phrase but I knew a good bit about compression, so when I needed to fix a problem with zips over 2G I managed to make myself a test fixture that was around 80k without modifying the file format. I think it was just 2.01G of white space.
Not really. For a normal, created in one-shot ZIP, yes. But the types of ZIPs in the article are going to act differently if you tried streaming them. The core idea of the article is overlapping the various files within the ZIP, s.t. they share bytes. But this is only apparent if you're using the central directory, which you can't if you're streaming, since it appears after all the data. If you're streaming, you're using the local file headers, but for ZIPs such as those in the article, you will see many less LHFs than had you looked them up in the central directory, b/c they overlap. (In the streaming case, I think you'd see exactly 1 file. It would still be huge, given the other tricks in the article.)
Also, I think you can "append" to ZIPs (this is why the central directory is at the end, s.t. it can be overwritten by new data, and then re-appended.) I think this approach allows tools to also "delete" data by simply removing the entry from the central directory, and re-appending it w/o, so the central directory is essentially the authoritative source for the ZIPs contents. (Though I suppose a streaming decompressor could decompress to a temporary location and then only move non-deleted entries into their final place.)
The Wikipedia page echos this:
> Tools that correctly read ZIP archives must scan for the end of central directory record signature, and then, as appropriate, the other, indicated, central directory records. They must not scan for entries from the top of the ZIP file, because (as previously mentioned in this section) only the central directory specifies where a file chunk starts and that it has not been deleted. Scanning could lead to false positives, as the format does not forbid other data to be between chunks, nor file data streams from containing such signatures.
> I don’t understand why you would extract the archive before checking the contents?
Zip is a streamable format but it also supports random access through a table of contents (the "central directory") located at the end of the file. This bomb works by overlapping the file offsets in the table of contents.
As to why they're writing the contents to disk I can only speculate. Perhaps they're using a library that doesn't expose an "extract to memory" feature, or maybe it's an anti-zipbomb measure to avoid out-memory/denial-of-service attacks.
Zip format can be de/compressed progressively, which is one reason why it’s nice for HTTP transport encoding. The file format is decompressed one record at a time and many or most libraries can give you this as a stream, so it never has to hit disk or be “sent to dev/null”.
If you take responsibility for streaming the records to disk (trivial), then you can check the canonical path before writing, and any other filesystem sanity tests you want to do.
> Zip format can be de/compressed progressively, which is one reason why it’s nice for HTTP transport encoding.
Do you mean HTTP transfer encoding? If so then it's not the zip archive format that's used, but rather the deflate compression algorithm (which zip also uses.)
> The file format is decompressed one record at a time
But not necessarily in the order they appear.
> many or most libraries can give you this as a stream, so it never has to hit disk or be “sent to dev/null”.
My point is that the compressed bytes have to be decompressed and checksummed in both extraction and checking, but after that the bytes may either be written or discarded.
> If you take responsibility for streaming the records to disk (trivial), then you can check the canonical path before writing, and any other filesystem sanity tests you want to do.
That's true but there's nothing wrong with the paths in this case.
I'm afraid to even download it now...
I prefer this functionality as most of the time I do want to unarchive it. I can rearchive it later (or remember to use another browser) when I need to.
MS Windows' transparent zip treated as folders works well (it's the same on Kubuntu), one could just do that and pre-cache an uncompressed version; then you get to use the file or open the folder with minimal friction?
Also, fun bonus fact re: checksumming:
Apple uses the https://en.m.wikipedia.org/wiki/Xar_(archiver) file format for their own software downloads (e.g. App Store downloads; and developer-tools packages from the Apple developer website; and downloading Safari extensions, before those were rolled into the App Store; etc.). Despite Apple being seemingly the sole user of .xar, it’s not an Apple-specific format; rather, it’s developed by OpenDarwin. So you can use it too (for, at least, your macOS-targeted downloads), if you like.
A .xar file contains embedded checksums (both for the archival representations of each file, and for their extracted representations); when Safari auto-unpacks a .xar, the .xar unpacker (Archive Utility?) verifies those checksums as it does so. IIRC, if the verification fails, the extraction stops, what has been extracted so far is deleted, and the user is told the archive is broken and asked whether they want to keep it or move it to the Trash.
A neat thing about .xar extraction, is that it seemingly tags the extracted files with an xattr declaring that they’ve already been checksummed. Apple ships applications like “Install macOS Whatever.app” as a .xar containing an .app bundle containing several mountable .dmg files; normally those .dmg files would do their own checksumming when they mount, but since they came out of a .xar, they know they’ve already been checksummed recently, so they just skip the internal checksumming step. (I think this is one of the main reasons Apple chose to move to .xar; they wanted to be able to make the macOS Installer run faster, by having it not have to do any checksums of its support .dmg files during install.)
So that’s the deeper answer to your question: ultimately, Apple expects people who want archives with checksums, to use .xar or a format like .xar, that does checksumming during extraction.
Purely my hypothesis: it’s too high a risk that someone accidentally leaves a password or private info in their clipboard. People don’t expect a clipboard to persist, so you’d need to re-educate everyone to avoid this “bug”, just like browser history and incognito mode.
Apple is notorious for assimilating popular third party extensions. Screencapture, night shift, colour picker , why no stack based clipboard? It’s too useful to have been overlooked. Must have been a conscious decision.
The only reason I've Chrome around is for the growing number of sites that only work with Chrome. Apps made by Google, in particular, increasingly don't support macOS/Safari. Which I find infuriating, but that's another topic.
When a browser helpfully decompresses the archive you can no longer perform this check.
But if someone replaces the archive with a malicious file that decompress normally, he will also probably change the listed checksum on the download page....
Many linux distro isos are available from several different mirrors. Having a secure hash on the original site with the links to mirrors means I don't have to trust the mirror(s).
Another case where the archive hash is useful is when there's some public key crypto involved. I can have a public key from a publisher (gotten either out-of-band or in the past) and the hash can be signed so I can verify it. These schemes would mean that an attacker would at the least need to have compromised a site for an extended period of time (if I have history with the site, the first visit it doesn't do anything extra), or in the case of out-of-band key sharing, multiple communication methods might need to be compromised for an attack to succeed.
But yes, in the common case a hash next to a file link hosted on the same domain really doesn't do anything.
You can compare the checksum to the one on the author's site to ensure the mirror provider didn't alter the file.
As for malware that would be unzipped when using an external zip file; first you would need to trigger the zip bomb on defender but not the external tool, and second defender will still scan the individual files getting unziped by that tool.
The other possibility I see harks back to algorithms - build a DAG from the archive's interfile dependencies and run an iterative deepening search through the structure against some heuristic checking for malicious design.
I've never gotten into anything security related (other than reading Schneier's blog) but the cat and mouse game is fascinating.
I wonder how Windows Defender would treat something that looked like a self-extracting archive? Perhaps the archive portion could be this zip bomb affair, but the executable portion had a small change in it to bypass that and do something else nefarious instead, eg hand execution control to a point later in the file.
One should also do a lua nginx plugin for that: aggressive crawler ? Comment spammer ? Take this nice gzip HTTP response...
EDIT: nope, steaming doesn't work, the zip relies on the fact it contains many files, and gzip assume there is only one big blog.
EDIT 2: tried with zlib but it expects a different header. So my guess is you really need to open it as an archive.
SSH however works with zero trust. Clients are protected from bad servers just as servers are protected from bad clients. It shouldn't be possible to send a file. If it is, it is a serious ssh vulnerability.
Most bots won't unzip a file they download.
But they will deflate a SSL packet.
Usually aimed against bots though: https://hackaday.com/2017/07/08/dropping-zip-bombs-on-vulner...
But then two questions sprang to mind:
1. Does this eventually get your domain marked as potentially harmful in Firefox/Chrome/other browser?
2. What happens if you're fronted by a CDN like Cloudflare? I mean, I assume nginx won't be screwed over by this but, even then, will it infuriate your CDN provider and put you at risk of getting your account shut down.
My fit of vengeful glee has therefore been somewhat ablated for the time being.
2. You create an exception so that they never cache the page and don't proxy this exact URL.
Better yet, mark it Disallow in robots.txt - to see "noindex, nofollow", they'd still need to request the URL, running the risk to be served with the bomb.
> 2. You create an exception so that they never cache the page and don't proxy this exact URL.
They work as reverse proxies on host-basis, I don't think you can exclude a single URL. CF at least will never cache text/html (unless specifically told to), but I don't know whether they will unpack (and possibly cross-compress to a better suited compression algorithm) the content while transmitting.
My experience is that most bots just hit the usual suspects, /wp-login.php, /phpmyadmin/ etc, regardless whether they are in robots.txt or not.
Yeah, basically what I see in my logs. To be more clear, the disallow is for a non existent path in the document dir. I somewhat expected to find at least one script to actively crawl it, but it makes sense, as no sane people would put secrets on a website and protect them with a robot.txt... ^__^;
Internet didn't existed yet in my country, took me 3 years to figure out how to open that file, and in the meantime I infected my computers multiple times with tons of viruses (seemly packing viruses in unzippers was popular... the one with most viruses was "pkunzip" or something like that)
No. You may be thinking of gzip, a spiritual successor and replacement for the Unix compress/uncompress (and even earlier pack/unpack).
But both ZIP and ARJ, and the earlier ARC, all made multiple-file archives.
Or did you mean that the compression was “carried over” from file to file inside the archive?
So which of these files which are really zip do browsers or mail programs auto-open? Anyone think of any?
They use zips as embedded file systems with well-specified structures, I wouldn’t expect them to blindly decompress everything. The format usually defines specific entry points (specially named files) which serve as pointers to the relevant information and link to other files. The bomb part would mostly be ignored filler there.
There are various sandboxing technologies from using simple rlimit, or a cgroup, or even running a full VM (see libvirt-sandbox) depending on the threat level and the amount of effort you want to put in vs the perceived risk.
This is not invalid itself of course - some compression programs likely deduplicate files with this technique. But if it seems excessive, or it's the only thing in the archive then you've got a zip bomb.
You could probably come up with some techniques to obfuscate this of course but it'll increase the size of the archive.
Only this week did I discover that yahoo and gmail don't let you send zip attachments. I thought this was a bit silly but now I am agreeing with them!
In case of ZIPs: an implementation can look through the Central Directory index of files and sum up "uncompressed size" fields of all the files, and then check the sum vs the set limits - no prior decompression is needed (this is neither CPU intensive nor requires a lot of memory allocations).
The obvious "gotcha" here is that the "uncompressed size" might be declared low, while the actual data inside the compressed stream might be much higher - this is detectable only when trying to decompress, so it would seem we would fall into your idea (memory allocation / CPU counters). But that actually is not needed, as all good decompression libraries have functions to "decompress at most N bytes" - so the implementation just uses the previously declared "uncompressed size" as the limit, and therefore guarantees that the actual total decompressed size is within the checked (in previous step) total limit.
That said, I do recognize that some decompression algorithms might have possible inputs which get really CPU intensive even for a single byte, though that's not the case for typical "DEFLATE"-using ZIPs (i.e. you probably might structure the decompression stream in a way that does a lot of cache misses, but that's about it).
For non-DEFLATE compression YMMV and your method comes to mind as a decent solution.
If the goal is to avoid zip bombs you can't get any more sloppy than that because then an attacker could exploit inaccuracies in your estimator.
Same goes for files that can crash your computer, they can send that to non-complying victims. Key is to waste as much of their time as possible while giving them as little value as possible.
1. ZIP archive has multiple files.
2. ZIP is an "index+pointers" based format (meaning the Central Directory index of archive files is basically a table of pointers - or rather offsets - telling where to look for data inside the file).
Thanks to these two properties David could create a very clever compressed stream that could be (partially) re-used by multiple files inside the archive.
While one could argue that PNGs do meet the first criteria (multiple compressed separate blocks - vide https://www.w3.org/TR/PNG/#10CompressionOtherUses - do note that multiple IDATs make a single compressed stream, so one has to use these other separate blocks like iTXt, iCCP or zTXt; YMMV for animated PNGs extensions), it certainly doesn't meet the second one - it's a block/chunk format (and by definition blocks are unable to overlap).
One note here is that in case of a faulty block/chunk format parser implementation - one with integer signess/overflow problems related to block size - one might be able to pull an overlapping block trick (see Bug 2 in https://gynvael.coldwind.pl/?id=533 for an example in a different file format).
> A final plea
> It's time to put an end to Facebook. Working there is not ethically neutral: every day that you go into work, you are doing something wrong. If you have a Facebook account, delete it. If you work at Facebook, quit.
> And let us not forget that the National Security Agency must be destroyed.
Personally I do agree. By the way, I got to meet the author (David Fifield) in person. An extremely bright mind!