
Python – Writing large ZIP archives without memory inflation - sandes
https://github.com/BuzonIO/zipfly#zipfly
======
mehrdadn
It'd be better if they didn't require knowing the paths a priori. One of the
fundamental strengths of ZIP is that the file list is at the end of the
archive rather than the beginning, letting you dynamically discover and send
contents in a streaming fashion. Forcing the list of files to be known a
priori works against that.

~~~
leni536
I have this same gripe with libzip in C.

------
kummappp
Please test your library with the c10-archive test suite
[https://www.ee.oulu.fi/research/ouspg/PROTOS_Test-
Suite_c10-...](https://www.ee.oulu.fi/research/ouspg/PROTOS_Test-
Suite_c10-archive)

~~~
wiredfool
How would one test a library for generating compressed archives with a test
designed for testing decompressors?

~~~
kummappp
Sorry. My bad.

------
2bluesc
I did something similar to stream large amounts of data off of an embedded
device via a zip file, falcon, and wsgi server with no external dependencies
for actual zip stream.

Proof of concept:
[https://gist.github.com/1e22bbf31b7e5ae84bbdfa32c68e03a9](https://gist.github.com/1e22bbf31b7e5ae84bbdfa32c68e03a9)

------
jonatron
I remember needing this 5+ years ago, I used a branch of a fork of python-
zipstream [https://github.com/longaccess/python-
zipstream/tree/streamin...](https://github.com/longaccess/python-
zipstream/tree/streaminput)

------
m4rtink
Yeah, I looked into this a while ago forgenerating photo gallery zip files on
the fly. Currently I'm using lazygal to generate static html galleries and if
I want to give users the option to easoly download the full gallery, it
basically doubles the size I need to store, as lazygal simply creates a zip
file and puts it next to the photos and html files in the static gallery
folder it generates.

So I looked to creating the zip files on the fly when requested & streaming
them to the client onthe other end withou having to create temporary files
and/or a lot of memmory consumption on the server. I got it working on a
prototype and even found some articles from others that got it working - it is
not that hard, really.

~~~
kitotik
It was apparently hard enough that you had to build a prototype and consult
other prior write ups.

When you really need something done it’s sometimes very helpful to be able
quickly get it done.

------
tyingq
The zip file format itself, and the compression algorithms, like deflate...all
seem set up to encourage low memory usage. The directory, for example, is at
the end of the file.

Also, most places in a zipfile you might not have context to write in a
streaming fashion are predictable in size/position so you can throw in a
placeholder and seek() back to it later.

Kind of a shame someone has to specifically implement a low memory usage
library. That implies other implementations went the lazy route.

~~~
shockinglytrue
Certain fields in the ZIP local file header require random access at write
time. Although deflate is self-delimiting, for non-compressed items you must
either know the size of item upfront, seek after learning the size, or force
all readers to fully buffer the ZIP before decompressing any entry, in order
to access the size stored in the central directory at the end of the ZIP

The CRC field of the local file header is similar, but its optionality is less
painful than the length field

In other words, there isn't a single 'good' ZIP implementation, it all depends
on what you're aiming for. AFAIK the built-in Python zipfile module always
fully populates the local file header.

~~~
zmodem
> Certain fields in the ZIP local file header require random access at write
> time.

Isn't that what bit 3 in the general purpose bit flags if for? When that's
set, the CRC and compressed/uncompressed file lengths are written in a Data
Descriptor block after the member file data.

~~~
shockinglytrue
It's been a few years, but I seem to remember there simply was no way to
detect end of non-compressed content lacking a length header without first
reading the TOC except substring search. This would quickly get very messy
when the non-compressed content might contain another embedded ZIP, etc.

~~~
zmodem
That's during extraction though. There's no way to reliably decompress a ZIP
without reading the Central Directory first. While scanning for the Local File
Headers works for most files, it's not guaranteed to be correct since it may
not match what's in the Central Directory. Also as you point out it doesn't
work when the file length is not in the LFH.

Streaming zip creation is well supported by the format though.

------
mrbonner
I wrote a servlet that streamed gigabytes of zipped data to HTTP clients in
2005. Something like this was baked into the JDK awhile ago. Why is it an
achievement for Python, though?

~~~
perraco
I think you can get the size of the zip before creating it.
[https://github.com/BuzonIO/zipfly/blob/master/zipfly/zipfly....](https://github.com/BuzonIO/zipfly/blob/master/zipfly/zipfly.py#L119)

~~~
plusperraco
smart! but always using ZIP_STORED?

------
nerdbaggy
I wish windows supported more than just zip natively. I would love something
like a tar. So much easier to generate on the fly than compressed formats.

~~~
johndough
If all you want is to generate zip files and not read them, you can use
uncompressed blocks, barely more complicated than tar and a lot faster and
less memory hungry than many compression libraries:
[https://tools.ietf.org/html/rfc1951#page-11](https://tools.ietf.org/html/rfc1951#page-11)

I've been using this for a javascript bookmarklet (< 2000 characters) which
automatically downloads all images from a web page on click.

~~~
acqq
And if one wants to list the content of the archives, zip is better and orders
of magnitude faster, as it also has a "central directory" so one doesn't have
to read the whole archive to get the list.

I've used that "central directory" approach even over the internet with a
great success (downloading just that segment instead of the whole archive to
get a list of what is inside).

But how did you make a bookmarklet to produce a zip archive, even non-
compressed?

~~~
johndough
> But how did you make a bookmarklet to produce a zip archive, even non-
> compressed?

Not quite sure I understand the question. Bookmarklets are bookmarks which
contain JavaScript code, for example:

javascript:alert('Hello, World!');

You can put that in a link

<a href="javascript:alert('Hello, World!');">Bookmarklet</a>

and then you can right-click the link to bookmark it. Now, when you want to
run that JavaScript code on some website, visit the website and click on the
bookmark.

~~~
acqq
I know bookmarklets basics. What I wanted to know is which API calls you used
inside of the < 2000 character bookmarklet to achieve the functionality you
described. I also believed that the bookmarklets are limited to the security
context of the web page and I don't understand how you did all that, if I
understood you correctly you are generating zip out of the images from the web
page using only the bookmarklet?

Do you, or do you do something much simpler?

~~~
johndough
> I know bookmarklets basics. What I wanted to know is which API calls you
> used inside of the < 2000 character bookmarklet to achieve the functionality
> you described.

To create the archive data, I used an Uint8Array where I wrote the bytes into.

To download the images, I used XMLHttpRequest.

> I also believed that the bookmarklets are limited to the security context of
> the web page

That seems to be true unfortunately. However, besides the images on the same
domain, it should also be possible to download some more images from other
domains by allowing cross origin requests if the remote server cooperates and
sets the respective header. but I have not looked into that yet.

The bookmarklet is here if you have more questions:
[https://github.com/983/FileDownloader](https://github.com/983/FileDownloader)

It is a bit larger than 2500 characters now since previously the code was
optimized for size. Now it is slightly more readable.

------
eggsnbacon1
Is this an issue in new versions of vanilla Python? I mostly bang on crusty
old Java stuff and zip streaming has been built into JDK for decades

~~~
toyg
Zip support in cpython stdlib has been there forever (I think it might even be
from 2.0 times, early ‘00s) but it’s never been very user-friendly or
particularly advanced. It’s one of those things like timezone and http, where
the implemented API didn’t really come out “right” and there has been some
pain ever since. It’s a bit like SSL stuff in JDK: something that should
instinctively be very easy just... isn’t.

