Hacker News new | past | comments | ask | show | jobs | submit login
Why are tar.xz files 15x smaller when using Python's tar compared to macOS tar? (superuser.com)
505 points by personjerry 8 months ago | hide | past | favorite | 150 comments



Sorting the files in an archive is a great trick to know for efficiency ( https://www.gwern.net/Archiving-URLs#sort-key-compression-tr... ); the more redundant your files are, the better it works and the savings can be enormous. Two near-duplicate files next to each other in a compressor's stream are basically free; space them by a few dozen megabytes, and you pay full-price...


A good ugly hack is to reverse the filenames before sorting them, which has the side effect of grouping all the same file extensions together.


7za will do this for you when running with the -mqs option.


Why isn't this the default? Or does it mean you have to use another option when de-7zipping in order to reverse them back?


It's not always beneficial. If directories contain groups of files with closely related content but different extensions, keeping those groups together may yield better compression results than reordering the files by extension.


Isn't the whole point of compression algorithms to find string of bytes that are the same? Why does it struggle if its out of order or not next to each other? Seems simple to detect and handle?


It depends on a lot on the compressor. Ancient junk like gzip/deflate only has a 32kb window, more modern compressors like zstd can do 512mb or more.

Larger window means potentially much more work during compression, and larger RAM requirements during decompression, so its not exactly free


> Ancient junk like gzip/deflate only has a 32kb window

Yes, a reflection of the time that they were first created, but hardly junk. gzip/DEFLATE have saved millions if not billions of hours of human time over the past few decades. Better things exist now, but it's unnecessarily derogatory to call something that significant "junk." And by being industry standards, they have made compression available in many more places than it used to be, even if they are not state of the art. Better to have some reasonable compression than none.


Picture the effect of compression ratios or decompression performance improving 1% globally, or conversely, if planes were replaced with a Wright Brothers model or cars with the T1. When discussing Deflate it's not just 1%, but regularly 200% or 2,000%, nor is it something people use once per month, but practically every time they interact with digital electronics.

Regressing to the T1 is unthinkable because they're long out of fashion and widely perceived as obsolete. Every change starts with a shift in perception, a process accelerated by a shift in language. For that I'm content with "junk", it describes exactly how we must feel about this dead technology before we'll ever be rid of it.

I'll gladly reminisce once we've escaped its legacy, be it time wasted installing an Android app and the effect it has on the battery (multiplied by billions of users), the time to open an Excel document, power up some smart TVs, fetch an itemized AWS bill over a typical S.E. Asian Internet connection, or grep that bill at speeds appropriate for a state of the art laptop manufactured this side of the millennium. You expect me to pay respect to software that wastes the finite breaths of billions of real people every single day.


So, I wanted to check so I downloaded linux source tree and ran gzip and zstd.

    Uncompressed .tar:    1064376320
    .xz (downloaded file): 117637692 (88.9%)
    .gz:                   189299444 (82.2%)
    .gz (level 9):         186533230 (82.5%)
    .zst:                  178012638 (83.3%)
    .zst --ultra -22:      119676867 (88.8%)
So, even at maximum level, zstd is only about 6% better, or about 37% if you consider .gz as baseline, while being about 10x slower than gzip level 9. There are places where having a smaller compressed file is worth the cost of 10x execution time (and much higher memory usage), but serving web pages is almost certainly not one of them.

In other words, the reason gzip is still being widely used is simply because it's good enough for its use cases. It's less like the original T1, and more like complaining that people are still driving Toyota Camry 2011 when Toyota Camry 2021 is a so much better car.


For the pile of XML in front of me just now (originally arrived in a ZIP container aka deflate, part of a much larger set), it's 1.4gb decompressed, 44M gzipped/deflated (default settings), 31mb zstd (default), 5.379s to decompress with gzip, 0.560s to decompress with zstd. Just under an order of magnitude difference in the most frequent operation (decompression).

Admittedly compressors generally love XML, this is just one example -- 28% less time on download and 89% less wasted on file open. Multiply by a few tens to a few thousand occurrences per week for 7.6 billion people, and I really struggle to call that a Camry.


You're admittedly talking about a difference in compression rate of 13MB in a 1400MB package, and a 5s difference in writing a 1400MB file to disk.

Come on, let's be honest here: that's nitpicking.

Unless you are running batch processes that store TB of compressed data, no one would even bother to switch app for thosd residual gains.

Let's put it this way: would you get any VC fund if your sales pitch was "I can improve compression in a 1400MB package by 13MB and shorten file write times by 5 seconds"? Odds are, you'd be asked why not just gzip it and get it over with.


> You're admittedly talking about a difference in compression rate of 13MB in a 1400MB package, and a 5s difference in writing a 1400MB file to disk.

> Come on, let's be honest here: that's nitpicking.

I agree on the compression size (13 megabytes is really not very much difference) but the decompression speed improvement really is remarkable. It's an order of magnitude of difference! Amortize that over every time you make a webrequest that has to decompress data, it makes a huge difference.

I'm mostly in the "gzip is good enough" camp, but a speed improvement of 10x is not nitpicking.


> Let's put it this way: would you get any VC fund if your sales pitch was "I can improve compression in a 1400MB package by 13MB and shorten file write times by 5 seconds"? Odds are, you'd be asked why not just gzip it and get it over with.

I agree but have a different takeaway: it’s why VCs aren’t the be all and end all. Such an improvement is worth the time invested to create it. It won’t change the world but spanned across the entire globe it’s a very notable step forward. That a VC can’t get hockey stick growth and a lucrative return out of it doesn’t invalidate the idea.


5.3s at 1.4gb on a laptop is 53ms at 14mb, is >70ms on mobile before adding any increased download time. 14mb is one retina image, even Reddit already weighs 8mb. It's silly to pretend this stuff doesn't matter, you can't even get search ranking without addressing it, for reasons that don't require concocted scenarios to explain.


Your difference might be mainly single threaded vs multi threaded decompression. This doesn't typically transform into a real-world speedup when browsing the web, as one will typically run multiple decompression streams in parallel anyway. Zstd is definitely an improvement on Gzip, but not by that much.


> 5.3s at 1.4gb on a laptop is 53ms at 14mb, is 70ms on mobile before adding any increased download time.

Factor in how much time it takes you to unpack those 1.4GB of data in a smartphone. Those 5s amount to which percentage of the total execution time?


A quick note on compressing XML. XML files typically have a high compression ratio because they consist of mostly redundancy, but compressing redundancy isn't free, you still end up with some overhead in the compressed file due to the redundancy. So if you pack the same data in a denser format, and then compress that, you typically end up saving a good chunk of space relative to the compressed XML file.


The tradeoff you show between compression ratio and compression speed is a misleading one.

The slow compression speed only takes place once (when compressing), while the saved bandwidth due to a better compression ratio is saved continuously.


Not if you're serving things dynamically from a web server, then it happens on the fly, every request.


True. But there you can go into optimizations like parallelization of the compression and building an optimized dictionary upfront once to make the process faster.


Cache by the ETag and/or If-Modified-Since headers, then


>but it's unnecessarily derogatory to call something that significant "junk."

By now, it certainly is. Back then it was probably state of the art. That's the cycle of computing. If I called an original Atari junk, some would take it as "this is useless" but most would take it as "this is useless now".


> If I called an original Atari junk, some would take it as "this is useless" but most would take it as "this is useless now".

I think you have that backwards. "junk" is an insulting word. If you called an Atari junk, most would understand you to be disparaging it in a roundabout way, not making a temporal point. To be fair to OP he did at least call it "ancient junk." But there's no need to add "junk." "old" or "ancient" more than ably would have made the point.


I suppose that's down to cultural differences. I'd consider junk to be not useful anymore (I call my core2 duo junk but it was a powerhouse when I got it), but maybe a better word is defunct? According to dictionary, junk = Cheap, shoddy, or worthless.


I'll start calling my grandpa junk too.


Your grandpa isn't particularly functionally different in design from new humans being produced today, just at a different stage of his life. Also, he's a human.

Let's take some non-human examples for comparison. Just about every good military strategy produced when your grandpa was young is, objectively, junk when compared to the state of war today. But just about every good piece of music still holds up - it might be different from good music produced today, but it's not worse. Your grandpa probably did mathematics with the help of books of log tables and books of mathematical instruction. The former are thoroughly junk, being at best as useful as a calculator (or calculator app) but much heavier, and at worst bearing typos; the latter are every bit as useful now as they were back then.

Describing something from a previous generation as "junk" does not convey the belief it is old - there's a perfectly good term for that, "old." It conveys the belief that it is junk.


I recall reading something in the 90s where in the 70s, Huffman coding was considered mathematically optimal compression and the whole field stagnated for a while because they thought they'd reached a fundamental limit.


“Huffman's original algorithm is optimal for a symbol-by-symbol coding with a known input probability distribution, i.e., separately encoding unrelated symbols in such a data stream” -wikipedia

So being pedantic, it is optimal in this specific meaning, but I don’t know about this stagnation.


Makes me wonder how tar+zstd would perform for the StackOverflow user. They were using xz, which is hardly ancient junk.


>xz

It may not be ancient, but it's certainly not the greatest compression format out there. I found this article to be very informative:

https://www.nongnu.org/lzip/xz_inadequate.html

(The author worked on lzip so it's probably somewhat biased, but facts are facts.)


The article calls xz out for its inadequacy as a (long-term) container format, which is nothing to do with its compression method (shared by both xz and lzip). I guess in this thread we are specifically talking about the efficiency of compression algorithms...


> I guess in this thread we are specifically talking about the efficiency of compression algorithms

Yep, and on top of that xz is more efficient at compression than zstd despite being "older", which is the relevant consideration here. I'm not expecting zstd to compete with xz for the StackOverflow case in terms of efficiency, what I want to know is whether file ordering makes the same efficiency difference for zstd as it does for xz.


> I want to know is whether file ordering makes the same efficiency difference for zstd as it does for xz.

It should, because much like other LZ77-based formats zstd uses a distance symbol plus additional uncompressed bits for longer distances, which should be frequent with the unorganized file order. Those formats assume that lower bits of longer distances are essentially random, so if this assumption doesn't hold those bits will affect the efficiency. Zstd also recommends the minimum window size of 8 MB, which is an improvement but also not very large.

I've done a quick experiment with a directory hanging around my temporary folder that weighs about 14 MB (Android CellBroadcastReceiver source code, if you ask). The random file order resulted in a file about 1.4% larger than those for the file extension order. So the effect definitely still exists, though I'm not able to quantify that.


Most of the points that article makes are utterly irrelevant for most uses of xz, including this one.


xz is more than ten years old now :-).


In terms of standard Linux tools, that's hardly ancient.


Hell, it's barely gone from "bleeding edge" to mere "cutting edge".


Most tools do not become obsolete as quickly.

Unlinking a path is hardly something that sees cutting edge improvements, cryptographic hashing, filesystems, and compression is something that is.


Exactly because of your caveat they are not junk,


The 'window size' dictates how far back the algorithm can look for redundant data. If you set it too high, people without enough core memory might not be able to decompress your archive.


And the window size for gzip is only 32 kB -- it "forgets" context pretty quickly. gzip was written in 1992, so it had to run under much tighter memory constraints than necessarily make sense today.

Modern compressors like xz (LZMA) use much more memory for context. Under extreme settings and when working with large files, they can use gigabytes of memory.


If you know your compressor has enough memory to hold the whole plaintext at once, but your decompressor doesn't, why not do extra work on the compression end with the whole plaintext in memory to emit an optimal "dynamic" Huffman alphabet, i.e. a code that gets patched by each new compressed block, rather than resetting to a completely new dictionary for each new block?

I presume the tightest packing possible would be achieved by converting the entire document into a Huffman-encoded sequence of references into a single unified strings table; breaking that strings table down recursively using largest-common-substring until satisfied; and then packing the resulting strings tree into something more like an intrusive prefix trie where offsets are implicit IDs. The problem would, of course, be the size of the resulting trie.

But you could always make the document into bytecode for a compressor VM (RAR-alike) where one opcode is "insert nodes {KV1, KV2, ...} into the prefix trie", and another opcode is "drop nodes {K1, K2, ...} from the prefix trie", such that the trie "window" used over the document would grow and shrink over time. Then you could optimize for redundant re-definitions of identical trie nodes in the compressed data, vs. total in-memory trie "window" size at any given time.

If the decompression wasn't required to be streaming, but rather could advance-allocate an mmap(2)ed file to write to, then you could also have a jump opcode, so that you could organize the Huffman-code trie-prefix definitions in an optimal order to minimize definition redundancy, trading off for making the compressed representation out-of-order relative to the plaintext it decompresses to. The compressor would put the "spine" of the document all together at the beginning, together with the frontloaded rootmost trie-node definitions; then gradually undefine them, replacing them with less-frequently-used nodes needed to decode context-free "islands" of content.


What you described is pretty much LZ78. It has been proven that if you can have an infinite dictionary, then you can achieve optimal efficiency. But infinite dictionaries are impossible.

Much better algorithms exist nowadays. Also, no modern algorithm uses Huffman coding anymore, it was overtaken by arithmetic and range coders long ago, and more recently by ANS coders.


LZ78 isn't adaptive; entries are forced out of the dictionary as soon as the window moves beyond where the dictionary-entry was originally defined, and if those entries are still relevant, they'll get a redundant definition in the next block.

Also, LZ78 is not recursively compressed (i.e. it doesn't compress the dictionary itself—it won't recognize when the subsequences it's working with are themselves internally redundant and would be better represented as a function with an input.)

Huffman/arithmetic/range/ANS are all "the same" insofar as they're all local-context streaming coders, i.e. things that only have the power of an FSM rather than the power you can get from requiring a pushdown automata/Turing machine.

The kind of coder I was talking about above, definitely would require a Turing machine. You couldn't implement it as a DSP — you need that unbounded-length state tape, the ability to write unbounded amounts of code to it, and the ability to jump to that code. That doesn't guarantee that it would be more powerful/better compressing; just that it could be. It wouldn't have the same inherent limits to its power that FSM-equivalent codes do.

-----

Let me go on a tangent for a bit, about what such a code probably would achieve if it could work (which should convince you, like me, that this is probably impossible somehow):

One thing I'd expect the code above to accomplish, would be to take a set of files representing images produced by capturing the sequential output frames generated by running an input movie on a NES emulator; and to emit something that looks very analogous to a cut-down version of the original ROM + input movie. (That is, the compressed data would contain 1. a set of partial spritesheets — the nodes of the prefix trie; 2. a set of partial tilemaps, referencing those spritesheets — the dictionary patch definition blocks; and 3. a sequence of bytecode instructions that use arbitrary geometric "draw" operations, jumps, loop registers, etc., to generate the screen-capture images by referencing the tilemaps that in turn reference the spritesheets — i.e. by calling ops to load/unload dictionary patches by reference. That's a "demo", in demoscene terms.)

This kind of result, if achieved, would (AFAIK) beat all currently-known compression methods. The best ANS code can't compress a sequence-of-images of SMB1 as well as just storing the ROM of SMB1 together with the input movie does. This wouldn't get quite to that degree of compression, but it would get close, and it would do it without the "implicit" Kolmogorov complexity of a NES emulator†.

† Though you could think of the output as something like the Futamura projection of a NES emulator with a specific ROM + input movie loaded into it. (It'd be a very weird NES emulator; one that runs as a batch process and emits image files of video frames. But if you had such an emulator, the output of Futamura-ing it would be very close to — or possibly better than — the output of this kind of approach. But, unlike this approach, incredibly non-portable — it would likely only run on the host architecture + OS version it was Futamura-ed on.)

In the (1D) audio domain, what I'm describing would basically involve turning PCM audio that encodes a rendering of a MIDI sequence through a set of patches, back into the inputs (a MIDI sequence and a set of patches), plus any noise/mastering effects as a residue. We can already do that in domain-specific systems, of course — that's how Melodyne's Direct Note Access works.

I'm just talking about doing it instead in a general compressor, lifting data with opaque redundancies out into graphical data-structures heuristically in order to then compress those graphical data-structures into programs that build-and-then-walk those graphs to produce the original output. A code that synthesizes something like MIDI to represent music; something like NES nametables to represent collaged images; etc. Not as well as those domain-tuned codes do, but better than local-context non-structural codes do. Basically the promise of Hierarchical Temporal Memory, but without the ML part. Purely-algorithmic HTM.

And another fun consequence of such a graphical-structuring code existing, would be that you could feed it non-trivial program binaries, and its output would be an equivalent program retargeted into an hypothetical ISA optimized for just that program. Feed it a corpus of enough such programs, all distinct, and you'll get a generally-space-optimized version of the ISA. An automatic derivation of THUMB from ARM, sorta thing.

------

And again, as I said in my sibling comment: this seems so obvious a win to me that it must be broken or impossible somehow. Tries / DAFSAs are easy. Longest-common-substring iterated to a satisfaction threshold is easy (using temporary post-constructed suffix automata.) Hypothetical iterated restructuring is easy — see OptiPNG, HypoPG, etc. So, where's the impossible part where the wheels fall off? I'm pretty sure it's there.)


Interesting idea! I'm not an expert, but if I'm reading you right and the decompressor is Turing complete; you're searching for the smallest Turing machine capable of generating the input sequence?

That would indeed be an optimal compressor and start to enable things like Solomonof induction and AIXI.

Seems the challenge is to get feedback to guide the search for the right Turing machine. With infinite compute can just run an ever expanding set of Turing machines until you get one that generates the input, but without some breakthrough that would enable something like stochastic gradient descent to efficiently work on non-differentiable programs I don't see how it can be made practical.

Perhaps there is some sweet spot thats more general than current compressors that looks at a restrict class of Turing machines that could be feasible.


Sure. Why not code it up?


I'm not an information theorist / cryptographer / etc., and so I have a strong feeling that there must be something wrong with my assumptions, given that "compressing big files using lots of memory for people to later decompress back to files using average amounts of memory" is an extremely-common need (for e.g. packaged software/OS updates.)

When a physicist sees a proof for a perpetual-motion machine, they don't ask "can this work"; they ask "where is this going wrong, that it came to this absurd conclusion?"

My post wasn't suggesting a real technique — it was me sharing my own perpetual-motion machine proof, so that someone can hopefully point out the step that's impossible, and I can then learn something from that :)


It sounds plausible to me. Real-world constraints aren't usually "you can use any amount of resources on the compressor, and any amount of compute on the decompressor for free, but the decompressor is hardcapped at exactly X MB of memory". So you might only have a theoretical improvement on existing methods, but that would still be very cool.

You can't beat existing methods by much - we're pretty close to the Shannon limit - but it's not a closed subject.


Relevant xkcd: https://xkcd.com/1831/


Is this really true? Surely they should be able to decompress it with virtual memory - you tell them exactly where to look for each operation. On the other hand, it will take you much longer to compress the file if your lookback window is larger than your available memory.


They're stream compressors, so they don't want to rely on being able to seek in the file. Even with swap, you'll have to cut it off somewhere; swap space isn't infinite.


Due to limited memory, I think most compression algos have a window of "stuff they've seen recently" which they use to detect duplicates. If the duplicate is of something too far back, it's fallen out of the window and been forgotten.


Compression formats have to be small themselves, so they make tradeoffs on what data they can point at to make copies of. Most formats prioritize recent data, since that tends to be the most similar.


Sliding sindow effects. Most compression algos don't keep the entire set of files to be compressed in memory, they have a sliding window where the only lok back and ahead X bytes.


I know there are fundamental limits on compression and I know I don’t know the first thing about implementing a modern compression algorithm. With that said:

> Most compression algos don't keep the entire set of files to be compressed in memory

This makes sense, but why keep the uncompressed data in memory? If the memory constraint is your biggest concern and you’re not CPU conscious, compare the outputs rather than the inputs. If you’re concerned about collisions, do a second pass to validate.

I’m certain the best minds on the topic are already aware of these and either using them or have ruled them out for reasons I can’t anticipate. I really hope, that likely being the case, I’ll get a chance to learn in responses and not just unexplained downvotes.


> Isn't the whole point of compression algorithms to find string of bytes that are the same?

No. The goal of compression algorithms is to describe an object approximately (lossy) or accurately (lossless) using the smallest number of [insert unit] possible. In some cases looking for strings of bytes that are the same serves well enough.

Examples of lossy algorithms that don't just look for same strings of bytes are audio and video codecs.

An example of lossless compression that works differently would be compressing "123456789" as "1 to 9".

> Seems simple to detect and handle?

Sure. If you've got infinite memory and time. Depending on your format you may also need to read the whole input before you can start writing the first byte.


Simple but time consuming to detect I presume.


Sorting by what?? Type.


Fyi... not mentioned explicitly in that Q&A is the related concept : https://en.wikipedia.org/wiki/Solid_compression

[EDIT to reply] : >The difference was just the order in which the files were concatenated.

Yes, right. My link just emphasizes the files concatenation aspect (e.g. tar is one example) for people who base the intuition on the ZIP format in which files' order doesn't matter because each file is compressed separately.


Note that in both cases the archive was first put into a single tar file, then compressed. That's how all tar formats function. The difference was just the order in which the files were concatenated.


The point GP is making is that tars are "solid" archives, so compression can span files, which means putting files with lots of redundancy between them next to one another will yield much better compression ratios than orderings which are essentially random.

That is somewhat unintuitive to people used to zip archives, because zip contents are compressed independently, so the ordering really doesn't matter (at least to the compression ratio).


> because zip contents are compressed independently

I "discovered" this when compressing a collection of icon and wmf files in my youth (long before gif and svg were ubiquitous (in fact before svg was even invented)).

Because the files were fairly small without a lot of redundancy the resulting archive was not massively smaller than the input files. It did take a lot less space on-disk due to no longer taking at least 512 bytes (one allocation unit on a small FAT formatted partition) per file which was enough for my needs at that point but it would still be inconvenient if I wanted to transfer them over a 14k4 modem based link, but it seemed wrong so piqued my curiosity enough that I hunted out some info via Usenet and worked out what was going on. Recompressing the zip resulted in massive savings, because the headers in the files were very similar, identical in many cases, so the inner .zip acted like the .tar format in this discussion.

> so the ordering really doesn't matter

Unless you compress again, either as another compress-to-file or through compression in the transport method, in which case there might be significant extra savings to be made if things are in the optimal order.


Aah ok, that makes sense, sorry for the confusion. I didn't know zip worked that way, although in hindsight that explains why it's so simple and fast to extract and/or modify only a part of a .zip archive, even on old computers.


> although in hindsight that explains why it's so simple and fast to extract and/or modify only a part of a .zip archive, even on old computers.

That is also because zips have a « central directory », so to modify or add one of the files you just need to add a new record then update the central directory. If you don’t care too much for size you can literally just append them to the existing zip file and leave both the old record (for updates) and the old central directory.


This reminds me of discovering that some of my backups had bad blocks on the disk. And the backup is in the shape of a contiguous .tar.lzo archive of around 50 GB size; sometimes solid archives like that are not the most robust option.. Fortunately I have other full snapshots.


If I compare the two tar archives directly, they seem different

I like the fact that this user considered whether the filesystem was lying about the size.

It's interesting to see the problem-solving approach; my next step would be to hexdump both and see if one isn't actually compressed as much as it could be, then decompress and inspect the tars.

I'm not sure whether it's true here, but one situation that leads to seemingly randomly-ordered files is when the archiver tries to be faster by using multiple threads.

The XZ blocksize is another variable to consider.


FWIK, compression works by de-duplication. Finding duplicate patterns is limited to a window.

If similar patterns are close to each other, there is a higher probability of finding such duplicates in the window leading to better compression.

When the files are not sorted, this randomly distributed files with similar patterns beyond the compression window leading to poor compression.

If there is an option to increase the size of window, that would be a good experiment.

This is very similar to `git repack` window and depth parameters. Larger the window and depth, you get better compressed packs.

Wonder if a sort based on diffs (group similar files together) would help get best compression. The cost of such sorting might outweigh the benefits.


That’s not the case here. What you describe is basically compression in “solid blocks”. `xz`, however, is (in practice virtually always) fully solid. It allows greater compression efficiency at the expense of random access, which is not possible at all.

The order matters because it “defines” what the algorithm’s dictionary will look like.


I think it's right regarding the limited horizon of the window. Any dynamic compression scheme that evolves knowledge of the input stream will have this property. The tar block being compressed here is many orders of magnitude larger than the window, so the order in which the compressor sees the stream matters.

It is somewhat more complicated because e.g. dynamic Huffan coding rarely applied to raw streams but to a series of one or more encoders such as run-length, but the reasoning remains that if the window or coded/dynamic dictionary size is much larger than the data stream, compression can improve with input permutation.

Back in the 90s I recall an unpopular but very efficient encoder called UltraCompressor[1] which had some input permutation heruristics, "solid" compression, and damage recovery to boot!

[1] http://www.compression.ru/arctest/self/uc.htm


This isn't really specific to tar. For example if you save data to parquet, sorting the data before persisting it can make a huge difference in the size of the file AND the predicate pushdown performance (as you can skip more pages especially if you sort by the column you filter on)


Unrelated to the issue in the post, but a few years back we ran into an annoying oddity in the Python zipfile module [1].

By default, when you create a new zipfile object to create a .zip archive with, it stores the data uncompressed.

Only noticed the issue when a 100MB zip file suddenly became 3MB after I had edited one of the files and re-saved the archive.

[1] https://docs.python.org/3/library/zipfile.html


Yes, it's clearly documented but very much unexpected the first time 'round.

The reason for the default is probably that python can be built with none of zlib, bzip2 or lzma support:

> If ZIP_DEFLATED, ZIP_BZIP2 or ZIP_LZMA is specified but the corresponding module (zlib, bz2 or lzma) is not available, RuntimeError is raised.

in which case only STORED is available, therefore that's the only possible default, any other would lead to zipfile raising errors out of the box for some people.

It would probably have been a better idea to just not use a default, and require the compression method to always be specified, at least when opening zipfiles with modes other than "r".


> in which case only STORED is available, therefore that's the only possible default, any other would lead to zipfile raising errors out of the box for some people.

I don't think that logic works. You can default to "deflate if available" just fine.


An inconsistent and variable default would be significantly, unfathomably, worse than a "bad" default.

And I don’t even have to imagine it, because there’s an example right in the python builtins: open in text mode will use… whatever garbage `locale.getpreferredencoding()` returns. Fantastic way to create hard to reproduce bugs.


That's because using non-binary mode is bad, not because variable defaults are bad.

> An inconsistent and variable default would be significantly, unfathomably, worse than a "bad" default.

Even if you're right on that, I will still say that defaulting to no compression is significantly worse than going "error missing library" for people that have a partial install.


> That's because using non-binary mode is bad

Text mode is perfectly good, it’s what users want in a good 90% of cases. Possibly more.


They want text, and they want line endings that work, but that doesn't mean they want file contents secretly changed as they go.


TBH, Python's zipfile should probably just require zlib. Without that module, it's basically useless (can't open most ZIP files, can't emit compressed ZIP files).


Actually, I've argued the opposite in the past. Any file that is of any significant size is already compressed: images, audio, video, gifs, PDFs, game map/save data... even documents (and by extension, spreadsheets and presentations) are compressed these days (because they're... zips! It's zips all the way down). Compression in your zip archive implementation is unnecessary complexity at this point.

Compressing archives most only makes sense if you have some huge (many many millions of lines) set of source code because that's a lot of data, more than one file that you want to compress, and low entropy in those files. Only in that combination does zip with compression make sense over uncompressed archives or compression without archive format.

Of course, zip can shoehorn both functions that you usually don't need together into one format and we're usually not CPU-bound anyway so I totally see why you might as well always apply a small amount of compression, but I would beg to differ if you say "it's basically useless to have zip without compression support". There are more cases where zip with compression is completely redundant than cases where it helps you.


> even documents (and by extension, spreadsheets and presentations) are compressed these days (because they're... zips! It's zips all the way down)

And how do you read those archives (or, for that matter, create them) using a ZIP library that doesn't know how to deal with compressed data?

My point is simply that DEFLATE support should be considered mandatory for any ZIP implementation. Without it, most archives you encounter will be unreadable. Yes, it can still handle archives which contain only STOREd files -- but that's an edge case which most archives won't fall into.


> And how do you read those archives (or, for that matter, create them) using a ZIP library that doesn't know how to deal with compressed data?

You don’t because you just want to archive them not read them.


This isn't an oddity or issue with Python zip creation, it's expected behaviour as per ISO/IEC WD 21320-1[1].

> The compression method shall be either 0 (“stored”) or 8(“deflated”).

The format does not prescribe a compression algorithm (many could be available for use). It's merely a container.

[1] ISO/IEC WD 21320-1, Document Container File - Part 1: Core - https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.39...


What defaults python chooses is an oddity within python.


Using defaults aimed at compatability with other zip archivers, today and tomorrow, doesn't strike me as odd.


If you can't deflate existing ZIP files because no libzip, then you're not maintaining compatibility with anything really


Also a bit unrelated, but at my work place we reimplemented tar algorithm so we can run it in a lambda/state machine. We used the s3 multipart upload to help with the file transfer bit. The performance was phenomenal. On avg we were tar'ing 20gb in 15 seconds or so, a lot faster than on my pc.


Tar (as opposed to gzip) is almost entirely I/O bound, it's just reading from the filesystem, adding some headers and then outputting it again. Whatever speed your I/O runs at, you will basically tar at that speed as well.


Yep. One of the biggest speed improvements I got out of a process that creates large tar balls was moving it to use pbzip2 to allow the compression to use multiple cores.


pigz is another package.


Lrzip is another one too


I don't understand this comment. You can run arbitrary code in AWS lambda, you can just run "tar". Why did you need to reimplement it?


I suppose they are streaming the tarball directly to S3 using multipart uploads, maybe parallelizing this task at the same time.

You wouldn't be able to do it using regular tar.


Yes, you can, you can pipe tar to the AWS CLI to upload a stream, like "tar -c ./myfiles | aws s3 cp - s3://my-bucket/myobject".

For very large uploads you must also specify a chunksize using "multipart_chunksize" such that the upload will require 10,000 chunks or fewer (the CLI can't do this for you because it doesn't know the size of your stream up front).


Wouldn’t you have to have an ec2 open ? We are trying to minimize infrastructure and do it all in lambdas


Nothing prevents calling the AWS CLI inside a Lambda.


hit the nail on the head


It wasn't always like that though. I think they added custom runtimes around 2yrs ago... Your options were very limited before that.


Before custom runtimes it was still possible to ship your own binaries and shell out to them from the official handler function.


Do any compression algorithms use and approximate matching technique like ssdeep to sort the files by similarity before compression starts? It might be slower but if the compression was 15x better then it might be worth it in some scenarios.


Sorting chunks by similarity: commonly used tools don't do that. Most archive tools only sort by file type.

I wrote a tool that chunks the data (into variable-sized blocks, to re-sync if there are multiple files that have different length prefixes, but that's another story), and then sorts the chunks by LSH (locality sensitive hash). LSH is used by search engines to detect similar text. It can compress directories that contain multiple version of e.g. source code very well (e.g. trunk, branches). https://github.com/h2database/h2database/blob/master/h2/src/...

I discussed this approach with a researcher in this area in January 2020. AFAIK there is active research in this area, specially to compress DNA sequences. But he also wasn't aware of papers or research in this area for general-purpose data compression.

So, I think this area is largely uncharted. I would be interested (as a hobby side project) to help, if somebody is interested.

Summary:

* Regular compression algorithms use a certain "window size": this can be 32 KB, or 2 GB, or so. But sorting chunks can be faster faster and can improve the compression rate.

* Regular compression tools can sort by file name and file type.

* Data de-duplication tools (e.g. enterprise backup tools) chunk the data in a smart way, but then only support exact duplicates (not near duplicates).

* Locality sensitive hashing: it is not used in common compression tools so far AFAIK.


You can use binsort to sort the file list upfront by similarity of contents, which is usable with tar and zip and so can be piped into any other compression you want (every once in a great while when I can't come up with a good filename-based sort, I try out binsort); and lrzip attempts to do long-range similarity detection without changing order to do something similar.


"Binsort - sort files by binary similarity": Thanks for the info! Yes, this seems to do something similar to what I implemented, on a file level (not chunk level). It seems to be O(n^2), while my algorithm is O(sort(n)). But having a standalone too is great!

lrzip / rzip use 900 MB a window, which used to be huge at that time it was written (1998?). But nowadays it is considered medium size.


Nice trick with the sorting. We use xz compressed Linux root filesystem images quite a bit. I wonder whether having arranged the files in the filesystem images could achieve a significant saving. Unlikely to be a factor of 15, but 20-30 wouldn't be bad either.


There’s a lot of research in the field of scientific computing aimed at exactly this kind of data. You can definitely gain a lot of your floating point data is varying slowly.


tarfile sorts names only since Python 3.7:

https://bugs.python.org/issue30693


Semi-related fun fact: I've experienced a large speedup doing MD5 checksumming with Python's hashlib instead of md5sum(1).


tl;dr: the python lib adds files to the archive sorted by the last modification date by default. this happened to be better in this case, but macOS tar had the same results when using the appropriate sorting flag.

Makes me wonder: considering the speed of modern SSDs, would it make sense for compression tools to try various sortings by default, or compare files up front?


> macOS tar had the same results when using the appropriate sorting flag

They had to install and use GNU tar to gain the `--sort` option. macOS (BSD) tar doesn't have it. (You could emulate the behavior by using `find` and passing the member names in on the command-line.)


As mentioned near the end of your comment: typically a workaround (possibly with temporary files instead) E.G.

find /path | sort | bsdtar -c -f OUTPUT -n --files-from /dev/stdin ...Args

I prefer the second version which terminates paths with the NULL character. Both include -n to not-recurse.

find /path -s -print0 | bsdtar -c -f OUTPUT --null -n --files-from /dev/stdin ...Args

From the BSD manpage: -s Cause find to traverse the file hierarchies in lexicographical order, i.e., alphabetical order within each directory. Note: `find -s' and `find | sort' may give different results.


> macOS tar had the same results when using the appropriate sorting flag.

No, macOS has the same results when using gnutar which provides sorting option.

bsdtar (which macOS provides out of the box) has no such option. It just stores files in whichever order you provide them (on the command-line, via the CLI) or it finds them (for recursive tar, so whichever order the directory returns them in).


Another option is to use something like lrzip as the compressor - not sure if it would have helped in this instance but it incorporates a step which looks for long-range redundancy which can make a huge difference in these sorts of scenarios.


Seems better to have some separate higher level metacompression tool which does the heuristic searching across many possible compression formats and parameters.


This is precisely the approach taken by the Hydra metacompressor in the (closed-source) Oodle compression library:

http://cbloomrants.blogspot.com/2017/02/oodle-hydra.html


Compression is generally CPU bound for most of the typical algorithms (LZMA, zlib/deflate) unless your drive is very very slow (<10 MB/s). There are some quite fast algorithms where this could make sense, basically using something like LZ4 as a proxy for compress-ability. This is the approach some tools use when you tell them to "compress, but only if it makes sense".


Even better when you switch on the --lossy option in gzip.


You may be interested in Lossy Zip [1]. It's coming up on 22 years old, as of April 1st.

Admittedly, it's somewhat unclear why the format didn't take off. At its core, it uses the Lessis-Moore algorithm to perform a global bit-wise sort of its inputs, then performs a run-length encoding (plus some other magic) on the results to achieve up to 100% compression.

[1] http://web.archive.org/web/20050402203136/http://lzip.source...


So it only has a 0% quality setting?

I'm not convinced that actually qualifies as "lossy".


That's hilarious. Thank you.


At that point you might as well store your files in /dev/null


Does /dev/null support sharding?



if /dev/null is fast and web scale I will use it. ( https://devnull-as-a-service.com/ )


It's time to mention the name of the NABOB archiver, it seems. (-:

* https://news.ycombinator.com/item?id=26460317


That's where the zoster virus hides out


Although you are joking, gifsicle does have a "--lossy" switch which allows changes to a gif file in order to make the LZW compression more efficient.


An other example is pngquant, which performs "lossy" compression of PNGs (by quantizing them).


pngquant reduces the size by removing information from the image before compression.

gifsicle --lossy actually allows slightly lossy LZW dictionary matches to improve the compression itself.


I assume you're joking... I've never heard of lossy compression in this kinds of formats, as there's no general way to determine what can be discarded without affecting the utility of the data. There's no such option in any implementation of gzip I've found...


I use gzip --lossy if I forget to submit a rough draft for school papers. It comes out decent quality, but slightly worse.


Was hoping for an Easter egg, but it just says unrecognized option? At least on my phone it does, iirc that's a Debian Buster install that I haven't updated in three years.


I might be overexplaining but it's a joke because lossy compression of a gzip would end up with you losing data that could render the entire file unreadable.


Yeah, that much was clear, just thought it might refer to a hidden option that triggers a funny message, but it didn't for me so perhaps it was a macOS' gzip thing or an ancient/recent gzip version thing.


What version of gzip has a lossy option?


It's a joke. Lossy compression can only work with specific formats where information loss is acceptable, theoretically anything but in practice usually audio and video.

If gzip tried to do lossy compression, not knowing what the data is, it would have to randomly corrupt data.


There are plent of example texts where letters are deliberately left out to show how the brain fills in the missing data without losing meaning. Seems like this --lossy option might not be so bad after all. Might only work well for those that speak the language natively.



It was the best and worst of times.


It was the best and best of times.


I want to know how well Brotli fares with this. Browser and server support for Brotli and Brotli with static compression is the future rather than these MS-DOS things from a long time ago.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: