Hacker News new | past | comments | ask | show | jobs | submit login

> Ancient junk like gzip/deflate only has a 32kb window

Yes, a reflection of the time that they were first created, but hardly junk. gzip/DEFLATE have saved millions if not billions of hours of human time over the past few decades. Better things exist now, but it's unnecessarily derogatory to call something that significant "junk." And by being industry standards, they have made compression available in many more places than it used to be, even if they are not state of the art. Better to have some reasonable compression than none.




Picture the effect of compression ratios or decompression performance improving 1% globally, or conversely, if planes were replaced with a Wright Brothers model or cars with the T1. When discussing Deflate it's not just 1%, but regularly 200% or 2,000%, nor is it something people use once per month, but practically every time they interact with digital electronics.

Regressing to the T1 is unthinkable because they're long out of fashion and widely perceived as obsolete. Every change starts with a shift in perception, a process accelerated by a shift in language. For that I'm content with "junk", it describes exactly how we must feel about this dead technology before we'll ever be rid of it.

I'll gladly reminisce once we've escaped its legacy, be it time wasted installing an Android app and the effect it has on the battery (multiplied by billions of users), the time to open an Excel document, power up some smart TVs, fetch an itemized AWS bill over a typical S.E. Asian Internet connection, or grep that bill at speeds appropriate for a state of the art laptop manufactured this side of the millennium. You expect me to pay respect to software that wastes the finite breaths of billions of real people every single day.


So, I wanted to check so I downloaded linux source tree and ran gzip and zstd.

    Uncompressed .tar:    1064376320
    .xz (downloaded file): 117637692 (88.9%)
    .gz:                   189299444 (82.2%)
    .gz (level 9):         186533230 (82.5%)
    .zst:                  178012638 (83.3%)
    .zst --ultra -22:      119676867 (88.8%)
So, even at maximum level, zstd is only about 6% better, or about 37% if you consider .gz as baseline, while being about 10x slower than gzip level 9. There are places where having a smaller compressed file is worth the cost of 10x execution time (and much higher memory usage), but serving web pages is almost certainly not one of them.

In other words, the reason gzip is still being widely used is simply because it's good enough for its use cases. It's less like the original T1, and more like complaining that people are still driving Toyota Camry 2011 when Toyota Camry 2021 is a so much better car.


For the pile of XML in front of me just now (originally arrived in a ZIP container aka deflate, part of a much larger set), it's 1.4gb decompressed, 44M gzipped/deflated (default settings), 31mb zstd (default), 5.379s to decompress with gzip, 0.560s to decompress with zstd. Just under an order of magnitude difference in the most frequent operation (decompression).

Admittedly compressors generally love XML, this is just one example -- 28% less time on download and 89% less wasted on file open. Multiply by a few tens to a few thousand occurrences per week for 7.6 billion people, and I really struggle to call that a Camry.


You're admittedly talking about a difference in compression rate of 13MB in a 1400MB package, and a 5s difference in writing a 1400MB file to disk.

Come on, let's be honest here: that's nitpicking.

Unless you are running batch processes that store TB of compressed data, no one would even bother to switch app for thosd residual gains.

Let's put it this way: would you get any VC fund if your sales pitch was "I can improve compression in a 1400MB package by 13MB and shorten file write times by 5 seconds"? Odds are, you'd be asked why not just gzip it and get it over with.


> You're admittedly talking about a difference in compression rate of 13MB in a 1400MB package, and a 5s difference in writing a 1400MB file to disk.

> Come on, let's be honest here: that's nitpicking.

I agree on the compression size (13 megabytes is really not very much difference) but the decompression speed improvement really is remarkable. It's an order of magnitude of difference! Amortize that over every time you make a webrequest that has to decompress data, it makes a huge difference.

I'm mostly in the "gzip is good enough" camp, but a speed improvement of 10x is not nitpicking.


> Let's put it this way: would you get any VC fund if your sales pitch was "I can improve compression in a 1400MB package by 13MB and shorten file write times by 5 seconds"? Odds are, you'd be asked why not just gzip it and get it over with.

I agree but have a different takeaway: it’s why VCs aren’t the be all and end all. Such an improvement is worth the time invested to create it. It won’t change the world but spanned across the entire globe it’s a very notable step forward. That a VC can’t get hockey stick growth and a lucrative return out of it doesn’t invalidate the idea.


5.3s at 1.4gb on a laptop is 53ms at 14mb, is >70ms on mobile before adding any increased download time. 14mb is one retina image, even Reddit already weighs 8mb. It's silly to pretend this stuff doesn't matter, you can't even get search ranking without addressing it, for reasons that don't require concocted scenarios to explain.


Your difference might be mainly single threaded vs multi threaded decompression. This doesn't typically transform into a real-world speedup when browsing the web, as one will typically run multiple decompression streams in parallel anyway. Zstd is definitely an improvement on Gzip, but not by that much.


> 5.3s at 1.4gb on a laptop is 53ms at 14mb, is 70ms on mobile before adding any increased download time.

Factor in how much time it takes you to unpack those 1.4GB of data in a smartphone. Those 5s amount to which percentage of the total execution time?


A quick note on compressing XML. XML files typically have a high compression ratio because they consist of mostly redundancy, but compressing redundancy isn't free, you still end up with some overhead in the compressed file due to the redundancy. So if you pack the same data in a denser format, and then compress that, you typically end up saving a good chunk of space relative to the compressed XML file.


The tradeoff you show between compression ratio and compression speed is a misleading one.

The slow compression speed only takes place once (when compressing), while the saved bandwidth due to a better compression ratio is saved continuously.


Not if you're serving things dynamically from a web server, then it happens on the fly, every request.


True. But there you can go into optimizations like parallelization of the compression and building an optimized dictionary upfront once to make the process faster.


Cache by the ETag and/or If-Modified-Since headers, then


[flagged]


> This is a deliberately misleading statement

> the same kind of bias

> another willful misrepresentation

These phrases, as well as the whole last paragraph, contribute no meritorical value to the comment, but are aggresively vocal about assuming bad faith and as such violate HN guidelines. Please don't do that.


I disagree. Lies should be called out as lies, and aren't nearly often enough in the current era of fake news/science/politics.


IMO, attack the message, not the messenger. Pointing out flaws in an argument is good, but assuming malice hurts the discussion. Now we are suddenly talking about the morality of the poster instead of the issue at hand.


Never attribute to malice, that which can be adequately explained by error.

https://en.wikipedia.org/wiki/Hanlon%27s_razor


Well, I did say "about 37% if you consider .gz as baseline", but I agree that it could have been worded better.

The problem is what you are measuring. If you have to compress a 100MB file, and algorithm A/B/C/D compress it to 10/9/1/0.9MB, respectively, then we can argue that "B is 10% better than A" (because the file is 10% smaller) and also "D is 10% better than C" (same).

However, moving from A to B saves us 1MB for this file. Moving from C to D only saves us 0.1MB. So depending on how you look at it, you could also argue that the difference between C and D is really only 0.1% (of the original file size). It may not be what you want (for a particular situation), but it's not wrong.


> at its default level, compresses not only more than gzip level 9, but also at more than 10x the speed

I could also call that a wilful misrepresentation, since gzip level 9 is very slow compared to the default setting and IMO isn't worth the extra execution time.

Also, nobody mentions that gzip is already installed everywhere. zstd is fairly new, and it's not installed by default on any of the machines I use.

I mean, I'm sure there are some exotic screw drivers that work better than Torx. But I can get Torx screws and drivers in every hardware store and everyone already has compatible bits so why not use whatever works?


Never trust any statistics you have not manipulated yourself...

You could have demonstrated the math without your attacks.


> It's 10x times better. Because it can fit 10x more data in the same storage budget.

This makes no sense. It's 9% better because it's 9% better.


The resulting compressed archives of these operations are 10% and 1% of the original file sizes, respectively (10x difference). The percent reduction can’t really be compared linearly.


I share your frustration with misinformation, but this is just human error, which you could be much kinder about correcting, I detected no gzip-pushing "agenda" in that comment, are you sure you're not just projecting?


>but it's unnecessarily derogatory to call something that significant "junk."

By now, it certainly is. Back then it was probably state of the art. That's the cycle of computing. If I called an original Atari junk, some would take it as "this is useless" but most would take it as "this is useless now".


> If I called an original Atari junk, some would take it as "this is useless" but most would take it as "this is useless now".

I think you have that backwards. "junk" is an insulting word. If you called an Atari junk, most would understand you to be disparaging it in a roundabout way, not making a temporal point. To be fair to OP he did at least call it "ancient junk." But there's no need to add "junk." "old" or "ancient" more than ably would have made the point.


I suppose that's down to cultural differences. I'd consider junk to be not useful anymore (I call my core2 duo junk but it was a powerhouse when I got it), but maybe a better word is defunct? According to dictionary, junk = Cheap, shoddy, or worthless.


I'll start calling my grandpa junk too.


Your grandpa isn't particularly functionally different in design from new humans being produced today, just at a different stage of his life. Also, he's a human.

Let's take some non-human examples for comparison. Just about every good military strategy produced when your grandpa was young is, objectively, junk when compared to the state of war today. But just about every good piece of music still holds up - it might be different from good music produced today, but it's not worse. Your grandpa probably did mathematics with the help of books of log tables and books of mathematical instruction. The former are thoroughly junk, being at best as useful as a calculator (or calculator app) but much heavier, and at worst bearing typos; the latter are every bit as useful now as they were back then.

Describing something from a previous generation as "junk" does not convey the belief it is old - there's a perfectly good term for that, "old." It conveys the belief that it is junk.


I recall reading something in the 90s where in the 70s, Huffman coding was considered mathematically optimal compression and the whole field stagnated for a while because they thought they'd reached a fundamental limit.


“Huffman's original algorithm is optimal for a symbol-by-symbol coding with a known input probability distribution, i.e., separately encoding unrelated symbols in such a data stream” -wikipedia

So being pedantic, it is optimal in this specific meaning, but I don’t know about this stagnation.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: