
Better Compression with Zstandard - ctur
http://gregoryszorc.com/blog/2017/03/07/better-compression-with-zstandard/
======
rdtsc
Agreed, zstd is probably one of the most exciting new developments in
compression.

However this worries me:

[https://github.com/facebook/zstd/blob/dev/PATENTS](https://github.com/facebook/zstd/blob/dev/PATENTS)

\---

The license granted hereunder will terminate [...] if you [...] initiate [...]
any Patent Assertion: (i) against Facebook or any of its subsidiaries or
corporate affiliates, (ii) against any party if such Patent Assertion arises
in whole or in part from any software, technology, product or service of
Facebook or any of its subsidiaries or corporate affiliates, or (iii) against
any party relating to the Software.

\---

Not a lawyer, but I would guess that's makes it a non-starter for many
situations. You sue them, now your customers' data, your code repository, or
backups have to be decoded with a software for which don't have a license to
use anymore. Am I crazy, and this is not a BigDeal(TM)? Or does anyone else
have an issue with this?

~~~
koolba
How does that play with BSD license file which presumably has no such
restrictions:
[https://github.com/facebook/zstd/blob/dev/LICENSE](https://github.com/facebook/zstd/blob/dev/LICENSE)

~~~
rdtsc
Good question. I can see two interpretation (but IANAL so am probably very
wrong here):

1) BSD license covers copyright, and that patent file covers patent grants.
Totally separate things. If you sue Facebook, you retain the BSD license and
they can't sue you back for copyright infringement, but they can then come
after you asserting you infringed their compression patents and demand
royalties.

2) Because PATENTS file says "Additional Grant of Patent Rights Version 2" it
could mean an extension of the BSD license. That is the termination would
trigger the revocation of copyright license as well not just patent
protection.

~~~
electrum
The License FAQ covers question #2:
[https://code.facebook.com/pages/850928938376556](https://code.facebook.com/pages/850928938376556)

 _Does termination of the additional patent grant in the Facebook BSD+Patents
license cause the copyright license to also terminate?_

 _No._

------
jdcarter
I've been completely impressed with zStandard. I tested it when 1.0 was
released, and I was blown away by both its performance and compression ratio,
especially on multi-core machines. I'm using it in a project at work
(prototype stage at the moment) and I'm confident we'll ship with zStandard as
the default compression method.

FWIW, I'm using this Go library:
[https://github.com/DataDog/zstd](https://github.com/DataDog/zstd)

I've made some contributions and the DataDog team is very prompt at reviewing
merge requests.

~~~
beagle3
Did the legal team at your work review the patent issues? (see top-level
comment by rdtsc above
[https://news.ycombinator.com/item?id=13814476](https://news.ycombinator.com/item?id=13814476)
?)

edit: time -> team, thanks JoshTriplett, DYAC.

~~~
JoshTriplett
> Did the legal time at your work review the patent issues?

(assuming you mean "legal team", not "legal time")

If it did, they almost certainly couldn't say.

------
unsigner
If you control both sides of the channel - compression and decompression - and
can make money out of better or faster compression, you should definitely look
into RAD Game Tools' Oodle libraries - they exceed LZ4/Zstd on all axes.

[http://www.radgametools.com/oodle.htm](http://www.radgametools.com/oodle.htm)

(not a employee or a financially related in any way to them, just a
compression nerd following the field)

------
mavam
I've done a similar-but-not-so-elaborate analysis a while ago with a large
variety of compression algorithms:
[https://github.com/mavam/compbench](https://github.com/mavam/compbench).

Only two input types: text logs and network traces. My conclusion was the
same: Zstandard has the best space-time tradeoff.

------
koolba
What's the memory footprint for Zstandard (vs. zlib) for compression and
decompression?

The article goes into CPU usage (which seems awesome...) but I don't see
anything about the buffer sizes for either. It just mentions them as knobs
that can be twiddled.

~~~
indygreg2
It depends on the "window size" used by the compressor. According to
[https://github.com/facebook/zstd/blob/dev/doc/zstd_compressi...](https://github.com/facebook/zstd/blob/dev/doc/zstd_compression_format.md#window_descriptor),
the window size can be anywhere from 1KB to 1.875TB.

The table mapping compression levels to window sizes is at
[https://github.com/facebook/zstd/blob/15a7a99653c78a57d1ccbf...](https://github.com/facebook/zstd/blob/15a7a99653c78a57d1ccbf5c5b4571e62183bf4f/lib/compress/zstd_compress.c#L3250).
The first element is the window size. I'm pretty sure you raise 2 to that
power to get the window size. As you can see, level 1 is 2^18, which is 256KB
- far larger than the minimum of 1KB. Lowering it will make compression faster
at the expense of compression ratio.

For comparison, zlib's max window size is 32KB. That's one reason its
compression ratio is so limited.

------
j_s
Ask HN: What is the best approach to compressing ProtoBuf streams? Does it
still depend completely on the content being serialized, even with the
ProtoBuf metadata?

For example: brotli is designed with an existing dictionary optimized for
HTTP. Based on this design decision, it seems like it wouldn't necessarily be
a good idea to use it with ProtoBuf, especially for small ProtoBuf serialized
plain-old-[whatever]-objects in a data access layer.

Poking around a bit so far, it seems [http://blosc.org/blosc-in-
depth.html](http://blosc.org/blosc-in-depth.html) may be the right choice, but
I'm not so sure about adopting its serialization stuff since it isn't as
widely xplat as ProtoBuf.

If your decision was obvious after reviewing multiple algorithms then you
might save me some time by sharing your experience; thanks in advance!

~~~
JyrkiAlakuijala
Results with geo.protodata (118588 bytes uncompressed) from snappy test data:

11728 bytes - brotli 11

11941 bytes - brotli 10

12056 bytes - lzma

12219 bytes - zstd 22

12314 bytes - brotli 9

12512 bytes - brotli 5

12526 bytes - zstd 15

12809 bytes - zstd 12

12831 bytes - zstd 9

14753 bytes - zopfli

15110 bytes - gzip 9

------
oso2k
This is great. But please note, some of the graphs presented use a common
statistics cheat of logarithmic scale along one or more of the axis.

Many times, lz4 level 1 isn't just 25% faster [0] in Compression vs. say zlib
level 1 (as it appears visually), it's actually 6 times (6x, or 600%) faster
in terms of MB/s (~600 MB/s vs. ~100 MB/s)!

And it's not just 5% or so faster than zstd in Decompression (visually) [1],
it's about twice as fast (~2000 MB/s vs. ~1000 MB/s).

Yes. It does not compress as well. But ~3.2:1 compression ratio best case for
lz4 (listed as ~3.2 in the chart) compresses almost as quickly as the 6:1 zstd
[0] but will then decompress twice as fast as zstd at any level [1]. Having a
high-compression is a case I could (almost) not care about. If you're
youtube/google or facebook/instagram, you might care (streaming, photos,
static assets all at enormous scales).

For me, however, the above results mean far less CPU burden on the client with
a twice the burden on the network I/O. If you're concerned about battery,
loading screens, or download dialogs, I'd still pick lz4. Just turn the AC up
a half notch in my AWS DC.

[0] [http://gregoryszorc.com/images/compression-bundle-
common.png](http://gregoryszorc.com/images/compression-bundle-common.png)

[1] [http://gregoryszorc.com/images/decompression-bundle-
output.p...](http://gregoryszorc.com/images/decompression-bundle-output.png)

------
partycoder
Zstandard is awesome technology but I do not like the name. It is like those
ad jingles that change 2 notes from a hit song chord to avoid paying
royalties.

------
Twirrim
I've been very impressed with zstandard in general, on x86, and satisfied with
it on other architectures. It seems to do a good job and do it well.

About the main reason I haven't pushed on using it for work is just concerns
about maturity. gzip/pigz works well enough for our purposes, and isn't enough
of a hindrance to justify using what is still new software without a proven
history.

~~~
ctur
I definitely understand reluctance to trust new data formats. I will say
though that we have been using Zstandard in production at Facebook since well
before 1.0 and trust it completely for both data at rest and data in flight.
Since 1.0, we've widened that even further to many different parts of the
company.

To that end, every release (and pretty much every commit) undergoes
significant testing. We also fuzz test heavily and have significant regression
tests in place constantly looking at corner cases and verifying known prior
bugs stay fixed. It's impossible to prove there are no issues, but so far
we've put a lot of effort into being as reliable as possible and seen that
play out well in our production environments. We trust it.

~~~
Twirrim
I feel pretty safe in trusting it.

I just like to be careful with a mental "new tech" budget for a
project/platform. The more new tech is involved, the more risk there is. The
balancing point there being the value the new tech brings.

With the platform I'm working on at the moment, the gains are relatively
minor: shaving a few seconds off here and there, when things are in the scale
of minutes; and saving 20-30MB when things are in the scale of gigabytes, but
also not lots of them. As the project grows, and as the new tech we've
introduced proves itself, there's always scope to introduce other new tech
like Zstandard and start gaining in the storage savings.

------
mastax
Are there any negatives to Zstandard (besides those listed in the article)?
Specifically, I was thinking about using it to compress backup archives. I
figure it might be annoying to use an algorithm that can't be extracted using
default installed tools (though zstd is in Ubuntu 16.10 repos), but otherwise
I don't see any downside.

~~~
beagle3
The PATENTS file. Calling it "zstandard" is a genius move - everyone assumes
it's a replacement for zlib (royalty free, bsd-style copyright, patent free),
where only the first two are in fact right.

See the post by rdtsc above
[https://news.ycombinator.com/item?id=13814476](https://news.ycombinator.com/item?id=13814476)

------
foepys
Last week I needed to compress some large CSV files (70+ MB) and tested some
compression algorithms for speed/ratio and was surprised that bzip2 was (in
this isolated case) way better than Zstandard. Zstandard needed a lot more
time (bzip2 3s vs Zstd 15s) to reach the same compression ratio and produced
noticeably larger files (5 MB vs 6.2 MB) in its default configuration.

So Zstandard is not the compression algorithm to end them all and definitely
has its weak spots.

~~~
wolf550e
Did your csv files contain stuff that looks like human language text? Use PPMD
like so:

    
    
        7z a -m0=PPMd demo.7z demo.txt
    

This beats bzip2 and zstd for such files.

~~~
mtdewcmu
The problem with PPM is that it's slow. Zstd is targeting a certain general-
purpose use case. There are other compressors that target different time-space
trade-offs, if you need extreme performance or extreme compression efficiency.

------
gameshot911
Somewhat off topic, but this reminded me of one of my favorite HN submissions:

Link -
[http://www.patrickcraig.co.uk/other/compression.php](http://www.patrickcraig.co.uk/other/compression.php)

Previous commentary -
[https://news.ycombinator.com/item?id=9163782](https://news.ycombinator.com/item?id=9163782)

------
rleigh
Potentially interesting for use with ZFS as an LZ4 alternative?

I recently did some benchmarking with lz4, gzip-6 and gzip-9 to look at using
better compression for archival datasets. However, the overhead of gzip made
the system largely unresponsive for days! An algorithm with better compression
ratios but less CPU overhead would be a great addition.

~~~
ptman
Maybe with HDDs, but SSDs are so fast that zstd would be the bottleneck

------
seangrogg
zStandard looks interesting - I'm somewhat hopeful it gets added to a browser
vendor or two so I actually have a solid use-case for it.

------
IanCal
Sounds really interesting.

Does it support efficient random access?

~~~
ctur
We currently are exploring a seekable format for Zstandard right now. What use
case do you have in mind -- network or at rest? If you want to file a github
PR describing how you'd like it to work, that will help guide our
implementation. We have a few in mind based on internal Facebook use cases,
but the more different needs we know about, the more general purpose the
result will be.

~~~
alphapapa
I have always thought that the 7-Zip format is interesting in the way it (or
the reference implementation of the compressor, at least) groups files by
extension, which I guess helps compression by making it more likely that
chunks of files that are common within a filetype end up in the dictionary
before it fills up with chunks from all filetypes. Do you have any thoughts
about this? Have you looked at the 7-Zip format?

~~~
stardomSerf
Do you mean this ? [https://github.com/mcmilk/7-Zip-
zstd](https://github.com/mcmilk/7-Zip-zstd)

~~~
alphapapa
That's cool, thanks for sharing. I actually meant to ask the OP if they had
looked at the 7-Zip format when designing their own archive format, but maybe
they don't need to...

------
kzrdude
Does the zstd define an archive format (hierarchy and metadata)?

~~~
Zash
I saw nothing like that in the API when I had a look. But there's always the
trusty ol' tar format.

