PS - I didn't know a lot of the terms you used, but the Finite State Entropy (FSE) link you provided does a good job intro'ing some of them, and the linked paper Asymmetric numeral systems: entropy coding combining speed of Huffman coding with compression rate of arithmetic coding  (ANS) seems interesting.
There is a lot of other gem in his blog. He points to resources and other people he have worked with making it much easier to get a bigger picture on the related items.
I don't think he is the first; although RAD game tools has been cagey about the details, many strongly suspect their recently announced Kraken, etc. use ANS in some form and some of their previous products; see the discussion here:
But even with that Yann might be the first. Yan's work on FSE is old  and if I'm reading this correctly his work on FSE is a mix of his own work and Jarek Duda. But it seems one of his failed attempt challenged Jarek to invent another version of ANS called rANS  and as you can see in the encode.ru comments from the RAD game tools that they seem to be using rANS at least for Kraken. Regardless these are impressive works and these people are bouncing ideas of each others, being challenged and inspired which is a very good thing for the rest of us :).
i.e. download speed is up to 20% faster with Zstd as compression algo for backing store compared to Deflate. Assuming bandwidth is available ;)
Not sure exactly when repcodes were invented. Igor Pavlov has already used them in 7zip.
Huffman requires at least one bit to represent any symbol, because it finds unique prefix codes for every symbol by varying the leading bits.
Arithmetic coding encodes symbols as fractional numbers of bits, by using binary fractions.
It divides up the range to make this work. In the end, you get one fraction per "message" that can be decoded back into the message (IE a fraction like 0.53817213781271231 ....)
range coding is similar, it just uses integers instead of floating point. You get one number that can be decoded back into the message (IE a number like 12312381219129123123).
Note that in both of these, you have not changed the number system at all. The number of even numbers and odd numbers still has the same density.
Another way to look at the above is based on what a bit selects.
In huffman, a single bit is generally not enough to select something, the prefixes are long. So you need to walk a bunch of bits to figure out what you've got.
In arithmetic or range coding, a single bit selects a very large range. They change the proportions of those ranges, but at some point, something has to describe that range. This is because it's a range. Knowing you have gotten to the 8 in 0.538 doesn't tell you anything on it's own, you need to know "the subrange for symbol a is 0.530 ... 0.539", so it's a. So you have to transmit that range.
ANS is a different trick. Instead of encoding things using the existing number system, it changes the number system.
That is, it redefines the number system so that our even and odd numbers are still uniformly distributed but have different densities. If you do this in the right way, you end up with a number system that lets you only require one number to determine the state, instead of two (like in a range).
I thought this wasn't possible with ANS. Or has this changed?
Static is much cheaper, uses the same probabilities for the entire data block (e.g. 30 kB), probabilities are stored in the header - practically all Huffman and tANS compressors (however, there are considered exceptions: https://en.wikipedia.org/wiki/Adaptive_Huffman_coding ).
Adaptive can start with e.g. uniform probability (no need to store in header) and learns on the way - it is more costly but gives better compression, used with arithmetic coding or rANS. See
Thanks for the resources.
"The x-axis is a decreasing logarithmic scale in megabytes per second; the y-axis is the compression ratio achieved."
I'd love to see a version of this chart that also included Brotli. (And I'm somewhat surprised Brotli isn't mentioned at all.)
(Disclaimer: I work at Google, which made Brotli)
Unfortunately the graphs leave a bit to be desired, especially when you're trying to compare two algorithms. They provide the raw data, but it needs a little munging to make it workable.
For my day job I made the following graph to compare zlib level1 and zstd. This was using the "peltast" platform which is a Xeon based system, because that was most relevant for us.
I'll make one with brotli now.
This is with brotli level 1, by the way. My understanding is that brotli is pretty quick through the first few levels, but the levels that ask for the highest compression are insanely slow (which is a valuable thing to have as an option, for things like game assets or something, which are compressed once and delivered many times!)
So a test between zstd and brotli would show brotli in a poor light if it used a mixed corpus, but a test between zstd and brotli on a web corpus would give an advantage to brotli...
I have a feeling that the dictionary was designed with the specific goal of performing well on a specific corpus similar to the Large Text Compression Benchmark. It has quite a few words and phrases that I'd associate with Wikipedia's "house style".
- 9216 phrases total
- 5857 (63.5%) pure ASCII phrases, mostly English with a few Spanish words thrown in
- 1027 (11.1%) CJK (Chinese, Japanese, and Korean, and mostly the first two) phrases -- it's very hard to tell Chinese and Japanese apart in this context; I didn't try
- 158 (1.7%) phrases containing extended Latin-1 characters (nearly all Spanish words)
- 303 (3.3%) Cyrillic script (probably Russian) phrases
- 322 (3.5%) Arabic phrases
- 172 (1.9%) Devanagari script (Hindi) phrases
Plus a few miscellaneous other scripts and generally unclassifiable content.
Because of its seemingly haphazard dictionary, I wouldn't rule out zstd outperforming brotli, if trained on a good dataset.
Uh ? What format of data was this ?
I did a pretty large test (2gb+) on OBJ/STL 3d data and brotli compressed within ~5% margin of lzma, and this holds true on other binary data I've compared.
It also compressed better than zstd (as in compression ratio) on the same data at their highest respective compression settings:
bro -quality 11 -window 24
zstd --ultra -22
So I find it baffling that you find brotli to be poor for any binary data, could you share the data in question ?
I understand that brotli is incredibly slow as compared to LZMA when using these high ratio settings (q11, w24), so slow as to be impractical in production even in a write-once, read-many scenario if you have any non-trivial amount of data being regularly produced. I do not want to have a farm of machines just to handle the brotli compression load of our data sets just because it is 10x slower than LZMA.
Compression time is indeed the achilles heel of Brotli, which is why it's something I would only use for compress once (and preferably decompress very often) scenarios.
Compared to lzma at it's best compression setting for this particular data (-mx9 -m0=LZMA:d512m:fb273:lc8), brotli took 6 minutes and 4 seconds to compress, while the same data took 1 minute and 47 seconds for lzma.
On the other hand, brotli decompressed the same data in 0.6 seconds, while it took lzma 2.2 seconds.
This seems to be slightly old doc - from April.
One thing I haven't figured out from either today's post or Yann's blog is whether Zstandard is switching between huff0 and FSE depending on compression level or is it somehow using both together? Also the post says its both OoO friendly and multi-core friendly but the speed benchmarks are those in a single core context or multi-core? Does only the format/algorithm multi-core friendly or the standard cli can run multi-threaded.
All benchmarks today are single threaded. The algorithm itself is single threaded, but can be parallelized across cores. We will soon release a pzstd command line utility to demonstrate this, similar to pigz, which accelerates both compression and decompression.
Zstandard uses both huff0 and FSE together when it compresses -- it doesn't switch between them based on the input.
If I wanted to use this in the pipeline of my servers Journaling system, is there any requirement that I restart the stream periodically.
That is to say, should I use it per journal entry (probably not a good idea for short messages), for the entire uptime of the writer, or with periodic restarts?
Obviously can measure and find out for myself, but wondered if you had any thoughts. Thanks!
"Zstandard has no inherent limit and can address terabytes of memory (although it rarely does). For example, the lower of the 22 levels use 1 MB or less. For compatibility with a broad range of receiving systems, where memory may be limited, it is recommended to limit memory usage to 8 MB. This is a tuning recommendation, though, not a compression format limitation."
8MB for the smallest preset? Back in the mid-2000s, I was attending a Jabber/XMPP discussion, about the viability of using libz for compressing the stream. It turned out that even just a 32kb window is huge when your connection server is handling thousands of connections at a time, and they were investigating the effect of using a modified libz with an even smaller window (it was hard-coded, back then).
I know Moore's law is in ZStandard's favor w.r.t. memory usage (what's 8MB when your server's got 64GB or more?), but I think it's useful to note that this is squarely aimed at web traffic backed by beefy servers.
In the mid-2000 it was still accepted norm to spawn one thread for each connection, where memory usage of the compressor would have been a problem. I doubt that it's a problem with today's software architecture.
The concern I have is that this makes it sound like the compressor can choose how much memory the decompressor will need to use. Does this mean that zstd can't be used in a potentially adversarial environment? (Eg. is there a denial-of-service vector here by forcing the server to use large amounts of memory to decompress my requests?)
Note that 8Mb is the size of the L3 cache on some modern Intel chips. You want fast lookups in the window area, where you constantly do random-access reads.
Would HTTP servers first have to add support, then browser vendors would follow?
The browser sends the server a request header indicating which compression methods it understands. Current Firefox for example sends
Accept-Encoding: gzip, deflate, br
This means the adoption path for the web would be an implemtation in at least one major browser, which advertises this capability with the Accept-Encoding header. Then any server can start using Zstandard for clients accepting it.
Also, I actually discovered something very interesting (to me at least). At the bottom of the link mentioned below, the link attached says https://github.com/Cyan4973/zstd but then redirects to https://github.com/facebook/zstd . Anyone know why?
EDIT: After a little bit of sleuthing, it looks like the author of zstd (github.com/Cyan4973) is now contributing to github.com/facebook/zstd
And the page layout for lz4 looks the same as zstd
Anyone know if Yann Collet works for/with facebook on things other than zstd?
EDIT 2: In the time it took me to google a couple things, looks like the children comments have already answered my questions.
Also, previous discussions on zstd (not that its completely relevant) -https://news.ycombinator.com/item?id=8941955
I believe this is just GitHub's standard behavior when a repository is moved to another namespace. We recently renamed https://github.com/D-Programming-Language to https://github.com/dlang , and e.g. https://github.com/D-Programming-Language/dmd redirects in the same way.
Upon looking further, it turns out that the author Yann Collet  works with Facebook now; the repo would have been thus transferred to Facebook. Author's blog .
Yep, and Google too
> it turns out that the author Yann Collet  works with Facebook now
On another note, wonder what happens when personal and professional lives can interfere. The profile of Yann Collet in the blog post is the above link and I can't help but think, why am I on faceook looking at a baby's picture on someone's profile instead of Github or at least LinkedIn? Seems like something he might want to keep private (and yes, I'm totally assuming here, I don't know his preferences)
Now, I know you can restrict the stuff you post to friends only instead of public like he did, but it is still something people(and facebook for its employees) should consider if they want their facebook profile to become their professional contact page.
lzfse 45 MB/s encode, 229 MB/s decode, 1.12 comp ratio
zstd 181 MB/s encode, 713 MB/s decode, 1.13 comp ratio
Note that LZFSE has a somewhat different goal, however: it's designed to be the most power-efficient compression algorithm out there, in other words on mobile devices LZFSE optimizes for bytes-per-watt rather than bytes-per-second. Zstandard, on the other hand, runs multiple pipelines and such--it's banking on having a server-class processor to run on.
Edit: hardware is a 2013 MacBook Pro, pretty fast flash storage, and 2 cores/4 threads. I warmed cache before each run and sent output to /dev/null, so the numbers above are best-case.
Sure, but is that even relevant? I mean, is there any way that lzfse could possibly be more power-efficient per byte than zstd when zstd is 3-4 times faster for the same compression ratio? According to the docs zstd doesn't have any support for multiple threads right now, so it should be a fair comparison.
As for how to do it, I'm not sure... but if I were a valued customer of various CPU suppliers, and were famous for the depth of my pockets, I'm sure I'd be able to find somebody to explain it to me ;)
I'm asking because Apple claims the github code is a reference implementation. Those typically aren't tuned for speed or efficiency.
Not surprising. Apple-originated tech is always kind of middle of the road.
Are there cases where an algorithm takes 10x the wall clock time to execute, but actually uses less energy on the same chip?
(Memory use/access is the main thing I guess that could be different.)
It can be. Certain operations are more power efficient than others - subtraction and then a check for negative value instead of comparison, certain vector operations, loop unrolling, using less memory/bus traffic...
>> Are there cases where an algorithm takes 10x the wall clock time to execute, but actually uses less energy on the same chip?
Slower code can be more power efficent. You're just tuning for different results.
I'd imagine this also requires very detailed knowledge of the chip, microcode, etc. Probably hard for x86 but I guess this kinda stuff would be on ARM where the programmer has deep vertical access (like as mentioned, Apple).
1. Apple tried to write a fast and reasonably compressible version of LZ4 thus improving power usage by creating LZFSE since none existed but beaten out handsomely by Zstandard.
2. Following a parent comment, Zstandard might some of the things that are dependent on a highly OoO cpu with lots of caches, extremely good branch predictor that could be significantly slower on an ARM even the apple one despite how good they are. Or they could still be slower but on ARM the gap might not be as big and the decision not as cut and dry as it seems now.
Would love to know what the actual case is from someone involved in LZFSE.
On other architecture the picture gets a bit murky, it seems to get handily beaten by pigz through what at first blush I'd guess is just sheer parallelism. It's got solid performance, and without a shadow of doubt faster than vanilla gzip. If/as/when I get time, it'll be interesting to dig into why performance is worse there.
In particular note the huge difference in branches between gzip and zstd on decompress:
8959780663 branches # 143.024 M/sec
2969481781 branches # 64.454 M/sec
and on misses:
542158823 branch-misses # 6.05% of all branches
89060880 branch-misses # 3.00% of all branches
I'm not a C programmer, understanding what happened is a bit beyond me but:
1) to compile on linux it needs the -pthread flag passed to it, Makefile is missing that (compiles fine on OS X)
2) decompression over stdin appears to be effectively impossible, still demands in input file. Compression over stdin works fine.
Incompressible performance measurements are important for interactive/realtime workloads and the numbers are extremely interesting because they can differ dramatically from the average case measurements. LZ4 for instance has been measured at doing 10GB/sec on incompressible data on a single core of a modern Intel Xeon processor. At the other end of the spectrum is the worst case scenario for incompressible data where performance slows to a crawl. I do not recall any examples in this area, but the point is that it is possible for algorithms to have great average case performance and terrible worst case performance. Quick sort is probably the most famous example of that concept.
I have no reason to suspect that zstd has bad incompressible performance, but the omission of incompressible performance numbers is unfortunate.
Good compressors can't squeeze any more out of a JPEG, but they can back off fast and go faster. Snappy was designed to do this, and even implementations of gzip do it too. It greatly reduces the fear of CPU overhead to always on compression. I wonder how Zstd handles such cases?
*Ignoring security altogether
I love C, it is not the enemy everyone makes it out to be.
It's already in debian: https://packages.debian.org/stretch/zstd and judging by the small requirements,it is portable indeed.
gzip -9: 27.574s, 48MiB output
zstd -9: 14.182s, 41MiB output
That being said: has anyone actually given any indication that there are any relevant patents in the first place?
Using zstd gives Facebook a free license on ALL your patents.
Using Opus gives Facebook only a license on patents that apply to Opus.
So no, large tech companies can and have given MUCH better grants for compression tech, than Facebook is doing.
I just read the Opus patent summary. It seems like if zstd followed the same license using it wouldn't give facebook any license but if I sue facebook I loose the licese to use zstd. Am I correct in that.
If you accept the job you give up your work. It's not like there's no compensation.
Am I correct in that.
Yes, the retaliation (and hence, implicit license to Facebook) is only for the technology itself.
Because as far as I can see, Facebook has an exception saying that if they for some reason patent-sue you first, you would be allowed to countersue without losing your license to their patents.
That could be a misinterpretation, I suppose. But if it's right, then Facebook's license seems superior for those parties who wish to end software patents.
It's mostly superior for Facebook. If you're a party that wants to end software patents, but intends to use zstd in any place you might want to interface with a company that doesn't hold the same position, then you're screwed.
If you want to end software patents, and Facebook sues with one (not applying to zstd), you're also still screwed. This is relevant because even if you're against software patents, you can take them out defensively. But this license makes that useless.
Compare to GPL vs LGPL, or how the free software codecs all eventually moved to BSD.
I guess the other factor for mobile is, besides memory and decompression speed, how do various compression schemes fare battery wise?
HTTP request times have a long tail; they can get really slow for the many, many people limited to slow connections. Decompression times are going to be much more consistent. Our aim here should be to improve the 10% slowest requests, and you do that by optimizing the actual transfer.
Spoilers: zstd wins at ethernet and wifi (and is among the best in 4G), lz4 wins at hard drive encryption… both were designed by the same author.
http://cbloomrants.blogspot.com http://www.cbloom.com/oodle_arm_report/ http://www.cbloom.com/rants.html
Caveat: proprietary commercial product.
But for compression isn't majority of the language would actually use bindings rather than implementing it natively for performance to make this moot.
I would assume the road to finding the optimal solution and understanding why it worked is much more complicated than the actual code. And a quick look doesn't seem to suggest its a very big code base for the library itself.
But at least within the perimeter of your own systems you can totally profit from this technology now.
They should have named it letter-zip, along the lines of gzip, bzip, and xzip, with the extension letterz. "fz" would have been a good one since they work at Facebook.
If you disagree, what do you think of naming your kid "Adolph Hitler [lastname]"? Most people agree that a name like that will cause great harm to a child growing up because of the constant ridicule and ostracization he'd inevitably face. That's an extreme example, but names are important.
Note that I don't think this project's name is horrible, but I don't think it's very good either, and could be a lot better.
(I am not sure if i need to clarify that my original comment was meant to be a joke about Huffman coding)