Sadly, the description of arithmetic coding bears almost no resemblance to the actual algorithm (it roughly describes the equal probability case, but that misses most of the point). The description of Shannon-Fano as "bottom up" and Huffman as "top-down" is also exactly backwards (the actual descriptions of those algorithms are accurate, but the labeling is confused).
The article contains a lot of terms you can search for if you are interested in these things, but sadly is not very informative on its own.
An interesting read, I will need to go through it with more care when I get home. A couple of personal interests are having studied with both David Huffman (though not for compression) and Glen Langdon (one of the arithmetic coding pioneers), I need to see how this article compares with my notes.
I also need to see about getting some of my old (15+ years) course notes online.
Just in those two, much history and memory lost and a hit for UC Santa Cruz's computer engineering and science departments (Huffman passed a number of years back, Glen this year after a period of retirement).
Edit - the scope of the article and the title provided...a serious disconnect in terms of breadth. But, better than nothing.
I took Huffman's class, 'Cybernetics', at UCSC when I was an undergrad, mainly because I thought I'd learn how to make robots.
The first day it didn't take long for the lecture to turn to 7-dimensional spheres, their packing, and its applications to network routing. He was a tough bastard-- lots of multiplication of large numbers on tests, with only a table of logarithms and exponentials. He loved tricky questions on the tests- for example he taught us Karnaugh maps, mentioned the sides wrapped around, then had an example on the final with 1s in all four corners (I hadn't realized maps wrap around the corners, too).
Great teacher; wonderful lecturer, hard grader. I failed the class (I was a biochem major and graduated), the only CS course I took in college.
I really wish I had kept the notes from that class.
This article is very good, but it is very short about gzip. Event if there was no technical innovation, I think that gzip was really important for development of compression softwares that are not patent-encumbered.
Seems like even though people went after this, there's not been much innovation, last few decades. Most of the new stuff looks quite incremental. It almost seems like we are losing our edge in our ability to build from the ground up. This is because we are part of a system that maintains an eagle's eye on new ideas. We need to take some of that pressure off us so that we can think a bit outside the box.
I wonder how much of this is that relatively simple techniques work very well on small amounts of data, and that more complex techniques don't perform sufficiently better to justify baking them into a compression format?
But when we look at the state of the art like ZPAQ, the AI techniques used don't seem to be much more complex than, say, a one or two layer neural network which might as well be from the 1970s. You don't see anything fancy like deep networks or other modern staples like random forests.
So this makes me wonder: maybe compression performance has stagnated because we're not willing to provide compression algorithms extremely large amounts of data or runtime, and so simple algorithms really do perform best with the minimal resources we're willing to use for compression. (People are happy to run neural networks on thousands of GPUs with many gigabytes of data and wait weeks for training to finish; can you imagine a compression utility which required that?)
It naturally follows from the compression-as-intelligence school of thought that building general-purpose compression/intelligence is hard.
I very much believe in domain-specific intelligence, and correspondingly domain-specific compression. Here's a practical business use case for lossless compression (in-memory analytics), which I have been developing:
> that building general-purpose compression/intelligence is hard.
Yes, but my point is. that we seem to be doing better at general-purpose intelligence than at compression despite the apparent equivalence of progress.
I think it's also partly due to the way that the typical dev team develops their algos. They all start with a base version and later push out revisions to their earlier base work. The tendency is therefore to just refine and not go for more game-changing answers. If they executed over longer terms and were more isolated from their markets, we might have seen more range in the changes made.
Well, part of this is that some of these are just "solved problems". For example, once you have arithmetic coding, you never have to worry about the actual bitwise encoding again (anything on top of that is just speed optimization). That reduces the problem to modeling and probability estimation.
There's an inherent limit at the entropy of the source, and usually it is not too hard to get within a few percent of that entropy (as near as we are able to model it). After that, you can do increasingly complex things, but for rapidly diminishing returns. For example, there are definitely lossless audio compressors that can make things smaller than FLAC, but usually only by a percent or two, and they are an order of magnitude or more slower.
To me, a lot of the interesting research at this point is exploring the compression/speed trade-off. That's what makes LZMA interesting (it dominates bzip2 on both axes). Google has also done some interesting work in this space. This matters when your goal is to save network transmission time: if you're only sending the data once, anything you do has to be faster than just sending more data on the wire.
I don't think it's as simple as you've put it, in terms of solved problems and entropy limits. Part of the quest is to find a transformed space in which the probabilities bin out bit better.
Can you provide any concrete indication that anyone was actually discouraged from researching general-purpose lossless compression by a "system that maintains an eagle's eye on new ideas"? Looks to me like forcing a social explanation on a technical problem.
To me it simply looks like there just isn't much possibility of improvement in that field. Ultimately, you're limited by the pigeonhole principle.
Special-purpose lossy compression, i.e. video - that's where you see leaps and bounds in active research and improvement.
Most of the past algos that have been developed have been done by commercial development teams. This means they are typically driven by the current requirements of the user market. The techniques that get developed therefore tend to be mostly increments to older techniques and since corporates value their IP, these get patented, everyone is happy.
If we could step out of this corporate context, we would likely do better on the algos.
On the topic of video, the latest stuff is still based on decades old core technology. Its still block-based motion compensated transforms which has been around since at least MPEG-1. Don't see any leaps and bounds in what's used now. No wavelets, no fractals, no other new stuff. Just better optimized versions of the base.
This is no surprise because reading the article, the history of data compression seems to be riddled with patent lawsuits and general blocking of "competitors" to use the same or a similar algorithm.
The reason there has not been much innovation is that even ZIP files are pretty close to the best you can do for general purpose compression. You might get a few more percent by improving the algorithms but there simply isn't much more compression to be done...
One interesting domain-specific class of compression algorithms not mentioned here is for lossless audio compression, which tends to use a different (though also pretty simple) technique, somewhat related to PPM. A common approach is to predict the waveform using linear predictive coding (LPC), and then entropy-code the residuals. FLAC does that, for example.
Are there compression techniques that are only "mostly" lossless? I was thinking something along the lines of: for delta, N > 0 the probability that the compression & decompression of a random stream of N bytes will result in loss is less than delta?
snappy is just dictionary coding with no entropy. The code is formed by literals and backreferences. pretty simple design. It's design for reasonable level of compression, with very fast codec. So you use it for comms between programs, where you want messages to move around quickly with fewer bytes than raw. Wouldn't use it for saving tons of data on disk for a long time.
One (new?) trick that snappy uses is to stop compressing for some time when it cannot find matches. This makes it much faster than other compressors on uncompressable files.
By the nature of things, anything that compresses some input data must necessarily lengthen other input data, since you can't get away from the fact that there are only so many input files that can be represented by the output number of bits. In fact, it will almost certainly lengthen many more of the possible inputs than it shortens.
I once heard someone describe compression programs as 'expansion programs with interesting failure cases', and so, of course, the best compression program to use depends on exactly which failure cases you're interested in.
While true this doesn't seem to be a practical issue. Any uncompressable data can be encoded using only one bit of overhead, where the bit is a flag indicating whether the rest of the data is compressed. In practice, there is a header and a field indicating which compression method to use. You pay for the size of the header. Adding support for another compression method is nearly free as far as space is concerned; one byte can switch between 256 of them. (Time is another matter.)
PAQ (particularly ZPAQ) is pretty good at most things because it selects the model which works best. Variants of PAQ tend to be the Hutter Prize winners (paqh) - PPM derivatives are generally excellent at text, source code, HTML, and things with that kind of word-like symbol distribution (which is why PPMd.H in particular is used by RAR and 7-zip for text compression, although RAR selects it automatically and unfortunately 7-zip doesn't seem to, probably because of the extra RAM overhead for decompression that PPMd.H introduces).
However, PAQ tends to be really* slow, largely because it tries more things. It's highly tunable, but people who aren't compression geeks tend to not want to tune their compression. Presets are available.
That's pretty much why it hasn't caught on - speed. There may be some hybrid approaches that deliver a better compromise between context mixing's effectiveness and dictionary coding's speed and memory usage: I guess you could argue LZMA, bringing a Markov-chain algorithm into the mix, is one such, in a way. Sort of.
I'm also a little antsy about ZPAQ formats containing bytecode descriptions of the decompression algorithm needed, and are, broadly speaking, executable (and in some cases, are). That seems like the kind of thing that may invite security problems if approached without due caution.
Absolutely, the compression part of it (LZW) is, yes. (LZW has been out of patent for about a decade, by the way, but it isn't particularly noteworthy on its own.)
GIF looks like crap on some images because it's paletted to [2,4,8,16,32,64,128,256] colours - so many images are reduced to a palette (say, with an octree) and/or dithered (perhaps badly, as dithering tends to increase noise, if not entropy), and also sometimes because some techniques exist (one implementation can be found in Photoshop's "Save for Web") which perform lossy transforms on the data so LZW compresses it better - the result is noisier, however, because it intentionally introduces repeating patterns.
The article contains a lot of terms you can search for if you are interested in these things, but sadly is not very informative on its own.