Hacker News new | past | comments | ask | show | jobs | submit login
The Upcoming Generation of Lossless Data Compression, Part 1 (adamsmith.cc)
25 points by adamsmith on May 31, 2010 | hide | past | favorite | 27 comments

Now take a group of, say, 100 JPEGs, and compress them into a single archive. You’ll get no real compression. [..] But do that with a compression algorithm from the future, and you’ll get a much smaller file.

Really? I guess PAQ8PX, Winzip 14, Stuffit 14 and PackJPG must be from the future.


It's worth noting that those are content aware, lossless compression for single image files, not cross-file.

That said -- very cool -- I didn't know those existed!

PAQ8 is definitely futuristic, by which I mean that it is incredibly slow on my machine and uses a ton of RAM :-) http://en.wikipedia.org/wiki/PAQ#Comparison

I was sitting here with 4 physical cores and 8 logical ones, 6Gb of RAM with the priority of compressing 7Gb of data as small as possible, as soon as possible.

7-zip fastest got it down to ~1.4Gb in about 10 minutes. 7-zip extra got it down to ~745Mb in about 30 minutes.

Yet the only options it had were up to 2 cores and up to 250Mb RAM. If I could have traded more cores and RAM for more speed and compression. Hrm.

The storage cost for all jpeg files in the world probably doesn't justify a major standards effort. The effort behind converting most of the world from gif to png must have been millions of man hours.

That's a really important question.

It should be possible to cross-file de-dup more than just images.

One important test to watch will be Ocarina's market success, or lack thereof.

EDIT: also, because of the growth of proprietary data stores in the cloud it becomes more and more lucrative to leverage this tech even if it's not a standard. E.g. if you were Amazon S3 compressing redundancies, you use JPEG on your interfaces but use a non-standardized algorithm on your backend.

It should be possible to cross-file de-dup more than just images.

Sit your files on ZFS with dedupe enabled and it will do that to any block-of-data. Not using filetype specific methods like the post describes, but file-agnostic dedupe nonetheless.

Versioning systems do something like this already. Most store old versions of files as diffs against new versions (or this may be reversed); git stores files as diffs against similarly sized files of I think similar age (which are most likely older/newer versions of the same file).

There's also a fingerprinting algorithm that's basically to divide your file into chunks based on a fixed short string or rolling-checksum value and then hash each chunk, which can be used to find good candidates for storing as diffs against eachother. Unfortunately I don't recall what name this goes by or where I ran across it, or what system actually uses it.

If you want to build a model on English text, you have terabytes of webpages to use. Exposing sensitive information (like, the phrase "Adam, I'm sorry your cat has feline AIDS" compresses via LZ to position 1230352024, so your account has 1.950001GB free instead of 1.949999GB free) is the wrong way to squeeze a few more bits out; make sure you don't break through the ACL of each document.

Another feature to be mindful of is how much power you'll need for the text model; larger models compress better, but need more time and space.

Lossless data compression and information complexity in general has been kind of my hobby for last 13-14 years. More out of OCD/curiosity than practical purposes, like a puzzle if you will.

From what I gathered, I think future of compression lies in the way we look at data patterns itself. That is, not looking at patterns statistically, and not looking at the problem as statistical, but some new approach via discrete math.

That is, a fresh new "out of this world" approach. I have some ideas where it might come from, but I don't want to sound like a mad man.

Are you talking about Fractal Compression methods?


Well, fractal compression is lossy based and sort of a reconstructive process which resembles original, but isn't. So, no.

I am talking about lossless compression. There are various things putting limits to what can be done ultimately.




those being the most important ones. Current methods are more or less all analysis/statistics/continuous math. I think, from what I gathered that new approach will come from the area of discrete maths. Mainly combinatorics. Also, this is where mad man part comes in, cellular automata kind of systems.

Discrete math in itself offers great tools to make r&d on paper, but requires severe computing power later on to prove right. New research also needs new minds un-poluted with "what can and can't be done" in order to bring new ways of thinking. I won't bore you with details, I had lots of ideas over the years, but pretty much all were fluke for lossless.

Sounds like he's talking about an application of Principal Components Analysis (see, e.g. http://www.willamette.edu/~gorr/classes/cs449/Unsupervised/p... )

My guess is that doing PCA on hundreds of images is too expensive, but I think you could prove that it would give you optimal lossless compression.

PCA is optimal from an L2 perspective, but with images, that is often not the metric you want to use.

Of course I was a little bit confused by the fact that he's talking about lossless compression but then mentioning jpeg, which is lossy.

If you consider lossy compression algorithms, then for images what matters most is human perception. JPEG and the various MPEG standards for video already take advantage of this, by more heavily quantizing things that humans have trouble distinguishing changes in. Thus, while these compression schemes may increase the L2 reconstruction error, the results look "better" to human eyes than the results of PCA, where artifacts are quite noticeable.

If you do want to use PCA, then the following method is quite a bit more efficient than straight PCA on images: http://www1.cs.columbia.edu/CAVE/publications/pdfs/Nishino_P...

That's for a single image, but you could extend the idea out to multiple images.

The problem with PCA is that it's a linear method, and lighting effects are not linear across a full image. However, in local settings (i.e., small patches of the image), the linearity assumption often holds, and you thus get much more bang-for-the-buck.

I've been wondering for a long time why nothing better than JPEG seems to be catching on for lossy photo compression. It seems like it would be possible to trade CPU cycles for file-size (in the same way that h.264 vs MPEG2 does), but I suppose it's really hard to get critical mass with anything that isn't backwards compatible...

I also wonder if it would be possible to establish a set of well known reference bits to compress against.

E.g. if you shipped an additional 10 MB of "data" with every browser, could you reduce the average image size on the internet by X% by leveraging that 10 MB?

(I have no idea; just an interesting thought exercise. Could make a cool masters thesis for someone.)

I can see how it may potentially work with fine crafted bloom filters - lossy, of course! I'm just not sure if training the set of filters would be worth the effort, because computing the filters might take a looooooooooooooooong time on that level of information complexity. I've been toying around with something like that over the years actually, had some models I've programmed (not for pictures though) - I gave up.

Sounds like the kind of thing that a company like Google could tackle. It would be especially good in the mobile world, where bandwidth is more limited.

Well funded organization could do it for sure. And Google would benefit from it, of course. But, from my limited research over the years I'm not convinced anymore that any further significant advancement could be done via continuous mathetmatics. It might be my personal bias from my experience, but I have a strong hunch on new approaches via discrete mathematics will be the key for any serious advances approaching shannon's limit / kolmogorov complexity (if those hold true, and they should as far as we know).

I like it. With the size of hard drives now, it seems like that would be a good trade-off against bandwidth (which is scarcer). It would save bandwidth and speed things up, two things much more desirable than a few more megs of HDD.

I suggest that you make a blog post or Tell HN about this and post it as a top level submission. I'd love to hear more opinions about this.

Go for it if you're passionate about it!

I understand the principle but don't understand enough of how it would actually work to speak for it as much as I'd want to.

Such a technique could also be especially useful for low bandwidth situations, most notably Paul English's proposed Internet for Africa effort, http://joinafrica.org/ .

If you're interested in this pursuit you should also look at SPDY, Google's suggested augmentation of HTTP. It would probably speed up the web more than better compression in many cases.

This is somewhat possible, I thought about this few years ago. I'm not sure if it helps much or at all with current compression algorithms though. Should probably give it a second some day.

IIRC JPEG 2000 has some patent FUD and JPEG XR has "it's from Microsoft" FUD.

There was some really forward-thinking work on Fractal Image Compression in the late 90s, but largely due to time-efficiency, plummeting storage costs, and patent encumbrances, it's never caught on.

Why go forth with these "futuristic" means of optimizing/compressing current formats, when these said, current formats will with all likeliness be abandoned soon enough in favor of something better? It appears as if this company's business idea is entirely redundant. After all, shedding the past in favor of the future is the exact history of all we've done in computing since... since... since before dinosaurs died.

I fail to see the point of this in the light of a much smarter idea: keep this approach in mind when developing new formats. Maybe I'm just not thinking outside the "boxeola" enough here... (or maybe I am?)

So for example a future image compression algorithm should define how to compress a single image, and how to compress a group of compressed images?

Interesting thought.

The downside of this coupling might be slower adoption of new cross-file compression algorithms if there is going to be a string of them, which empirically there seems to be. It's kind of funny that the de facto image compression algorithms haven't changed in ~10 years whereas video compression algorithms change flavors every two to four years.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact