Really? I guess PAQ8PX, Winzip 14, Stuffit 14 and PackJPG must be from the future.
That said -- very cool -- I didn't know those existed!
7-zip fastest got it down to ~1.4Gb in about 10 minutes.
7-zip extra got it down to ~745Mb in about 30 minutes.
Yet the only options it had were up to 2 cores and up to 250Mb RAM. If I could have traded more cores and RAM for more speed and compression. Hrm.
It should be possible to cross-file de-dup more than just images.
One important test to watch will be Ocarina's market success, or lack thereof.
EDIT: also, because of the growth of proprietary data stores in the cloud it becomes more and more lucrative to leverage this tech even if it's not a standard. E.g. if you were Amazon S3 compressing redundancies, you use JPEG on your interfaces but use a non-standardized algorithm on your backend.
Sit your files on ZFS with dedupe enabled and it will do that to any block-of-data. Not using filetype specific methods like the post describes, but file-agnostic dedupe nonetheless.
There's also a fingerprinting algorithm that's basically to divide your file into chunks based on a fixed short string or rolling-checksum value and then hash each chunk, which can be used to find good candidates for storing as diffs against eachother. Unfortunately I don't recall what name this goes by or where I ran across it, or what system actually uses it.
Another feature to be mindful of is how much power you'll need for the text model; larger models compress better, but need more time and space.
From what I gathered, I think future of compression lies in the way we look at data patterns itself. That is, not looking at patterns statistically, and not looking at the problem as statistical, but some new approach via discrete math.
That is, a fresh new "out of this world" approach. I have some ideas where it might come from, but I don't want to sound like a mad man.
I am talking about lossless compression. There are various things putting limits to what can be done ultimately.
those being the most important ones. Current methods are more or less all analysis/statistics/continuous math. I think, from what I gathered that new approach will come from the area of discrete maths. Mainly combinatorics. Also, this is where mad man part comes in, cellular automata kind of systems.
Discrete math in itself offers great tools to make r&d on paper, but requires severe computing power later on to prove right. New research also needs new minds un-poluted with "what can and can't be done" in order to bring new ways of thinking. I won't bore you with details, I had lots of ideas over the years, but pretty much all were fluke for lossless.
My guess is that doing PCA on hundreds of images is too expensive, but I think you could prove that it would give you optimal lossless compression.
Of course I was a little bit confused by the fact that he's talking about lossless compression but then mentioning jpeg, which is lossy.
If you consider lossy compression algorithms, then for images what matters most is human perception. JPEG and the various MPEG standards for video already take advantage of this, by more heavily quantizing things that humans have trouble distinguishing changes in. Thus, while these compression schemes may increase the L2 reconstruction error, the results look "better" to human eyes than the results of PCA, where artifacts are quite noticeable.
If you do want to use PCA, then the following method is quite a bit more efficient than straight PCA on images:
That's for a single image, but you could extend the idea out to multiple images.
The problem with PCA is that it's a linear method, and lighting effects are not linear across a full image. However, in local settings (i.e., small patches of the image), the linearity assumption often holds, and you thus get much more bang-for-the-buck.
E.g. if you shipped an additional 10 MB of "data" with every browser, could you reduce the average image size on the internet by X% by leveraging that 10 MB?
(I have no idea; just an interesting thought exercise. Could make a cool masters thesis for someone.)
I suggest that you make a blog post or Tell HN about this and post it as a top level submission. I'd love to hear more opinions about this.
I understand the principle but don't understand enough of how it would actually work to speak for it as much as I'd want to.
Such a technique could also be especially useful for low bandwidth situations, most notably Paul English's proposed Internet for Africa effort, http://joinafrica.org/ .
If you're interested in this pursuit you should also look at SPDY, Google's suggested augmentation of HTTP. It would probably speed up the web more than better compression in many cases.
I fail to see the point of this in the light of a much smarter idea: keep this approach in mind when developing new formats. Maybe I'm just not thinking outside the "boxeola" enough here... (or maybe I am?)
The downside of this coupling might be slower adoption of new cross-file compression algorithms if there is going to be a string of them, which empirically there seems to be. It's kind of funny that the de facto image compression algorithms haven't changed in ~10 years whereas video compression algorithms change flavors every two to four years.