
The Upcoming Generation of Lossless Data Compression, Part 1 - adamsmith
http://blog.adamsmith.cc/2010/05/the-upcoming-generation-of-lossless-data-compression-part-1.html
======
jodrellblank
_Now take a group of, say, 100 JPEGs, and compress them into a single archive.
You’ll get no real compression. [..] But do that with a compression algorithm
from the future, and you’ll get a much smaller file._

Really? I guess PAQ8PX, Winzip 14, Stuffit 14 and PackJPG must be from the
future.

<http://www.maximumcompression.com/data/jpg.php>

~~~
sp332
PAQ8 is definitely futuristic, by which I mean that it is incredibly slow on
my machine and uses a ton of RAM :-)
<http://en.wikipedia.org/wiki/PAQ#Comparison>

~~~
jodrellblank
I was sitting here with 4 physical cores and 8 logical ones, 6Gb of RAM with
the priority of compressing 7Gb of data as small as possible, as soon as
possible.

7-zip fastest got it down to ~1.4Gb in about 10 minutes. 7-zip extra got it
down to ~745Mb in about 30 minutes.

Yet the only options it had were up to 2 cores and up to 250Mb RAM. If I could
have traded more cores and RAM for more speed and compression. Hrm.

------
tlb
The storage cost for all jpeg files in the world probably doesn't justify a
major standards effort. The effort behind converting most of the world from
gif to png must have been millions of man hours.

~~~
adamsmith
That's a really important question.

It should be possible to cross-file de-dup more than just images.

One important test to watch will be Ocarina's market success, or lack thereof.

EDIT: also, because of the growth of proprietary data stores in the cloud it
becomes more and more lucrative to leverage this tech even if it's not a
standard. E.g. if you were Amazon S3 compressing redundancies, you use JPEG on
your interfaces but use a non-standardized algorithm on your backend.

~~~
jodrellblank
_It should be possible to cross-file de-dup more than just images._

Sit your files on ZFS with dedupe enabled and it will do that to any block-of-
data. Not using filetype specific methods like the post describes, but file-
agnostic dedupe nonetheless.

------
tbrownaw
Versioning systems do something like this already. Most store old versions of
files as diffs against new versions (or this may be reversed); git stores
files as diffs against similarly sized files of I think similar age (which are
most likely older/newer versions of the same file).

There's also a fingerprinting algorithm that's basically to divide your file
into chunks based on a fixed short string or rolling-checksum value and then
hash each chunk, which can be used to find good candidates for storing as
diffs against eachother. Unfortunately I don't recall what name this goes by
or where I ran across it, or what system actually uses it.

~~~
adamsmith
Here's a good overview paper:
[http://www.usenix.org/event/usenix04/tech/general/full_paper...](http://www.usenix.org/event/usenix04/tech/general/full_papers/kulkarni/kulkarni_html/paper.html)

------
lsb
If you want to build a model on English text, you have terabytes of webpages
to use. Exposing sensitive information (like, the phrase "Adam, I'm sorry your
cat has feline AIDS" compresses via LZ to position 1230352024, so your account
has 1.950001GB free instead of 1.949999GB free) is the wrong way to squeeze a
few more bits out; make sure you don't break through the ACL of each document.

Another feature to be mindful of is how much power you'll need for the text
model; larger models compress better, but need more time and space.

------
Keyframe
Lossless data compression and information complexity in general has been kind
of my hobby for last 13-14 years. More out of OCD/curiosity than practical
purposes, like a puzzle if you will.

From what I gathered, I think future of compression lies in the way we look at
data patterns itself. That is, not looking at patterns statistically, and not
looking at the problem as statistical, but some new approach via discrete
math.

That is, a fresh new "out of this world" approach. I have some ideas where it
might come from, but I don't want to sound like a mad man.

~~~
apu
Are you talking about Fractal Compression methods?

<http://en.wikipedia.org/wiki/Fractal_compression>

~~~
Keyframe
Well, fractal compression is lossy based and sort of a reconstructive process
which resembles original, but isn't. So, no.

I am talking about lossless compression. There are various things putting
limits to what can be done ultimately.

<http://en.wikipedia.org/wiki/Shannon%E2%80%93Hartley_theorem>

<http://en.wikipedia.org/wiki/Kolmogorov_complexity>

<http://en.wikipedia.org/wiki/Pigeonhole_principle>

those being the most important ones. Current methods are more or less all
analysis/statistics/continuous math. I think, from what I gathered that new
approach will come from the area of discrete maths. Mainly combinatorics.
Also, this is where mad man part comes in, cellular automata kind of systems.

Discrete math in itself offers great tools to make r&d on paper, but requires
severe computing power later on to prove right. New research also needs new
minds un-poluted with "what can and can't be done" in order to bring new ways
of thinking. I won't bore you with details, I had lots of ideas over the
years, but pretty much all were fluke for lossless.

------
jluxenberg
Sounds like he's talking about an application of Principal Components Analysis
(see, e.g.
[http://www.willamette.edu/~gorr/classes/cs449/Unsupervised/p...](http://www.willamette.edu/~gorr/classes/cs449/Unsupervised/pca.html)
)

My guess is that doing PCA on hundreds of images is too expensive, but I think
you could prove that it would give you optimal lossless compression.

~~~
apu
PCA is optimal from an L2 perspective, but with images, that is often not the
metric you want to use.

Of course I was a little bit confused by the fact that he's talking about
lossless compression but then mentioning jpeg, which is lossy.

If you consider lossy compression algorithms, then for images what matters
most is human perception. JPEG and the various MPEG standards for video
already take advantage of this, by more heavily quantizing things that humans
have trouble distinguishing changes in. Thus, while these compression schemes
may increase the L2 reconstruction error, the results look "better" to human
eyes than the results of PCA, where artifacts are quite noticeable.

If you do want to use PCA, then the following method is quite a bit more
efficient than straight PCA on images:
[http://www1.cs.columbia.edu/CAVE/publications/pdfs/Nishino_P...](http://www1.cs.columbia.edu/CAVE/publications/pdfs/Nishino_PAMI05.pdf)

That's for a single image, but you could extend the idea out to multiple
images.

The problem with PCA is that it's a linear method, and lighting effects are
not linear across a full image. However, in local settings (i.e., small
patches of the image), the linearity assumption often holds, and you thus get
much more bang-for-the-buck.

------
MikeCapone
I've been wondering for a long time why nothing better than JPEG seems to be
catching on for lossy photo compression. It seems like it would be possible to
trade CPU cycles for file-size (in the same way that h.264 vs MPEG2 does), but
I suppose it's really hard to get critical mass with anything that isn't
backwards compatible...

~~~
adamsmith
I also wonder if it would be possible to establish a set of well known
reference bits to compress against.

E.g. if you shipped an additional 10 MB of "data" with every browser, could
you reduce the average image size on the internet by X% by leveraging that 10
MB?

(I have no idea; just an interesting thought exercise. Could make a cool
masters thesis for someone.)

~~~
Keyframe
I can see how it may potentially work with fine crafted bloom filters - lossy,
of course! I'm just not sure if training the set of filters would be worth the
effort, because computing the filters might take a looooooooooooooooong time
on that level of information complexity. I've been toying around with
something like that over the years actually, had some models I've programmed
(not for pictures though) - I gave up.

~~~
MikeCapone
Sounds like the kind of thing that a company like Google could tackle. It
would be especially good in the mobile world, where bandwidth is more limited.

~~~
Keyframe
Well funded organization could do it for sure. And Google would benefit from
it, of course. But, from my limited research over the years I'm not convinced
anymore that any further significant advancement could be done via continuous
mathetmatics. It might be my personal bias from my experience, but I have a
strong hunch on new approaches via discrete mathematics will be the key for
any serious advances approaching shannon's limit / kolmogorov complexity (if
those hold true, and they should as far as we know).

------
hackermom
Why go forth with these "futuristic" means of optimizing/compressing _current_
formats, when these said, current formats will with all likeliness be
abandoned soon enough in favor of something better? It appears as if this
company's business idea is entirely redundant. After all, shedding the past in
favor of the future is the exact history of all we've done in computing
since... since... since before dinosaurs died.

I fail to see the point of this in the light of a much smarter idea: keep this
approach in mind when developing _new formats_. Maybe I'm just not thinking
outside the "boxeola" enough here... (or maybe I _am?_ )

~~~
adamsmith
So for example a future image compression algorithm should define how to
compress a single image, and how to compress a group of compressed images?

Interesting thought.

The downside of this coupling might be slower adoption of new cross-file
compression algorithms if there is going to be a string of them, which
empirically there seems to be. It's kind of funny that the de facto image
compression algorithms haven't changed in ~10 years whereas video compression
algorithms change flavors every two to four years.

