
How To Think About Compression - epall
http://changelog.complete.org/archives/910-how-to-think-about-compression
======
BigZaphod
Whoa - 10 minutes ago, out of the blue, I was poking around on Wikipedia
reading about compression algorithms and thinking how it might be a fun side
project to mess with something like that. Now I reload my RSS reader and find
this!

Life is strange sometimes...

What I started wondering about is if there's a chance of achieving better
compression results if the compressor and decompressor could communicate about
the compression before it happened. For example, (ignoring CPU constraints)
what if when I was downloading a large file over the web, the web server and
the browser conspired together to achieve a higher rate of compression
customized just for me by using some kind of shared knowledge (like a history
of what was recently downloaded or whatever) to achieve a transfer of less
data? (Like, if I had a bunch of files on my client that the server also has,
could it use small chunks of those known files and index their location in the
new file to be transfered - thus saving me download time?)

Obviously this is not a fully formed thought... but I think it might be worth
considering that compression need not always be bound by the type of data, but
also by the receiver of that data and the intention and method of that data's
transmission.

~~~
jodrellblank
There is something to that - due in the next Windows Server 2008 R2 and
Windows 7 is an update to Offline Files and network copying, where the server
will keep hashes of every file, in smallish chunks.

The idea is that if you copy two similar files from the server, it can check
the hashes and tell your client to use parts of a file it already has instead
of transferring them (or only copy changed parts of a big file).

I have no source handy to find more details about exactly what scenarios it
will be used at the moment though.

~~~
BigZaphod
Awesome. That's a lot like what I was imagining - keeping a library of known
hashes and chunks of shared files in the hopes that, at least occasionally,
having those chunks pays off when transferring future files. You could even
use OS files and offsets within them that you could be sure were present on
the client side. Stuff like that.

~~~
code_devil
check these terms out .... deduplications and wan optimization

------
DaniFong
It seems to me that for backup you'd want a file format with redundancy, or
that could tolerate a few bit for sector error while remaining mostly
readable. This seems to conflict with most compression schemes.

~~~
dangoldin
Well isn't that a separate process? You should be able to make any file
redundant, not just compressed files.

Something similar to the way CDs are written to be able to handle scratches
should work well in this case.

