
Using machine learning to choose compression algorithms - kootenpv
https://vks.ai/2019-12-05-shrynk-using-machine-learning-to-learn-how-to-compress
======
pedrocx486
Is there a way to do the reverse?

There's a quite "legendary" Game Boy Advance game out there (Klonoa - Densetsu
no Star Medal) that never got a translation to English because it has some
sort of in-house created compression by Namco applied to the game that was
made so it could fit into a GBA cartridge. AFAIK no one was ever able to crack
it open and release the code to de/compress it.

A while ago I had a "bounty" of USD100 for anyone that could do it (just the
decompression and re-compression, not translation) but there aren't many
people that want to fiddle with low-level GBA coding.

~~~
yorwba
I'm assuming the game is playable, i.e. the decompression code is included on
the cartridge and you just don't know how it works. In that case you could
emulate the game and use a language model to identify strings containing
Japanese text (you'd need to know the encoding to do that) so they can be
extracted for translation. That doesn't allow you to put the translations into
the compressed code, but you might be able to instrument the emulator to
inject translated strings on-the-fly.

~~~
amelius
You might as well write a tool that extracts strings from a video signal using
OCR, and translates them. That would make the solution more universal, and you
could even use it to e.g. suppress ads.

~~~
dnautics
Well that's super hard since Japanese encoding is an epic story in digital
archaeology itself.

~~~
simiones
I'm not well-versed on the subject, how does encoding come into play for text
displayed on the screen? Did they use a strange way of representing the
Japanese text because of technical limitations?

------
mappu
Are all compressors simply being run with their default arguments? There's a
lot of scope for speed/filesize tradeoffs within a single compressor.

EDIT: You are missing `csv+zstd` ? It should obsolete `csv+gzip` at all speeds
and compression levels.

There is a pareto-optimality frontier here - I ran my testing back in 2016
[https://code.ivysaur.me/compression-performance-
test/](https://code.ivysaur.me/compression-performance-test/) but the numbers
are now a little bit obsolete (e.g. zstd and brotli have both seen a lot of
improvements).

~~~
alexott
+1 - and if you could use zstd with custom dictionary, then you can achieve
even better compression ratios

~~~
kootenpv
Yea zstd is really amazing... if I would choose a single one all the time it'd
be zstd for sure.

------
jedimastert
A somewhat different example but related (at least by name) but I recall an
article from Chris Wellons' blog, Null Program, where he wanted to "discover"
a new hashing algorithm, so he randomly generated them, JT compiled them to
native, then tested them.

There was, how ever, no machine learning or optimizing. Instead, he called it
"prospecting" and just generate a new one from scratch each time until he
found something interesting.

[https://nullprogram.com/blog/2018/07/31/](https://nullprogram.com/blog/2018/07/31/)

~~~
kootenpv
Yea - that is related to genetic programming. That, and using auto-encoders
for e.g. image compression are known approaches in "AI".

I'm particularly proud of this meta approach and I am actually thinking this
could become huge: the same thing can be done for hyperparameter optimization
in machine learning tasks.

Hyperparamter optimization is currently focused on minimizing cross-validation
error, but using this concept you could have weights on accuracy, training
time and prediction time (very similar to compression where the 3 dimensions
are size, write time and read time), and then given a new unknown dataset you
could predict what model/hyperparameters to use.

Maybe this should be patented ;)

~~~
eugenhotaj
> I'm particularly proud of this meta approach and I am actually thinking this
> could become huge: the same thing can be done for hyperparameter
> optimization in machine learning tasks.

There is already a substantial field of Machine Learning/Meta Learning which
focuses on exactly this. For example, this paper [1] from NeurIPS 2015 does
exactly what you suggest.

[1]: [https://papers.nips.cc/paper/5872-efficient-and-robust-
autom...](https://papers.nips.cc/paper/5872-efficient-and-robust-automated-
machine-learning.pdf)

~~~
kootenpv
Yea I am aware of meta hyperparameter approach for ML, except they only focus
on accuracy instead of also including train/prediction times in to the
equation :) That's what I was referring to! (you can save A LOT of compute and
zoom in on things that work if you can weed out slow / badly performing
algorithms as part of meta learning hyperparameters).

To make it extra clear: by doing a lot of compute on different datasets and
not only recording the accuracy but also time it took, and then by including
that as dimension it will even give better results.

------
md_
When and why would you want to apply ML to this problem?

Off the cuff, this seems like it competes with an alternative of simply
running every considered compression algorithm and choosing the optimal one. I
guess this would be advantageous if the RF classifier is meaningfully faster
to run than the compression algorithms themselves. Is it?

~~~
kootenpv
Spot on (I briefly touch on this in the article)! This is why I try to work
with cheap-to-compute features. I used to calculate how unique all values
were, but ended up taking a sample instead to speed that part up for large
data!

~~~
md_
How does that compare to just doing compression on a sample of data? :)

~~~
ma2rten
Running the compression algorithms is O(kn) where k is the number of
compression algorithms. Taking the machine learning approach is O(n).

~~~
md_
As the author noted after you posted this, it’s not a given that the ML
algorithm is O(n). It may be constant time (by looking only at column headers
and a sample of data, say).

That said, I was really more interested in practical runtimes. Like, in
practice the ML may have a high startup cost (e.g., due to cost of loading a
model), whereas for most sized datasets linear complexity may be fine...

~~~
kootenpv
Models are cached and not large so the setup time is very low. I'll time it
when I get to a PC

------
gwern
Isn't one of the benefits of time-series databases more compact storage at
minimal overhead?

~~~
proverbialbunny
Does anyone know what is the leading time-series database people use today?
Like, eg, bar / tick data.

(I use PostgreSQL, due to my ignorance.)

~~~
JamesBarney
I've been happy with influxdb, but I've also noticed that a lot of people
switch to a time series database way before they need to. You can go pretty
far with postgres/oracle/sql server, and it has the advantage you don't need
to manage different databases.

~~~
proverbialbunny
Thanks for the response. I've avoided influxdb because it drops data. Bar and
tick data needs to be ACID or at least very close to ACID.

InfluxDB is great for server usage stats though.

~~~
JamesBarney
I don't know much about bar or tick data. Mind going into more detail about
dropping data?

We didn't really have a need for transactions because we were just recording a
bunch of sensor data from oil and gas wells from the around the world and then
running computations and then displaying and alerting on the results.

~~~
proverbialbunny
Bar and tick data is financial, so dealing with numbers. It has to be precise,
like not 0.0000001 off and dropped data isn't the end of the world, but it's
pretty bad.

------
WoodenChair
We did a fun experiment in Chapter 5 of _Classic Computer Science Problems in
Python_. We ran a genetic algorithm to find what order of a list (assuming the
order doesn't matter) of names led to the best compression using one of the
Python standard library's built-in compression algorithms. This article has
made me interested in re-running the experiment with multiple compression
algorithms.

------
d_burfoot
Why not use ML to build a statistical model of your data, and then use that
model to compress it? You will likely get better codelengths for your specific
dataset than off-the-shelf algorithms can achieve. And if you figure out a
good model that describes the fluctuations in the cryptocurrency prices very
well, then you can use that model pretty directly to make money via automated
trading.

I've got an easy-to-use library for arithmetic encoding in Java, it would be
easy to port to other languages:
[https://github.com/comperical/MirrorEncode](https://github.com/comperical/MirrorEncode)

------
js8
How does it compare to context mixing algorithms such as
[https://en.wikipedia.org/wiki/ZPAQ](https://en.wikipedia.org/wiki/ZPAQ) ?

~~~
kootenpv
shrynk is not doing any compression itself, it's using ML to do a "meta"
approach: it computes features on your data like how many rows, how many
columns, how much duplication etc, and then predicts which existing
compression algorithm (out of the available compressions) will be best given
your requirements/preferences in terms of size, write time and read time.

------
proverbialbunny
Is it possible to run data through a DNN, making sure the output is the same
as the input (or close for lossless data), then take an autoencoder variant,
record the 'compressed' data, then on the other end have the other half of the
DNN to decompress it?

I'm pretty ignorant on the topic, so I get that that may be off, but if so,
why wouldn't that be a valid solution to compressing data?

~~~
d_burfoot
Sure, and you can add some correction bits to fix the lossy encoding of the
autoencoder.

The problem is that you have to encode/send the DNN itself, otherwise your
receiver won't know how to decode the data. If you are not smart, the added
codelength of the DNN will likely blow away your savings. If you are smart,
this leads to a whole formulation of machine learning called MDL:

[https://en.wikipedia.org/wiki/Minimum_description_length](https://en.wikipedia.org/wiki/Minimum_description_length)

------
numbol
Funny, it deeply resembles the last episodes of "Silicon Valley"

More seriously, the generalized compression algorithm can be one of the keys
or even one of the definitions of generalized artificial intelligence

And Kolmogorov’s complexity was present by a subtle reference in the very last
episode ...

------
mebr
The wording could be improved to make it easier understand what is being
offered more quickly.

This is using machine learning to predict on the given heterogeneous tabular
data which compression algorithm will yield a higher compression rate.

There are multiple other ways to do this approximation.

------
mattnewport
From the data shown it's not obvious that some of the options would ever be
best, would be interesting to see if any of the non csv options ever beat the
csv options and for what type of data.

~~~
kootenpv
I can give you this... the larger the data the more useful parquet and
compression on it will be...

~~~
mattnewport
From the brief description it seems like Parquet is just a weaker version of
what a decent LZ style compressor will do anyway

