
Image Compression with Neural Networks - hurrycane
https://research.googleblog.com/2016/09/image-compression-with-neural-networks.html
======
emcq
This is pretty neat. But is it just me or does the dog picture look better in
JPEG?

When zoomed in, the JPEG artifacts are quite apparent and the RNN produces a
much smoother image. However, to my eye when zoomed out the high frequency
"noise", particularly in the snout area, looks better in JPEG. The RNN
produces a somewhat blurrier image that reminds me of the soft focus effect.

~~~
jcl
It looks like they are using a "multi-scale structural similarity" (MS-SIM)
metric as a proxy for how well a human will think a compressed image
reproduces the original image. Both the JPEG and the RNN images are compressed
to the same MS-SIM metric, which in this case makes the RNN image take 25%
fewer bytes.

Part of the reason JPEG looks better in this case is that they have chosen to
feature a detail- and noise-rich portion of the image, which favors JPEG...
High JPEG compression has a sharpening effect around edges, including
"ringing" which hides well in noise. However, if you view the full-sized
comparison of the dog images, you can see that JPEG has essentially no detail
in the darker portions of the image. Presumably both the MS-SIM metric and
humans would judge RNN better for these portions.

~~~
nomel
The NN images, for sure, have a problem reproducing sharp edges/have less high
frequency content. Just look at the grass.

It appears to be a severely flawed metric.

~~~
nomel
By severely flawed metric, I meant,

> Presumably both the MS-SIM metric and humans would judge RNN better for
> these portions.

doesn't remotely hold for detailed/high contrast. If I can see a huge
difference, and vastly prefer one over the other, the metric is not useful by
its definition. Please don't get emotional about figures of merit.

I'm guessing a genuinely useful figure of merit would have put both file sizes
a bit closer together, and the NN would have shined, especially since it works
so well in low contrast areas.

------
richard_todd
jpeg 2000 had about a 20% reduction in size over typical jpeg, while producing
virtually no blocking artifacts, 16 years ago[1]. Almost no one uses it,
though. Now in 2016 we are using neural networks to get a similar reduction,
except the dog's snout looks blurry, and with a process that I assume is much
more resource intensive. It's interesting for sure, but if people didn't care
about jp2, they would have to be drinking some serious AI Kool-aid to want
something like this.

[1]:
[https://en.m.wikipedia.org/wiki/JPEG_2000](https://en.m.wikipedia.org/wiki/JPEG_2000)

~~~
svantana
There's also BPG [1], a javascript image decoder that can be used on the web
today, with (in my estimate) about a 40% improvement over jpeg. It's a bit
disingenuous to use something so far from the state of the art as baseline,
but it's still an interesting line of research IMO.

[1] [http://bellard.org/bpg/](http://bellard.org/bpg/)

~~~
wyoh
BPG is derived from HEVC and thus patent encumbered as JPEG2000. But we have
new codec like Daala or AV1 that could be used as a still image formats when
they'll be mature enough.

I made a website where one could compare various formats:

[http://wyohknott.github.io/image-formats-
comparison](http://wyohknott.github.io/image-formats-comparison)

------
starmole
Important quote from the paper:

"The next challenge will be besting compression methods derived from video
compression codecs, such as WebP (which was derived from VP8 video codec), on
large images since they employ tricks such as reusing patches that were
already decoded."

Beating block based JPEG with a global algorithm doesn't seem that exciting.

~~~
runeks
I would be extremely surprised if someone were able to create a neural network
that can come anywhere close to H.264/265\. A lot of research has happened in
the area of video codecs. Neural networks are good at adaptation, but useless
at forming concepts about how the data is structured. For example: in video we
do motion compensation, because we know video captures motion since objects
move in physical reality. A neural network would have to do the same in order
to get the same compression levels. And I also doubt it can outperform
dedicated engineers in motion search estimation. But it's certainly
interesting to see the development.

~~~
joeyo
While I agree with you that the reason why modern video codecs work so well is
because they embody knowledge about the statistical structure of natural
scenes, there is no reason why a data driven / neural network approach could
not also learn these sorts of hierarchical spatiotemporal contingencies as
well.

The image labeling neural networks are a good proof of concept of this
possibility. After all, what is a text label other than a concise (ie highly
compressed) representation of the pixel data in the image. Obviously, being
able to represent that a cat is in an image is quite lossy as compared to
being able to represent that a particular cat is in an image (and where it is,
and how it's scaled, etc). However, it's easy to imagine (in principle)
layering on a hierarchy of other features, each explaining more and more of
the statistics in the scene, until the original image is reproduced to
arbitrary precision.

Could this outperform a hardwired video compressor? In terms of file
size/bitrate, my intuition is yes and probably by a lot. In terms of
computational efficiency, no idea.

~~~
starmole
But isn't there a fundamental difference between labeling and compression? For
compression I would like stable guarantees for all data. For labeling it is
enough to do better. Think of the classic stereotype that all asian faces look
alike to europeans: that's ok for still labeling a human face, and useful. But
for image compression to have different quality based on the subject would be
useless!

~~~
dTal
Not really a fundamental difference. The better you can predict the data, the
less you have to store. And the performance of all compression varies wildly
depending on subject matter. The deeper the understanding of the data, the
better the compression - and the more difficult to distinguish artefacts from
signal. A compression algorithm that started giving you plausible, but
incorrect faces if you turned the data rate too far down wouldn't be useless -
it would be a stunning technical achievement.

------
the8472
Why does a blog page showing static content do madness like this? I'd think
google engineers of all people would know better. The site doesn't even work
without javascript from a 3rd-party domain.

[https://my.mixtape.moe/klvzip.png](https://my.mixtape.moe/klvzip.png)

Static mirror: [https://archive.fo/yyozl](https://archive.fo/yyozl)

~~~
pipeep
Engineers are expensive. Google's research blog probably isn't a very high-
traffic surface, and most visitors are probably on modern machines and fast
networks. It doesn't make sense to spend a bunch of engineering resources on
optimizing something like this. So instead, you do whatever is easiest to
implement and maintain.

------
wyldfire
> Instead of using a DCT to generate a new bit representation like many
> compression schemes in use today, we train two sets of neural networks - one
> to create the codes from the image (encoder) and another to create the image
> from the codes (decoder).

So instead of implementing a DCT on my client I need to implement a neural
network? Or are these encoder/decoder steps merely used for the iterative
"encoding" process? It seems like the representation of a "GRU" file is
different from any other.

~~~
radarsat1
It sounds complicated, but neural networks these days are basically just a
bunch of filters with nonlinear cut-off and bining. (From a signal processing
point of view..)

Super simple to implement the feed-forward scenario for decoding.

Not entirely sure what the residual memory aspect of these networks add in
terms of complexity, but it's probably just another vector add-multiply, or
something to that effect.

------
jpambrun
It's fun and scientifically interesting, but the decoder model is 87MB by
itself.

~~~
aab0
Many NNs can be compressed considerably without losing much performance. The
runtime of RNNs is more concerning, as is whether anyone wants to move to a
new image format, but it's still interesting pure research in terms of
learning to predict image contents. It's a kind of unsupervised learning,
which is one of the big outstanding questions for NNs at the moment.

------
ilaksh
I asked about the possibility of doing this type of thing on CS Stack Exchange
two years ago.

[http://cs.stackexchange.com/questions/22317/does-there-
exist...](http://cs.stackexchange.com/questions/22317/does-there-exist-a-data-
compression-algorithm-that-uses-a-large-dataset-distribu)

They basically ripped me a new one said it was a stupid idea and that I
shouldnt make suggestions in a question. Then I took the suggestions and
details out (but left the basic concept in there) and they gave me a lecture
on basics of image compression.

Made me really not want to try to discuss anything with anyone after that.

~~~
nl
Expecting 2 years of research by around 20 of the best DNN researchers on the
planet to be compressed into a StackOverflow answer before it has been done
seems a fairly large amount to expect.

Not a huge fan of the negativity on StackOverflow, but until August 2016 (when
the paper this was based on was published - or maybe 2015, with DRAW[2])
people actively working in this area didn't think it was possible.

Also, the single answer there certainly didn't say anything like it was a
stupid idea. I don't think the author of that answer knew much about
autoencoders.

Also^2, your question isn't really anything like what this addresses. Your
question concentrates on the idea of compressing a large set of images, and
sharing some kind of representation.

That certainly is possible without using a ML approach. And yes, autoencoders
have been around for a long time.

But hoverboards are a great idea, too.

[1] This paper has 7 authors, Gregor et al has 5 authors, DRAW has 6.

[2]
[https://arxiv.org/pdf/1502.04623v2.pdf](https://arxiv.org/pdf/1502.04623v2.pdf)

~~~
ilaksh
It really is like what it addresses.

------
ChrisFoster
It's quite exciting to see progress on a data driven approach to compression.
Any compression program encodes a certain amount of information about the
correlations of the input data in the program itself. It's a big engineering
task to determine a simple and computationally efficient scheme which models a
given type of correlation.

It seems to me like the data driven approach could greatly outperform hand
tuned codecs in terms of compression ratio by using a far more expressive
model of the input data. Computational cost and model size is likely to be a
lot higher though, unless that's also factored into the optimization problem
as a regularization term: if you don't ask for simplicity, you're unlikely to
get it!

Lossy codecs like jpeg are optimized to permit the kinds of errors that humans
don't find objectionable. However, it's easy to imagine that this is not the
_right kind_ of lossyness for some use cases. With a data driven approach, one
could imagine optimizing for compression which only looses information
irrelevant to a (potentially nonhuman) process consuming the data.

------
Houshalter
This seems so overly complicated, with the RNN learning to do arithmetic
coding and image compression all at once. Why not do something like
autoencoders to compress the image? Then you need only send a small hidden
state. You can compress an image to many fewer bits like that. Then you can
clean up the remaining error by sending the smaller Delta, which itself can be
compressed, either by the same neural net, or with standard image compression.

The idea of using NNs for compression has been around for at least 2 decades.
The real issue is that it's ridiculously slow. Performance is a big deal for
most applications.

It's also not clear how to handle different resolutions or ratios.

------
Lerc
I see there being a number ofpaths for Neural Network compression.

She Simplest is a network with inputs of [X,Y] and outputs of {R,G,B] Where
the image is encoded into the network weights. You have to per-image train the
network. My guess is it would need large complex images before you could get
compression rates comparable to simpler techniques. An example of this can be
seen at
[http://cs.stanford.edu/people/karpathy/convnetjs/demo/image_...](http://cs.stanford.edu/people/karpathy/convnetjs/demo/image_regression.html)

In the same vein, you could encode video as a network of [X,Y,T] --> [R, G,
B]. I suspect that would be getting into lifetime of the universe scales of
training time to get high quality.

The other way to go is a neural net decoder. The network is trained to
generate images from input data, You could theoretically train a network to do
a IDCT, so it is also within the bounds of possibility that you could train a
better transform that has better quality/compressibility characteristics. This
is one network for all possible images.

You can also do hybrids of the above techniques where you train a decoder to
handle a class of image and then provide a input bundle.

I think the place where Neural Networks would excel would be as a
predictive+delta compression method. Neural networks should be able to predict
based upon the context of the parts of the image that have already been
decoded.

Imagine a neural network image upscaler that doubled the size of a lower
resolution image. If you store a delta map to correct any areas that the
upscaler guesses excessively wrong then you have a method to store arbitrary
images. Ideally you can roll the delta encoding into the network as well.
Rather than just correcting poor guesses, the network could rank possible
outputs by likelyhood. The delta map then just picks the correct guess, which
if the predictor is good, should result in an extremely compressible delta
map.

The principle is broadly similar to the approach to wavelet compression, only
with a neural network the network can potentially go "That's an
eye/frog/egg/box, I know how this is going to look scaled up"

------
concerneduser
That neural network technology is all fine and good for compressing images of
lighthouses and dogs - but what about other things?

------
rdtsc
Now that Google is full on the neural network deep learning train with their
Tensor Processing Units we'll be seeing NN applied to everything. There was an
article about translation now imagine compression. It is a bit amusing, but
nothing wrong with it, this is great stuff, I am glad they are sharing all
this work.

------
sevenless
I've been wondering when neural networks might be able to compress a movie
back down to the screenplay.

------
acd
Is there any image compression that uses Eigenfaces? Using the fact your face
may look similar to someone else face.

What if you use uniqueness and eigenface look up table for compression?

~~~
semaphoreP
Not an expert in this, but you would need a lot of eigenfaces to capture the
variance (i.e. a lot of upfront cost storing all the eigenimages). It might be
good for something that is very standardized (e.g passport photos where
everyone is in the same position) but otherwise I think there probably is too
much variance to keep the number of eigenimages to a reasonable number.

------
zump
Compression engineers shaking in their boots.

------
aligajani
I knew this was coming. Great stuff.

------
rasz_pl
you could probably reach 20% by building custom quantization table (DQT) per
image alone

------
samfisher83
Was this this inspired by silicon valley?

------
joantune
Nice!! They should call it PiedPiper :D (I can't believe I was the 1st one
with this comment)

~~~
joantune
wow, such a hated comment.. I don't get it, I guess that there aren't many
fans of the show around here

