
Show HN: HiFiC – High-Fidelity Generative Image Compression – demo and paper - atorodius
https://hific.github.io
======
crazygringo
I've been waiting for something to implement this concept for _so long_ , and
I'm so happy to finally get a chance to explore how it works in practice!

It's really fascinating to zoom comparing the compressed version with the
original -- the "general idea" is all there and the quality is roughly the
same, but strands of hair move to different positions, skin pores are in a
completely different pattern, etc. -- it's all regenerated to look "right",
but at the lowest level the texture isn't "real".

Which is so fascinatingly different from JPEG or other lossy schemes -- the
compression artifacts are obviously exactly that, you'd never confuse them for
real detail. While here, you could never guess what is real detail vs. what is
reconstructed detail.

My hunch is that this won't actually take off at first for images, because
images on the internet don't use up _that_ much of our bandwidth anyways, and
the performance and model storage requirements here are huge.

BUT -- if the same idea can be applied to 1080p _video_ with a temporal
dimension, so that even if reconstructed textures are "false" they persist
across frames, then suddenly the economics here would start to make sense:
dedicated chips for encoding and decoding, with a built-in 10 GB model in ROM,
trained on a decent proportion of every movie and TV show ever made... imagine
if 2-hour movies were 200 MB instead of 2 GB?

(And once we have chips for video, they could be trivially re-used for still
images, much like Apple reuses h.265 for HEIC.)

~~~
ksec
> because images on the internet don't use up that much of our bandwidth
> anyways,

On the Internet, Yes. But for web pages, Image is still by far the largest
payload. It could be anywhere between 60 to 90% of total content size.

Thinking in the context of latency, where you could benefits with images
appearing much quicker or instantaneous ( embedding inside the HTML ) because
of its reduction. Especially on first screen.

~~~
crazygringo
Progressive JPEG's go a long way towards that though, no?

~~~
ksec
Not JPEG in its current form. ( May be JPEG XL ) The user experience is
horrible. And during the last HN discussion on Medium using progressive JPEG
it turns out a lot of people have similar thoughts.

------
atorodius
Hi, author here! This is the demo of a research project demonstrating the
potential of using GANs for compression. For this to be used in practice, more
research is required to make the model smaller/faster, or to build a deticated
chip! Currently, this runs at approximately 0.7 megapixel/sec on a GPU with
unoptimized TensorFlow. Happy to answer any questions.

~~~
ksec
>Currently, this runs at approximately 0.7 megapixel/sec on a GPU with
unoptimized TensorFlow.

Forgive me, is that in encoding or decoding? And size of library required to
decode it? ( Edit: Looks like answered below )

It was only an hour ago I was looking at some pictures encoded with VVC /
H.266 reference encoder, it was better than BPG / HEVC, but this HiFiC still
beats it by quite a bit.

This whole thing blows up my mind on limitation I thought we had on image
compression.

May be I should take back my words on JPEG XL is the future. This seems
revolutionary.

~~~
gtoderici
(coauthor here) The 0.7 megapixels/sec is PNG decode (to get
input)+encoding+decoding+PNG encoding (to get output we can visualize in a
browser) speed.

Thanks for your kind comment!

------
jasonjayr
This is impressive; but isn't this also susceptible to the type of bug that
Xerox copiers got hit with many years ago?

[https://www.theregister.com/2013/08/06/xerox_copier_flaw_mea...](https://www.theregister.com/2013/08/06/xerox_copier_flaw_means_dodgy_numbers_and_dangerous_designs/)

~~~
atorodius
Small text is indeed one of the fail cases, and more research needs to be done
here (see also my other comment [1]). We mention this issue in the
supplementary materials [2], and you can check out an example here at [3]

[1]
[https://news.ycombinator.com/item?id=23654161](https://news.ycombinator.com/item?id=23654161)

[2]
[https://storage.googleapis.com/hific/data/appendixb/appendix...](https://storage.googleapis.com/hific/data/appendixb/appendixb.pdf)

[3]
[https://storage.googleapis.com/hific/userstudy/visualize.htm...](https://storage.googleapis.com/hific/userstudy/visualize.html?perPage=1&page=5)

~~~
mmastrac
This might be a good opportunity to lead to research on a "good faith
watermark" for GAN-compressed images that may include hallucinated details.

~~~
gtoderici
One of the things we discussed to address this is to have the ability to: a)
turn off detail hallucination completely given the same bitstream; and b)
store the median/maximum absolute error across the image

(b) should allow the user to determine whether the image is suitable for their
use-case.

~~~
the8472
Is it possible to put the decoder into a feedback loop and search among
multiple possible encodings that minimize the residual errors? Similar to
trellis optimization in video codecs.
[http://akuvian.org/src/x264/trellis.txt](http://akuvian.org/src/x264/trellis.txt)

~~~
atorodius
It would be possible - but by minimizing residual errors you end up in a
similar regime as when minimizing MSE again, likely making reconstructions
blurry!

------
nullc
Generally lossy compression methods have a 'knee' below which the percieved
quality rapidly drops off. The default jpeg examples here are well below that
knee.

Usually comparing below the knee isn't very useful except to better help
understand the sort of artefacts that one might want to also look for in
higher rate images.

It would be interesting to see some examples from a lower rate HiFiC
specifically for the purpose of bringing out some artefacts to help inform
comparisons at higher rates.

~~~
ebg13
> _The default jpeg examples here are well below that knee._

The examples are chosen as multiples of their file size 1x,2x,...

~~~
nullc
I used the word 'default' for a reason! :)

~~~
atorodius
The HiFiC model we show is already the low model :) We show JPEG and BPG at
the same rate to visualize how low the bitrate of our model actually is. And
for JPEG and BPG we show 1x, 2x, and so on to visualize how many more bits the
previous methods need to look similar visually.

~~~
nullc
Sure, but it isn't low rate enough to produce the level of gross artefacts
needed to train the viewer to recognize faults in the images.

E.g. after looking at those jpeg images you're able to flip to much higher
rate jpegs and notice blocking artefacts that a completely unprepared subject
would fail to notice.

In my work on lossy codecs I found that having trained listeners was critical
to detecting artifacts that would pass on quick inspection but would become
annoying after prolonged exposure.

From a marketing fairness point since the only codec you evaluate 'below knee'
is jpeg, it risks exaggerating the difference. It would be just about as
accurate to say that jpeg can't achieve those low rates-- since no one will
use jpeg at a point where its falling apart. This work is impressive enough
that it doesn't need any (accidental) exaggeration.

I think it's best to either have all comparison points above the point where
the codecs fall apart, or have a below-knee example for all so that it's clear
how they fail and where they fail... rather than asking users to compare the
gigantic difference between a failed and non-failing output.

~~~
atorodius
In addition to my sibling comment, I would like to add:

> since no one will use jpeg at a point where its falling apart.

Hence we also put it at bitrates people actually use :) And then the point
should be that HiFiC uses much fewer bits.

> below-knee for HiFiC

We actually did not train lower than this model, HiFiC^lo. That it works so
well was somewhat surprising to us also! From other GAN literature, it seems
reasonable to expect the "below-knee" point for this kind of approach to still
look realistic, but not faithful anymore. I.e., without the original, it may
be hard to tell that it's fake.

~~~
nullc
> expect the "below-knee" point for this kind of approach to still look
> realistic, but not faithful anymore.

If so, that might be kinda cool on its own.

(and perhaps also useful for anti-forensic purposes, e.g. use it as a de-
noising mechanism that makes it more difficult to identify the source camera)

------
deweller
This is really impressive. But it raises some questions for me.

What size library would be required to decode these type of images?

And would the decoding library be updated on a regular basis? Would the image
change when decoding library is updated? Would images be tagged with a version
of the library when encoded (HiFiC v1.0.2?)

~~~
atorodius
Thanks for the kind words!

> What size library would be required to decode these type of images?

The model is 726MB. Keep in mind that this is a research project - further
research needs to be done on how to get faster now that we know that this kind
of results are possible!

> And would the decoding library be updated on a regular basis?

Only if you want even better reconstructions!

> Would the image change when decoding library is updated? Would images be
> tagged with a version of the library when encoded (HiFiC v1.0.2?)

Yes, some header would have to be added.

~~~
pornel
I'm very curious how such thing could be standardized as an image format. With
classic image formats there's an expectation one can write a spec and an
independent implementation from scratch. "Take my X-MB large pre-trained
model" is unprecedented.

Would it still be competitive with H.265 if the model was 10MB or 50MB in
size? 0.7GB may be difficult for widespread adoption.

~~~
gtoderici
Independently of this work, we have models which are competitive with HEVC
while being significantly smaller (this is from previous work). They will not
look nearly as good as what you see in the website demo, but they're still
better.

I don't have any such model handy but perhaps it's 10x-20x smaller.

We don't claim that this (or even the previous work) is the way to go forward
for images, but we hope to incentivize more researchers to look in this
direction so that we can figure out how to deploy these kinds of methods since
the results they produce are very compelling.

------
0-_-0
I'm very impressed, I was waiting for an image codec that combines something
like VGG loss + GANs! (Another thing that I'm waiting for is a neural JPEG
decoder with a GAN, which would be backwards compatible with all the pictures
already out there!) Now we need to get some massive standardisation process
going to make this more practical and perfect it, just like it was done for
JPEG in the old days! (And then do it for video and audio too!)

What happens if you compress a noisy image? Does the compression denoise the
image?

~~~
gtoderici
On the standardization issue: the advantage of such a method that we presented
is that as long as there exists a standard for model specification, we can
encode every image with an arbitrary computational graph that can be linked
from the container.

Imagine being able to have domain specific models - say we could have a high
accuracy/precision model for medical images (super-close to lossless), and one
for low bandwidth applications where detail generation is paramount. Also
imagine having a program written today (assuming the standard is out), and it
being able to decode images created with a model invented 10 years from today
doing things that were not even thought possible when the program was
originally written. This should be possible because most of the low level
building blocks (like convolution and other mathematical operations) is how we
define new models!

On noise: I'll let my coauthors find some links to noisy images to see what
happens when you process those.

~~~
0-_-0
Absolutely! Being able to improve a decoder for an existing encoder (and vica
versa) is a great advantage!

------
ricardobeat
This is amazing. Not only it completely blows JPEG and BPG out of the water in
efficiency, but it makes some images look _better_ than the original! It seems
to reduce local contrast which works well for skin (photos 1, 2 and 5), and
what it does for repeating patterns is quite pleasant to the eyes (1 and 4).

The only downside I see is the introduction of artifacts like you can see on
the right-side face of the clock tower.

~~~
pbowyer
> it makes some images look better than the original! It seems to reduce local
> contrast which works well for skin (photos 1, 2 and 5)

It's better than the lossy JPEGs but I strongly disagree it's better than the
original - or even close. Examining image 1 closely, the loss of contrast and
the regularising of the texture of the skin make the HiFIC image feel fake
when I look at it. I downloaded and opened the images at random on my computer
to get rid of the bias of "the right one is the original" and it was the
original I preferred each time.

Regardless, it's impressive work and I look forward to future developments!

~~~
CamperBob2
How about image 2? On both of my monitors, the colors in the regenerated image
have a striking subjective advantage over the original. I'd be surprised if
95%+ of observers didn't favor the compressed version of image 2.

~~~
pbowyer
Checked on 2 monitors here (one color calibrated) and I see the HiFiC version
of image 2 has slightly more muted colours than the original.

Unclear what you're seeing to get "striking subjective advantage" but it's not
visible here.

------
mmastrac
In addition to being an amazing result, props for including a variety
different sets of skin tones in the hero sample box. It's extremely good to
see that there was thought put into reducing any potential bias for this
compression algorithm which has unfortunately been an issue in the history of
image capture [1]

[1] [https://www.nytimes.com/2019/04/25/lens/sarah-lewis-
racial-b...](https://www.nytimes.com/2019/04/25/lens/sarah-lewis-racial-bias-
photography.html)

------
stereo
The obvious question when looking at the comparison is, what kind of jpeg did
they pick, and was it fair? It turns out that mozjpeg, which is pretty much
state of the art, can squeeze out an extra 5% out of shades_jpg_4x_0.807.jpg.

Your comparison would be even more impressive if you produced the best jpegs
you can at those sizes.

~~~
atorodius
Thank you for your suggestion. We use libjpeg for the Demo. Keep in mind that
we compare against JPEG files which are at around 100%-400% of the size of the
proposed method, so 5% would not really add much as far as we can tell.

------
roywiggins
Finally someone's instantiated something like Vernor Vinge's "evocation" idea,
quoted here:

[https://www.danvk.org/wp/2014-02-10/vernor-vinge-on-video-
ca...](https://www.danvk.org/wp/2014-02-10/vernor-vinge-on-video-
calls/index.html)

------
mchusma
If you are into this you may also like to see a similar model facebook used
for foveated rendering (Nov 2019): [https://ai.facebook.com/blog/deepfovea-
using-deep-learning-f...](https://ai.facebook.com/blog/deepfovea-using-deep-
learning-for-foveated-reconstruction-in-ar-vr/)

------
ebg13
A number of comments here mention the regularity of the noise in the Lo
variant as producing a slightly artificial appearance (zoom on the woman's
forehead in picture 1). Would it improve anything to look for swathes where
the dominant texture is noticeably uniform noise and mix it up a bit?

~~~
gtoderici
Not sure how that would work out in practice. We kind of rushed with the
experiments in any case - maybe we could let the methods train for longer. In
the end the adversary might learn any regular patters that appear, therefore
forcing the generator to come up with something that cannot be detected
easily.

(We trained everything to 1M steps. Perhaps letting it train to 2M would solve
it.)

------
isoprophlex
Super exciting! Fantastic work and cool demo!

I hope this can be 'stabilized' to allow video compression. If you run eg.
style transfer networks on video, the artifacts in the outputted frames arent
stable, they jump around from frame to frame.

(Not sure if you can decode this at 24 fps on normal hardware, but still...)

~~~
atorodius
Thanks! And agreed - doing is in real time for high-resolution videos still
needs quite some research though.

~~~
mikepurvis
Shipping the network in device ROMs seems like a pretty far-off thing at this
point, but I wonder if there could be something in the nearer term around
time-shifting bandwidth usage, eg:

\- Media appliance under your TV downloads a 5GB updated network during the
night so that you can use smaller-encoded streams of arbitrary content in the
evening.

\- Spotify maintains a network on your phone that is updated while you're at
home so that you can stream audio on the road with minimal data usage.

\- Your car has a network in it that is periodically updated over wifi and
allows you to receive streetview images over a manufacturer-supplied M2M
connection.

------
littlestymaar
Amazing work! It's something we knew would come at some point, but I didn't
expect it to be that good so early!

Since some of the authors are in this thread answering questions, I have one:
I wonder if such project requires Google-scale infrastructure or if it's
something that can easily be replicated. For instance, how big is the training
set? And how much compute was necessary to train the model?

~~~
atorodius
Hi! It depends on how far you want to go. For this project, we did a lot of
exploration, _because_ we had Google-scale infrastructure. Replicating _all_
the exploration will need a lot of GPUs (in the 100s), replicating all the
experiments that actually went into the paper maybe a few dozens. Training
something similar as what's in the demo with the code we will release takes 1
V100 :)

We can't release the internal training set, but expect a dataset of a few
hundred thousand images (e.g., openimages) to be sufficient, maybe even less
(AFAIK this has not been explored in a controlled setting).

~~~
littlestymaar
Thanks a lot!

------
microcolonel
Hmm, it's too bad that none of the methods (including the novel one)
accurately represent her pupils. In the original image, you can tell which
direction she's looking, and see how dilated her pupils are.

What I feel could be a high priority is machine learning for rate distortion.
It would be really good to have higher rate on the pupils.

~~~
atorodius
FWIW, the network uses the hyperprior probability estimation network (citation
[6] in the paper), which already adapts the rate depending on the image
region.

------
fizixer
Fantastic.

When can we install the 1TiB video player/codec/plugin, that'll allow us to
stream 4K movies on a 1Mbps connection?

~~~
atorodius
Hopefully soon! However, in addition to research needed to make this work for
video, it also needs research to make it smaller and/or to put it in silicon
(see my other comments). The network shown is "only" 726MB BTW :)

------
lokl
Has this been extended to video (beyond compressing each frame separately)?

~~~
atorodius
This is the first time that GANs have been used to obtain high-fidelity
reconstructions at useful bitrates. We haven't tried extending it to video.

