Hacker News new | past | comments | ask | show | jobs | submit login
Show HN: HiFiC – High-Fidelity Generative Image Compression – demo and paper (hific.github.io)
281 points by atorodius 17 days ago | hide | past | favorite | 89 comments



I've been waiting for something to implement this concept for so long, and I'm so happy to finally get a chance to explore how it works in practice!

It's really fascinating to zoom comparing the compressed version with the original -- the "general idea" is all there and the quality is roughly the same, but strands of hair move to different positions, skin pores are in a completely different pattern, etc. -- it's all regenerated to look "right", but at the lowest level the texture isn't "real".

Which is so fascinatingly different from JPEG or other lossy schemes -- the compression artifacts are obviously exactly that, you'd never confuse them for real detail. While here, you could never guess what is real detail vs. what is reconstructed detail.

My hunch is that this won't actually take off at first for images, because images on the internet don't use up that much of our bandwidth anyways, and the performance and model storage requirements here are huge.

BUT -- if the same idea can be applied to 1080p video with a temporal dimension, so that even if reconstructed textures are "false" they persist across frames, then suddenly the economics here would start to make sense: dedicated chips for encoding and decoding, with a built-in 10 GB model in ROM, trained on a decent proportion of every movie and TV show ever made... imagine if 2-hour movies were 200 MB instead of 2 GB?

(And once we have chips for video, they could be trivially re-used for still images, much like Apple reuses h.265 for HEIC.)


I just can't wait for things that people could spot. Trained networks can "imagine" some elements existing in the same way that they recreate the different texture. If this gets applied based on existing movies, there will definitely be a time when someone in the background raises their hand to ear to brush their hair, but suddenly a mobile phone materialises from a shadow on their hand. It will be just a few pixels that you'd miss... But it will be there to spot, like distant power lines on period dramas.


Given the seemingly random nature of the compression "artefacts", versus the more predictable compression artefacts with other algos, I wonder if this would limit use of such images in scenarios where the resulting images had to be relied on in court as evidence of something?


Absolutely. Exactly for this reason, I would hope not to see this ever become the default format for cell phone cameras unless there were somehow explicit guarantees around what is or is not faithful. Same with security cameras, etc.

But, anywhere where the purpose of output is social/entertainment, it seems this would be freely adopted. Profile pictures, marketing images, videos and movies. If could very much mean, however, that a photo on someone's Facebook profile might not be admissable as evidence, if used to identify an exact match with someone, rather than just "someone who looks similar".



Wow! The implications are huge. At least the Xerox bug had the benefit of being reproducible. An AI algo might _occasionally_ spit out an incorrect digit, and other times not. Really, this highlights the need for explicability in AI/ML.


Given that our networks can generally handle 2GB movies now, can you imagine streaming a 4K movie, but instead of using 20GB it only takes 2GB?


It's going to be funny when the neural network is clever enough to hallucinate sequels.


> because images on the internet don't use up that much of our bandwidth anyways,

On the Internet, Yes. But for web pages, Image is still by far the largest payload. It could be anywhere between 60 to 90% of total content size.

Thinking in the context of latency, where you could benefits with images appearing much quicker or instantaneous ( embedding inside the HTML ) because of its reduction. Especially on first screen.


> Image is still by far the largest payload.

A good point that has been made before is that images are often by far the biggest part of a web page, but most sites still don't even do basic optimisations like resizing them to the viewing size, using WebP, using pngcrush, etc.

That said, this would be cool for making really well optimised sites.


Progressive JPEG's go a long way towards that though, no?


Not JPEG in its current form. ( May be JPEG XL ) The user experience is horrible. And during the last HN discussion on Medium using progressive JPEG it turns out a lot of people have similar thoughts.


I think it will be extremely popular. Most people are concerned about how computers look. Wether or not the behavior is correct usually only bothers people who spend time working on computers.

I’m not sure if it’s because the first group is just used to life being unpredictable and don’t realize things could be better, or if they really don’t care.


Regarding your last paragraph: I wouldn't be surprised if existing "upscale" chips already work similar to what you suggested: They convert lower res content to (for example) 4K. See https://www.sony.com/electronics/4k-resolution-4k-upscaling-... for an example.


Yes!

Nvidia shows what is possible with dlss2 as well.

I'm very curious about what we will be able to do with voice as well. Perhaps you will sync voice profiles from your familiy/friends for better compression or better reconstruction.

We will also see models which will have higher details through incorporating all frames and enhancing existing material. We see this already anyway in gaming as well.


there's a similar paper shared here a couple years ago: https://data.vision.ee.ethz.ch/aeirikur/extremecompression/ ..when compared to low bitrate dct-based compressors it looks impressive, but then compared to the original source the compression artifacts stuff like incorrect license plate numbers, building architecture, species of trees, etc.


Hi, author here! This is the demo of a research project demonstrating the potential of using GANs for compression. For this to be used in practice, more research is required to make the model smaller/faster, or to build a deticated chip! Currently, this runs at approximately 0.7 megapixel/sec on a GPU with unoptimized TensorFlow. Happy to answer any questions.


Great work. Thank you for making it so accessible online and sharing it on HN!

The first thought I had, after looking at some of the examples, is that it should be fairly trivial to train or finetune the compression network to hallucinate missing detail with "BPG-style" or "JPG-style" smoothing artifacts by feeding compressed images as "real" samples to the adversarial network during training/finetuning.

I wonder if doing that would make it possible to achieve even greater compression with loss of detail that is indistinguishable from traditional compression methods and therefore still acceptable to the human eye.


What you suggest has already been done: train a neural network with the output of BPG or JPEG, and ask to reconstruct the input with just the decompressed pixels being available.

It definitely is a valid approach but the limitation is that if the network needs some texture-specific information that cannot be extracted from the decoded pixels, it can't really do much.

There were approaches where such information was also sent on the side, which yielded better results, of course.

The field is wide-open and each approach has its own challenges (e.g., you may need to train one network for quantization level for example if you're going to do restoration).


>Currently, this runs at approximately 0.7 megapixel/sec on a GPU with unoptimized TensorFlow.

Forgive me, is that in encoding or decoding? And size of library required to decode it? ( Edit: Looks like answered below )

It was only an hour ago I was looking at some pictures encoded with VVC / H.266 reference encoder, it was better than BPG / HEVC, but this HiFiC still beats it by quite a bit.

This whole thing blows up my mind on limitation I thought we had on image compression.

May be I should take back my words on JPEG XL is the future. This seems revolutionary.


(coauthor here) The 0.7 megapixels/sec is PNG decode (to get input)+encoding+decoding+PNG encoding (to get output we can visualize in a browser) speed.

Thanks for your kind comment!


JPEG XL and HiFiC are developed in the same team. JPEG XL is just positioned for slightly less compute than HiFiC.

Have you compared the result with latest image compression formats, such as AVIF[0]?

[0]: https://aomediacodec.github.io/av1-avif/


We haven't specifically compared to AVIF, which as far as we know is still under development. We'd be happy to compare, but it's unlikely that we'd learn much out of it. As far as we know, AVIF is better by <100% than HEVC, but we're comparing against HEVC at 300% of the bitrate.

Of course, we'd be happy to add any additional images from other codecs if they're available.


I would add JPEG-XL in addition if you're looking for suggestions for other codecs to compare to. It's very competitive with AV1 and beats it, in my opinion, at higher bitrates.

Admittedly, you're not likely to learn much from this that is useful for your research, but most of the interest from people clicking on this is probably wanting to see the latest developments in image compression.


These are truly amazing results. Looking closely at the results vs original it would appear that much of the details are very different at almost a noise level. Is the perceptual evaluation allowing for these similar but completely different noise details?


(coauthor here) We used an adversarial loss in addition to a perceptual loss and MSE. None of these work super-well when the others are not used.

The adversarial loss "learns" what is a compressed image and tries to make the decoder go away from such outputs.

The perceptual (LPIPS) is not so sensitive to pure noise and allows for it, but is sensitive to texture details.

MSE tries to get the rough shape right.

We also asked users in a study to tell us which images they preferred when having access to the original. Most prefer the added details even if they're not exactly right.


> The adversarial loss "learns" what is a compressed image and tries to make the decoder go away from such outputs.

Could you expand on this point


The idea is that any distortion loss imposes specific artifacts. For example, MSE tends to blur outputs, CNN-feature-based losses (VGG, LPIPS) tend to produce gridding patterns or also blur. Now, when the discriminator network sees these artifacts, those artifacts very obviously distinguish the reconstructions from the input, and thus the gradients from the discriminator guide the optimization away from these artifacts. Let me know if this helps!


Yeah, it seems like the HiFiC version has the same large-scale structure, but the details are sort of "regenerated": e.g. it's as if the image format says "there's some hair here", and the decoder just paints in hair.


That was exactly the goal of the project! Basically if the size doesn't allow to have detail, we need to "hallucinate it". This of course is not necessary if there's enough bandwidth available for transmission or enough storage.

On the other hand, in our paper we show that some generated detail can help even at higher bitrates.


Hi, could you give a general comparison vs CLIC workshop[1] submissions at CVPR? It seems to be the same end goal and CLIC was even sponsored by Google so I am a little puzzled as to why I didn't see your work there.

EDIT: You mention CLIC2020 in your paper so I'm a bit more confused.

Another question would be,what are your thoughts on the future of learned image compression and ideas for future work?

[1] https://www.compression.cc/


Looks really interesting, thanks for sharing!

One silly question, the red door example image... why is the red so saturated in all the compressed versions vs the original?

edit: Ah seems to be some kind of display issue in Firefox. When saving the images and comparing the saturation level is roughly equal.


Interesting, I did not notice this before. This is likely due to the original having some color profile attached. Not sure why this only renders differently in Firefox.


The demo images used on hific.github.io appear to be part of the datasets used to train the system. In another comment you say the trained model is 726MB. The combined size of the training datasets appear to be about 8GB zipped. Is the currently trained model usable on images that are not part of the training datasets? with output of similar quality and size?


Hi- no the images on the demo page are not part of the training set, they are only used to evaluate the method. Arbitrary images of natural scenes will look similar at these bitrates. We‘ll release trained models and a colab soon!


Thank you and my apologies. I should have read the pdf more carefully where the distinction between the training data and evaluation datasets is described

Our training set consists of a large set of high-resolution images collected from theInternet, which we downscale to a random size between 500 and 1000 pixels, and then randomlycrop to256×256. We evaluate on three diverse benchmark datasets collected independently ofour training set to demonstrate that our method generalizes beyond the training distribution: thewidely usedKodak[22] dataset (24 images), as well as theCLIC2020[43] testset (428 images), andDIV2K[2] validation set (100 images).


This is impressive; but isn't this also susceptible to the type of bug that Xerox copiers got hit with many years ago?

https://www.theregister.com/2013/08/06/xerox_copier_flaw_mea...


Small text is indeed one of the fail cases, and more research needs to be done here (see also my other comment [1]). We mention this issue in the supplementary materials [2], and you can check out an example here at [3]

[1] https://news.ycombinator.com/item?id=23654161

[2] https://storage.googleapis.com/hific/data/appendixb/appendix...

[3] https://storage.googleapis.com/hific/userstudy/visualize.htm...


This might be a good opportunity to lead to research on a "good faith watermark" for GAN-compressed images that may include hallucinated details.


One of the things we discussed to address this is to have the ability to: a) turn off detail hallucination completely given the same bitstream; and b) store the median/maximum absolute error across the image

(b) should allow the user to determine whether the image is suitable for their use-case.


Is it possible to put the decoder into a feedback loop and search among multiple possible encodings that minimize the residual errors? Similar to trellis optimization in video codecs. http://akuvian.org/src/x264/trellis.txt


It would be possible - but by minimizing residual errors you end up in a similar regime as when minimizing MSE again, likely making reconstructions blurry!


You could use an OCR encoder and cat that to the feature vector. Great work by the way!


Agreed! Alternatively, some semantic segmentation network could be used and a masked MSE loss. In this paper, we focussed on showing the crazy potential of GANs for compression - let's see what future work brings.


Generally lossy compression methods have a 'knee' below which the percieved quality rapidly drops off. The default jpeg examples here are well below that knee.

Usually comparing below the knee isn't very useful except to better help understand the sort of artefacts that one might want to also look for in higher rate images.

It would be interesting to see some examples from a lower rate HiFiC specifically for the purpose of bringing out some artefacts to help inform comparisons at higher rates.


> The default jpeg examples here are well below that knee.

The examples are chosen as multiples of their file size 1x,2x,...


I used the word 'default' for a reason! :)


The HiFiC model we show is already the low model :) We show JPEG and BPG at the same rate to visualize how low the bitrate of our model actually is. And for JPEG and BPG we show 1x, 2x, and so on to visualize how many more bits the previous methods need to look similar visually.


Sure, but it isn't low rate enough to produce the level of gross artefacts needed to train the viewer to recognize faults in the images.

E.g. after looking at those jpeg images you're able to flip to much higher rate jpegs and notice blocking artefacts that a completely unprepared subject would fail to notice.

In my work on lossy codecs I found that having trained listeners was critical to detecting artifacts that would pass on quick inspection but would become annoying after prolonged exposure.

From a marketing fairness point since the only codec you evaluate 'below knee' is jpeg, it risks exaggerating the difference. It would be just about as accurate to say that jpeg can't achieve those low rates-- since no one will use jpeg at a point where its falling apart. This work is impressive enough that it doesn't need any (accidental) exaggeration.

I think it's best to either have all comparison points above the point where the codecs fall apart, or have a below-knee example for all so that it's clear how they fail and where they fail... rather than asking users to compare the gigantic difference between a failed and non-failing output.


In addition to my sibling comment, I would like to add:

> since no one will use jpeg at a point where its falling apart.

Hence we also put it at bitrates people actually use :) And then the point should be that HiFiC uses much fewer bits.

> below-knee for HiFiC

We actually did not train lower than this model, HiFiC^lo. That it works so well was somewhat surprising to us also! From other GAN literature, it seems reasonable to expect the "below-knee" point for this kind of approach to still look realistic, but not faithful anymore. I.e., without the original, it may be hard to tell that it's fake.


> expect the "below-knee" point for this kind of approach to still look realistic, but not faithful anymore.

If so, that might be kinda cool on its own.

(and perhaps also useful for anti-forensic purposes, e.g. use it as a de-noising mechanism that makes it more difficult to identify the source camera)


> asking users to compare the gigantic difference between a failed and non-failing output

It isn't. It's asking users to compare what happens at a particular bitrate as a demonstration that it works (well!) at bitrates significantly lower than what anything else can safely achieve. To get comparable quality you need several _times_ the number of bytes with another codec.

While I agree that it would be _interesting_ to see what happens below HiFiC's knee, the location of the knee for individual codecs is irrelevant to the comparison that really matters, because users just want high quality without high size or low size without low quality. And this clearly produces the best quality at the least size by a mile.

I think they did a fantastic job of showing both how each code looks at the same size and also how many bytes it takes for other codecs to reach the same quality. Showing what it looks like when HiFiC degrades if you go to an even lower than already extremely low bitrate would be fun and neat for comparing HiFiC's bitrates but has little bearing on how it compares to other codecs.


This is really impressive. But it raises some questions for me.

What size library would be required to decode these type of images?

And would the decoding library be updated on a regular basis? Would the image change when decoding library is updated? Would images be tagged with a version of the library when encoded (HiFiC v1.0.2?)


Thanks for the kind words!

> What size library would be required to decode these type of images?

The model is 726MB. Keep in mind that this is a research project - further research needs to be done on how to get faster now that we know that this kind of results are possible!

> And would the decoding library be updated on a regular basis?

Only if you want even better reconstructions!

> Would the image change when decoding library is updated? Would images be tagged with a version of the library when encoded (HiFiC v1.0.2?)

Yes, some header would have to be added.


I'm very curious how such thing could be standardized as an image format. With classic image formats there's an expectation one can write a spec and an independent implementation from scratch. "Take my X-MB large pre-trained model" is unprecedented.

Would it still be competitive with H.265 if the model was 10MB or 50MB in size? 0.7GB may be difficult for widespread adoption.


Independently of this work, we have models which are competitive with HEVC while being significantly smaller (this is from previous work). They will not look nearly as good as what you see in the website demo, but they're still better.

I don't have any such model handy but perhaps it's 10x-20x smaller.

We don't claim that this (or even the previous work) is the way to go forward for images, but we hope to incentivize more researchers to look in this direction so that we can figure out how to deploy these kinds of methods since the results they produce are very compelling.


And I also have questions about decompression speed and memory requirements.


I'm very impressed, I was waiting for an image codec that combines something like VGG loss + GANs! (Another thing that I'm waiting for is a neural JPEG decoder with a GAN, which would be backwards compatible with all the pictures already out there!) Now we need to get some massive standardisation process going to make this more practical and perfect it, just like it was done for JPEG in the old days! (And then do it for video and audio too!)

What happens if you compress a noisy image? Does the compression denoise the image?


On the standardization issue: the advantage of such a method that we presented is that as long as there exists a standard for model specification, we can encode every image with an arbitrary computational graph that can be linked from the container.

Imagine being able to have domain specific models - say we could have a high accuracy/precision model for medical images (super-close to lossless), and one for low bandwidth applications where detail generation is paramount. Also imagine having a program written today (assuming the standard is out), and it being able to decode images created with a model invented 10 years from today doing things that were not even thought possible when the program was originally written. This should be possible because most of the low level building blocks (like convolution and other mathematical operations) is how we define new models!

On noise: I'll let my coauthors find some links to noisy images to see what happens when you process those.


Absolutely! Being able to improve a decoder for an existing encoder (and vica versa) is a great advantage!


Thanks! Noise is actually preserved really well, and is one of the strengths of using a GAN. Check out this visual example from the "All Evaluation Images" link: https://storage.googleapis.com/hific/clic2020/visualize.html...


... wow


This is amazing. Not only it completely blows JPEG and BPG out of the water in efficiency, but it makes some images look better than the original! It seems to reduce local contrast which works well for skin (photos 1, 2 and 5), and what it does for repeating patterns is quite pleasant to the eyes (1 and 4).

The only downside I see is the introduction of artifacts like you can see on the right-side face of the clock tower.


Thanks for the kind words!

We hope that the algorithm we presented is a step in the right direction, and we acknowledge that there's more work to be done! Just like with any algorithm there are ways in which it can be improved. Please check out the supplementary materials [1], to find some examples that can definitely be improved: small faces and small text. Overall, as you say, the algorithm does a great job though!

[1] https://storage.googleapis.com/hific/data/appendixb/appendix...


> it makes some images look better than the original! It seems to reduce local contrast which works well for skin (photos 1, 2 and 5)

It's better than the lossy JPEGs but I strongly disagree it's better than the original - or even close. Examining image 1 closely, the loss of contrast and the regularising of the texture of the skin make the HiFIC image feel fake when I look at it. I downloaded and opened the images at random on my computer to get rid of the bias of "the right one is the original" and it was the original I preferred each time.

Regardless, it's impressive work and I look forward to future developments!


How about image 2? On both of my monitors, the colors in the regenerated image have a striking subjective advantage over the original. I'd be surprised if 95%+ of observers didn't favor the compressed version of image 2.


Checked on 2 monitors here (one color calibrated) and I see the HiFiC version of image 2 has slightly more muted colours than the original.

Unclear what you're seeing to get "striking subjective advantage" but it's not visible here.


That effect is very close to what a photographer would aim for with manual editing. It is not about fidelity but that blemish-free look which you'll find on any magazine. The authors also mentioned that in their research most users preferred the compressed one!


To me the effect looks like when heavy chroma noise filtering is applied, leaving luminance noise. This is particularly noticeable in image 2 all over the face; it's nothing like the blemish-free look we strive for when post-processing portraits or fashion photography.


In addition to being an amazing result, props for including a variety different sets of skin tones in the hero sample box. It's extremely good to see that there was thought put into reducing any potential bias for this compression algorithm which has unfortunately been an issue in the history of image capture [1]

[1] https://www.nytimes.com/2019/04/25/lens/sarah-lewis-racial-b...


The obvious question when looking at the comparison is, what kind of jpeg did they pick, and was it fair? It turns out that mozjpeg, which is pretty much state of the art, can squeeze out an extra 5% out of shades_jpg_4x_0.807.jpg.

Your comparison would be even more impressive if you produced the best jpegs you can at those sizes.


Thank you for your suggestion. We use libjpeg for the Demo. Keep in mind that we compare against JPEG files which are at around 100%-400% of the size of the proposed method, so 5% would not really add much as far as we can tell.


Finally someone's instantiated something like Vernor Vinge's "evocation" idea, quoted here:

https://www.danvk.org/wp/2014-02-10/vernor-vinge-on-video-ca...


If you are into this you may also like to see a similar model facebook used for foveated rendering (Nov 2019): https://ai.facebook.com/blog/deepfovea-using-deep-learning-f...


A number of comments here mention the regularity of the noise in the Lo variant as producing a slightly artificial appearance (zoom on the woman's forehead in picture 1). Would it improve anything to look for swathes where the dominant texture is noticeably uniform noise and mix it up a bit?


Not sure how that would work out in practice. We kind of rushed with the experiments in any case - maybe we could let the methods train for longer. In the end the adversary might learn any regular patters that appear, therefore forcing the generator to come up with something that cannot be detected easily.

(We trained everything to 1M steps. Perhaps letting it train to 2M would solve it.)


Super exciting! Fantastic work and cool demo!

I hope this can be 'stabilized' to allow video compression. If you run eg. style transfer networks on video, the artifacts in the outputted frames arent stable, they jump around from frame to frame.

(Not sure if you can decode this at 24 fps on normal hardware, but still...)


Thanks! And agreed - doing is in real time for high-resolution videos still needs quite some research though.


Shipping the network in device ROMs seems like a pretty far-off thing at this point, but I wonder if there could be something in the nearer term around time-shifting bandwidth usage, eg:

- Media appliance under your TV downloads a 5GB updated network during the night so that you can use smaller-encoded streams of arbitrary content in the evening.

- Spotify maintains a network on your phone that is updated while you're at home so that you can stream audio on the road with minimal data usage.

- Your car has a network in it that is periodically updated over wifi and allows you to receive streetview images over a manufacturer-supplied M2M connection.


Amazing work! It's something we knew would come at some point, but I didn't expect it to be that good so early!

Since some of the authors are in this thread answering questions, I have one: I wonder if such project requires Google-scale infrastructure or if it's something that can easily be replicated. For instance, how big is the training set? And how much compute was necessary to train the model?


Hi! It depends on how far you want to go. For this project, we did a lot of exploration, because we had Google-scale infrastructure. Replicating all the exploration will need a lot of GPUs (in the 100s), replicating all the experiments that actually went into the paper maybe a few dozens. Training something similar as what's in the demo with the code we will release takes 1 V100 :)

We can't release the internal training set, but expect a dataset of a few hundred thousand images (e.g., openimages) to be sufficient, maybe even less (AFAIK this has not been explored in a controlled setting).


Thanks a lot!


Hmm, it's too bad that none of the methods (including the novel one) accurately represent her pupils. In the original image, you can tell which direction she's looking, and see how dilated her pupils are.

What I feel could be a high priority is machine learning for rate distortion. It would be really good to have higher rate on the pupils.


FWIW, the network uses the hyperprior probability estimation network (citation [6] in the paper), which already adapts the rate depending on the image region.


Fantastic.

When can we install the 1TiB video player/codec/plugin, that'll allow us to stream 4K movies on a 1Mbps connection?


Hopefully soon! However, in addition to research needed to make this work for video, it also needs research to make it smaller and/or to put it in silicon (see my other comments). The network shown is "only" 726MB BTW :)


Has this been extended to video (beyond compressing each frame separately)?


This is the first time that GANs have been used to obtain high-fidelity reconstructions at useful bitrates. We haven't tried extending it to video.




Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: