It's really fascinating to zoom comparing the compressed version with the original -- the "general idea" is all there and the quality is roughly the same, but strands of hair move to different positions, skin pores are in a completely different pattern, etc. -- it's all regenerated to look "right", but at the lowest level the texture isn't "real".
Which is so fascinatingly different from JPEG or other lossy schemes -- the compression artifacts are obviously exactly that, you'd never confuse them for real detail. While here, you could never guess what is real detail vs. what is reconstructed detail.
My hunch is that this won't actually take off at first for images, because images on the internet don't use up that much of our bandwidth anyways, and the performance and model storage requirements here are huge.
BUT -- if the same idea can be applied to 1080p video with a temporal dimension, so that even if reconstructed textures are "false" they persist across frames, then suddenly the economics here would start to make sense: dedicated chips for encoding and decoding, with a built-in 10 GB model in ROM, trained on a decent proportion of every movie and TV show ever made... imagine if 2-hour movies were 200 MB instead of 2 GB?
(And once we have chips for video, they could be trivially re-used for still images, much like Apple reuses h.265 for HEIC.)
But, anywhere where the purpose of output is social/entertainment, it seems this would be freely adopted. Profile pictures, marketing images, videos and movies. If could very much mean, however, that a photo on someone's Facebook profile might not be admissable as evidence, if used to identify an exact match with someone, rather than just "someone who looks similar".
On the Internet, Yes. But for web pages, Image is still by far the largest payload. It could be anywhere between 60 to 90% of total content size.
Thinking in the context of latency, where you could benefits with images appearing much quicker or instantaneous ( embedding inside the HTML ) because of its reduction. Especially on first screen.
A good point that has been made before is that images are often by far the biggest part of a web page, but most sites still don't even do basic optimisations like resizing them to the viewing size, using WebP, using pngcrush, etc.
That said, this would be cool for making really well optimised sites.
I’m not sure if it’s because the first group is just used to life being unpredictable and don’t realize things could be better, or if they really don’t care.
Nvidia shows what is possible with dlss2 as well.
I'm very curious about what we will be able to do with voice as well. Perhaps you will sync voice profiles from your familiy/friends for better compression or better reconstruction.
We will also see models which will have higher details through incorporating all frames and enhancing existing material. We see this already anyway in gaming as well.
The first thought I had, after looking at some of the examples, is that it should be fairly trivial to train or finetune the compression network to hallucinate missing detail with "BPG-style" or "JPG-style" smoothing artifacts by feeding compressed images as "real" samples to the adversarial network during training/finetuning.
I wonder if doing that would make it possible to achieve even greater compression with loss of detail that is indistinguishable from traditional compression methods and therefore still acceptable to the human eye.
It definitely is a valid approach but the limitation is that if the network needs some texture-specific information that cannot be extracted from the decoded pixels, it can't really do much.
There were approaches where such information was also sent on the side, which yielded better results, of course.
The field is wide-open and each approach has its own challenges (e.g., you may need to train one network for quantization level for example if you're going to do restoration).
Forgive me, is that in encoding or decoding? And size of library required to decode it? ( Edit: Looks like answered below )
It was only an hour ago I was looking at some pictures encoded with VVC / H.266 reference encoder, it was better than BPG / HEVC, but this HiFiC still beats it by quite a bit.
This whole thing blows up my mind on limitation I thought we had on image compression.
May be I should take back my words on JPEG XL is the future. This seems revolutionary.
Thanks for your kind comment!
Of course, we'd be happy to add any additional images from other codecs if they're available.
Admittedly, you're not likely to learn much from this that is useful for your research, but most of the interest from people clicking on this is probably wanting to see the latest developments in image compression.
The adversarial loss "learns" what is a compressed image and tries to make the decoder go away from such outputs.
The perceptual (LPIPS) is not so sensitive to pure noise and allows for it, but is sensitive to texture details.
MSE tries to get the rough shape right.
We also asked users in a study to tell us which images they preferred when having access to the original. Most prefer the added details even if they're not exactly right.
Could you expand on this point
On the other hand, in our paper we show that some generated detail can help even at higher bitrates.
EDIT: You mention CLIC2020 in your paper so I'm a bit more confused.
Another question would be,what are your thoughts on the future of learned image compression and ideas for future work?
One silly question, the red door example image... why is the red so saturated in all the compressed versions vs the original?
edit: Ah seems to be some kind of display issue in Firefox. When saving the images and comparing the saturation level is roughly equal.
Our training set consists of a large set of high-resolution images collected from theInternet, which we downscale to a random size between 500 and 1000 pixels, and then randomlycrop to256×256. We evaluate on three diverse benchmark datasets collected independently ofour training set to demonstrate that our method generalizes beyond the training distribution: thewidely usedKodak dataset (24 images), as well as theCLIC2020 testset (428 images), andDIV2K validation set (100 images).
(b) should allow the user to determine whether the image is suitable for their use-case.
Usually comparing below the knee isn't very useful except to better help understand the sort of artefacts that one might want to also look for in higher rate images.
It would be interesting to see some examples from a lower rate HiFiC specifically for the purpose of bringing out some artefacts to help inform comparisons at higher rates.
The examples are chosen as multiples of their file size 1x,2x,...
E.g. after looking at those jpeg images you're able to flip to much higher rate jpegs and notice blocking artefacts that a completely unprepared subject would fail to notice.
In my work on lossy codecs I found that having trained listeners was critical to detecting artifacts that would pass on quick inspection but would become annoying after prolonged exposure.
From a marketing fairness point since the only codec you evaluate 'below knee' is jpeg, it risks exaggerating the difference. It would be just about as accurate to say that jpeg can't achieve those low rates-- since no one will use jpeg at a point where its falling apart. This work is impressive enough that it doesn't need any (accidental) exaggeration.
I think it's best to either have all comparison points above the point where the codecs fall apart, or have a below-knee example for all so that it's clear how they fail and where they fail... rather than asking users to compare the gigantic difference between a failed and non-failing output.
> since no one will use jpeg at a point where its falling apart.
Hence we also put it at bitrates people actually use :) And then the point should be that HiFiC uses much fewer bits.
> below-knee for HiFiC
We actually did not train lower than this model, HiFiC^lo. That it works so well was somewhat surprising to us also! From other GAN literature, it seems reasonable to expect the "below-knee" point for this kind of approach to still look realistic, but not faithful anymore. I.e., without the original, it may be hard to tell that it's fake.
If so, that might be kinda cool on its own.
(and perhaps also useful for anti-forensic purposes, e.g. use it as a de-noising mechanism that makes it more difficult to identify the source camera)
It isn't. It's asking users to compare what happens at a particular bitrate as a demonstration that it works (well!) at bitrates significantly lower than what anything else can safely achieve. To get comparable quality you need several _times_ the number of bytes with another codec.
While I agree that it would be _interesting_ to see what happens below HiFiC's knee, the location of the knee for individual codecs is irrelevant to the comparison that really matters, because users just want high quality without high size or low size without low quality. And this clearly produces the best quality at the least size by a mile.
I think they did a fantastic job of showing both how each code looks at the same size and also how many bytes it takes for other codecs to reach the same quality. Showing what it looks like when HiFiC degrades if you go to an even lower than already extremely low bitrate would be fun and neat for comparing HiFiC's bitrates but has little bearing on how it compares to other codecs.
What size library would be required to decode these type of images?
And would the decoding library be updated on a regular basis? Would the image change when decoding library is updated? Would images be tagged with a version of the library when encoded (HiFiC v1.0.2?)
> What size library would be required to decode these type of images?
The model is 726MB. Keep in mind that this is a research project - further research needs to be done on how to get faster now that we know that this kind of results are possible!
> And would the decoding library be updated on a regular basis?
Only if you want even better reconstructions!
> Would the image change when decoding library is updated? Would images be tagged with a version of the library when encoded (HiFiC v1.0.2?)
Yes, some header would have to be added.
Would it still be competitive with H.265 if the model was 10MB or 50MB in size? 0.7GB may be difficult for widespread adoption.
I don't have any such model handy but perhaps it's 10x-20x smaller.
We don't claim that this (or even the previous work) is the way to go forward for images, but we hope to incentivize more researchers to look in this direction so that we can figure out how to deploy these kinds of methods since the results they produce are very compelling.
What happens if you compress a noisy image? Does the compression denoise the image?
Imagine being able to have domain specific models - say we could have a high accuracy/precision model for medical images (super-close to lossless), and one for low bandwidth applications where detail generation is paramount. Also imagine having a program written today (assuming the standard is out), and it being able to decode images created with a model invented 10 years from today doing things that were not even thought possible when the program was originally written. This should be possible because most of the low level building blocks (like convolution and other mathematical operations) is how we define new models!
On noise: I'll let my coauthors find some links to noisy images to see what happens when you process those.
The only downside I see is the introduction of artifacts like you can see on the right-side face of the clock tower.
We hope that the algorithm we presented is a step in the right direction, and we acknowledge that there's more work to be done! Just like with any algorithm there are ways in which it can be improved. Please check out the supplementary materials , to find some examples that can definitely be improved: small faces and small text. Overall, as you say, the algorithm does a great job though!
It's better than the lossy JPEGs but I strongly disagree it's better than the original - or even close. Examining image 1 closely, the loss of contrast and the regularising of the texture of the skin make the HiFIC image feel fake when I look at it. I downloaded and opened the images at random on my computer to get rid of the bias of "the right one is the original" and it was the original I preferred each time.
Regardless, it's impressive work and I look forward to future developments!
Unclear what you're seeing to get "striking subjective advantage" but it's not visible here.
Your comparison would be even more impressive if you produced the best jpegs you can at those sizes.
(We trained everything to 1M steps. Perhaps letting it train to 2M would solve it.)
I hope this can be 'stabilized' to allow video compression. If you run eg. style transfer networks on video, the artifacts in the outputted frames arent stable, they jump around from frame to frame.
(Not sure if you can decode this at 24 fps on normal hardware, but still...)
- Media appliance under your TV downloads a 5GB updated network during the night so that you can use smaller-encoded streams of arbitrary content in the evening.
- Spotify maintains a network on your phone that is updated while you're at home so that you can stream audio on the road with minimal data usage.
- Your car has a network in it that is periodically updated over wifi and allows you to receive streetview images over a manufacturer-supplied M2M connection.
Since some of the authors are in this thread answering questions, I have one: I wonder if such project requires Google-scale infrastructure or if it's something that can easily be replicated. For instance, how big is the training set? And how much compute was necessary to train the model?
We can't release the internal training set, but expect a dataset of a few hundred thousand images (e.g., openimages) to be sufficient, maybe even less (AFAIK this has not been explored in a controlled setting).
What I feel could be a high priority is machine learning for rate distortion. It would be really good to have higher rate on the pupils.
When can we install the 1TiB video player/codec/plugin, that'll allow us to stream 4K movies on a 1Mbps connection?