Hacker News new | past | comments | ask | show | jobs | submit login
Generative Adversarial Networks for Extreme Learned Image Compression (ethz.ch)
105 points by relate on April 11, 2018 | hide | past | favorite | 31 comments



Picture is not compressed, its hallucinated from vague memory of the real thing, a mere dream. Cars vanish, building change wall structure, even the license plate receives fake text absent from source materia.

Its a giant guesswork of what was there originally. Reminds me of Xerox scanners lying about scanned in numbers http://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_...


All lossy compression algorithms hallucinate. That's the whole point: reducing image size by dropping some information and then hallucinating a plausible replacement to decompress.

The only difference is that this compression is better at hallucinating, so you don't get ringing artifacts or blocks, but some internally consistent alternate reality.

If you don't want to lose data you should not use lossy compression at all. JPEG can erase the distinction between digits as well.


Okay, but still, some kinds of changes are better than others. They should probably start testing for this in the visual perception tests: is lost information greyed out in a visible way? Are words and digits always fuzzed, or replaced?

Because it turns out that fuzziness and compression artifacts have a higher-level meaning: when you see them, you know something has been lost. That's an important (if inadvertent) signal. We need to make sure the artifacts don't go away.


I can kind of see regulations coming that make neurally compressed image or video data are require to have a little Ⓝ on-screen graphic in one of the corners, in addition to (not necessarily perceivable) watermarks that can make even small crops of the image identifiable as neurally generated. And that is probably the best case.

In the worst case (and more likely?), we are going to ban computational substrates large enough to perfectly forge important data altogether because it will be too easy to misuse. We‘d essentially go back to ~1960s electronics to have at least halfway functioning mechanisms of creating social trust, namely high-bandwidth personal interactions where every thought and every action has a high chance of leaving a trace in the real world and thus contributing to someone’s reputation. No blockchain and no other technology can create nearly as much trust as that without being highly prone to misuse.


It could be a problem if it hallucinates the wrong license plate number at a crime scene. If all you want is a gigantic 8K resolution stock photo of a woman holding her baby in front of a laptop without devouring 10 MB of the user's data cap, it may be fine if the woman has a slightly different (but still highly detailed) hair style.


It would always decompress the same way, so for artistic purposes, if you preview it and it looks good, it is good.

But if you're using photography to look at things in the world, that's a whole different story.


If fuzziness and compression artifacts are a signal, it makes sense that eliminating them would reduce filesize.


There is a difference between dropping data, and replacing it with fake one. Unreadable text is better than scanned cheque with different amount.


Sounds more or less like our human memory and image recall. Sometimes it's more accurate, but sometimes we make up details that were not there originally.


It's not as bad as you say. Sure, the positions of the individual leaves and of the grain of the concrete and of little puffs of the clouds are hallucinated, but most salient semantic features like the presence of a car or of a person are left untouched.

In other words, mostly only unimportant details are hallucinated, which is what we want.


Car behind the bus is gone, bus license plate got some fake text. Its pretty bad as far as reproducing original.


The trick is not so much about compression, but rather about image generation. The trade off is completely different from usual lossy algorithms. A highly compressed image might still retain high visual quality but with completely different details, textures.

Kinda what would happen if you use a perfect painter with a blurry memory.


This is quite visible in their demonstration slider-image; the building in the back-right changes from a brick building with glass windows to a stucco-ish building with a bunch of exterior duct-work.


Also, the car behind the bus disappears. It looks like it's been photoshopped out. This is unexpected behavior from a compression algorithm. Users are conditioned to expect the quality to degrade uniformly across the whole image.


In fairness that detail isn't preserved in the other formats either. The new issue is merely the illusion of accuracy.


car is visible in WebP.


You would have to be very very careful indeed if you wanted to use this for anything with safety implications.


This would have interesting implications if we accidentally trained biased networks (provocative example: improperly changing people's skin-colors in images when decoding them)


I'd be curious to see how different levels of quantization affect the image. From the paper it looks like the quantization is applied at the latent feature space. I wonder if it has similar effects like the celebrity GAN's we have seen where interpolating in the latent space results in morphing from one face to another. Could be funny when compression doesn't result in something blocky or distorted, but replacing objects with other objects that look similar to them.

This seems to be for static images, but this gets me wondering if an RNN can be used and have better motion prediction that other current "hard coded" solutions.

Also, the more specific the domain, the better the compression, since it can specialize. I'm wondering about the practical applications of this. Do we have different baselines that can be used for different use cases?


it also means that you could get a meaningful average of two images by averaging their compressed form (aka latent state z), and decoding, just like with the celebrities :)


I wonder how pied piper will respond to that. This could be a good idea for video compression that is "monothematic". Hmmm, i can think of a video industry that is monothematic...


Just waiting for this to show up in a video compression standard. With the right network it could be just as fast to decompress, though probably insanely slow to compress.


No, the tradeoff is actually the other way around. Encoding with a neural network can potentially be faster than the exhaustive tree searches that are done in current compression methods. On other hand, current decoders are fairly dumb and extremely optimized for speed. Neural networks will probably have trouble competing. In the design of video codecsan increase in decoding time is considered at least 10x more costly than the same increase in encoding time.

source: I worked with both HEVC and neural network based compression.


I think that the title of the paper should state that it is for lossy image compression, which clearly states how it works and what task it performs.

I would be surprised if there wasn't a way to provide a learned, lossless method of compression, but that would be a very different paper and result.


Take the following as coming from a dilettante... I'm still trying to understand the remainder of the paper but felt like writing on the basics of the encoder/decoder/quantizer setup they mention.

I found this particularly interesting "To compress an image x ∈ X , we follow the formulation of [20, 8] where one learns an encoder E, a decoder G, and a finite quantizer q."

I feel like this is related to some of the standard human memorization/learning techniques. Example: I'm learning the guitar fretboard note placement in e standard. It's difficult for me to visualize the first 4 frets on a 6 string guitar with notes on each fret.

To help me memorize the note placement I develop various mnemonic devices (both lossy and lossless). I know I've memorized the fretboard sufficiently when I can visualize it.

Attempting to translate my reading of the paper I believe the following analogy is apt. My "encoder" operates on a short term image when I close my eyes after looking at a fret diagram. It produces semantic objects, i.e. an ordered sequence of "letters" or pairs of letters (letters that are horizontally, vertically or diagonally aligned). The quantizer takes these objects and looks at the order/distribution. The quantizer places more importance on some of the semantic objects than others (the fourth fret has 4 natural notes before an accidental). My decoder is interpreting the stored/compressed note information to try to produce the image. It may be off substantially, so I correct and repeat the process.

The process of optimizing what the semantic objects are, the weight each gets, and how I use them to derive the original image seems like a fairly good representation of what I do (though at least some of that appears to be fixed in the learning algorithm typically). Of course, analogies are just that and mine doesn't take into account the discriminator or the remaining "heart" of the paper.

I think the heart of the paper is that they're trying to determine through GANs a good way to both store the image and recover it while reducing bits per pixel and increasing the quality of reproduction. Using some classical terms, the GAN algorithm thus tweaks the compressor, the data storage format and the decompressor to optimize what should be "hard-coded" in the compressing/decompressing process or program vs what will be stored as a result of the compression program.

Very handwavey but I think the general idea is right?


An encoder / decoder architecture learns a more "efficient" representation. It tries to find features it can use that are useful for describing the variations in the input data (images) that it has seen.

For example, if trained on faces, it will learn features for things like eyes and mouths. So the image can be encoded as put a mouth of this type with this width at this location rather than operating at the level of pixels.

If trained on text, it might learn features related to letters and typography (boldness, italics, size, spacing). So it might encode things as Helevetica, 16pt, italics.

This is a gross oversimplification, and things rarely map exactly to concepts humans would use, but hopefully it communicates the concept.


For photo quality material this will be a detail loss, but some media(animation/clip art/compressed video) can benefit greatly if the algorithm of reconstruction is fast enough. They should compare it with AV1/x265 codecs.


It seems that some form of neural network based compression is the future, but how to go from academic one-off implementation to a widely deployable codec?


Was thinking about this use case of neural networks for months... Glad to read a paper about that. Wonder how to adapt that to video


It's very interesting. I've heard it said in online lectures that it does a sort of compression but nobody really uses it for that because existing algorithms perform much better. Guess this is no longer true.


Loved the site. Great way to present research, will be even better with source code or a notebook.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: