Hacker News new | past | comments | ask | show | jobs | submit login
Stable Diffusion based image compression (matthias-buehlmann.medium.com)
498 points by nanidin on Sept 20, 2022 | hide | past | favorite | 199 comments



Nice work!

However, a cautionary tale on AI medical image "denoising":

(and beyond, in science)

- See the artifacts?

The algorithm plugs into ambiguous areas of the image stuff it has seen before / it was trained with. So, if such a system was to "denoise" (or compress, which - if you think about it - is basically the same operation) CT scans, X-rays, MRIs, etc., in ambiguous areas it could plug-in diseased tissue where the ground-truth was actually healthy.

Or the opposite, which is even worse: substitute diseased areas of the scan with healthy looking imagery it had been trained on.

Reading recent publications that try to do "denoising" or resolution "enhancement" in medical imaging contexts, the authors seem to be completely oblivious to this pitfall.

(maybe they had a background as World Bank / IMF economists?)


Here's a similar case of a scanner using a traditional compression algorithm. It has a bug in the compression algorithm, which made it replace a number in the scanned image with a different number.

https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...


That is completely outside all my expectations prior to reading it. The consequences are potentially life and death, or incarceration, etc, and yet they did nothing until called out and basically forced to act.

A good reminder that the bug can be anywhere, and when things stop working we often need to get very dumb, and just methodically troubleshoot.


We programmers tend to think our abstractions match reality somehow. Or that they don't leak. Or even if they do leak, that leakage won't spill down several layers of abstraction.

I used to install T1 lines a long time ago. One day we had a customer that complained that their T1 was dropping every afternoon. We ran tests on the line for extended periods of time trying to troubleshoot the problem. Every test passed. Not a single bit error no matter what test pattern we used.

We monitored it while they used it and saw not a single error, except for when the line completely dropped. We replaced the NIU card, no change.

Customer then hit us with, "it looks like it only happens when Jim VNCs to our remote server".

Obviously a userland program (VNC) could not possibly cause our NIU to reboot, right?? It's several layers "up the stack" from the physical equipment sending the DS1 signal over the copper.

But that's what it was. We reliably triggered the issue by running VNC on their network. We ended up changing the NIU and corresponding CO card to a different manufacturer (from Adtran to Soneplex I think?) to fix the issue. I wish I had had time to really dig into that one, because obviously other customers used VNC with no issues. Adtran was our typical setup. Nothing else was weird about this standard T1 install. But somehow the combination of our equipment, their networking gear, and that program on that workstation caused the local loop equipment to lose its mind.

This number-swapping story hit me the same way. We would all expect a compression bug to manifest as blurry text, or weird artifacts. We would never suspect a clean substitution of a meaningful symbol in what is "just a raster image".


Reminds me of this story: http://blog.krisk.org/2013/02/packets-of-death.html

tldr: Specific packet content triggers a bug in the firmware of an Intel network card and bricks it until powered off.


I assume something like jpeg (used in the DICOM standard today) has more eyes on the code than proprietary Xerox stuff? hopefully at least...

I have seen weird artifacts on MRI scans, specifically the FLAIR image enhancement algorithm used on T2 images, i.e. white spots, which could in theory be interpreted by a radiology as small strokes or MS...so I always take what I see with a grain of salt..


The DICOM standard stuff did have a lot of eyes on it, and was tuned toward fidelity which helps. It's not perfect, but what is.

MRI artifacts though are a whole can of worms, but fundamentally most of them come from a combination of the EM physics involved, and the reconstruction algorithm needed to produce an image from the frequency data.

I'm not sure what you mean by "image enhancement algorithm"; FLAIR is a pulse sequence used to suppress certain fluid signals, typically used in spine and brain.

Many of the bright spots you see in FLAIR are due to B1 inhomogeneity, iirc (it's been a while though)


Probably worth mentioning also that "used in DICOM standard" is true but possibly misleading to someone unfamiliar with it.

DICOM is a vast standard. In it's many crevasses, it contains wire and file encoding schemas, some of which include (many different) image data types, some of which allow (multiple) compression schemes, both lossy and lossless, as well as metadata schemes. These include JPEG, JPEG-LS, JPEG-2000, MPEG2/4, HVEC.

I think you have to encode the compression ratio as well, if you do lossy compression. You definitely have to note that you did lossy compression.


The xerox problem wasn’t an issue with the standard, it was a bug in the code.


Iirc, this was an issue or conspiracy fuel or whatever with the birth certificate that Obama released. That some of the unique elements in the scan repeated over and over.


People have been publishing fairly useless papers "for" medical imaging enhancement/improvement for 3+ decades now. NB this is not universal (there are some good ones) and not limited to AI techniques, although essentially every AI technique that comes along gets applied to compression/denoising/"superres"/etc. if it can, eventually.

The main problems is that that typical imaging researchers are too far from actual clinical applications, and often trying to solve the wrong problems. It's a structural problem with academic and clinical incentives, as much as anything else.


From the article:

> a bit of a danger of this method: One must not be fooled by the quality of the reconstructed features — the content may be affected by compression artifacts, even if it looks very clear

... plus an excellent image showing the algorithm straight making stuff up, so I suspect the author is aware.


In my xp, medical imaging at the diagnostic tier uses only lossless (JPEG2000 et al). It was explicitly mentioned on our SOP/policies that we had to have a lossless setup.

Very sketchy to use a super resolution for diagnostics. In research (flourescence), sure.

ref: my direct experience of pathology slide scanning machines and their setup.


There is nothing in the article suggesting this should be used for medical imaging.


Fun to imagine this could show up in future court cases. Is the picture true, or were details changed by the ai compression algorithm?


This has already happened. Pinch-to-zoom became a subject of controversy in the Kyle Rittenhouse trial because of a fear that it could introduce false details.


This is similar to whether you can trust eye witnesses’ recollections. Brains do their own “compression” of what they observed.


That is a pretty scary side effect and why AI in medicine needs to be handled carefully, especially anything with generative capability.

I do think, however, the primary application of these newer generation of generative models will be useful for more sophisticated data augmentation. For example, a lung nodule detection AI trained on a combination of synthetic and real lung nodules may perform better than one trained on real data alone. Do you think there are any other responsible applications?


I think this could be avoided by using some simple-stupid metric for measuring the difference between the compressed and uncompressed images and rejecting the compressed one if it's too different.

Maybe something like dividing the image into small squares, computing the L2 difference between original and compressed. If any square has too large L2 difference then compression failed. Th metric might need to be informed by the domain.

Edit: Or maybe PSNR+SSIM?


Mentioned in TFA at least twice


Sounds like you need lossless compression.

I was told that the GPT-2 text compression variant was a lossless compressor (https://bellard.org/libnc/gpt2tc.html), why is stable diffusion lossy?


Probably something to do with the variational auto encoder, which is lossy.


Something interesting about the San Francisco test image is that if you start to look into the details, its clear that some real changes have been made to the city. Rather than losing texture or grain or clarity, the information lost in this is information about the particular layout of a neighborhood of streets, which has now been replaced as if some one were drawing the scene from memory. A very different kind of loss that with out the original might be imperceptible because the information that was lost isn't replaced with random or systematic noise, but rather new, structured information..


One thing that worries me about generative AI is the degradation of “truth” over time. AI will be the cheapest way to generated content, by far. It will sometimes get facts subtly wrong, and eventually that AI generated content will be used to train future models. Rinse and repeat.


We are getting closer and closer to a simulacrum and hyperreality.

We used to create things that were trying to simulate (reproduce) reality, but now we are using those "simulations" we'd created as if they were the real thing. With time we will be getting farther away from the "truth" (as you put it), and yes - I share your worry about that.

https://en.wikipedia.org/wiki/Simulacrum

EDIT: A good example I heard that explains what a simulacrum is was this: Ask a random person to draw a photo of a princes and see how many will draw a disney princess (which already was based on real princesses) vs how many will draw one looking like Catherine of Aragon or another real princess.


Similar to how we have low-background (pre-nuclear) steel, might we have pre-transformer content?


Yes indeed. I've been looking for an auto summarizer that reliably doesn't change the content. So far everything I've tried will make up or edit a key fact once in a while.


So you’ve described humans.


Currently computers can reliably do maths. Later AI will unreliably do maths. Exactly like humans.


So it will get stupider... maybe the singularity isn't bad like too smart but bad like dealing with too many stupid people.


The nice thing about math is that often it's much harder to find a proof than to verify that proof. So math AI is allowed to make lots of dumb mistakes, we just want it to make the occasional real finding too.


Unless we also ask AI to do the proof verification...


Why would you do that? Proof verification is pretty much a solved problem.


Both stupider and less deterministic, but also and smarter and more flexible. Like humans.


Maybe making (certain kinds of) math mistakes is a sign of intelligence.


Fair point, though I feel there's a difference as AI can generate content much more quickly.


Anywhere that truth matters will be unaffected. If such deviations from truth can withhold, then the truth never mattered. False assumptions will never hold where they can't, because reality is quite pervasive. Ask anyone who's had to productionize an ML model in a setting that requires a foot in reality. Even a single-digit drop in accuracy can have resounding effects.


The interesting thing is that is some ways this is a return to pre-modern era of lossy information transmission between the generations. Every story is re-molded by the re-teller. Languages change and thus the contextual interpretations. Even something a seemingly static as a book gets slowly modified as scribes rewrite scrolls over centuries.


Certainly possible, though we also have many hundreds of millions of people walking the globe taking pictures of things with their phones (not all of which are public to be used for training, but still).


It’s not too different from how human memory and written tradition or word of mouth works.


I've started seeing more of this crap show up on the front page of Google.


Kind of like how chicken taste like everything.


Jpeg bitrot 2.0


art is truth


Yeah, if it were actually adopted as a way to do compression, it seems likely to lead to even worse problems than JBIG2 did https://news.ycombinator.com/item?id=6156238

Invisibly changing the content rather than the image quality seems like a really concerning failure mode for image compression!

I wonder if it'd be possible to use SD as part of a lossless system - use SD as something that tells us the liklihood of various pixel values given the rest of the image and combine that liklihood with a huffman encoding. Either way, fantastic hack, but we really should avoid using anything lossy built on AI for image compression.


Imagine a world where bandwidth constraints meant transmitting a hidden compressed representation that gets expanded locally by smart TVs that have pretrained weights baked into the OS. Everyone sees a slightly different reconstitution of the same input video. Firmware updates that push new weights to your TV result in stochastic changes to a movie you've watched before.


"The weather forecast was correct as broadcast, sir, it's just your smart TV thought it was more likely that the weather in your region would be warm on that day, so it adjusted the symbol and temperature accordingly"


I mean, that is already happening. Almost all modern TV's do some signal processing before outputting the pixels, and the image looks slightly different on each model.

But it'd definitely be cool to have some latent representation of a video that then gets rendered on tv - you could apply latent style sheets to the content, like what actors you want to play the roles, or turn everything into a steam-punk anime on the fly. The more abstract the representation, the more interesting alterations you could apply


You could still use some kind of adaptive huffman coding. Current compression schemes have some kind of dictionary embedded in the file to map between the common strings and the compressed representation. Google tried proposing SDCH a few years using a common dictionary for wep pages. There isn't any reason why we can't be a bit more deterministic and share a much larger latent representation of "human visual comprehension" or whatever to do the same. It doesn't need to be stochastic once generated.


Give it "enough" bits and it won't be a problem. How many is enough is the question.


It's interesting that this is closer to how human memory operates—we're quite good in unconsciously fabricating false yet strong memories.


True, but I'd like to continue using products that produce close-to-real images. Phones nowadays already process images at lot. The moment they start replacing pixels it'll all be fake.

And… Some manufacturer apparently already did it on their ultra zoom phones when taking photos of the moon.


Meh. Cameras have been "replacing pixels" for as long as I've been alive. Consider that a 4K camera only has 2k*4k pixels whereas a 4K screen has 2k*4k*3 subpixels.

2/3 of the image is just dreamed up by the ISP (image signal processor) when it debayers the raw image.

I'm not aware of any consumer hardware that has open source ISP firmware or claims to optimize for accuracy over beauty.


Okay, but a camera doing this is unlikely to dream up plausible features that didn't actually exist in the scene.


Of course it is! Try feeding static into a modern ISP. It will find patterns that don't exist.


The good old JBIG2 debacle.

“When used in lossy mode, JBIG2 compression can potentially alter text in a way that's not discernible as corruption. This is in contrast to some other algorithms, which simply degrade into a blur, making the compression artifacts obvious.[14] Since JBIG2 tries to match up similar-looking symbols, the numbers "6" and "8" may get replaced, for example.

In 2013, various substitutions (including replacing "6" with "8") were reported to happen on many Xerox Workcentre photocopier and printer machines, where numbers printed on scanned (but not OCR-ed) documents could have potentially been altered. This has been demonstrated on construction blueprints and some tables of numbers; the potential impact of such substitution errors in documents such as medical prescriptions was briefly mentioned.”

https://en.m.wikipedia.org/wiki/JBIG2


There was a scandal when it was discovered that Xerox machines were doing this; in that case, the example showed "photocopies" replacing numbers in documents with other numbers.


There is a talk about that issue [1].

During my PhD this issue came up amongst those in the group looking into compressed sensing in MRI. Many reconstruction methods (AI being a modern variant) work well because a best guess is visually plausible. These kinds of methods fall apart when visually plausible and "true" are different in a meaningful way. The simplest examples here being the numbers in scanned documents, or in the MRI case, areas of the brain where "normal brain tissue" was on average more plausible than "tumor".

[1]: http://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres_...


It's worth noting that these problems are things to be aware of, not the complete showstoppers some people seem to think that they are.


> not the complete showstoppers some people seem to think that they are.

idk if I had to second guess every single result coming out of a machine it would be a showstopper for me. This isn't pokemon go, tumor detection is serious matter


Why would you want to lossily compress any medical image is beyond me. You get equipment to make precise high-resolution measurements, it goes without saying that you do not want noise added to that.


In medical images, you don't record first and then compress later. Instead, you make sparse measurements and then reconstruct. Why? Because people move, so getting more frames/sec is a thing; you don't want people to stay for too long in the machine; and (ideally) with the same setup, you can focus on a smaller area and get a higher resolution than standard measurements too.


You are talking about compressed sensing which is not lossy compression (compressed sensing can be lossless unless you're dealing with noisy measurements).

But say you're doing noisy measurements, and you are under-measuring like you say, and you have to fabricate non-random non-homogenous reconstruction noise. In that case it would be a very good idea to produce, as they do for lossy compression, both the standard overall bit rate vs. PSNR characterization against alternate direct (non-sparse) measurement ground truths (that have to exist, or else the reconstruction method should be called into question), and the bit rate for each particular sparse measurement. So this way people can see how reliable the reconstruction is. Ideally the image should be labeled at the pixel level with reconstruction probabilities, or presented in other ways to demonstrate the ratio of measured vs. fabricated information, like 95% confidence-interval extremal reconstructions or something.

It's not clear that community is doing this level of due diligence, so then the voices here are right: it's not a good idea to use.


An medical sensor filling in "plausible" information is not a show stopper? I hope you are never in control of making decisions like that.


To be aware of when you are building compression systems.

It's perfectly possible to build neural network based compression systems that do not output false information.


If the compression is lossless that's fine. I have not seen an AI system being used in this manner but I don't doubt it's possible. All lossy compression methods output false information, that's the point of lossy compression and why it works so well. Remove details that the compression algorithm deems unimportant.


I'm having a hard time seeing where the random substitution of all numbers isn't supposed to be a complete showstopper.


Well for example you train the VAE to reduce the compression on characters.


The right amount of compression in a photocopy machine is zero.

Compression that gives you a blurred image is a trade-off.

But what does it mean to “be aware of” compression that may give you a crisp image of some made up document?


> The right amount of compression in a photocopy machine is zero.

This isn't an obvious statement to me. If you've had the misfortune of scanning documents to PDF and getting the 100MB per page files automatically emailed to you then you might see the benefit in all that white space being compressed somehow.

> But what does it mean to “be aware of” compression that may give you a crisp image of some made up document?

This isn't something I said. A good compression system for documents will not change characters in any circumstances.


If you are making an image of a cityscape to illustrate an article it probably doesn't matter what the city looks like. But if the article is about the architecture of the specific city, it probably does, so you need to 'be aware' that the image you are showing people isn't correct, and reduce the compression.


This subthread was about changing numbers in scanned documents and vanishing tumors in medical images.


Arguably this is still fine with the definition of lossy compression. The compressed image still roughly shows the idea of the original image.


I would've thought anyone relying on lossy-compressed images of any sort already needs to be aware of the potential effects, or otherwise isn't really concerned by the effect on the image (and I'd guess that the vast majority of use cases actually don't care if parts of the image are essentially "imaginary")


This needs to be compared with automated tests. A lack of visual artifacts doesnt mean an accurate representation of the image in this case.


It opens up an interesting question that is it suggesting "improvements" that could be done in the real world.


Are you suggesting a lossy but 'correct' version?

IE, the algorithm ignores and loses the 'irrelevant' information, but holds the important stuff?


A few thoughts that aren't related to each other.

1. This is a brilliant hack. Kudos.

2. It would be great to see the best codecs included in the comparison - AVIF and JPEG XL. Without those it's rather incomplete. No surprise that JPEG and WEBP totally fall apart at that bitrate.

3. A significant limitation of the approach seems to be that it targets extremely low bitrates where other codecs fall apart, but at these bitrates it incurs problems of its own (artifacts take the form of meaningful changes to the source image instead of blur or blocking, very high computational complexity for the decoder).

When only moderate compression is needed, codecs like JPEG XL already achieve very good results. This proof of concept focuses on the extreme case, but I wonder what would happen if you targeted much higher bitrates, say 5x higher than used here. I suspect (but have no evidence) that JPEG XL would improve in fidelity faster as you gave it more bits than this SD-based technique. Transparent compression, where the eye can't tell a visual difference between source and transcode (at least without zooming in) is the optimal case for JPEG XL. I wonder what sort of bitrate you'd need to provide that kind of guarantee with this technique.


also thought it was odd that AVIF was not compared - it would show a major quality and size improvement over WebP.


The comparison doesn't make much sense because for fair comparisons you have to measure decompressor size plus encoded image size. The decompressor here is super huge because it includes the whole AI model. Also, everyone needs to have the exact same copy of the model in the decompressor for it to work reliably.


Only if decompressor and image are transmitted over the same channel at the same time, and you only have a small number of images. When compressing images for the web I don't care if a webp decompressor is smaller than a jpg or png decompressor, because the recipient already has all of those.

Of course stable diffusion's 4GB is much more extreme than Brotli's 120kb dictionary size, and would bloat a Browser's install size substantially. But for someone like Instagram or a Camera maker it could still make sense. Or imagine phones having the dictionary shipped in the OS to save just a couple kB on bad data connections.


Even if dictionaries were shipped, the biggest difficulty would be performance and resources. Most of these models require beefy compute and a large amount of VRAM that isn't likely to ever exist on end devices.

Unless that can be resolved it just doesn't make sense to use it as a (de)compressor.


The prospect of the images getting "structurally" garbled in unpredictable ways would probably limit real-world applications: https://miro.medium.com/max/4800/1*RCG7lcPNGAUnpkeSsYGGbg.pn...

There's something to be said about compression algorithms being predictable, deterministic, and only capable of introducing defects that stand out as compression artifacts.

Plus, decoding performance and power consumption matters, especially on mobile devices (which also happens be the setting where bandwidth gains are most meaningful).


While that is kind of true it is also sort of the point.

The optimal lossy compression algorithm would be based on humans as a target. it would remove details that we wouldn't notice to reduce the target size. If you show me a photo of a face in front of some grass the optimal solution would likely be to reproduce that face in high detail but replace the grass with "stock imagery".

I guess it comes down to what is important. In the past algorithms were focused on visual perception, but maybe we are getting so good at convincingly removing unnecessary detail that we need to spend more time teaching the compressor what details are important. For example if I know the person in the grass preserving the face is important. If I don't know them then it could be replaced by a stock face as well. Maybe the optimal compression of a crowd of people is the 2 faces of people I know preserved accurately and the rest replaced with "stock" faces.


Remember the Xerox scan-to-email scandal in which tiling compression was replacing numbers in structural drawings? We're talking about similar repercussions here.


This reminds me of a question I have about SD: why can’t it do a simple OCR to know those are characters not random shapes? It’s baffling that neither SD nor DE2 have any understanding of the content they produce.


You could certainly apply a “duct tape” solution like that, but the issue is that neural networks were developed to replace what were previously entire solutions built on a “duct tape” collection of rule-based approaches (see the early attempts at image recognition). So it would be nice to solve the problem in a more general way.


> why can’t it do a simple OCR to know those are characters not random shapes?

It's pretty easy to add this if you wanted to.

But a better method would be to fine tune on a bunch of machine-generated images of words if you want your model to be good at generating characters. You'll need to consider which of the many Unicode character sets you want your model to specialize in though.


With compression you often make a prediction then delta off of it. A structurally garbled one could be discarded or just result in a worse baseline for the delta.


Just a note that stable diffusion is/can be deterministic (if set an rng seed).


I was told (on the Unstable Diffusion discord, so this info might not be reliable) that even with using the same seed the results will differ if the model is running on a different GPU. This was also my experience when I couldn't reproduce the results generated by the discord's SD txt2img generating bot.


I'm not sure about the different GPU issue. But if that is an issue, the model can be made deterministic (probably compromising inference speed), by making sure the calculations are computed deterministically.


It absolutely should be reproducable, and in my experience it is.

I do tend to use the HuggingFace version though.


To evaluate this experimental compression codec, I didn’t use any of the standard test images or images found online in order to ensure that I’m not testing it on any data that might have been used in the training set of the Stable Diffusion model (because such images might get an unfair compression advantage, since part of their data might already be encoded in the trained model).

I think it would be very interesting to determine if these images do come back with notably better compression.


Given the approach, they'll probably come back with better reconstruction/decompression too.


Not clear. Fully encoding the training images could not be a feasible aspect of a good auto-encoder.


I wonder if this technique could be called something like “abstraction” rather than “compression” given it will actually change information rather than its quality.

Ie. “There’s a neighbourhood here” is more of an abstraction than “here’s this exact neighbourhood with the correct layout just fuzzy or noisy.”


I would say any compression is abstraction in a certain sense. A simple example is a gradient. A lossy compression might abstract over the precise pixel value and simply records a gradiant that almost matches the raw input. You could even make the argument that lossless compressions is abstraction. A 2D grid with 5px lines and 50px spacing between them could feasibly be captured really well using a classical compression scheme.

What AI offers is just a more powerful and opaque way of doing the same thing.


like a MIDI file


Well, a MIDI file says nothing about the sound a Trumpet makes, whereas this SD-based abstraction does give a general idea of what your neighborhood should look like.

Maybe it's more like a MOD file?


Doesn't decompression require the entire stable fusion model? (and the exact same model at that)

This could be interesting but I'm wondering if the compression size is more a result of the benefit of what is essentially a massive offline dictionary built into the decoder vs some intrinsic benefit to processing the image in latent space based on the information in the image alone.

That said... I suppose it's actually quite hard to implement a "standard image dictionary" and this could be a good way to do that.


Haha. Here’s a faster compression model. Make a database of every image ever made. Compute a thumbprint and use that as the index of the database. Boom!


A quick Google says there are 10^72 to 10^82 atoms in the universe.

Assuming 24-bit color, if you could store an entire image in a single atom, then you could store images that are only 60 pixels and each atom would still have a unique image.


Not every possible image has been produced!


I'll get started, then!


The latent space _is_ the massive offline dictionary, and the benifit is not having to hand craft the massive offline dictionary?


For those of us unfamiliar... roughly how large is that in terms of bytes?


I thought that's what "some important caveats" was going to be, but no, article didn't mention this.


I'd love to see a series of increasingly compressed images, say 8kb -> 4kb -> 2kb -> ... -> 2bits -> 1bit. This would be a great way to demonstrate the increasing fictionalization of the method's recall.


Yes please. That would actually be an incredible blog post.

It also makes me wonder, if dealing with 8 bits, what would the 256 resulting images look like? It feels like it would be an eye into this "brain", what it considers to be the basic building blocks?


This is why for compression tests, they incorporate the size of everything needed to decompress the file. You can compress down to 4.97KB all you want, just include the 4GB trained model.


Do you also include the library to render a jpeg? And maybe the whole OS required to display it on your screen?

There are very many uses where any fixed overhead is meaningless. Imagine archiving billions of images for long term storage. The 4GB model quickly becomes meaningless.


Fixed overheads are never meaningless. O(n^2) algorithm that processes your data in 5s is faster on your data than O(log n) that takes 20 hours.

Long term storage of billions of images is meaningless, if it takes billions of years to archive these images.


It’s a one time cost rather than per image. You need the 4GB model only once and then you can uncompress unlimited images.


Yes, but each image needs access to this 4GB (actually, I have no idea how much RAM it takes up), plus whatever the working set size is. It is a non-trivial overhead that really limits throughput of your system, so you can process less images in parallel, so compressing billion of images in reasonable time suddenly may cost much more than the amount of storage it would save, compared to other methods.


> Do you also include the library to render a jpeg? And maybe the whole OS required to display it on your screen?

No, what does that have to do with reconstructing the original data?

If the fixed overhead works for you, that's fine, but including it is not meaningless.


Is that true? I have never seen this done for any image compression comparisons that I have seen (i.e. only data that is specific to the image that is being compressed is included, not standard tables that are always used by the algorithm like the quantisation tables used in JPG compression)


Yes, it is done all the time.

However, several people here are conflating "best compression as determined for a competition" and "best compression for use in the real world". There is an important relationship between them, absolutely, but in the real world we do not download custom decoders for every bit of compressed content. Just because there is a competition that quite correctly measures the entire size of the decompressor and encoded content does not mean that is now the only valid metric to measure decompression performance. The competitions use that metric for good and valid reasons, but those good and valid reasons are only vaguely correlated to the issues faced in the normal world.

(Among the reasons why competitions must include the size of the decoder is that without that the answer is trivial; I define all your test inputs as a simple enumeration of them and my decoder hard-codes the output as the test values. This is trivially the optimal algorithm, making competition useless. If you could have a real-world encoder that worked this well, and had the storage to implement it, it would be optimal, but you can't possibly store all possible messages. For a humorous demonstration of this encoding method, see the classic joke: https://onemansblog.com/2010/05/18/prison-joke/ )


For text compression benchmarks it's done http://mattmahoney.net/dc/text.html

Matt doesn't do this on the Silesia corpus compression benchmark, even though it would make sense there as well: http://mattmahoney.net/dc/silesia.html

So a compressor of a few gigabyte would make sense if you would have a set of pictures of more then a few gigabyte. It's a bit similar to preprocessing text compression with a dictionary and adding the dictionary to the extractor to squeeze a bit more bytes.


By the way, the leading nncp in the LTCB (text.html) "is a free, experimental file compressor by Fabrice Bellard, released May 8, 2019" :)


The vae used in stable diffusion is not ideal for compression. I think it would be better to use the vector-quantized variant (by the same authors of latent diffusion) instead of the KL variant, then store the indexes for each quantized vector using standard entropy coding algorithms.

From the paper the VQ variant also performs better overall, SD may have chosen the KL variant only to lower vram use.


KL models performs better than VQ models as you can see in the latent diffusion repo by CompVis.


just checked the paper again and yes you're right, the KL version is better on the openimages dataset. The VQ version is better in the inpainting comparison.

In this case you'd still want to use the VQ version though, it doesn't make sense to do an 8bit quantization on the KL vectors when there's an existing quantization learned through training.


The one with the different buildings in the reconstructed image is a bit spooky. I've always argued that human memory is highly compressed, storing, for older memories anyway, a "vibe" plus pointers to relevant experiences/details that can be used to flesh it out as needed. Details may be wrong in the recollecting/retelling, but the "feel" is right.

And here we have computers doing the same thing! Reconstructing an image from a highly compressed memory and filling in appropriate, if not necessarily exact details. Human eye looks at it casually and yeah, that's it, that's how I remember it. Except that not all the details are right.

Which is one of those "Whoa!" moments, like many many years ago, when I wrote a "Connect 4" implementation in BASIC on the Commodore 64, played it and lost! How did the machine get so smart all of a sudden?


In theory, it would be possible to benefit from the ability of Stable Diffusion to increase perceived image quality without even using a new compression format. We could just enhance existing JPG images in the browser.

There already are client side algorithms that increase the quality of JPGs a lot. For some reason, they are not used in browsers yet.

A Stable Diffusion based enhancement would probably be much nicer in most cases.

There might be an interesting race to do client side image enhancements coming to the browsers over the next years.


Great idea to use Stable Diffusion for image compression. There are deep links between machine learning and data compression (which I’m sure the author is aware of).

If you could compute the true conditional Kolmogorov complexity of an image or video file given all visual online media as the prior, I imagine you would obtain mind-blowing compression ratios.

People complain of the biased artifacts that appear when using neural networks for compression, but I’m not concerned in the long term. The ability to extract algorithmic redundancy from images using neural networks is obviously on its way to outclassing manually crafted approaches, and it’s just a matter of time before we are able to tack on a debiasing step to the process (such that the distribution of error between the reconstructed image and the ground truth has certain nice properties).


For text, GPT-2 was used in a similar demo a year ago albeit said demo is now defunct: https://news.ycombinator.com/item?id=23618465


I thought this was another take on this parody post: https://news.ycombinator.com/item?id=32671539

But no, it's the real deal. Great job author.


One interesting feature of ML-based image encoders is that it might be hard to evaluate them with standard benchmarks, because those are likely to be part of the training set, simply by virtue of being scraped from the web. How many copies of Lenna has Stable Diffusion been trained with? It’s on so many websites.


We might enter a time when every time a new model/compression algo is introduced, a new series of benchmark images may need to be introduced/taken and ALL historical benchmarks of major compression algos redone on the new images.


What they do is essentially a fractal compression with an external library of patterns (that was IIRC pattented but the patent should be long expired).


This does remind of fractal compression [1] from the 90's which never took off for various reasons which will be relevant here as well.

[1] https://en.wikipedia.org/wiki/Fractal_compression


This relates to a strong hunch that consciousness is tightly coupled to whatever compression is as an irreducible entity.

Memory <> Compression <> Language <> Signal Strength <> Harmonics and Ratios


"Consciousness" is a pretty useless word without being very carefully defined, because people use it to mean a variety of different things. And often in the most ambiguous way possible such as this comment.

But also often some related but very specific and different things such as the reply that assumes it means only "self-awareness".

To me, the main purpose of the word is to prove the insufficiency of language and how imprecise most people's thinking is.


The beauty of compression is the paradox of specificy and simultaneous ambiguity.


Consciousness is IMHO being avare of being avare. The mystic specialty of it is IMHO a mental illusion, like the Penrose ladder optical illusion.


I see the relation between compression and consciousness. But what do you mean by irreducible entity, and how does it relate to the two?


I don't understand much of what the OP is saying.

But I do like the Stephen Wolfram idea of consciousness being the way a computationally bounded observer develops a coherent view of a branching universe.

This is related to compression because it a (lossy!) reduction in information.

I understand that Wolfram is controversial, but the information-transmission-centric view of reality he works with makes a lot of intuitive sense to me.

https://writings.stephenwolfram.com/2021/03/what-is-consciou...


By irreducible entity, as the yet undefined entity that sits at the nexus of mathematics, philosophy, computation, logic (consciousness).

It’s not a well defined ontology yet. So whatever it is, at its irreducible size pinpointing it as a thing in which gives rise to such other things.


What kind of reductions would be disallowed?


This but for video using the "infilling" version for changing parts between frames.

The structural changes per frame matter much less. Send a 5kB image every keyframe then bytes per subsequent image with a sketch of the changes and where to mask them on the frame.

Modern video codecs are pretty amazing though, so not sure how it would compare in frame size


I've been thinking about more or less the same idea, but the computational edge inference costs probably makes it impractical for most of today's client devices. I see a lot of potential in this direction in the near future though.


I think it's unclear how much computational resources the uncompression steps take.

At the moment it's fairly fast, but RAM hungry. But this article makes it clear that quantizing the representation works well (at least for the VAE). It's possible quantized models could also do decent jobs.


I am currently also playing around with this. The best part is that for storage you don't need to store the reconstructed image, just the latent representation and the VAE decoder (which can do the reconstructing later). So you can store the image as relatively few numbers in a database. In my experiment I was able to compress a (512, 384, 3) RGB image to (48, 64, 4) floats. In terms of memory it was a 8x reduction.

However, on some images the artefacts are terrible. It does not work as a general-purpose lossy compressor unless you don't care about details.

The main obstacle is compute. The model is quite large, but hdds are cheap. The real problem is that reconstruction requires a GPU with lots of VRAM. Even with a GPU it's 15 seconds to reconstruct an image in Google Collab. You could do it on CPU, but then it's extremely slow. This is only viable if compute costs go down a lot.


From the title, I expected this to be basically pairing stable diffusion with an image captioning algorithm by 'compressing' the image to a simple human readable description, and then regenerating a comparable image from the text. I imagine that would work and be possible, essentially an autoencoder with a 'latent space' of single short human readable sentences.

The way this actually works is pretty impressive. I wonder if it could be made lossless or less lossy in a similar manner to FLAC and/or video compression algorithms... basically first do the compression, and then add on a correction that converts the result partially or completely into the true image. Essentially, e.g. encoding real images of the most egregiously modified regions of the photo and putting them back over the result.


It definitely can be made lossless, all you need to do is a compress/decompress roundtrip, and also save the resulting difference with the ground truth in a lossless image like PNG, QOI or lossless JXL. The final size would be the lossy compression + difference image. This is of course the least sophisticated approach, but who knows, it might compare pretty well with "plain" lossless formats.


Indeed one way of looking at intelligence is that it is a method of compressing the external universe.

See e.g. the Hutter Prize.


Our sight is light detection compressed into human thought

Written language is human thought compressed into words

Digital images are light detection compressed into bits

Text to images AI compress digital images into written language

Then how do the AI weights relate to human thought?


The feeling of understanding is essentially a decompression result being successfuly pattern matched.


It reminded me of a scene from "A Fire Upon the Deep" where connection bitrate is abysmal, but the video is crisp and realistic. It is used as a tool for deception, as it happens. Invisible information loss has its costs.


It is really interesting to talk about semantic lossy compression, which is probably what we get.

Where recreating with traditional codices introduce syntactic noise, then this will introduce semantic noise.

Imagine seeing a high res perfect picture, just until you see the source image and discover that it was reinterpreted..

It is also going to be interesting, to see if this method will be chosen for specific pictures, eg. pictures of celebrity objects (or people, when/if issues around that resolve), but for novel things, we need to use "syntactical" compression.


This is the algorithmic equivalent of a metaphor.


Goodness, I love this. It's a great description of the approach.


Before I clicked through to the article, I thought maybe they were taking an image and spitting out a prompt that would produce an image substantially similar to the original.


> Quantizing the latents from floating point to 8-bit unsigned integers by scaling, clamping and then remapping them results in only very little visible reconstruction error.

This might actually be interesting/important for the OpenVINO adaptation of SD ... from what I gathered from the OpenVINO documentation, quantizing is actually a big part of optimizing as this allows the usage of Intels new(-ish) NN instruction sets.


While this is great as an experiment, before you jump into practical applications, it is worth remembering that the decompressor is roughly 5GB in size :-)


I believe ML techniques are the future of video/image compression. When you read a well written novel, you can kind of construct images of characters, locations and scenes in your mind. You can even draw these scenes, and if you're a good artist, those won't have any artifacts.

I don't expect future codecs to be able to reduce a movie to a simple text stream, but maybe it could do something in the same vein. Store abstract descriptions instead of bitmaps. If the encoding and decoding are good enough, your phone could reconstruct an image that closely resembles what the camera recorded. If your phone has to store a 50Gb model for that, it doesn't seem too bad, especially if the movie file could be measured in tens of megabytes.

Or it could go in another direction, where file sizes remain in the gigabytes, but quality jumps to extremely crisp 8k that you can zoom into or move the camera around if you want.

Can't wait for this stuff!


I would call this "confabulation" more than compression.

Its accuracy is proportional to and bounded by the training data; I suspect in practice it's got a specific strength (filling in fungible detail) and as discussed ITT with fascinating and gnarly corners, some specific failure modes which are going to lead to bad outcomes.

At least with "lossy" CODECs of various kinds, even if you don't attend to absence until you do an A/B comparison, you can perceive the difference when you do do those comparisons.

In this case the serious peril is that an A/B comparison is [soon] going to just show difference. "What... is... the Real?"

When you contemplate that an ever-increasing proportion of the training data itself stems from AI- or otherwise-enhanced imagery,

our hold on the real has never felt weaker, and our vulnerability to the rewriting of reality, has never felt more present.


Compare compressed sensing's single pixel camera: https://news.mit.edu/2017/faster-single-pixel-camera-lensles...


This may give insights in how brain memory and thinking works.

Imagine if some day a computer could take a snapshot of the weights and memory bits of the brain and then reconstruct memories and thoughts.


This kind of already fits a little bit with how the brain processes images where there is information lacking. Neurocognitive specialists can likely correct me on the following.

Glaucoma is a disease where one slowly loses peripheral vision, until a small central island remains or you go completely blind.

So do patients perceive black peripheral vision? Or blurred peripheral vision?

Not really…patients actually make up the surrounding peripheral vision, sometimes with objects!


I heard Stable Diffusion's model is just 4 GB. It's incredible that billions of images could be squeezed in just 4 GB. Sure it's lossy compression but still.


I don't think that thinking of it as "compression" is useful, and more than an artist recreating the Mona Lisa from memory is "decompressing" it. The process that diffusion models use is fundamentally different to decompression.

For example, if you prompt Stable Diffusion with "Mona Lisa" and look at the iterations, it is clearer what is happening - it's not decompressing so much as drawing something it knows looks like Mona Lisa and then iterating to make it look clearer and clearer.

It clearly "knows" what the Mona Lisa looks like, but what is is doing isn't copying it - it's more like recreating a thing that looks like it.

(And yes I realize lots of artist on Twitter are complaining that it is copying their work. I think "forgery" is a better analogy than "stealing" though - it can create art that looks like a Picasso or whatever, but it isn't copying it in a conventional sense)


Forgery requires some kind of deception/fraud. Painting an imitation of the Mona Lisa isn’t forgery. Trying to sell it as if it is the original is.


Yes I agree with this too.

I think using that language is better than "stealing", because the immoral act is the passing off, not training of the model.


In this regard, stable diffusion is not so much comparable to a corpus of jpeg images, but with the jpeg compression algorithms.


I think it's easy to explain. If we split all those images into small 8x8 chunks, and put all the chunks into a fuzzy and a bit lossy hashtable, we'll see that many chunks are very similar and can be merged into one. To address this "space of 8x8 chunks" we'll apply PCA to them, just like in jpeg, and use only the top most significant components of the PCA vectors.

So in essense, this SD model is like an Alexandria library of visual elements, arranged on multidomensional shelves.


Save around a kilobyte with a decompressor that’s ~5Gbyte.


Does anybody understand from the article, how much data needed to be downloaded first on decompression side? The entire SD weights 2GB array, right?


On another note, you can also downscale an image, save it as a JPEG or whatever, then Upscale it back using AI upscaling.


If it’s a VAE then the latents should really be distributions, usually represented as the mean and variance of a normal distribution. If so then it should be possible to use the variance to determine to what precision a particular latent needs to be encoded. Could perhaps help increase the compression further.


Why aren't they scaled to have uniform variances?


Each image is represented by it’s own distribution over the latents. So the encoder needs the ability to specify some latents very accurately and others more loosely you could say.


What if I just want something pretty similar but not necessarily the exact image. Maybe there could be a way to find a somewhat similar text prompt as a starting point, and then add in some compressed information to adjust the prompt output to be just a bit closer to the original?


If this were used in the wild, do you need a copy of the model locally to decompress the images?


Yes, but possibly not the entire model, hypothetically for instance some fine-tuning on compression and then distillation.


I can imagine some uses for this. Imagine having to archive a massive dataset where it’s unlikely any individual image will be retrieved and where perfect accuracy isn’t required.

Could cut down storage costs a lot.


And how much compute time/power does “decompressing” take compared to a jpg?


Reminds me and sort of similar to NVidia's spatial/temporal upscaling DLSS. Lossy compression and upscaling are very closely related.


In the future you can have full 16k movies representing only 1.44mb seeds. A giant 500 petabyte trained model file can run those movies. You can even generate your own movie by uploading a book.


Probably very unlikely, but sometimes I wonder if Jan Sloot did something like this back in '95: https://en.wikipedia.org/wiki/Sloot_Digital_Coding_System


This is not really "stable-diffusion based image compression", since it only uses the VAE part of "stable diffusion", and not the denoising UNet.

Technically, this is simply "VAE-based image compression" (that uses stable diffusion v1.4's pretrained variational autoencoder) that takes the VAE representations and quantizes them.

(Note: not saying this is not interesting or useful; just that it's not what it says on the label)

Using the "denoising UNet" would make the method more computationally expensive, but probably even better (e.g., you can quantize the internal VAE representations more aggressively, since the denoising step might be able to recover the original data anyway).


It does use the UNet to denoise the VAE compressed image:

"The dithering of the palettized latents has introduced noise, which distorts the decoded result. But since Stable Diffusion is based on de-noising of latents, we can use the U-Net to remove the noise introduced by the dithering."

The included Colab doesn't have line numbers, but you can see the code doing it:

  # Use Stable Diffusion U-Net to de-noise the dithered latents
  latents = denoise(latents)
  denoised_img = to_img(latents)
  display(denoised_img)
  del latents
  print('VAE decoding of de-noised dithered 8-bit latents')
  print('size: {}b = {}kB'.format(sd_bytes, sd_bytes/1024.0))
  print_metrics(gt_img, denoised_img)


I stand corrected, then :) cheers.


It is using the UNet, though.


How long does it take to compress and decompress an image that way?


Is there a general name for this kind of latent space round-trip compression ? If not, I think a good name could be “interpretive compression”


I would like to see this with much smaller file sizes - like 100 bytes. How well can SD preserve the core subjects or meaning of the photos?


You can already "compress" them down to a few words, so you have your answer there.


A future where all the shadows on a Netflix series is full of ghost cats.


hm. would be interesting to see if any of the perceptual image compression quality metrics could be inserted into the vae step to improve quality and performance...


Extraordinary! Is it going to be called Pied Piper?


Is there something like this for live video chat?


What does johanne balle have to say about this?


You can do lossless neural compression too.


There's a really nice lecture series on data compression with deep probabalistic models https://www.youtube.com/playlist?list=PL05umP7R6ij0Mp1dW2HuX...


Didn't I do this last week?


The basic premise of these kinds of compression algorithms is actually pretty clever. Here's a very very trivialization of this style of approach:

1. both the compressor and decompressor contain knowledge beyond the algorithm used to compress/decompress some data

2. in this case the knowledge might be "all the images in the world"

3. when presented with an image, the compressor simply looks up some index or identifier of the the image

4. the identifier is passed around as the "compressed image"

5. "decompression" means looking up the identifier and retrieving the image

I've heard this called "compression via database" before and it can give the appearance of defeating Shannon theorem for compression even though it doesn't do that at all.

Of course the author's idea is significantly more sophisticated than the approach above, and trades a lossy approach for some gains in storage and retrieval efficiency (we don't have to have a copy of all of the pictures in the world in both the compressor and the decompressor). The evaluation note of not using any known image for the tests further challenges the approach and helps sus-out where there are specific challenge like poor reconstruction of specific image constructs like faces or text -- I suspect that there are many other issues like these but the author honed in on these because we (as literate humans) are particularly sensitive to them.

In these types of lossy compression approaches (as opposed to the above which is lossless) the basic approach is:

1. Throw away data until you get to the desired file size. You usually want to come up with some clever scheme to decide what data you toss out. Alternative, just hash the input data using some hash function that produces just the right number of bits you want, but use a scheme that results in a hash digest that can act as a (non-unique) index to the original image in a table of every image in the world.

2. For images it's usually easy to eliminate pixels (resolution) and color (bit-depth, channels, etc.). In this specific case, the author uses an variational autoencoder to "choose" what gets tossed. I suspect the autoencoder is very good at preserving information rich, or high-entropy, information dense slices of a latent space or something. At any rate, this produces something that to us sorta kinda looks like a very low resolution, poorly colored postage stamp of the original image, but actually contains more data than that. I think at this point it can just be considered the hash digest.

3. this hash digest, or VAE encoded image or whatever we want to call it, is what's passed around as the "compressed" data.

4. just like above, "decompression" means effectively looking up the value in a "database". If we are working with hash digests, there was probably a collision during the construction of the database of all images, so we lost some information. In this case we're dealing with stable diffusion and instead of a simple index->table entry, our "compressed" VAE image wraps through some hyperspace to find the nearest preserved data. Since the VAE "pixels" probably align close to data dense areas of the space you tend to get back data that closely represents the original image. It's still a database lookup in that sense, but it's looking more for "similar" rather than "exact matches" which when used to rebuild the image give a good approximation of the original.

Because it's an "approximation" it's "lossy". In fact I think it'd be more accurate to say it's "generally lossy" as there is a chance the original image can be reproduced exactly, especially if it's in the original training data. Which is why the author was careful not to use anything from that set.

Because we've stored so much information in the compressor and decompressor, it can also give the appearance of defeating Shannon entropy for compression except it's also not because:

a) it's generally lossy

b) just like the original example above we're cheating by simply storing lots of information elsewhere

There's probably some deep mathematical relationship between the author's approach and compressive sensing.

Still, it's useful, and has the possibility of improving data transmission speeds at the cost of storing lots of local data at both ends.

Source: Many years ago before deep learning was even a "thing", I worked briefly on some compression algorithms in an effort to reduce data transfer issues in telecom poor regions. One of our approaches was not too dissimilar to this -- throw away a bunch of the original data in a structured way and use a smart algorithm and some stored heuristics in the decompressor to guess what we threw away. Our scheme had the benefit of almost absolutely trivial "compression" with the downside of massive computational needs on the "decompression" side, but had lots of nice performance guarantees which you could use to design the data transport stuff around.

*edit* sorry if this explanation is confusing, it's been a while and it's also very late where I am. I just found this post really fun.


For people interested in more about this, it's probably worth reading the Hutter Prize FAQ: http://prize.hutter1.net/hfaq.htm




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: