However, a cautionary tale on AI medical image "denoising":
(and beyond, in science)
- See the artifacts?
The algorithm plugs into ambiguous areas of the image stuff it has seen before / it was trained with.
So, if such a system was to "denoise" (or compress, which - if you think about it - is basically the same operation) CT scans, X-rays, MRIs, etc., in ambiguous areas it could plug-in diseased tissue where the ground-truth was actually healthy.
Or the opposite, which is even worse: substitute diseased areas of the scan with healthy looking imagery it had been trained on.
Reading recent publications that try to do "denoising" or resolution "enhancement" in medical imaging contexts, the authors seem to be completely oblivious to this pitfall.
(maybe they had a background as World Bank / IMF economists?)
Here's a similar case of a scanner using a traditional compression algorithm. It has a bug in the compression algorithm, which made it replace a number in the scanned image with a different number.
That is completely outside all my expectations prior to reading it. The consequences are potentially life and death, or incarceration, etc, and yet they did nothing until called out and basically forced to act.
A good reminder that the bug can be anywhere, and when things stop working we often need to get very dumb, and just methodically troubleshoot.
We programmers tend to think our abstractions match reality somehow. Or that they don't leak. Or even if they do leak, that leakage won't spill down several layers of abstraction.
I used to install T1 lines a long time ago. One day we had a customer that complained that their T1 was dropping every afternoon. We ran tests on the line for extended periods of time trying to troubleshoot the problem. Every test passed. Not a single bit error no matter what test pattern we used.
We monitored it while they used it and saw not a single error, except for when the line completely dropped. We replaced the NIU card, no change.
Customer then hit us with, "it looks like it only happens when Jim VNCs to our remote server".
Obviously a userland program (VNC) could not possibly cause our NIU to reboot, right?? It's several layers "up the stack" from the physical equipment sending the DS1 signal over the copper.
But that's what it was. We reliably triggered the issue by running VNC on their network. We ended up changing the NIU and corresponding CO card to a different manufacturer (from Adtran to Soneplex I think?) to fix the issue. I wish I had had time to really dig into that one, because obviously other customers used VNC with no issues. Adtran was our typical setup. Nothing else was weird about this standard T1 install. But somehow the combination of our equipment, their networking gear, and that program on that workstation caused the local loop equipment to lose its mind.
This number-swapping story hit me the same way. We would all expect a compression bug to manifest as blurry text, or weird artifacts. We would never suspect a clean substitution of a meaningful symbol in what is "just a raster image".
I assume something like jpeg (used in the DICOM standard today) has more eyes on the code than proprietary Xerox stuff? hopefully at least...
I have seen weird artifacts on MRI scans, specifically the FLAIR image enhancement algorithm used on T2 images, i.e. white spots, which could in theory be interpreted by a radiology as small strokes or MS...so I always take what I see with a grain of salt..
The DICOM standard stuff did have a lot of eyes on it, and was tuned toward fidelity which helps. It's not perfect, but what is.
MRI artifacts though are a whole can of worms, but fundamentally most of them come from a combination of the EM physics involved, and the reconstruction algorithm needed to produce an image from the frequency data.
I'm not sure what you mean by "image enhancement algorithm"; FLAIR is a pulse sequence used to suppress certain fluid signals, typically used in spine and brain.
Many of the bright spots you see in FLAIR are due to B1 inhomogeneity, iirc (it's been a while though)
Probably worth mentioning also that "used in DICOM standard" is true but possibly misleading to someone unfamiliar with it.
DICOM is a vast standard. In it's many crevasses, it contains wire and file encoding schemas, some of which include (many different) image data types, some of which allow (multiple) compression schemes, both lossy and lossless, as well as metadata schemes. These include JPEG, JPEG-LS, JPEG-2000, MPEG2/4, HVEC.
I think you have to encode the compression ratio as well, if you do lossy compression. You definitely have to note that you did lossy compression.
Iirc, this was an issue or conspiracy fuel or whatever with the birth certificate that Obama released. That some of the unique elements in the scan repeated over and over.
People have been publishing fairly useless papers "for" medical imaging enhancement/improvement for 3+ decades now. NB this is not universal (there are some good ones) and not limited to AI techniques, although essentially every AI technique that comes along gets applied to compression/denoising/"superres"/etc. if it can, eventually.
The main problems is that that typical imaging researchers are too far from actual clinical applications, and often trying to solve the wrong problems. It's a structural problem with academic and clinical incentives, as much as anything else.
> a bit of a danger of this method: One must not be fooled by the quality of the reconstructed features — the content may be affected by compression artifacts, even if it looks very clear
... plus an excellent image showing the algorithm straight making stuff up, so I suspect the author is aware.
In my xp, medical imaging at the diagnostic tier uses only lossless (JPEG2000 et al). It was explicitly mentioned on our SOP/policies that we had to have a lossless setup.
Very sketchy to use a super resolution for diagnostics. In research (flourescence), sure.
ref: my direct experience of pathology slide scanning machines and their setup.
This has already happened. Pinch-to-zoom became a subject of controversy in the Kyle Rittenhouse trial because of a fear that it could introduce false details.
That is a pretty scary side effect and why AI in medicine needs to be handled carefully, especially anything with generative capability.
I do think, however, the primary application of these newer generation of generative models will be useful for more sophisticated data augmentation. For example, a lung nodule detection AI trained on a combination of synthetic and real lung nodules may perform better than one trained on real data alone. Do you think there are any other responsible applications?
I think this could be avoided by using some simple-stupid metric for measuring the difference between the compressed and uncompressed images and rejecting the compressed one if it's too different.
Maybe something like dividing the image into small squares, computing the L2 difference between original and compressed. If any square has too large L2 difference then compression failed. Th metric might need to be informed by the domain.
Something interesting about the San Francisco test image is that if you start to look into the details, its clear that some real changes have been made to the city. Rather than losing texture or grain or clarity, the information lost in this is information about the particular layout of a neighborhood of streets, which has now been replaced as if some one were drawing the scene from memory. A very different kind of loss that with out the original might be imperceptible because the information that was lost isn't replaced with random or systematic noise, but rather new, structured information..
One thing that worries me about generative AI is the degradation of “truth” over time. AI will be the cheapest way to generated content, by far. It will sometimes get facts subtly wrong, and eventually that AI generated content will be used to train future models. Rinse and repeat.
We are getting closer and closer to a simulacrum and hyperreality.
We used to create things that were trying to simulate (reproduce) reality, but now we are using those "simulations" we'd created as if they were the real thing. With time we will be getting farther away from the "truth" (as you put it), and yes - I share your worry about that.
EDIT: A good example I heard that explains what a simulacrum is was this:
Ask a random person to draw a photo of a princes and see how many will draw a disney princess (which already was based on real princesses) vs how many will draw one looking like Catherine of Aragon or another real princess.
Yes indeed. I've been looking for an auto summarizer that reliably doesn't change the content. So far everything I've tried will make up or edit a key fact once in a while.
The nice thing about math is that often it's much harder to find a proof than to verify that proof. So math AI is allowed to make lots of dumb mistakes, we just want it to make the occasional real finding too.
Anywhere that truth matters will be unaffected. If such deviations from truth can withhold, then the truth never mattered. False assumptions will never hold where they can't, because reality is quite pervasive. Ask anyone who's had to productionize an ML model in a setting that requires a foot in reality. Even a single-digit drop in accuracy can have resounding effects.
The interesting thing is that is some ways this is a return to pre-modern era of lossy information transmission between the generations. Every story is re-molded by the re-teller. Languages change and thus the contextual interpretations. Even something a seemingly static as a book gets slowly modified as scribes rewrite scrolls over centuries.
Certainly possible, though we also have many hundreds of millions of people walking the globe taking pictures of things with their phones (not all of which are public to be used for training, but still).
Invisibly changing the content rather than the image quality seems like a really concerning failure mode for image compression!
I wonder if it'd be possible to use SD as part of a lossless system - use SD as something that tells us the liklihood of various pixel values given the rest of the image and combine that liklihood with a huffman encoding. Either way, fantastic hack, but we really should avoid using anything lossy built on AI for image compression.
Imagine a world where bandwidth constraints meant transmitting a hidden compressed representation that gets expanded locally by smart TVs that have pretrained weights baked into the OS. Everyone sees a slightly different reconstitution of the same input video. Firmware updates that push new weights to your TV result in stochastic changes to a movie you've watched before.
"The weather forecast was correct as broadcast, sir, it's just your smart TV thought it was more likely that the weather in your region would be warm on that day, so it adjusted the symbol and temperature accordingly"
I mean, that is already happening. Almost all modern TV's do some signal processing before outputting the pixels, and the image looks slightly different on each model.
But it'd definitely be cool to have some latent representation of a video that then gets rendered on tv - you could apply latent style sheets to the content, like what actors you want to play the roles, or turn everything into a steam-punk anime on the fly. The more abstract the representation, the more interesting alterations you could apply
You could still use some kind of adaptive huffman coding. Current compression schemes have some kind of dictionary embedded in the file to map between the common strings and the compressed representation. Google tried proposing SDCH a few years using a common dictionary for wep pages. There isn't any reason why we can't be a bit more deterministic and share a much larger latent representation of "human visual comprehension" or whatever to do the same. It doesn't need to be stochastic once generated.
True, but I'd like to continue using products that produce close-to-real images. Phones nowadays already process images at lot. The moment they start replacing pixels it'll all be fake.
And… Some manufacturer apparently already did it on their ultra zoom phones when taking photos of the moon.
Meh. Cameras have been "replacing pixels" for as long as I've been alive. Consider that a 4K camera only has 2k*4k pixels whereas a 4K screen has 2k*4k*3 subpixels.
2/3 of the image is just dreamed up by the ISP (image signal processor) when it debayers the raw image.
I'm not aware of any consumer hardware that has open source ISP firmware or claims to optimize for accuracy over beauty.
“When used in lossy mode, JBIG2 compression can potentially alter text in a way that's not discernible as corruption. This is in contrast to some other algorithms, which simply degrade into a blur, making the compression artifacts obvious.[14] Since JBIG2 tries to match up similar-looking symbols, the numbers "6" and "8" may get replaced, for example.
In 2013, various substitutions (including replacing "6" with "8") were reported to happen on many Xerox Workcentre photocopier and printer machines, where numbers printed on scanned (but not OCR-ed) documents could have potentially been altered. This has been demonstrated on construction blueprints and some tables of numbers; the potential impact of such substitution errors in documents such as medical prescriptions was briefly mentioned.”
There was a scandal when it was discovered that Xerox machines were doing this; in that case, the example showed "photocopies" replacing numbers in documents with other numbers.
During my PhD this issue came up amongst those in the group looking into compressed sensing in MRI. Many reconstruction methods (AI being a modern variant) work well because a best guess is visually plausible. These kinds of methods fall apart when visually plausible and "true" are different in a meaningful way. The simplest examples here being the numbers in scanned documents, or in the MRI case, areas of the brain where "normal brain tissue" was on average more plausible than "tumor".
> not the complete showstoppers some people seem to think that they are.
idk if I had to second guess every single result coming out of a machine it would be a showstopper for me. This isn't pokemon go, tumor detection is serious matter
Why would you want to lossily compress any medical image is beyond me. You get equipment to make precise high-resolution measurements, it goes without saying that you do not want noise added to that.
In medical images, you don't record first and then compress later. Instead, you make sparse measurements and then reconstruct. Why? Because people move, so getting more frames/sec is a thing; you don't want people to stay for too long in the machine; and (ideally) with the same setup, you can focus on a smaller area and get a higher resolution than standard measurements too.
You are talking about compressed sensing which is not lossy compression (compressed sensing can be lossless unless you're dealing with noisy measurements).
But say you're doing noisy measurements, and you are under-measuring like you say, and you have to fabricate non-random non-homogenous reconstruction noise. In that case it would be a very good idea to produce, as they do for lossy compression, both the standard overall bit rate vs. PSNR characterization against alternate direct (non-sparse) measurement ground truths (that have to exist, or else the reconstruction method should be called into question), and the bit rate for each particular sparse measurement. So this way people can see how reliable the reconstruction is. Ideally the image should be labeled at the pixel level with reconstruction probabilities, or presented in other ways to demonstrate the ratio of measured vs. fabricated information, like 95% confidence-interval extremal reconstructions or something.
It's not clear that community is doing this level of due diligence, so then the voices here are right: it's not a good idea to use.
If the compression is lossless that's fine. I have not seen an AI system being used in this manner but I don't doubt it's possible. All lossy compression methods output false information, that's the point of lossy compression and why it works so well. Remove details that the compression algorithm deems unimportant.
> The right amount of compression in a photocopy machine is zero.
This isn't an obvious statement to me. If you've had the misfortune of scanning documents to PDF and getting the 100MB per page files automatically emailed to you then you might see the benefit in all that white space being compressed somehow.
> But what does it mean to “be aware of” compression that may give you a crisp image of some made up document?
This isn't something I said. A good compression system for documents will not change characters in any circumstances.
If you are making an image of a cityscape to illustrate an article it probably doesn't matter what the city looks like. But if the article is about the architecture of the specific city, it probably does, so you need to 'be aware' that the image you are showing people isn't correct, and reduce the compression.
I would've thought anyone relying on lossy-compressed images of any sort already needs to be aware of the potential effects, or otherwise isn't really concerned by the effect on the image (and I'd guess that the vast majority of use cases actually don't care if parts of the image are essentially "imaginary")
2. It would be great to see the best codecs included in the comparison - AVIF and JPEG XL. Without those it's rather incomplete. No surprise that JPEG and WEBP totally fall apart at that bitrate.
3. A significant limitation of the approach seems to be that it targets extremely low bitrates where other codecs fall apart, but at these bitrates it incurs problems of its own (artifacts take the form of meaningful changes to the source image instead of blur or blocking, very high computational complexity for the decoder).
When only moderate compression is needed, codecs like JPEG XL already achieve very good results. This proof of concept focuses on the extreme case, but I wonder what would happen if you targeted much higher bitrates, say 5x higher than used here. I suspect (but have no evidence) that JPEG XL would improve in fidelity faster as you gave it more bits than this SD-based technique. Transparent compression, where the eye can't tell a visual difference between source and transcode (at least without zooming in) is the optimal case for JPEG XL. I wonder what sort of bitrate you'd need to provide that kind of guarantee with this technique.
The comparison doesn't make much sense because for fair comparisons you have to measure decompressor size plus encoded image size. The decompressor here is super huge because it includes the whole AI model. Also, everyone needs to have the exact same copy of the model in the decompressor for it to work reliably.
Only if decompressor and image are transmitted over the same channel at the same time, and you only have a small number of images. When compressing images for the web I don't care if a webp decompressor is smaller than a jpg or png decompressor, because the recipient already has all of those.
Of course stable diffusion's 4GB is much more extreme than Brotli's 120kb dictionary size, and would bloat a Browser's install size substantially. But for someone like Instagram or a Camera maker it could still make sense. Or imagine phones having the dictionary shipped in the OS to save just a couple kB on bad data connections.
Even if dictionaries were shipped, the biggest difficulty would be performance and resources. Most of these models require beefy compute and a large amount of VRAM that isn't likely to ever exist on end devices.
Unless that can be resolved it just doesn't make sense to use it as a (de)compressor.
There's something to be said about compression algorithms being predictable, deterministic, and only capable of introducing defects that stand out as compression artifacts.
Plus, decoding performance and power consumption matters, especially on mobile devices (which also happens be the setting where bandwidth gains are most meaningful).
While that is kind of true it is also sort of the point.
The optimal lossy compression algorithm would be based on humans as a target. it would remove details that we wouldn't notice to reduce the target size. If you show me a photo of a face in front of some grass the optimal solution would likely be to reproduce that face in high detail but replace the grass with "stock imagery".
I guess it comes down to what is important. In the past algorithms were focused on visual perception, but maybe we are getting so good at convincingly removing unnecessary detail that we need to spend more time teaching the compressor what details are important. For example if I know the person in the grass preserving the face is important. If I don't know them then it could be replaced by a stock face as well. Maybe the optimal compression of a crowd of people is the 2 faces of people I know preserved accurately and the rest replaced with "stock" faces.
Remember the Xerox scan-to-email scandal in which tiling compression was replacing numbers in structural drawings? We're talking about similar repercussions here.
This reminds me of a question I have about SD: why can’t it do a simple OCR to know those are characters not random shapes? It’s baffling that neither SD nor DE2 have any understanding of the content they produce.
You could certainly apply a “duct tape” solution like that, but the issue is that neural networks were developed to replace what were previously entire solutions built on a “duct tape” collection of rule-based approaches (see the early attempts at image recognition). So it would be nice to solve the problem in a more general way.
> why can’t it do a simple OCR to know those are characters not random shapes?
It's pretty easy to add this if you wanted to.
But a better method would be to fine tune on a bunch of machine-generated images of words if you want your model to be good at generating characters. You'll need to consider which of the many Unicode character sets you want your model to specialize in though.
With compression you often make a prediction then delta off of it. A structurally garbled one could be discarded or just result in a worse baseline for the delta.
I was told (on the Unstable Diffusion discord, so this info might not be reliable) that even with using the same seed the results will differ if the model is running on a different GPU. This was also my experience when I couldn't reproduce the results generated by the discord's SD txt2img generating bot.
I'm not sure about the different GPU issue. But if that is an issue, the model can be made deterministic (probably compromising inference speed), by making sure the calculations are computed deterministically.
To evaluate this experimental compression codec, I didn’t use any of the standard test images or images found online in order to ensure that I’m not testing it on any data that might have been used in the training set of the Stable Diffusion model (because such images might get an unfair compression advantage, since part of their data might already be encoded in the trained model).
I think it would be very interesting to determine if these images do come back with notably better compression.
I wonder if this technique could be called something like “abstraction” rather than “compression” given it will actually change information rather than its quality.
Ie. “There’s a neighbourhood here” is more of an abstraction than “here’s this exact neighbourhood with the correct layout just fuzzy or noisy.”
I would say any compression is abstraction in a certain sense. A simple example is a gradient. A lossy compression might abstract over the precise pixel value and simply records a gradiant that almost matches the raw input. You could even make the argument that lossless compressions is abstraction. A 2D grid with 5px lines and 50px spacing between them could feasibly be captured really well using a classical compression scheme.
What AI offers is just a more powerful and opaque way of doing the same thing.
Well, a MIDI file says nothing about the sound a Trumpet makes, whereas this SD-based abstraction does give a general idea of what your neighborhood should look like.
Doesn't decompression require the entire stable fusion model? (and the exact same model at that)
This could be interesting but I'm wondering if the compression size is more a result of the benefit of what is essentially a massive offline dictionary built into the decoder vs some intrinsic benefit to processing the image in latent space based on the information in the image alone.
That said... I suppose it's actually quite hard to implement a "standard image dictionary" and this could be a good way to do that.
Haha. Here’s a faster compression model. Make a database of every image ever made. Compute a thumbprint and use that as the index of the database. Boom!
A quick Google says there are 10^72 to 10^82 atoms in the universe.
Assuming 24-bit color, if you could store an entire image in a single atom, then you could store images that are only 60 pixels and each atom would still have a unique image.
I'd love to see a series of increasingly compressed images, say 8kb -> 4kb -> 2kb -> ... -> 2bits -> 1bit. This would be a great way to demonstrate the increasing fictionalization of the method's recall.
Yes please. That would actually be an incredible blog post.
It also makes me wonder, if dealing with 8 bits, what would the 256 resulting images look like? It feels like it would be an eye into this "brain", what it considers to be the basic building blocks?
This is why for compression tests, they incorporate the size of everything needed to decompress the file. You can compress down to 4.97KB all you want, just include the 4GB trained model.
Do you also include the library to render a jpeg? And maybe the whole OS required to display it on your screen?
There are very many uses where any fixed overhead is meaningless. Imagine archiving billions of images for long term storage. The 4GB model quickly becomes meaningless.
Yes, but each image needs access to this 4GB (actually, I have no idea how much RAM it takes up), plus whatever the working set size is. It is a non-trivial overhead that really limits throughput of your system, so you can process less images in parallel, so compressing billion of images in reasonable time suddenly may cost much more than the amount of storage it would save, compared to other methods.
Is that true? I have never seen this done for any image compression comparisons that I have seen (i.e. only data that is specific to the image that is being compressed is included, not standard tables that are always used by the algorithm like the quantisation tables used in JPG compression)
However, several people here are conflating "best compression as determined for a competition" and "best compression for use in the real world". There is an important relationship between them, absolutely, but in the real world we do not download custom decoders for every bit of compressed content. Just because there is a competition that quite correctly measures the entire size of the decompressor and encoded content does not mean that is now the only valid metric to measure decompression performance. The competitions use that metric for good and valid reasons, but those good and valid reasons are only vaguely correlated to the issues faced in the normal world.
(Among the reasons why competitions must include the size of the decoder is that without that the answer is trivial; I define all your test inputs as a simple enumeration of them and my decoder hard-codes the output as the test values. This is trivially the optimal algorithm, making competition useless. If you could have a real-world encoder that worked this well, and had the storage to implement it, it would be optimal, but you can't possibly store all possible messages. For a humorous demonstration of this encoding method, see the classic joke: https://onemansblog.com/2010/05/18/prison-joke/ )
So a compressor of a few gigabyte would make sense if you would have a set of pictures of more then a few gigabyte. It's a bit similar to preprocessing text compression with a dictionary and adding the dictionary to the extractor to squeeze a bit more bytes.
The vae used in stable diffusion is not ideal for compression. I think it would be better to use the vector-quantized variant (by the same authors of latent diffusion) instead of the KL variant, then store the indexes for each quantized vector using standard entropy coding algorithms.
From the paper the VQ variant also performs better overall, SD may have chosen the KL variant only to lower vram use.
just checked the paper again and yes you're right, the KL version is better on the openimages dataset. The VQ version is better in the inpainting comparison.
In this case you'd still want to use the VQ version though, it doesn't make sense to do an 8bit quantization on the KL vectors when there's an existing quantization learned through training.
The one with the different buildings in the reconstructed image is a bit spooky. I've always argued that human memory is highly compressed, storing, for older memories anyway, a "vibe" plus pointers to relevant experiences/details that can be used to flesh it out as needed. Details may be wrong in the recollecting/retelling, but the "feel" is right.
And here we have computers doing the same thing! Reconstructing an image from a highly compressed memory and filling in appropriate, if not necessarily exact details. Human eye looks at it casually and yeah, that's it, that's how I remember it. Except that not all the details are right.
Which is one of those "Whoa!" moments, like many many years ago, when I wrote a "Connect 4" implementation in BASIC on the Commodore 64, played it and lost! How did the machine get so smart all of a sudden?
In theory, it would be possible to benefit from the ability of Stable Diffusion to increase perceived image quality without even using a new compression format. We could just enhance existing JPG images in the browser.
There already are client side algorithms that increase the quality of JPGs a lot. For some reason, they are not used in browsers yet.
A Stable Diffusion based enhancement would probably be much nicer in most cases.
There might be an interesting race to do client side image enhancements coming to the browsers over the next years.
Great idea to use Stable Diffusion for image compression. There are deep links between machine learning and data compression (which I’m sure the author is aware of).
If you could compute the true conditional Kolmogorov complexity of an image or video file given all visual online media as the prior, I imagine you would obtain mind-blowing compression ratios.
People complain of the biased artifacts that appear when using neural networks for compression, but I’m not concerned in the long term. The ability to extract algorithmic redundancy from images using neural networks is obviously on its way to outclassing manually crafted approaches, and it’s just a matter of time before we are able to tack on a debiasing step to the process (such that the distribution of error between the reconstructed image and the ground truth has certain nice properties).
One interesting feature of ML-based image encoders is that it might be hard to evaluate them with standard benchmarks, because those are likely to be part of the training set, simply by virtue of being scraped from the web. How many copies of Lenna has Stable Diffusion been trained with? It’s on so many websites.
We might enter a time when every time a new model/compression algo is introduced, a new series of benchmark images may need to be introduced/taken and ALL historical benchmarks of major compression algos redone on the new images.
What they do is essentially a fractal compression with an external library of patterns (that was IIRC pattented but the patent should be long expired).
"Consciousness" is a pretty useless word without being very carefully defined, because people use it to mean a variety of different things. And often in the most ambiguous way possible such as this comment.
But also often some related but very specific and different things such as the reply that assumes it means only "self-awareness".
To me, the main purpose of the word is to prove the insufficiency of language and how imprecise most people's thinking is.
But I do like the Stephen Wolfram idea of consciousness being the way a computationally bounded observer develops a coherent view of a branching universe.
This is related to compression because it a (lossy!) reduction in information.
I understand that Wolfram is controversial, but the information-transmission-centric view of reality he works with makes a lot of intuitive sense to me.
This but for video using the "infilling" version for changing parts between frames.
The structural changes per frame matter much less. Send a 5kB image every keyframe then bytes per subsequent image with a sketch of the changes and where to mask them on the frame.
Modern video codecs are pretty amazing though, so not sure how it would compare in frame size
I've been thinking about more or less the same idea, but the computational edge inference costs probably makes it impractical for most of today's client devices. I see a lot of potential in this direction in the near future though.
I think it's unclear how much computational resources the uncompression steps take.
At the moment it's fairly fast, but RAM hungry. But this article makes it clear that quantizing the representation works well (at least for the VAE). It's possible quantized models could also do decent jobs.
I am currently also playing around with this. The best part is that for storage you don't need to store the reconstructed image, just the latent representation and the VAE decoder (which can do the reconstructing later). So you can store the image as relatively few numbers in a database. In my experiment I was able to compress a (512, 384, 3) RGB image to (48, 64, 4) floats. In terms of memory it was a 8x reduction.
However, on some images the artefacts are terrible. It does not work as a general-purpose lossy compressor unless you don't care about details.
The main obstacle is compute. The model is quite large, but hdds are cheap. The real problem is that reconstruction requires a GPU with lots of VRAM. Even with a GPU it's 15 seconds to reconstruct an image in Google Collab. You could do it on CPU, but then it's extremely slow. This is only viable if compute costs go down a lot.
From the title, I expected this to be basically pairing stable diffusion with an image captioning algorithm by 'compressing' the image to a simple human readable description, and then regenerating a comparable image from the text. I imagine that would work and be possible, essentially an autoencoder with a 'latent space' of single short human readable sentences.
The way this actually works is pretty impressive. I wonder if it could be made lossless or less lossy in a similar manner to FLAC and/or video compression algorithms... basically first do the compression, and then add on a correction that converts the result partially or completely into the true image. Essentially, e.g. encoding real images of the most egregiously modified regions of the photo and putting them back over the result.
It definitely can be made lossless, all you need to do is a compress/decompress roundtrip, and also save the resulting difference with the ground truth in a lossless image like PNG, QOI or lossless JXL. The final size would be the lossy compression + difference image. This is of course the least sophisticated approach, but who knows, it might compare pretty well with "plain" lossless formats.
It reminded me of a scene from "A Fire Upon the Deep" where connection bitrate is abysmal, but the video is crisp and realistic. It is used as a tool for deception, as it happens. Invisible information loss has its costs.
It is really interesting to talk about semantic lossy compression, which is probably what we get.
Where recreating with traditional codices introduce syntactic noise, then this will introduce semantic noise.
Imagine seeing a high res perfect picture, just until you see the source image and discover that it was reinterpreted..
It is also going to be interesting, to see if this method will be chosen for specific pictures, eg. pictures of celebrity objects (or people, when/if issues around that resolve), but for novel things, we need to use "syntactical" compression.
Before I clicked through to the article, I thought maybe they were taking an image and spitting out a prompt that would produce an image substantially similar to the original.
> Quantizing the latents from floating point to 8-bit unsigned integers by scaling, clamping and then remapping them results in only very little visible reconstruction error.
This might actually be interesting/important for the OpenVINO adaptation of SD ... from what I gathered from the OpenVINO documentation, quantizing is actually a big part of optimizing as this allows the usage of Intels new(-ish) NN instruction sets.
While this is great as an experiment, before you jump into practical applications, it is worth remembering that the decompressor is roughly 5GB in size :-)
I believe ML techniques are the future of video/image compression. When you read a well written novel, you can kind of construct images of characters, locations and scenes in your mind. You can even draw these scenes, and if you're a good artist, those won't have any artifacts.
I don't expect future codecs to be able to reduce a movie to a simple text stream, but maybe it could do something in the same vein. Store abstract descriptions instead of bitmaps. If the encoding and decoding are good enough, your phone could reconstruct an image that closely resembles what the camera recorded. If your phone has to store a 50Gb model for that, it doesn't seem too bad, especially if the movie file could be measured in tens of megabytes.
Or it could go in another direction, where file sizes remain in the gigabytes, but quality jumps to extremely crisp 8k that you can zoom into or move the camera around if you want.
I would call this "confabulation" more than compression.
Its accuracy is proportional to and bounded by the training data; I suspect in practice it's got a specific strength (filling in fungible detail) and as discussed ITT with fascinating and gnarly corners, some specific failure modes which are going to lead to bad outcomes.
At least with "lossy" CODECs of various kinds, even if you don't attend to absence until you do an A/B comparison, you can perceive the difference when you do do those comparisons.
In this case the serious peril is that an A/B comparison is [soon] going to just show difference. "What... is... the Real?"
When you contemplate that an ever-increasing proportion of the training data itself stems from AI- or otherwise-enhanced imagery,
our hold on the real has never felt weaker, and our vulnerability to the rewriting of reality, has never felt more present.
This kind of already fits a little bit with how the brain processes images where there is information lacking. Neurocognitive specialists can likely correct me on the following.
Glaucoma is a disease where one slowly loses peripheral vision, until a small central island remains or you go completely blind.
So do patients perceive black peripheral vision? Or blurred peripheral vision?
Not really…patients actually make up the surrounding peripheral vision, sometimes with objects!
I heard Stable Diffusion's model is just 4 GB. It's incredible that billions of images could be squeezed in just 4 GB. Sure it's lossy compression but still.
I don't think that thinking of it as "compression" is useful, and more than an artist recreating the Mona Lisa from memory is "decompressing" it. The process that diffusion models use is fundamentally different to decompression.
For example, if you prompt Stable Diffusion with "Mona Lisa" and look at the iterations, it is clearer what is happening - it's not decompressing so much as drawing something it knows looks like Mona Lisa and then iterating to make it look clearer and clearer.
It clearly "knows" what the Mona Lisa looks like, but what is is doing isn't copying it - it's more like recreating a thing that looks like it.
(And yes I realize lots of artist on Twitter are complaining that it is copying their work. I think "forgery" is a better analogy than "stealing" though - it can create art that looks like a Picasso or whatever, but it isn't copying it in a conventional sense)
I think it's easy to explain. If we split all those images into small 8x8 chunks, and put all the chunks into a fuzzy and a bit lossy hashtable, we'll see that many chunks are very similar and can be merged into one. To address this "space of 8x8 chunks" we'll apply PCA to them, just like in jpeg, and use only the top most significant components of the PCA vectors.
So in essense, this SD model is like an Alexandria library of visual elements, arranged on multidomensional shelves.
If it’s a VAE then the latents should really be distributions, usually represented as the mean and variance of a normal distribution. If so then it should be possible to use the variance to determine to what precision a particular latent needs to be encoded. Could perhaps help increase the compression further.
Each image is represented by it’s own distribution over the latents. So the encoder needs the ability to specify some latents very accurately and others more loosely you could say.
What if I just want something pretty similar but not necessarily the exact image. Maybe there could be a way to find a somewhat similar text prompt as a starting point, and then add in some compressed information to adjust the prompt output to be just a bit closer to the original?
I can imagine some uses for this. Imagine having to archive a massive dataset where it’s unlikely any individual image will be retrieved and where perfect accuracy isn’t required.
In the future you can have full 16k movies representing only 1.44mb seeds. A giant 500 petabyte trained model file can run those movies. You can even generate your own movie by uploading a book.
This is not really "stable-diffusion based image compression", since it only uses the VAE part of "stable diffusion", and not the denoising UNet.
Technically, this is simply "VAE-based image compression" (that uses stable diffusion v1.4's pretrained variational autoencoder) that takes the VAE representations and quantizes them.
(Note: not saying this is not interesting or useful; just that it's not what it says on the label)
Using the "denoising UNet" would make the method more computationally expensive, but probably even better (e.g., you can quantize the internal VAE representations more aggressively, since the denoising step might be able to recover the original data anyway).
It does use the UNet to denoise the VAE compressed image:
"The dithering of the palettized latents has introduced noise, which distorts the decoded result. But since Stable Diffusion is based on de-noising of latents, we can use the U-Net to remove the noise introduced by the dithering."
The included Colab doesn't have line numbers, but you can see the code doing it:
# Use Stable Diffusion U-Net to de-noise the dithered latents
latents = denoise(latents)
denoised_img = to_img(latents)
display(denoised_img)
del latents
print('VAE decoding of de-noised dithered 8-bit latents')
print('size: {}b = {}kB'.format(sd_bytes, sd_bytes/1024.0))
print_metrics(gt_img, denoised_img)
hm. would be interesting to see if any of the perceptual image compression quality metrics could be inserted into the vae step to improve quality and performance...
The basic premise of these kinds of compression algorithms is actually pretty clever. Here's a very very trivialization of this style of approach:
1. both the compressor and decompressor contain knowledge beyond the algorithm used to compress/decompress some data
2. in this case the knowledge might be "all the images in the world"
3. when presented with an image, the compressor simply looks up some index or identifier of the the image
4. the identifier is passed around as the "compressed image"
5. "decompression" means looking up the identifier and retrieving the image
I've heard this called "compression via database" before and it can give the appearance of defeating Shannon theorem for compression even though it doesn't do that at all.
Of course the author's idea is significantly more sophisticated than the approach above, and trades a lossy approach for some gains in storage and retrieval efficiency (we don't have to have a copy of all of the pictures in the world in both the compressor and the decompressor). The evaluation note of not using any known image for the tests further challenges the approach and helps sus-out where there are specific challenge like poor reconstruction of specific image constructs like faces or text -- I suspect that there are many other issues like these but the author honed in on these because we (as literate humans) are particularly sensitive to them.
In these types of lossy compression approaches (as opposed to the above which is lossless) the basic approach is:
1. Throw away data until you get to the desired file size. You usually want to come up with some clever scheme to decide what data you toss out. Alternative, just hash the input data using some hash function that produces just the right number of bits you want, but use a scheme that results in a hash digest that can act as a (non-unique) index to the original image in a table of every image in the world.
2. For images it's usually easy to eliminate pixels (resolution) and color (bit-depth, channels, etc.). In this specific case, the author uses an variational autoencoder to "choose" what gets tossed. I suspect the autoencoder is very good at preserving information rich, or high-entropy, information dense slices of a latent space or something. At any rate, this produces something that to us sorta kinda looks like a very low resolution, poorly colored postage stamp of the original image, but actually contains more data than that. I think at this point it can just be considered the hash digest.
3. this hash digest, or VAE encoded image or whatever we want to call it, is what's passed around as the "compressed" data.
4. just like above, "decompression" means effectively looking up the value in a "database". If we are working with hash digests, there was probably a collision during the construction of the database of all images, so we lost some information. In this case we're dealing with stable diffusion and instead of a simple index->table entry, our "compressed" VAE image wraps through some hyperspace to find the nearest preserved data. Since the VAE "pixels" probably align close to data dense areas of the space you tend to get back data that closely represents the original image. It's still a database lookup in that sense, but it's looking more for "similar" rather than "exact matches" which when used to rebuild the image give a good approximation of the original.
Because it's an "approximation" it's "lossy". In fact I think it'd be more accurate to say it's "generally lossy" as there is a chance the original image can be reproduced exactly, especially if it's in the original training data. Which is why the author was careful not to use anything from that set.
Because we've stored so much information in the compressor and decompressor, it can also give the appearance of defeating Shannon entropy for compression except it's also not because:
a) it's generally lossy
b) just like the original example above we're cheating by simply storing lots of information elsewhere
There's probably some deep mathematical relationship between the author's approach and compressive sensing.
Still, it's useful, and has the possibility of improving data transmission speeds at the cost of storing lots of local data at both ends.
Source: Many years ago before deep learning was even a "thing", I worked briefly on some compression algorithms in an effort to reduce data transfer issues in telecom poor regions. One of our approaches was not too dissimilar to this -- throw away a bunch of the original data in a structured way and use a smart algorithm and some stored heuristics in the decompressor to guess what we threw away. Our scheme had the benefit of almost absolutely trivial "compression" with the downside of massive computational needs on the "decompression" side, but had lots of nice performance guarantees which you could use to design the data transport stuff around.
*edit* sorry if this explanation is confusing, it's been a while and it's also very late where I am. I just found this post really fun.
However, a cautionary tale on AI medical image "denoising":
(and beyond, in science)
- See the artifacts?
The algorithm plugs into ambiguous areas of the image stuff it has seen before / it was trained with. So, if such a system was to "denoise" (or compress, which - if you think about it - is basically the same operation) CT scans, X-rays, MRIs, etc., in ambiguous areas it could plug-in diseased tissue where the ground-truth was actually healthy.
Or the opposite, which is even worse: substitute diseased areas of the scan with healthy looking imagery it had been trained on.
Reading recent publications that try to do "denoising" or resolution "enhancement" in medical imaging contexts, the authors seem to be completely oblivious to this pitfall.
(maybe they had a background as World Bank / IMF economists?)