Hacker News new | comments | show | ask | jobs | submit login
MP3 for Image Compression (2006) (emphy.de)
344 points by joshumax 242 days ago | hide | past | web | favorite | 96 comments

I just posted the article[1] that I think prompted this script to be brought up.

What I did instead was to run images through an audio editing tool, which lets you apply echoes or do mindboggling things like change the volume of the image. The script can be found on github[2].

[1] http://memcpy.io/audio-editing-images.html

[2] https://github.com/robertfoss/audio_shop/

Nice script :) I did a similar thing by hand with graphicsmagick and sox a couple years ago - the most fun result was from wahwah, pretty much a wide bandpass filter with centre frequency varying cyclically over time: http://lewisinthelandofmachines.tumblr.com/post/59405537096/...

Thank you!

I haven't even tried the wahwah effect, you've definitely piqued my interest.

I had a look through the SoX effects list[1] for the wahwah, but can't seem to find it. The closest thing I found was the 'flanger' effect[2].

[1] http://sox.sourceforge.net/sox.html#EFFECTS

[2] https://allg.one/zKMP

Shamefully I can't remember which wahwah I used - I was treating the audio with both Audacity and Pro Tools as well as sox though...

No description of each image. Which one did you change the volume?

None, but I did here[1]. Also, on a related note all image names include the effect that was used.

[1] https://allg.one/1QbX

Nice thanks. Seems like it mostly reduces the number of colors?

Also like how clearly you can see the overdrive introducing noise into the previously pure blue sky at the top

It's more like increasing the contrast, which has the effect of reducing the dynamic range in favor of extreme values.

The obvious question here is: what does JPEG encoded music sound like?

Horrible. I actually tried applying image compression to sounds over a decade ago (was still in school, did that for Jugend forscht). I used a Fourier transform to generate image sections of a specific window of samples and used that. In hindsight, perhaps just treating the waveform as image data may have been better. The idea behind what I did was to see whether a spectrogram can be compressed well, considering that there's lots of repetition and blank space usually.

In any case, JPEG murders pretty much everything in thag representation.

It kind of reads like you have discredited frequency domain compression algorithms. FYI JPEG used a block-based DCT which is very similar to a Fourier transform.

That was not my intent. I just noted that my approach came from visually inspecting spectrograms and noting that they might be compressed quite well by image compression algorithms. Only to notice in the end that lossy image compression completely butchers the signal with noise if doing that. Had I just used the original waveform as raw image data that wouldn't have been as much of a problem. Compression would still have changed the samples, but overall probably not as badly.

Heck, I've been 17 when I did that. I certainly didn't really know much about the theory behind or why it almost certainly would fail. But I guess by now I wouldn't even attempt such craziness, so in a way it's probably a good thing to not know things too well, sometimes.

The problem is that we are much more sensitive to audio distortion than visual, particularly dynamic (e.g.video). Pretty much all lossy audio compression does a bad job without a psycho-acoustic model, whereas image compression works pretty well on simple metrics. It doesn't help that psycho-visual modelling is less well understood (than audio).

I'm more curious about a different idea. If you hook up electrodes to a tongue and then start activating its sensors according to the amount of light being received, people quickly develop the ability to see via their tongue.

So what happens when you encode visual information into sound in such a way as to have a sound-to-environment signal and you just keep that active and running real time. We already know that this mechanism exists in nature and can even be developed in humans, since echo-location is a thing.

Edit: It occurs to me that actually doing sensory experiments like this might really mess with someones head... maybe not a good idea to play around with it casually.

Edit 2: If we were going to swap out one of our senses for another, which should we get rid of and which should we gain?

Actually there are around (and have been for a few years aready) some glasses that scan the image you'd see and translate that into a sound that you need a bit training to understand as an image[0]. Some people have gotten used to them, while others have gone the way of the bat and learned to echolocate[1].

[0] http://gizmodo.com/these-synthesia-glasses-help-blind-people...

[1] https://en.wikipedia.org/wiki/Human_echolocation

Here's an Android app that converts images from the phone's camera to sounds as a tool for the blind:


Your comment reminded me the TED talk "How a blind astronomer found a way to hear the stars"[1].

[1] https://www.ted.com/talks/wanda_diaz_merced_how_a_blind_astr...

It's been done and it works. I believe they just map the image to a spectrogram and convert it to audio: https://www.seeingwithsound.com/

Lots of noise. https://milek7.pl/.stuff/jpegcompress.flac (24bit PCM as 10000px wide RGB image compressed using JPEG quality 85, ~50% of uncompressed size)

Similar results - converting to a square image at 8bit then lowering JPEG quality just gets noisy in an uninteresting way:

    sox mo.wav -e unsigned -b 8 -c 1 -r 48k mo.raw
    bytes=`stat -f %z mo.raw`
    width=`echo sqrt\($bytes\) | bc`
    square_bytes=`echo $width \* $width | bc`
    dd if=mo.raw of=mo_square.raw bs=$square_bytes count=1
    gm convert -depth 8 -size ${width}x${width} gray:mo_square.raw -quality 50 mo_square.jpg
    gm convert mo_square.jpg gray:mo_square_jpg.raw
    sox -e unsigned -b 8 -c 1 -r 48k -t raw mo_square_jpg.raw mo_jpg.wav

Sounds like a weak radio signal! I wonder if we could pass it through a deringing filter to remove some of that noise, eg, like these guys have done: http://www.cs.rpi.edu/~oztanb/Papers/spie07paper.pdf

Not that bad, actually.

Just tried converting the sound to 8 bit, compressing it as a monochrome image, then back.

You can't really hear much difference at higher quality settings, and as you turn it down it sound like there is more and more digital-ish noise. I had to turn the compression way up before the voice became almost illegible, but you can still hear that it's someone talking.

I'd be curious what you consider garbage audio, as compared to 'not that bad.'

It's garbage when you need to switch to CW to continue the conversation. :)

Seriously, when it's unpleasant to listen to. As i said in the other reply, with music the quality is pretty bad on almost any setting, voice just happen to be an exception.

Tried again with music...

Even of high quality, there is a background sheesh kind of noise, similar to what you get if you play a 3D printed record.

On higher compression it's turning into something rather crumbly, with a recognizable tune but horrible sound quality.

So, yeah. Not quite usable.

> similar to what you get if you play a 3D printed record

I guess I'm behind the times, not relating to that reference...

Are you trying to fund a gif music player ?

There's a whole subreddit for things like this. https://www.reddit.com/r/glitch_art/

Now I wonder, what happens if you add sound effect, to change pitch, tone, or widen the sound stage. What would do to the decoded image?

I think the 2.00 bits/ pixel result looks quite more "analog" with a film grain effect to me.

I'm kinda wondering if this post was because of that, because I found the link in question and posted it there. It's kinda neat to see a possible circuitous route that something comes somewhere else with.

So you were the person who posted that link ;)

Awesome, thanks!

I did this with Audacity to some bmp images. I got some interesting results. I just read in the bmp files as raw data and manipulated them that way.

I did similar a long time ago with zip compressed TIFs and JPGs. Added reverb and the results were really unusual: https://vimeo.com/105317804

I really wanted salad fingers to pop out of my screen.

Thanks for that.

Reminds me that one could (can?) pipe random data to /dev/dsp and get it interpreted as sound data. Heck, you could pipe a wav file and actually hear it.

Or pipe output of C oneliner: http://canonical.org/~kragen/bytebeat/

Audacity allows for raw data input ?

Yes, you have to import raw data.

I did something similar with Ogg Vorbis back in 2006.

There are also some results of experiments with the Opus codec posted in the comments.


MP3 is inherently a one dimensional codec, whereas JPEG is two dimensional. No wonder it performs much better.

JPEG transforms 2D images into 1D arrays using a "zigzag" ordering of 8x8 blocks. See this diagram on wikipedia:


It's a shame the author didn't do the same transformation, because it would de-correlate a lot of the error noise. You can see in the highest compression settings that the "MP3" image compression is smearing everything horizontally. If it used a zigzag transformation, it would be a more smeared both horizontally and vertically, but probably less visually bad.

JPEG transforms 2D images into 1D arrays using a "zigzag" ordering of 8x8 blocks.

No. The zigzag ordering is applied to the frequency components, not to the image pixels.

You're right. I'm remembering a different codec — I think this is how H.264 orders DC coefficients for macroblocks. JPEG uses an actual 2D DCT, not a 1D DCT of a flattened block.

What if you try to compress the frequency components, scanned in zig-zag, using MP3 (without the first FFT like layers if they exists, I guess)? - if that even makes any sense...

That doesn't really make sense. MP3 takes inputs in signal-space, not frequency-space. You could run:

1. Divide image into blocks (as in JPEG),

2. Perform two-dimensional FFTs (as in JPEG),

3. Scan frequency components in zig-zag order (as in JPEG),

4. Run all of the steps of MP3 compression aside from the initial "split audio into blocks" and "perform FFTs" stages.

That would pretty much just give you a less efficient version of JPEG; both JPEG and MP3 take advantage of knowing how much each frequency component "matters" (i.e., how precisely it's necessary to encode the value to avoid artifacts noticed by humans), so using the MP3 quantization logic on frequency amplitudes from images would result in wasting bits by encoding certain amplitudes more precisely than is useful.

Good point. Do you have any idea why they didn't opt for a Hilbert curve, like others suggested here?

I'm not the parent, but I'd probably surmise that images are usually encoded in blocks to allow the image blocks to be decoded in parallel.

If you're asking why they use the zigzag (within a block) instead of a Hilbert curve, IIRC (quite fuzzy on this, so take it w/ a gran of salt and verify) the reason is that it allows for better spatial encoding (imagine having a ripple in one corner and going out - that's essentially what you want to encode w/ your DCT). Using a Hilbert curve would preserve locality, but I don't think it would line up with the spatial distribution of frequencies in an image.

The way the quantization matrix is set up is that most non-zero values end up in a corner, so the zig-zagging actually manages to be pretty good since the data is already suitably aligned.

You could iterate a bit by leveraging 'joint stereo' mode [1], eg, by considering pairs of rows as left/right channels. And perhaps even further with more 'channels'.

[1] https://en.wikipedia.org/wiki/Joint_(audio_engineering)

I wonder how mp3 would perform against a Hilbert-type path taken through the image.

JPEG uses this path for its blocks, rather than anything fancy and Hilbert-ish: http://www.johnloomis.org/ece563/notes/compression/jpeg/tuto...

I think this alone would be an improvement for the mp3 encoder.

One thing to try is to map 2D image data to 1D using Hilbert curve or some other locality-preserving space-filling curve.

Not only that, but doesn't MP3 remove "sounds changes you can't hear" or whatever, whereas jpg removes "color changes you don't see"? There is a parallel there, but I would imagine they would be quite different since one is based on human auditory response vs visual response.

A psychovisual model.

Hmm, idea... suppose you ran tempo/beat detection on some music, and then took its 1D waveform and chopped it up and arranged it in a virtual 2D space, lining up similar sections (e.g. choruses would lie side by side). Could you then gainfully compress the result with a 2D compression algorithm?

I had this idea a while ago and later discovered that the MOD file format beat me by a few decades :)

Also, today I stopped halfway through importing a FLAC of repetitive electronic music into Audacity, and I was surprised to discover that there was almost no repetition (it would play one bar then skip ahead over the identical parts)

Edit: Whoops, totally wrong!

Turns out I was loading a half-torrented file and forgot that torrents don't download in order!

No, because things aren't gonna live up that nice.

Our ears are more sensitive to amplitude errors than phase errors (as a function of frequency, in frequency space). Our eyes are the opposite.

I love this. I am really hoping others come to the comments to share similar "misuse" of technology stories.

As long as we're dealing with signals, everything is within the realm of possibility.

I'm hoping for someone to drive the timing of a spark-ignition internal-combustion-engine with MP3 data, and report what it sounds like.

Not exactly what you were asking for, but made me think of these:




and if you go down the rabbit hole, you end up here:


Not really related, but a while back I wrote a script to visualize/hear audio generation loss with different file formats:


So, mp3s add a bunch of silence to the beginning of the file, and ogg files start to "chirp". I never got around to putting this info in a consumable, easy to understand format though. The videos in these folders just continuously re-encode a source file w/ a given lossy format.

See also: https://en.wikipedia.org/wiki/Generation_loss

Kinda sounds like "I'm sitting in a room"


Thanks for the link! I hadn't heard of this before.

It would be interesting to see if the horizontal artifacts could be avoided by feeding the pixels in a different order to the encoder.

Ordering the pixel data by their locations along a space-filling curve, maybe.

Definitely: https://en.wikipedia.org/wiki/Hilbert_curve. I'd still expect the final result to look worse than jpeg, but it'd be a much more interesting comparison.

or split the image in its frequency domains first, and then run the mp3 compression on a spatial curve, but at this point you're already doing a https://en.wikipedia.org/wiki/Discrete_cosine_transform a core part of JPEG compression, which was actually adapted for mp3: https://en.wikipedia.org/wiki/Modified_discrete_cosine_trans...

Before I clicked the OP link, I originally thought it was going to be this type of implementation...cool nonetheless, but would really like to see someone try this and share!

Wouldn't the Hilbert curve require a POT bitmap?

You could try something akin to tessellation, with either more than one hilbert curve, or a truncated curve.

They're both based on the DCT. Mp3 (and AAC and Vorbis and more) use a modified DCT which uses block overlapping to mitigate aberrations on the block boundary.

Its no surprise it works, however you wouldn't necessarily get "as good" compression as you would from an optimized DCT coder (JPEG etc) based on the data duplication (2x for the overlapping blocks).

See https://en.wikipedia.org/wiki/Modified_discrete_cosine_trans... https://en.wikipedia.org/wiki/Discrete_cosine_transform

I wonder if it's possible to create a set of data that produces an actual image when compressed with JPEG and an actual music when compressed with MP3 (for example, a JPEG picture of pianist that also gives a MP3 piano piece).

While amusing, this is bound to quite bad since the function basis is restricted to a single axis.

Curious enough, years ago I've seen a blog post where the author used PNG lossless compression for FLAC audios. Guess there should be a lot of room for improving both image and audio compression, even because we're still ending up using jpeg and mp3.

Isn't PNG just DEFLATE? You should get basically the same result by just gzipping it.

Is there a way to use image compression for MP3s?

I'd like to store music files in my phone's camera roll, and easily upload them to a website where some Javascript could decode and play them.

Now for big time shits-and-grins, add a deep learning GAN to generate a more refined signal during the decompression / upsampling stage.

That's something your Turbo Pascal code never attempted.

I think the question really is, does it do anything for file size? If there were a radical difference in total size, the quality degradation might be an interesting compromise.

The side-by-side assessment compares JPEG and MP3 at the same bitrate.

I remember reading an article where file data (a ZIP file I believe) was converted into a bitmap image, and then it was compressed with PNG for another few % of compression.

While it might be possible for special cases [1], I have my concerns, you really can't compress (lossless) beyond the (information theoretic) entropy -- in which case one would try to compress deflate (zip) with deflate (png).

[1]: I remember when I was taking an information theory lesson at the university and told the (video/jpeg compression guru) professor that I compressed an AVI file with my dump arithmetic coding implementation, he was shocked, turns out the file had some large crappy header from the editor

> another few % of compression

ZIP consists of sections of both compressed and uncompressed data, so some compressibility is to be expected.

What about turning the image into a sound, then compressing it with MP3, then turning it back to an image again?

The search for an image that encodes the Close Encounters theme is on.

Where are the images???

I'm amazed this works at all!

It kind of has to -- both compression algorithms work by doing a bunch of lossless steps to get to a format that can easily be quantized for lossy compression. The prioritization of how much quantization to apply is picked according to perceptual models of vision/hearing, to devote more bits to quantizing the things we can actually notice.

Needless to say, a perceptual model designed for audio is a pretty bad choice for a long string of grayscale pixels, interpreted as sampled audio. It looks like a lot of high-frequency content was discarded, resulting in horizontal blurring.

Has to? Are you saying that the whole thing would not fall apart if, say, someone used color images instead of greyscale?

If you ordered each subpixel linearly, that would work too. You'd see pretty much the same effect of horizontal distortion. Because the input data is PCM (just magnitude values), same as audio, it "works" just fine. If the input data were some other representation, like text, you'd get gibberish out.

Okay, so it does depend on the representation, but most reasonable representations will work. I guess the big exception would be trying to do this on something that is already compressed, like a gzipped bmp, or something.


Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact