I just posted the article[1] that I think prompted this script to be brought up.
What I did instead was to run images through an audio editing tool, which lets you apply echoes or do mindboggling
things like change the volume of the image.
The script can be found on github[2].
Nice script :) I did a similar thing by hand with graphicsmagick and sox a couple years ago - the most fun result was from wahwah, pretty much a wide bandpass filter with centre frequency varying cyclically over time: http://lewisinthelandofmachines.tumblr.com/post/59405537096/...
Horrible. I actually tried applying image compression to sounds over a decade ago (was still in school, did that for Jugend forscht). I used a Fourier transform to generate image sections of a specific window of samples and used that. In hindsight, perhaps just treating the waveform as image data may have been better. The idea behind what I did was to see whether a spectrogram can be compressed well, considering that there's lots of repetition and blank space usually.
In any case, JPEG murders pretty much everything in thag representation.
It kind of reads like you have discredited frequency domain compression algorithms. FYI JPEG used a block-based DCT which is very similar to a Fourier transform.
That was not my intent. I just noted that my approach came from visually inspecting spectrograms and noting that they might be compressed quite well by image compression algorithms. Only to notice in the end that lossy image compression completely butchers the signal with noise if doing that. Had I just used the original waveform as raw image data that wouldn't have been as much of a problem. Compression would still have changed the samples, but overall probably not as badly.
Heck, I've been 17 when I did that. I certainly didn't really know much about the theory behind or why it almost certainly would fail. But I guess by now I wouldn't even attempt such craziness, so in a way it's probably a good thing to not know things too well, sometimes.
The problem is that we are much more sensitive to audio distortion than visual, particularly dynamic (e.g.video). Pretty much all lossy audio compression does a bad job without a psycho-acoustic model, whereas image compression works pretty well on simple metrics. It doesn't help that psycho-visual modelling is less well understood (than audio).
I'm more curious about a different idea. If you hook up electrodes to a tongue and then start activating its sensors according to the amount of light being received, people quickly develop the ability to see via their tongue.
So what happens when you encode visual information into sound in such a way as to have a sound-to-environment signal and you just keep that active and running real time. We already know that this mechanism exists in nature and can even be developed in humans, since echo-location is a thing.
Edit: It occurs to me that actually doing sensory experiments like this might really mess with someones head... maybe not a good idea to play around with it casually.
Edit 2: If we were going to swap out one of our senses for another, which should we get rid of and which should we gain?
Actually there are around (and have been for a few years aready) some glasses that scan the image you'd see and translate that into a sound that you need a bit training to understand as an image[0]. Some people have gotten used to them, while others have gone the way of the bat and learned to echolocate[1].
Just tried converting the sound to 8 bit, compressing it as a monochrome image, then back.
You can't really hear much difference at higher quality settings, and as you turn it down it sound like there is more and more digital-ish noise. I had to turn the compression way up before the voice became almost illegible, but you can still hear that it's someone talking.
It's garbage when you need to switch to CW to continue the conversation. :)
Seriously, when it's unpleasant to listen to. As i said in the other reply, with music the quality is pretty bad on almost any setting, voice just happen to be an exception.
I'm kinda wondering if this post was because of that, because I found the link in question and posted it there. It's kinda neat to see a possible circuitous route that something comes somewhere else with.
Reminds me that one could (can?) pipe random data to /dev/dsp and get it interpreted as sound data. Heck, you could pipe a wav file and actually hear it.
It's a shame the author didn't do the same transformation, because it would de-correlate a lot of the error noise. You can see in the highest compression settings that the "MP3" image compression is smearing everything horizontally. If it used a zigzag transformation, it would be a more smeared both horizontally and vertically, but probably less visually bad.
You're right. I'm remembering a different codec — I think this is how H.264 orders DC coefficients for macroblocks. JPEG uses an actual 2D DCT, not a 1D DCT of a flattened block.
What if you try to compress the frequency components, scanned in zig-zag, using MP3 (without the first FFT like layers if they exists, I guess)? - if that even makes any sense...
That doesn't really make sense. MP3 takes inputs in signal-space, not frequency-space. You could run:
1. Divide image into blocks (as in JPEG),
2. Perform two-dimensional FFTs (as in JPEG),
3. Scan frequency components in zig-zag order (as in JPEG),
4. Run all of the steps of MP3 compression aside from the initial "split audio into blocks" and "perform FFTs" stages.
That would pretty much just give you a less efficient version of JPEG; both JPEG and MP3 take advantage of knowing how much each frequency component "matters" (i.e., how precisely it's necessary to encode the value to avoid artifacts noticed by humans), so using the MP3 quantization logic on frequency amplitudes from images would result in wasting bits by encoding certain amplitudes more precisely than is useful.
I'm not the parent, but I'd probably surmise that images are usually encoded in blocks to allow the image blocks to be decoded in parallel.
If you're asking why they use the zigzag (within a block) instead of a Hilbert curve, IIRC (quite fuzzy on this, so take it w/ a gran of salt and verify) the reason is that it allows for better spatial encoding (imagine having a ripple in one corner and going out - that's essentially what you want to encode w/ your DCT). Using a Hilbert curve would preserve locality, but I don't think it would line up with the spatial distribution of frequencies in an image.
The way the quantization matrix is set up is that most non-zero values end up in a corner, so the zig-zagging actually manages to be pretty good since the data is already suitably aligned.
You could iterate a bit by leveraging 'joint stereo' mode [1], eg, by considering pairs of rows as left/right channels. And perhaps even further with more 'channels'.
Not only that, but doesn't MP3 remove "sounds changes you can't hear" or whatever, whereas jpg removes "color changes you don't see"? There is a parallel there, but I would imagine they would be quite different since one is based on human auditory response vs visual response.
Hmm, idea... suppose you ran tempo/beat detection on some music, and then took its 1D waveform and chopped it up and arranged it in a virtual 2D space, lining up similar sections (e.g. choruses would lie side by side). Could you then gainfully compress the result with a 2D compression algorithm?
I had this idea a while ago and later discovered that the MOD file format beat me by a few decades :)
Also, today I stopped halfway through importing a FLAC of repetitive electronic music into Audacity, and I was surprised to discover that there was almost no repetition (it would play one bar then skip ahead over the identical parts)
Edit: Whoops, totally wrong!
Turns out I was loading a half-torrented file and forgot that torrents don't download in order!
So, mp3s add a bunch of silence to the beginning of the file, and ogg files start to "chirp". I never got around to putting this info in a consumable, easy to understand format though. The videos in these folders just continuously re-encode a source file w/ a given lossy format.
Before I clicked the OP link, I originally thought it was going to be this type of implementation...cool nonetheless, but would really like to see someone try this and share!
They're both based on the DCT. Mp3 (and AAC and Vorbis and more) use a modified DCT which uses block overlapping to mitigate aberrations on the block boundary.
Its no surprise it works, however you wouldn't necessarily get "as good" compression as you would from an optimized DCT coder (JPEG etc) based on the data duplication (2x for the overlapping blocks).
I wonder if it's possible to create a set of data that produces an actual image when compressed with JPEG and an actual music when compressed with MP3 (for example, a JPEG picture of pianist that also gives a MP3 piano piece).
Curious enough, years ago I've seen a blog post where the author used PNG lossless compression for FLAC audios.
Guess there should be a lot of room for improving both image and audio compression, even because we're still ending up using jpeg and mp3.
I think the question really is, does it do anything for file size? If there were a radical difference in total size, the quality degradation might be an interesting compromise.
I remember reading an article where file data (a ZIP file I believe) was converted into a bitmap image, and then it was compressed with PNG for another few % of compression.
While it might be possible for special cases [1], I have my concerns, you really can't compress (lossless) beyond the (information theoretic) entropy -- in which case one would try to compress deflate (zip) with deflate (png).
[1]: I remember when I was taking an information theory lesson at the university and told the (video/jpeg compression guru) professor that I compressed an AVI file with my dump arithmetic coding implementation, he was shocked, turns out the file had some large crappy header from the editor
It kind of has to -- both compression algorithms work by doing a bunch of lossless steps to get to a format that can easily be quantized for lossy compression. The prioritization of how much quantization to apply is picked according to perceptual models of vision/hearing, to devote more bits to quantizing the things we can actually notice.
Needless to say, a perceptual model designed for audio is a pretty bad choice for a long string of grayscale pixels, interpreted as sampled audio. It looks like a lot of high-frequency content was discarded, resulting in horizontal blurring.
If you ordered each subpixel linearly, that would work too. You'd see pretty much the same effect of horizontal distortion. Because the input data is PCM (just magnitude values), same as audio, it "works" just fine. If the input data were some other representation, like text, you'd get gibberish out.
Okay, so it does depend on the representation, but most reasonable representations will work. I guess the big exception would be trying to do this on something that is already compressed, like a gzipped bmp, or something.
What I did instead was to run images through an audio editing tool, which lets you apply echoes or do mindboggling things like change the volume of the image. The script can be found on github[2].
[1] http://memcpy.io/audio-editing-images.html
[2] https://github.com/robertfoss/audio_shop/