This is interesting in terms of information theory. When you compress information you are basically figuring out the pattern behind something, and encoding that pattern instead of the entire image. For example, PNG which assumes that pixels are more likely to be the same or similar to the pixels above or to the left of them. So you only have to encode how much difference each pixel has to it's neighbors, rather than the entire value of every pixel.
If you were to randomly corrupt a highly redundant format, like a bitmap, it would just change a few pixels. On more compressed formats like JPEG it seems to affect the entire image, and in very specific ways (mainly the color of every block of pixels after the point that was corrupted.)
If you corrupted a perfectly compressed image, it would give you a different image entirely, possibly of something very similar. I.e. if you had a image format very good at compressing faces, corrupting it would result in a different face entirely, not randomly misplaced pixels or colors. And the face would be similar to the original, maybe with stuff like a different nose type, or an extra freckle.
The corruption is revealing what kinds of assumptions it is making about the content.
Reminds me an argument about crossword puzzles I think comes from Shannon.
It goes like this: in a sufficiently "fault-tolerant" language (ie with low per-character information content, which sorta gives large Hamming distance), crosswords become impossible, because puzzles won't be satisfiable: words could never line up right. But in a sufficiently compact language (many bits per character in words, and Hamming distance thus tending towards zero), crosswords are impossible as well, but now because puzzles would be ambiguous: too many words that fit. Somehow, natural languages appear to fit somewhere in the middle.
> Somehow, natural languages appear to fit somewhere in the middle.
I don't recall where, but I have heard an argument that there is pressure on the information density of natural language, as follows: If you are frequently misunderstood, then you can save effort (not have to repeat/explain) by adding redundancy. If you are never misunderstood, then you can save effort by removing redundancy.
There's no particular reason the result of this process should be good for crosswords, but it's a reason to be in the middle rather than at the extremes.
> If you are frequently misunderstood, then you can save effort (not have to repeat/explain) by adding redundancy.
If the added words improve comprehension, then they're not redundant. Not all repetition is redundant. Oh, by the way, did I forget to mention that not all repetition is redundant? :)
in a sufficiently compact language crosswords are
impossible as well, but now because puzzles would
be ambiguous: too many words that fit
This ignores the clues, which restrict which words are acceptable in a space. While the clues are mostly there to make the puzzle easier, resolving ambiguity is a secondary purpose.
It turns out that I either hallucinated (incorrectly, perhaps) one side of the argument, or I read it from some other place, because the paper only talks about one side, the one about feasibility. The relevant paragraph follows:
"The redundancy of a language is related to the existence of crossword puzzles. If the redundancy is zero any sequence of letters is a reasonable text in the language and any two-dimensional array of letters forms a crossword puzzle. If the redundancy is too high the language imposes too many constraints for large crossword puzzles to be possible. A more detailed analysis shows that if we assume the constraints imposed by the language are of a rather chaotic and random nature, large crossword puzzles are just possible when the redundancy is 50%. If the redundancy is 33%, three-dimensional crossword puzzles should be possible, etc."
Anyway, to somewhat rescue a botched story, the ambiguity part reminds me of the mindblowingly awesome and famous puzzle by Jeremiah Ferrell on the eve of the 1996 presidential election, which clue was "Lead story in tomorrow's paper", with possible answers BOBDOLE ELECTED or CLINTON ELECTED, both satisfying all crossclues as well! Amazing:
Not just Xerox, btw... My HP printer/scanner does the same. After reading that post I started paying attention and eventually saw the effect a few times. I just wonder how many copies with the wrong numbers I've produced so far...
Heh, at my first startup I used to get members of the audio codec team wandering over when there was a bug. I'd listen to the corrupted playback, draw out the codec block diagram, and point right to the likely error locations. As another commenter mentions, these were all lossy codecs. The character of noise would be very interesting much of the time -- effectively turning the decoder into a synthesizer.
FLAC would probably be a bad example, since it is lossless. It makes fewer assumptions about the content of the audio file, since it anyway has to encode all of the noise and minor stuff. Making too many assumptions of the content would only necessitate a facility to introduce specified noise into the audio later. Which I can't imagine would be both fast and small.
I don't think it has much to do with the fact that it is lossless, but rather that it does not compress in a suitable way. In the scheme of things outlined by GGP. Whether such scheme implies lossy is a question that needs to be demonstrated imho.
Most audio and video formats are based on encoding relatively short block independent of what has come before - otherwise when you skip into the middle of the track the player would need to read everything up to that point before being able to restart play.
Many video formats do look back a fair number of frames, so you do sometimes see corruption remain and spread for a few seconds, but then you hit a key frame or block boundary where everything resets and all is suddenly well again.
Audio compression is usually a different bag entirely: over a given time period the compression algorithm looks for what it can leave out or merge from the input signal over that time because human ears are unlikely to hear the difference, there is much less opportunity (compared to still images and video) to be able to encode "copy that chunk from a frame or few back, rotate/skew/what-ever it a small amount, then change these few blobs".
Anyone got a jpeg un-glitcher? I'm thinking of a tool to fix these sorts of minor corruptions in the jpeg bitstream. The tool would load up the corrupt image, let the user visually select where the glitch starts by clicking the first "weird" looking pixel and then iterate through different values until the user says the picture looks better. Rinse and repeat.
most pictures don't have hard transitions between block areas, so once the fault area has been identified an algorithm could be made to look for the change that results in the image having the softest transition between neighboring blocks.
we should make your unglitcher part of the standard, that way every jpeg across the whole Internet could be reduced in size by just throwing away certain parts of it. Then your unglitcher could get those parts back. oh, wait....
Its doubtful that the image can be recovered. The information really is gone. I suspect it would be easier to make it simple to have redundant copies of the data, rather than trying to fix the images after they're gone.
That sounds plausible. Another idea: choose a section of "bad image" and a section of "good image" which should have similar colours and then generate mutations of the corrupted DCT square until you have a close match between the two areas.
Given that we're assuming some kind of correctness metric for the final image, could we not also use this metric to locate the start of the corruption?
I want something that can do the opposite of this. I have a bunch of old photos that are messed up where the bottom half is pink or offset or something else. I just haven't had the time to dig into the spec to figure out how to undo it.
It's just a huffman code. When you encounter an undecodable sequence of bits you can skip bits until you find the next symbol (i.e. the stream is decodable starting from current position + X bits), then just guess which symbol it was that you skipped. The symbols are a run length encoding of quantized zigzaged DCT coefficients of (usually) 4:2:0 YCbCr. Wikipedia has details and the spec is readable.
While on the subject of JPEG in JavaScript, I made a web page[1] to repeatedly encode an image in order to bring out the artifacts. (Works “best” on text.)
I don't call it cheating... I call it simulation of the effects of cropping.
But seriously though, I was quite was surprised by how little is lost if you repeatedly encode the same image. Then again, it makes a lot of sense if you look at how blocks are encoded.
When I arrived to New Zealand, from a trip to US (California mostly), my hard disk turned to be damaged. I managed to restore only a small part of photos. Only less than 10 % photos turned out to be unaffected by a specific digital effect (it is clearly seen in the post). In the beginning I got upset (that is why I did not post that long) but after, perhaps, the tenth view of the half-damaged photo archive I started to notice interesting frames.
That's actually pretty nifty, and shows what a little bit of corruption does to an image.
Hopefully more systems will start shipping with checksumming file systems by default. Even better if they have error correction.
I still have some of the first MP3s I ripped back in the late 90s. They still play, but it sounds like a scratched CD. HDDs aren't as robust as we'd like to think.
JPEG actually support self-repair: when you save out a JPEG, you can insert "restart markers" every so often, which basically repeat the original headers and allows a decoder to recover from any corruption.
If you put enough restart markers in an image, you can swap or replace the compressed blocks between any given sets of restart markers in the same image without glitching it.
They've been across about 4 different disks. I copy using SCP, so it's unlikely that the corruption happened over the network, most likely it was between the CPU and disk as it was being read/written.
Get an mp3 error checker and see if it has anything to say about the particular files.
You also might want to grab EncSpot Pro (free, for windows) to figure out exactly what codec you used. Chances are that if the codec was buggy, then it's been documented such a google search on the specific codec and version will probably turn up people talking about the known errors.
This technique is often used as a puzzle in ARGs[1]. It reminds me especially of I Love Bees[2]. Bits of plaintext data were added to jpgs on a supposedly-corrupted website[3], which then had to be combined in the correct order. As well as fitting the theme of the game, it's a good puzzle since the corruption gives a huge visual clue to investigate the image and it's relatively easy for anyone to load it up in notepad and find the 'hidden' data.
Nice :-) One can do something akin to this by using Radamsa -- https://code.google.com/p/ouspg/wiki/Radamsa (shameless plug) -- to fuzz bitmaps. Similar imagery is common when fuzzing browsers with image-containing samples and can be quite mesmerizing to observe. Fuzzing other media formats (such as MIDI or audio files) can also yield "interesting" results.
Check out music video Chairlift - Evident Utensil. I don't feel like they took advantage of the effect in any meaningful way in this video though. I also seem to remember another one with a similar effect.
Looks like applying the motion vectors from one video to the starting conditions from another, which sometimes happens when fast forwarding with a buggy video decoder or a glitchy video.
I wonder, could you do this just by chopping I frames from one video with B and P frames from another?
Also the music video for Philadelphia Grand Jury - Save Our Town, where the video gradually degrades over the song (although this one uses analogue degradation).
It is sickening. You may now feel entitled to rage about hacker privilege misappropriating glitch culture and the injustices against those without the choice of unbiased data that have to make do with corruption, a reality with no digital integrity net to fall back on, when you need bits to be determinate or a file "format" that stops being forgiving just when you need it most. There is an invisible underlying source of so-called "malfunctions", yet the author of this tool, emphasizes a convenient transformation/generation of "error typical" data and contrasts it against subjective compliance with a file "format", projecting a rigid labelling, totally side-stepping these issues and the subsequent injustices.
If you were to randomly corrupt a highly redundant format, like a bitmap, it would just change a few pixels. On more compressed formats like JPEG it seems to affect the entire image, and in very specific ways (mainly the color of every block of pixels after the point that was corrupted.)
If you corrupted a perfectly compressed image, it would give you a different image entirely, possibly of something very similar. I.e. if you had a image format very good at compressing faces, corrupting it would result in a different face entirely, not randomly misplaced pixels or colors. And the face would be similar to the original, maybe with stuff like a different nose type, or an extra freckle.
The corruption is revealing what kinds of assumptions it is making about the content.