As some comments are arriving below -- lossy compression does not work that way!
Interpreting one set of data as another is a clever trick for puzzles (thematic, for example, to: http://web.mit.edu/puzzle/www/2013/coinheist.com/oceans_11/t... -- but that's only a hint, the solution is still lots of fun!) but is kind of wrong in practice.
Compression is about conveying a similar 'message' while removing the stuff that just doesn't compress well. In signals such as pictures and audio, this is often seen/heard in high frequency detail.
To the layman, the simplest lossy compression of an image is to resize it by a half in each dimension and then stretch the output. (Not unlike retina-display artifacts you may have seen). Assuming you do it right (with an averaging filter) -- you've just removed the upper half of the representable frequencies, it just so happens that most of what the eye cares about isn't in those ranges -- and so the image still looks decent. A little blocky if you stretch it, but still pretty good. JPEG is just a more technically advanced version of similar tricks.
A better analogy of lossy compression on text is actually xkcd's Up Goer Five (http://xkcd.com/1133/) -- using only the "ten hundred" most common words, you can still convey concepts (albeit simplistically) and stay under ten bits a word, or a little less than two ascii characters per. If you were to map the dictionary into "up-goer five" speak, you could compress Shakespeare pretty well. It would lose a lot in the translation -- but so does a heavily-compressed image. If you limit yourself to the first 64k words, you have a much larger vocabulary, still limited, and still fitting within two bytes per word. Though you may have to use more words, a word like "wherefore" still counts for four words worth of compressed space, when replaced by "why". Tradeoffs!
That'd be a curious hack -- sort the words in Romeo and Juliet by word frequency. Rewrite the least common words in terms of the other words. Compress.
I wrote a quick script to convert text into the most common 1000 words as best possible, here's your comment. Needs a bit of work, and a better synonym list, but whatever.
as some comments are coming below -- lossy press does not work that way!
interpreting one set of data as another is a able action for puzzles (thematic, for example, to: http://web.mit.edu/puzzle/www/2013/coinheist.com/oceans_11/t.... -- but that's only a direction, the action is still army of fun!) but is kind of wrong in act.
press is about giveing a similar 'account' while removing the air that just doesn't cut well. in act such as pictures and clear, this is often seen/heard in high light army.
to the christian, the simplest lossy press of an image is to resize it by a half in each area and then answer the data. (not different retina-display artifacts you may have seen). big you do it right (with an averaging filter) -- you've just alone the c half of the representable frequencies, it just so happens that most of what the eye care about isn't in those ranges -- and so the image still air christian. a little blocky if you answer it, but still beautiful good. jpeg is just a more technically developed account of similar actions.
a better approach of lossy press on back is actually xkcd's up goer five (http://xkcd.com/1133/) -- using only the "ten hundred" most common words, you can still give concepts (albeit simplistically) and stay under ten cut a word, or a little less than two ascii act per. if you were to art the cant into "up-goer five" speak, you could cut shakespeare beautiful well. it would clear a lot in the change -- but so does a heavily-cuted image. if you all i to the first 64k words, you have a much larger cant, still alled, and still change within two bytes per word. again you may have to use more words, a word like "ground" still counts for four words account of cuted space, when replaced by "why". tradeoffs!
that'd be a concerned common -- sort the words in man and juliet by word light. record the least common words in terms of the other words. cut.
If you limit yourself to the first 64k words, you have a much larger vocabulary, still limited, and still fitting within two bytes per word.
Maybe I'm misunderstanding what you're saying but if you're compressing adaptively by building a dictionary of the target material, 64k words would give you lossless compression of Shakespeare - the unique words in the collected works (with some wiggle room for 'word forms') are in the 30-40k range.
True! You could do that too. Sending the codebook is a cromulent strategy. It'd be lossless too. That's (to a first order) how LZ works.
I was arguing you can use a standard codebook (some agreed-upon top-N list) instead. I imagine if you did you'd lose a couple words ("wherefore" springs to mind, as it's not in common English usage) but by and large you'd have just about everything.
Technically, you can also have a lossless JPEG (or at least, within epsilon), but that's not why it's done.
More importantly, the entropy of English text isn't really that high -- ~ 1 bit per letter -- which means lossless compression works pretty well on it. Images less so.
The lesson here being that compressing English text as suggested in the article is rather meaningless, and can be done much, much better. When the article asks at the end:
> We're sensitive to data loss in text form: we can only consume a few dozens of bytes per second, and so any error is obvious.
> Conversely, we're almost blind to it in pictures and images: and so losing quality doesn't bother us all that much.
> Should it?
My answer is "mu". I wanted to answer "no", but that would mean I even accept the question as valid.
EDIT: s/in this way/as suggested in the article/ for clarity.
I suppose my confusion arises from the fact that pretty much any practical compression scheme is essentially 'content adaptive' rather than 'standard codebooky'. In the lossy cases, add whatever magical filtering to toss out the perceptually irrelevant frequencies.
I agree with you that the article misunderstands compression and coding in general I just couldn't quite figure out what one of your counter-examples was about.
I found that when I tried to use up goer five speak (e.g. http://m50d.github.io/2013/01/18/girl-with-not-real-power.ht...) I used many more words to express the same number of concepts. So I'd be interested to see how much more or less compressible it actually makes things.
Compression depends on data structures filled with repeated pieces. Lossy compression improves the hit rate of repeated pieces by replacing rare pieces with similar, common pieces. In the case of colours, naive lossy compression means "Apricot" and "Burnt Sienna" both become "Orange", which becomes "Org". "Chartreuse" and "Pistachio" both become "Lime", which gets abbreviated to "Lm". A final list of all the colours gets tallied at the end, and colours like "Orange" and "Lime" may get abbreviated all the way down to "O" and "L".
> "The point isn't how many words you're using, but how repetitiously those words appear."
Not under the compression scheme barakm is proposing, which is to fix a universal dictionary (the commonest 1k or 64k English words across some very wide corpus), so that you don't have to transmit it.
Under the 64k scheme, every word is 2B, but you probably have to use a few more words. It's a naive scheme, but a very comprehensible one, and relatively useful for shorter texts.
As an alternative, sure, you can make a custom dictionary, in which case you can use whatever words you like, as long as you can keep the distinct number down, and ideally heavily-repeat certain words. This results in a larger filesize for small messages, because you have to start by sending the dictionary, but a much better filesize for nearly any large message.
Or, hmn. I guess you could allow recursive compression, or back-references, or something, so that you could compress repetition of word-sequences.
And, hmn, that might not always work out. Well, look, you could allocate the first byte of the message to switching between compression schemes, so that you can choose between 256 different schemes, and choose the best one.
Oh look, if I keep this up for another hour or two, maybe I'll catch up to the state of the art from before I was born. No, I don't know what wavelets are, nor who Fourier is, why do you ask?
Anyway, popping out of fake-naive mode, my point is that Huffman compression isn't the only compression technique.
Shakespeare is particularly evil in this regard, given the terms he coined. Some we have picked up and others haven't made it to the modern age.
But, extra credit for compressing such that it still pentas its iambic meter. :-)
It would be fun to take arbitrary slices of force-ranked words and constrain composition likewise. Words of ordinal frequency 1500-3000 would create a bit of a challenge along the line of writing a book without a single letter e. (http://books.google.com/books/about/Gadsby.html?id=jG1JU82Bd...)
Or conversely, in images, the data bits are the pixels, in text the data bits are two steps removed from the words (one is a character encoding, the second is a language).
Hello there! I'm Tom. I made this. I didn't expect it to go this far. A few notes for you, which I've also added to the original site:
Yes, I'm aware this is pretty meaningless - my original tweet was "Not sure if it's net-art or just a load of guff." It seems to have sparked some discussion, though.
No, I'm not selling the books. I try not to create too many useless physical objects. If you're hell bent on it, or want to see the full text, I've uploaded all the source files and there's a link at the bottom of the page. I ask that you don't sell anything you make with it, although given I'm ripping off Shakespeare and the JPEG algorithm that's probably a bit rich.
If anyone wants to commission a professionally bound hardcover set for some net art exhibit, let me know. I'll happily pretend to be a serious artist.
The whole reason this analogy breaks down is that letters are independent symbols while levels in an image represent a continuum. There's no mathematical correspondence among letters to exploit, only an accident of encoding. Apples to oranges.
That said here's a particular rant transformed as a whole signal with the DCT and quantized a bit http://pastebin.com/B5zS8W8f
I'm not sure you should draw any meaning at all from this, whether technical or philosophical.
JPEG is an image specific compression format, and works by removing visually insignificant detail. Applying the same system blindly to an unrelated byte-stream and then looking at the before and after isn't meaningful.
Those objections aside, it's a bit silly from a technical point of view - the text was read in as a RAW image format. As I understand it, RAW has to be converted into a bitmap image using a camera specific algorithm, so before he's even started compressing it the data has been munged, the data is then compressed and output as a completely different file format. I'm surprised he got anything recognisable at all.
Putting bitmap type file headers on the text data, then jpeg compressing it, and resaving it as a bitmap (and removing the headers), might produce something slightly more meaningful. I might try it later - curious to find out what Singular Value Decomposition looks like on text.
Actually, could just skip out Photoshop...
cracks knuckles and fires up Python
Only some camera manufactures use proprietary RAM format. There is OpenRAW, for instance. And even if he didn't use that, I don't think it would matter, except to the image generated. Text -> mung -> image -> unmung -> text. You should be able to do any format, assuming the conversion opens and saves into the same format.
Anyway, I gathered that the guy was precisely trying to show that lossy compression was a good thing when used in the right manner, and bad in the wrong manner.
But frankly, I don't care if he was making a point or not. It was a funny read, and having books printed of his various outcomes was a thoroughly wasteful, yet funny idea. So I guess even though I didn't learn anything, I was at least entertained.
> As I understand it, RAW has to be converted into a bitmap image using a camera specific algorithm, so before he's even started compressing it the data has been munged, the data is then compressed and output as a completely different file format
That depends if you are talking to photographer or photoshop user :)
If you open file as "Camera RAW" then it indeed goes through photographic processing (debayering etc), but if you use "Photoshop RAW" then the opening is essentially lossless.
Yea, it would be more interesting if he would have compressed the text with an algorithm that replaces similar sounding letters, using rules like in the SoundEx algorithm.
That's an interesting idea, and it's cool that he took it as far as he did, but I'm surprised by the last paragraph:
We're sensitive to data loss in text form: we can only consume a few dozens of bytes per second, and so any error is obvious. Conversely, we're almost blind to it in pictures and images: and so losing quality doesn't bother us all that much.
I'm not aware of any "lossy" encodings for text, but I assume they'd do pretty well on shakespeare and pretty terrible on images encoded as text. It's just that the techniques are tuned for the standard data, so most things you smuggle through are going to be "weird" and suffer more than usual.
Lossy encodings for text are easy. I can compress the entirety of 'Romeo And Juliet' to this:
"Boy and girl fall in love, they both die."
Yes, it's lost a bit of detail and resolution, but it's significantly more compact than the original, while retaining somewhat the same recognizable message. All we need next is a computer algorithm that generates TLDR.
You could probably quite effectively do 'lossy' compression on text given enough knowledge of the language. Compress words with similar meanings into each other, automatically select variants of words based on context (plurality, etc), automatically insert punctuation, and so on.
The result would probably look like it was written by someone with an incomplete knowledge of English (or by a machine) but might end up being quite readable.
Of course, on the other hand, you can just compress the text with LZMA :)
The easiest/quickest lossy text compressions that could theoretically almost halve the size of your data is to drop capitalization (and also assuming you're using some minimal encoding that can only encode the characters you're using.) there is some semantic info that is imparted by capitals, but not much.
No, that would not cut the length in half, or anything close to it.
One form of "minimal encoding" you're describing would be to use arithmetic coding where the predictor simply assigns equal probability to each symbol actually used in the input. For example, if you encode a 1000 letter message which includes at least one instance of every letter (both upper and lower case), then you'd have 52 symbols, and your simple prediction step would simply assign 1/52 as the probability for each symbol. This will result in a constant 5.7 bits of output per symbol. So your 1000 letter message would encode at 5700 bits.
Now, suppose you discard capitalization. Now there are 26 symbols, each assigned a probability of 1/26, resulting in a constant 4.7 bits per symbol. Your 1000 letter message will now encode at 4700 bits. So you only save about 17% in this very generously constructed case.
The mistake you've made is to assume that cutting the number of symbols in half will cut the symbol length in half. In fact, reducing the number of symbols by half merely reduces the symbol length by a single bit. For example, you need 8 bits to specify between 256 possibilities, but if you cut the possibilities down to 128, you still need 7 bits.
In addition, lossy compression is usually combined with lossless compression, and a smart predictor could use context to reduce the cost of capital letters to much less than one bit per symbol (assuming that the text has a predictable structure, like human language).
it's comomn kdownlege in inoitrmfaon thoery ciecrls taht you can reoerdr lttres in wrods and the relsut is pecfretly undeastnrdlabe as lnog as you keep the fsirt and lsat letrtes the smae.
I've written this page, to try to explain the science behind this meme. There are elements of truth in this, but also some things which scientists studying the psychology of language (psycholinguists) know to be incorrect.
The page also contains a number of counter-examples, show that the result often is not "pflteecry uaaedlbdrsntne".
But such features could be exploited to make lossy compression. If humans can recognize words with some of the letters randomized, then there is no need to store those letters individually. Instead just insert random letters at decomperssion phase.
In a certain sense, thats what Reader's Digest Condensed Books are all about - lossy text compression.
But we let machines do it all the time too. Go from UTF-8 to ASCII-7. You won't loose much in English - orientation of quotation marks, distinction among various types of dashes. Naïve and Café and El Niño and other words borrowed from other languages and commonly spelt with their native character set would also need to be spelt differently.
You could drop down to 5 bits per character, giving you only upper case letters and room for 6 punctuations. Numerals would need to be spelt out, which means you could no longer distinguish among these: 1, One, ONE, one.
You can also find lots of lossy text compression, focused on minimizing characters, on Twitter or in text messages.
I hate it when I open a document and there's non-ASCII characters cluttering it up just because someone wanted magic quote marks. If you must have oriented quotes, you should do them the ``LaTeX way,'' so you don't have to get into Unicode, which sucks [1].
Exactly - the whole point of lossy compression is to compress it in a way that has a minimal effect on the type of data you're trying to compress while maximizing the compression.
Using a compression method on a different data set is unsurprisingly ineffective.
We're not "blind" to the changes in images, but the algorithm is designed that the changes are minimal when in image form.
Yes, I'm sure I could come up with a lossy encryption system for text. There was a popular meme going around for a while of sorting all but the first and last characters of a word.
Interesting idea. So basically any word with greater than 3 characters could be converted into beginning / end char and a number representing the number of chars in between.
I wonder how much of our ability to read text like that is based on context though. Also, I'm going to assume that this only works with fairly common words (which is unique for each person). For example, I'm guessing running said compression on the works of Shakespeare wouldn't work as well, since most people struggle to make sense of them to begin with.
Your claim that the sorted version keeps all the information is false-- it is impossible to reverse the transformation.
I'd love to see this with the Burrows-Wheeler-Transform (suffix sorting), which doesn't actually lose any information-- it might even be possible for a human to (very) slowly read it.
"we can only consume a few dozens of bytes per second, and so any error is obvious."
That's not the point at all. The English language itself is already heavily compressed, by which I mean the space of all possible words is already densely packed. This is why we can both understand misspelled words with no particularly close neighbors, but are nonetheless sensitive to misspellings in general.
The act of writing something down in language compresses it, significantly changing properties of the space such as comparability of neighbors
So, this is a fun bit of tinkering and all, but I'd be more interested to see just how small HN could losslessly compress the Gutenberg plain text UTF8 file of Romeo and Juliet.
Yeah. You really need this caveat to avoid the loophole where you can compress anything to zero bytes by making the desired output a literal in your decompressor (which just ignores its input and prints the hard-coded literal).
This could be considered a real-world application of "hello world" programs: Winning compression prize contests by blatantly abusing the rules.
If even max quality has that many single character shifts I'm worried that this is barely demonstrating compression at all, and is instead mostly demonstrating color space conversion.
??? JPEG uses the redundant(or neglect-able) information in 8x8 blocks. I do not see how text is information of 8x8 blocks with neglect-able information within this block...
It's not. I think it would have been more insightful, if they outputted the text file to WAV instead, and then compared the audio quality of MP3 compressed version of the WAV file. Bring the Noise. Only true audiophiles can hear the difference.
Next up, the startling results of outputting a text file to random locations in RAM...Basically the author was just bored.
They say a picture is worth a thousand words. Thus, the more damage you do to a picture, you might get that thousand number lower, but it still retains a lot. A word is worth one, and so, even changing it slightly removes all meaning.
He describes the process: treat text file as RAW image file, read into picture editor, save as .jpg with varying compression. Abuse your preferred example text accordingly.
> We're sensitive to data loss in text form: we can only consume a few dozens of bytes per second, and so any error is obvious.
We've also optimized text to be pretty robust in its native form, such that all of those bits have a high chance of getting through successfully; for example, no writing system requires color to carry semantic weight. Therefore, text can be displayed successfully with monochrome technology, and is robust against color changes as long as contrast is preserved.
Also, every writing system has a fixed repertoire of atomic units (characters), even if some of them have a huge number of them. This makes writing robust against small changes as they're more likely to turn a well-formed character into a badly-formed version of the same character, as opposed to a different character entirely. This, of course, is where JPEG breaks the rules somewhat, because its idea of a 'small change', when applied to a text file, will always produce an entirely different character, usually with no obvious relationship to the original. The properties are entirely different.
> Conversely, we're almost blind to it in pictures and images: and so losing quality doesn't bother us all that much.
Images are captured and displayed largely irrespective of human perceptual limits. Lossy compression is, in fact, a way to recognize the limits of human perception and take them into account by throwing away information likely to be ignored anyway. For example, human hearing tops out at around 22 kHz, so a sampling rate somewhere north of 44 kHz will, mathematically, capture everything about the sound humans are interested in. Had we canine perception, our sampling rates would be higher, but we could use fewer bits to encode the colors in color photographs.
Text is inefficient in its uncompressed form, but human perception of that type has little to do with that. It's inefficient because our language is redundant, which is a boon when we're trying to talk in a crowded café and must accept some noise along with the signal.
A better test would be to take a picture of a sign, then edge-detect to bring out the letter forms at maximum contrast, pound it down to monochrome, and then compress that at maximum lossage. I guarantee it would still be readable.
This "experiment" is garbage. No description of test methodology other than "then outputted the compressed file back to plain text."
Last time I had to use Omnipage to convert some highly compressed .JPEGs, it blew me away with it's accuracy. Not only did it get the text nearly 100%, it often got the tables right too. The only thing it struggled with was advanced mathematical symbols.
The methodology seems obvious - interpret every three bytes in the text as a pixel in an image, compress the image, then reinterpret each (post-compression) pixel as three ASCII characters.
Thanks. Great explanation. Considering that the ASCII character map is arbitrary and there are so many great ways to test lossy compression algorithms( PSNR to A/B testing), this seems like a pointless exercise, but at least I see what the author was trying to do now.
If I wanted to demonstrate lossy compression, I'd use Image Subtraction plugin in GIMP, not ASCII.
Interpreting one set of data as another is a clever trick for puzzles (thematic, for example, to: http://web.mit.edu/puzzle/www/2013/coinheist.com/oceans_11/t... -- but that's only a hint, the solution is still lots of fun!) but is kind of wrong in practice.
Compression is about conveying a similar 'message' while removing the stuff that just doesn't compress well. In signals such as pictures and audio, this is often seen/heard in high frequency detail.
To the layman, the simplest lossy compression of an image is to resize it by a half in each dimension and then stretch the output. (Not unlike retina-display artifacts you may have seen). Assuming you do it right (with an averaging filter) -- you've just removed the upper half of the representable frequencies, it just so happens that most of what the eye cares about isn't in those ranges -- and so the image still looks decent. A little blocky if you stretch it, but still pretty good. JPEG is just a more technically advanced version of similar tricks.
A better analogy of lossy compression on text is actually xkcd's Up Goer Five (http://xkcd.com/1133/) -- using only the "ten hundred" most common words, you can still convey concepts (albeit simplistically) and stay under ten bits a word, or a little less than two ascii characters per. If you were to map the dictionary into "up-goer five" speak, you could compress Shakespeare pretty well. It would lose a lot in the translation -- but so does a heavily-compressed image. If you limit yourself to the first 64k words, you have a much larger vocabulary, still limited, and still fitting within two bytes per word. Though you may have to use more words, a word like "wherefore" still counts for four words worth of compressed space, when replaced by "why". Tradeoffs!
That'd be a curious hack -- sort the words in Romeo and Juliet by word frequency. Rewrite the least common words in terms of the other words. Compress.