
Shakespeare.txt.jpg: A JPEG compression experiment - pavel_lishin
http://www.tomscott.com/romeo/
======
barakm
As some comments are arriving below -- lossy compression does not work that
way!

Interpreting one set of data as another is a clever trick for puzzles
(thematic, for example, to:
[http://web.mit.edu/puzzle/www/2013/coinheist.com/oceans_11/t...](http://web.mit.edu/puzzle/www/2013/coinheist.com/oceans_11/this_page_intentionally_left_blank/index.html)
\-- but that's only a hint, the solution is still lots of fun!) but is kind of
wrong in practice.

Compression is about conveying a similar 'message' while removing the stuff
that just doesn't compress well. In signals such as pictures and audio, this
is often seen/heard in high frequency detail.

To the layman, the simplest lossy compression of an image is to resize it by a
half in each dimension and then stretch the output. (Not unlike retina-display
artifacts you may have seen). Assuming you do it right (with an averaging
filter) -- you've just removed the upper half of the representable
frequencies, it just so happens that most of what the eye cares about isn't in
those ranges -- and so the image still looks decent. A little blocky if you
stretch it, but still pretty good. JPEG is just a more technically advanced
version of similar tricks.

A better analogy of lossy compression on text is actually xkcd's Up Goer Five
([http://xkcd.com/1133/](http://xkcd.com/1133/)) -- using only the "ten
hundred" most common words, you can still convey concepts (albeit
simplistically) and stay under ten bits a word, or a little less than two
ascii characters per. If you were to map the dictionary into "up-goer five"
speak, you could compress Shakespeare pretty well. It would lose a lot in the
translation -- but so does a heavily-compressed image. If you limit yourself
to the first 64k words, you have a much larger vocabulary, still limited, and
still fitting within two bytes per word. Though you may have to use more
words, a word like "wherefore" still counts for four words worth of compressed
space, when replaced by "why". Tradeoffs!

That'd be a curious hack -- sort the words in Romeo and Juliet by word
frequency. Rewrite the least common words in terms of the other words.
Compress.

~~~
glurgh
_If you limit yourself to the first 64k words, you have a much larger
vocabulary, still limited, and still fitting within two bytes per word._

Maybe I'm misunderstanding what you're saying but if you're compressing
adaptively by building a dictionary of the target material, 64k words would
give you lossless compression of Shakespeare - the unique words in the
collected works (with some wiggle room for 'word forms') are in the 30-40k
range.

~~~
barakm
True! You could do that too. Sending the codebook is a cromulent strategy.
It'd be lossless too. That's (to a first order) how LZ works.

I was arguing you can use a standard codebook (some agreed-upon top-N list)
instead. I imagine if you did you'd lose a couple words ("wherefore" springs
to mind, as it's not in common English usage) but by and large you'd have just
about everything.

Technically, you can also have a lossless JPEG (or at least, within epsilon),
but that's not why it's done.

More importantly, the entropy of English text isn't really that high -- ~ 1
bit per letter -- which means lossless compression works pretty well on it.
Images less so.

The lesson here being that compressing English text as suggested in the
article is rather meaningless, and can be done much, much better. When the
article asks at the end:

> We're sensitive to data loss in text form: we can only consume a few dozens
> of bytes per second, and so any error is obvious.

> Conversely, we're almost blind to it in pictures and images: and so losing
> quality doesn't bother us all that much.

> Should it?

My answer is "mu". I wanted to answer "no", but that would mean I even accept
the question as valid.

EDIT: s/in this way/as suggested in the article/ for clarity.

~~~
glurgh
I suppose my confusion arises from the fact that pretty much any practical
compression scheme is essentially 'content adaptive' rather than 'standard
codebooky'. In the lossy cases, add whatever magical filtering to toss out the
perceptually irrelevant frequencies.

I agree with you that the article misunderstands compression and coding in
general I just couldn't quite figure out what one of your counter-examples was
about.

------
tomscott
Hello there! I'm Tom. I made this. I didn't expect it to go this far. A few
notes for you, which I've also added to the original site:

Yes, I'm aware this is pretty meaningless - my original tweet was "Not sure if
it's net-art or just a load of guff." It seems to have sparked some
discussion, though.

No, I'm not selling the books. I try not to create too many useless physical
objects. If you're hell bent on it, or want to see the full text, I've
uploaded all the source files and there's a link at the bottom of the page. I
ask that you don't sell anything you make with it, although given I'm ripping
off Shakespeare and the JPEG algorithm that's probably a bit rich.

If anyone wants to commission a professionally bound hardcover set for some
net art exhibit, let me know. I'll happily pretend to be a serious artist.

------
0x09
The whole reason this analogy breaks down is that letters are independent
symbols while levels in an image represent a continuum. There's no
mathematical correspondence among letters to exploit, only an accident of
encoding. Apples to oranges.

That said here's a particular rant transformed as a whole signal with the DCT
and quantized a bit
[http://pastebin.com/B5zS8W8f](http://pastebin.com/B5zS8W8f)

------
rkangel
I'm not sure you should draw any meaning at all from this, whether technical
or philosophical. JPEG is an image specific compression format, and works by
removing visually insignificant detail. Applying the same system blindly to an
unrelated byte-stream and then looking at the before and after isn't
meaningful.

Those objections aside, it's a bit silly from a technical point of view - the
text was read in as a RAW image format. As I understand it, RAW has to be
converted into a bitmap image using a camera specific algorithm, so before
he's even started compressing it the data has been munged, the data is then
compressed and output as a completely different file format. I'm surprised he
got anything recognisable at all.

Putting bitmap type file headers on the text data, then jpeg compressing it,
and resaving it as a bitmap (and removing the headers), might produce
something slightly more meaningful. I might try it later - curious to find out
what Singular Value Decomposition looks like on text.

Actually, could just skip out Photoshop... _cracks knuckles and fires up
Python_

~~~
spyder
Yea, it would be more interesting if he would have compressed the text with an
algorithm that replaces similar sounding letters, using rules like in the
SoundEx algorithm.

~~~
TylerE
Or may a zip-style algorithmn, but as the "quality" setting gets lower,
unique/low frequency sequences are replaced by the closest common sequence.

------
ZeroGravitas
That's an interesting idea, and it's cool that he took it as far as he did,
but I'm surprised by the last paragraph:

 _We 're sensitive to data loss in text form: we can only consume a few dozens
of bytes per second, and so any error is obvious. Conversely, we're almost
blind to it in pictures and images: and so losing quality doesn't bother us
all that much._

I'm not aware of any "lossy" encodings for text, but I assume they'd do pretty
well on shakespeare and pretty terrible on images encoded as text. It's just
that the techniques are tuned for the standard data, so most things you
smuggle through are going to be "weird" and suffer more than usual.

~~~
kevingadd
You could probably quite effectively do 'lossy' compression on text given
enough knowledge of the language. Compress words with similar meanings into
each other, automatically select variants of words based on context
(plurality, etc), automatically insert punctuation, and so on.

The result would probably look like it was written by someone with an
incomplete knowledge of English (or by a machine) but might end up being quite
readable.

Of course, on the other hand, you can just compress the text with LZMA :)

~~~
drewcrawford
it's comomn kdownlege in inoitrmfaon thoery ciecrls taht you can reoerdr
lttres in wrods and the relsut is pecfretly undeastnrdlabe as lnog as you keep
the fsirt and lsat letrtes the smae.

[http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4126...](http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4126417)

~~~
VMG
That is more about built-in redundancy and error correction than lossy
compression.

~~~
zokier
But such features could be exploited to make lossy compression. If humans can
recognize words with some of the letters randomized, then there is no need to
store those letters individually. Instead just insert random letters at
decomperssion phase.

------
Harkins
I recently finished a similar project in computationally mutating a familiar
text: [http://www.wellsortedversion.com](http://www.wellsortedversion.com)

</shamelessplug>

~~~
Scaevolus
Your claim that the sorted version keeps all the information is false-- it is
impossible to reverse the transformation.

I'd love to see this with the Burrows-Wheeler-Transform (suffix sorting),
which doesn't actually lose any information-- it might even be possible for a
human to (very) slowly read it.

------
samineru
"we can only consume a few dozens of bytes per second, and so any error is
obvious."

That's not the point at all. The English language itself is already heavily
compressed, by which I mean the space of all possible words is already densely
packed. _This_ is why we can both understand misspelled words with no
particularly close neighbors, but are nonetheless sensitive to misspellings in
general.

The act of writing something down in language compresses it, significantly
changing properties of the space such as comparability of neighbors

------
DanBC
So, this is a fun bit of tinkering and all, but I'd be more interested to see
just how small HN could losslessly compress the Gutenberg plain text UTF8 file
of Romeo and Juliet.

([http://www.gutenberg.org/cache/epub/1112/pg1112.txt](http://www.gutenberg.org/cache/epub/1112/pg1112.txt))

Especially if you're using interesting techniques to build the dictionary.

EDIT: I got it down from 178983 to 44830 bytes, which isn't as impressive as I
was hoping for.

EDIT2: Currently at 43,666 bytes which is a bit better.

EDIT3: 40,125 bytes.

~~~
abecedarius
You'd want to count the size of the decoder, like
[http://en.wikipedia.org/wiki/Hutter_Prize](http://en.wikipedia.org/wiki/Hutter_Prize)

~~~
DanBC
Well, yes. Except the source text is tiny, so unfortunately all the software
is pretty big in comparison.

I love the Hutter Prize, and the various compression benchmark sites.

~~~
abecedarius
It's just pretty unconstrained without that kind of rule. And a decompressor
that doesn't have to be fast can be tiny.

------
Dylan16807
If even max quality has that many single character shifts I'm worried that
this is barely demonstrating compression at all, and is instead mostly
demonstrating color space conversion.

------
miniatureape
My brother did something similar a number of years ago.

[http://whaleycopter.blogspot.com/2007/09/project-jteg-
compre...](http://whaleycopter.blogspot.com/2007/09/project-jteg-compression-
format-for-ink.html)

It's interesting to see the same ideas pop-up over and over again. (Note the
patent-pending message is a joke).

------
Qantourisc
??? JPEG uses the redundant(or neglect-able) information in 8x8 blocks. I do
not see how text is information of 8x8 blocks with neglect-able information
within this block...

~~~
lbenes
It's not. I think it would have been more insightful, if they outputted the
text file to WAV instead, and then compared the audio quality of MP3
compressed version of the WAV file. Bring the Noise. Only true audiophiles can
hear the difference.

Next up, the startling results of outputting a text file to random locations
in RAM...Basically the author was just bored.

------
film42
They say a picture is worth a thousand words. Thus, the more damage you do to
a picture, you might get that thousand number lower, but it still retains a
lot. A word is worth one, and so, even changing it slightly removes all
meaning.

------
clebio
Very neat idea. Would love to have those bound copies....

Reminds me of Nethack, where text in the dirt degrades every time you walk
over it.

------
bcj
It's too bad the full text isn't posted anywhere. I'd love to take a more in-
depth look at the results.

~~~
ctdonath
He describes the process: treat text file as RAW image file, read into picture
editor, save as .jpg with varying compression. Abuse your preferred example
text accordingly.

------
derleth
> We're sensitive to data loss in text form: we can only consume a few dozens
> of bytes per second, and so any error is obvious.

We've also optimized text to be pretty robust in its native form, such that
all of those bits have a high chance of getting through successfully; for
example, no writing system requires color to carry semantic weight. Therefore,
text can be displayed successfully with monochrome technology, and is robust
against color changes as long as contrast is preserved.

Also, every writing system has a fixed repertoire of atomic units
(characters), even if some of them have a huge number of them. This makes
writing robust against small changes as they're more likely to turn a well-
formed character into a badly-formed version of the same character, as opposed
to a different character entirely. This, of course, is where JPEG breaks the
rules somewhat, because its idea of a 'small change', when applied to a text
file, will _always_ produce an entirely different character, usually with no
obvious relationship to the original. The properties are entirely different.

> Conversely, we're almost blind to it in pictures and images: and so losing
> quality doesn't bother us all that much.

Images are captured and displayed largely irrespective of human perceptual
limits. Lossy compression is, in fact, a way to recognize the limits of human
perception and take them into account by throwing away information likely to
be ignored anyway. For example, human hearing tops out at around 22 kHz, so a
sampling rate somewhere north of 44 kHz will, mathematically, capture
everything about the sound humans are interested in. Had we canine perception,
our sampling rates would be higher, but we could use fewer bits to encode the
colors in color photographs.

Text is inefficient in its uncompressed form, but human perception of that
type has little to do with that. It's inefficient because our language is
redundant, which is a boon when we're trying to talk in a crowded café and
must accept some noise along with the signal.

A better test would be to take a picture of a sign, then edge-detect to bring
out the letter forms at maximum contrast, pound it down to monochrome, and
then compress that at maximum lossage. I guarantee it would still be readable.

------
slacka
This "experiment" is garbage. No description of test methodology other than
"then outputted the compressed file back to plain text."

Last time I had to use Omnipage to convert some highly compressed .JPEGs, it
blew me away with it's accuracy. Not only did it get the text nearly 100%, it
often got the tables right too. The only thing it struggled with was advanced
mathematical symbols.

~~~
xsmasher
The methodology seems obvious - interpret every three bytes in the text as a
pixel in an image, compress the image, then reinterpret each (post-
compression) pixel as three ASCII characters.

This is not about OCR.

~~~
slacka
Thanks. Great explanation. Considering that the ASCII character map is
arbitrary and there are so many great ways to test lossy compression
algorithms( PSNR to A/B testing), this seems like a pointless exercise, but at
least I see what the author was trying to do now.

If I wanted to demonstrate lossy compression, I'd use Image Subtraction plugin
in GIMP, not ASCII.

