

A Tweet is Worth (at Least) 140 Words With this Compression Algorithm - Umalu
http://thevirtuosi.blogspot.com/2011/08/tweet-is-worth-at-least-140-words.html

======
waitwhat
A hack for the sake of doing a hack.

If you _really_ want to compress as many words as you want into a tweet, just
include a link: <http://www.example.com/really-long-article.txt>

~~~
Nick_C
Heh, I never realised IANA supports example.com and example.org. Did you all
know that already and forget to tell me?

------
waffle_ss
I think the most realistic way to compress a tweet would be to replace words
like "before" with "b4", "too"/"to" with "2", reduce whitespace (e.g. double
spaces to single), and maybe start ripping out vowels ("vowels" -> "vwls").

Although not as efficient as demonstrated above, there are no external
dependencies needed; the content can be decompressed by the reader's brain in-
place at the slight cost of being difficult to immediately parse/understand.

~~~
Jach
There's also a social status cost, which could be great or small, for example
if your readers have an intense dislike for netspeak and poor English.

~~~
FuzzyDunlop
Here in the UK it's interesting to note that, while text speak was all the
rage a few years ago, it faded out.

It was replaceddddd by making wordssss actuallllllly longerrrrrr for no
reeeeaaaasonnnnn!!!

Now the text speak is a bit more reined in and not totally incomprehensible
like it used to get.

~~~
hammock
Actually making the words longer like that goes a long way to expressing
emotion in your texts - while avoiding the use of emoticons, which make you
look like a girl.

For example, adding letters to a word is especially useful when teasing
someone- gives them a hint you're not 100% serious.

------
petercooper
I wondered how a naive approach would work in comparison: You can reliably
represent just over 2^20 codepoints in a UTF-8 character or 2800 bits over 140
characters. Standard ASCII is 2^7. 2800/7 gives us a potential 400 ASCII
characters using a naive approach alone or a compression of 2.86x compared to
the 5x he mentions.

------
blauwbilgorgel
Great hacking. If you like this topic, this should be relevant:
[http://stackoverflow.com/questions/891643/twitter-image-
enco...](http://stackoverflow.com/questions/891643/twitter-image-encoding-
challenge) (Compressing images in tweets)

------
instakill
All that's missing is a custom Twitter client that does compression upon
tweeting and in the various timeline columns ala Tweetdeck, does
decompression, it could make for something interesting. Too bad that would be
against Twitter's TOS.

------
bmalicoat
Good article, but isn't the author describing Huffman codes?

~~~
RodgerTheGreat
Same basic idea. He's using variable-length input strings and mapping them to
unique symbols rather than the other way around, but it seems like he's using
entropy measurements to build an optimal trie.

~~~
akavi
Is an arbitrary bit sequence valid unicode? If so, I'm curious if a simple
per-character Huffman encoding of English would be more efficient.

~~~
JeremyBanks
No, it's not. (Although "unicode" is ambiguous, because there are several
unicode encodings.) Even if the binary sequence can be decoded as unicode code
points, certain combinations of code points aren't allowed (surrogate pairs
have to be matched).

------
jconnop
To extend upon this idea - if you really wanted to maximise the data you could
transmit in a single twitter message you could use the full 31bits of unicode
(instead of just the chinese subset) and then apply standard lossless data
compression techniques to the generated unicode for further improvement.

~~~
waffle_ss
There are a lot of code points that would need to be filtered out if you do
this - Noncharacters, Control codes, High/Low surrogates, Private-Use,
Whitespace, and then of course the ones that mutate other code points in the
sequence - Bidirectional, Combining characters / diacritical marks. It isn't
quite as simple as just combining random 32-bit characters, as I found when
creating my URL shortener.

If you want to play around with this, there is a helpful official (but really
hard to find) Web app here: <http://unicode.org/cldr/utility/properties.html>

The filter I ended up using is:

    
    
      [:Diacritic=No:]&[:Noncharacter_Code_Point=No:]&[:Deprecated=No:]&[:White_Space=No:]&[:General_Category=Math_Symbol:]|[:General_Category=Symbol:]|[:General_Category=Letter:]|[:General_Category=Punctuation:]|[:General_Category=Currency_Symbol:]|[:General_Category=Number:]&[:General_Category!=Modifier_Letter:]&[:General_Category!=Modifier_Symbol:]

~~~
masklinn
> High/Low surrogates

Surrogates are not codepoints.

~~~
psykotic
That's not right. Surrogates are code points. That's the whole idea! It means
you can express characters beyond the BMP with legacy encodings that were
designed back when the entire code space could be coded in 16 bits. Newer
encodings like UTF-8 don't have to rely on surrogates to do this because they
have that capability at the bytewise encoding level.

[http://www.google.com/search?&q=site%3Aunicode.org+surro...](http://www.google.com/search?&q=site%3Aunicode.org+surrogate+code+point)

------
tomotomo
I made a quick Chrome extension (userscript wasn't going to have enough
permissions) for this which is up at
[https://chrome.google.com/webstore/detail/idcnolgflhcckjdfpf...](https://chrome.google.com/webstore/detail/idcnolgflhcckjdfpfbcehjocggffdjk)

------
waffle_ss
Unicode is really fun. In a similar vein, my first Rails app was a URL
shortener that also takes advantage of Twitter's Unicode character counting
method:

<http://menosgrande.org>

------
cpeterso
So Twitter allows 140 (UTF-8?) characters, regardless of the number bytes? The
article wasn't clear about this.

~~~
waffle_ss
Here's the Twitter dev article that I used for designing my URL shortener:
[https://dev.twitter.com/docs/counting-
characters#Twitter_Cha...](https://dev.twitter.com/docs/counting-
characters#Twitter_Character_Encoding)

Basically, they use the Normalization Form C of Unicode normalization which
counts code points, not UTF-8 bytes.

~~~
wisty
A slight modification to your tldr (it's a little confusing to me as I'm not
so familiar with unicode):

Basically, they use count code points, not UTF-8 bytes.

Before they count the code points, they normalize using the Normalization Form
C of Unicode (NFC), which aims to combine diacritics, so the "é" in "café"
counts as 1 code point. If they used NFD to normalize, "é" would be normalized
into 2 code points - "e" and a diacritic mark. If they didn't normalize, then
it would be client dependent.

Normalization is distinct from encoding. A unicode string can be normalized in
several different ways (including NFC and NFD), which changes the actual
unicode code-points). Each normalized form can be encoded using several
different encodings (i.e. utf-8, utf-32, latin1, etc). Normalization affects
both the number of code points, and subsequently the number of bytes. Encoding
(unicode -> bytes) changes the number of bytes, but should _not_ affect the
number of code points - code points should not be influenced by the encoding.

Recapping - Twitter does not count bytes. They count code points, after the
code points have been normalized (combining diacritics with characters), so
message length should not depend on the client.

Hmm, that wasn't so succinct.

------
heydenberk
Would it be possible to use the tweet metadata to cram more data into a tweet?

~~~
MostAwesomeDude
Yes! You can put whatever you want in the location field, including completely
invalid locations. I remember a discussion last year about using those bytes
for yet more space in a tweet, if your client software is looking for them.

