

International Longest Tweet Contest - 0xdeadc0de
http://blog.ksplice.com/2010/03/longest-tweet/

======
mnemonicsloth
Fun Facts: Ben and Alyssa make appearances in _Structure and Interpretation of
Computer Programs_ , and in Sussman's supercool symbolic programming class:
<http://groups.csail.mit.edu/mac/users/gjs/6.945/>

"Alyssa P Hacker" is a pun on "A Lisp Hacker". She has a friend named Imelda
Macros.

~~~
d0m
Alyssa P. Hacker NEVER answer a question.. she only ask trcky ones.

------
sp332
I feel like I should point out that Unicode is not being used correctly in the
article. A better understanding of Unicode encodings might help in the
competition. <http://www.joelonsoftware.com/articles/Unicode.html>

~~~
ximeng
At first I thought you were right, but reading the article more closely I
think it's OK. It's talking about the relationship between the ISO/IEC 10646
and Unicode. There's more information available about that in the Unicode
standard here:

<http://www.unicode.org/versions/Unicode5.0.0/appC.pdf>

See section C.3 for the differences between the UTF-8 encodings. The key
paragraph is:

"The definition of UTF-8 in Annex D of ISO/IEC 10646:2003 also allows for the
use of five- and six-byte sequences to encode characters that are outside the
range of the Unicode character set; those five- and six-byte sequences are
illegal for the use of UTF-8 as an encoding form of Unicode characters.
ISO/IEC 10646 does not allow mapping of surrogate code positions, known as RC-
elements in that standard; that restriction is identical to the restriction
for the Unicode definition of UTF-8."

That's where the extra characters come from, as the UTF-8 encoding used by
twitter is apparently not checking to see if the characters are valid Unicode
characters, as required by Unicode UTF-8. This is the extra restriction
referred to in the passage above that is imposed by the Unicode version of
UTF-8 but not the ISO version. As quoted in the article,
<http://en.wikipedia.org/wiki/UTF-8#Description> says that the ISO version of
UTF-8 can encode 31 bits. I couldn't find a source for the encoding of ISO
UTF-8.

The other part I wasn't sure about at first was where the 1,112,064 possible
characters figure came from. It turns out that's the 17 Unicode planes of
65,536 characters each, less the range from 0xD000 to 0xDFFF reserved for
surrogate pairs.

In other words:

1+0x10ffff-(0xdfff-0xd800+1) = 1 112 064

~~~
sp332
Thanks, that helps!

------
chaosmachine
I wonder if this was inspired by my Tweet Compressor project:

<http://tweetcompressor.com/>

It got a lot of attention on Reddit last week (20k visitors).

