
Love hotels and Unicode - gioele
http://www.reigndesign.com/blog/love-hotels-and-unicode/
======
davidw
I loved the idea of the slides _interspersed with text_. Just posting slides
is usually lame, because you lose out on the actual talk, which contains
most/much of the actual information.

~~~
pmjordan
This really ought to be the standard way to post presentations online.
Unfortunately, it's more work than uploading to the awful _slideshare_.

~~~
_delirium
Yeah, it's an interesting format I've recently started experimenting with for
some of my own talks (though only preliminarily). It seems sort of halfway
between the "throw up slides" non-solution, and the traditional academic
solution, which is to accompany a talk with a written paper. Then the paper
would serve as the "durable" version of the talk suitable for archiving,
laying out the same material but in a way more suited to text, and usually a
bit more formally, with more details (e.g. to enable someone to actually
reproduce the work).

The slide interspersed with text format could be seen as a more heavily
illustrated, easy-reading sort of paper. More similar to the talk than a
traditional accompanying paper would be, but not _just_ the slides. But I
think it's still non-trivial work to make a good version, which is why people
often just throw up the slides, because that's almost literally no additional
work.

You basically have to take the same source material and think how you would
write it up into a blog post, which is not 100% the same as thinking of how to
give it as a talk. Though I guess a first-cut solution could be to just record
the talk and type up a transcript in between the slides.

~~~
pmjordan
You could probably do a lot worse to prepare for a talk than writing such a
text+slides version of it, then distilling key words as notes from the prose
and then practicing and giving the talk based on those.

------
bgruber
i had never before read about the unicode flags, and the rather (politically
and technically) brilliant end run around the problem. fascinating.

~~~
jerf
I am torn between it being brilliant, versus it should have been taken as a
sign that adding all those symbols was just a waste of time. I sort of feel
like Unicode 5 is jumping the shark here. At the point where you're arguing
about how to refer to the color of the hair on one of your "graphemes", may I
humbly suggest that the goal of creating a complete global encoding scheme is
apparently done and the committee ought to disband. Unicode has apparently
ceased being about being a universal encoding for human text and expanded its
mission into becoming a universal icon catalog. This is just silly. Next up in
Unicode 7, we deprecate the face icons in favor of a series of combining
characters that allow to mix and match hairstyles, face colors, eyes, noses,
etc., to create arbitrary faces, why not.

~~~
greggman
Or you could consider that 140million Japanese (not sure about other
countries) have been using emoji every day for the last 10+ years on their
cellphones.

They've become part of the language and it would be culturally insensitive to
give them a big "fuck you, us westerner's don't need your damn icons, either
keep those to yourself and hence don't be able to inter-operate with everyone
else or stop using them even though they've been a big part of your culture
for 10+ years"

~~~
jerf
You're trying to make me feel bad, but you're doing it with a false dilemma:
_Either_ we stick emoji in Unicode _or_ the Japanese are screwed and we told
them to "fuck" off. This is a false dilemma, and exactly the sort of thinking
that worries me as we fuzzily expand the charter of Unicode beyond a universal
grapheme repository. I say there's plenty of third options, many of which are
better choices than jamming it into the international all-purpose standard.

Unicode is supposed to be the central hub for all graphemes, but I would
certainly like to see some argument that emoji are actually graphemes. For one
thing it's quite bizarre that suddenly Unicode appears to be specifying colors
as well as shapes, which is one bright line I'd be concerned about. (There may
have been colors before but they would be much more exceptional, and I'm not
aware of any.) Unicode was _ambitious_ as the all-grapheme repository, it's
simply a guaranteed failure if it tries to become the repository of all
vaguely iconic/smilyish/little pictures in the world.

~~~
aidenn0
Japanese do mix emoji with kanji and kana. Are emoji any less graphemes than
the mor pictographic han characters? I don't think the colors are normative
though, any more than the exact shape is. A different representation of a love
hotel ought to be fine.

~~~
jerf
The colors I'm referring to are the ones in the names of the code points, such
as the ones the Germans complained about It doesn't get much more normative
than that.

~~~
aidenn0
Ah, that grapheme doesn't actually have color IIRC, but the hair is not
filled, thus giving the impression of a fair-haired person (rather than dark
haired)

------
vorg
> Every character, or to be more exact, every "grapheme", is assigned a
> Unicode code point.

Every character is assigned a Unicode code point. The Unicode consortium
defines a grapheme as a "user perceived character", usually made up of one
Unicode code point, but sometimes two or more. A base character can be
followed by one or more non-spacing marks, together forming a "grapheme", the
most common of which usually have a "canonical mapping" to a single character,
but need not.

------
guard-of-terra
Minor point: I see copy-paste from wikipedia about ISO-8859-5. It's
unfortunate since nobody ever used ISO-8859-5. They probably should change it
to ISO-8858-something-else in the wikipedia article.

~~~
pmjordan
It also entirely glosses over the fact that before the ISO-8859 standards,
there were the horrendous code pages in DOS and numerous other encodings on
other platforms, which made things hard even for Europeans, let alone
languages with a non-latin-derived alphabet.

~~~
derleth
And before that, you get into encodings like EBCDIC, RAD50, SIXBIT, FIELDATA,
and even more failed schemes now largely forgotten.

Why should a brief overview go back even to the pre-ISO-8859 days except to
mention ASCII? None of them are directly relevant: The world we're dealing
with now on the Web begins with ASCII, moves through a Pre-Unicode Period, and
finishes up in the Land of Unicode, where it's at least possible to do things
Right. All history tells a narrative; when it comes to character encodings,
that's a good default unless you really think your audience cares about why
FORTRAN was spelled that way back in the Before Time.

Tom Jennings has an interesting history:

<http://www.wps.com/projects/codes/>

~~~
Someone
_"The world we're dealing with now on the Web begins with ASCII"_

I know nothing about the implementation of early web browsers/gopher/etc, but
I doubt there ever was anything on the web that used ASCII. 7-bit email may
have been around at e time, but I would guess Tim Berners Lee just used
whatever character set his system used by default (corrections welcome; being
snarky isn't the only reason I write this)

~~~
derleth
> I know nothing about the implementation of early web browsers/gopher/etc,
> but I doubt there ever was anything on the web that used ASCII.

All headers, HTTP, email, or otherwise, are 99% or more ASCII. HTML markup is
over 99% ASCII for most documents, especially the complex ones.

ASCII is the only text encoding you can guarantee everything on the Web (and
the Internet in general, really) knows how to speak. Finally, guess what all
valid UTF-8 codepoints in the range U+00 to U+7F inclusive are compatible
with: ASCII.

~~~
Someone
I know that, but "over 99% ASCII" = "not ASCII". For many users, UTF8 is over
99% ASCII, but it is not ASCII.

~~~
derleth
> I know that, but "over 99% ASCII" = "not ASCII"

No, that's not what I meant. I meant that all of the essential bits are ASCII,
all of the software that generates those important pieces as to know ASCII,
and it's entirely possible for software that speaks only ASCII to handle it as
long as the filenames (the main source of non-ASCII characters) being served
are also ASCII.

Read the HTTP specification sometime.

------
mavroprovato
Mirror from Google cache, but unfortunately text only:

[http://webcache.googleusercontent.com/search?q=cache:http://...](http://webcache.googleusercontent.com/search?q=cache:http://www.reigndesign.com/blog/love-
hotels-and-unicode/&hl=en&strip=1)

~~~
icebraining
The Coral cache has images: [http://www.reigndesign.com.nyud.net/blog/love-
hotels-and-uni...](http://www.reigndesign.com.nyud.net/blog/love-hotels-and-
unicode/)

------
brown9-2
Does anyone happen to know what the arguments for big-endianness versus
little-endianness in the Unicode format were?

I've never really understood the advantage of one over the other, but this
section on Wikipedia <http://en.wikipedia.org/wiki/Endianness#Optimization>
helps explain that there are optimizations that can be made at a hardware
level when performing arithmetic on little-endianness values.

Is the argument for using LE in Unicode similar?

~~~
ajross
If you're using, say, UCS2 (not UTF16, lest we get too confused), it's awfully
nice if the wide character (i.e. a C short) for "A" can be equal to the
literal 'A' and not 0x4100. Making that work depends on the endianness of the
host architecture, not just the data.

------
sbierwagen

      ASCII was actually invented in the US in the 1960s, as a 
      standardised way of encoding text on a computer. ASCII 
      defined 128 characters, that's a character for half of the 
      256 possible bytes in an 8-bit computer system.
    

Uh? Wouldn't it be easier to just say that it's a 7-bit coding system? And
what does he mean by "256 possible bytes in an 8-bit computer system"?

~~~
hardy263
"7 bits" versus "half of 8 bits" are two slightly different things. One has a
padding, the other does not. So the file size for a 7 bit encoding would be
slightly smaller than an 8 bit one.

------
davvid
The one thing he forgot to say is, "If you have to choose an encoding, use
utf-8."

Someone who doesn't know the difference between utf-16, utf-8, etc. might not
know which to use.

------
adavies42
i'd've noted that endianness is hardly unique to unicode.

------
akrifa
This was a fantastic read. Thank you!

------
derleth
When he says byte-order marks are optional, does he mean just in UTF-8 (where
they are) or also in UTF-16 (where I strongly suspect they are not)?

(Yes, you can use heuristics to guess which endianness is in use. The problem
is that while this is trivial for Western languages I don't even know how
you'd begin when presented with arbitrary text using an East Asian or African
written language.)

~~~
patio11
_The problem is that while this is trivial for Western languages I don't even
know how you'd begin when presented with arbitrary text using an East Asian or
African written language_

Not actually that hard.

Consider a document which is encoded in either a) ASCII like you know it or b)
ASCII where the top 4 bits and bottom 4 bits are transposed. How would you
tell the difference? Well, one can imagine creating a histogram of the bits
for each half of the bytes and comparing them to expectations based on the
distribution of bits in naturally occurring English text. The half with most
of the entries in the 0x5, 0x6, and 0x7 is the upper order half.

If you don't know what naturally occurring e.g. Japanese looks like in Unicode
code points, take this on faith: flipping the order does not give you a
document which looks probably correct. (Also, crucially, Japanese with the
order flipped doesn't resemble any sensible document in any language -- you
end up with Unicode code points from a mishmash of unrelated pages.)

P.S. Why care about that algorithm? Here's a hypothetical: you're a forensic
investigator or system administrator who, given a hard drive which has been
damaged, need to extract as much information as possible from it. The BOM is
very possibly not in the same undamaged sector which you are reading right
now, and it may be impossible to stitch the sectors without first reading the
text. How would you determine a) whether an arbitrary stream of bytes was
likely a textual document, b) what the encoding was, c) what endianness it
was, if appropriate, and d) what human language it was written in?

~~~
funkah
Sounds hard to me.

~~~
chc
How about this simplified version:

1\. Try both byte orders

2\. If one produces valid text and the other does not, choose that one (this
will get you the correct answer almost every time, even if the source text is
Chinese)

3\. If both happen to produce valid text, use the one with the smallest number
of scripts

(Note that this just determines byte order, while Patrick was talking about
the more ambitious task of heuristically determining whether a random string
of bytes is text and if so what encoding it is. My point is just that you
really don't need to be told the order of the bytes in most cases.)

~~~
kijin
Simple in theory, but hard enough in practice that companies like Microsoft
screw it up from time to time.

Try saving a text file in Windows XP Notepad with the words "Bush hid the
facts" and nothing else. Close it and open the file again. WTF Chinese
characters! Conspiracy!

~~~
jerf
That's not Microsoft "screwing it up", that's you not feeding the algorithm
enough characters for it to be really sure. While that short string is below
the threshold, the threshold is actually quite surprisingly small; if I
remember correctly it's just over 100 bytes and any non-pathological input
will be correctly identified with effectively 100% success.

