
Repeat after me: Unicode is not UTF-\d{1,2} - nreece
http://enjoydoingitwrong.wordpress.com/2009/06/22/unicode-is-not-utf/
======
jobu
From the comments on the article: "UTF-\d+ IS Unicode like a Ford Focus is a
car. Not every car is a Ford Focus. Thus UTF-8 IS Unicode, not vice-versa."

I think that's a better explanation than the whole rest of the article/rant.
That author is a little too nuts to follow.

~~~
DougWebb
That's not a great metaphor though, because there are many types of cars, but
only one type of Unicode. 'Unicode' is not a generalization for character
encodings the way 'car' is a generalization for car models. I think this is
exactly the point the article author is trying to make.

A better analogy would be color pixels. Consider a pure-red pixel; there is
only one particular shade of red a pixel can have and be pure-red. However,
there are multiple ways to represent that color: RGB, HSV, HSL, YUV, CMYK,
etc. These are all encodings of the same color. None of them /are/ pure-red,
but they all /represent/ pure-red.

Similarly, the 1-4 byte sequences within a UTF-# encoded string aren't Unicode
characters themselves; they represent individual Unicode characters. There is
only one A in Unicode, but there are multiple ways to encode that A in a
stream of bytes.

~~~
blasdel
Well there are at least a half-dozen valid unaccented capital a characters in
Unicode, but there is only one LATIN CAPITAL LETTER A, which sits at U+0041.

You are correct that there are many ways to encode U+0041 in a stream of
bytes.

------
cduan
The analogy I prefer is that Unicode-encoding is like network protocol layers.
In a (very simplified) network, the goal is to present an interface of a
pipeline for sending a continuous stream of bytes. To achieve this, you have a
protocol like TCP, which consists of routines for transforming a series of
packets running through a wire into a stream of bytes. To achieve a series of
packets running through a wire, you have a protocol like Ethernet, which
consists of routines for transforming electrical pulses into a series of data
packets. You need both Ethernet and TCP to turn electrical pulses into a
stream of continuous bytes.

Similarly, the goal of Unicode is to present an interface of a stream of
abstract characters (the letter "A", the number 3, the nonbreaking space,
etc.). To achieve this, you use Unicode, which consists of routines to convert
a stream of arbitrarily-long numbers (packets) into abstract characters. To
achieve a stream of arbitrarily-long numbers, you use an encoding such as
UTF-8, which consists of routines to convert a series of bytes into a stream
of arbitrarily-long numbers. Now, using an encoding and Unicode, you can
convert a stream of bytes (i.e., an arbitrary file) into a series of abstract
characters.

------
blasdel
Any article that attempts to lay down the truth about Unicode that does not
mention UCS-2 is seriously deficient.

There's way too many morons out there treating UTF-16 input as UCS-2, or
writing UCS-2 and calling it UTF-16 (or "Unicode" as the article nicely
addresses). Both Windows and Java have fucked this up pervasively in the past.

------
Tichy
"when transforming from (byte) strings to Unicode, you are decoding your data"

Oh, so in memory they are not bytes anymore, but "code sequences"? Fair enough
to attempt to clarify a point, but please don't make it even more confusing
than it actually is.

I guess (what I take away from the article, even though it is not written in
it) the actual "transforming" stage only applies to single letters then -
"unicode" would be the mapping of a number to a letter, and the encodings
(utf-8 and so on) are different ways to represent the number?

Also, is it true that utf-16 can represent all of unicode? Because I was under
the impression that it can't?

~~~
gizmo
It's true. UTF-16, like UTF-8 is a variable length encoding. UTF-8 uses less
memory, and with UTF-16 it's much easier to determine the string length.

~~~
jrockway
How is it easier in UTF-16? Most "normal" Unicode characters fit in two bytes,
sure, but you can't just count bytes and divide by two if you want the right
answer. It is just as difficult as UTF-8 to implement.

------
sp332
Wow, there's a UTF-EBCDIC? What next, UTF-Morse code?

------
andrewvc
This is the kind of thing that doesn't matter till it does, which for most
programmers (at least in the US) is never.

When something bad happens you go off and learn some more, but, for a lot of
people, it flat out won't be an issue.

~~~
lallysingh
.. and then you rewrite all your string manipulations?

Unlike other new technologies, you're already using it from the beginning (and
in your approach, incorrectly). Spend the two hours to figure out how to do
the basic work right at the beginning.

~~~
prodigal_erik
This. At my workplace we had legacy code written by people who either didn't
understand or didn't care about this stuff, and now our databases are full of
crap that we literally have to _guess_ how to correctly convert and render.

------
jodrellblank
So, unicode isn't called multicode because...?

~~~
blasdel
...because Unicode is a Universal set of Code Points that uniquely refer to
Characters.

Even if some asshole decreed that there'd be variable-width encodings, you'd
still have endianness issues and combining characters to trip over. Shit ain't
easy.

