
UTF-8 good, UTF-16 bad - mattyb
http://benlynn.blogspot.com/2011/02/utf-8-good-utf-16-bad.html
======
chalst
Broad backwards compatibility with ASCII is a strong reason to prefer UTF-8 in
most applications, however I find the issues with UTF-16 are overstated and
the advantages of it ignored.

"UTF-16 has no practical advantages over UTF-8" - UTF-16 is better - i.e.,
quite a bit more compact, since, e.g., Chinese characters always take 3
characters in UTF-8, for asian languages.

"A related drawback is the loss of self-synchronization at the byte level." -
Maybe this is a problem, maybe not. Maybe the failure of UTF-8 to be self-
synchronising at the 4-bit level is a problem is some circumstances. I don't
mean to be flippant, but the wider point is that with UTF-16, you really need
to commit to 16-bit char width.

"The encoding is inflexible" - I think the author has confused the fixed-width
UCS-2 and the variable-width UTF-16.

"We lose backwards compatibility with code treating NUL as a string
terminator." - Not true. NUL is a 16-bit character in UTF-16. Use a C compiler
that supports 16-bit char width.

~~~
othermaciej
> "UTF-16 has no practical advantages over UTF-8" - UTF-16 is better - i.e.,
> quite a bit more compact, since, e.g., Chinese characters always take 3
> characters in UTF-8, for asian languages.

In some contexts that may matter. In other cases, you can expect enough ASCII
mixed in to outweigh that effect. For example, on the Web though, if you're
sending HTML, the savings from using 8 bits for the ASCII tags will nearly
outweigh the cost of extra bytes for content text. Gzip shrinks that further
to essentially no difference.

~~~
fnl
As I posted there, too, he also missed one very important plus in UTF-16:
character counting. If you ensure there are no Surrogate Chars (which is a
simple byte-mask comparison using every second byte (only!) as in: "0xD800 &
byte == 0xD800" [using the entire range to avoid having to think about byte
order]) in your array, you can be 100% sure about the character length just by
dividing the byte-length in two. With UTF-8, you have to look at each and
every byte one by one and count the number of characters to establish the
actual char-length. Finally, given how rare the Supplementary Range characters
are, the Surrogate Range check can be safely skipped in many scenarios, making
it significantly faster to establish char lengths in UTF-16 than UTF-8.

EDIT: oh, and before any "semantic-nerd" comes along: I am fully aware that
0xXXXX are two bytes, so, if you want, read "two-byte" for every time I
mention "byte" above... (doh ;))

~~~
wladimir
Right, if you ensure that only ASCII characters are used in UTF-8, which you
can check using 0x80 & byte == 0x00, counting the number of characters is
easy. But to check that this condition holds, you need to iterate over the
whole string anyway.

I don't see how this gives any advantage to either of the encodings. Both for
UTF-8 and UTF-16 you have to implement some decoding to reliably count the
number of characters.

~~~
fnl
Oh, and another thing is your argument only holds for ASCII - I hardly ever
encounter pure ASCII data nowadays when I work with text. On the other hand, I
_never_ have encountered a Surrogate Character except in my test libraries,
either. Next, if you are working with pure ASCII, who cares about UTF-8 or
-16? Last, you still need to scan every byte in UTF-8 for that, but only every
second "two-byte" in UTF-16, which is half.

~~~
roel_v
When you say 'ASCII' I guess you mean 'strings where all bytes have a decimal
value in the range [0-127]', right? If so I agree that it's rare to encounter
that, but the common use (however wrong) of ASCII is 'chars are in [0-255]',
i.e. all chars are one byte; and that data is very common.

Thinking about it, I don't know what codepage the UTF-8 128-255 code points
map too, if any, though; could you explain? If you treat UTF-8 as ASCII data
(as one byte, one character, basically), does it generally work with chars in
the [127-255] range.

~~~
fnl
256 char ASCII is called "8-bit", "high", or "extended" ASCII. So, pure
(7-bit) ASCII is the only thing you can hold in a "one-byte UTF-8 array". The
8th bit (or the first, depending on how you see it) is used to mark multibyte
characters, i.e. any character that is not ASCII. So you only can represent
128 possible symbols into a UTF-8 character with length of one byte. In
particular, UTF-8 maps these to ASCII (or, US-ASCII) characters, and the byte
starts with a 0 bit (In other words, you cannot encode high ASCII into one
UTF-8 byte.) For ALL other characters (no matter), the first bit in any
(multi-)byte is set to 1. That's why it is easy to scan for length of ASCII
chars in UTF-8, but not for any others.

~~~
roel_v
Oh I see, thank you, it seems I was misinformed.

------
wooster
Better discussion here, IMO: [http://research.swtch.com/2010/03/utf-8-bits-
bytes-and-benef...](http://research.swtch.com/2010/03/utf-8-bits-bytes-and-
benefits.html)

------
adobriyan
> The Go designers knew what they were doing, so they chose UTF-8.

The authors of Go and authors of UTF-8 are more or less the same people, so
the choice was no-brainer.

------
burgerbrain
I honestly had no idea there were parts of the development community that
actually used and preferred UTF-16...

~~~
btilly
Anyone who programs in Java is using UTF-16. And in my experience very few
Java programmers understand that UTF-16 is a variable length encoding.

~~~
Confusion
Too few programmers know that UTF-8 is a variable length encoding: I've heard
plenty assert that in UTF-8, every character takes two bytes, while claiming
simultaneously they could encode _every possible_ character in it.

A bit broader: too few programmers understand the difference between a
character set and a character encoding.

~~~
fedd
> too few programmers understand the difference between a character set and a
> character encoding

why then they have the same names??? :)))

ps/ <http://www.grauw.nl/blog/entry/254> \- is this article ok? first in
google by "charset encoding difference"

~~~
Confusion
That article starts out OK and then suddenly tries to argue that you can use
the terms interchangeably. You can not and you will drown in confusion if you
try to. Just imagine that tomorrow, the Chinese introduce their own character
set next to Unicode, but use UTF-8 to minimize the number of bytes it takes to
represent their language (which makes sense, because the frequency of
characters drops off pretty fast and some characters are much more common than
others, so you'd like to represent those with one byte).

The fact that the HTTP RFC speaks of 'charset=utf-8' is explained by this part
of the spec:

    
    
      Note: This use of the term "character set" is more commonly
      referred to as a "character encoding." However, since HTTP and
      MIME share the same registry, it is important that the terminology also be shared.
    

Why does MIME use the 'wrong' terminology? Perhaps because the registry is old
and the difference between set and encoding was less obvious and relevant back
then. Perhaps it was simply a mistake; a detail meant to be corrected. Perhaps
the person that drew it up was inept. Who knows. It doesn't matter, it is
still wrong. And don't get me started on the use of character set in MySql...

------
prodigal_erik
I keep hoping a string API will catch on in which combining marks are mostly
treated as indivisible. Handling text one codepoint at a time is as bad an
idea as handling US-ASCII one bit at a time--almost everything it lets you do
is an elaborate way to misinterpret or corrupt your data.

~~~
barrkel
It's not so simple: it depends on what you're doing with the text. If you're
not trying to do analysis with it, encoded text is more or less a program
written in a DSL that, when interpreted by a font renderer, draws symbols in
some graphical context. Depending on the analysis you want to do, you need
varying amounts of knowledge. Perhaps you only need to know about word
boundaries; perhaps you're trying to look things up in a normalized
dictionary; maybe even decompose a word into phonemes to try and pronounce it.
These require different levels of analysis, and one size won't fit all.

------
thristian
The article, at the end, claims that "ASCII was developed from telegraph codes
by a committee." It turns out the story is much, much more complicated and
interesting than that: <http://www.wps.com/projects/codes/>

------
zbowling
UCS2, even though being an outdated predecessor to UTF-16, has some unique
qualities that make it useful for things like databases or other storage
mediums that you are not mixing with a lot of low code point characters (like
you do with XML and HTML markups).

One being that's fair for all languages with respect to size so when you may
be storing your standard Chinese, Korean, Japanese characters.

When UTF-16 made UC2 variable length, a few of the nice things were lost, but
when dealing a lot of the higher code point characters mostly, UTF-16 may save
you space.

------
fedd
may i dare make a conclusion with my observations? :)

utf-8 is good for network interchange and is de facto becoming standard.

utf-16 is not bad for internal storage of strings in memory or in database.
not nesserily bad. maybe even better for some reasons

------
natmaster
Can someone link me to where python uses UTF-16? It was my understanding it
defaulted to UTF-8.

<http://www.python.org/dev/peps/pep-3120/>

~~~
lysium
That PEP only refers to the encoding of the source (code) file, not to the
encoding on how strings are stored in Python.

Edit: Your question might be answered at SO:
[http://stackoverflow.com/questions/1838170/what-is-
internal-...](http://stackoverflow.com/questions/1838170/what-is-internal-
representation-of-string-in-python-3-x#1838285)

------
wildmXranat
Correct me if I'm wrong, but I think Excel still outputs UTF-16 in some cases.
I remember parsing generated .txt/.csv files and there were issues with it and
it's endian order.

------
mikecaron
Good read! I always HATE it when things complain that my visual studio _.c/_.h
files are BINARY! WTF!

------
fedd
and i find it fair that American characters require 2 bytes in Java as
everybody else, not 1 as in utf-8! :)

~~~
brownleej
You seem to be confusing "fair" with "equal". Treating everybody the same is
not necessarily fair. It seems fair to me to have the most common characters
be shortest. I don't have any evidence, but I would guess that Latin
characters[1] are used most commonly.

[1] "Latin characters" is the proper term, not "American characters"

~~~
fedd
thanks! but i just tried to kid which i can't control sometimes.

anyway, i think that even if Latin chars weren't the most used in the world,
it would be fair to keep them the primary charset for use in programming and
markup languages, as no-one now complains that the international language of
medicine is Latin, not, say, Chinese :) as computers started to be massively
developed in America.

