

Should UTF-16 be considered harmful? - ch0wn
http://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful

======
pilif
UTF16 gives you the worst of both worlds: it wastes a lot of memory for ASCII-
like languages (mostly ASCII plus some special characters), you have to deal
with byte ordering and you still don't get the advantages of directly
addressing a specific character without parsing the whole string up to the
character you want.

But.

If you widen the character even more, you'd probably still want it to be
somewhat word-aligned so you'd use 32bit per character which would have enough
storage to store all of Unicode plus some. The cost though is obvious: you
waste memory.

Depending on the language, you even waste a lot of memory: think ASCII or
ASCII-like (the middle european languages). In UHF-8 those need - depending on
language - barely more than one byte up to two bytes per character.
Representing these languages with 4 bytes per character makes you use nearly 4
times the memory reasonably needed.

This changes the farther east you move. Representing Chinese (no ASCII, many
character points high up in the set) in utf-8, you begin wasting a lot of
memory due to the ASCII compatibility. As encoding a Unicode code point in
utf-8 uses around one byte more than if you would just store the code point as
an integer.

So on international software running on potentially limited memory while
targeting the eastern languages, you will again be better off using utf-16 as
it requires less storage for characters really high up in the Unicode plane.

Also, if you know that you are just extended ASCII, you can optimize and
access characters directly without parsing the whole string, giving you
another speed advantage.

I don't know what's the best way to go. 32 bits is wasteful, utf-16 is
sometimes wasteful, has endianness issues and still needs parsing (but is less
wasteful than 32bits in most realistic cases) and utf-8 is really wasteful for
high code points and always requires parsing but doesn't have the endianness
issues.

I guess as always these are just tools and you have to pick what works in your
situation. Developers have to adapt to what was picked.

~~~
pornel
> Representing Chinese (no ASCII, many character points high up in the set) in
> utf-8, you begin wasting a lot of memory due to the ASCII compatibility.

That's not "a lot", that's only 33% more in the _best case_ of purely CJK
text.

OTOH any non-CJK characters are 50% smaller. It starts to even out if you add
few foreign names or URLs to the text.

And for HTML, UTF-16 is just crazy. Makes HTML markup twice as expensive.

CJK websites don't use UTF-16, they use Shift-JIS or GBK, which are
technically more like UTF-8.

~~~
fleitz
How exactly does UTF-16 make HTML twice as expensive?

I guarantee you that there are very few apps that will double their memory
usage if you start using UTF-16 text. Even if you start looking at bandwidth
once you compress the text there is very little difference. (You are
compressing your HTML right?)

The case for UTF-8 saving memory makes a lot of sense if you're writing
embedded software, however in most stacks the amount of memory wasted by
UTF-16 is trivial compared to the amount of memory wasted by the GC, bloated
libraries, interpreted code, etc.

If you're using .NET or the JVM char is 16 bits wide anyway. The UTF-8 vs.
UTF-16 debate is a perfect example of mircobenchmarking where theoretically
there is a great case for saving a resource but in aggregate makes very little
difference.

~~~
pornel
> How exactly does UTF-16 make HTML twice as expensive?
    
    
        < 0 h 0 t 0 m 0 l 0 > 0
    

> If you're using .NET or the JVM char is 16 bits wide anyway.

Hopefully you don't need to be worried what .Net/JVM have to do for legacy
reasons and you can use UTF-8 for all input and output.

------
xpaulbettsx
UTF-16 has all of the disadvantages of UTF-8 and none of the advantages, and
originally comes from the UCS-2 era where they thought, "64k characters are
enough for everyone!" Unfortunately, all of Windows uses it, so we as an
industry are stuck with it.

~~~
jensnockert
You mean that it is not ASCII compatible?

Also, Mac OS X, Java, .Net, Qt, ICU... there are a lot of support for UTF-16,
for other reasons than backwards compatibility. Processing UTF-16 is easier in
many situations.

~~~
nitrogen
Processing UTF-16 would only be easier if you have a valid byte-order mark
and/or know the endianness in advance and you can guarantee that no surrogate
pairs exist. Otherwise the same pitfalls exist as when processing UTF-8, plus
the endianness issue and the possibility of UTF-16 programmers not knowing
about surrogate pairs.

~~~
xpaulbettsx
What you're describing is UCS-2, not UTF-16. That's why the latter is
frustrating to deal with

------
jrockway
Harmful, not really. Sometimes UTF-16 is the most compact representation for a
given string, and its semantics are not that confusing. UTF-8 is a better
"default choice", but the algorithms for both are very hard to get right.
Given an idealized Unicode string object, it's still hard to do things like
count the number of glyphs required to display the text in font foo, or
convert all characters to "lowercase", or sort a list of names in
"alphebetical" order. But the problem is not Unicode or UTF-16 or UTF-8, the
problem is defining what a correct computer program is, and ultimately, that's
the job for the programmer, not the string encoding algorithm.

Think of UTF-8 and UTF-16 as being like in-memory compression for Unicode
strings. You can't take a gzip file, count the number of bytes, and multiply
by some factor to get the true length of the string. But, this complexity is
often worth the space savings, so that's why these encodings exist.

The problem that people have is twofold: one, the average programmer doesn't
realize that characters and character encodings are two different things. The
second is, they don't realize that a Unicode string is a data structure that
does one thing and only one thing: maintain a sequence of Unicode codepoints
in memory.

A Unicode string is not "text" in some language; it's some characters for a
computer. To choose the right font or sort things correctly, you need
additional information sent out-of-band. There is no way to do it right given
only the Unicode string itself. People are reluctant to send out-of-bad data
and so other people assume Unicode is hard. It's not. The problem is that
Unicode is not text.

------
jensnockert
Unicode should be considered harmful, possibly even text. Never think you
understand text, it is a very complex medium, and every time this topic is
brought up, you learn something about some odd quality of some language that
you might never have heard of. Yes, UTF-16 is variable length, yes, it does
make many European scripts larger. Size is always a trade-off and there won't
be one standard for encoding.

Text is hard, do not approach it with a C library you built in an afternoon,
leave it to the professionals. I just wish I knew any...

~~~
derefr
> Size is always a trade-off and there won't be one standard for encoding.

I don't see why, in modern multi-core CPUs/GPUs where 99% of time is spent
idle, we can't just have "raw" text (UCS4) to manipulate in memory, and
_compressed_ text (using any standard stream compression algorithm) on disk/in
the DB/over the wire.

Anything that's not UCS4 is already variable-length-encoded, so you lose the
ability to random-seek it anyway; and (safely performing) any complex text
manipulation, e.g. upper-casing, requires temporarily converting the text to
UCS4 anyway. At that point, you may as well go all the way, and serialize it
as efficiently as possible, if you're just going to spit it out somewhere
else. I guess the only difference is that string-append operations would
require un-compressing compressed strings and then re-compressing the
result—but you could defer that as long as necessary using rope[1].

[1] <http://en.wikipedia.org/wiki/Rope_(computer_science)>

~~~
klodolph
The oft-stated advantage of UTF-32/UCS4 is that you can do random access. But
random character access is almost entirely useless for real text processing
tasks. (You can still do random byte access for UTF-8 text, and if your regexp
engine spits out byte offsets, you're fine.)

Even when you're doing something "simple" like upcasing/downcasing, the
advantages of UTF-32 are not great. You are still converting variable length
sequences to other variable length sequences -- e.g., eszett upcases to SS.

Now the final piece to this is that for some language implementations,
compilation times are dominated by lexical analysis. Sometimes, significant
_speed_ gains can be had by dealing with UTF-8 directly rather than UTF-32
because memory and disk representation are identical, and memory bandwidth
affects parsing performance. This doesn't matter for most people, but it
matters to the Clang developers, for example. Additional system speed gains
are had from reducing memory pressure.

Sure, we have plenty of memory and processor power these days. But simpler
code isn't always worth 3-4x memory usage.

Text is not simple.

------
perlgeek
UTF-16 is not harmful, just the "I use UTF-16, so I don't have to worry about
Unicode issues anymore" stance is harmful.

Processing text is _hard_. For example if you write an editor, you must be
aware of things like right-to-left text sequences, full-width characters and
graphemes composed of multiple codepoints.

If you implement a case insensitive search, you must be aware of title case,
and the nasty German ß which case-folds to _two_ upper-case characters (SS).
(Ok, you'll probably just use regexes, it's the author of the regex engine who
has to worry about it. And it is ugly, because it means that character classes
can match two characters, which you have to consider in all length-based
optimizations).

In first approximation it doesn't matter if you implement your stuff in UTF-8
or in one of the UTF-16 encodings, you have to deal with the same issues
(include variable byte length for a single codepoint).

------
thisrod
I see a different problem: the scope of Unicode has crept outside the scope of
its designers' expertise. Next you know, we'll have a code point for the
symbol "arctanh". (Moreover, sin through arccosh will be contiguous, erf will
be placed where you'd expect arctanh, and arctanh will sit between blackboard
bold B and D, because C is listed a trillion code points earlier as "Variant
Complex Number Sign", between "Engineering Right Angle Bracket" and "Computer
Programming Left Parenthesis".)

Encoding the technical and mathematical symbols in two bytes is necessarily
kludgy. There are International Standard kludges that everyone is supposed to
use, but not everyone does. My solution would be for Unicode to stick to human
languages.

------
maaku
You shouldn't care about UTF-8/UTF-16/UCS-4 except for performance, and that
totally depends on what you are doing with what data-sets.

As someone who speaks multiple languages and has written a fair amount of
language-processing code, the simple truth is that if you are not using a
vetted unicode framework for writing your application--and if you're asking if
UTF-16 should be considered harmful, you're probably not--you are almost
certainly introducing massive numbers of bugs that your own cultural biases
are blinding you to.

Do not underestimate the difficulty of writing correct multi-lingual-aware
programs. These frameworks exist for a reason, and are often written by
professionally trained linguists.

~~~
archangel_one
Interfacing with third-party libraries is an obvious case where you do need to
worry about encodings for reasons that are not performance.

------
cpeterso
Is there any reason to use a character encoding _other than_ UTF-8 (for new
content)? There is a (lot of) legacy content in non-Unicode encodings. Also,
UCS-4 _might_ be a useful optimization for processing text.

~~~
maaku
Size, perhaps? Remember, most of the world does not speak English. I bet you
baidu uses UTF-16 for its indexes.

~~~
burgerbrain
Is size _actually_ a concern in the modern world? With transmission and
storage size is practically a nonissue and we have proper rather cheap
compression anyway. And in memory? We have _gigabytes_ of ram, who cares? Of
the potential situations where it may become an issue that I can think of,
pretty much none of them are on end user hardware.

~~~
maaku
I reiterate my example of baidu.

Just because size isn't a problem for your application, doesn't mean there
aren't others out there dealing with petabytes of text.

~~~
burgerbrain
_"Of the potential situations where it may become an issue that I can think
of, pretty much none of them are on end user hardware."_

Do what you need to on your companies machines, but keep it out of ecosystems
it has no reason to be in.

------
kogir
If whatever framework or libraries you're using don't hide this from you
nearly completely, you might be doing it wrong. Just use the default Unicode
format of the OS and framework or standard libraries you're using and never
think of it again.

Much like dates and times, correct handling of Unicode is best left to the
poor souls stuck dealing with them full time, all the time.

~~~
hrktb
_> Much like dates and times, correct handling of Unicode is best left to the
poor souls stuck dealing with them full time, all the time._

Reminds me that date parsing is fucked up in iOs and you actually need to know
your shit to get it right when dealing with users around the globe. But I
don't see people dealing with this full time.

~~~
chubs
In all apps i've worked on, passing dates around as 'seconds since 1970' has
made date parsing on ios effortless - you may want to try that.

~~~
pyre

      > passing dates around as 'seconds since 1970' has
      > made date parsing on ios effortless
    

I'm confused. How does the way that you store dates internally have anything
to do with date parsing? If you are accepting dates from users, then you still
have to normalize then to unix timestamps.

That also doesn't cover things like:

* Take this date and convert it into a week represented by the last Sunday of the week (i.e. 20110410 represents 20110403-20110409).

* Basically anything involving days of the week.

------
caf
I agree with much of what the top-voted answer says, with the exception of the
criticism of wchar_t. wchar_t is _not_ equal to utf-16: it is supposed to be a
type large enough to store any single codepoint. Environments where wchar_t is
utf-16 simply have a broken wchar_t - when wchar_t is ucs4 then it fulfills
its original purpose.

------
j_baker
I dislike UTF16 for a very simple reason: it's another type of encoding to
deal with. If I could force the entire world to dump UTF16 and use UTF8, I
would. Hell, I'd force everyone to dump UTF8 and use UTF16 if I could too.
It's just such a pain in the ass to have to deal with having so many ways of
encoding text on the web.

