
Why Unicode Won’t Work on the Internet (2001) - jordigh
http://www.hastingsresearch.com/net/04-unicode-limitations.shtml
======
lcuff
This article was written before UTF-8 became the de-facto standard. According
to Wikipedia, UTF-8 encodes each of the 1,112,064 valid code points. Much more
than Goundry's (the author's) 170,000. Goundry's only complaint against UTF-8
is that at the time, it was one of three possible encoding formats that might
work. Since it has now been widely embraced, the complaint is no longer valid.

In short, Unicode will work just fine on the internet in 2016 as far as
encoding all the characters goes. Problems having to do with how ordinal
numbers are used, right-to-left languages, upper-case/lower-case anomalies,
different glyphs being used for the same letter depending on the letter's
position in the word (and many other realities of language and script
differences) all need to be in the forefront of a developer's mind when trying
to build a multi-lingual site.

~~~
WayneBro
> In short, Unicode will work just fine...

> Problems............all need to be in the forefront of a developer's mind
> when trying to build a multi-lingual site.

It will work. Just fine though? It sounds like way too much work!

~~~
Dylan16807
Unicode handily solves the problem of storing text.

Manipulating text, though, is inherently nightmarish. No format can prevent
that.

------
TazeTSchnitzel
UTF-16, and non-BMP planes, were devised in 1996. The author seems to have
been 5 years late to the party.

> The current permutation of Unicode gives a theoretical maximum of
> approximately 65,000 characters

No, UTF-16 enables a maximum of 2,097,152 characters (2^21).

> Clearly, 32 bits (4 octets) would have been more than adequate if they were
> a contiguous block. Indeed, "18 bits wide" (262,144 variations) would be
> enough to address the world’s characters if a contiguous block.

UTF-16 provides 21 bits, 3 more than the author wants.

Except they're not “in a contiguous block”:

> But two separate 16 bit blocks do not solve the problem at all.

The author doesn't explain why having multiple blocks is a problem. This works
just fine, and has enabled Unicode to accommodate the hundreds of thousands of
extra characters the author said it ought to.

Though maybe there's a hint in this later comment:

> One can easily formulate new standards using 4 octet blocks (ad infinitum) –
> but piggybacking them on top of Unicode 3.1 simply exacerbates the
> complexity of font mapping, as Unicode 3.1 has increased the complexity of
> UCS-2.

They would have preferred if backwards-compatibility had been broken and
everyone switched to a new format that's like UTF-32/UCS-4, but not called
Unicode, I guess?

~~~
jordigh
Maybe the errors in the article are more a statement of how complicated and
improperly communicated Unicode was... and mostly still is! While I think I
understand most of how UTF-8 works, I still have to read and re-read how
codepoints and planes and encodings and decodings work together. It's a pretty
complicated beast that could very easily be misunderstood when it was less
popular than it is now.

It's still widely misunderstood today.

~~~
TazeTSchnitzel
I'm not sure that's fair, Unicode's encodings are pretty straightforward,
particularly compared to some other character sets. Most of the complexity
comes above the encoding level.

~~~
0x0
It also doesn't help that the classic LAMP stack has very confusing defaults
and badly named functions:

* PHP has functions named "utf8_encode()" and "utf8_decode()", when they should have been called "latin1_to_utf8_transcode()" and "utf8_to_latin1_transcode()"

* MySQL for the longest time used latin1 as a default character set, then introduced an insufficient character set called "utf8" which only allows up to 3 bytes, not enough for all possible utf8 encoded codepoints, then introduced a proper implementation called "utf8mb4".

* mysql connectors and client libraries often default their "client character set" setting to latin1, causing "silent" transcodes against the "server character set" and table column character sets. Also, because their "latin1" charset is more or less a binary-safe encoding, it is very easy to get double latin1-to-utf8 transcoded data in the database, something that often goes by unnoticed as long as data is merely received-inserted-selected-output to a browser, until you start to work on substrings or case insensitive searches etc.

* In Java, there are tons of methods that work on the boundary between bytes and characters that allows not specifying an encoding, which then silenty falls back to an almost randomly set system encoding

* Many languages such as Java, JavaScript and the unicode variants of win32 were unfortunately designed at a time where unicode characters could fit into 16bits, with the devastating result that the data type "char" is too small to store a single unicode character. It also plays hell on substring indexing.

In short, the APIs are stacked against the beginning programmer and doesn't
make it obvious that when you go from working with abstract "characters" to
byte streams, there is ALWAYS an encoding involved.

~~~
jordigh
Does any programming language get Unicode right all the way? I thought Python
did it mostly correctly, but for example with the composing characters, I
would argue that it gets it wrong if you try to reverse a Unicode string.

~~~
deathanatos
My basic litmus test for "does this language support Unicode" is, "does
iterating over a string get me code points?"¹

Rust, and recent versions of Python 3 (but not early versions of Python 3, and
definitely not 2…) pass this test.

I believe that all of JavaScript, Java, C#, C, C++ … all fail.

(Frankly, I'm not sure anything in that list even has built-in functionality
in the standard library for doing code-point iteration. You have to more or
less write it yourself. I think C# comes the closest, by having some Unicode
utility functions that make the job easier, but still doesn't directly let you
do it.)

¹Code units are almost always, in my experience, the wrong layer to work at.
One might argue that code points are still too low level, but this is a basic
litmus test (I don't disagree that code points are often wrong, it's mostly a
matter of what can I actually get from a language).

> _try to reverse a Unicode string._

A good example of where even code points don't suffice.

~~~
TorKlingberg
I basically agree with you, but note that code points are not the same as
characters or glyphs. Iterating over code points is a code smell to me. There
is probably a library function that does what you actually want.

~~~
deathanatos
I explicitly mention exactly this in my comment, and provide an example of
where it breaks down. The point, which I also heavily noted in the post, is
that it's a litmus test. If a language can't pass the iterate-over-code-points
bar, do you really think it would give you access to characters or glyphs?

------
jimjimjim
[http://utf8everywhere.org/](http://utf8everywhere.org/)

a very useful site especially when having to explain what utf8 is to other
devs when working in a windows shop.

~~~
wmccullough
'working in a windows shop.'

Surely you're flattering yourself.

~~~
jimjimjim
no seriously. I'm a windows application dev, and I have been for >decade.

If all you see around you is wchar_t and LPCWSTR then that is what unicode
means.

------
herge
Man, UCS-2 is the pits. I still remember fighting with 'slim-builds' of python
back in the day.

Any critique of unicode while not assuming UTF-8, which allows for more than 1
million code points) is a bit suspect in my opinion. The biggest point against
UTF-8 might be that it takes more space than 'local' encodings for asian
languages.

~~~
mjevans
Wikipedia has a summary of comparisons:

[https://en.wikipedia.org/wiki/UTF-8#Compared_to_UTF-16](https://en.wikipedia.org/wiki/UTF-8#Compared_to_UTF-16)

Advantages

* Byte encodings and UTF-8 are represented by byte arrays in programs, and often nothing needs to be done to a function when converting from a byte encoding to UTF-8. UTF-16 is represented by 16-bit word arrays, and converting to UTF-16 while maintaining compatibility with existing ASCII-based programs (such as was done with Windows) requires every API and data structure that takes a string to be duplicated, one version accepting byte strings and another version accepting UTF-16.

 _Text encoded in UTF-8 will be smaller than the same text encoded in UTF-16
if there are more code points below U+0080 than in the range U+0800..U+FFFF.
This is true for all modern European languages.

_ Most communication and storage was designed for a stream of bytes. A UTF-16
string must use a pair of bytes for each code unit:

* * The order of those two bytes becomes an issue and must be specified in the UTF-16 protocol, such as with a byte order mark.

* * If an odd number of bytes is missing from UTF-16, the whole rest of the string will be meaningless text. Any bytes missing from UTF-8 will still allow the text to be recovered accurately starting with the next character after the missing bytes.

Disadvantages

* Characters U+0800 through U+FFFF use three bytes in UTF-8, but only two in UTF-16. As a result, text in (for example) Chinese, Japanese or Hindi will take more space in UTF-8 if there are more of these characters than there are ASCII characters. This happens for pure text[nb 2] but actual documents often contain enough spaces and line terminators, numbers (digits 0–9), and HTML or XML or wiki markup characters, that they are shorter in UTF-8. For example, both the Japanese UTF-8 and the Hindi Unicode articles on Wikipedia take more space in UTF-16 than in UTF-8.[nb 3]

~~~
deathanatos
The biggest disadvantage of UTF-16, IMO, is that programmers blindly assume
that they can index into the string as if it were an array, and get a code
point out — which you can _not_ ; you'll get a code unit, which is slightly
different, and might not represent a full code point (let alone a full
character).

UTF-8's very encoding quickly beats this out of anyone who tries, whereas it's
easy to eek by in UTF-16. The real problem is that the APIs allow such
tomfoolery. (Some have historical excuses, I will grant, but new languages are
still made that allow indexing into _code units_ without it being obvious that
this is probably not what the coder wants.)

~~~
int_19h
Strictly speaking, this is a disadvantage of languages that have strings as
first-class types _and_ allow indexing on strings in the first place (and
specify it to have this semantics).

For the most part, the developer shouldn't really care about the internal
encoding of the string, but the language/library should also not expose that
to them.

------
mcaruso
More like, "Why UCS-2 Won’t Work on the Internet".

------
khaled
An extensive and very informative, though a bit sarcastic, rebuttal (from 2001
as well): [https://features.slashdot.org/story/01/06/06/0132203/why-
uni...](https://features.slashdot.org/story/01/06/06/0132203/why-unicode-will-
work-on-the-internet) (via
[https://twitter.com/FakeUnicode/status/786324531828838400](https://twitter.com/FakeUnicode/status/786324531828838400)).

------
oconnor663
> Thus is can be said that Hiragana can form pictures but Katakana can only
> form sounds

That sounds really weird to me. Does that sound right to any native Japanese
speakers here?

~~~
xigency
This writing is sort of a strange metaphor, but I guess the point the author
makes is that kanji can be transliterated as hiragana but not katakana. The
writer goes on to talk about traumatic brain injuries so I guess he's aiming
at the cultural value of each syllabary.

I'm not a native speaker, but if I were to make an equally strange metaphor as
the author, katakana feels like writing in all capital letters.

~~~
achamayou
Kanji can be transliterated either way, and both forms are lossy since there
are so many homonyms and kana only encodes sounds. It's traditional to
annotate difficult Kanji pronunciation with small Hiragana called Furigana,
for example in children's books. But it could be done all the same in
Katakana. Modern Chinese words that Japanese borrows are usually translated in
Katakana for example.

~~~
klodolph
Kana mostly contain just sounds, but do contain some morphological
information—there are homonyms in kana as well, after all. This is a bit rare,
however.

------
zoom6628
The paper is far more interesting for its informative background on the the
use of the character sets in CJK region.

------
david90
Good to see we've had a breakthrough after 15+ years.

------
jbmorgado
Why "640K ought to be enough to everyone"

~~~
Dylan16807
The opposite, really. It's arguing why 16 bits is not enough. But it does this
while casually dismissing the solutions we already had.

------
reality_czech
Hilarious, a document from 2001 talking about why Unicode is unsuitable to
"the orient." At the end, I half expected to read that "Negroes have also
proved to be most unfavorable to it."

~~~
darkengine
Simply because of his use of an outdated synonym of "Asia"? If anything, I got
the impression that the author was critical of Westerners for being
insensitive to the needs of Asian computer users. I think this is because of
Han Unification, but he does not mention it by name.

~~~
sundarurfriend
In this author's case, I agree that the intent doesn't appear to have been
malicious. But calling it just "an outdated synonym of Asia" is like saying
Negro is just an outdated synonym for Black (after all, negro is just Spanish
for black). In both cases, it ignores heavy historical and cultural
implications that come with the words.

~~~
hueving
What are the cultural implications of using Orient as a synonym for Asia?

