Hacker News new | past | comments | ask | show | jobs | submit login
Why Unicode Won’t Work on the Internet (2001) (hastingsresearch.com)
102 points by jordigh on Oct 12, 2016 | hide | past | favorite | 110 comments



This article was written before UTF-8 became the de-facto standard. According to Wikipedia, UTF-8 encodes each of the 1,112,064 valid code points. Much more than Goundry's (the author's) 170,000. Goundry's only complaint against UTF-8 is that at the time, it was one of three possible encoding formats that might work. Since it has now been widely embraced, the complaint is no longer valid.

In short, Unicode will work just fine on the internet in 2016 as far as encoding all the characters goes. Problems having to do with how ordinal numbers are used, right-to-left languages, upper-case/lower-case anomalies, different glyphs being used for the same letter depending on the letter's position in the word (and many other realities of language and script differences) all need to be in the forefront of a developer's mind when trying to build a multi-lingual site.


This article is written years after the 16-bit problem was solved, yet seems to be completely unaware of the solution. (It's unclear if the mentions of adding "octet blocks" is a misunderstanding of the solution, or a dismissal.) It mentions Unicode 3.1, yet surrogate pairs were introduced in Unicode 2.0 five years before the doc was written.

In actuality we don't concern ourselves overmuch with planes, "octet blocks", and so on these days. There's just a single numbered list of characters* that we call "Unicode", and a few fairly minimal algorithms for transforming a list of characters into a bytestream ("UTF-8", "UTF-16", etc).

Still I love it as a historically interesting document. It's jarring to see "Oriental" used, I suspect the same author would never use that word in a professional context today.

[*] I use "characters" here when "code points" would be more accurate, but the details of that comparison aren't meaningful to this argument. I also don't go into normalization and so on, as this article seems to be more about the feasibility of fitting >100K code points into actual-bits-on-the-wire.


It doesn't have anything to do with UTF-8. It has to do with Unicode‡ growing from a 16-bit address space, into a ~20.1-bit address space; 17 16-bit "planes"‡‡ (log₂(17×2¹⁶)≈20.1bits).

This expansion happened sort-of gradually. Unicode 3.1 (2001) assigned the Unicode 3.0 character set to be "plane 0", and added 14 additional planes‡‡‡ (for a total address space of log₂(15×2¹⁶)≈19.9bits). I'm not sure exactly which version between 3.1 and 9.0 the additional two planes in.

That is to say, Unicode 3.1 solved the address-space problem.

So why does the article have a section "Why Unicode 3.1 Does Not Solve the Problem"? Well, there are two answers: the one that the article suggests, and the one giving the author the benefit of the doubt.

The way the section is written, it seems that the author thinks the the address space of 16 bits + 16 bits is 2×2¹⁶ instead of 2^(2×16); because they are in separate "16 bit blocks". They seem to think that maybe 1 bit was added, but think that that just grew the encoding size by 16 bits. It honestly read to me like it was written by a linguist who only had a rudimentary grasp of programming; I was surprised when it said the author was a programmer at the bottom.

Giving him the benefit of the doubt that the argument was only poorly expressed, not poorly thought: Unicode 3.1 expanded the address space by another 917,504 encodable characters; more than enough! However, it didn't actually define 900,000 more characters; it only defined 44,946 more characters (as the author noted). But it wasn't limited to 16 bits anymore; it had a full 19 (more actually; 19.9-ish!) to work with. The author even mentions that 18 bits would have been plenty. Well Unicode 3.1 got them! They just weren't allocated yet.

That said, Unicode 9.0 (2016) still only has about 128,000 characters defined in it. A far cry from the author's claim of 170,000 characters needed to satisfy asian languages.

‡: To say otherwise would be to confuse Unicode with its encodings, a mistake that only leads to confusion.

‡‡: The term "plane" comes from ISO/IEC standards dealing with character sets. Unicode 3.0 corresponded to the ISO/IEC "Basic Multilingual Plane"; so each 16-bit group of characters got its own cutesy name as a "Plane" to match.

‡‡‡: Why 15 planes, then 17? It has to do with what was encodable with existing encodings. That isn't to say that Unicode was limited by the encodings; but that it was informed by them. The growth beyond plane 0 meant that UCS-2 had to be phased out for UTF-16 (its successor), as UCS-2 couldn't encode anything but plane 0. However, seeing that UTF-16 could only encode 17×2¹⁶ characters, it made sense to limit the number of planes to 17 if there isn't a pressing need for more; as doing so would require obsoleting UTF-16. And given that the current address space is only about 12% utilized, there's no reason to mandate phasing out UTF-16 yet.


I just want to say, I absolutely love that you used a whole bunch of unicode characters for math. I'm not sure I've ever actually seen someone use the superscript characters in a legitimate context (well, beyond my own very occasional usage). And you even threw in stuff like U+00D7 MULTIPLICATION SIGN (×)!

As a side note, if anybody here is on OS X and wants to be able to type these characters, years ago I wrote a DefaultKeyBinding.dict file that adds bindings for these and a lot more (including the greek alphabet). You can see the file at https://gist.github.com/kballard/7584246fa5d5fcb684e996ff095..., and if you want to use it, just put this in ~/Library/KeyBindings with the name DefaultKeyBinding.dict. Any running apps may have to be restarted to notice this.


To be honest, the only reason I used × was that it was easier than figuring out how to make HN not treat asterisk as italic!


Speaking of, I know it's way out of scope for Unicode, but am I the only one that is sad that Unicode only provides the symbols for math and not the layout? The poor standardization of mathematical formulae in various document editors is a constant frustration. You can happily move Word documents to Google Docs, but all the formulas vanish because there are a zillion standards for formulas and all of them are terrible.


Good news, that standard actually exists: http://unicode.org/notes/tn28/UTN28-PlainTextMath-v3.pdf


I know many math teachers who are using latex because there was too many problems with word (bad editor and incompatibilities with previous versions with no upgrade path).


Latex works across many websites.

There's an extension for Google Docs for Latex BTW.


> In short, Unicode will work just fine...

> Problems............all need to be in the forefront of a developer's mind when trying to build a multi-lingual site.

It will work. Just fine though? It sounds like way too much work!


Unicode handily solves the problem of storing text.

Manipulating text, though, is inherently nightmarish. No format can prevent that.


UTF-16, and non-BMP planes, were devised in 1996. The author seems to have been 5 years late to the party.

> The current permutation of Unicode gives a theoretical maximum of approximately 65,000 characters

No, UTF-16 enables a maximum of 2,097,152 characters (2^21).

> Clearly, 32 bits (4 octets) would have been more than adequate if they were a contiguous block. Indeed, "18 bits wide" (262,144 variations) would be enough to address the world’s characters if a contiguous block.

UTF-16 provides 21 bits, 3 more than the author wants.

Except they're not “in a contiguous block”:

> But two separate 16 bit blocks do not solve the problem at all.

The author doesn't explain why having multiple blocks is a problem. This works just fine, and has enabled Unicode to accommodate the hundreds of thousands of extra characters the author said it ought to.

Though maybe there's a hint in this later comment:

> One can easily formulate new standards using 4 octet blocks (ad infinitum) – but piggybacking them on top of Unicode 3.1 simply exacerbates the complexity of font mapping, as Unicode 3.1 has increased the complexity of UCS-2.

They would have preferred if backwards-compatibility had been broken and everyone switched to a new format that's like UTF-32/UCS-4, but not called Unicode, I guess?


> UTF-16, and non-BMP planes, were devised in 1996. The author seems to have been 5 years late to the party.

Hell UTF-8 was devised in mid-1992 and presented at USENIX in 1993.

> No, UTF-16 enables a maximum of 2,097,152 characters (2^21).

And until 2003 (and RFC 3629 which neutered it to match UTF-16) UTF-8 enabled 2,147,483,648 characters (2^31).


Maybe the errors in the article are more a statement of how complicated and improperly communicated Unicode was... and mostly still is! While I think I understand most of how UTF-8 works, I still have to read and re-read how codepoints and planes and encodings and decodings work together. It's a pretty complicated beast that could very easily be misunderstood when it was less popular than it is now.

It's still widely misunderstood today.


Hello, sorry for the incoming information dump! You are probably familiar with all of this, but I've still typed it out for any passing interested reader. I've listed my responses in decreasing importance, feel free to stop after the next paragraph:

Unicode doesn't have to be complex; it is simply a single map of characters (such as "a" or "ﷺ") to a codepoint (such as "97" or "65018").

"Encodings": Encodings predate Unicode, and we had to deal with encoding before Unicode was ever invented (eg. "is this text in iso-8859-1, or is it koi8-r?"). Unicode has actually simplified the problem of encodings, for two reasons:

1) it's increasingly likely that any text sent over the internet is in the UTF-8 transformation of Unicode.

2) there is a mapping for any encoding into a Unicode transformation such as UTF-16 or UTF-8, therefore you only need three routines for text work: transform_encoding_to_unicode, process_text_as_unicode, transform_unicode_to_encoding.

"Planes": Planes are an aspect of Unicode that you may not need to concern yourself with, if you work in UTF-8 (as most Unix software does). However if you're in the world of Windows, Java, or Javascript (which use UTF-16 instead), then you may need to know that characters in planes other than 0 are encoded as 4 bytes instead of 2 bytes. Hopefully the tools available in your language of choice hide this implementation detail from you!

Misc: Beyond that it's true that there's an infinitely complex system of rabbit holes (think normalization, hyphenation, character ordering, right-to-left, ligature decomposition, and so on) that you can delve into when learning i18n, however Unicode wasn't the cause of them; they all existed before Unicode was invented! They're simply the messy but rewarding aspect of dealing with the written word.


I'm not sure that's fair, Unicode's encodings are pretty straightforward, particularly compared to some other character sets. Most of the complexity comes above the encoding level.


It also doesn't help that the classic LAMP stack has very confusing defaults and badly named functions:

* PHP has functions named "utf8_encode()" and "utf8_decode()", when they should have been called "latin1_to_utf8_transcode()" and "utf8_to_latin1_transcode()"

* MySQL for the longest time used latin1 as a default character set, then introduced an insufficient character set called "utf8" which only allows up to 3 bytes, not enough for all possible utf8 encoded codepoints, then introduced a proper implementation called "utf8mb4".

* mysql connectors and client libraries often default their "client character set" setting to latin1, causing "silent" transcodes against the "server character set" and table column character sets. Also, because their "latin1" charset is more or less a binary-safe encoding, it is very easy to get double latin1-to-utf8 transcoded data in the database, something that often goes by unnoticed as long as data is merely received-inserted-selected-output to a browser, until you start to work on substrings or case insensitive searches etc.

* In Java, there are tons of methods that work on the boundary between bytes and characters that allows not specifying an encoding, which then silenty falls back to an almost randomly set system encoding

* Many languages such as Java, JavaScript and the unicode variants of win32 were unfortunately designed at a time where unicode characters could fit into 16bits, with the devastating result that the data type "char" is too small to store a single unicode character. It also plays hell on substring indexing.

In short, the APIs are stacked against the beginning programmer and doesn't make it obvious that when you go from working with abstract "characters" to byte streams, there is ALWAYS an encoding involved.


> * PHP has functions named "utf8_encode()" and "utf8_decode()", when they should have been called "latin1_to_utf8_transcode()" and "utf8_to_latin1_transcode()"

In the XML module, no less. I'll get round to moving those out of there eventually.


Well, I finally wrote a patch for that: https://github.com/php/php-src/pull/2160


Does any programming language get Unicode right all the way? I thought Python did it mostly correctly, but for example with the composing characters, I would argue that it gets it wrong if you try to reverse a Unicode string.


My basic litmus test for "does this language support Unicode" is, "does iterating over a string get me code points?"¹

Rust, and recent versions of Python 3 (but not early versions of Python 3, and definitely not 2…) pass this test.

I believe that all of JavaScript, Java, C#, C, C++ … all fail.

(Frankly, I'm not sure anything in that list even has built-in functionality in the standard library for doing code-point iteration. You have to more or less write it yourself. I think C# comes the closest, by having some Unicode utility functions that make the job easier, but still doesn't directly let you do it.)

¹Code units are almost always, in my experience, the wrong layer to work at. One might argue that code points are still too low level, but this is a basic litmus test (I don't disagree that code points are often wrong, it's mostly a matter of what can I actually get from a language).

> try to reverse a Unicode string.

A good example of where even code points don't suffice.


Lua 5.3 can iterate over a UTF-8 string. You can even index character positions (not byte positions) in a UTF-8 string. Some more information here: https://www.lua.org/manual/5.3/manual.html#6.5

Edit: fixed link


I basically agree with you, but note that code points are not the same as characters or glyphs. Iterating over code points is a code smell to me. There is probably a library function that does what you actually want.


I explicitly mention exactly this in my comment, and provide an example of where it breaks down. The point, which I also heavily noted in the post, is that it's a litmus test. If a language can't pass the iterate-over-code-points bar, do you really think it would give you access to characters or glyphs?


The .iterator() method of strings in JS returns code points, not JS characters, so, e.g., Array.from("a") is ["", "a"].

(I use U+1F4A9 for testing all my non-BMP needs)


If I've got the right method[1], that appears to have been added in ES6. I'm still getting up to speed there. Not supported on IE, of course.

[1]: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Refe...


C# is built on top of what Windows offers, so I imagine it matches a mix of DBCS and Unicode APIs.


All strings in C# are UTF-16.

However, it exposes the encoding directly as a sequence of 16-bit ints. In other words, if you iterate over a string or index it, you're getting those, and not codepoints (i.e. it doesn't account for surrogate pairs).

Note that this only applies to iteration and indexing. All string functions do understand surrogates properly.

On the other hand, C# (or rather .NET) has a way to iterate over text elements in a string. This is one level higher than code points, in that it folds combining characters: https://msdn.microsoft.com/en-us/library/system.globalizatio...


Better still is forcing you to say what you want to iterate over, like Swift.


I'd accept that, too. I'm not familiar w/ Swift, so it wasn't in the list above. (But I do think the default should not be code units, or programmers will use it incorrectly out of ignorance. Forcing them to choose prevents that, hence, I'll allow it)


I would go one step further and say that there is no meaningful default for these things. It all depends on the context, and there's no single context that is so common that it's the only one that most people ever see. Thus, it should always be explicit, and an attempt to enumerate or index a string directly should not be allowed - you should always have to spell out if it's the underlying encoding units, or code points, or text elements, or something else.


> Does any programming language get Unicode right all the way?

Lisps with Unicode support seem to. This is a case where the Common Lisp standard's reluctance to mandate certain things paid off bigtime.


How does Clojure or ClojureScript do as they are built on top of JVM/CLR or JavaScript? I'm assuming that since they fall back to many of the primitives of their respective runtimes that they use their implementations.


> How does Clojure or ClojureScript do as they are built on top of JVM/CLR or JavaScript?

I don't know, but what you wrote sounds right. I don't really think of them as Lisps, although they have some Lisp-like features.


I've had the least trouble when using Apple's Objective-C (NSString), and Microsoft's C# - these two at least make you take conscious decisions when transcoding to bytes.


Right. The encoding is just figuring out how many bits you need for a number and putting those bits in a template. With UTF-8 having four templates and UTF-16 having two.


UTF-16, and non-BMP planes, were devised in 1996. The author seems to have been 5 years late to the party.

UTF-16 was only made into a standard in 2000[0]

The first non-BMP characters weren't introduced until 2001[1]

From a historical perspective, less than a month after Unicode 3.1 was released officially, it was not clear that Unicode (the standard, not the technical implementation of the encoding) would actually work.

[0] - https://tools.ietf.org/html/rfc2781 [1] - http://www.unicode.org/reports/tr27/tr27-4.html


> UTF-16 was only made into a standard in 2000

It only got its own IETF RFC in 2000, but the surrogate pair mechanism was first specified in 1996's Unicode 2.0.


The maximum number of characters in Unicode is 1111998 (this excludes non-characters and surrogates). UTF-16 can encode 1112064 code points (this includes non-characters). This is quite a bit smaller than the 21 bits quoted: you really get the 16-bit BMP plus the 20-bit astral planes, minus the surrogates (2048 of them). The "non-characters" number 66.


> The maximum number of characters ... (this excludes non-characters and surrogates)

You should exclude the surrogates but include the non-characters when tallying the characters. According to Unicode's 2013 Corridendum 9, "Noncharacters in the Unicode Standard are intended for internal use and have no standard interpretation when exchanged outside the context of internal use. However, they are not illegal in interchange nor do they cause ill-formed Unicode text".

See http://www.unicode.org/versions/corrigendum9.html


I can see why this might be confusing—I chose the word "character" carefully. The annex you cited clarifies that noncharacters are code points that can be used in interchange. However, this does not mean that a noncharacter is a character.

The number of code points available for use as characters must, by definition, exclude the noncharacters. So, to expand on the original comment, Unicode defines 1114112 code points, of which 1112064 can be used in interchange, and 1111998 can be defined as characters. UTF-16 can only represent the 1112064 that are valid for interchange, and the 66 noncharacters should generally be avoided (especially U+FFFE).


Thank you for that correction, I suspected 2^21 might not be quite correct.


http://utf8everywhere.org/

a very useful site especially when having to explain what utf8 is to other devs when working in a windows shop.



'working in a windows shop.'

Surely you're flattering yourself.


no seriously. I'm a windows application dev, and I have been for >decade.

If all you see around you is wchar_t and LPCWSTR then that is what unicode means.


Man, UCS-2 is the pits. I still remember fighting with 'slim-builds' of python back in the day.

Any critique of unicode while not assuming UTF-8, which allows for more than 1 million code points) is a bit suspect in my opinion. The biggest point against UTF-8 might be that it takes more space than 'local' encodings for asian languages.


Wikipedia has a summary of comparisons:

https://en.wikipedia.org/wiki/UTF-8#Compared_to_UTF-16

Advantages

* Byte encodings and UTF-8 are represented by byte arrays in programs, and often nothing needs to be done to a function when converting from a byte encoding to UTF-8. UTF-16 is represented by 16-bit word arrays, and converting to UTF-16 while maintaining compatibility with existing ASCII-based programs (such as was done with Windows) requires every API and data structure that takes a string to be duplicated, one version accepting byte strings and another version accepting UTF-16.

Text encoded in UTF-8 will be smaller than the same text encoded in UTF-16 if there are more code points below U+0080 than in the range U+0800..U+FFFF. This is true for all modern European languages.

Most communication and storage was designed for a stream of bytes. A UTF-16 string must use a pair of bytes for each code unit:

* * The order of those two bytes becomes an issue and must be specified in the UTF-16 protocol, such as with a byte order mark.

* * If an odd number of bytes is missing from UTF-16, the whole rest of the string will be meaningless text. Any bytes missing from UTF-8 will still allow the text to be recovered accurately starting with the next character after the missing bytes.

Disadvantages

* Characters U+0800 through U+FFFF use three bytes in UTF-8, but only two in UTF-16. As a result, text in (for example) Chinese, Japanese or Hindi will take more space in UTF-8 if there are more of these characters than there are ASCII characters. This happens for pure text[nb 2] but actual documents often contain enough spaces and line terminators, numbers (digits 0–9), and HTML or XML or wiki markup characters, that they are shorter in UTF-8. For example, both the Japanese UTF-8 and the Hindi Unicode articles on Wikipedia take more space in UTF-16 than in UTF-8.[nb 3]


The biggest disadvantage of UTF-16, IMO, is that programmers blindly assume that they can index into the string as if it were an array, and get a code point out — which you can not; you'll get a code unit, which is slightly different, and might not represent a full code point (let alone a full character).

UTF-8's very encoding quickly beats this out of anyone who tries, whereas it's easy to eek by in UTF-16. The real problem is that the APIs allow such tomfoolery. (Some have historical excuses, I will grant, but new languages are still made that allow indexing into code units without it being obvious that this is probably not what the coder wants.)


Strictly speaking, this is a disadvantage of languages that have strings as first-class types and allow indexing on strings in the first place (and specify it to have this semantics).

For the most part, the developer shouldn't really care about the internal encoding of the string, but the language/library should also not expose that to them.


It doesn't mention the biggest disadvantages of UTF-8 relatives to UTF-16: the existence of non-shortest forms and invalid code units.


Ah, but those are disallowed by spec and in the former case you'd open a security bug against anything that didn't transform it to U+FFFD or similar (eg. it lets you sneak '\0's into C-style strings and '/'s into unix paths).

So I'll grant you a point but could match that against similar problems in UTF-16 (bad surrogate pairs, surrogate singletons, BOM bombs, and the same invalid code units).


Which really isn't a very serious concern at this point, considering how cheap storage and compression are.

What's expensive to store are images and sound, from an ever increasing number of devices at an ever higher resolution. The production and storage of text barely registers in comparison.


Text size is still relevant due to bandwidth and reliability of transmission. Not everyone has a gigabit internet connection--large portions of the world are still operating on 2G wireless, or even dialup.


Definitely. I would be interested to know of any good statistics on data size of real-world non-European language webpages in UTF-8 vs. UTF-16 and also with/without compression. Much of the markup will be smaller in UTF-8 but actual text content would be smaller in UTF-16.


And both end up being fed to optimized-deflate-encoder-of-the-week and have negligible size differences once compressed.


There was an experiment (see https://bug416411.bmoattachments.org/attachment.cgi?id=30307... for the results). that switched all the UTF-16 internal strings in Firefox to UTF-8. It found that UTF-8 lowered memory usage, even on text-heavy East Asian pages. Keep in mind that things like tag names or attribute names are all interned, so there's no space savings from, say, compressing <img src="">.


See comment above parent:

> This happens for pure text[nb 2] but actual documents often contain enough spaces and line terminators, numbers (digits 0–9), and HTML or XML or wiki markup characters, that they are shorter in UTF-8. For example, both the Japanese UTF-8 and the Hindi Unicode articles on Wikipedia take more space in UTF-16 than in UTF-8.[nb 3]


Wrong. Most of the world is on 3G and wifi:

https://opensignal.com/reports/2016/08/global-state-of-the-m...

Many "developing" countries never even deployed 2G and dial-up to any great extent. They were simply too poor to build large-scale telephone network infrastructure. When they did start getting connectivity in the late 1990s and 2000s, they were able to skip straight to the latest generation of technology.

This is a pattern we see all over the world — the wealth advantage of "developed" nations is offset by their historical investment in infrastructure that is no longer state-of-the-art.

For example, the London Underground has been in operation for over a hundred years. The newest bits are great, but the oldest parts are hamstrung by design decisions made in the Victorian era. Whereas, when China builds a new metro, it's able to build every part of it to modern standards, applying the accumulated knowledge from building those earlier metros.


No, not wrong. I said:

> Not everyone has a gigabit internet connection--large portions of the world are still operating on 2G wireless, or even dialup.

Sure, the majority of people have moved over to 3G or better worldwide, but there are still many areas where 2G is more common. We just did a deployment in India[1] which still has more 2G coverage than 3G. Performance on 2G connections was a requirement from our Indian business partners.

It's also worth noting that your link contains an implicit bias: it's measuring connections, not people. People with slower connections sometimes simply won't connect at all if your site doesn't perform on their connection, so this is always going to skew toward faster connections. Your link is also pretty vague on the actual statistics--given their claim that the vast majority of countries have > 3G availability 75% of the time, 25% of the majority of countries could not have > 3G availability, and if the vast minority country is India, that's hundreds of millions of people.

Yes, the majority of the world is on 3G or better, but the minority can still contain millions and millions of people.

[1] http://www.sensorly.com/map/2G-3G/IN/India/Vodafone/gsm_4040...


Encoding text is not the problem with Unicode. The problem is the complexity at the higher abstraction level.

At the top level of abstraction is the abstract character aka user perceived character and grapheme clusters (a sequence of coded characters that should be kept together). Mapping from code-points to abstract characters is not total, injective, or surjective.

Unicode is not just standard for encoding. It's also standard for representation, and handling text.

The exact semantics of UTF-8 string in all cases is something that only few programmers are able to comprehend. I know for sure that I don't and I don't know anyone who does. Interchanging and storing UTF-8 strings and hoping for the best is the standard practice and it works well, but it shows the overreach that Unicode standard is.


SCSU and BOCU-1 can both help a lot with that last point. They're two standard methods of compressing unicode strings that are primarily in a single language—obviously it's not as good as lz4 or gzip or other general-purpose compression algorithms, but it also doesn't increase the size so is ideal for up to ~a few kb.


SCSU and BOCU-1 were never seriously used, as far as I can tell, except to prove a point back when developers were uncertain about implementing Unicode.

Unicode standardizer: "Your software needs to support more than one language at a time. Please use Unicode."

Developer: "I don't want to. I've already got a codepage that's designed for the language I care about, and Unicode will make everything take up too much space."

Unicode standardizer: "Here's BOCU-1, an encoding of Unicode that compresses monolingual text into nearly the same amount of space as your favorite codepage."

Developer: "Uh, thanks, but that's weird."

Unicode standardizer: "Yeah, never mind. How about you try this new encoding called UTF-8?"

Developer: "Oh, this works really well and I guess it's small enough. I'll use it."


I think I'd prefer transparent stream/file-system compression of text documents for this type of issue.


More like, "Why UCS-2 Won’t Work on the Internet".


An extensive and very informative, though a bit sarcastic, rebuttal (from 2001 as well): https://features.slashdot.org/story/01/06/06/0132203/why-uni... (via https://twitter.com/FakeUnicode/status/786324531828838400).


> Thus is can be said that Hiragana can form pictures but Katakana can only form sounds

That sounds really weird to me. Does that sound right to any native Japanese speakers here?


Not a native speaker, but this is still wrong, Hiragana is just like Katakana, each character is a syllable (more or less, n is a bit special in both sets). There's a 1:1 mapping between Hiragana, Katakana and syllables that exist in Japanese.

Kanji (Chinese-style characters) aren't phonetic on the other hand, they map to several different sounds, and several Kanjis may map to the same sound (not necessarily a single syllable).


To contest your point of a 1:1 mapping, there are hentaigana which are older than the obsolete characters in both hiragana and katakana. Interestingly enough, these are not supported in Unicode.

Additionally, depending on context, hiragana might not have a phonological reading. See the object marker 'wo'. Katakana is usually phonological, except for the same usage of object marker.

In summary, someone who says "Hiragana can form pictures but Katakana can only form sounds" is either a philosopher who practices calligraphy or misinformed.


> To contest your point of a 1:1 mapping, there are hentaigana which are older than the obsolete characters in both hiragana and katakana.

Technically yes, but they don't matter. The modern hiragana and katakana syllabaries have intentional symmetry. Anything written in one can be written in the other, and is (stylistic unorthodox choices of syllabary, old or limited computers where only one set is available, Japanese Braille which makes no such distinction, etc.)

> Additionally, depending on context, hiragana might not have a phonological reading.

This can also be true of katakana in some situations.


They're not strict character-level substitutions for each other (see: katakana long syllable marker), but non-pedantically your point does stand.


The chōonpu (long vowel marker) can be used for hiragana too, but it's not how words are usually spelled. Likewise, you can use hiragana-style long vowels in katakana.


From a historical perspective, it's not true. Both types of kana were derived from Han characters, often the same characters, in different ways. Their usage has varied over time.

As for the neurological claim, I did a quick search and found an article which cited:

> Uno (as cited in Itani 2001) and Saito (as cited in Kosaka & Tsuzuki) say that kana is primarily processed phonologically in the reading process, whereas kanji is processed semantically.

http://www.staff.amu.edu.pl/~inveling/pdf/Dyszy-18.pdf

This is a few steps removed from the actual citation, but it seems to support your disagreement with the article.


It's a bit poetic but it's almost right: Hiragana is for words (native Japanese) and Katakana is for spelling things out phonetically (onomatopoeic words and foreign loanwords).

This ignores the fact that a significant percentage of everyday Japanese life is now loanwords, so you can't really dismiss them as not being part of the language.


I'm absolutely not qualified to address this definitively, but my understanding is that Hiragana is typically used for verb conjugations, adjective inflections, particles (like "の" for indication of possession for example), some natively Japanese words where no one really uses the Kanji for whatever reason, and some other stuff that's similar in nature. Katakana is typically used for writing foreign words so that they use the pronunciation rules of Japanese (like "コカ・コーラ" for Coca-Cola for example) and onomatopoeias.


This writing is sort of a strange metaphor, but I guess the point the author makes is that kanji can be transliterated as hiragana but not katakana. The writer goes on to talk about traumatic brain injuries so I guess he's aiming at the cultural value of each syllabary.

I'm not a native speaker, but if I were to make an equally strange metaphor as the author, katakana feels like writing in all capital letters.


Kanji can be transliterated either way, and both forms are lossy since there are so many homonyms and kana only encodes sounds. It's traditional to annotate difficult Kanji pronunciation with small Hiragana called Furigana, for example in children's books. But it could be done all the same in Katakana. Modern Chinese words that Japanese borrows are usually translated in Katakana for example.


Kana mostly contain just sounds, but do contain some morphological information—there are homonyms in kana as well, after all. This is a bit rare, however.


While possible phonetically, it would be strange to see kanji transliterated as katakana.

But I guess there are some Japanese sources that are written all in katakana.


There are situations where Kanji get transliterated to katakana - many name input forms will ask for the reading of your kanji name in katakana


Good point.


Not native here, but studied a bit. Hiragana is used for writing down native words, Katakana - of foreign origin. So it doesn't really sound right.


The paper is far more interesting for its informative background on the the use of the character sets in CJK region.


Good to see we've had a breakthrough after 15+ years.


Why "640K ought to be enough to everyone"


The opposite, really. It's arguing why 16 bits is not enough. But it does this while casually dismissing the solutions we already had.


[flagged]


We've banned this account for repeatedly violating the guidelines—several times in this comment alone—after we've asked you to stop. We detached this subthread from https://news.ycombinator.com/item?id=12696915 and marked it off-topic.


Banned what account??? I've definitely never been banned. I don't know what the hell you're talking about. Interesting that it's fine with you that the parent's comment is yet another stab at forcing liberal politics down my throat and you have no problem with that, but if I object to it, then there's a problem. Typical liberal thought police fascism -- let's guess who you're voting for this election?

Let me make it simple for you -- if that's what's going on here at Hacker News that they've got you playing hall monitor using your rules as a tool to censor only the political speech you don't like, and that happens to be liberal, you AND this site can stick your thought control where the sun don't shine and delete my entire account for all I care. I can promise you that if it's going to be okay for liberals to insert THEIR politics into their comments then I'm going to feel free to continue firing back with mine and if you don't like it that's too bad.


Not sure you're contributing a lot to the conversation there. Probably annoying a bunch of folks in the process. You probably know all that though.

But you're right. We love being called orientals. It's just that warm feeling of being categorised and separated from the occidentals in that lovely nostalgic way the PC guys just don't get.


And the odd thing is that orient = Asia doesn't make sense in an American context...


From time to time certain words accumulate enough baggage that they need to be retired from polite conversation.

This isn't "PC social justice crap", this is how society works. Words evolve in meaning, and sometimes those meanings become intrinsically linked with racism and prejudice.

If you'd rather speak your mind and, in the process, make everyone think you're an absolute dick, by all means, but don't think anyone who refuses to put up with your crap is being politically correct. They're just being polite by looking out for other people.


Looking out for other people without those people actually being offended is a bit silly though. The ultra PC crowd seems to have taken it upon themselves to decide what words are offensive without the 'victims' actually caring about the offensive word in the first place.

These scenarios seem to be white people deciding what words other white people are allowed to use to describe cultures, people, nationalities, ethnicities, etc. Would you chide an Asian group running a store with 'Orient' in the name?[1] If not, why the hypocrisy?

1. http://orientwatchusa.com


It's like if someone calls a girl "bitch" in what should be polite conversation you do have a right to say "no", that's not cool. This is no different.

It's not "white people deciding" it's "white people finally acknowledging what these words mean".

Whatever culture is the subject of a slur is naturally free to use that slur in any way they see fit. It does not automatically mean it's free for anyone to use.


Acknowledging what to whom and on behalf of whom?

I really want you to fill in the blanks for those 3 in the context of referring to Asia as 'the Orient'.


You really want to test everyone's patience, don't you?

"Oriental" became a slur just as "negro" did despite starting out as a more neutral term, though one in an age where many racial terminology was intrinsically loaded.

If people of Asian descent want to call their shops something with "Orient" in it, it doesn't bother me. If someone who's not from that descent instead appropriates that culture and makes a mockery of it, that's not cool.


>You really want to test everyone's patience, don't you?

Don't project your frustrations as a means to appeal to popularity.

>"Oriental" became a slur

According to whom?

>just as "negro"

No, not the same. "negro" is an offensive term to black people.

>instead appropriates that culture and makes a mockery of it

Can you explain to me in the quote from the document where the 'mockery' is?

"The fact of the matter is that this bias, and its glaring ignorance of the real value of such a large amount of so-called "redundancy" continues to this very day, and thus continues to be a chafing-point between Orientals and misguided Westerners."


> According to whom

Wikipedia gives a summary with respect to the American use of the word (https://en.wikipedia.org/wiki/Orient#American_English):

"John Kuo Wei Tchen, director of the Asian/Pacific/American Studies Program and Institute at New York University, said the basic critique of the term developed in U.S.A. in the 1970s. Tchen has said: "With the U.S.A. anti-war movement in the '60s and early '70s, many Asian Americans identified the term 'Oriental' with a Western process of racializing Asians as forever opposite 'others'."[10]"


Maybe you haven't been paying attention, but there's a lot of friction in very concrete cases of mockery like various sports team names and mascots, especially the Cleveland Indians.

Their mascot is literally a caricature of an "indian". It's from a time when that sort of thing was acceptable because we didn't have to listen to such voices. We'd just sweep them off to reservations and strip them of their dignity and rights. Great times.

So if you want to presume Oriental isn't a slur, go right ahead and keep using it. Just don't raise a fuss if someone comes to you and says "You might not want to use that word, it's got certain connotations."


You haven't responded at all to my questions, which implies you have no response and there isn't any substance to your argument.


It's called I've written you off as someone who's tone deaf and there's no point in discussing this further.

I'd rather explain the concept of color to dogs.


> From time to time certain words accumulate enough baggage that they need to be retired from polite conversation.

The problem is that's not universally the case. Outside of North America the word is not offensive or pejorative. While it's definitely offensive in the North American English, without knowing the author's origin, it's wrong to dismiss it completely out of hand. American culture is not the sole arbitrator of what is offensive.

Edit: And after googling around, the Author lives in Vancouver -- so the usage can be considered offensive.


[flagged]


"Oriental" comes from the Latin word for "eastern". If you're going to ban that word, then you should also ban words like "Western" (as in "Western civilization") since they represent a subjective point of view. But what, actually, is wrong with having a particular culturally-dependent PoV?

According to Wikipedia, outrage over the word "oriental" is itself culturally determined - it mostly seems to be a thing in American English.


While I agree that the word "oriental" is a bit dated, you should be aware that the notions of which words are offensive is cultural, not universal: https://en.wikipedia.org/wiki/Orient#Current_usage

A lot of the mismatch in offensiveness/offendedness on HN seems to boil down to differences between American and European speakers of English (as a rule of thumb, Americans tend to consider more things more offensive for historical reasons).

Personally, as a non-native speaker, I find it weird to call people Oriental because the notion feels a bit dated in my native language. However in the given context it seems a lot clearer than spelling out each individual culture or vaguely saying "non-Western" (which frequently excludes Eastern Europe).

IOW, occident/orient is a historical cultural distinction, much like Cold War era first/second/third worlds and East/West -- of course it's fuzzy and inadequate, but so are all abstractions.

As a European I can see why someone ethnically stemming from the "orient" might find the term quaint, inaccurate and even offensive, but I feel the same way about the various labels Americans unthinkingly throw around, e.g. "Caucasian" (because obviously all "white" people are the same and are best represented by the geographical area surrounding Azerbaijan).


>Are you so tiny that you have to shove your balls into every conversation about things that don't fucking concern you?

This is not a way to conduct a useful discussion. A discussion about the use of the word 'oriental' in a 15 year old document doesn't concern any of us. But this site is dedicated to discussions, so don't try to arbitrate what is worthy of discussion.


I've known both Japanese, Chinese and Korean people who self-described as oriental. Clearly I don't know a representative sample, but it's not nearly as straightforward as you imply.


Hilarious, a document from 2001 talking about why Unicode is unsuitable to "the orient." At the end, I half expected to read that "Negroes have also proved to be most unfavorable to it."


Simply because of his use of an outdated synonym of "Asia"? If anything, I got the impression that the author was critical of Westerners for being insensitive to the needs of Asian computer users. I think this is because of Han Unification, but he does not mention it by name.


In this author's case, I agree that the intent doesn't appear to have been malicious. But calling it just "an outdated synonym of Asia" is like saying Negro is just an outdated synonym for Black (after all, negro is just Spanish for black). In both cases, it ignores heavy historical and cultural implications that come with the words.


What are the cultural implications of using Orient as a synonym for Asia?


Maybe a foreign language speaker? The term didn't get the same valences in places that didn't colonize it




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: