Hacker News new | past | comments | ask | show | jobs | submit login
Dark corners of Unicode (eev.ee)
158 points by zeitg3ist on Sept 13, 2015 | hide | past | favorite | 28 comments

I am not sure whether these are dark corners of Unicode or just complex language support. When dealing with text, you need to learn and educate yourself about text. International text is much more complex than localized text, but if you intend to support that, learn it. Sure, frameworks exist that hide a lot of the complexity, but then you hit a specific bug/corner case where the framework fails you, and since you are not familiar with the intricacies of the language-specific problem, you are clueless how to proceed and fix it, whereas if you invested the time to learn about it, it would have been much easier.

A similar (albeit rather simpler and more limited) problem is calendrical calculus, where people who have little to no grasp of how to perform correct date operations do complex calendar applications and fail spectacularly in some edge cases.

Call me crazy, but if you are dealing with text, have some time set for research before you start your development.

As you point out, many of these features are dependent on how well they're handled by the application developer. If Unicode were "just" about rendering pre-made glyphs consistently it would not be nearly so hard, but it aims to do more - sorting, capitalization, string length - and that's also why it tends to fall over in production systems. Nobody is ever going to get Unicode completely right, just a subset of it for the languages they've successfully localized.

The simplest character encoding you could hope to work with is something like a single-line calculator or vending machine display - fixed-width, no line breaks, just the Arabic numerals and maybe a decimal point, some mathematical symbols, or a limited English alphabet to display "INSERT CASH". Any featureset above that produces issues. Just line breaks alone are responsible for all sorts of strange behaviors.

I think it's a bit magical that we've managed to do so much with text given the starting situation. At each step - from early telegram encodings through the proliferation of emoji - the implementations had to codify a thing that was previously left open, develop rules around its use, etc. We've made language more systematic than it ever was in history, for the benefit of machines to parse and process it.

Unicode is complex because language is complex. Before we had that complexity in Unicode there were a lot of languages and scripts that couldn't be represented accurately (or at all) in computers. Mixing scripts in one document was nigh impossible. I'm not sure that's a world we want back.

It's funny though, how people are not even aware of the fact that different languages are different until they see something break with Unicode. Casing and collation rules have been language-specific before. It's just that before lots of software didn't even attempt to do it right. Again, a world I'd rather not have back again.

The decimal mark requires some localisation (see https://en.wikipedia.org/wiki/Decimal_mark#Countries_using_A...).

Let's restrict the calculator to whole numbers only.

> I am not sure whether these are dark corners of Unicode or just complex language support.

Either, neither, both. Some are intrinsic to Unicode's purpose of encoding human text, others are accidents of Unicode history, yet others are design decisions which could have gone other ways (which may or may not have been better)

> Call me crazy, but if you are dealing with text, have some time set for research before you start your development.

The issues being most people will have a hard time justifying a year of linguistic and calligraphic study before the project gets to start (whether employed or independent) and most languages have "string"-manipulation facilities which are easy, obvious and wrong.

I think a year is quite the exaggeration, but of course, it depends on the project. If you are developing a complex word processor, page layout or publishing software, you bet your ass you should devote a year and even more to get it right. In other projects, even take two-three weeks of research before development will do wonders to complement the existing frameworks which also assist you in development.

For locale-aware correct sorting of unicode strings, based on the Unicode Collation Algorithm, some open source libraries twitter released are pretty awesome.

ruby: https://github.com/twitter/twitter-cldr-rb

javascript: https://github.com/twitter/twitter-cldr-js

Human written language is pretty complicated. The Unicode standards (including the Common Locale Data Repository, the Unicode Collation Algorithm, normalization forms, associated standards and algorithms, etc) -- is a pretty damn amazing approach to dealing with it. It's not perfect, but it's amazing it's as well-designed and complete as it is. It's also not easy to implement solutions based on the unicode standards from scratch, cause it's complicated.

I don't agree with the section about JavaScript strings. Those are proper strings, just encoded in UTF-16.

> JavaScript’s string type is backed by a sequence of unsigned 16-bit integers, so it can’t hold any codepoint higher than U+FFFF and instead splits them into surrogate pairs.

You just contradicted yourself. Surrogate pairs is exactly what allows UTF-16 to encode any codepoint.

Once you start talking about in-memory representation, you need to agree on an encoding. UTF-8, UTF-16 being the most common. wchar_t could be UTF-16 or UCS-2.

Javascript strings are not UTF-16, you'd only ever see codepoints if that were the case. Javascript "strings" are UCS2, it's trivial to demonstrate: "\ud83c" is a valid Javascript string, it's not valid UTF-16.

Here's the relevant section of the Unicode FAQ on the subject:

> UCS-2 does not describe a data format distinct from UTF-16, because both use exactly the same 16-bit code unit representations. However, UCS-2 does not interpret surrogate code points, and thus cannot be used to conformantly represent supplementary characters.

A correct UTF-16 implementation would interpret surrogate code point, validate that they're paired and prevent access to either surrogate via string operations.

ES6 did get some new functions to correctly deal with surrogate pairs in strings. In then end, JS strings are just an sequence of 16 bit values, with the unfortunate case that many string functions interpret those as UCS-2 and only some new functions as UTF-16.

When you come across an invalid sequence while decoding particular input (like "\ud83c") then you generally have three choices: throw an exception, skip the invalid part, or replace it with a replacement character. The default JavaScript behaviors is to be lenient. But if you need more control over the decoding behavior then you can use StringView or TextDecoder which is part of this spec: https://encoding.spec.whatwg.org/

> ES6 did get some new functions to correctly deal with surrogate pairs in strings. In then end, JS strings are just an sequence of 16 bit values

Which is exactly why they are not and can not be UTF-16.

> The default JavaScript behaviors is to be lenient.

The javascript behaviour is to have UCS2 "strings".

But they're not a sequence of codepoints. They're a series of 16 bit values that can be set to anything, even invalid unicode.

JavaScript has byte strings, not character strings.

For very wide bytes ;)

I'd a sequence of code units instead of code points. Sadly that holds true for many string implementations in programming languages, often for history, compatibility, or efficiency reasons (e.g. C#could have done it right, being designed after the UCS-2/UTF-16 split, but they didn't for various reasons). So you get code unit sequences with a few functions tacked on top that add code point support.

Huh, I was wrong, there's no such thing as an invalid code unit. Even U+FFFE and U+FFFF are allowed.

Interesting that Firefox takes the decomposed Hangul and renders it as whole syllables, while Chrome shows them as the sequence of individual jamos. http://mcc.id.au/temp/hangul.png

He's rendering normalized text, but normalization is only for string comparisons...

I don't understand why emoji are width 1 either.. really the EastAsianWidth.txt from the Unicode standard needs to match fixed with terminal emulators.

I've been dealing with all of this recently in JOE: http://sourceforge.net/p/joe-editor/mercurial/ci/default/tre...

In particular JOE now finally renders combining characters correctly. It now stores a string for each character cell which includes the start character and any following combining characters. If any of them change, JOE re-emits the entire sequence.

But which characters are combining characters? I expect \p{Mn} and \p{Me}, but U+1160 - U+11FF needs to be included as well but isn't. It's crazy that these are not counted as combining characters. Now I'm going to have to check how zero-width joiner is handled in terminal emulators. JOE is not changing the start character after a joiner into a combining character, ugh..

Well, there are multiple normalization forms in unicode. The OP isn't clear about this. (Perhaps because the python library he's using also isn't as clear as it ought to be? I dunno)

'compatibility' normalizations are mainly for comparison (including indexing/search/retrieval) and sorting, although there might be other uses. But indeed you should not expect a 'compatibility' normalization to render the same as the not-normalized input that produced it under normalization.

The 'canonical' normalization outputs ought to render the same as de-normalized input, but rendering systems don't always get it quite right.

For the web, the WWW consortium recommends a canonical normalization.

The Unicode documentation on normalization forms is actually pretty readable and straightforward, for being a somewhat confusing topic. http://unicode.org/reports/tr15/

> Normalization Forms KC and KD [compatibility normalizations] must not be blindly applied to arbitrary text. Because they erase many formatting distinctions, they will prevent round-trip conversion to and from many legacy character sets, and unless supplanted by formatting markup, they may remove distinctions that are important to the semantics of the text. It is best to think of these Normalization Forms as being like uppercase or lowercase mappings: useful in certain contexts for identifying core meanings, but also performing modifications to the text that may not always be appropriate.

The compatibility normalizations are pretty damn useful for indexing/search/retrieval though. Anyone storing non-ascii text in Solr or ElasticSearch (etc) probably wants to be familiar with them -- as a general rule of thumb, you probably want to do a compatibility normalization before indexing and again on query input.

OP here. You say this all, and yet, if I google for "unicode strip accents"...

Top-voted answer uses NFD, one below it uses NFKD: http://stackoverflow.com/questions/517923/what-is-the-best-w...

NFKD: http://www.perlmonks.org/?node_id=835238

NFD: http://www.perlmonks.org/?node_id=1105025

NFD: http://www.perlmonks.org/?node_id=485681

NFD: http://drillio.com/en/software/java/remove-accent-diacritic/

NFKD: https://gist.github.com/j4mie/557354

Two and a half of the first six results blindly apply NFKD to arbitrary text. All of them use normalization.

Sad state of affairs.

"Strip accents" is not a well-defined operation outside of a specific locale. Does "Ö" have an accent or not? In German, yes: it's an O with an umlaut. In English, yes: it's an O with some funny dots on it (heavy metal umlauts?). In the "New Yorker" dialect of English, it's an O with a dieresis. But in Hungarian, Finnish, Turkish, and many others, it's not: it's the letter between O and P, or between O and Ő, or after Z, or...

If you do want to do this, you should know that it only makes sense in your own locale, and you shouldn't be surprised that the methods are somewhat ad-hoc (I'm not saying you shouldn't do this: I've done it myself).

In German, history of the letter and rules even dictate that ö should be written as oe in such cases (that's what it evolved from and that's what the two dots are; e.g. it's not a diaeresis in German, despite looking the same).

Some of what you find googling is just wrong. Dealing with global characters is confusing, people get it wrong a lot, and suggest wrong answers.

But it's true, as far as i know, that there's no unicode standard way to 'strip accents', which is unfortunate because we sometimes do need to do it. Even if 'strip accents' is locale dependent, and may have no sensible answer in some locales, I think there are sensible ways to do it in some locales (certainly in English, for Latin characters at least), and I wish there were a recognized best practice standard for doing it that could be implemented identically in various languages (maybe there is and I don't know it?).

There are unicode standard ways to compare/sort strings ignoring accents, in at least some locales, which might get you there if you reverse engineered them and took them further.

At any rate, at the end of the day, you can't simply talk about 'unicode normalization' without talking about the four different unicode normalization forms (canonical and compatibility; decomposed and composed) -- if you do, you are definitely getting something wrong.

And also, unicode normalization forms are definitely _not_ intended to 'strip accents', that is not what they are for, they aren't the solution to that, even if the compatibility normalizations do it in some cases.

“Also, I strongly recommend you install the Symbola font, which contains basic glyphs for a vast number of characters. They may not be pretty, but they’re better than seeing the infamous Unicode lego.”

I disagree with the notion of Symbola not being a pretty font. As I mentioned here¹, the glyphs Symbola has for the Mathematical Alphanumeric Symbols block are quite beautiful². (It may help that I’m using the non-hinted version on a HiDPI display though… still that implies it will look even better when printed on paper with an inkjet or laser printer since they still produce more DPI than the typical HiDPI monitor).


¹ — https://news.ycombinator.com/item?id=10198620

² — http://f.cl.ly/items/2h2p0r1F1h2E1y2o2y0c/Screen%20Shot%2020...

Its a good article. There is a direct analogy with the article asking if HTML is a semantic markup language or a binary graphics art format, and the groups not overlapping very much other than in failure while mostly not being very interested in each other.

That sounds like an article I'd like to read - can anyone provide a link?

> I strongly recommend you install the Symbola font, which contains basic glyphs for a vast number of characters. They may not be pretty, but they’re better than seeing the infamous Unicode lego.

Well, I installed the Symbola font as he suggested but I'm still seeing lots of Unicode lego in the article.

I'm using Windows 7 and the latest version of Firefox, and I set the Symbola as the default font in Firefox and unchecked the box that says, "Allow pages to choose their own fonts, instead of my selections above".

What could I be doing wrong? I would assume that if the author recommends Symbola font, he's checked that Symbola has representations for all the symbols he's using.

Symbola certainly does support most of the characters used in that article (not all though). Specifically, it doesn’t have glyphs for the CJK characters or the regional indicator symbols used on the page, so you’ll need another font for those. (OS X v10.7+ includes the 'Apple Color Emoji' font which has glyphs for the regional indicator symbols, while on Windows 7, 'Segoe UI Symbol' includes glyphs for them (you need to have installed the update¹ for the font first though)).

Anyways, firstly, there’s no need to uncheck 'Allow pages to choose their own fonts, instead of my selections above'. You can go ahead and let the page specify whatever fonts it wants. If you don’t have the font(s) specified in the webpage’s stylesheet, or the webpage contains a character that the current font doesn’t have a glyph for, your OS/browser will substitute it for another font on your system that does have a glyph for that character (if it exists), so you can recheck that option. In fact, you don’t have to explicitly even pick Symbola to be used as a font for any type of text in Firefox at all, since your OS should use font substitution automatically if any of those fonts chosen there don’t have a glyph for a character on whatever webpage you’re on. In fact, to begin with, it’s impossible for any one font to contain all of Unicode right now, since even OpenType fonts can only contain a maximum of 65,536 glyphs, while Unicode has more than 120,000 assigned codepoints, so font substitution is absolutely necessary (so you can change the fonts in Firefox back to the defaults if you like).

Secondly, after you install the font, you may have to restart the computer or close Firefox before it actually picks up on the new font.

Thirdly, the version of Symbola that was linked to in the article is an old one. I’d recommend this² one instead (covers more codepoints).


¹ — https://support.microsoft.com/en-us/kb/2729094

² — https://web.archive.org/web/20150625020428/http://users.teil...

I predict someone will complain, as usual, that Unicode could and should be regular and programmer-friendly and everything.

My response would be this: http://xkcd.com/1576/

Unicode is merely as complex that which it encodes: human language.

This is why you should provide a locale for most sting functions in C# for example. I think this article is mostly about bad unicode support in programming languages.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact