Hacker News new | past | comments | ask | show | jobs | submit login
Why to normalize Unicode strings (withblue.ink)
143 points by bibyte 9 days ago | hide | past | web | favorite | 95 comments





The official unicode documentation on normalization is good reading, and quite readable. It's actually an even more complicated topic than OP reveals, but the Unicode Standard Annex #15 explains it well.

http://unicode.org/reports/tr15/

OP has a significant error:

> You can choose whatever form you’d like, as long as you’re consistent, so the same input always leads to the same result.

Not so much! Do _not_ use the "Compatibility" (rather than "Canonical") normalization forms unless you know what you are doing! UAX15 will explain why, but they are "lossy". In general, NFC is the one to use as a default.


Ligatures, which are mentioned in the article, are a good example of the distinction between canonical and compatibility forms.

If your input contains "traffic", you don't necessarily want NFC to insert the "ffi" ligature and turn it into "tra\ufb03c". The ligature is generally a poor choice in a monospaced font, for example.

Similarly, if you've gone to the trouble to insert the ligature, you don't necessarily want NFD to strip it out. However, with NFKC or NFKD, a search for "traffic" will find the string.


Ligatures are not a good example actually. They are obsolete by Unicode, i. e. there's no way to turn ffi⟨normal⟩ into ffi⟨ligature⟩. (But NFKD(ffi⟨ligarure⟩)≡ffi⟨normal⟩ of course.)

You're right that there's no way to apply any specified unicode normalization to turn `ffi`(normal) into `ffi`(ligature).

But they're a good example of the difference between "canonical" and "compatible" normalization anyway, in the other direction. Which is really the only direction that matters to illustrate the difference anyway.

NFKD (or NFKC too I think?) will, as you say, turn the ligatures into their "individual" forms. It's different glyphs, and there is at that point no way to know which it was "originally", and no way to convert it in the other direction. The "compatibility" normalizations are "lossy".

The "canonical" normalizations on the other hand are basically 'lossless' with regard to "glpyhs". Unless you actually _cared_ that it was represneted with a combining diacritic before, which in 99.9% of cases you don't. You should have exactly the same symbols on the screen after a 'canonical' normalization. And for any x, NFC(x) == NFC(NFD(NFC(x)).

The compatibility normalizations are super useful, because as the person up there mentioned, you often want a search query for `ffi` to match on `ffi` (and vice versa). But they are intended to lose symbolic representation (ffi and ffi are now the same thing with no way to distinguish), where the canonical normalizations are not.

I wouldn't say the `ffi` ligature is "obsoleted" in any way. People still use it all the time. We both just included it in our comments, unicode was happy to support that. Unicode is so happy for you to use it, that it provides compatibility normalization to make it _easier_ to recognize that it means the same thing as "ffi". :)


> In general, NFC is the one to use as a default.

Why nfc instead of nfd?


Not really sure why NFC has become the general standard.

UAX#15 notes that:

> The W3C Character Model for the World Wide Web 1.0: Normalization [CharNorm] and other W3C Specifications (such as XML 1.0 5th Edition) recommend using Normalization Form C for all content, because this form avoids potential interoperability problems arising from the use of canonically equivalent, yet different, character sequences in document formats on the Web. See the W3C Character Model for the Word Wide Web: String Matching and Searching [CharMatch] for more background.

One of the W3C documents cited says:

> NFC has the advantage that almost all legacy data (if transcoded trivially, one-to-one, to a Unicode encoding), as well as data created by current software or entered by users on most (but not all) keyboards, is already in this form. NFC also has a slight compactness advantage and is a better match to user expectations in most languages with respect to the relationship between characters and graphemes.

https://www.w3.org/TR/charmod-norm/


To conform with W3C recommendations, for one thing:

"Content authors SHOULD use Unicode Normalization Form C (NFC) wherever possible for content."[1]

It's a SHOULD rather than a MUST. Non-NFC content is not a spec violation, but using NFC is strongly advised.

[1] https://www.w3.org/TR/charmod-norm/#normalizationChoice


Saves space, is just what people expect, easier to work fast and loose with when you want to do that.

But needs a third more time than NFD. Apple choose the long form NFD in its old filesystem. NFC needs three steps, NFD only the first two. The size rarely matters that much.

> Thankfully, there’s an easy solution, which is normalizing the string into the “canonical form”.

Cool, problem solved!

> There are four standard normalization forms:

(╯°□°)╯︵ ┻━┻


I do understand the need for the difference between NFC and NFKC, but in hindsight NFD and NFKD seem entirely unnecessary.

NFD is useful if you want to do a diacritic-insensitive search.

The problem is that diacritic-insensitive search is locale-dependent, so it doesn't do the right thing anyway.

Specific example: in English, you'd want a search for ‘a’ to find ‘ä’ while this is the entirely wrong thing in Swedish where a and ä are distinct letters.

An English speaker probably wouldn't want a search for ‘i’ to match ‘j’ even though the latter just has an extra hook on the bottom.


That's a very good point. There's still a use for locale-insensitive diacritic-insensitive searches, but you're absolutely right that in most cases you'd want it to be locale-aware and therefore NFD isn't sufficient (though it may still be easier to do this on NFD than NFC).

I actually laughed out loud when I got to that sentence.

Glad that I still work in plain ASCII.


> Why use both [UTF-8 and UTF-16]? Western languages typically are most efficiently encoded with UTF-8 (since most characters would be represented with 1 byte only), while Asian languages can usually produce smaller files when using UTF-16 as encoding.

The second sentence is technically correct, but it's a strange followup here because it's not why UTF-8 and UTF-16 exist today. I don't know any Asian webpages that use UTF-16 to save bandwidth, e.g., Japanese Wikipedia is still UTF-8.

The major use of UTF-16 in 2019, AFAICT, is for legacy operating system interfaces.


Well, the second sentence isn't often correct.

It'd probably be true for pure TEXT files IIRC, however the markup elements of webpages and javascript also take up space and /those/ are western elements. On actual web pages UTF-8 is actually the smallest choice.


Also "legacy" language runtimes. "Legacy" being in scare quotes because JavaScript, the JVM, and the CLR all work this way and are all very much in widespread use today.

JDK 9 introduced “compact strings” (https://bugs.openjdk.java.net/browse/JDK-8054307). That stores a string’s characters in a byte array, with either the traditional two bytes per ‘char’ (encoding the entire string as UTF16) or, if possible, one (encoding the entire string as ISO-8859-1/Latin-1). They probably didn’t use UTF-8 because it would break the fact that string indexing is O(1).

That's just internal representation, though. Semantically, strings are still sequences of chars, and char is still 16-bit, so the API is still UTF-16.

Yes, indeed. I didn't mean "legacy" to imply that they aren't still useful. Simply that many of them predate the creation of codepoints which don't fit in 16 bits -- and it's harder to upgrade a binary API distributed in a client OS, compared to changing a web server's preferred encoding.

JS/JVM/CLR all work fine, but I imagine if they were created today, their strings would not be based on UTF-16.


Utf8String is in the works for C#/.NET. It would still be pretty much bolted on though.

I imagine compression negates most of the overhead from encoding Asian web pages in UTF-8. Plus, considering how image/markup/script heavy modern web pages are, the actual human-readable text is a tiny sliver of most sites' bandwidth costs.

Note that Apple's APFS doesn't normalize Unicode filenames:

https://news.ycombinator.com/item?id=13953800

From what I understand, it stores them as-is but can read any (so is normalization insensitive):

https://medium.com/@yorkxin/apfs-docker-unicode-6e9893c9385d

https://developer.apple.com/library/archive/documentation/Fi...

This hit me a couple of years ago when I was working on a scraper and storing the title of the page as the filename. It looked fine, but would fail a Javascript string comparison. I can't remember if I was using HFS+ though, which I believe saved filenames as NFD:

https://en.wikipedia.org/wiki/HFS_Plus#Criticisms

The same script might work today on APFS.


HFS+ uses Form D.

I had to remember to create my ZFS volumes with Form D enabled, as it isn't an attribute that can be changed afterwards.

IIRC, ZFS on Mac OS X would set that by default so if you created the volumes from a Mac, then ok. But I was creating my ZFS array on a Linux or OpenSolaris server, where I would need to set Form D Normalization explicitly.


By the way the last letter of Zoë is e with a diaresis, not An umlaut. Like the second o in coöperate — it’s just an ordinary o with a marker to tell you to pronounce it rather than form a diphthong.

Just tried this in Perl6; looks like string comparisons Do The Right Thing™.

    > "\x65\x301".contains("\xe9")
    True

  > "\c[latin small letter e]\c[combining acute accent]" eq "\c[latin small letter e with acute]"
  True
Edit: And of course

  > "\c[dog face]".chars
  1
and not 2 as in the article.

PS: WTF? HN strips emojis :/ (and does it incorrectly when they are emoji sequences).


Swift is another major language that has correctly solved this problem in this way - i.e. not representing/operating on strings as though they were naive arrays of bytes or code points - but rather as though they were arrays of characters, which Unicode thoroughly and intuitively defines in the same way that humans think about characters.

Swift is headed in the right direction.

But I think Perl6 is the only language that can do this magic:

  > 'Déjà vu' ~~ /:ignorecase:ignoremark deja \s vu/
  「Déjà vu」

> characters, which Unicode thoroughly and intuitively defines in the same way that humans think about characters

"Character" is a somewhat vague term, and Unicode prefers to use more specific terms like "code unit", "code point", "abstract character", etc.

In this case I think you may be referring to grapheme clusters, which come closer to how "humans think about characters" than Unicode abstract characters, which are building blocks of the technical encoding standard but in some cases don't really match a human concept of a graphical element of a writing system.

See also “Characters” and Grapheme Clusters in section 2.11 of https://www.unicode.org/versions/Unicode12.0.0/ch02.pdf, for example.


Oops - you're right. I'm using the term "character" for both the intuitive and documented definition, but the documented term I'm referring to is actually grapheme cluster.

Normalization covers more than just combining characters. CJK full width digits and punctuation can be problematic when you want the canonical forms for pattern matching:

  s = '1 2'

  print(unicodedata.normalize('NFD', s))
  print(unicodedata.normalize('NFC', s))
  print(unicodedata.normalize('NFKC', s))

  1 2
  1 2
  1 2

It's called collation

  > $*COLLATION.set(:!secondary:!tertiary:!quaternary)
  collation-level => 1, Country => International, Language => None, primary => 1, secondary => 0, tertiary => 0, quaternary => 0

  > '12' coll '12'
  Same

Perl 6 normalizes to NFC by default for everything except filenames: https://docs.perl6.org/language/unicode

As to lelf's 1-character emoji, str.chars returns the number of characters in the string-- it would only return 2 if it returned the number of code units instead (which, the documentation notes, is what currently happens on the JVM).


Perl 6 actually normalizes to NFG (Normalization Form Grapheme) https://docs.perl6.org/language/glossary#NFG

And on the C level you need something like safelibc's wcsnorm_s and then wcsfc_s for case-insensitive search. Or libunicode or ICU, but they are too slow and big and failed to be useful in the GNU coreutils.

That's for strings. For identifiers (names, filenames, ...) there are a lot more rules to consider, and almost nobody supports unicode identifiers safely.

There's also still no support for foreign strings on the most basic utilities, like expand, wc, cut, head/tail, tr, fold/fmt, od or sed, awk, grep, go, silversearch, go platinum searcher, rust ripgrep, ... => http://perl11.org/blog/foldcase.html I do maintain the multibyte patches for coreutils and fixed it for my projects at least.


And Ruby does what I expect:

    > "\x65\x301".include?("\xe9")
    => false

I still don't understand why Unicode allows two different ways to represent the same thing.

Naively, that appears as a major defect in Unicode.

Perhaps someone reading this knows why this was the right thing to do?


Unicode was created to unify a number of different pre-existing character sets so they all could be mapped directly to unicode code points. Some of these character-sets had pre-composed characters, e.g. a single code-point to represent 'ö'. Other used combining characters. Unicode therefore had to support both.

The Unicode space is massive. It's odd they didn't just spend a bunch of code points for those existing (predominately Latin) character.

Unless you are supposed to be able to put an unlaut on any character. CJK characters with umlauts.


Yes, you’re supposed to: &̈ *̈ ~̈ ⺃ ⺃̈ ⻑ ⻑̈

It wasn't massive back in UCS2 days...

In particular, Unicode codepoints 128-255 were explicitly defined to be the same as ISO-8859-1, which has a bunch of precomposed characters in it.

Yes, I am likewise confused.

Text is one of those enormously complicated human conventions, so it's probably just ignorance speaking. But I would like to understand.


As other people have mentioned, it's for historical reasons. If Unicode was started from a clean slate, without the need to be compatible with existing encodings there wouldn't be any precomposed forms.

...in web apps (i.e. during presentation). Don’t do it at the storage layer:

https://github.com/git/git/commit/76759c7dff53e8c84e975b88cb...

Edit: see comments below. My generalization is over broad. Maybe a fairer statement is that some forms of normalization lead to aliasing and sometimes you want that but sometimes not. So be aware of whether you want different strings to be treated the “same” or not.

My thought was that you can always test for sameness after the fact, but once you’ve normalized into storage, you can’t undo it.


Eh, in general I think it makes a lot of sense to do it at the storage layer.

That particular commit mentions _filenames_. I agree you should _never_ touch the bytes that are meant to be a filepath. file systems still do idiosyncratic things with non-ascii file paths, and most of us aren't filesystem experts. Leave the bytes of a filepath alone.

Since git is all about filepaths, it makes sense that git would want to generally avoid this.

But in general, "during presentation" is not enough to deal with the sorts of problems the OP talks about. If you're comparing strings somewhere, it's probably before "presentation".

In general, I think it's quite reasonable to normalize your input, on the way in, to NFC. I think it's reasonable enough in most cases that normalizing input to NFC on the way in is a reasonable "default" to get started with, unless you know a reason you shouldn't.

(For searching, you MIGHT want to normalize to NFKC, but that is "lossy" so I would never do that as a rule. I'd normally do it in some other field, and keep the original lossless copy too).


Even then, it's not just filepaths. Internal representations can range from "bytes that are sometimes string-ish" to "bytes which are valid strings", while in no way covering the domain where the string-like properties that normalization brings are desireable.

I doubt there are hard-and-fast rules where normalization is appropriate, but if you start applying it to all input, you're going to break things. Email addresses immediately spring to mind, for example.


Can you even have unicode/non-ascii in email addresses?

If you can, and your email address is josé@gmail.com, but whether it arrives to you depends on whether the address was entered as unicode codepoint 'LATIN SMALL LETTER E WITH ACUTE' or 'LATIN SMALL LETTER E'+'COMBINING ACUTE ACCENT'... you're going to have a lot of trouble getting email. Becauase people entering your address have no idea which they are entering. I know I can type an "é" on MacOS US keyboard layout by typing `option-e,e`, but I have _no idea_ what actual bytes (corresponding to which of these codepoints) are sent to HN when I enter that in an HN comment text box. And most people sending email, of course, don't even know any of this is a thing.

If it works at all, it's actually gonna be because the software on either or both ends (sending and receiving) is already doing NFC normalization!

So, yes, this illustrates that normalization _is a thing_ you have to pay attention to, and can get weird.

But in general, for "text" input (rather than addresses or filepaths), normalizing to NFC is _usually_ a good call. If it's addresses or filepaths or what have you... it's gonna get confusing no matter what.

If you know what you're doing, maybe NFC isn't right. If you don't know what you're doing, NFC is a lot better start-out-with call than not normalizing at all. Not normalizing at all will bite you more often than NFC will.


> Can you even have unicode/non-ascii in email addresses?

Yes, per RFC 6531:

* https://en.wikipedia.org/wiki/Unicode_and_email * https://en.wikipedia.org/wiki/International_email


Strangely, RFC 6531 gives the impression of being written by people who didn't realize unicode normalization was a concern. I'm reading it for the first time, but it basically just says "an SMTP server can advertise it supports UTF8, and then it should accept UTF8 in all the places that were ascii before." I mean, a little bit more complicated than that, but with regard to _mailboxes_ (the part before the `@`), not much.

It doens't seem to say anything about normalization, or the fact that a mailbox `josé@` could be using 'LATIN SMALL LETTER E WITH ACUTE' or 'LATIN SMALL LETTER E'+'COMBINING ACUTE ACCENT'.

I suspect any SMTP servers supporting UTF8 are actualy doing NFC normalization to match mailboxes to requests.


Servers are free to do any transformation on the account names they accept, and shouldn't apply any on the ones they relay.

It's expected that they would do many kinds of transformations, like case normalization. Unicode normalization should be one of those but none of the email RFCs intend to enumerate them all.


Is running normalization algorithms against UCS-2 -- i.e., where it's possible to have invalid surrogate pairs -- well-defined?

I'm not certain, I'm not familiar with that situation. (I think "UCS-2" is basically "legacy" and not actually a unicode standard at this point? I'm not familiar with it).

If the issue is invalid bytes for a given unicode encoding (I am familiar with those!), I'm not certain what the official unicode normalization algorithms would do to them. They might just leave invalid bytes alone. The discussion of normalization seems to be in terms of codepoints though; a bytestream with invalid bytes (that can't be turned into unicode codepoints)... I would guess most actually existing normalization implementations are gonna complain about that. One could certainly write an algorithm that would just leave those bytes alone, I'm not sure if it would be considered compliant with the standard or not, just can't say.

Why does this come up?


This comes up because many implementations of "UTF-16" are in fact naive, and just operate within the mentality of UCS-2, where everything is just a 16-bit number.

This is relevant within this discussion because we were talking about filesystems, and Windows stores file names as an arbitrary array of 16-bit numbers, but UTF-16 has additional requirements that Windows file names do not satisfy.

When I say UCS-2, I just mean pretending that UTF-16 is short for `[]uint16`, and not worrying about surrogate pairs.


I think we can all agree that you should never be applying any normalization to bytes representing a file system path, unless you are doing it according to specifications for that particular file system+OS combo.

file system paths are tricky with non-ascii, that is for sure! Especially if you are trying to be agnostic to OS/filesystem. They are an edge case.


UCS-2 refers to the UTF-16 encoding of the first plane, defined when there was only one plane's worth of assignable space. UTF-16 extends UCS-2, and defines the reserved surrogates, thereby introducing the possibility of invalid surrogate usage. The parent post is referring to that (so it would be more accurate for the parent post to have said "UTF-16").

OK, thanks.

You can certainly have invalid bytes for UTF-8 too. I think in any unicode encoding, not every possible byte sequence is a valid representation of unicode codepoints. It doesn't require UCS-2/UTF-16.

Here's an invalid byte sequence in UTF-8: "\xc3\x28"


True, though the GP's example was somewhat unique in that UTF-16 created the possibility of invalid combinations at the code point level, rather than only the byte level.

> I'm not certain, I'm not familiar with that situation.

See RFC 6531.


Normalization has nothing to say about what happens if you have a string containing invalid surrogates - you have an invalid string at that point and normalization won't get you into or out of that situation - it only affects how valid characters are composed of code points.

> That particular commit mentions _filenames_.

This basically just comes down to the (sometime unpleasant) reality that on most Unix systems, file paths are not text; they are byte strings that don't have 00 or 2F bytes.


Sorry if this sounds a little clueless, but it sounds like the problem is because there's multiple Unicode standards (UTF-8, UTF-16, UTF-32). So it seems like if you just re-encode everything into one of these before committing to the storage layer, that you'd avoid this problem altogether, and you'd be able to do operations in the storage layer correctly too.

A Unicode string is an array of Unicode's building blocks: Code points. Normalization/composition refers to how we use code points to represent characters. Encoding refers to how we use bytes to represent code points. The two concepts exist at two independent levels of abstraction.

Accurate but so arrogant. HN purity achieved.

Sorry, I didn't intend that. I honestly considered it a reasonable question.

No worries, your reply was a concise clarification of the issue and terminology, and a reasonable answer; I didn't perceive it as arrogant in the least.

OP here; I didn't perceive it as arrogant either. I think the "arrogant" accuser might have a point about some stuff I see on HN, but this reply to my post just isn't it.

Yeah, no. That is actually a different issue. Curious if you read the OP?

But, no, you do not avoid issues with unicode normalization by ensuring everything is in (eg) UTF-8. Unicode normalization is a thing within UTF-8.


Depends on the context such as filenames on Mac which you link. There are many cases where doing it at the storage layer makes sense. Are you indexing text for search? Normalize. Do you want to de-duplicate textual data and store it in a database? Normalize. Are you going to run some sort of NLP analysis? Normalize.

“The first of such conventions, or character encodings, was ASCII (American Standard Code for Information Interchange).” The author may know better and is glossing over history, but when I see statements like this that are obviously incorrect, I question everything else in the article.

The article is largely correct. I do a lot of search and NLP work.

There shouldn't even be any such thing as normalized strings, i.e. two different Unicode sequences that are supposed to be the same character.

Sounds great in theory...but you've just vastly extended the number of "characters". While text parsing is a nightmare, you also get the flexibility of combining different characters together, such as emojis and skin tone modifiers, which outputs a different display character.

On the other hand, I'm not sure its possible to have a font that represents every valid combination of unicode.


Unicode can simply not recognize combining characters where a special code point exists, and vice versa, on a case by case basis. For example, the ä can remain its special code point, and a¨ can display as a¨. Then software that processes Unicode will become straightforward.

That ship has sailed for backwards compatibility reasons.

In that case you would lose information when transferring text from non-Unicode encoded files to Unicode encoded files. Not a great way for your standard to take off...

What is with the push to unicode? Why not ascii? It seems to give a lot less trouble, particularly wrt panagrams, normalization, etc.

ASCII is effectively just the keys on a US English keyboard. You couldn't even represent an accented 'e' or the price of British products. So even if you just wanted a character set defined as "very common characters in the English speaking world", ASCII falls miles short of even that tiny goal.

Most of the internet is based on American English. See programming for an example. For the goal of representing American English, it works fine.

Why would anything more be necessary?


Unicode is not about the Internet. Just like ASCII or EBCDIC before it weren't about networking.

The reason we use Unicode is because ASCII is very limited in its scope. It can't handle the majority of the world's languages. It can't even handle American English. No dashes, no open or close quotes, etc.


Most programming is in American english too. And most people who use computers end up picking up a decent bit anyway, according to some friends from a spanish-speaking country.

So ASCII is good enough for programming, but fails at general text, unless we resign ourselves to kludges like -- instead of – and --- instead of —.

And your friends from that Spanish-speaking country do UI work right? Can’t use ASCII for that. Ñ is out the window.


No, they do c.

If you're encoding data for a domain guaranteed to contain only US English words and US English symbols, and you're OK without the occasional non-ASCII bits in US English like diacritics in loan words, then sure. But such domains are vanishingly rare. As for your examples, "most of the internet" might still contain non-ASCII, I think, but even a little bit of the internet is still a gigantic and highly active part that's inseparable from the larger domain of "the internet". And, while the legal symbols in the syntax of a programming language might only be ASCII, programming languages must process arbitrary strings, which may be user-facing, or be in some other domain where ASCII is inadequate.

> Most of the internet is based on American English.

Emoji are rapidly becoming a staple of American English.

(Edit: and apparently HN filters emoji out of posts!)


I think it might have something to do with the ~7 billion people whose native language isn’t expressible in ascii. Just maybe?

I have really good news for you.

UTF-8 encoding is compatible with ASCII. As long as you just use ASCII characters, they your strings are also valid UTF-8.


そうですね、僕も分からない(´-`)⁉️

Speaking of which, how does Han Unification affect Unicode normalization. If I understand correctly, you can compose strokes into characters? Does that work?

Google Translate: "Well, I do not know"

ASCII is pretty limited. For example your namesake's book Nationalökonomie could not properly by rendered in ASCII.

*be

I'm currently processing a huge archive of files where someone was too lazy to bother using the proper encoding. out of the first 700K files I've run through I've dumped 3000 into a secondary queue because they've been so bastardized the mojibake is completely indecipherable.

It continues to astonish me that programmers who spend hours arguing the relative merits of ECC RAM and database ACID implementations are so quick to destroy data by lazily defaulting to ASCII.




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: