
Why to normalize Unicode strings - bibyte
https://withblue.ink/2019/03/11/why-you-need-to-normalize-unicode-strings.html
======
jrochkind1
The official unicode documentation on normalization is good reading, and quite
readable. It's actually an even more complicated topic than OP reveals, but
the Unicode Standard Annex #15 explains it well.

[http://unicode.org/reports/tr15/](http://unicode.org/reports/tr15/)

OP has a significant error:

> You can choose whatever form you’d like, as long as you’re consistent, so
> the same input always leads to the same result.

Not so much! Do _not_ use the "Compatibility" (rather than "Canonical")
normalization forms unless you know what you are doing! UAX15 will explain
why, but they are "lossy". In general, NFC is the one to use as a default.

~~~
zokier
> In general, NFC is the one to use as a default.

Why nfc instead of nfd?

~~~
minitech
Saves space, is just what people expect, easier to work fast and loose with
when you want to do that.

~~~
rurban
But needs a third more time than NFD. Apple choose the long form NFD in its
old filesystem. NFC needs three steps, NFD only the first two. The size rarely
matters that much.

------
doodpants
> Thankfully, there’s an easy solution, which is normalizing the string into
> the “canonical form”.

Cool, problem solved!

> There are four standard normalization forms:

(╯°□°）╯︵ ┻━┻

~~~
Zarel
I do understand the need for the difference between NFC and NFKC, but in
hindsight NFD and NFKD seem entirely unnecessary.

~~~
eridius
NFD is useful if you want to do a diacritic-insensitive search.

~~~
lokedhs
The problem is that diacritic-insensitive search is locale-dependent, so it
doesn't do the right thing anyway.

Specific example: in English, you'd want a search for ‘a’ to find ‘ä’ while
this is the entirely wrong thing in Swedish where a and ä are distinct
letters.

An English speaker probably wouldn't want a search for ‘i’ to match ‘j’ even
though the latter just has an extra hook on the bottom.

~~~
eridius
That's a very good point. There's still a use for locale-insensitive
diacritic-insensitive searches, but you're absolutely right that in most cases
you'd want it to be locale-aware and therefore NFD isn't sufficient (though it
may still be easier to do this on NFD than NFC).

------
ken
> _Why use both [UTF-8 and UTF-16]? Western languages typically are most
> efficiently encoded with UTF-8 (since most characters would be represented
> with 1 byte only), while Asian languages can usually produce smaller files
> when using UTF-16 as encoding._

The second sentence is technically correct, but it's a strange followup here
because it's not _why_ UTF-8 and UTF-16 exist today. I don't know any Asian
webpages that use UTF-16 to save bandwidth, e.g., Japanese Wikipedia is still
UTF-8.

The major use of UTF-16 in 2019, AFAICT, is for legacy operating system
interfaces.

~~~
ameliaquining
Also "legacy" language runtimes. "Legacy" being in scare quotes because
JavaScript, the JVM, and the CLR all work this way and are all very much in
widespread use today.

~~~
Someone
JDK 9 introduced “compact strings”
([https://bugs.openjdk.java.net/browse/JDK-8054307](https://bugs.openjdk.java.net/browse/JDK-8054307)).
That stores a string’s characters in a byte array, with either the traditional
two bytes per ‘char’ (encoding the entire string as UTF16) or, if possible,
one (encoding the entire string as ISO-8859-1/Latin-1). They probably didn’t
use UTF-8 because it would break the fact that string indexing is O(1).

~~~
int_19h
That's just internal representation, though. Semantically, strings are still
sequences of chars, and char is still 16-bit, so the API is still UTF-16.

------
zackmorris
Note that Apple's APFS doesn't normalize Unicode filenames:

[https://news.ycombinator.com/item?id=13953800](https://news.ycombinator.com/item?id=13953800)

From what I understand, it stores them as-is but can read any (so is
normalization insensitive):

[https://medium.com/@yorkxin/apfs-docker-
unicode-6e9893c9385d](https://medium.com/@yorkxin/apfs-docker-
unicode-6e9893c9385d)

[https://developer.apple.com/library/archive/documentation/Fi...](https://developer.apple.com/library/archive/documentation/FileManagement/Conceptual/APFS_Guide/FAQ/FAQ.html)

This hit me a couple of years ago when I was working on a scraper and storing
the title of the page as the filename. It looked fine, but would fail a
Javascript string comparison. I can't remember if I was using HFS+ though,
which I believe saved filenames as NFD:

[https://en.wikipedia.org/wiki/HFS_Plus#Criticisms](https://en.wikipedia.org/wiki/HFS_Plus#Criticisms)

The same script might work today on APFS.

~~~
watersb
HFS+ uses Form D.

I had to remember to create my ZFS volumes with Form D enabled, as it isn't an
attribute that can be changed afterwards.

IIRC, ZFS on Mac OS X would set that by default so if you created the volumes
from a Mac, then ok. But I was creating my ZFS array on a Linux or OpenSolaris
server, where I would need to set Form D Normalization explicitly.

------
gumby
By the way the last letter of Zoë is e with a diaresis, not An umlaut. Like
the second o in coöperate — it’s just an ordinary o with a marker to tell you
to pronounce it rather than form a diphthong.

------
athenot
Just tried this in Perl6; looks like string comparisons Do The Right Thing™.

    
    
        > "\x65\x301".contains("\xe9")
        True

~~~
lelf

      > "\c[latin small letter e]\c[combining acute accent]" eq "\c[latin small letter e with acute]"
      True
    

Edit: And of course

    
    
      > "\c[dog face]".chars
      1
    

and not 2 as in the article.

PS: WTF? HN strips emojis :/ (and does it incorrectly when they are emoji
sequences).

~~~
happytoexplain
Swift is another major language that has correctly solved this problem in this
way - i.e. not representing/operating on strings as though they were naive
arrays of bytes or code points - but rather as though they were arrays of
_characters_ , which Unicode thoroughly and intuitively defines in the same
way that humans think about characters.

~~~
jfk13
> _characters_ , which Unicode thoroughly and intuitively defines in the same
> way that humans think about characters

"Character" is a somewhat vague term, and Unicode prefers to use more specific
terms like "code unit", "code point", "abstract character", etc.

In this case I think you may be referring to _grapheme clusters_ , which come
closer to how "humans think about characters" than Unicode _abstract
characters_ , which are building blocks of the technical encoding standard but
in some cases don't really match a human concept of a graphical element of a
writing system.

See also _“Characters” and Grapheme Clusters_ in section 2.11 of
[https://www.unicode.org/versions/Unicode12.0.0/ch02.pdf](https://www.unicode.org/versions/Unicode12.0.0/ch02.pdf),
for example.

~~~
happytoexplain
Oops - you're right. I'm using the term "character" for both the intuitive and
documented definition, but the documented term I'm referring to is actually
grapheme cluster.

------
gwbas1c
I still don't understand why Unicode allows two different ways to represent
the same thing.

Naively, that appears as a major defect in Unicode.

Perhaps someone reading this knows why this was the right thing to do?

~~~
goto11
Unicode was created to unify a number of different pre-existing character sets
so they all could be mapped directly to unicode code points. Some of these
character-sets had pre-composed characters, e.g. a single code-point to
represent 'ö'. Other used combining characters. Unicode therefore had to
support both.

~~~
paulddraper
The Unicode space is massive. It's odd they didn't just spend a bunch of code
points for those existing (predominately Latin) character.

Unless you are supposed to be able to put an unlaut on any character. CJK
characters with umlauts.

~~~
Someone
Yes, you’re supposed to: &̈ *̈ ~̈ ⺃ ⺃̈ ⻑ ⻑̈

------
js2
...in web apps (i.e. during presentation). Don’t do it at the storage layer:

[https://github.com/git/git/commit/76759c7dff53e8c84e975b88cb...](https://github.com/git/git/commit/76759c7dff53e8c84e975b88cb8245587c14c7ba)

Edit: see comments below. My generalization is over broad. Maybe a fairer
statement is that some forms of normalization lead to aliasing and sometimes
you want that but sometimes not. So be aware of whether you want different
strings to be treated the “same” or not.

My thought was that you can always test for sameness after the fact, but once
you’ve normalized into storage, you can’t undo it.

~~~
jrochkind1
Eh, in general I think it makes a lot of sense to do it at the storage layer.

That particular commit mentions _filenames_. I agree you should _never_ touch
the bytes that are meant to be a filepath. file systems still do idiosyncratic
things with non-ascii file paths, and most of us aren't filesystem experts.
Leave the bytes of a filepath alone.

Since git is all about filepaths, it makes sense that git would want to
generally avoid this.

But in general, "during presentation" is not enough to deal with the sorts of
problems the OP talks about. If you're comparing strings somewhere, it's
probably before "presentation".

In general, I think it's quite reasonable to normalize your input, on the way
in, to NFC. I think it's reasonable enough in most cases that normalizing
input to NFC on the way in is a reasonable "default" to get started with,
unless you know a reason you shouldn't.

(For searching, you MIGHT want to normalize to NFKC, but that is "lossy" so I
would never do that as a rule. I'd normally do it in some other field, and
keep the original lossless copy too).

~~~
magduf
Sorry if this sounds a little clueless, but it sounds like the problem is
because there's multiple Unicode standards (UTF-8, UTF-16, UTF-32). So it
seems like if you just re-encode everything into one of these before
committing to the storage layer, that you'd avoid this problem altogether, and
you'd be able to do operations in the storage layer correctly too.

~~~
happytoexplain
A Unicode string is an array of Unicode's building blocks: Code points.
Normalization/composition refers to how we use code points to represent
characters. Encoding refers to how we use bytes to represent code points. The
two concepts exist at two independent levels of abstraction.

~~~
dummyfunnytoo
Accurate but so arrogant. HN purity achieved.

~~~
happytoexplain
Sorry, I didn't intend that. I honestly considered it a reasonable question.

~~~
FabHK
No worries, your reply was a concise clarification of the issue and
terminology, and a reasonable answer; I didn't perceive it as arrogant in the
least.

~~~
magduf
OP here; I didn't perceive it as arrogant either. I think the "arrogant"
accuser might have a point about some stuff I see on HN, but this reply to my
post just isn't it.

------
s1mon
“The first of such conventions, or character encodings, was ASCII (American
Standard Code for Information Interchange).” The author may know better and is
glossing over history, but when I see statements like this that are obviously
incorrect, I question everything else in the article.

~~~
rpedela
The article is largely correct. I do a lot of search and NLP work.

------
WalterBright
There shouldn't even be any such thing as normalized strings, i.e. two
different Unicode sequences that are supposed to be the same character.

~~~
timw4mail
Sounds great in theory...but you've just vastly extended the number of
"characters". While text parsing is a nightmare, you also get the flexibility
of combining different characters together, such as emojis and skin tone
modifiers, which outputs a different display character.

On the other hand, I'm not sure its possible to have a font that represents
every valid combination of unicode.

~~~
WalterBright
Unicode can simply not recognize combining characters where a special code
point exists, and vice versa, on a case by case basis. For example, the ä can
remain its special code point, and a¨ can display as a¨. Then software that
processes Unicode will become straightforward.

~~~
int_19h
That ship has sailed for backwards compatibility reasons.

------
mises
What is with the push to unicode? Why not ascii? It seems to give a lot less
trouble, particularly wrt panagrams, normalization, etc.

~~~
kalleboo
そうですね、僕も分からない（´-`）⁉️

~~~
sevensor
Speaking of which, how does Han Unification affect Unicode normalization. If I
understand correctly, you can compose strokes into characters? Does that work?

