OP has a significant error:
> You can choose whatever form you’d like, as long as you’re consistent, so the same input always leads to the same result.
Not so much! Do _not_ use the "Compatibility" (rather than "Canonical") normalization forms unless you know what you are doing! UAX15 will explain why, but they are "lossy". In general, NFC is the one to use as a default.
If your input contains "traffic", you don't necessarily want NFC to insert the "ffi" ligature and turn it into "tra\ufb03c". The ligature is generally a poor choice in a monospaced font, for example.
Similarly, if you've gone to the trouble to insert the ligature, you don't necessarily want NFD to strip it out. However, with NFKC or NFKD, a search for "traffic" will find the string.
But they're a good example of the difference between "canonical" and "compatible" normalization anyway, in the other direction. Which is really the only direction that matters to illustrate the difference anyway.
NFKD (or NFKC too I think?) will, as you say, turn the ligatures into their "individual" forms. It's different glyphs, and there is at that point no way to know which it was "originally", and no way to convert it in the other direction. The "compatibility" normalizations are "lossy".
The "canonical" normalizations on the other hand are basically 'lossless' with regard to "glpyhs". Unless you actually _cared_ that it was represneted with a combining diacritic before, which in 99.9% of cases you don't. You should have exactly the same symbols on the screen after a 'canonical' normalization. And for any x, NFC(x) == NFC(NFD(NFC(x)).
The compatibility normalizations are super useful, because as the person up there mentioned, you often want a search query for `ffi` to match on `ﬃ` (and vice versa). But they are intended to lose symbolic representation (ﬃ and ffi are now the same thing with no way to distinguish), where the canonical normalizations are not.
I wouldn't say the `ﬃ` ligature is "obsoleted" in any way. People still use it all the time. We both just included it in our comments, unicode was happy to support that. Unicode is so happy for you to use it, that it provides compatibility normalization to make it _easier_ to recognize that it means the same thing as "ffi". :)
Why nfc instead of nfd?
UAX#15 notes that:
> The W3C Character Model for the World Wide Web 1.0: Normalization [CharNorm] and other W3C Specifications (such as XML 1.0 5th Edition) recommend using Normalization Form C for all content, because this form avoids potential interoperability problems arising from the use of canonically equivalent, yet different, character sequences in document formats on the Web. See the W3C Character Model for the Word Wide Web: String Matching and Searching [CharMatch] for more background.
One of the W3C documents cited says:
> NFC has the advantage that almost all legacy data (if transcoded trivially, one-to-one, to a Unicode encoding), as well as data created by current software or entered by users on most (but not all) keyboards, is already in this form. NFC also has a slight compactness advantage and is a better match to user expectations in most languages with respect to the relationship between characters and graphemes.
"Content authors SHOULD use Unicode Normalization Form C (NFC) wherever possible for content."
It's a SHOULD rather than a MUST. Non-NFC content is not a spec violation, but using NFC is strongly advised.
Cool, problem solved!
> There are four standard normalization forms:
Specific example: in English, you'd want a search for ‘a’ to find ‘ä’ while this is the entirely wrong thing in Swedish where a and ä are distinct letters.
An English speaker probably wouldn't want a search for ‘i’ to match ‘j’ even though the latter just has an extra hook on the bottom.
Glad that I still work in plain ASCII.
The second sentence is technically correct, but it's a strange followup here because it's not why UTF-8 and UTF-16 exist today. I don't know any Asian webpages that use UTF-16 to save bandwidth, e.g., Japanese Wikipedia is still UTF-8.
The major use of UTF-16 in 2019, AFAICT, is for legacy operating system interfaces.
JS/JVM/CLR all work fine, but I imagine if they were created today, their strings would not be based on UTF-16.
From what I understand, it stores them as-is but can read any (so is normalization insensitive):
The same script might work today on APFS.
I had to remember to create my ZFS volumes with Form D enabled, as it isn't an attribute that can be changed afterwards.
IIRC, ZFS on Mac OS X would set that by default so if you created the volumes from a Mac, then ok. But I was creating my ZFS array on a Linux or OpenSolaris server, where I would need to set Form D Normalization explicitly.
> "\c[latin small letter e]\c[combining acute accent]" eq "\c[latin small letter e with acute]"
> "\c[dog face]".chars
PS: WTF? HN strips emojis :/ (and does it incorrectly when they are emoji sequences).
But I think Perl6 is the only language that can do this magic:
> 'Déjà vu' ~~ /:ignorecase:ignoremark deja \s vu/
"Character" is a somewhat vague term, and Unicode prefers to use more specific terms like "code unit", "code point", "abstract character", etc.
In this case I think you may be referring to grapheme clusters, which come closer to how "humans think about characters" than Unicode abstract characters, which are building blocks of the technical encoding standard but in some cases don't really match a human concept of a graphical element of a writing system.
See also “Characters” and Grapheme Clusters in section 2.11 of https://www.unicode.org/versions/Unicode12.0.0/ch02.pdf, for example.
s = '１ ２'
collation-level => 1, Country => International, Language => None, primary => 1, secondary => 0, tertiary => 0, quaternary => 0
> '１２' coll '12'
As to lelf's 1-character emoji, str.chars returns the number of characters in the string-- it would only return 2 if it returned the number of code units instead (which, the documentation notes, is what currently happens on the JVM).
That's for strings. For identifiers (names, filenames, ...) there are a lot more rules to consider, and almost nobody supports unicode identifiers safely.
There's also still no support for foreign strings on the most basic utilities, like expand, wc, cut, head/tail, tr, fold/fmt, od or sed, awk, grep, go, silversearch, go platinum searcher, rust ripgrep, ... => http://perl11.org/blog/foldcase.html
I do maintain the multibyte patches for coreutils and fixed it for my projects at least.
Naively, that appears as a major defect in Unicode.
Perhaps someone reading this knows why this was the right thing to do?
Unless you are supposed to be able to put an unlaut on any character. CJK characters with umlauts.
Text is one of those enormously complicated human conventions, so it's probably just ignorance speaking. But I would like to understand.
Edit: see comments below. My generalization is over broad. Maybe a fairer statement is that some forms of normalization lead to aliasing and sometimes you want that but sometimes not. So be aware of whether you want different strings to be treated the “same” or not.
My thought was that you can always test for sameness after the fact, but once you’ve normalized into storage, you can’t undo it.
That particular commit mentions _filenames_. I agree you should _never_ touch the bytes that are meant to be a filepath. file systems still do idiosyncratic things with non-ascii file paths, and most of us aren't filesystem experts. Leave the bytes of a filepath alone.
Since git is all about filepaths, it makes sense that git would want to generally avoid this.
But in general, "during presentation" is not enough to deal with the sorts of problems the OP talks about. If you're comparing strings somewhere, it's probably before "presentation".
In general, I think it's quite reasonable to normalize your input, on the way in, to NFC. I think it's reasonable enough in most cases that normalizing input to NFC on the way in is a reasonable "default" to get started with, unless you know a reason you shouldn't.
(For searching, you MIGHT want to normalize to NFKC, but that is "lossy" so I would never do that as a rule. I'd normally do it in some other field, and keep the original lossless copy too).
I doubt there are hard-and-fast rules where normalization is appropriate, but if you start applying it to all input, you're going to break things. Email addresses immediately spring to mind, for example.
If you can, and your email address is josé@gmail.com, but whether it arrives to you depends on whether the address was entered as unicode codepoint 'LATIN SMALL LETTER E WITH ACUTE' or 'LATIN SMALL LETTER E'+'COMBINING ACUTE ACCENT'... you're going to have a lot of trouble getting email. Becauase people entering your address have no idea which they are entering. I know I can type an "é" on MacOS US keyboard layout by typing `option-e,e`, but I have _no idea_ what actual bytes (corresponding to which of these codepoints) are sent to HN when I enter that in an HN comment text box. And most people sending email, of course, don't even know any of this is a thing.
If it works at all, it's actually gonna be because the software on either or both ends (sending and receiving) is already doing NFC normalization!
So, yes, this illustrates that normalization _is a thing_ you have to pay attention to, and can get weird.
But in general, for "text" input (rather than addresses or filepaths), normalizing to NFC is _usually_ a good call. If it's addresses or filepaths or what have you... it's gonna get confusing no matter what.
If you know what you're doing, maybe NFC isn't right. If you don't know what you're doing, NFC is a lot better start-out-with call than not normalizing at all. Not normalizing at all will bite you more often than NFC will.
Yes, per RFC 6531:
It doens't seem to say anything about normalization, or the fact that a mailbox `josé@` could be using 'LATIN SMALL LETTER E WITH ACUTE' or 'LATIN SMALL LETTER E'+'COMBINING ACUTE ACCENT'.
I suspect any SMTP servers supporting UTF8 are actualy doing NFC normalization to match mailboxes to requests.
It's expected that they would do many kinds of transformations, like case normalization. Unicode normalization should be one of those but none of the email RFCs intend to enumerate them all.
If the issue is invalid bytes for a given unicode encoding (I am familiar with those!), I'm not certain what the official unicode normalization algorithms would do to them. They might just leave invalid bytes alone. The discussion of normalization seems to be in terms of codepoints though; a bytestream with invalid bytes (that can't be turned into unicode codepoints)... I would guess most actually existing normalization implementations are gonna complain about that. One could certainly write an algorithm that would just leave those bytes alone, I'm not sure if it would be considered compliant with the standard or not, just can't say.
Why does this come up?
This is relevant within this discussion because we were talking about filesystems, and Windows stores file names as an arbitrary array of 16-bit numbers, but UTF-16 has additional requirements that Windows file names do not satisfy.
When I say UCS-2, I just mean pretending that UTF-16 is short for `uint16`, and not worrying about surrogate pairs.
file system paths are tricky with non-ascii, that is for sure! Especially if you are trying to be agnostic to OS/filesystem. They are an edge case.
You can certainly have invalid bytes for UTF-8 too. I think in any unicode encoding, not every possible byte sequence is a valid representation of unicode codepoints. It doesn't require UCS-2/UTF-16.
Here's an invalid byte sequence in UTF-8: "\xc3\x28"
See RFC 6531.
This basically just comes down to the (sometime unpleasant) reality that on most Unix systems, file paths are not text; they are byte strings that don't have 00 or 2F bytes.
But, no, you do not avoid issues with unicode normalization by ensuring everything is in (eg) UTF-8. Unicode normalization is a thing within UTF-8.
On the other hand, I'm not sure its possible to have a font that represents every valid combination of unicode.
Why would anything more be necessary?
The reason we use Unicode is because ASCII is very limited in its scope. It can't handle the majority of the world's languages. It can't even handle American English. No dashes, no open or close quotes, etc.
And your friends from that Spanish-speaking country do UI work right? Can’t use ASCII for that. Ñ is out the window.
Emoji are rapidly becoming a staple of American English.
(Edit: and apparently HN filters emoji out of posts!)
UTF-8 encoding is compatible with ASCII. As long as you just use ASCII characters, they your strings are also valid UTF-8.
It continues to astonish me that programmers who spend hours arguing the relative merits of ECC RAM and database ACID implementations are so quick to destroy data by lazily defaulting to ASCII.