It is interesting to consider a theoretical system in which paths are represented in 0-terminated count-length format (e.g. "foo/bar/baz/myfile.txt" would be "\003foo\003bar\003baz\012myfile.txt\000"), truly allowing any byte in a filesystem node's name, although that might be going a little bit too far.
Things are much easier for the file system if it can just treat names as bags of bytes.
If you're really talking about bags (unordered sets), that would certainly make for an interesting filesystem since filename.txt, filemane.txt, and maletent.fix would all be the same...
The more fundamental question is whether filenames are under control of the user or the system. The answer today is "both": there's blessed paths /System/Library... and non-blessed paths like ~/Documents/Pokemon.txt. Addressing this properly means reifying that distinction: making apps always be explicit about whether they're working with the user's or the filesystem's view of a file.
While we're here, why the should the user even have to name every file? Do you name every piece of junk mail on your desk/kitchen island?
The ideal for the user is something like tagging, where naming stuff is optional, names are forgiving and perhaps not unique, and files are not restricted to being in one directory. Meanwhile, filesystem names are an implementation detail and the filesystem enjoys Unicode ignorance. Spotlight moved a bit in this direction, but the end goal is still awfully far away.
Show two files with apparently identical names. Is that so surprising? There are many many ways for two different Unicode strings to look visually identical (or near-identical) even if they aren't equivalent under normalization. For that matter, even if you were limited to plain keyboard ASCII, you could have two really long filenames that only differ in one character, which are effectively indistinguishable without massive hair-pulling. If you want to be robust, there's no way around having means to identify files other than names.
(I agree with your sentiment about tagging.)
My initial response was to not present confusing data to the user, but then I thought of exactly this. In many fonts, lowercase "L" (l) and uppercase "i" (I) look the same. I'm sure in some fonts they actually are identical, rather than just being nearly indistinguishable to the human eye.
To the discussion is not about whether we should present confusing identifiers to users, but at what point if it becomes more common does it cause enough of a problem to be a bad choice, and what techniques are there to mitigate it?
This distinction is important for other reasons, too. Should this file be indexed for search? Should it store an access time to support features like "recently opened"?
On the other side, we want to know which blessed paths "belong" to which system application or service so we can sync them, protect access to them, and uninstall them with their owners.
Maybe the user-visible document store should be a separate mechanism entirely from low-level app/service data persistence.
You still need to make a decision in the end, but are much more free to change it.
If the difference is not visible in your browser (it shouldn't), try copy-pasting those two filenames in a text editor, one of them is 10 characters long and one of them is 11 characters long.
I think the only thing you can really do about that is restricting to ASCII minus control characters. Probably many programmers would be willing to accept that, but non-technical users not.
The next best thing would be to enforce a canonical (code-point?) encoding. But given the complexities of Unicode that will get us only so far...
That's exactly what HFS+ did, and what APFS doesn't.
Yes, but software doesn't consider them equivalent.
It's still good then. And several systems allow for that just fine, including VMS (uniqueness comes from more than the filename). This is more alike real world folders (which can have identical items, e.g. two copies of the same paper), and is also an excellent way to keep different document versions (keep the name the same, change the date shown).
You could do that trivially in UNIX since forever.
I would've actually preferred for identifiers to be Unicode text strings.
Ideally, the VFS layer should be handling all this garbage.
I am somewhat willing to accept the value of case sensitivity in identifiers for programming languages (though it's often related to the verbosity of the language, as in Java: "Foo foo = new Foo()"), but not in file naming.
Of course, I realize that the ship has sailed, and we're stuck with case sensitivity in file systems.
I'm with David Wheeler: https://www.dwheeler.com/essays/fixing-unix-linux-filenames....
We need to limit filenames for the good of the entire system and the whole community. Filenames as byte strings may sound good, but nobody ever thinks of the costs and the scant benefits.
And much, much harder for applications if that "bag of bytes" is an invalid UTF-8 sequence. You will end up with an invalid string (or an exception), and trying to open that file will then fail.
I'd really hope that Apple checks that the filenames are valid UTF-8 as they otherwise can end up with very interesting security bugs.
It makes copying-and-pasting filenames as Unicode unreliable, for one thing.
> I'd really hope that Apple checks that the filenames are valid UTF-8 as they otherwise can end up with very interesting security bugs.
I think that's a bit pessimistic as Linux has run for year like that without any real issues.
These things aren't just byte streams, they have semantics.
So one risk of treating this as "just bytes" is that bugs will introduce byte sequences that aren't utf8 at all, which will cause other programs to fail, or worse, to "try" and thereby corrupt the data further.
Another risk is that since it's supposed to be utf-8, some programs may do canonicalization internally to avoid confusing situations. This may even happen accidentally (though it's not likely): after all, a unicode-processing system could be forgiven for transparently changing canonicalization. But if a program canonicalizes you can now get really weird behavior such as opening a file, then saving it, and ending up with two files that look identical - because the write wrote to a path that was canonicalized.
Additionally, even though you can never avoid confusing paths entirely without considering the glyphs rendered, you are losing a very simple check against a fairly large class of errors.
Identifiers should be identifiable.
If a filename is encoded in utf8, it needs to be normalized.
To the canonical form of course not the crazy Python 3 or Apple idea of NFD. Which is also slower.
If it's encoded as bytes you get garbage in - garbage out.
There's much more to consider, but unfortunately you cannot restrict a directory to forbid mixed scripts or confusables.
I summarized a few problems at http://perl11.org/blog/unicode-identifiers.html
 In Unicode, they're generally unified into a single set, via process called "Han unification". So, unlike Greek "Α" vs Cyrillic "А", the "same" character that may even look slightly differently in Chinese vs Japanese (e.g. "海"), would have a single codepoint in Unicode. But that's another story...
• Chinese characters are called hànzì¹
• modern Japanese uses many hànzì, calling them kanji² instead
• katakana³ is used in modern Japanese to write words of foreign origin (loanwords) as well as onomatopoeia and is romanized as katakana, not katagana
• hiragana⁴ is used in modern Japanese to write okurigana⁵, particles, and certain words, and is romanized as hiragana, not hiregana
Here’s a typical (and extremely simple) sentence in modern Japanese that uses all three:
This is a sample sentence.
この: this (hiragana)
文: sentence (kanji)
は: particle (pronounced wa) (hiragana)
サンプル: sample (loanword; pronounced sanpuru) (katakana)
です: particle (pronounced desu) (hiragana)
¹ — https://en.wikipedia.org/wiki/Chinese_characters
² — https://en.wikipedia.org/wiki/Kanji
³ — https://en.wikipedia.org/wiki/Katakana
⁴ — https://en.wikipedia.org/wiki/Hiragana
⁵ — https://en.wikipedia.org/wiki/Okurigana
(While we're being pedantic)
です isn't a particle, it's the imperfective (present/future) polite form of the copular verb (to be, in English). The only particle in that sentence is は.
And as a further side note, some words or names normally written in kanji or hiragana are ocassionally written in katakana for emphasis or to emphasize a foreign nature.
Which part of it is relevant to pathnames?
> To the canonical form of course
Do you mean NFC?
> not the crazy Python 3 or Apple idea of NFD.
In what circumstances does Python 3 normalize pathnames?
> If it's encoded as bytes you get garbage in - garbage out.
What do you mean by "encoded as bytes"?
This is about programming language identifies, not about pathnames, which are very different beasts.
All parts regarding identifiers and names, if you see pathnames as publicly identifiable information. See below.
> In what circumstances does Python 3 normalize pathnames?
None. Python 3 normalizes identifiers to the shorter non-canonical NFKC form.
Apple HFS+ normalizes pathnames to the longer D canonical form. Which is faster but takes more space. Usually space is more important than CPU.
> What do you mean by "encoded as bytes"?
What everyone else means on byte encodings: 1-1 mapping of bytes, without any further knowledge of the encoding.
> This is about programming language identifies, not about pathnames, which are very different beasts.
Partially. pathnames can also be argued to be identifiers, similar to domain names, email names, user names, language identifiers. Apple does so. Ask plan9 what they think about pathname semantics.
Not everyone is in the garbage in - garbage out camp. Many systems do encode pathnames and have character restrictions.
Esp. important is the popular / spoof, which people from the garbage camp don't care about.
Julia is recent proponent of the garbage camp, btw.
What definition or sense of 'canonical' do you mean? NFKC stands for "Compatibility Decomposition, followed by Canonical Composition", so it's canonical in some sense, right?
That's fortunate, not unfortunate. I want the files corresponding to documents to be named according to the titles of those documents, and said titles often do mix Latin and Cyrillic scripts in a way that produces "confusables". So as a generic rule for all filenames, it's a no-go - it will significantly affect perfectly legitimate use cases.
1) History has shown that doing it at the filesystem level causes significant compatibility issues
2) Case insensitivity can present visually confusing results
3) Round-tripping between cases sometimes is not possible without loss of information
4) Some languages don't have a meaningful definition of 'case insensitive'
5) Some languages would demand a more comprehensive approach (i.e. being able to search using either kana or kanji for Japanese filenames)
So for things like case insensitivity (or #5 above) it's best to handle them at the application (or, ideally, OS user space library) level, taking advantage of information like system locale. Low-level APIs and system services can use case-sensitive filenames.
This is different from the question of whether to normalize filenames, because filenames are still visually unambiguous (at least most of the time) in UI, logs, and debuggers.
Incidentally, Windows is proof of this technique's advantages because NTFS is a case-sensitive file system while Win32 is case-insensitive. IIRC the Linux subsystem for Windows takes advantage of this.
I was thinking the same thing! Although if we're being pedantic, I think "bag" is an "unordered multiset". If it were just a plain unordered set, then "filename.txt" would also be the same as "filenam.txt".
Aren't colons (:) an issue with MacOS as well? I think Finder and other userspace apps convert them to slashes. I suppose though, there's a case for the filesystem not caring about that.
"hello world!" encoded as "12:hello world!,"
An empty string as "0:,"
Unicode and UTF-8 unify many Traditional & Simplified Chinese Characters as well as Japanese Kanji and Korean Hanja.
There were some extremely angry reactions to Han Character Unification, though efforts have been made to overcome many of the initial complaints.
 "Why Unicode Won’t Work on the Internet:
Linguistic, Political, and Technical Limitations" http://www.hastingsresearch.com/net/04-unicode-limitations.s...
It also means that, yet again, every app will need to be updated for a new version of iOS. Makes me wonder how many apps will be left behind if not updated? Thousands? Hundreds of thousands? Millions?
Something tells me practically no apps will be seriously affected.
So they have to deal with arbitrary filenames; on the other hand, for the same reason, they can't maintain a master list of files somewhere which would break, but have to check the actual directory contents instead. Still, things like history or links between files might be broken.
That is, unless NSFileCoordination APIs act differently wrt normalization; iCloud clients have to use those rather than accessing the filesystem directly.
They should be doing that anyway. Anything else is just begging to get out of sync at some point.
Using a network value or a value returned by an API is just asking for trouble.
If a user names a file that will be hidden behind a URL the same advice applies. If not then the user can use any sequence of bytes they want and you shouldn't care.
Given how loudly the tech press proclaims any perceived mis-step by Apple, I'd have to believe we'd have been reading tons of 'Apple is Doooooooomed' articles about this by now. Given that this hasn't happened, and I haven't seen similar problem reports on dev forums and other hangouts, I'd lean towards there being some miscommunication or misunderstanding here.
But this may become a much bigger issue when we start using APFS on macOS.
It’s a reasonable decision to come to (especially for iOS where end-users don’t ever really interact with the filesystem directly), but it will cause quite a bit of churn in the short term.
Canonically equivalent Unicode sequences look the same to machines.
The latter is a much more significant problem, because it can wreak havoc with interoperability.
Memcmp disagrees, as do the default equality operators of most programming languages in existence.
If there isn't, accusations of non-compliance are just FUD.
I only just learnt about this unicode normalisation recently looking at ZFS which has options for it I had never seen until reading the Ubuntu Root FS on ZFS guides which talk about setting it.
They are our tools, not the other way around ;-)
More generally, once APFS is deployed users can legitimately end up with
multiple files in the same folder whose names only differ in normalization.
You wouldn't say "the OS doesn't check file attributes to see whether you are allowed to access a file; that's left to applications", either.
Could they introduce a directory-level option to automatically normalize all files below that node? (Same with case-sensitivity, which I think Adobe software still has problems with.)
Umlauts on HFS+ are still Umlauts anywhere else. The only real oddity of HFS+ (beyond the fact that it does normalization at all) is that it's not using NFD, it's using a variant of NFD based on an old version of Unicode (it has to be this way because the normalization tables must be immutable or there's compatibility issues when reading drives written to from different versions of the OS). So if you take a filename and convert it to NFD, it may not be the exact same byte sequence that you get if you plug that filename into HFS+ (but in most cases it will be). But whatever byte sequence you get from HFS+ is still going to be a valid Unicode sequence.
Copy "hällö" from Mac to Linux over samba -> works fine.
scp or rsync that same file again from Mac to Linux (without special conversion options) -> you will have two "hällö" there.
The first one is recognized by Samba, the second one is invisible.
On HFS+, because it normalizes, if you try to pass either form to the filesystem, it will treat them the same. But on filesystems that don't normalize (which is most of them) they'll be treated as distinct files. As a result, depending on the input you provide and whether the tools in question do any normalization, you could end up with two files that look identical (e.g. your two "hällö" files) but are different unicode sequences under the hood.
And none of this really has anything to do with UTF-8. UTF-8 is just a byte encoding scheme that can represent all valid unicode sequences.
But this doesn't make it wrong. You can reproduce this exact behavior with other systems too simply by changing the local filename to be decomposed. The only Mac-specific part here is that HFS+ will automatically decompose the filename.
If you tell HFS to store a file with an Ö in it, and then list the directory, the character Ö is nowhere to be found.