The real problem isn't whether filename normalization is a good/bad thing, the problem is Apple used to do it one way and is now switching without warning to doing it the other way.
It's the logical end product of the odyssey from Apple's original philosophy of a resource and data fork model for files to the UNIX stream-of-bytes model for files. The UNIX model traditionally kept metadata about files separate (anyone else remember naughtily hand-editing directory files with sed(1), back in the day?) but gradually picked up a bunch of new stuff that has to be stored somewhere. Meanwhile, the Mac platform adopted UNIX binaries with the move to NeXTStep underpinnings (we're going back to the late 90s here, and the adoption of OSX over traditional MacOS), obviating the need for the original resource bundle, which got rid of those annoying errors about missing bundle bits but left us with a legacy of .DS_Store turdlets in every directory to hold file metadata that was formerly stored in the resource fork.
As a destination it's laudable — design consistency is almost always laudable — but it's the end goal of a very messy process and it looks like APFS isn't quite there yet; NSFileManager and NSURL need some way of distinguishing files with different unicode representations, and the Finder in particular needs to be robust. I'm guessing this is going to be fixed when High Sierra finally ships, but isn't supported in Sierra at present, hence the OP's alarm at the situation.
One of the new features in iOS 11 presented yesterday is the File App. It provides access to all files (at some level) in the system and on networked systems including iCloud, Dropbox etc.
Almost! But not quite: it aliases some possible filenames, though only they are valid UTF-8 encodings (or UTF-16 say, if one did this on Windows). In particular, it prevents the existence of two files with equivalent names.
In other words, like most RDBMSes, ZFS differentiates "field type" from "collation."
"Field type" in an RDBMS controls what can be written (e.g. "valid UTF8 strings"), what will be read back (e.g. the use of the Unicode replacement character), and what special values like NULL will cast to.
An RDBMS field's collation controls how values in the field will compare for equality, and what will happen when you sort on that field.
Unique constraints in RDBMSes function on top of collation—so if the collation says two values are equivalent, the unique constraint will prevent you from inserting the new one.
Great properties, if you can get them. It's too bad that the filesystem API is as low-level as it is, actually; if POSIX had some concept of "readdir(2) pre-sorted by a given field" like e.g. Windows does, then you could require that userland programs rely on FS-level collation rather than allowing them to collate the resulting values however they feel like (usually meaning "naively.")
Well, readdir() in ZFS does not produce outputs in any order, since directories are just plain hash tables.
And the n-i string comparison function produces a boolean if I remember correctly, not a trinary. So it doesn't define a collation. But it could define a collation, that's true.
The problem with moving sorting into the kernel is that you now need to have the collations there (English? French? something else? "Unicode" is not enough), and user-land needs to tell the kernel what collation to use for any given process or thread or system call. That's ETOOMUCHWORK for everyone, so it doesn't happen.
(BTW, I found your comments interesting, even as I laughed at how often you were making them. Thanks for contributing, and sorry for getting us both flagged (for IMHO the goofiest possible reason).)
The former resource fork not in ._$filename unless the underlying filesystem does not support it. On HFS+ volumes, which is what 99%+ of Mac users use, there will be no ._$filename file.
You are correct regarding the resource fork support in the fs and it's visibility, maybe I didn't express myself as exactly as I should.
However, 99%+ of Mac users do use FAT-formatted USB sticks, ZIP files, or other fs/mechanism/whatever that does not support resource forks where the compatibility littering kicks in, so they, or the people they share their files with, will see ._$filename files too.
On the other hand, Finder litters with .DS_Store files everywhere, even on HFS+.
Most files don't have resource forks though, so the ._$filename doesn't store resource fork data in most cases. It usually just stores metadata.
I really doubt that 99%+ of Mac users use flash drives / zip files, though. I suspect that well more than 1% of users never use anything like that. A lot of people only have a single computer and share through e.g. just mailing files to people or using Dropbox, or not even that.
While resource forks are deprecated and on the retreat, extended attributes are a new hotness, used extensively (just download a file in the browser, and it will get com.apple.quarantine and com.apple.metadata:kMDItemWhereFroms). These are also shoveled into AppleDouble files. Actually, resource fork is just com.apple.ResourceFork extended attribute.
I'm not sure that there are more than 1% of mac users do not exchange files with other people, possibly on other platforms. What do they use the computer for, then? iPad would be more suitable then.
I think that there are well more than 1% of users, maybe 20% or higher, that never share files except to attach a picture to an email, upload it to Facebook, or something similar like that. It's a bit of a nitpick, though, but we do often forget the "non-power" computer users.
There's a potentially useful discussion to be had on normalization but the title is pure clickbait hyperbole. HFS+ is the only filesystem in common use which performs Unicode normalization and a statement that bold would require at least some evidence that Windows, Linux, etc. are only usable by English speakers.
My position on this is mixed. I've had to write code to deal with normalization changes in archives and it's quite tedious. The HFS+ approach of normalization is in many ways the best choice considered in isolation but in practice it's really expensive to support since most filesystems, tools, APIs, etc. predate Unicode and everyone else chose the bag of bytes approach.
Normalization is very expensive and does not belong at the FS level. This kills performance for some classes of applications.
The comparison with other filesystems does not hold since applications for other OSs have always been developed with no normalization at FS level, and hence it was done by the applications, or through the use of high-level OS APIs. Mac applications, on the other hand, expect it to be the responsibility of the filesystem. This is explained pretty clearly in the article.
On a side note, I wish Apple would have taken this opportunity to switch normalization from NFD to NFC, which basically everything else uses. The distinction causes complexity and often issues in software which share data between Apple platforms and other platforms, such as version control systems for instance.
EDIT: according to pilif's comment, they did, which is awesome!
Do you have any recent benchmarks showing a significant impact from normalization? I haven't seen that on anything in at least a decade and that was simply Red Hat shipping an ancient and completely unoptimized libicu.
> Mac applications, on the other hand, expect it to be the responsibility of the filesystem. This is explained pretty clearly in the article.
Again, the article made a huge sweeping claim without supporting it. That's simply not true in either way – many apps on every OS don't handle this at all, some handle it consistently everywhere, and what a “Mac application” means varies widely from “clean, modern Cocoa” to “uses a lot of C, etc. libraries”, “cross platform C++”, “C# port”, “Electron shell”, etc. You can't make any statement which is true for every single one of those categories, much less for every code path which eventually results in a filesystem call. I've run into cases where something mostly worked until you hit their integrated ZIP, Git/SVN, etc. support and found a new way that a filename was constructed.
My point wasn't that everything is fine but simply that this is complicated and no decision results in avoiding problems. Not normalizing allows for confusing visually-identical files; normalizing results in errors or data loss which will be blamed on the OS.
> Rather than saying the article is wrong, can you demonstrate /why/ it is wrong?
I think you might to re-read my entire comment: note that I'm not arguing that the technical details are wrong, only that they're insufficient to support the huge “APFS is unusable” conclusion.
As previously noted, Windows and Linux work the same way and they are used by more people in individual non-English locales than the total number of Mac users. Would you say “NTFS is unusable by non-English users” is a useful statement?
There's plenty of room to say that a particular tool needs improvement, or that people making systems which copy or archive files should check for pathological cases, but it doesn't help anything to overstate the case so broadly.
The issue isn't a "bag-of-bytes" filename model. The issue is a "bag-of-bytes" filename model combined with an inconsistent normalization scheme.
It's not a problem on Windows or Linux filesystems because Windows and Linux don't provide a half-assed normalization scheme that lets me fairly easily create files that can't be accessed. If the Cocoa libraries did no normalization, then the resulting behavior might be obnoxious from a human-interface perspective, but I don't think the article would describe it as "little short of catastrophic".
I'm sitting here on my US English keyboard typing scancodes that look just like they did in 1990, so I'm not the best authority on how big of a problem it really is, but I'd guess it's going to result in a lot of bugs. Anyone who's ever tried to use a Mac with a case-sensitive HFS+ partition should be able to tell you that programmers can't even "normalize" their filenames consistently strictly within their native language.
> It's not a problem on Windows or Linux filesystems because Windows and Linux don't provide a half-assed normalization scheme that lets me fairly easily create files that can't be accessed
This is only true if you're talking about the kernel APIs. Unfortunately, filenames come from a variety of sources and it's easy to find tools which inconsistently normalize them – e.g. simply copying and pasting a name from a Word doc, web page, etc. which has different normalization than whatever originally created the file – or which produce either duplicate error messages or confusing error messages because the normalization form used in a file doesn't match the normalization form written on disk.
I've encountered variations of this problem on all three systems. No approach is going to handle 100% of the filenames in the wild and all of them will require extra care in the user-interface which may or may not have been done – e.g. the Windows Explorer still provides no way to tell why Café.txt and Café.txt are not the same file – and fixing the cases where programs are internally inconsistent. APFS switching will expose some programs which were unsafe before but since it's consistent with the other common filesystems it'll remove the need for every archive, version control, etc. system to either special-case or break.
Normalization is not that expensive if you only care to do normalization-insensitive string comparison and string hashing.
The reason it's not that expensive is: a) this requires no memory allocation, and b) most characters in most strings require no normalization!
Notionally you just look at pairs of next codepoints, and if the second one isn't combining and the first one is canonical for the chosen NF (a very fast check for ASCII!), then there's no need to normalize the first, otherwise you gather the combining codepoints and normalize, producing one normalized character and restarting the process where you left off. Most of the time the first codepoint requires no normalization, so the fast path is fast -- not as fast as a normal strcmp() or memcmp(), but still pretty fast.
In HFS+ it's even only done once per-create, so pretty cheap.
In ZFS it's once per-open()/stat()/and so on. But still, not at all on readdir(), and anyways, it's highly optimized. For an all ASCII filename the slow path is never taken, and for a mostly ASCII filename the slow path is only taken for non-ASCII codepoints that are followed by combining codepoints (that check is itself a slower-than-the-fast path, but still faster than the slowest path).
Not true! ZFS also normalizes. However, while HFS+ normalizes on _create_ (bad), ZFS normalizes on lookup (good).
Specifically, ZFS has a normalization-preserving, normalization-insensitive behavior -- a lot like case-preserving but case-insensitive behavior, but for normalization forms rather than case.
The way this works is that there's a) a string comparison function that can provide normalization- and/or case-insensitive comparisons, b) a character-at-a-time normalization function of sorts used for directory hashing (since ZFS hashes directories).
This is much better than normalizing on create, which is destructive and obnoxious when the form you normalize to is not the more common form and you live in a sea of applications that don't do normalization.
> Not true! ZFS also normalizes. However, while HFS+ normalizes on _create_ (bad), ZFS normalizes on lookup (good).
That's not a question of being true or not but referring to different things. The distinction I was trying to make is that HFS+ will force every filename into NFD. With ZFS, the filename as received from the APIs should be the same byte sequence which was used to create it and that avoids an entire category of bugs where e.g. a program writes a file and fails to find it in a later readdir() call. As an example, Subversion and Git both had numerous bug reports over the years where it was impossible to simply checkout a repo containing a file which used a different normalization form.
Say you create a file with a name that has different NFC and NFD forms, and you create it with the NFC form. Then you go try to create it with the NFD form, well, if doing an exclusive create (O_EXCL) then you'll get EEXIST, else you'll open the existing file.
It normalizes on LOOKUP, not create. On create it preserves the original form. Think of this as like case-preserving/case-insensitive behavior, but for form rather than case.
This is largely due to microsoft being an early adopter of unicode. In the 90s, unicode was limited to 16 bits, so switching to fixed-length 16-bit characters made a lot of sense.
Of course, it didn't take long for that to change, but MS had already put a lot of resources into switching to UCS-2.
This timeline also meant that wchar_t was fixed as 16 bits on Windows for compatibility, but ended up as 32 bits on macOS and Unix. It can't really be removed from the C or C++ standards, but there's nothing you can do with wchar_t that's both useful and portable.
Well-meaning developers learn how to support Unicode with wchar_t only to discover that wchar_t is the worst way to support Unicode.
It doesn't help that many developers are introduced to the basics of Unicode by a popular Joel on Software article from 2003, which recommended the use of wchar_t. [1]
It was a very important article for 2003, but it should be honorably retired.
Its plan of action isn't as good as the modern "UTF-8 Everywhere", and even its motivating examples are becoming less relevant: when was the last time you went to the 'Encoding' menu in your web browser and guessed which codepage the page author meant to use? When was the last time you worried about whether your e-mail was 7-bit clean?
HFS+'s use of NFD also comes from it being an early adopter of Unicode.
When I argued strenuously for n-p/n-i behavior in ZFS (see elsewhere in this thread) I was told that gee, Unicode doesn't describe n-i string comparison the way I was proposing, so we can't do it :( However, fast n-i does follow from the spec, so eventually it got done. I can't take credit for the code, though I did contribute an optimization.
What about "bag of wchar_t" doesn't preserve normalization? Or am I not sure what you're trying to say?
It's no more a "bag of code units" than Linux filesystems store a "bag of code units". Windows will barf back whatever wchar_t array you give it, just like Linux will barf back whatever char array you give it.
You're right that "code unit" is the correct Unicode term for parts of encoded Unicode strings, which is exactly why it's the wrong term in this case, because the whole point is the filenames don't have to be Unicode.
So I can take the byte string [0xFF, 0xFF, 0xFF] and use that as a filename on Linux, or I can take the sequence of 16-bit values [0xD800, 0xD800, 0xD800] and use that as a filename on NTFS. They're not made of code units because they're not Unicode strings.
Exactly - and that matters to me in cases where the ability to say that someone copying files won't change the checksum on a manifest, which users find extremely confusing.
HFS+ was the only file system I know of that was doing Unicode normalisation and is certainly was the only one choosing NFD which encodes characters in a way that's impossible for someone to easily type in the UI.
Short-term this will cause a lot of inconsistencies with applications using low-level APIs as the files currently existing will be NFD normalised, but any user-given path will very likely be more or less equivalent to what NFC would do.
So in the short term, this will be a mess (unless the APFS conversion on install-time also does a once-time conversion of the normal form), but in the long term, I believe this is the way to go.
And for that matter, the same should happen with regards to case-insensitivity which is even worse as ensuring case-insensitivity might actually be dependent on locale, which means that depending on the user's locale two file names might or might not be identical under case-insensitivity rules.
Unfortunately, it look like there's just too much legacy code around that plain doesn't work with a case-sensitive file-system on the mac (I'm looking at you, Adobe).
BTW: On my 10.13 Beta 1 setup with an APFS converted boot drive, unless I manually create a NFD encoded file name on the command line (which you can't do accidentally), everything is NFC both in the UI and on the command line. This also applies to files I haven't touched since the conversion to APFS.
>HFS+ was the only file system I know of that was doing Unicode normalisation
I've seen something in a couple comments now to this effect, does nobody here do anything with ZFS at all? It's a pretty great filesystem on not just illumos but FreeBSD, Linux, and macOS, and it has full native normalization support (and for all 4 forms too). Normalizing is optional, but it's definitely there and with Macs I have been using NFD for compatibility purposes for 5 years now with no discernible performance problems in day-to-day use. I know ZFS isn't remotely a majority but I don't think it's that obscure either.
In particular ZFS has normalization-preserving/normalization-insensitive behavior, which is far superior to HFS+'s opinionated normalization-on-create (to a form that is different from the common input modes' output!).
Yes, HFS+ implements normalization-insensitive behavior through not being normalization-preserving, in the same way that some FATs might implement case-insensitive behavior through not being case-preserving. They didn't realize that you could have both features: normalization-preserving and normalization-insensitive.
I guess it was a very forgivable lack of imagination. When we came up with form-preserving/insensitive we were motivated in great part by the interop mess caused by HFS+ -- we might not have arrived there without that mess, though I'd like to believe that someone would eventually have reached this conclusion regardless.
Not to repeat myself too much, but... ZFS also supports normalization. Only instead of normalizing on create, it has normalization-preserving/insensitive behavior.
I'd like to believe that most developer would prefer the current behaviour over that proposed by the article (normalisation).
If there are issues in Apple's Finder or other high level APIs, I'd like to believe they will be fixed before the final release. These are fixable bugs, IMHO.
The problem seems to come from the high-level APIs doing normalization. So if you have two files in a directory, one normalized and one not-normalized and open the non-normalized file, the high-level API will then normalize that filename and you open the wrong file.
I've always disliked the practice of messing around with file paths (storing them, concatenating them, etc). I preferred the way that the Classic MacOS typically dealt with filesystem references and aliases instead of file paths. This also meant users can rename or move files and references still work.
Having two sets of APIs, one that does not mess with paths and one that does or adds other constraints, will cause all kinds of funky problems.
For APFS it will be high-level-API using applications not being able to open certain files or using the wrong file[0], like stated.
You got a similar siltation with NTFS where the file system supports long paths while the userland WinAPI does use a much smaller length limit (unless you jump through "\\?\ hoops, which does not work for relative paths), which renders certain files "unopenable" by certain applications.
[0] Regarding opening the wrong files, anybody remember the Android zip vulnerabilities? https://googlesystem.blogspot.com/2013/07/the-8219321-androi...
It's not that hard to imagine that some macOS software does security/sanity checks on files using the low-level non-normalized API but then opens the (wrong and unchecked) file later with normalizing high level API, or vice versa. Having this API discrepancy built into your OS certainly makes these kinds of things more likely.
I understand that the bookmark feature of NSURL is (more or less, but most more) the same. cit: "Whereas path- and file reference URLs are potentially fragile between launches of your app, a bookmark can usually be used to re-create a URL to a file even in cases where the file was moved or renamed."
> The title is click-bait and over-dramatises the issue.
Sadly it's not.
> The choice of APFS is that a filename is a sequence of bytes. Nothing more, nothing less (feel free to correct me if I'm wrong here).
That is incorrec. APFS treats filenames as utf-8 strings and depending on the API you are using normalization is still taking place but on different levels. For instance all Cocoa APIs will perform normalization but the underlying syscalls and posix APIs will not.
So it is correct then, the file system doesn't concern itself with normalization. File names are stored internally as utf-8 strings which are just a sequence of bytes.
"The case-sensitive variant of APFS is both normalization-preserving and normalization-sensitive."
It should just always be form-preserving and form-INsensitive.
There is absolutely no reason to want the combination they give. It makes me suspect that they didn't bother separating case- and form-insensitivity, although I can't imagine why one wouldn't. It seems like.. a mistake. Someone misunderstood.
So you disagree with the linked blog post then? I mean this entire discussion is based on the premise that HFS+ performs normalization and APFS does not. The disrcepancy means that what used to happen automatically is now the responsibility of individual applications and the author hypothises that this means that shit will break.
Not sure what you want me to say. In my mind the current situation with APFS is a massive pain and was not well thought through.
Previously normalization (while terrible) was at least consistent. Now different APIs have widely different behaviour and it's even worse on the application level. What used to be a annoyance has now become absolute chaos with no guidance from apple.
My understanding of the article is effectively: AFPS on macOS in the current form is unusable for non-English users and I strongly agree with that. While it might be partially usable, in particular if you upgrade from an earlier mac was performed, it has a ton of really terrible edge cases most users will quickly run into.
The problem is that most input methods produce something close to NFC while HFS+ decomposes to something close to NFD. Which means that if you cut-n-paste non-ASCII Unicode names from a finder into any app that doesn't normalize, then you'll have problems.
The solution we came up with for ZFS was normalization-preserving but normalization-insensitive name comparison and directory hashing. This produces the best interoperability via NFS, SMB, WebDAV, and so on, and local POSIX access.
I also think encoding doesn't belong into a file system. Let the names be arrays of bytes and leave the encoding to the people that use it, be it utf-8, utf-16 or something entirely different.
ZFS has an option to reject non-UTF-8. But if you disable that then it allows it, and for all UTF-8 it is normalization-preserving/insensitive. This is cheap and ideal.
EDIT: i.e., ZFS doesn't know or care about encodings. Its n-p/n-i behavior means that for non-UTF-8 that happens to appear as valid UTF-8 there is some potential aliasing behavior going on, but this is exceedingly unlikely, and we decided that it was worth taking that risk.
the problem that arrives there almost instantly is that people want to see a list of filenames.
If you're ever going to do some sort of "displaying" of data, you cannot store it as bytes. You need to know what characters things are supposed to be presented as.
You could imagine not settling on a specific encoding, but you must know the encoding. Unless your plan is to show a list of numbers to users.
Right. And it's just rather difficult for the kernel to know the user-land LC_CTYPE setting, so it doesn't, so it doesn't know the encoding of any string, so the best thing to do is assume UTF-8. If you want some other encoding, then libc's syscall stubs will have to do codeset conversions, or if not then you have to make sure that you only use that codeset everywhere.
The filesystem can't be responsible for displaying anything, though. Displaying filenames is the job of the shell / window manager / etc.
The filesystem is much better off handling filenames as a number of arbitrary bytes. Let people who want to put weird bytes in their filenames see ugly filenames along the lines of "\x00 Can you see this?"
Or worse, you speak multiple languages, or learn new ones, and need to... switch codesets? How do you then access your old files?!
We must switch to Unicode. Full stop. If there are imperfections in Unicode script support, then we must fix those, but otherwise we must adopt Unicode.
The actual problem is not APFS but Apple's programs doing (and their advice to developers to do) normalization. It wouldn't be a problem if programs just used the file names given to them.
In fact, I think the normalization HFS+ does is more problematic. For example, fish shell can't complete file names when you use un-normalized characters in the input. [0][1]
Apparently on iOS 11, even the case-sensitive variant of APFS will be normalization-insensitive. Previously it looked like this would only be the case on macOS's case-insensitive APFS variant.
Anyway, it looks like the issues raised in this (April) blog post will not actually apply to iOS 11 or macOS High Sierra.
I've been saying for years, to anyone who will listen (and many who won't!) that normalization-preserving/normalization-insensitive is the only sensible behavior for a filesystem.
(I've said this many times in the IETF in the context of NFSv4 and WebDAV and such. Every time I've noticed the subject of stringprep for filesystem protocols come up.)
"Happening at a higher level" is a reasonable solution, if it happens consistently, no matter which higher level you're using. If you have 18 different functions to open a file, and 11 of them normalize and 7 don't, then you're screwed before you even get out the door. Programmers simply aren't capable of getting this right in a consistent way if asked to solve the problem application by application.
which screws you because anyone can just call fopen on their own without using the core foundation libraries, leading to inconsistent states in the filesystem. You can be higher level than the filesystem, but only a little bit. You can't be higher level than some API that developers will regularly use (unless you do like ZFS and normalize at lookup rather than create).
From the "What's new for developers" in macOS 10.13 High Sierra document [0], case-sensitive APFS can be normalisation-insensitive:
> APFS now supports an on-disk format change to allow for normalization-insensitive Case Sensitive volumes. This means that file names in either Unicode NFC or NFD will point to the same files.
Which means that both versions support normalisation-insensitivity.
(Edit: There is also a one-line mention of this in the iOS 11 document [1] but it doesn't say if it is the default.)
"Unusable" is a strong word to use. Should filesystems be making up for our Unicode shortcomings? From a SW design perspective, is that the most sensible place to pass the burden of responsibility? I would say that another way to handle it is to store a file name as an array of bytes and put the burden on software developers to interpret Unicode correctly. Swift does this pretty nicely.
I would say the only downside to this approach, is a user wouldn't be able to distinguish two files with the same name apart, but it's hard to imagine how they'd get to creating such a situation in the first place without the developer rule above being violated.
> Should filesystems be making up for our Unicode shortcomings?
Absolutely, yes. File names are text by their very definition; that we've been treating them as "bags of bytes" is a historical tragedy. At the very least, file names need to be displayed, as text, to the user, so they should be stored as text, that is in some well-defined encoding, and yes, it should be the job of the filesystem driver / kernel to enforce that it's not writing garbage out to disk.
Reinventing that wheel in every system that in any way interacts with the filesystem is bad engineering, and doomed to fail.
Further, I don't see why the typical user should need to know or understand the differences between 'e\N{COMBINING ACUTE ACCENT}' and '\N{LATIN LOWERCASE E WITH ACUTE ACCENT}'. Likewise, I don't see why each and every piece of code should be forced to handle that. Developers will get this wrong. In fact, the article seems to say even Apple can't get it right, in that Finder will not correctly show the directory contents in some instances, and fails to open files in some instances, telling users the file "doesn't exist".
But what is text? Not everyone wants to use unicode. It is dependent of the platform, the region, the OS and on many other different things like LC_* variables on linux. Why should a filesystem depend on those too?
> and on many other different things like LC_ variables on linux.*
The point is that it doesn't need to be. I would entertain that not everyone might not want to use Unicode: in that case, the FS should still have a well-defined encoding, such that I can still arrive at a string to display to the user. The point is not that "Unicode is best" but that storing file names as "bags of bytes" is incredibly user unfriendly, and there needs to be a straightforward, no bullshit method to display and transmit the names of files.
But I would also argue that Unicode is the best we've got presently, and it would be pragmatic for a filesystem to simply adopt it outright. It's overwhelmingly the dominant character set in use today, especially if you ignore deprecated junk that Unicode is a strict superset of.
ZFS has a per-dataset option to allow/reject non-UTF-8. For valid UTF-8 names ZFS implements normalization-preserving/insensitive behavior. That was the best compromise we could find, and it works really well.
RDBMS's have gone through the same journey. First it was hardcoded, now for many of them we can specify the codepage or UTF encoding and collation of a DB, some even offer it per table or column, and so on.
Filesystems should do the same.
Either pick one way and stick to it (expose it as UTF8, with a sensible normalization and collation) or offer several options that can be specified when creating a volume and do on the fly conversions for clients that need it.
> Currently, the Mac Extended file system, HFS+, uses Normalisation Form D (NFD). Under that, é and é are automagically converted to é, and represented as three bytes, 65 cc 81.
This is actually an oversimplification. Mac NFD does not decompose characters in a few specific ranges[0]:
> U+2000 through U+2FFF, U+F900 through U+FAFF, and U+2F800 through U+2FAFF are not decomposed
But I do expect applications built for HFS+' pseudo-NFD to struggle with non-normalizing APFS.
Would be interesting to know why Apple made the decision not to normalize on a file system level then. Just an argument based on separation of concerns?
Normalizing on the file system level in the past was a mistake. Normalizing Unicode is not an easy task, it belongs into userspace and not into a kernel driver.
Other systems consider filename a string of octets and interpretation what these octets mean is left for userspace to decide.
Others have different opinions.
Normalization of Unicode identifiers is necessary for identifiers to be identifiable. Mostly this is only done for domain names, I also do it for identifiers in my programming language, and consequently path names and directory entries should also identifiable. The garbage in - garbage out camp completely ignores all unicode security concerns and simply shifts blame to user-space. I applauded Apple to use normalization in HFS+, but the APFS integration is just horrible. 1st it makes unicode paths insecure again, and 2nd it's inconsistent.
NFD over NFC is simply trading data over CPU. NFC requires 3 passes over a string, NFD only 2, whilst NFD is usually a few bytes longer. Linus talking about NFD corrupting data is of course nonsense. He should stay technical and only rant about things he has an idea about. The python3 NFKC normalization format is nonsense, but not really important. NFD is fine, because faster.
Shifting interpretation of byte encodings from utf-8 to user-space is typical Linux non-sense, but he inherited the existing mess. Using utf-8 is miles better than bytes.
Case-insensitivity in HFS+ is of course legacy nonsense. This should have been to target to get rid of, not normalization.
The 2nd big security issue would be to forbid mixed scripts in a name (filename).
I blogged about it here, and OP uses the same examples I used: http://perl11.org/blog/unicode-identifiers.html
Invisible combining marks, right-to-left overwrites or similar / spoofs security problems are only fixable with normalization and more TR31 (mixed scripts, confusables), though confusables can only be handled in user-space, not the FS.
ZFS does normalization on LOOKUP. This is much better, as it preserves whatever form you used, but is normalization-form insensitive, which is precisely what users need (and would want, if only they could be expected to understand what the heck is going on!).
This is going to be fun with the java.nio.file API. Java currently normalises everything to NFC even though the file system on macOS is NFD. This was introduced some time in Java 8 because reasons.
Of course this is all going to break now, of course it will take years until Oracle fixes this.
Per pilif[0], NFD filenames are converted to NFC during HFS+ to APFS conversion:
> BTW: On my 10.13 Beta 1 setup with an APFS converted boot drive, unless I manually create a NFD encoded file name on the command line (which you can't do accidentally), everything is NFC both in the UI and on the command line. This also applies to files I haven't touched since the conversion to APFS.
Sadly this was expected but it's not clear what Apple wants developers to do about this now. I am very happy that normalization is removed from the FS layer but I wish they had removed it from Cocoa as well or at least picked a different normalization form.
The fact that Cocoa apps create different filenames than the terminal is not great. It's even worse that some UI (like Finder) seem to cause even more confusion is not helping.
Wait why is a file name a sequence of bytes? Why is it not a sequence of bytes that form a valid string in some encoding, so that you can just refuse some file names? Who refuses invalid characters such as invisible characters or "/"? Would not the same place be a good place to refuse anything that doesn't normalize to the same byte sequence, or an even stricter subset?
Supporting more than one encoding requires knowing what that is, and apps and APIs are very bad at keeping track of that. Which is why only Unicode in some UTF makes sense.
ZFS allows you to store non-UTF-8 if you like, but for all valid UTF-8 it implements normalization-preserving/insensitive behavior, which is the best possible compromise (IMO).
To be honest, I blame Unicode. Why allow different representations for the same character, and then provide a normalized form anyway, except it's not one normalized form but several? Sounds like job security to me.
> To be honest, I blame Unicode. Why allow different representations for the same character, and then provide a normalized form anyway, except it's not one normalized form but several?
That's not really the issue here. Even if you can avoid normalization in some languages the problem will come back in others. For instance what are you going to do about invisible characters, control characters or more? Traditionally software attempted to just ignore the problems and let it blow up in other places. For some fun issues just try whitespace and bell characters in filenames and navigate around in your shell.
Normalization is just one of many issues with filenames.
No, just developers. Of course, the use of bytes for characters goes back a long time, to times when computers had small memories and disk (and other) storage capacities. And to even before then, to the days of telexes and typewriters. It's completely understandable. But UTF-8 is genius, which is why we use it.
Incidentally, ASCII was actually a multi-byte codeset... since one could combine most lower-case characters with BS (backspace) and overstrike with apostrophe, backtick, tilde, comma (for cedilles), and double-quotes (for umlauts), or with the same char for bold, or with underscore for underline. nroff(1) still uses this for bold and underscore, no?
I've always wondered what the concept of a "backspace" character was supposed to mean. (I tried printing it to erase a previously printed character, but that doesn't work.)
I guess it's another relic of the idea that computer output goes to a printer rather than a display?
Yes! Terminal vendors had to explicitly support this in terminals, though maybe not just because of printers but because it's actually quite useful.
Note that BS only produces overstrike when followed by certain characters (which we might term "combining" for the fun of it), while most will just change the character at that location. A tty spinner is just |BS/BS-BS\BS|BS... with some delay between each -- no overstrike there.
Not really. In a barely-networked and severely resource-constrained world (kilobytes of memory, not gigabytes), this decision made sense. Shift happens.
Indeed not: there is an almost infinite number of things that Latin-1 did not have; e.g. any characters beyond Western Europe. While keeping the 128-256 block compatible was a part of early Unicode (hence the "one-glyph" é), having composed characters was a Unicode primary design goal (hence e and the composing accent). A pure Unicode implementation would have been better, maybe; what we have instead is one that has gained mass adoption.
The biggest actual problem is Greek vs Cyrillic. They appear the same, but use different encodings. Any filename (directory entry as identifier) must forbid mixed scripts. See TR31 http://www.unicode.org/reports/tr31/
Combining marks or RTL switching tricks are the other popular spoofs.
According to me not if you mixed Greek with Cyrillic, and not Math with Cyrillic.
Unfortunately nobody cares about Unicode identifier security models. Garbage out is the most popular ideology.
I take the view that modern file names are human readable labels, not access keys.
“10kΩ Резисторы” is Cyrillic, Latin, and Greek (U+2126 canonically maps to U+03A9). A proposal that disallows that but allows “10kΩ Resistors” is a political non-starter.
The Unicode consortium has published recommendations with various security levels everybody enjoys to ignore. Following even the weakest level would be a starter, but not in current filesystems community. This needs to start in a new OS or with high profile spoofing attacks, such as on GitHub.
Pre-composed characters were needed to make transcoding to ISO-8859-* fast.
Also, for Hangul, though decomposition is the better form to use normally (since it's phonetic, not syllabic), conversion to pre-composed syllabic form is very useful sometimes.
Even without pre-composition there would have been multiple equivalent forms of writing any character that requires more than one combining codepoint.
The only way to have avoided normalization altogether would have been to have no combining codepoints, and only a complete set of pre-compositions. This would have been less flexible, and very obnoxious for Hangul and possibly others.
I believe the need for normalization was simply unavoidable. It's not the fault of Unicode but the fault of humans' script designs. And it's OK.
> But why oh why did they have to specify FOUR normalization forms?
Ignoring compatibility normalizations there are really only two and neither of those two you can remove. So that's a bit of an odd question. If you want to remove the compatibility forms that's fair but nobody is force to use them. You can consider them "partially normalized" or "not normalized at all" for all intents and purposes
> I don't remember those having multiple representations for the same character in the same encoding or normal forms...
But that's not really what is happing here. What is happening is that the renderer decides to merge some code points into one glyph. That was true long before unicode as well (for instance ligatures) or even on old terminals with overstriking. Unicode just made this more explicit.
Compatibility and efficiency are two big reasons why you would want to have composed forms like e.g. the single codepoint for é: legacy encodings also have a single codepoint for it, and it's more efficient to represent it with one codepoint than with two.
So, we need composed characters, but surely combining characters aren't needed, then? Well, no, because it's unreasonable to include a codepoint for every single combination, especially once you have multiple diacritics on a single character, for example. Also, compatibility factors in here too, because legacy encodings likewise have combining characters.
And also: scripts evolve. People may well start combining various diacritical marks in ways not done before.
Also, it can be useful to decompose in some cases. For example, it's useful for removing diacritical marks, which can be useful for fuzzy searches. Hangul is really phonetic, not syllabic, so there it's the reverse: decomposed forms are most useful, except when you need to think in syllables.
Combining characters seems like a pretty stupid thing to have in a character set/encoding. Does anyone know how many combined characters are actually needed?
It's not Unicode's fault. Unicode took a terrible situation (lots of complex scripts) and did the best it could to give us a unified way of representing text in all scripts.
The gist is that normalization should only ever be used for comparison (if needed, e.g. "do these two files have filenames that would look the same to user"), and never for changing data (filenames are user data and should be stored verbatim without normalization). HFS+ should never have used normalization in the first place. You can think of normalization as essentially a lossy hash function (you cannot get back to HFS+ NFD form once you normalize it to NFC - HFS+ NFD and NFD proper are not the same thing - many people don't realize this). Using normalization for anything other than temporary comparison leads to data loss.
Actually Korean (Hangul) is the biggest user of NFD normalization, with very special logic. Hangul also has the only 2 still remaining identifier bugs in the unicode 9 database. HANGUL FILLER and HALFWIDTH HANGUL FILLER wrongly being valid ID_Cont, as such wrongly usable as identifier characters (such as in filenames).
Personally I think the current Normalisation Form D is awful, storing an ü as two characters is really annoying and even bash can't really deal with it in the version Apple uses. I really hope APFS will fix this. But we'll see.
Apple could of course change their keyboard layouts to produce something close to NFD. (OS X has always allowed keys to produce multi-character results.)
Did anyone not read the documentation? I'm pretty sure this will be a non-issue soon.
> How does Apple File System handle filenames?
> APFS has case-sensitive and case-insensitive variants. The case-insensitive variant of APFS is normalization-preserving, but not normalization-sensitive. The case-sensitive variant of APFS is both normalization-preserving and normalization-sensitive. Filenames in APFS are encoded in UTF-8 and aren’t normalized. […]
> The first developer preview of APFS, made available in macOS Sierra in June 2016, offered only the case-sensitive variant.
No, in my mind a filename should be an identifier made of a bunch of bytes. How you represent those bytes is not important in the context of the file subsystem.
For me it seems more like a problem with Unicode.
I can see why it is the way it is from a certain perspective. Very connivent.
But it has broken the underlying abstraction layer.
There was no underlying layer, just lots of U.S.- or Western-centric assumptions. Unicode breaks them, but so did every codeset (Shift-JIS, this, that, and the other).
A pet peeve of mine is the use of the word "unusable" when reporting a behavior which the reporter finds undesirable. Apple is just now doing what Linux has always done. Is Linux unusable?
Keeping your issue reports neutral and avoiding hyperbole is an underrated skill.
ISTR an old blog post or doc about this. Apple engineers wanted to be able to do prefix matching in HFS+, if I remember correctly. And so making the name as long as possible (via canonical decomposition) made sense to them.
I'm glad this is getting attention. I've been working around this issue for over a year now, assuming that "if it's broke, Apple will know, and they will fix it".
i guess we'll see a new option on the finder "show non-normalized files" next to "show hidden files" pretty soon.
i don't see how they could automatically solve all the potential mismatch issues otherwise.
It's the logical end product of the odyssey from Apple's original philosophy of a resource and data fork model for files to the UNIX stream-of-bytes model for files. The UNIX model traditionally kept metadata about files separate (anyone else remember naughtily hand-editing directory files with sed(1), back in the day?) but gradually picked up a bunch of new stuff that has to be stored somewhere. Meanwhile, the Mac platform adopted UNIX binaries with the move to NeXTStep underpinnings (we're going back to the late 90s here, and the adoption of OSX over traditional MacOS), obviating the need for the original resource bundle, which got rid of those annoying errors about missing bundle bits but left us with a legacy of .DS_Store turdlets in every directory to hold file metadata that was formerly stored in the resource fork.
As a destination it's laudable — design consistency is almost always laudable — but it's the end goal of a very messy process and it looks like APFS isn't quite there yet; NSFileManager and NSURL need some way of distinguishing files with different unicode representations, and the Finder in particular needs to be robust. I'm guessing this is going to be fixed when High Sierra finally ships, but isn't supported in Sierra at present, hence the OP's alarm at the situation.